Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Anirudh Rayabharam @ 2026-04-02 16:13 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260327201920.2100427-5-jloeser@linux.microsoft.com>

On Fri, Mar 27, 2026 at 01:19:15PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL), and
> SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> 
> Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cleanup()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out from
> under VMBus causes its later cleanup to write SynIC MSRs while SynIC
> is disabled, which the hypervisor does not tolerate.
> 
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: on L1VH leave entirely to VMBus (it
>   already enabled them); on root partition VMBus doesn't run, so
>   MSHV must enable/disable them

Maybe it's time to extract all the synic management stuff to a separate
file to act like synic "driver" which facilitates synic access to
mutiple users i.e. mshv & vmbus.

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH 3/7] mshv: Rename mshv_mem_region to mshv_region
From: Anirudh Rayabharam @ 2026-04-02 16:33 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177508156067.215674.12361225930217655159.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Wed, Apr 01, 2026 at 10:12:40PM +0000, Stanislav Kinsburskii wrote:
> The mshv_mem_region structure represents guest address space regions,
> which can be either RAM-backed memory or memory-mapped IO regions
> without physical backing. The "mem_" prefix incorrectly suggests the
> structure only handles memory regions, creating confusion about its
> actual purpose.
> 
> Remove the "mem_" prefix to align with existing function naming
> (mshv_region_map, mshv_region_pin, etc.) and accurately reflect that
> this structure manages arbitrary guest address space mappings
> regardless of their backing type.

I don't think the "mem_" prefix automatically suggested the backing
type.

Isn't mshv_region too vague now? Region of what?

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH 3/8] firmware: sysfb: Make CONFIG_SYSFB a user-selectable option
From: Arnd Bergmann @ 2026-04-02 16:47 UTC (permalink / raw)
  To: Thomas Zimmermann, Javier Martinez Canillas, Ard Biesheuvel,
	Ilias Apalodimas, Huacai Chen, WANG Xuerui, Maarten Lankhorst,
	Maxime Ripard, Dave Airlie, Simona Vetter, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, longli, Helge Deller
  Cc: linux-arm-kernel, loongarch, linux-efi, linux-riscv, dri-devel,
	linux-hyperv, linux-fbdev
In-Reply-To: <001efe27-9cbb-4a89-8d2d-a1f3ae15e505@suse.de>

On Thu, Apr 2, 2026, at 17:27, Thomas Zimmermann wrote:
> Am 02.04.26 um 16:59 schrieb Arnd Bergmann:
>> On Thu, Apr 2, 2026, at 16:10, Thomas Zimmermann wrote:
>>> Am 02.04.26 um 15:08 schrieb Arnd Bergmann:
>>>> On Thu, Apr 2, 2026, at 11:09, Thomas Zimmermann wrote:
>>>> I don't really like this part of the series and would prefer
>>>> to keep CONFIG_SYSFB hidden as much as possible as an x86
>>>> (and EFI) specific implementation detail, with the hope
>>>> of eventually seperating out the x86 bits from the EFI ones.
>>> You mean, you want to use the EFI-provided framebuffers without the
>>> intermediate step of going through sysfb_primary_display?
>>>
>>> In that case, CONFIG_SYSFB would become an x86-internal thing, right?
>> The part that is still needed from sysfb is the arbitration
>> between DRM_EFI and the PCI device driver for the same hardware,
>> so I think some part of sysfb is clearly needed, in particular
>> the sysfb_disable() function that removes the EFI framebuffer
>> when there is a conflicting simpledrm or hardware specific
>> driver.
>
> We do most of that in the aperture-helper module. (see 
> <linux/aperture.h>). Calling sysfb_disable() from there is a workaround 
> for some corner cases. We can have an EFI-specific function that does 
> the same.

That sounds good, yes. The same change would need to go into
of_platform_default_populate_init() then.

> BTW, simpledrm-on-EFI/VESA is considered obsolete and should preferably 
> be removed from that driver. Simpledrm should become a driver for 
> Devicetree nodes of type simple-framebuffer (as it originally has been 
> intended).

Sure, I was only thinking of the case where there are both
sysfb (from Arm/riscv UEFI) and simpledrm (from devicetree)
objects referring to the same one, not the simpledrm
device created by sysfb_simplefb.

>> The parts that I want to keep out of that is anything
>> related to the x86 boot protocol, non-EFI framebuffers,
>> text console, and kexec handoff, which we don't need on
>> non-x86 UEFI systems.
>>
>> I don't mind the idea of having a sysfb_primary_display
>> in the EFI code if that helps keep EFI sane on x86,
>> but it would be good to make that local to
>> drivers/firmware/efi and (eventually) detached from
>> include/uapi/linux/screen_info.h.
>
> Efidrm retrieves the framebuffer settings from the contained struct 
> screen_info. Disconnecting from screen_info would require separate 
> graphics drivers for x86 and non-x86. If we split off EFI from sysfb, 
> we'd likely need a sysfbdrm driver of some sort. Just saying.

Yes, I saw that as well and don't have an immediate idea for how
to best do it. I saw that you already abstracted the access to
the screen_info members in drm_sysfb_screen_info.c, which I think
is a step in that direction.

I also noticed that efidrm is mostly a subset of vesadrm, so
in theory they could be merged back into an x86 drm driver
along with the drm_sysfb_screen_info helpers, and have a non-x86
driver that constructs a drm_sysfb_device directly from the
EFI structures.

> I think we'd also have to duplicate the framebuffer-relocation code that 
> currently works on anything using struct screen_info (see patch 5).

You mean the code from include/linux/screen_info.h? I think
it would make sense to have an x86 specific version of that
to operate on the x86 screen_info, and a simpler version
that just updates the resource for the efirdrm driver, but
that could also be done one level higher or lower.

>>>> In general, I am always in favor of properly using Kconfig
>>>> dependencies over 'select' statements, for the same reasons
>>>> you describe, but I don't want the the x86 logic for
>>>> the legacy VESA and VGA console handling to leak into more
>>>> architectures than necessary.
>>>>
>>>> Do you think we could instead move the sysfb_init()
>>>> function into the same two places that contain the
>>>> sysfb_primary_display definition (arch/x86/kernel/setup.c,
>>>> drivers/firmware/efi/efi-init.c) and simplify the efi version
>>>> to take out the x86 bits? That would reduce the rest
>>>> of sysfb-primary.c to the logic to unregister the device,
>>>> and that could then be selected by both x86 and EFI.
>>> No, I'm more than happy that sysfb finally consolidates all the
>>> init-framebuffer setup and detection that floated around in the kernel.
>>> I would not want it to be duplicated again.
>>>
>>> For now, we could certainly keep CONFIG_SYSFB hidden and autoselected.
>>> Although I think this will require soem sort of solution at a later point.
>> Can you clarify which problem you are trying to solve
>> with that?
>
> One thing is that some users simply what control over their kernel build.
>
> I also think that there might be systems that want to use 
> sysfb_primary_display (plus the relocation feature), but not create the 
> framebuffer device. Say for efi-earlycon. It needs user-control over the 
> SYSFB option to do that.

I'm still not following, sorry. efi-earlycon doesn't require
CONFIG_SYSFB today, and I don't see why that would need to change,
or why it couldn't just 'select SYSFB' if it it does change.

> As a side-effect, user-configurable SYSFB gives us a nice place to put 
> SYSFB_SIMPLEFB and FIRMWARE_EDID; two options that currently float 
> around in the config somewhat arbitrarily.

You said that SYSFB_SIMPLEFB should get phased out in the future,
right?

I'm also missing your plan for CONFIG_FIRMWARE_EDID. I only
see three legacy drivers using the old fb_firmware_edid()
interface, so I assume this is not what you are interested in. 

For the global copy that is filled by x86 and efi, and
consumed by vesadrm and efidrm, does that even need to
be a configuration option rather than get always enabled?

       Arnd

^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-02 17:09 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Long Li, lpieralisi@kernel.org, kwilczynski@kernel.org,
	mani@kernel.org, robh@kernel.org, bhelgaas@google.com,
	Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: stable@vger.kernel.org, Matthew Ruffell, Krister Johansen
In-Reply-To: <BN7PR02MB41486A6DBF839D5BC5D5BC2ED497A@BN7PR02MB4148.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Wednesday, January 21, 2026 11:11 PM
> ...
> From: Dexuan Cui <decui@microsoft.com> Sent: Wednesday, January 21,
> 2026 6:04 PM
> >
> > There has been a longstanding MMIO conflict between the pci_hyperv
> > driver's config_window (see hv_allocate_config_window()) and the
> > hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
> > both get MMIO from the low MMIO range below 4GB; this is not an issue
> > in the normal kernel since the VMBus driver reserves the framebuffer
> > MMIO in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram()
> > can always get the reserved framebuffer MMIO; however, a Gen2 VM's 
> > kdump kernel fails to reserve the framebuffer MMIO in vmbus_reserve_fb()
> >  because the screen_info.lfb_base is zero in the kdump kernel: 
> > the screen_info is not initialized at all in the kdump kernel, because the
> > EFI stub code, which initializes screen_info, doesn't run in the case of kdump.
> 
> I don't think this is correct. Yes, the EFI stub doesn't run, but screen_info

Hi Michael, sorry for delaying the reply for so long! Now I think I should
understand all the details.

My earlier statement "the screen_info is not initialized at all in the kdump
kernel" is not correct on x86, but I believe it's correct on ARM64. Please see
my explanation below.

> should be initialized in the kdump kernel by the code that loads the
> kdump kernel into the reserved crash memory. See discussion in the commit
> message for commit 304386373007.
> 
> I wonder if commit a41e0ab394e4 broke the initialization of screen_info
> in the kdump kernel. Or perhaps there is now a rev-lock between the kernel
> with this commit and a new version of the user space kexec command.

The commit
a41e0ab394e4 ("sysfb: Replace screen_info with sysfb_primary_display")
should be unrelated here.

> There's a parameter to the kexec() command that governs whether it
> uses the kexec_file_load() system call or the kexec_load() system call.
> I wonder if that parameter makes a difference in the problem described
> for this patch.
> 
> I can't immediately remember if, when I was working on commit
> 304386373007, I tested kdump in a Gen 2 VM with an NVMe OS disk to
> ensure that MMIO space was properly allocated to the frame buffer
> driver (either hyperv_fb or hyperv_drm). I'm thinking I did, but tomorrow
> I'll check for any definitive notes on that.
> 
> Michael

If vmbus_reserve_fb() in the kdump kernel fails to reserve the framebuffer
MMIO range due to a Gen2 VM's screen_info.lfb_base being 0,  the MMIO
conflict between hyperv_fb/hyperv_drm and hv_pci happens -- this is
especially an issue if hv_pci is built-in and hyperv_fb/hyperv_drm is built
as modules. vmbus_reserve_fb() should always succeed for a Gen1 VM, since
it can always get the framebuffer MMIO base from the legacy PCI graphics
device, so we only need to discuss Gen2 VMs here.

When kdump-tools loads the kdump kernel into memory, the tool can
accept any of the 3 parameters (e.g. I got the below via "man kexec" in
Ubuntu 24.04):

       -s (--kexec-file-syscall)
              Specify that the new KEXEC_FILE_LOAD syscall should be used exclusively.

       -c (--kexec-syscall)
              Specify that the old KEXEC_LOAD syscall should be used exclusively.

       -a (--kexec-syscall-auto)
              Try the new KEXEC_FILE_LOAD syscall first and when it is not supported or the kernel does not understand the supplied  im‐
              age fall back to the old KEXEC_LOAD interface.

              There is no one single interface that always works, so this is the default.

              KEXEC_FILE_LOAD is required on systems that use locked-down secure boot to verify the kernel signature.  KEXEC_LOAD may be
              also disabled in the kernel configuration.

              KEXEC_LOAD is required for some kernel image formats and on architectures that do not implement KEXEC_FILE_LOAD.

If none of the parameters are specified, the default may be -c, or -s
or -a, depending on the distro and the version in use.  We can run
    strace -f kdump-config reload  2>&1 | egrep 'kexec_file_load|kexec_load'
to tell which syscall is being used.

Old distro versions seem to use KEXEC_LOAD by default, and new distro
versions tend to use KEXEC_FILE_LOAD by default, especially when
Secure Boot is enabled (e.g. see /usr/sbin/kdump-config: kdump_load()
in Ubuntu).

In Ubuntu, we can explicitly specify one of the parameters in
"/etc/default/kdump-tools", e.g. KDUMP_KEXEC_ARGS="-c -d".

The -d is for debugging. I found it very useful: when we run
"kdump-config show" or "kdump-config reload", we get very useful
debug info with -d.

On x86-64, with -c:
The kdump-tools gets the framebuffer's MMIO base using
ioctl(fd, FBIOGET_FSCREENINFO, ....): see the end of the email for
an example program; kdump-tools then uses the KEXEC_LOAD syscall
to set up the screen_info.lfb_base for the kdump kernel.

The function in kdump-tools that gets the framebuffer MMIO base 
is kexec/arch/i386/x86-linux-setup.c: setup_linux_vesafb():
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/i386/x86-linux-setup.c?h=v2.0.32#n133

Unluckily, setup_linux_vesafb() only recognizes the vesafb
driver in Linux kernel ("VESA VGA") and the efifb driver ("EFI VGA").
It looks like normally arch_options.reuse_video_type is always 0.

This means the kdump kernel's screen_info.lfb_base is 0, if
hyperv_fb or hyperv_drm loads. In the past,  for a Ubuntu kernel
with CONFIG_FB_EFI=y, our workaround is blacklisting 
hyperv_fb or hyperv_drm, so /dev/fb0 is backed by efifb, and
the screen_info.lfb_base is correctly set for kdump.

However, now CONFIG_FB_EFI is not set in recent Ubuntu kernels:
$ egrep 'CONFIG_FB_EFI|CONFIG_SYSFB|CONFIG_SYSFB_SIMPLEFB|CONFIG_DRM_SIMPLEDRM|CONFIG_DRM_HYPERV' /boot/config-6.8.0-1051-azure
CONFIG_SYSFB=y
CONFIG_SYSFB_SIMPLEFB=y
CONFIG_DRM_SIMPLEDRM=y
CONFIG_DRM_HYPERV=m
# CONFIG_FB_EFI is not set

So, with Ubuntu 22.04/24.04,  -c can't avoid the MMIO conflict
for Gen2 x86-64 VMs now, even if we blacklist hyperv_fb/hyperv_drm.
Note: Ubuntu 20.04 uses an old version of the kdump-tools, so
the statement is different there (see the later discussion below).

hyperv_fb has been removed in the mainline kernel: see
commit 40227f2efcfb ("fbdev: hyperv_fb: Remove hyperv_fb driver")
so we no longer need to worry about it.

Even if we modify setup_linux_vesafb() to support  hyperv_drm,
it still won't work, because the MMIO base is hidden by commit
da6c7707caf3 ("fbdev: Add FBINFO_HIDE_SMEM_START flag")

On x86-64, with -s:
The KEXEC_FILE_LOAD syscall sets the kdump kernel's
screen_info.lfb_base in the kernel: see 

"arch/x86/kernel/kexec-bzimage64.c"
    bzImage64_load
        setup_boot_parameters
            memcpy(&params->screen_info, &screen_info, sizeof(struct screen_info));

so, as long as the first kernel's hyperv_drm doesn't relocate the
MMIO base, kdump should work fine; if the MMIO base is relocated,
currently hyperv_drm doesn't update the screen_info.lfb_base,
so the kdump's efifb driver and hv_pci driver won't work. Normally
hyperv_drm doesn't relocate the MMIO base, unless the user
specifies a very high resolution and the required MMIO size
exceeds the default 8MB reserved by vmbus_reserve_fb() -- let's
ignore that scenario for now.

On AMR64, with -c:
The kdump-tools doesn't even open /dev/fb0 (we can confirm this by using 
strace or bpftrace), so the kdump kernel's screen_info.lfb_base ia always 0.

On AMR64, with -s:
"arch/arm64/kernel/kexec_image.c": image_load() doesn't set the
params->screen_info, so the kdump kernel's screen_info.lfb_base ia always 0.

To recap, with a recent mainline kernel (or the linux-azure kernels) that
has 304386373007, my observation on Ubuntu 22.04 and 24.04 is: 
    on x86-64, -c fails, but -s works.
    on ARM64, -c fails, and -s also fails. 

Note: the kdump-tools v2.0.18 in Ubuntu 20.04 doesn't have this commit:
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=fb5a8792e6e4ee7de7ae3e06d193ea5beaaececc
(Note the "return 0;" in setup_linux_vesafb())
so, on x86-64, -c also works in Ubuntu 20.04, if hyperv_fb is used
(-c still doesn't work if hyperv_drm is used due to da6c7707caf3).

With this patch
"PCI: hv: Allocate MMIO from above 4GB for the config window",
both -c  and -s work on x86-64 and ARM64 due to no MMIO conflict,
as long as there are no 32-bit PCI BARs (which should be true on
Azure and on modern hosts.)

With the patch, even if hyperv_drm relocates the framebuffer MMO
base, there would still be no MMIO conflict because typically hyperv_drm
gets its MMIO from below 4GB: it seems like vmbus_walk_resources() 
always finds the low MMIO range first and adds it to the beginning of the 
MMIO resources "hyperv_mmio", so presumably hyperv_drm would
get MMIO from the low MMIO range.

I'll update the commit message, add Matthew's and Krister's 
Tested-by's and post v2.

Thanks,
Dexuan

//Print the info of the frame buffer for /dev/fb0:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/fb.h>

static void print_bitfield(const char *name, const struct fb_bitfield *bf) {
    printf("%s:\n", name);
    printf("  offset       : %u\n", bf->offset);
    printf("  length       : %u\n", bf->length);
    printf("  msb_right    : %u\n", bf->msb_right);
}

static void print_fix_screeninfo(const struct fb_fix_screeninfo *fix) {
    printf("struct fb_fix_screeninfo:\n");
    printf("  id           : %.16s\n", fix->id);
    printf("  smem_start   : 0x%lx\n", fix->smem_start);
    printf("  smem_len     : %u\n", fix->smem_len);
    printf("  type         : %u\n", fix->type);
    printf("  type_aux     : %u\n", fix->type_aux);
    printf("  visual       : %u\n", fix->visual);
    printf("  xpanstep     : %u\n", fix->xpanstep);
    printf("  ypanstep     : %u\n", fix->ypanstep);
    printf("  ywrapstep    : %u\n", fix->ywrapstep);
    printf("  line_length  : %u\n", fix->line_length);
    printf("  mmio_start   : %lu\n", fix->mmio_start);
    printf("  mmio_len     : %u\n", fix->mmio_len);
    printf("  accel        : %u\n", fix->accel);
    printf("  capabilities : %u\n", fix->capabilities);
    printf("  reserved[0]  : %u\n", fix->reserved[0]);
    printf("  reserved[1]  : %u\n", fix->reserved[1]);
}

static void print_var_screeninfo(const struct fb_var_screeninfo *var) {
    printf("struct fb_var_screeninfo:\n");
    printf("  xres         : %u\n", var->xres);
    printf("  yres         : %u\n", var->yres);
    printf("  xres_virtual : %u\n", var->xres_virtual);
    printf("  yres_virtual : %u\n", var->yres_virtual);
    printf("  xoffset      : %u\n", var->xoffset);
    printf("  yoffset      : %u\n", var->yoffset);
    printf("  bits_per_pixel: %u\n", var->bits_per_pixel);
    printf("  grayscale    : %u\n", var->grayscale);

    print_bitfield("  red", &var->red);
    print_bitfield("  green", &var->green);
    print_bitfield("  blue", &var->blue);
    print_bitfield("  transp", &var->transp);

    printf("  nonstd       : %u\n", var->nonstd);
    printf("  activate     : %u\n", var->activate);
    printf("  height       : %u\n", var->height);
    printf("  width        : %u\n", var->width);
    printf("  accel_flags  : %u\n", var->accel_flags);
    printf("  pixclock     : %u\n", var->pixclock);
    printf("  left_margin  : %u\n", var->left_margin);
    printf("  right_margin : %u\n", var->right_margin);
    printf("  upper_margin : %u\n", var->upper_margin);
    printf("  lower_margin : %u\n", var->lower_margin);
    printf("  hsync_len    : %u\n", var->hsync_len);
    printf("  vsync_len    : %u\n", var->vsync_len);
    printf("  sync         : %u\n", var->sync);
    printf("  vmode        : %u\n", var->vmode);
    printf("  rotate       : %u\n", var->rotate);
    printf("  colorspace   : %u\n", var->colorspace);
    printf("  reserved[0]  : %u\n", var->reserved[0]);
    printf("  reserved[1]  : %u\n", var->reserved[1]);
    printf("  reserved[2]  : %u\n", var->reserved[2]);
    printf("  reserved[3]  : %u\n", var->reserved[3]);
}

int main(void) {
    int fd;
    struct fb_fix_screeninfo fix;
    struct fb_var_screeninfo var;

    fd = open("/dev/fb0", O_RDONLY);
    if (fd == -1) {
        perror("open");
        return EXIT_FAILURE;
    }

    if (ioctl(fd, FBIOGET_FSCREENINFO, &fix) == -1) {
        perror("ioctl(FBIOGET_FSCREENINFO)");
        close(fd);
        return EXIT_FAILURE;
    }

    if (ioctl(fd, FBIOGET_VSCREENINFO, &var) == -1) {
        perror("ioctl(FBIOGET_VSCREENINFO)");
        close(fd);
        return EXIT_FAILURE;
    }

    print_fix_screeninfo(&fix);
    printf("\n");
    print_var_screeninfo(&var);

    close(fd);
    return EXIT_SUCCESS;
}

^ permalink raw reply

* Re: [PATCH 3/7] mshv: Rename mshv_mem_region to mshv_region
From: Stanislav Kinsburskii @ 2026-04-02 17:57 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260402-fervent-thick-boobook-45dba9@anirudhrb>

On Thu, Apr 02, 2026 at 04:33:24PM +0000, Anirudh Rayabharam wrote:
> On Wed, Apr 01, 2026 at 10:12:40PM +0000, Stanislav Kinsburskii wrote:
> > The mshv_mem_region structure represents guest address space regions,
> > which can be either RAM-backed memory or memory-mapped IO regions
> > without physical backing. The "mem_" prefix incorrectly suggests the
> > structure only handles memory regions, creating confusion about its
> > actual purpose.
> > 
> > Remove the "mem_" prefix to align with existing function naming
> > (mshv_region_map, mshv_region_pin, etc.) and accurately reflect that
> > this structure manages arbitrary guest address space mappings
> > regardless of their backing type.
> 
> I don't think the "mem_" prefix automatically suggested the backing
> type.
> 

What else can it suggest?

> Isn't mshv_region too vague now? Region of what?
> 

The region of address space, which can or can not be backed by memory.

Thanks,
Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* [PATCH v2 0/7] mshv: Reduce memory consumption for unpinned regions
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series reduces memory consumption for unpinned regions by avoiding
PFN array allocation. A 1GB unpinned region currently wastes 2MB for an
unused PFN array that HMM-managed regions don't need.

The first three patches are preparatory refactoring. Patch 1 consolidates
region creation and mapping logic, reducing API surface by 4 functions.
Patch 2 introduces a typedef for PFN handler callbacks to simplify
function signatures. Patch 3 renames mshv_mem_region to mshv_region to
align with existing function naming conventions.

Patch 4 optimizes unmap and no-access remap operations by eliminating
redundant PFN iteration when all frames are guaranteed to be mapped.
This uses large page flags for aligned chunks and removes unnecessary
helper functions.

Patches 5-6 decouple PFN processing from the region->pfns storage.
Patch 5 threads the pfns pointer explicitly through the processing
chain. Patch 6 removes offset-based indexing by having callers pass
pre-offset pointers.

Patch 7 converts the pfns array from a flexible array member to a
conditional pointer, allocated only for pinned regions that need it
for share/unshare/evict operations. This eliminates the memory waste
for unpinned regions and allows using kzalloc instead of vzalloc.

v2: 
- Improved commit message
- Fixed invalid vfree(region->mreg_pfns) call for MMIO-backed regions
- Fixed unpinning of already-released pages in the error path during
  pinned region creation
- Removed redundant mshv_map_region helper in favor of the new
  optimized mapping logic

---

Stanislav Kinsburskii (7):
      mshv: Consolidate region creation and mapping
      mshv: Improve code readability with handler function typedef
      mshv: Rename mshv_mem_region to mshv_region
      mshv: Optimize memory region mapping operations
      mshv: Pass pfns array explicitly through processing chain
      mshv: Simplify pfn array handling in region processing
      mshv: Allocate pfns array only for pinned regions

 drivers/hv/mshv_regions.c   |  350 +++++++++++++++++++++++++------------------
 drivers/hv/mshv_root.h      |   26 ++-
 drivers/hv/mshv_root_main.c |   79 ++++------
 3 files changed, 249 insertions(+), 206 deletions(-)

^ permalink raw reply

* [PATCH v2 1/7] mshv: Consolidate region creation and mapping
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Consolidate region type detection and initialization into
mshv_region_create() to simplify the region creation flow. Move type
determination logic (MMIO/pinned/movable) earlier in the process and
initialize type-specific fields during creation rather than after.

This eliminates the need for mshv_region_movable_init/fini() by
handling MMU interval notifier setup directly in the constructor and
teardown in the destructor. Region mapping is also unified through a
single mshv_map_region() dispatcher that routes to the appropriate
type-specific handler.

Changes improve code organization by:
- Reducing API surface (4 fewer exported functions)
- Centralizing type determination and validation
- Making region lifecycle more explicit and easier to follow
- Removing post-construction initialization steps

The refactoring maintains existing functionality while making the
codebase more maintainable and less error-prone.

Additionally, movable region initialization now fails explicitly
if mmu_interval_notifier_insert() returns an error, rather than
silently falling back to pinned memory. This fail-fast approach
makes configuration issues more visible.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   81 ++++++++++++++++++++++++++++---------------
 drivers/hv/mshv_root.h      |   14 +++----
 drivers/hv/mshv_root_main.c |   61 +++++++++++++-------------------
 3 files changed, 83 insertions(+), 73 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 6b703b269a4f..a85d18e2c279 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,6 +20,8 @@
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
 #define MSHV_INVALID_PFN				ULONG_MAX
 
+static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
+
 /**
  * mshv_chunk_stride - Compute stride for mapping guest memory
  * @page      : The page to check for huge page backing
@@ -241,16 +243,39 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags)
+struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
+					   u64 guest_pfn, u64 nr_pfns,
+					   u64 uaddr, u32 flags,
+					   ulong mmio_pfn)
 {
 	struct mshv_mem_region *region;
+	int ret = 0;
 	u64 i;
 
 	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
+	switch (type) {
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		ret = mmu_interval_notifier_insert(&region->mreg_mni,
+						   current->mm, uaddr,
+						   nr_pfns << HV_HYP_PAGE_SHIFT,
+						   &mshv_region_mni_ops);
+		break;
+	case MSHV_REGION_TYPE_MEM_PINNED:
+		break;
+	case MSHV_REGION_TYPE_MMIO:
+		region->mreg_mmio_pfn = mmio_pfn;
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	if (ret)
+		goto free_region;
+
+	region->mreg_type = type;
 	region->nr_pfns = nr_pfns;
 	region->start_gfn = guest_pfn;
 	region->start_uaddr = uaddr;
@@ -263,9 +288,14 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
 	for (i = 0; i < nr_pfns; i++)
 		region->mreg_pfns[i] = MSHV_INVALID_PFN;
 
+	mutex_init(&region->mreg_mutex);
 	kref_init(&region->mreg_refcount);
 
 	return region;
+
+free_region:
+	vfree(region);
+	return ERR_PTR(ret);
 }
 
 static int mshv_region_chunk_share(struct mshv_mem_region *region,
@@ -462,7 +492,7 @@ static void mshv_region_destroy(struct kref *ref)
 	int ret;
 
 	if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
-		mshv_region_movable_fini(region);
+		mmu_interval_notifier_remove(&region->mreg_mni);
 
 	if (mshv_partition_encrypted(partition)) {
 		ret = mshv_region_share(region);
@@ -736,27 +766,6 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
 	.invalidate = mshv_region_interval_invalidate,
 };
 
-void mshv_region_movable_fini(struct mshv_mem_region *region)
-{
-	mmu_interval_notifier_remove(&region->mreg_mni);
-}
-
-bool mshv_region_movable_init(struct mshv_mem_region *region)
-{
-	int ret;
-
-	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
-					   region->start_uaddr,
-					   region->nr_pfns << HV_HYP_PAGE_SHIFT,
-					   &mshv_region_mni_ops);
-	if (ret)
-		return false;
-
-	mutex_init(&region->mreg_mutex);
-
-	return true;
-}
-
 /**
  * mshv_map_pinned_region - Pin and map memory regions
  * @region: Pointer to the memory region structure
@@ -770,7 +779,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
  *
  * Return: 0 on success, negative error code on failure.
  */
-int mshv_map_pinned_region(struct mshv_mem_region *region)
+static int mshv_map_pinned_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
@@ -826,17 +835,31 @@ int mshv_map_pinned_region(struct mshv_mem_region *region)
 	return ret;
 }
 
-int mshv_map_movable_region(struct mshv_mem_region *region)
+static int mshv_map_movable_region(struct mshv_mem_region *region)
 {
 	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
 					   false);
 }
 
-int mshv_map_mmio_region(struct mshv_mem_region *region,
-			 unsigned long mmio_pfn)
+static int mshv_map_mmio_region(struct mshv_mem_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 
 	return hv_call_map_mmio_pfns(partition->pt_id, region->start_gfn,
-				     mmio_pfn, region->nr_pfns);
+				     region->mreg_mmio_pfn,
+				     region->nr_pfns);
+}
+
+int mshv_map_region(struct mshv_mem_region *region)
+{
+	switch (region->mreg_type) {
+	case MSHV_REGION_TYPE_MEM_PINNED:
+		return mshv_map_pinned_region(region);
+	case MSHV_REGION_TYPE_MEM_MOVABLE:
+		return mshv_map_movable_region(region);
+	case MSHV_REGION_TYPE_MMIO:
+		return mshv_map_mmio_region(region);
+	}
+
+	return -EINVAL;
 }
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 1f92b9f85b60..2bcdfa070517 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -92,6 +92,7 @@ struct mshv_mem_region {
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
 	struct mutex mreg_mutex;	/* protects region PFNs remapping */
+	u64 mreg_mmio_pfn;
 	unsigned long mreg_pfns[];
 };
 
@@ -366,16 +367,13 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
-struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
-					   u64 uaddr, u32 flags);
+struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
+					   u64 guest_pfn, u64 nr_pfns,
+					   u64 uaddr, u32 flags,
+					   ulong mmio_pfn);
 void mshv_region_put(struct mshv_mem_region *region);
 int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
-void mshv_region_movable_fini(struct mshv_mem_region *region);
-bool mshv_region_movable_init(struct mshv_mem_region *region);
-int mshv_map_pinned_region(struct mshv_mem_region *region);
-int mshv_map_movable_region(struct mshv_mem_region *region);
-int mshv_map_mmio_region(struct mshv_mem_region *region,
-			 unsigned long mmio_pfn);
+int mshv_map_region(struct mshv_mem_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index adb09350205a..3bfa9e9c575f 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1217,11 +1217,14 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
  */
 static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_user_mem_region *mem,
-					struct mshv_mem_region **regionpp,
-					bool is_mmio)
+					struct mshv_mem_region **regionpp)
 {
 	struct mshv_mem_region *rg;
+	enum mshv_region_type type;
 	u64 nr_pfns = HVPFN_DOWN(mem->size);
+	struct vm_area_struct *vma;
+	ulong mmio_pfn;
+	bool is_mmio;
 
 	/* Reject overlapping regions */
 	spin_lock(&partition->pt_mem_regions_lock);
@@ -1234,18 +1237,27 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 	}
 	spin_unlock(&partition->pt_mem_regions_lock);
 
-	rg = mshv_region_create(mem->guest_pfn, nr_pfns,
-				mem->userspace_addr, mem->flags);
-	if (IS_ERR(rg))
-		return PTR_ERR(rg);
+	mmap_read_lock(current->mm);
+	vma = vma_lookup(current->mm, mem->userspace_addr);
+	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
+	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
+	mmap_read_unlock(current->mm);
+
+	if (!vma)
+		return -EINVAL;
 
 	if (is_mmio)
-		rg->mreg_type = MSHV_REGION_TYPE_MMIO;
-	else if (mshv_partition_encrypted(partition) ||
-		 !mshv_region_movable_init(rg))
-		rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
+		type = MSHV_REGION_TYPE_MMIO;
+	else if (mshv_partition_encrypted(partition))
+		type = MSHV_REGION_TYPE_MEM_PINNED;
 	else
-		rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
+		type = MSHV_REGION_TYPE_MEM_MOVABLE;
+
+	rg = mshv_region_create(type, mem->guest_pfn, nr_pfns,
+				mem->userspace_addr, mem->flags,
+				mmio_pfn);
+	if (IS_ERR(rg))
+		return PTR_ERR(rg);
 
 	rg->partition = partition;
 
@@ -1271,40 +1283,17 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		     struct mshv_user_mem_region mem)
 {
 	struct mshv_mem_region *region;
-	struct vm_area_struct *vma;
-	bool is_mmio;
-	ulong mmio_pfn;
 	long ret;
 
 	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
 	    !access_ok((const void __user *)mem.userspace_addr, mem.size))
 		return -EINVAL;
 
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, mem.userspace_addr);
-	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
-	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
-	mmap_read_unlock(current->mm);
-
-	if (!vma)
-		return -EINVAL;
-
-	ret = mshv_partition_create_region(partition, &mem, &region,
-					   is_mmio);
+	ret = mshv_partition_create_region(partition, &mem, &region);
 	if (ret)
 		return ret;
 
-	switch (region->mreg_type) {
-	case MSHV_REGION_TYPE_MEM_PINNED:
-		ret = mshv_map_pinned_region(region);
-		break;
-	case MSHV_REGION_TYPE_MEM_MOVABLE:
-		ret = mshv_map_movable_region(region);
-		break;
-	case MSHV_REGION_TYPE_MMIO:
-		ret = mshv_map_mmio_region(region, mmio_pfn);
-		break;
-	}
+	ret = mshv_map_region(region);
 
 	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
 				   region->start_gfn, region->nr_pfns,



^ permalink raw reply related

* [PATCH v2 2/7] mshv: Improve code readability with handler function typedef
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The inline function pointer declarations in mshv_region_process_*
functions make the code harder to read and maintain. Each function
signature repeats the same lengthy callback parameter definition,
adding visual noise and making the actual logic less clear.

Introduce pfn_handler_t typedef to replace the repeated inline
function pointer declarations. This simplifies function signatures,
makes the code more maintainable, and follows common kernel
patterns for callback handling.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index a85d18e2c279..70cd0857a28e 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,6 +20,10 @@
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
 #define MSHV_INVALID_PFN				ULONG_MAX
 
+typedef int (*pfn_handler_t)(struct mshv_mem_region *region, u32 flags,
+			     u64 pfn_offset, u64 pfn_count,
+			     bool huge_page);
+
 static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
 
 /**
@@ -80,11 +84,7 @@ static int mshv_chunk_stride(struct page *page,
 static long mshv_region_process_pfns(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     pfn_handler_t handler)
 {
 	u64 gfn = region->start_gfn + pfn_offset;
 	u64 count;
@@ -138,11 +138,7 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
 static long mshv_region_process_hole(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     pfn_handler_t handler)
 {
 	long ret;
 
@@ -156,11 +152,7 @@ static long mshv_region_process_hole(struct mshv_mem_region *region,
 static long mshv_region_process_chunk(struct mshv_mem_region *region,
 				      u32 flags,
 				      u64 pfn_offset, u64 pfn_count,
-				      int (*handler)(struct mshv_mem_region *region,
-						     u32 flags,
-						     u64 pfn_offset,
-						     u64 pfn_count,
-						     bool huge_page))
+				      pfn_handler_t handler)
 {
 	if (pfn_valid(region->mreg_pfns[pfn_offset]))
 		return mshv_region_process_pfns(region, flags,
@@ -193,11 +185,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 static int mshv_region_process_range(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
-				     int (*handler)(struct mshv_mem_region *region,
-						    u32 flags,
-						    u64 pfn_offset,
-						    u64 pfn_count,
-						    bool huge_page))
+				     pfn_handler_t handler)
 {
 	u64 start, end;
 	long ret;



^ permalink raw reply related

* [PATCH v2 3/7] mshv: Rename mshv_mem_region to mshv_region
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The mshv_mem_region structure represents guest address space regions,
which can be either RAM-backed memory or memory-mapped IO regions
without physical backing. The "mem_" prefix incorrectly suggests the
structure only handles memory regions, creating confusion about its
actual purpose.

Remove the "mem_" prefix to align with existing function naming
(mshv_region_map, mshv_region_pin, etc.) and accurately reflect that
this structure manages arbitrary guest address space mappings
regardless of their backing type.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   |   74 ++++++++++++++++++++++---------------------
 drivers/hv/mshv_root.h      |   18 +++++-----
 drivers/hv/mshv_root_main.c |   20 ++++++------
 3 files changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 70cd0857a28e..2c4215381e0b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -20,7 +20,7 @@
 #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
 #define MSHV_INVALID_PFN				ULONG_MAX
 
-typedef int (*pfn_handler_t)(struct mshv_mem_region *region, u32 flags,
+typedef int (*pfn_handler_t)(struct mshv_region *region, u32 flags,
 			     u64 pfn_offset, u64 pfn_count,
 			     bool huge_page);
 
@@ -81,7 +81,7 @@ static int mshv_chunk_stride(struct page *page,
  *
  * Return: Number of pages handled, or negative error code.
  */
-static long mshv_region_process_pfns(struct mshv_mem_region *region,
+static long mshv_region_process_pfns(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     pfn_handler_t handler)
@@ -135,7 +135,7 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
  *
  * Return: Number of PFNs handled, or negative error code.
  */
-static long mshv_region_process_hole(struct mshv_mem_region *region,
+static long mshv_region_process_hole(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     pfn_handler_t handler)
@@ -149,7 +149,7 @@ static long mshv_region_process_hole(struct mshv_mem_region *region,
 	return pfn_count;
 }
 
-static long mshv_region_process_chunk(struct mshv_mem_region *region,
+static long mshv_region_process_chunk(struct mshv_region *region,
 				      u32 flags,
 				      u64 pfn_offset, u64 pfn_count,
 				      pfn_handler_t handler)
@@ -182,7 +182,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
  *
  * Returns 0 on success, or a negative error code on failure.
  */
-static int mshv_region_process_range(struct mshv_mem_region *region,
+static int mshv_region_process_range(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     pfn_handler_t handler)
@@ -231,12 +231,12 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 	return 0;
 }
 
-struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
-					   u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags,
-					   ulong mmio_pfn)
+struct mshv_region *mshv_region_create(enum mshv_region_type type,
+				       u64 guest_pfn, u64 nr_pfns,
+				       u64 uaddr, u32 flags,
+				       ulong mmio_pfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	int ret = 0;
 	u64 i;
 
@@ -286,7 +286,7 @@ struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
 	return ERR_PTR(ret);
 }
 
-static int mshv_region_chunk_share(struct mshv_mem_region *region,
+static int mshv_region_chunk_share(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -305,7 +305,7 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 					      flags, true);
 }
 
-static int mshv_region_share(struct mshv_mem_region *region)
+static int mshv_region_share(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
@@ -314,7 +314,7 @@ static int mshv_region_share(struct mshv_mem_region *region)
 					 mshv_region_chunk_share);
 }
 
-static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
+static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
 				     bool huge_page)
@@ -331,7 +331,7 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 					      flags, false);
 }
 
-static int mshv_region_unshare(struct mshv_mem_region *region)
+static int mshv_region_unshare(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
@@ -340,7 +340,7 @@ static int mshv_region_unshare(struct mshv_mem_region *region)
 					 mshv_region_chunk_unshare);
 }
 
-static int mshv_region_chunk_remap(struct mshv_mem_region *region,
+static int mshv_region_chunk_remap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -362,7 +362,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				    region->mreg_pfns + pfn_offset);
 }
 
-static int mshv_region_remap_pfns(struct mshv_mem_region *region,
+static int mshv_region_remap_pfns(struct mshv_region *region,
 				  u32 map_flags,
 				  u64 pfn_offset, u64 pfn_count)
 {
@@ -371,7 +371,7 @@ static int mshv_region_remap_pfns(struct mshv_mem_region *region,
 					 mshv_region_chunk_remap);
 }
 
-static int mshv_region_map(struct mshv_mem_region *region)
+static int mshv_region_map(struct mshv_region *region)
 {
 	u32 map_flags = region->hv_map_flags;
 
@@ -379,7 +379,7 @@ static int mshv_region_map(struct mshv_mem_region *region)
 				      0, region->nr_pfns);
 }
 
-static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
+static void mshv_region_invalidate_pfns(struct mshv_region *region,
 					u64 pfn_offset, u64 pfn_count)
 {
 	u64 i;
@@ -395,12 +395,12 @@ static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
 	}
 }
 
-static void mshv_region_invalidate(struct mshv_mem_region *region)
+static void mshv_region_invalidate(struct mshv_region *region)
 {
 	mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
 }
 
-static int mshv_region_pin(struct mshv_mem_region *region)
+static int mshv_region_pin(struct mshv_region *region)
 {
 	u64 done_count, nr_pfns, i;
 	unsigned long *pfns;
@@ -449,7 +449,7 @@ static int mshv_region_pin(struct mshv_mem_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
+static int mshv_region_chunk_unmap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
 				   bool huge_page)
@@ -465,7 +465,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				  pfn_count, flags);
 }
 
-static int mshv_region_unmap(struct mshv_mem_region *region)
+static int mshv_region_unmap(struct mshv_region *region)
 {
 	return mshv_region_process_range(region, 0,
 					 0, region->nr_pfns,
@@ -474,8 +474,8 @@ static int mshv_region_unmap(struct mshv_mem_region *region)
 
 static void mshv_region_destroy(struct kref *ref)
 {
-	struct mshv_mem_region *region =
-		container_of(ref, struct mshv_mem_region, mreg_refcount);
+	struct mshv_region *region =
+		container_of(ref, struct mshv_region, mreg_refcount);
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
@@ -499,12 +499,12 @@ static void mshv_region_destroy(struct kref *ref)
 	vfree(region);
 }
 
-void mshv_region_put(struct mshv_mem_region *region)
+void mshv_region_put(struct mshv_region *region)
 {
 	kref_put(&region->mreg_refcount, mshv_region_destroy);
 }
 
-int mshv_region_get(struct mshv_mem_region *region)
+int mshv_region_get(struct mshv_region *region)
 {
 	return kref_get_unless_zero(&region->mreg_refcount);
 }
@@ -534,7 +534,7 @@ int mshv_region_get(struct mshv_mem_region *region)
  *
  * Return: 0 on success, a negative error code otherwise.
  */
-static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
+static int mshv_region_hmm_fault_and_lock(struct mshv_region *region,
 					  unsigned long start,
 					  unsigned long end,
 					  unsigned long *pfns,
@@ -613,7 +613,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
  *
  * Return: 0 on success, negative errno on failure.
  */
-static int mshv_region_collect_and_map(struct mshv_mem_region *region,
+static int mshv_region_collect_and_map(struct mshv_region *region,
 				       u64 pfn_offset, u64 pfn_count,
 				       bool do_fault)
 {
@@ -653,14 +653,14 @@ static int mshv_region_collect_and_map(struct mshv_mem_region *region,
 	return ret;
 }
 
-static int mshv_region_range_fault(struct mshv_mem_region *region,
+static int mshv_region_range_fault(struct mshv_region *region,
 				   u64 pfn_offset, u64 pfn_count)
 {
 	return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
 					   true);
 }
 
-bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
+bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 {
 	u64 pfn_offset, pfn_count;
 	int ret;
@@ -706,9 +706,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 					    const struct mmu_notifier_range *range,
 					    unsigned long cur_seq)
 {
-	struct mshv_mem_region *region = container_of(mni,
-						      struct mshv_mem_region,
-						      mreg_mni);
+	struct mshv_region *region = container_of(mni,
+						  struct mshv_region,
+						  mreg_mni);
 	u64 pfn_offset, pfn_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
@@ -767,7 +767,7 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
  *
  * Return: 0 on success, negative error code on failure.
  */
-static int mshv_map_pinned_region(struct mshv_mem_region *region)
+static int mshv_map_pinned_region(struct mshv_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 	int ret;
@@ -823,13 +823,13 @@ static int mshv_map_pinned_region(struct mshv_mem_region *region)
 	return ret;
 }
 
-static int mshv_map_movable_region(struct mshv_mem_region *region)
+static int mshv_map_movable_region(struct mshv_region *region)
 {
 	return mshv_region_collect_and_map(region, 0, region->nr_pfns,
 					   false);
 }
 
-static int mshv_map_mmio_region(struct mshv_mem_region *region)
+static int mshv_map_mmio_region(struct mshv_region *region)
 {
 	struct mshv_partition *partition = region->partition;
 
@@ -838,7 +838,7 @@ static int mshv_map_mmio_region(struct mshv_mem_region *region)
 				     region->nr_pfns);
 }
 
-int mshv_map_region(struct mshv_mem_region *region)
+int mshv_map_region(struct mshv_region *region)
 {
 	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 2bcdfa070517..97659ba55418 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -81,7 +81,7 @@ enum mshv_region_type {
 	MSHV_REGION_TYPE_MMIO
 };
 
-struct mshv_mem_region {
+struct mshv_region {
 	struct hlist_node hnode;
 	struct kref mreg_refcount;
 	u64 nr_pfns;
@@ -367,13 +367,13 @@ extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
 
-struct mshv_mem_region *mshv_region_create(enum mshv_region_type type,
-					   u64 guest_pfn, u64 nr_pfns,
-					   u64 uaddr, u32 flags,
-					   ulong mmio_pfn);
-void mshv_region_put(struct mshv_mem_region *region);
-int mshv_region_get(struct mshv_mem_region *region);
-bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
-int mshv_map_region(struct mshv_mem_region *region);
+struct mshv_region *mshv_region_create(enum mshv_region_type type,
+				       u64 guest_pfn, u64 nr_pfns,
+				       u64 uaddr, u32 flags,
+				       ulong mmio_pfn);
+void mshv_region_put(struct mshv_region *region);
+int mshv_region_get(struct mshv_region *region);
+bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn);
+int mshv_map_region(struct mshv_region *region);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 3bfa9e9c575f..9d83a2348655 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -612,10 +612,10 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 static_assert(sizeof(struct hv_message) <= MSHV_RUN_VP_BUF_SZ,
 	      "sizeof(struct hv_message) must not exceed MSHV_RUN_VP_BUF_SZ");
 
-static struct mshv_mem_region *
+static struct mshv_region *
 mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
 		if (gfn >= region->start_gfn &&
@@ -626,10 +626,10 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 	return NULL;
 }
 
-static struct mshv_mem_region *
+static struct mshv_region *
 mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	spin_lock(&p->pt_mem_regions_lock);
 	region = mshv_partition_region_by_gfn(p, gfn);
@@ -656,7 +656,7 @@ mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
 {
 	struct mshv_partition *p = vp->vp_partition;
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	bool ret = false;
 	u64 gfn;
 #if defined(CONFIG_X86_64)
@@ -1217,9 +1217,9 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
  */
 static int mshv_partition_create_region(struct mshv_partition *partition,
 					struct mshv_user_mem_region *mem,
-					struct mshv_mem_region **regionpp)
+					struct mshv_region **regionpp)
 {
-	struct mshv_mem_region *rg;
+	struct mshv_region *rg;
 	enum mshv_region_type type;
 	u64 nr_pfns = HVPFN_DOWN(mem->size);
 	struct vm_area_struct *vma;
@@ -1282,7 +1282,7 @@ static long
 mshv_map_user_memory(struct mshv_partition *partition,
 		     struct mshv_user_mem_region mem)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	long ret;
 
 	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
@@ -1318,7 +1318,7 @@ static long
 mshv_unmap_user_memory(struct mshv_partition *partition,
 		       struct mshv_user_mem_region mem)
 {
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 
 	if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP)))
 		return -EINVAL;
@@ -1690,7 +1690,7 @@ remove_partition(struct mshv_partition *partition)
 static void destroy_partition(struct mshv_partition *partition)
 {
 	struct mshv_vp *vp;
-	struct mshv_mem_region *region;
+	struct mshv_region *region;
 	struct hlist_node *n;
 	int i;
 



^ permalink raw reply related

* [PATCH v2 4/7] mshv: Optimize memory region mapping operations
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Two specific operations don't require PFN iteration: region unmapping
and region remapping with no access. For unmapping, all frames in MSHV
memory regions are guaranteed to be mapped with page access, so we can
unmap them all without checking individual PFNs. For remapping with no
access, all frames are already mapped with page access, allowing us to
unmap them all in one pass.

Since neither operation needs PFN validation, iterating over PFNs is
redundant. Batch operations into large page-aligned chunks followed by
remaining pages. This eliminates PFN traversal for these operations,
requires no additional hypercalls compared to the PFN-checking approach,
and provides the simplest possible sequential execution path.

The optimization utilizes HV_MAP_GPA_LARGE_PAGE and
HV_UNMAP_GPA_LARGE_PAGE flags for aligned portions, processing only the
remainder with base page granularity. This removes
mshv_region_chunk_unmap() and eliminates PFN iteration for unmap and
no-access operations, reducing code complexity.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   65 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 46 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 2c4215381e0b..a92381219758 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -449,27 +449,27 @@ static int mshv_region_pin(struct mshv_region *region)
 	return ret < 0 ? ret : -ENOMEM;
 }
 
-static int mshv_region_chunk_unmap(struct mshv_region *region,
-				   u32 flags,
-				   u64 pfn_offset, u64 pfn_count,
-				   bool huge_page)
+static int mshv_region_unmap(struct mshv_region *region)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
-		return 0;
+	u64 aligned_pages, remaining_pages;
+	int ret = 0;
 
-	if (huge_page)
-		flags |= HV_UNMAP_GPA_LARGE_PAGE;
+	aligned_pages = ALIGN_DOWN(region->nr_pfns, PTRS_PER_PMD);
+	remaining_pages = region->nr_pfns - aligned_pages;
 
-	return hv_call_unmap_pfns(region->partition->pt_id,
-				  region->start_gfn + pfn_offset,
-				  pfn_count, flags);
-}
+	if (aligned_pages)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 region->start_gfn,
+					 aligned_pages,
+					 HV_UNMAP_GPA_LARGE_PAGE);
 
-static int mshv_region_unmap(struct mshv_region *region)
-{
-	return mshv_region_process_range(region, 0,
-					 0, region->nr_pfns,
-					 mshv_region_chunk_unmap);
+	if (!ret && remaining_pages)
+		ret = hv_call_unmap_pfns(region->partition->pt_id,
+					 region->start_gfn + aligned_pages,
+					 remaining_pages,
+					 0);
+
+	return ret;
 }
 
 static void mshv_region_destroy(struct kref *ref)
@@ -684,6 +684,34 @@ bool mshv_region_handle_gfn_fault(struct mshv_region *region, u64 gfn)
 	return !ret;
 }
 
+static int mshv_region_map_no_access(struct mshv_region *region,
+				     u64 pfn_offset, u64 pfn_count)
+{
+	u64 aligned_pages, remaining_pages;
+	int ret = 0;
+
+	aligned_pages = ALIGN_DOWN(pfn_count, PTRS_PER_PMD);
+	remaining_pages = pfn_count - aligned_pages;
+
+	if (aligned_pages)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   region->start_gfn + pfn_offset,
+					   aligned_pages,
+					   HV_MAP_GPA_NO_ACCESS |
+					   HV_MAP_GPA_LARGE_PAGE,
+					   NULL);
+
+	if (!ret && remaining_pages)
+		ret = hv_call_map_ram_pfns(region->partition->pt_id,
+					   region->start_gfn +
+					   aligned_pages + pfn_offset,
+					   remaining_pages,
+					   HV_MAP_GPA_NO_ACCESS,
+					   NULL);
+
+	return ret;
+}
+
 /**
  * mshv_region_interval_invalidate - Invalidate a range of memory region
  * @mni: Pointer to the mmu_interval_notifier structure
@@ -727,8 +755,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 
 	mmu_interval_set_seq(mni, cur_seq);
 
-	ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
-				     pfn_offset, pfn_count);
+	ret = mshv_region_map_no_access(region, pfn_offset, pfn_count);
 	if (ret)
 		goto out_unlock;
 



^ permalink raw reply related

* [PATCH v2 5/7] mshv: Pass pfns array explicitly through processing chain
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The current implementation relies on accessing region->pfns directly
within the pfn processing chain, making it difficult to use these
handlers with alternative pfn sources. This tight coupling limits
flexibility when processing pfns from different locations, such as
temporary arrays or external sources.

By threading the pfns pointer through the entire processing chain
(mshv_region_process_range, mshv_region_process_chunk, and all
handlers), we decouple the processing logic from the storage location.
This enables future enhancements like processing pfns from multiple
sources or implementing more sophisticated memory management strategies
without duplicating the core processing logic.

No functional change intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   61 +++++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 24 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index a92381219758..120ce2588501 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -22,7 +22,7 @@
 
 typedef int (*pfn_handler_t)(struct mshv_region *region, u32 flags,
 			     u64 pfn_offset, u64 pfn_count,
-			     bool huge_page);
+			     unsigned long *pfns, bool huge_page);
 
 static const struct mmu_interval_notifier_ops mshv_region_mni_ops;
 
@@ -84,6 +84,7 @@ static int mshv_chunk_stride(struct page *page,
 static long mshv_region_process_pfns(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
 	u64 gfn = region->start_gfn + pfn_offset;
@@ -91,7 +92,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 	unsigned long pfn;
 	int stride, ret;
 
-	pfn = region->mreg_pfns[pfn_offset];
+	pfn = pfns[pfn_offset];
 	if (!pfn_valid(pfn))
 		return -EINVAL;
 
@@ -101,7 +102,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < pfn_count ; count += stride) {
-		pfn = region->mreg_pfns[pfn_offset + count];
+		pfn = pfns[pfn_offset + count];
 
 		/* Break if current pfn is invalid */
 		if (!pfn_valid(pfn))
@@ -114,7 +115,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 			break;
 	}
 
-	ret = handler(region, flags, pfn_offset, count, stride > 1);
+	ret = handler(region, flags, pfn_offset, count, pfns, stride > 1);
 	if (ret)
 		return ret;
 
@@ -138,11 +139,12 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 static long mshv_region_process_hole(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
 	long ret;
 
-	ret = handler(region, flags, pfn_offset, pfn_count, 0);
+	ret = handler(region, flags, pfn_offset, pfn_count, pfns, 0);
 	if (ret)
 		return ret;
 
@@ -152,15 +154,16 @@ static long mshv_region_process_hole(struct mshv_region *region,
 static long mshv_region_process_chunk(struct mshv_region *region,
 				      u32 flags,
 				      u64 pfn_offset, u64 pfn_count,
+				      unsigned long *pfns,
 				      pfn_handler_t handler)
 {
-	if (pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (pfn_valid(pfns[pfn_offset]))
 		return mshv_region_process_pfns(region, flags,
-				pfn_offset, pfn_count,
+				pfn_offset, pfn_count, pfns,
 				handler);
 	else
 		return mshv_region_process_hole(region, flags,
-				pfn_offset, pfn_count,
+				pfn_offset, pfn_count, pfns,
 				handler);
 }
 
@@ -170,12 +173,13 @@ static long mshv_region_process_chunk(struct mshv_region *region,
  * @flags     : Flags to pass to the handler.
  * @pfn_offset: Offset into the region's PFNs array to start processing.
  * @pfn_count : Number of PFNs to process.
+ * @pfns      : Pointer to an array of PFNs corresponding to the region.
  * @handler   : Callback function to handle each chunk of contiguous
  *              valid PFNs.
  *
- * Iterates over the specified range of PFNs in @region, skipping
- * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
- * @handler via mshv_region_process_pfns.
+ * Iterates over the specified range of PFNs, skipping invalid PFNs.
+ * For each contiguous chunk of valid PFNS, invokes @handler via
+ * mshv_region_process_pfns.
  *
  * Note: The @handler callback must be able to handle PFNs backed by both
  * normal and huge pages.
@@ -185,6 +189,7 @@ static long mshv_region_process_chunk(struct mshv_region *region,
 static int mshv_region_process_range(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     pfn_handler_t handler)
 {
 	u64 start, end;
@@ -207,15 +212,14 @@ static int mshv_region_process_range(struct mshv_region *region,
 		 * Accumulate contiguous pfns with the same validity
 		 * (valid or not).
 		 */
-		if (pfn_valid(region->mreg_pfns[start]) ==
-		    pfn_valid(region->mreg_pfns[end])) {
+		if (pfn_valid(pfns[start]) == pfn_valid(pfns[end])) {
 			end++;
 			continue;
 		}
 
 		ret = mshv_region_process_chunk(region, flags,
 						start, end - start,
-						handler);
+						pfns, handler);
 		if (ret < 0)
 			return ret;
 
@@ -224,7 +228,7 @@ static int mshv_region_process_range(struct mshv_region *region,
 
 	ret = mshv_region_process_chunk(region, flags,
 					start, end - start,
-					handler);
+					pfns, handler);
 	if (ret < 0)
 		return ret;
 
@@ -289,16 +293,17 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 static int mshv_region_chunk_share(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
+				   unsigned long *pfns,
 				   bool huge_page)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!pfn_valid(pfns[pfn_offset]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pfns + pfn_offset,
+					      pfns + pfn_offset,
 					      pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
@@ -311,22 +316,24 @@ static int mshv_region_share(struct mshv_region *region)
 
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
+					 region->mreg_pfns,
 					 mshv_region_chunk_share);
 }
 
 static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     u32 flags,
 				     u64 pfn_offset, u64 pfn_count,
+				     unsigned long *pfns,
 				     bool huge_page)
 {
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!pfn_valid(pfns[pfn_offset]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->mreg_pfns + pfn_offset,
+					      pfns + pfn_offset,
 					      pfn_count, 0,
 					      flags, false);
 }
@@ -337,12 +344,14 @@ static int mshv_region_unshare(struct mshv_region *region)
 
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
+					 region->mreg_pfns,
 					 mshv_region_chunk_unshare);
 }
 
 static int mshv_region_chunk_remap(struct mshv_region *region,
 				   u32 flags,
 				   u64 pfn_offset, u64 pfn_count,
+				   unsigned long *pfns,
 				   bool huge_page)
 {
 	/*
@@ -350,7 +359,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	 * hypervisor track dirty pages, enabling precopy live
 	 * migration.
 	 */
-	if (!pfn_valid(region->mreg_pfns[pfn_offset]))
+	if (!pfn_valid(pfns[pfn_offset]))
 		flags = HV_MAP_GPA_NO_ACCESS;
 
 	if (huge_page)
@@ -359,15 +368,17 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	return hv_call_map_ram_pfns(region->partition->pt_id,
 				    region->start_gfn + pfn_offset,
 				    pfn_count, flags,
-				    region->mreg_pfns + pfn_offset);
+				    pfns + pfn_offset);
 }
 
 static int mshv_region_remap_pfns(struct mshv_region *region,
 				  u32 map_flags,
-				  u64 pfn_offset, u64 pfn_count)
+				  u64 pfn_offset, u64 pfn_count,
+				  unsigned long *pfns)
 {
 	return mshv_region_process_range(region, map_flags,
 					 pfn_offset, pfn_count,
+					 pfns,
 					 mshv_region_chunk_remap);
 }
 
@@ -376,7 +387,8 @@ static int mshv_region_map(struct mshv_region *region)
 	u32 map_flags = region->hv_map_flags;
 
 	return mshv_region_remap_pfns(region, map_flags,
-				      0, region->nr_pfns);
+				      0, region->nr_pfns,
+				      region->mreg_pfns);
 }
 
 static void mshv_region_invalidate_pfns(struct mshv_region *region,
@@ -645,7 +657,8 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 	}
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
-				     pfn_offset, pfn_count);
+				     pfn_offset, pfn_count,
+				     region->mreg_pfns);
 
 	mutex_unlock(&region->mreg_mutex);
 out:



^ permalink raw reply related

* [PATCH v2 6/7] mshv: Simplify pfn array handling in region processing
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The current code requires passing both the full pfn array and an offset
parameter to region processing functions, forcing callees to manually
index into arrays. This approach is inflexible and makes it difficult
to work with different sources of pfn arrays.

Upcoming changes will need to pass pfn arrays obtained from the HMM
framework directly to these functions. The HMM framework returns arrays
that represent specific ranges rather than full region arrays with
offsets, making the current offset-based indexing pattern incompatible.

Refactor by having callers pass pre-offset pointers to pfn arrays and
removing offset-based indexing from callees. This allows functions to
work with any pfn array starting at index 0, regardless of its source,
and prepares the code for HMM integration.

No functional change intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 120ce2588501..3c2a35f3bc31 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -92,7 +92,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 	unsigned long pfn;
 	int stride, ret;
 
-	pfn = pfns[pfn_offset];
+	pfn = pfns[0];
 	if (!pfn_valid(pfn))
 		return -EINVAL;
 
@@ -102,7 +102,7 @@ static long mshv_region_process_pfns(struct mshv_region *region,
 
 	/* Start at stride since the first stride is validated */
 	for (count = stride; count < pfn_count ; count += stride) {
-		pfn = pfns[pfn_offset + count];
+		pfn = pfns[count];
 
 		/* Break if current pfn is invalid */
 		if (!pfn_valid(pfn))
@@ -157,7 +157,7 @@ static long mshv_region_process_chunk(struct mshv_region *region,
 				      unsigned long *pfns,
 				      pfn_handler_t handler)
 {
-	if (pfn_valid(pfns[pfn_offset]))
+	if (pfn_valid(pfns[0]))
 		return mshv_region_process_pfns(region, flags,
 				pfn_offset, pfn_count, pfns,
 				handler);
@@ -204,10 +204,7 @@ static int mshv_region_process_range(struct mshv_region *region,
 	if (end > region->nr_pfns)
 		return -EINVAL;
 
-	start = pfn_offset;
-	end = pfn_offset + 1;
-
-	while (end < pfn_offset + pfn_count) {
+	for (start = 0, end = 1; end < pfn_count; ) {
 		/*
 		 * Accumulate contiguous pfns with the same validity
 		 * (valid or not).
@@ -218,8 +215,9 @@ static int mshv_region_process_range(struct mshv_region *region,
 		}
 
 		ret = mshv_region_process_chunk(region, flags,
-						start, end - start,
-						pfns, handler);
+						pfn_offset + start,
+						end - start,
+						pfns + start, handler);
 		if (ret < 0)
 			return ret;
 
@@ -227,8 +225,9 @@ static int mshv_region_process_range(struct mshv_region *region,
 	}
 
 	ret = mshv_region_process_chunk(region, flags,
-					start, end - start,
-					pfns, handler);
+					pfn_offset + start,
+					end - start,
+					pfns + start, handler);
 	if (ret < 0)
 		return ret;
 
@@ -296,15 +295,14 @@ static int mshv_region_chunk_share(struct mshv_region *region,
 				   unsigned long *pfns,
 				   bool huge_page)
 {
-	if (!pfn_valid(pfns[pfn_offset]))
+	if (!pfn_valid(pfns[0]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      pfns + pfn_offset,
-					      pfn_count,
+					      pfns, pfn_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
 					      flags, true);
@@ -326,15 +324,15 @@ static int mshv_region_chunk_unshare(struct mshv_region *region,
 				     unsigned long *pfns,
 				     bool huge_page)
 {
-	if (!pfn_valid(pfns[pfn_offset]))
+	if (!pfn_valid(pfns[0]))
 		return -EINVAL;
 
 	if (huge_page)
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      pfns + pfn_offset,
-					      pfn_count, 0,
+					      pfns, pfn_count,
+					      0,
 					      flags, false);
 }
 
@@ -359,7 +357,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	 * hypervisor track dirty pages, enabling precopy live
 	 * migration.
 	 */
-	if (!pfn_valid(pfns[pfn_offset]))
+	if (!pfn_valid(pfns[0]))
 		flags = HV_MAP_GPA_NO_ACCESS;
 
 	if (huge_page)
@@ -368,7 +366,7 @@ static int mshv_region_chunk_remap(struct mshv_region *region,
 	return hv_call_map_ram_pfns(region->partition->pt_id,
 				    region->start_gfn + pfn_offset,
 				    pfn_count, flags,
-				    pfns + pfn_offset);
+				    pfns);
 }
 
 static int mshv_region_remap_pfns(struct mshv_region *region,
@@ -658,7 +656,7 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
 				     pfn_offset, pfn_count,
-				     region->mreg_pfns);
+				     region->mreg_pfns + pfn_offset);
 
 	mutex_unlock(&region->mreg_mutex);
 out:



^ permalink raw reply related

* [PATCH v2 7/7] mshv: Allocate pfns array only for pinned regions
From: Stanislav Kinsburskii @ 2026-04-02 18:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177515251087.119822.1940529498624181326.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Convert pfns to a pointer allocated only for pinned regions that
actually need it for share/unshare/evict operations. Unpinned
regions use NULL since HMM handles their mappings dynamically.

The pfns array was previously a flexible array member, forcing
allocation for all regions regardless of memory type. This wastes
significant memory for unpinned HMM-managed regions which don't
need persistent PFN tracking - a 1GB region wastes 2MB for an
unused array.

This also allows using kzalloc for the main structure instead of
vzalloc, improving allocation efficiency and cache locality.

Simplify unpinned region invalidation by calling the hypervisor
directly rather than tracking PFNs. The tradeoff of skipping huge
page optimization is acceptable since invalidation ranges are
typically small and not performance-critical.

Add NULL checks where pfns array is required and update cleanup
to handle conditional allocation.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c |   61 ++++++++++++++++++++++++---------------------
 drivers/hv/mshv_root.h    |    6 +++-
 2 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 3c2a35f3bc31..cd28758ca494 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -243,7 +243,7 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 	int ret = 0;
 	u64 i;
 
-	region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
+	region = kzalloc_obj(*region);
 	if (!region)
 		return ERR_PTR(-ENOMEM);
 
@@ -255,6 +255,13 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 						   &mshv_region_mni_ops);
 		break;
 	case MSHV_REGION_TYPE_MEM_PINNED:
+		region->mreg_pfns = vmalloc_array(nr_pfns, sizeof(unsigned long));
+		if (!region->mreg_pfns) {
+			ret = -ENOMEM;
+			break;
+		}
+		for (i = 0; i < nr_pfns; i++)
+			region->mreg_pfns[i] = MSHV_INVALID_PFN;
 		break;
 	case MSHV_REGION_TYPE_MMIO:
 		region->mreg_mmio_pfn = mmio_pfn;
@@ -276,16 +283,13 @@ struct mshv_region *mshv_region_create(enum mshv_region_type type,
 	if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
 		region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
 
-	for (i = 0; i < nr_pfns; i++)
-		region->mreg_pfns[i] = MSHV_INVALID_PFN;
-
 	mutex_init(&region->mreg_mutex);
 	kref_init(&region->mreg_refcount);
 
 	return region;
 
 free_region:
-	vfree(region);
+	kfree(region);
 	return ERR_PTR(ret);
 }
 
@@ -312,6 +316,9 @@ static int mshv_region_share(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
 
+	if (!region->mreg_pfns)
+		return -EINVAL;
+
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
 					 region->mreg_pfns,
@@ -340,6 +347,9 @@ static int mshv_region_unshare(struct mshv_region *region)
 {
 	u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
 
+	if (!region->mreg_pfns)
+		return -EINVAL;
+
 	return mshv_region_process_range(region, flags,
 					 0, region->nr_pfns,
 					 region->mreg_pfns,
@@ -380,27 +390,19 @@ static int mshv_region_remap_pfns(struct mshv_region *region,
 					 mshv_region_chunk_remap);
 }
 
-static int mshv_region_map(struct mshv_region *region)
-{
-	u32 map_flags = region->hv_map_flags;
-
-	return mshv_region_remap_pfns(region, map_flags,
-				      0, region->nr_pfns,
-				      region->mreg_pfns);
-}
-
 static void mshv_region_invalidate_pfns(struct mshv_region *region,
 					u64 pfn_offset, u64 pfn_count)
 {
 	u64 i;
 
+	if (region->mreg_type != MSHV_REGION_TYPE_MEM_PINNED)
+		return;
+
 	for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
 		if (!pfn_valid(region->mreg_pfns[i]))
 			continue;
 
-		if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
-			unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
-
+		unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
 		region->mreg_pfns[i] = MSHV_INVALID_PFN;
 	}
 }
@@ -506,7 +508,9 @@ static void mshv_region_destroy(struct kref *ref)
 
 	mshv_region_invalidate(region);
 
-	vfree(region);
+	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+		vfree(region->mreg_pfns);
+	kfree(region);
 }
 
 void mshv_region_put(struct mshv_region *region)
@@ -616,10 +620,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_region *region,
  *   leaving missing pages as invalid PFN markers.
  *   Used for initial region setup.
  *
- * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
- * flags cleared, then the range is mapped into the hypervisor. Present
- * PFNs get mapped with region access permissions; missing PFNs (zero
- * entries) get mapped with no-access permissions.
+ * HMM bookkeeping flags are stripped from collected PFNs before mapping.
+ * Present PFNs get mapped with region access permissions; missing PFNs
+ * (marked as MSHV_INVALID_PFN) get mapped with no-access permissions.
  *
  * Return: 0 on success, negative errno on failure.
  */
@@ -648,15 +651,17 @@ static int mshv_region_collect_and_map(struct mshv_region *region,
 		goto out;
 
 	for (i = 0; i < pfn_count; i++) {
-		if (!(pfns[i] & HMM_PFN_VALID))
+		if (!(pfns[i] & HMM_PFN_VALID)) {
+			pfns[i] = MSHV_INVALID_PFN;
 			continue;
+		}
 		/* Drop HMM_PFN_* flags to ensure PFNs are valid. */
-		region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
+		pfns[i] &= ~HMM_PFN_FLAGS;
 	}
 
 	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
 				     pfn_offset, pfn_count,
-				     region->mreg_pfns + pfn_offset);
+				     pfns);
 
 	mutex_unlock(&region->mreg_mutex);
 out:
@@ -770,8 +775,6 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 	if (ret)
 		goto out_unlock;
 
-	mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
-
 	mutex_unlock(&region->mreg_mutex);
 
 	return true;
@@ -834,7 +837,9 @@ static int mshv_map_pinned_region(struct mshv_region *region)
 		}
 	}
 
-	ret = mshv_region_map(region);
+	ret = mshv_region_remap_pfns(region, region->hv_map_flags,
+				     0, region->nr_pfns,
+				     region->mreg_pfns);
 	if (!ret)
 		return 0;
 
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 97659ba55418..e43bdbf1ada8 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -92,8 +92,10 @@ struct mshv_region {
 	enum mshv_region_type mreg_type;
 	struct mmu_interval_notifier mreg_mni;
 	struct mutex mreg_mutex;	/* protects region PFNs remapping */
-	u64 mreg_mmio_pfn;
-	unsigned long mreg_pfns[];
+	union {
+		unsigned long *mreg_pfns;
+		u64 mreg_mmio_pfn;
+	};
 };
 
 struct mshv_irq_ack_notifier {



^ permalink raw reply related

* [PATCH net-next v5 0/3] net: mana: debugfs fixes and diagnostic info
From: Erni Sri Satya Vennela @ 2026-04-02 18:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	shirazsaleem, yury.norov, kees, ssengar, ernis, dipayanroy,
	gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma

This series first fixes two pre-existing debugfs issues in the MANA
driver, then adds new debugfs entries for hardware diagnostic info.

Patch 1 fixes the per-device debugfs directory naming to use the unique
PCI BDF address via pci_name(), avoiding a potential NULL pointer
dereference when pdev->slot is NULL and preventing name collisions
across multiple PFs or VFs.

Patch 2 moves the current_speed debugfs file creation from
mana_probe_port() to mana_init_port() so it survives detach/attach
cycles triggered by MTU changes or XDP program changes.

Patch 3 adds new debugfs entries exposing hardware configuration and
diagnostic information (device capabilities, vPort config, steering
parameters) and consolidates debugfs directory lifecycle into
mana_gd_setup()/mana_gd_cleanup_device().
---
Changes in v5:
* Create new patchset including all the the patches.
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
  because "ac" gets freed before debugfs_remove_recursive, to avoid
  Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
  values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
  mana_gd_suspend to handle multiple duplicates creation in
  mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
  if(!resuming) condition since we have to create it again even for
  resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
Erni Sri Satya Vennela (3):
  net: mana: Use pci_name() for debugfs directory naming
  net: mana: Move current_speed debugfs file to mana_init_port()
  net: mana: Expose hardware diagnostic info via debugfs

 .../net/ethernet/microsoft/mana/gdma_main.c   | 62 ++++++++++---------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 37 ++++++++++-
 include/net/mana/gdma.h                       |  1 +
 include/net/mana/mana.h                       |  8 +++
 4 files changed, 76 insertions(+), 32 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH net-next v5 1/3] net: mana: Use pci_name() for debugfs directory naming
From: Erni Sri Satya Vennela @ 2026-04-02 18:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	shirazsaleem, yury.norov, kees, ssengar, ernis, dipayanroy,
	gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260402182704.2474739-1-ernis@linux.microsoft.com>

Use pci_name(pdev) for the per-device debugfs directory instead of
hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
previous approach had two issues:

1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
   in environments like generic VFIO passthrough or nested KVM,
   causing a NULL pointer dereference.

2. Multiple PFs would all use "0", and VFs across different PCI
   domains or buses could share the same slot name, leading to
   -EEXIST errors from debugfs_create_dir().

pci_name(pdev) returns the unique BDF address, is always valid, and
is unique across the system.

Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v5:
* New to patchset.
Changes in v4, v3, v2:
* Not created
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 43741cd35af8..098fbda0d128 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -2065,11 +2065,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	if (gc->is_pf)
-		gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
-	else
-		gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
-							  mana_debugfs_root);
+	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+						  mana_debugfs_root);
 
 	err = mana_gd_setup(pdev);
 	if (err)
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v5 2/3] net: mana: Move current_speed debugfs file to mana_init_port()
From: Erni Sri Satya Vennela @ 2026-04-02 18:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	shirazsaleem, yury.norov, kees, ssengar, ernis, dipayanroy,
	gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260402182704.2474739-1-ernis@linux.microsoft.com>

Move the current_speed debugfs file creation from mana_probe_port() to
mana_init_port(). The file was previously created only during initial
probe, but mana_cleanup_port_context() removes the entire vPort debugfs
directory during detach/attach cycles. Since mana_init_port() recreates
the directory on re-attach, moving current_speed here ensures it survives
these cycles.

Fixes: 75cabb46935b ("net: mana: Add support for net_shaper_ops")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v5:
* New to the patchset.
Changes in v4, v3, v2:
* Not created.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 565375cc20d3..78522205c5d0 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3147,6 +3147,8 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
+			   &apc->speed);
 	return 0;
 
 reset_apc:
@@ -3425,8 +3427,6 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 
 	netif_carrier_on(ndev);
 
-	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs, &apc->speed);
-
 	return 0;
 
 free_indir:
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next v5 3/3] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-04-02 18:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	shirazsaleem, yury.norov, kees, ssengar, ernis, dipayanroy,
	gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260402182704.2474739-1-ernis@linux.microsoft.com>

Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.

The debugfs directory creation and removal for each PCI device is
integrated into mana_gd_setup() and mana_gd_cleanup_device()
respectively, so that all callers (probe, remove, suspend, resume,
shutdown) share a single code path.

Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
  - num_msix_usable, max_num_queues: Max resources from hardware
  - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
  - num_vports, bm_hostmode: Device configuration

Per-vPort entries (under /sys/kernel/debug/mana/<BDF>/vportN/):
  - port_handle: Hardware vPort handle
  - max_sq, max_rq: Max queues from vPort config
  - indir_table_sz: Indirection table size
  - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
    Last applied steering configuration parameters

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v5:
* Update commit message.
* Fix conflicts to align with the new patches.
* Make it part of patchset.
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
  because "ac" gets freed before debugfs_remove_recursive, to avoid
  Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
  values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
  mana_gd_suspend to handle multiple duplicates creation in
  mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
  if(!resuming) condition since we have to create it again even for
  resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 59 ++++++++++---------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 33 +++++++++++
 include/net/mana/gdma.h                       |  1 +
 include/net/mana/mana.h                       |  8 +++
 4 files changed, 74 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..7a99db9afa03 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -194,6 +194,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues > gc->num_msix_usable - 1)
 		gc->max_num_queues = gc->num_msix_usable - 1;
 
+	debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+			   &gc->num_msix_usable);
+	debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+			   &gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1264,6 +1269,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 		return err ? err : -EPROTO;
 	}
 	gc->pf_cap_flags1 = resp.pf_cap_flags1;
+	gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+	debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+			   &gc->gdma_protocol_ver);
+	debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+			   &gc->pf_cap_flags1);
+
 	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
 		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
 		if (err) {
@@ -1943,15 +1955,20 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	int err;
 
+	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+						  mana_debugfs_root);
+
 	err = mana_gd_init_registers(pdev);
 	if (err)
-		return err;
+		goto remove_debugfs;
 
 	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
 
 	gc->service_wq = alloc_ordered_workqueue("gdma_service_wq", 0);
-	if (!gc->service_wq)
-		return -ENOMEM;
+	if (!gc->service_wq) {
+		err = -ENOMEM;
+		goto remove_debugfs;
+	}
 
 	err = mana_gd_setup_hwc_irqs(pdev);
 	if (err) {
@@ -1992,11 +2009,14 @@ static int mana_gd_setup(struct pci_dev *pdev)
 free_workqueue:
 	destroy_workqueue(gc->service_wq);
 	gc->service_wq = NULL;
+remove_debugfs:
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
 	dev_err(&pdev->dev, "%s failed (error %d)\n", __func__, err);
 	return err;
 }
 
-static void mana_gd_cleanup(struct pci_dev *pdev)
+static void mana_gd_cleanup_device(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 
@@ -2008,6 +2028,10 @@ static void mana_gd_cleanup(struct pci_dev *pdev)
 		destroy_workqueue(gc->service_wq);
 		gc->service_wq = NULL;
 	}
+
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
+
 	dev_dbg(&pdev->dev, "mana gdma cleanup successful\n");
 }
 
@@ -2065,9 +2089,6 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
-						  mana_debugfs_root);
-
 	err = mana_gd_setup(pdev);
 	if (err)
 		goto unmap_bar;
@@ -2096,16 +2117,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 cleanup_mana:
 	mana_remove(&gc->mana, false);
 cleanup_gd:
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 unmap_bar:
-	/*
-	 * at this point we know that the other debugfs child dir/files
-	 * are either not yet created or are already cleaned up.
-	 * The pci debugfs folder clean-up now, will only be cleaning up
-	 * adapter-MTU file and apc->mana_pci_debugfs folder.
-	 */
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-	gc->mana_pci_debugfs = NULL;
 	xa_destroy(&gc->irq_contexts);
 	pci_iounmap(pdev, bar0_va);
 free_gc:
@@ -2155,11 +2168,7 @@ static void mana_gd_remove(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, false);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	xa_destroy(&gc->irq_contexts);
 
@@ -2181,7 +2190,7 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 
 	return 0;
 }
@@ -2220,11 +2229,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 78522205c5d0..33464a28e5d5 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1269,6 +1269,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 	apc->port_handle = resp.vport;
 	ether_addr_copy(apc->mac_addr, resp.mac_addr);
 
+	apc->vport_max_sq = *max_sq;
+	apc->vport_max_rq = *max_rq;
+
 	return 0;
 }
 
@@ -1423,6 +1426,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
+
+	apc->steer_rx = rx;
+	apc->steer_rss = apc->rss_state;
+	apc->steer_update_tab = update_tab;
+	apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
 out:
 	kfree(req);
 	return err;
@@ -3147,6 +3155,23 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+			   &apc->port_handle);
+	debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_sq);
+	debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_rq);
+	debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+			   &apc->indir_table_sz);
+	debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rx);
+	debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rss);
+	debugfs_create_u32("steer_update_tab", 0400, apc->mana_port_debugfs,
+			   &apc->steer_update_tab);
+	debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+			   &apc->steer_cqe_coalescing);
 	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
 			   &apc->speed);
 	return 0;
@@ -3639,6 +3664,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 
 	ac->bm_hostmode = bm_hostmode;
 
+	debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+			   &ac->num_ports);
+	debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+			  &ac->bm_hostmode);
+
 	if (!resuming) {
 		ac->num_ports = num_ports;
 
@@ -3779,6 +3809,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	mana_gd_deregister_device(gd);
 
+	debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
+	debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
+
 	if (suspending)
 		return;
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 7fe3a1b61b2d..c4e3ce5147f7 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -442,6 +442,7 @@ struct gdma_context {
 	struct gdma_dev		mana_ib;
 
 	u64 pf_cap_flags1;
+	u64 gdma_protocol_ver;
 
 	struct workqueue_struct *service_wq;
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 96d21cbbdee2..6d2e05a7368c 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -568,6 +568,14 @@ struct mana_port_context {
 
 	/* Debugfs */
 	struct dentry *mana_port_debugfs;
+
+	/* Cached vport/steering config for debugfs */
+	u32 vport_max_sq;
+	u32 vport_max_rq;
+	u32 steer_rx;
+	u32 steer_rss;
+	u32 steer_update_tab;
+	u32 steer_cqe_coalescing;
 };
 
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
-- 
2.34.1


^ permalink raw reply related

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-02 18:49 UTC (permalink / raw)
  To: Michael Kelley, Matthew Ruffell
  Cc: bhelgaas@google.com, Haiyang Zhang, Jake Oshins,
	kwilczynski@kernel.org, KY Srinivasan,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, Long Li, lpieralisi@kernel.org,
	mani@kernel.org, robh@kernel.org, stable@vger.kernel.org,
	wei.liu@kernel.org
In-Reply-To: <SN6PR02MB41573CD2EA6CD82A0C238F66D494A@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Thursday, January 22, 2026 10:39 PM
>  ...
> This is good info, and definitely a clue. So to be clear, the problem repros
> only when kexec_load() is used. With kexec_file_load(), it does not repro. Is
> that right?

Yes and no. The answer depends on the combination of the version of
kdump-tools, and the architecture (x86-64 vs. ARM64), and the hypercall
(KEXEC_LOAD vs. KEXEC_FILE_LOAD)  and the Linux kernel version (there
have been patches fixing and breaking kdump over the past several years...)

Please see the reply I posted about 2 hours ago for all the details. 

> I saw a similar distinction when working on commit 304386373007,
> though in the opposite direction!
I think this happens because you're using Ubuntu 20.04:

> https://lwn.net/ml/linux-kernel/SN6PR02MB41572155B6D139C499814EB7D4F12@SN6PR02MB4157.namprd02.prod.outlook.com/
> To further complicate matters, the kexec on Oracle Linux 9.4 seems to
> have a bug when the -c option forces the use of kexec_load() instead
> of kexec_file_load(). As an experiment, I modified the kdumpctl shell
> script to add the "-c" option to kexec, but in that case the value "0x0"
> is passed as the framebuffer address, which is wrong.

Before commit 304386373007, hyperv_fb relocates the framebuffer
MMIO base,  so KEXEC_FILE_LOAD doesn't work for you, because the kdump
kernel's screen_info.lfb_base still points to the initial MMIO, and hence
the kdump kernel's efifb driver fails to work properly; KEXEC_LOAD works
for you because the kdump-tools v2.0.18 in Ubuntu 20.04 doesn't have that
commit (see the other reply from me)

The kexec on Oracle Linux 9.4 has that commit, and it's not buggy -- unless
we'd like to claim that all the recent kdump-tools versions are buggy :-)

Thanks,
Dexuan

^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-02 19:23 UTC (permalink / raw)
  To: Michael Kelley, Matthew Ruffell
  Cc: bhelgaas@google.com, Haiyang Zhang, Jake Oshins,
	kwilczynski@kernel.org, KY Srinivasan,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, Long Li, lpieralisi@kernel.org,
	mani@kernel.org, robh@kernel.org, stable@vger.kernel.org,
	wei.liu@kernel.org
In-Reply-To: <SN6PR02MB415759DBA9428256D379841AD494A@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Friday, January 23, 2026 10:28 AM
>  ...
> One more thought here: Is commit 96959283a58d relevant? The
> commit message describes a scenario where vmbus_reserve_fb()
> doesn't do anything because CONFIG_SYSFB is not set. Looking at
> the code for vmbus_reserve_fb(), it doing nothing might imply that
> screen_info.lfb_base is 0. But when CONFIG_SYSFB is not set,
> screen_info.lfb_base is just ignored, with the same result. This behavior
> started with the 6.7 kernel due to commit a07b50d80ab6.
> 
> Note that commit 96959283a58d has a follow-on to correct a
> problem when CONFIG_EFI is not set.  See commit 7b89a44b2e8c.
> If there's a reason to backport 96959283a58d, also get
> 7b89a44b2e8c.
> 
> Michael

In my opinion,
96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests")
is not a good fix for a07b50d80ab6: the commit message of a07b50d80ab6
says "the vmbus_drv code marks the original EFI framebuffer as reserved, but
this is not required if there is no sysfb" -- IMO the message is incorrect.

Even if CONFIG_SYSFB is not set, we still need to reserve the framebuffer
MMIO range, because we need to make sure that hv_pci doesn't allocate
MMIO from there.

96959283a58d adds "select SYSFB if !HYPERV_VTL_MODE", but we can
still manually unset CONFIG_SYSFB (I happened to do this when debugging
the kdump issue), and hv_pci won't work.

IMO vmbus_reserve_fb() should unconditionally reserve the frame buffer
MMIO range. I'll post a patch like this:

--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2395,10 +2398,8 @@ static void __maybe_unused vmbus_reserve_fb(void)

        if (efi_enabled(EFI_BOOT)) {
                /* Gen2 VM: get FB base from EFI framebuffer */
-               if (IS_ENABLED(CONFIG_SYSFB)) {
-                       start = sysfb_primary_display.screen.lfb_base;
-                       size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
-               }
+               start = sysfb_primary_display.screen.lfb_base;
+               size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
        } else {
                /* Gen1 VM: get FB base from PCI */
                pdev = pci_get_device(PCI_VENDOR_ID_MICROSOFT,


diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 7937ac0cbd0f..78d7f8c66278 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -9,7 +9,6 @@ config HYPERV
        select PARAVIRT
        select X86_HV_CALLBACK_VECTOR if X86
        select OF_EARLY_FLATTREE if OF
-       select SYSFB if EFI && !HYPERV_VTL_MODE
        select IRQ_MSI_LIB if X86
        help
          Select this option to run Linux as a Hyper-V client operating

Thanks,
Dexuan

^ permalink raw reply related

* [PATCH 0/2] genirq and Hyper-V: Clean up handling of add_interrupt_randomness()
From: Michael Kelley @ 2026-04-02 20:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, hpa, maz, bigeasy, x86, linux-kernel, linux-hyperv

From: Michael Kelley <mhklinux@outlook.com>

The Hyper-V ISRs are calling add_interrupt_randomness() as a
primary source of entropy in VMs, since the VMs don't have another
good source. The call is currently in the ISRs as a common place to
handle both x86/x64 and arm64. On x86/x64, hypervisor interrupts come
through a custom sysvec entry, and on arm64 they come through an
emulated GICv3. GICv3 uses the generic handler handle_percpu_devid_irq(),
which does not do add_interrupt_randomness(), unlike its counterpart
handle_percpu_irq().

Cleanup this somewhat confusing situation by doing the
add_interrupt_randomness() in handle_percpu_devid_irq(), and remove
it from the Hyper-V ISRs. Then only the Hyper-V custom sysvec path,
which plays the role of a generic interrupt handler, needs to do
add_interrupt_randomness(). As a result of this change, no
device drivers are calling add_interrupt_randomness(), which is
appropriate.

The change is broken into two patches since it spans generic
interrupt handling code and Hyper-V specific code. But the two
patches should be taken together through the same tree. It's
OK to have a bisect window where add_interrupt_randomness() is
done in both places, but taking the Hyper-V patch first could leave
a bisect window where add_interrupt_randomness() is not done in
either place.

Michael Kelley (2):
  genirq/chip: Do add_interrupt_randomness() in
    handle_percpu_devid_irq()
  Drivers: hv: Move add_interrupt_randomness() to hypervisor callback
    sysvec

 arch/x86/kernel/cpu/mshyperv.c | 2 ++
 drivers/hv/mshv_synic.c        | 3 ---
 drivers/hv/vmbus_drv.c         | 3 ---
 kernel/irq/chip.c              | 3 +++
 4 files changed, 5 insertions(+), 6 deletions(-)

-- 
2.25.1

^ permalink raw reply

* [PATCH 1/2] genirq/chip: Do add_interrupt_randomness() in handle_percpu_devid_irq()
From: Michael Kelley @ 2026-04-02 20:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, hpa, maz, bigeasy, x86, linux-kernel, linux-hyperv
In-Reply-To: <20260402202400.1707-1-mhklkml@zohomail.com>

From: Michael Kelley <mhklinux@outlook.com>

handle_percpu_devid_irq() is a version of handle_percpu_irq() but with
the addition of a pointer to a per-cpu devid. However, handle_percpu_irq()
does add_interrupt_randomness(), while handle_percpu_devid_irq() currently
does not. Add the missing add_interrupt_randomness(), as it is needed
when per-cpu interrupts with devid's are used in VMs for interrupts
from the hypervisor.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 kernel/irq/chip.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 6147a07d0127..6c9b1dc4e7d4 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -14,6 +14,7 @@
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
 #include <linux/irqdomain.h>
+#include <linux/random.h>
 
 #include <trace/events/irq.h>
 
@@ -929,6 +930,8 @@ void handle_percpu_devid_irq(struct irq_desc *desc)
 			    enabled ? " and unmasked" : "", irq, cpu);
 	}
 
+	add_interrupt_randomness(irq);
+
 	if (chip->irq_eoi)
 		chip->irq_eoi(&desc->irq_data);
 }
-- 
2.25.1


^ permalink raw reply related

* [PATCH 2/2] Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
From: Michael Kelley @ 2026-04-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, hpa, maz, bigeasy, x86, linux-kernel, linux-hyperv
In-Reply-To: <20260402202400.1707-1-mhklkml@zohomail.com>

From: Michael Kelley <mhklinux@outlook.com>

The Hyper-V ISRs, for normal guests and when running in the
hypervisor root patition, are calling add_interrupt_randomness() as a
primary source of entropy. The call is currently in the ISRs as a common
place to handle both x86/x64 and arm64. On x86/x64, hypervisor interrupts
come through a custom sysvec entry, and do not go through a generic
interrupt handler. On arm64, hypervisor interrupts come through an
emulated GICv3. GICv3 uses the generic handler handle_percpu_devid_irq(),
which does not do add_interrupt_randomness() -- unlike its counterpart
handle_percpu_irq(). But handle_percpu_devid_irq() is now updated to do
the add_interrupt_randomness(). So add_interrupt_randomness() is now
needed only in Hyper-V's x86/x64 custom sysvec path.

Move add_interrupt_randomness() from the Hyper-V ISRs into the Hyper-V
x86/x64 custom sysvec path, matching the existing STIMER0 sysvec path.
With this change, add_interrupt_randomness() is no longer called from any
device drivers, which is appropriate.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 2 ++
 drivers/hv/mshv_synic.c        | 3 ---
 drivers/hv/vmbus_drv.c         | 3 ---
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 9befdc557d9e..a7dfc29d3470 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -161,6 +161,8 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_hyperv_callback)
 	if (vmbus_handler)
 		vmbus_handler();
 
+	add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR);
+
 	if (ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED)
 		apic_eoi();
 
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 43f1bcbbf2d3..e2288a726fec 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -12,7 +12,6 @@
 #include <linux/mm.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
-#include <linux/random.h>
 #include <linux/cpuhotplug.h>
 #include <linux/reboot.h>
 #include <asm/mshyperv.h>
@@ -445,8 +444,6 @@ void mshv_isr(void)
 		mb();
 		if (msg->header.message_flags.msg_pending)
 			hv_set_non_nested_msr(HV_MSR_EOM, 0);
-
-		add_interrupt_randomness(mshv_sint_vector);
 	} else {
 		pr_warn_once("%s: unknown message type 0x%x\n", __func__,
 			     msg->header.message_type);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 1d71c28fadba..3faa74e49a6b 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -32,7 +32,6 @@
 #include <linux/ptrace.h>
 #include <linux/sysfb.h>
 #include <linux/efi.h>
-#include <linux/random.h>
 #include <linux/kernel.h>
 #include <linux/syscore_ops.h>
 #include <linux/dma-map-ops.h>
@@ -1356,8 +1355,6 @@ static void __vmbus_isr(void)
 
 	vmbus_message_sched(hv_cpu, hv_cpu->hyp_synic_message_page);
 	vmbus_message_sched(hv_cpu, hv_cpu->para_synic_message_page);
-
-	add_interrupt_randomness(vmbus_interrupt);
 }
 
 static DEFINE_PER_CPU(bool, vmbus_irq_pending);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 2/2] Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
From: Thomas Gleixner @ 2026-04-02 21:07 UTC (permalink / raw)
  To: mhklinux, kys, haiyangz, wei.liu, decui, longli, mingo, bp,
	dave.hansen, hpa, maz, bigeasy, x86, linux-kernel, linux-hyperv
In-Reply-To: <20260402202400.1707-3-mhklkml@zohomail.com>

On Thu, Apr 02 2026 at 13:24, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com>
>
> The Hyper-V ISRs, for normal guests and when running in the
> hypervisor root patition, are calling add_interrupt_randomness() as a
> primary source of entropy. The call is currently in the ISRs as a common
> place to handle both x86/x64 and arm64. On x86/x64, hypervisor interrupts
> come through a custom sysvec entry, and do not go through a generic
> interrupt handler. On arm64, hypervisor interrupts come through an
> emulated GICv3. GICv3 uses the generic handler handle_percpu_devid_irq(),
> which does not do add_interrupt_randomness() -- unlike its counterpart
> handle_percpu_irq(). But handle_percpu_devid_irq() is now updated to do
> the add_interrupt_randomness(). So add_interrupt_randomness() is now
> needed only in Hyper-V's x86/x64 custom sysvec path.
>
> Move add_interrupt_randomness() from the Hyper-V ISRs into the Hyper-V
> x86/x64 custom sysvec path, matching the existing STIMER0 sysvec path.
> With this change, add_interrupt_randomness() is no longer called from any
> device drivers, which is appropriate.
>
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Acked-by: Thomas Gleixner <tglx@kernel.org>

^ permalink raw reply

* Re: [PATCH 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-02 21:21 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260402-sturdy-chirpy-cat-78baeb@anirudhrb>

On Thu, 2 Apr 2026, Anirudh Rayabharam wrote:

> Maybe it's time to extract all the synic management stuff to a separate
> file to act like synic "driver" which facilitates synic access to
> mutiple users i.e. mshv & vmbus.

Yes, such a refactor maybe warranted. This series is about getting 
kexec to work on L1VH, and so such refactor would be ill-placed here.

Best,
Jork

^ permalink raw reply

* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Stanislav Kinsburskii @ 2026-04-02 23:25 UTC (permalink / raw)
  To: Junrui Luo
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
	Praveen K Paladugu, Jinank Jain, linux-hyperv, linux-kernel,
	Yuhao Jiang, Roman Kisel, stable
In-Reply-To: <SYBPR01MB788138A30BC69B0F5C3316E5AF54A@SYBPR01MB7881.ausprd01.prod.outlook.com>

On Sat, Mar 28, 2026 at 05:18:45PM +0800, Junrui Luo wrote:
> mshv_partition_create_region() computes mem->guest_pfn + nr_pages to
> check for overlapping regions without verifying u64 wraparound. A
> sufficiently large guest_pfn can cause the addition to overflow,
> bypassing the overlap check and allowing creation of regions that wrap
> around the address space.
> 
> Fix by using check_add_overflow() to reject such regions early, and
> validate that the region end does not exceed MAX_PHYSMEM_BITS. These
> checks also protect downstream callers that compute start_gfn +
> nr_pages on stored regions without overflow guards.
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Suggested-by: Roman Kisel <romank@linux.microsoft.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
> ---
> Changes in v2:
> - Add a maximum check suggested by Roman Kisel
> - Link to v1: https://lore.kernel.org/all/SYBPR01MB7881689C0F58149DD986A6D1AF49A@SYBPR01MB7881.ausprd01.prod.outlook.com/
> ---
>  drivers/hv/mshv_root_main.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 6f42423f7faa..32826247dbce 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1174,11 +1174,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
>  {
>  	struct mshv_mem_region *rg;
>  	u64 nr_pages = HVPFN_DOWN(mem->size);
> +	u64 new_region_end;
> +
> +	/* Reject regions whose end address would wrap around */
> +	if (check_add_overflow(mem->guest_pfn, nr_pages, &new_region_end))
> +		return -EOVERFLOW;
> +
> +	/* Reject regions beyond the maximum physical address */

nit: both comments are redundant - the meaning is clear from the code
itself.

> +	if (new_region_end > HVPFN_DOWN(1ULL << MAX_PHYSMEM_BITS))

This maximum value check bugs me a bit.

First of all, why does it matter what is the region end? Potentially, there can be
regions not backed by host address space (leave alone host RAM), so why
intropducing this limitation?

Second, this check takes a host-specific constant (MAX_PHYSMEM_BITS) and rounds it down
to hypervisor-specific units which may not be aligned with the host page
size. Should this be host pages instead?

Thanks,
Stanislav

> +		return -EINVAL;
>  
>  	/* Reject overlapping regions */
>  	spin_lock(&partition->pt_mem_regions_lock);
>  	hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> -		if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> +		if (new_region_end <= rg->start_gfn ||
>  		    rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
>  			continue;
>  		spin_unlock(&partition->pt_mem_regions_lock);
> 
> ---
> base-commit: c369299895a591d96745d6492d4888259b004a9e
> change-id: 20260328-fixes-0296eb3dbb52
> 
> Best regards,
> -- 
> Junrui Luo <moonafterrain@outlook.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox