Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 2/9] net: atlantic: convert to use .get_rx_ring_count
From: Breno Leitao @ 2026-01-22 10:43 UTC (permalink / raw)
  To: Creeley, Brett
  Cc: Ajit Khaparde, Sriharsha Basavapatna, Somnath Kotur, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Igor Russkikh, Simon Horman, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Alexander Duyck, kernel-team,
	Edward Cree, Brett Creeley, netdev, linux-kernel, oss-drivers,
	linux-hyperv, linux-net-drivers
In-Reply-To: <4daf0901-20a6-4c9b-9b56-32efef24e5e5@amd.com>

Hello Brett,

On Wed, Jan 21, 2026 at 08:49:23AM -0800, Creeley, Brett wrote:
> On 1/21/2026 7:54 AM, Breno Leitao wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> > 
> > 
> > Use the newly introduced .get_rx_ring_count ethtool ops callback instead
> > of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().
> > 
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >   drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c | 15 +++++++++------
> >   1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
> > index 6fef47ba0a59b..d8b5491c9cb2b 100644
> > --- a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
> > +++ b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
> > @@ -500,20 +500,22 @@ static int aq_ethtool_set_rss(struct net_device *netdev,
> >          return err;
> >   }
> > 
> > +static u32 aq_ethtool_get_rx_ring_count(struct net_device *ndev)
> > +{
> > +       struct aq_nic_s *aq_nic = netdev_priv(ndev);
> > +       struct aq_nic_cfg_s *cfg = aq_nic_get_cfg(aq_nic);
> > +
> > +       return cfg->vecs;
> > +}
> > +
> 
> Tiny nit, but RCT ordering is not maintained.

Damn, I will re-spin. Thanks for catching it.

--breno
--
pw-bot: cr

^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-01-22  7:10 UTC (permalink / raw)
  To: Dexuan Cui, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, longli@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, jakeo@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: stable@vger.kernel.org
In-Reply-To: <20260122020337.94967-1-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Wednesday, January 21, 2026 6:04 PM
> 
> There has been a longstanding MMIO conflict between the pci_hyperv
> driver's config_window (see hv_allocate_config_window()) and the
> hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
> both get MMIO from the low MMIO range below 4GB; this is not an issue
> in the normal kernel since the VMBus driver reserves the framebuffer
> MMIO in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram() can
> always get the reserved framebuffer MMIO; however, a Gen2 VM's kdump
> kernel fails to reserve the framebuffer MMIO in vmbus_reserve_fb() because
> the screen_info.lfb_base is zero in the kdump kernel: the screen_info
> is not initialized at all in the kdump kernel, because the EFI stub
> code, which initializes screen_info, doesn't run in the case of kdump.

I don't think this is correct. Yes, the EFI stub doesn't run, but screen_info
should be initialized in the kdump kernel by the code that loads the
kdump kernel into the reserved crash memory. See discussion in the commit
message for commit 304386373007.

I wonder if commit a41e0ab394e4 broke the initialization of screen_info
in the kdump kernel. Or perhaps there is now a rev-lock between the kernel
with this commit and a new version of the user space kexec command.

There's a parameter to the kexec() command that governs whether it
uses the kexec_file_load() system call or the kexec_load() system call.
I wonder if that parameter makes a difference in the problem described
for this patch.

I can't immediately remember if, when I was working on commit
304386373007, I tested kdump in a Gen 2 VM with an NVMe OS disk to
ensure that MMIO space was properly allocated to the frame buffer
driver (either hyperv_fb or hyperv_drm). I'm thinking I did, but tomorrow
I'll check for any definitive notes on that.

Michael

> 
> When vmbus_reserve_fb() fails to reserve the framebuffer MMIO in the
> kdump kernel, if pci_hyperv in the kdump kernel loads before hyperv_drm
> loads, pci_hyperv's vmbus_allocate_mmio() gets the framebuffer MMIO
> and tries to use it, but since the host thinks that the MMIO range is
> still in use by hyperv_drm, the host refuses to accept the MMIO range
> as the config window, and pci_hyperv's hv_pci_enter_d0() errors out:
> "PCI Pass-through VSP failed D0 Entry with status c0370048".
> 
> This PCI error in the kdump kernel was not fatal in the past because
> the kdump kernel normally doesn't reply on pci_hyperv, and the root
> file system is on a VMBus SCSI device.
> 
> Now, a VM on Azure can boot from NVMe, i.e. the root FS can be on a
> NVMe device, which depends on pci_hyperv. When the PCI error occurs,
> the kdump kernel fails to boot up since no root FS is detected.
> 
> Fix the MMIO conflict by allocating MMIO above 4GB for the
> config_window.
> 
> Note: we still need to figure out how to address the possible MMIO
> conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
> MMIO BARs, but that's of low priority because all PCI devices available
> to a Linux VM on Azure should use 64-bit BARs and should not use 32-bit
> BARs -- I checked Mellanox VFs, MANA VFs, NVMe devices, and GPUs in
> Linux VMs on Azure, and found no 32-bit BARs.
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/controller/pci-hyperv.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..a6aecb1b5cab 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3406,9 +3406,13 @@ static int hv_allocate_config_window(struct
> hv_pcibus_device *hbus)
> 
>  	/*
>  	 * Set up a region of MMIO space to use for accessing configuration
> -	 * space.
> +	 * space. Use the high MMIO range to not conflict with the hyperv_drm
> +	 * driver (which normally gets MMIO from the low MMIO range) in the
> +	 * kdump kernel of a Gen2 VM, which fails to reserve the framebuffer
> +	 * MMIO range in vmbus_reserve_fb() due to screen_info.lfb_base being
> +	 * zero in the kdump kernel.
>  	 */
> -	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, 0, -1,
> +	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, SZ_4G, -1,
>  				  PCI_CONFIG_MMIO_LENGTH, 0x1000, false);
>  	if (ret)
>  		return ret;
> --
> 2.43.0


^ permalink raw reply

* Re: [PATCH net-next 4/9] net: mana: convert to use .get_rx_ring_count
From: Subbaraya Sundeep @ 2026-01-22  7:06 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Ajit Khaparde, Sriharsha Basavapatna, Somnath Kotur, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Igor Russkikh, Simon Horman, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Alexander Duyck, kernel-team,
	Edward Cree, Brett Creeley, netdev, linux-kernel, oss-drivers,
	linux-hyperv, linux-net-drivers
In-Reply-To: <20260121-grxring_big_v4-v1-4-07655be56bcf@debian.org>

On 2026-01-21 at 21:24:41, Breno Leitao (leitao@debian.org) wrote:
> Use the newly introduced .get_rx_ring_count ethtool ops callback instead
> of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().
> 
> Since ETHTOOL_GRXRINGS was the only command handled by mana_get_rxnfc(),
> remove the function entirely.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>

Thanks,
Sundeep
> ---
>  drivers/net/ethernet/microsoft/mana/mana_ethtool.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 0e2f4343ac67f..f2d220b371b5d 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -282,18 +282,11 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
>  	}
>  }
>  
> -static int mana_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc *cmd,
> -			  u32 *rules)
> +static u32 mana_get_rx_ring_count(struct net_device *ndev)
>  {
>  	struct mana_port_context *apc = netdev_priv(ndev);
>  
> -	switch (cmd->cmd) {
> -	case ETHTOOL_GRXRINGS:
> -		cmd->data = apc->num_queues;
> -		return 0;
> -	}
> -
> -	return -EOPNOTSUPP;
> +	return apc->num_queues;
>  }
>  
>  static u32 mana_get_rxfh_key_size(struct net_device *ndev)
> @@ -520,7 +513,7 @@ const struct ethtool_ops mana_ethtool_ops = {
>  	.get_ethtool_stats	= mana_get_ethtool_stats,
>  	.get_sset_count		= mana_get_sset_count,
>  	.get_strings		= mana_get_strings,
> -	.get_rxnfc		= mana_get_rxnfc,
> +	.get_rx_ring_count	= mana_get_rx_ring_count,
>  	.get_rxfh_key_size	= mana_get_rxfh_key_size,
>  	.get_rxfh_indir_size	= mana_rss_indir_size,
>  	.get_rxfh		= mana_get_rxfh,
> 
> -- 
> 2.47.3
> 

^ permalink raw reply

* Re: [PATCH net-next 3/9] net: nfp: convert to use .get_rx_ring_count
From: Subbaraya Sundeep @ 2026-01-22  7:03 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Ajit Khaparde, Sriharsha Basavapatna, Somnath Kotur, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Igor Russkikh, Simon Horman, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Alexander Duyck, kernel-team,
	Edward Cree, Brett Creeley, netdev, linux-kernel, oss-drivers,
	linux-hyperv, linux-net-drivers
In-Reply-To: <20260121-grxring_big_v4-v1-3-07655be56bcf@debian.org>

On 2026-01-21 at 21:24:40, Breno Leitao (leitao@debian.org) wrote:
> Use the newly introduced .get_rx_ring_count ethtool ops callback instead
> of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>

Thanks,
Sundeep
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
> index 16c828dd5c1a3..e88b1c4732a57 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
> @@ -1435,15 +1435,19 @@ static int nfp_net_get_fs_loc(struct nfp_net *nn, u32 *rule_locs)
>  	return 0;
>  }
>  
> +static u32 nfp_net_get_rx_ring_count(struct net_device *netdev)
> +{
> +	struct nfp_net *nn = netdev_priv(netdev);
> +
> +	return nn->dp.num_rx_rings;
> +}
> +
>  static int nfp_net_get_rxnfc(struct net_device *netdev,
>  			     struct ethtool_rxnfc *cmd, u32 *rule_locs)
>  {
>  	struct nfp_net *nn = netdev_priv(netdev);
>  
>  	switch (cmd->cmd) {
> -	case ETHTOOL_GRXRINGS:
> -		cmd->data = nn->dp.num_rx_rings;
> -		return 0;
>  	case ETHTOOL_GRXCLSRLCNT:
>  		cmd->rule_cnt = nn->fs.count;
>  		return 0;
> @@ -2501,6 +2505,7 @@ static const struct ethtool_ops nfp_net_ethtool_ops = {
>  	.get_sset_count		= nfp_net_get_sset_count,
>  	.get_rxnfc		= nfp_net_get_rxnfc,
>  	.set_rxnfc		= nfp_net_set_rxnfc,
> +	.get_rx_ring_count	= nfp_net_get_rx_ring_count,
>  	.get_rxfh_indir_size	= nfp_net_get_rxfh_indir_size,
>  	.get_rxfh_key_size	= nfp_net_get_rxfh_key_size,
>  	.get_rxfh		= nfp_net_get_rxfh,
> 
> -- 
> 2.47.3
> 

^ permalink raw reply

* Re: [GIT PULL] Hyper-V fixes for v6.19-rc7
From: pr-tracker-bot @ 2026-01-22  5:59 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linus Torvalds, Wei Liu, Linux on Hyper-V List, Linux Kernel List,
	kys, haiyangz, decui, longli
In-Reply-To: <20260122050548.GA909211@liuwe-devbox-debian-v2.local>

The pull request you sent on Thu, 22 Jan 2026 05:05:48 +0000:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260121

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/a66191c590b3b58eaff05d2277971f854772bd5b

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Jacob Pan @ 2026-01-22  5:18 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, Jacob Pan
In-Reply-To: <20260120064230.3602565-13-mrathor@linux.microsoft.com>

Hi Mukesh,

On Mon, 19 Jan 2026 22:42:27 -0800
Mukesh R <mrathor@linux.microsoft.com> wrote:

> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add a new file to implement management of device domains, mapping and
> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> framework for PCI passthru on Hyper-V running Linux as root or L1VH
> parent. This also implements direct attach mechanism for PCI passthru,
> and it is also made to work within the VFIO framework.
> 
> At a high level, during boot the hypervisor creates a default identity
> domain and attaches all devices to it. This nicely maps to Linux iommu
> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> during boot. As mentioned previously, Hyper-V supports two ways to do
> PCI passthru:
> 
>   1. Device Domain: root must create a device domain in the
> hypervisor, and do map/unmap hypercalls for mapping and unmapping
> guest RAM. All hypervisor communications use device id of type PCI for
>      identifying and referencing the device.
> 
>   2. Direct Attach: the hypervisor will simply use the guest's HW
>      page table for mappings, thus the host need not do map/unmap
>      device memory hypercalls. As such, direct attach passthru setup
>      during guest boot is extremely fast. A direct attached device
>      must be referenced via logical device id and not via the PCI
>      device id.
> 
> At present, L1VH root/parent only supports direct attaches. Also
> direct attach is default in non-L1VH cases because there are some
> significant performance issues with device domain implementation
> currently for guests with higher RAM (say more than 8GB), and that
> unfortunately cannot be addressed in the short term.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                     |   1 +
>  arch/x86/include/asm/mshyperv.h |   7 +-
>  arch/x86/kernel/pci-dma.c       |   2 +
>  drivers/iommu/Makefile          |   2 +-
>  drivers/iommu/hyperv-iommu.c    | 876
> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/iommu/hyperv-iommu.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 381a0e086382..63160cee942c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/infiniband/hw/mana/
>  F:	drivers/input/serio/hyperv-keyboard.c
> +F:	drivers/iommu/hyperv-iommu.c
Given we are also developing a guest iommu driver on hyperv, I think it
is more clear to name them accordingly. Perhaps, hyperv-iommu-root.c?

>  F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
> diff --git a/arch/x86/include/asm/mshyperv.h
> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
> 100644 --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
>  #endif
>  
>  #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> -{ return false; }       /* temporary */
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
>  u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type
> type); +u64 hv_iommu_get_curr_partid(void);
>  #else	/* CONFIG_HYPERV_IOMMU */
>  static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>  { return false; }
> -
>  static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>  				       enum hv_device_type type)
>  { return 0; }
> +static inline u64 hv_iommu_get_curr_partid(void)
> +{ return HV_PARTITION_ID_INVALID; }
>  
>  #endif	/* CONFIG_HYPERV_IOMMU */
>  
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 6267363e0189..cfeee6505e17 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -8,6 +8,7 @@
>  #include <linux/gfp.h>
>  #include <linux/pci.h>
>  #include <linux/amd-iommu.h>
> +#include <linux/hyperv.h>
>  
>  #include <asm/proto.h>
>  #include <asm/dma.h>
> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
>  	gart_iommu_hole_init();
>  	amd_iommu_detect();
>  	detect_intel_iommu();
> +	hv_iommu_detect();
Will this driver be x86 only?

>  	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
>  }
>  
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 598c39558e7d..cc9774864b00 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o
DMA and IRQ remapping should be separate

>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu.c
> b/drivers/iommu/hyperv-iommu.c new file mode 100644
> index 000000000000..548483fec6b1
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -0,0 +1,876 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Hyper-V root vIOMMU driver.
> + * Copyright (C) 2026, Microsoft, Inc.
> + */
> +
> +#include <linux/module.h>
I don't think this is needed since this driver cannot be a module

> +#include <linux/pci.h>
> +#include <linux/dmar.h>
should not depend on Intel's DMAR

> +#include <linux/dma-map-ops.h>
> +#include <linux/interval_tree.h>
> +#include <linux/hyperv.h>
> +#include "dma-iommu.h"
> +#include <asm/iommu.h>
> +#include <asm/mshyperv.h>
> +
> +/* We will not claim these PCI devices, eg hypervisor needs it for
> debugger */ +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}
> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev;	 /* disable direct device attach for
> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> +	hv_no_attdev = true;
> +	return 0;
> +}
> +__setup("hv_no_attdev", setup_hv_no_attdev);
> +
> +/* Iommu device that we export to the world. HyperV supports max of
> one */ +static struct iommu_device hv_virt_iommu;
> +
> +struct hv_domain {
> +	struct iommu_domain iommu_dom;
> +	u32 domid_num;			      /* as opposed to
> domain_id.type */
> +	u32 num_attchd;		      /* number of currently
> attached devices */
rename to num_dev_attached?

> +	bool attached_dom;		      /* is this direct
> attached dom? */
> +	spinlock_t mappings_lock;	      /* protects
> mappings_tree */
> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
> tree */ +};
> +
> +#define to_hv_domain(d) container_of(d, struct hv_domain, iommu_dom)
> +
> +struct hv_iommu_mapping {
> +	phys_addr_t paddr;
> +	struct interval_tree_node iova;
> +	u32 flags;
> +};
> +
> +/*
> + * By default, during boot the hypervisor creates one Stage 2 (S2)
> default
> + * domain. Stage 2 means that the page table is controlled by the
> hypervisor.
> + *   S2 default: access to entire root partition memory. This for us
> easily
> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
> subsystem, and
> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
> hypervisor.
> + *
> + * Device Management:
> + *   There are two ways to manage device attaches to domains:
> + *     1. Domain Attach: A device domain is created in the
> hypervisor, the
> + *			 device is attached to this domain, and
> then memory
> + *			 ranges are mapped in the map callbacks.
> + *     2. Direct Attach: No need to create a domain in the
> hypervisor for direct
> + *			 attached devices. A hypercall is made to
> tell the
> + *			 hypervisor to attach the device to a
> guest. There is
> + *			 no need for explicit memory mappings
> because the
> + *			 hypervisor will just use the guest HW
> page table.
> + *
> + * Since a direct attach is much faster, it is the default. This can
> be
> + * changed via hv_no_attdev.
> + *
> + * L1VH: hypervisor only supports direct attach.
> + */
> +
> +/*
> + * Create dummy domain to correspond to hypervisor prebuilt default
> identity
> + * domain (dummy because we do not make hypercall to create them).
> + */
> +static struct hv_domain hv_def_identity_dom;
> +
> +static bool hv_special_domain(struct hv_domain *hvdom)
> +{
> +	return hvdom == &hv_def_identity_dom;
> +}
> +
> +struct iommu_domain_geometry default_geometry = (struct
> iommu_domain_geometry) {
> +	.aperture_start = 0,
> +	.aperture_end = -1UL,
> +	.force_aperture = true,
> +};
> +
> +/*
> + * Since the relevant hypercalls can only fit less than 512 PFNs in
> the pfn
> + * array, report 1M max.
> + */
> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
> +
> +static u32 unique_id;	      /* unique numeric id of a new
> domain */ +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> +				struct device *dev);
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather
> *gather); +
> +/*
> + * If the current thread is a VMM thread, return the partition id of
> the VM it
> + * is managing, else return HV_PARTITION_ID_INVALID.
> + */
> +u64 hv_iommu_get_curr_partid(void)
> +{
> +	u64 (*fn)(pid_t pid);
> +	u64 partid;
> +
> +	fn = symbol_get(mshv_pid_to_partid);
> +	if (!fn)
> +		return HV_PARTITION_ID_INVALID;
> +
> +	partid = fn(current->tgid);
> +	symbol_put(mshv_pid_to_partid);
> +
> +	return partid;
> +}
This function is not iommu specific. Maybe move it to mshv code?

> +
> +/* If this is a VMM thread, then this domain is for a guest VM */
> +static bool hv_curr_thread_is_vmm(void)
> +{
> +	return hv_iommu_get_curr_partid() != HV_PARTITION_ID_INVALID;
> +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	default:
> +		return false;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Check if given pci device is a direct attached device. Caller
> must have
> + * verified pdev is a valid pci device.
> + */
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{
> +	struct iommu_domain *iommu_domain;
> +	struct hv_domain *hvdom;
> +	struct device *dev = &pdev->dev;
> +
> +	iommu_domain = iommu_get_domain_for_dev(dev);
> +	if (iommu_domain) {
> +		hvdom = to_hv_domain(iommu_domain);
> +		return hvdom->attached_dom;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
Attached domain can change anytime, what guarantee does the caller have?

> +
> +/* Create a new device domain in the hypervisor */
> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_device_domain *ddp;
> +	struct hv_input_create_device_domain *input;
nit: use consistent coding style, inverse Christmas tree.

> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	ddp = &input->device_domain;
> +	ddp->partition_id = HV_PARTITION_ID_SELF;
> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	ddp->domain_id.id = hvdom->domid_num;
> +
> +	input->create_device_domain_flags.forward_progress_required
> = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* During boot, all devices are attached to this */
> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
> device *dev) +{
> +	return &hv_def_identity_dom.iommu_dom;
> +}
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> device *dev) +{
> +	struct hv_domain *hvdom;
> +	int rc;
> +
> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
> !hv_no_attdev) {
> +		pr_err("Hyper-V: l1vh iommu does not support host
> devices\n");
why is this an error if user input choose not to do direct attach?

> +		return NULL;
> +	}
> +
> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> +	if (hvdom == NULL)
> +		goto out;
> +
> +	spin_lock_init(&hvdom->mappings_lock);
> +	hvdom->mappings_tree = RB_ROOT_CACHED;
> +
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie,
> 0 */
This is true only when unique_id wraps around, right? Then this driver
stops working?

can you use an IDR for the unique_id and free it as you detach instead
of doing this cyclic allocation?

> +		goto out_free;
> +
> +	hvdom->domid_num = unique_id;
> +	hvdom->iommu_dom.geometry = default_geometry;
> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> +
> +	/* For guests, by default we do direct attaches, so no
> domain in hyp */
> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
> +		hvdom->attached_dom = true;
> +	else {
> +		rc = hv_iommu_create_hyp_devdom(hvdom);
> +		if (rc)
> +			goto out_free_id;
> +	}
> +
> +	return &hvdom->iommu_dom;
> +
> +out_free_id:
> +	unique_id--;
> +out_free:
> +	kfree(hvdom);
> +out:
> +	return NULL;
> +}
> +
> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (hvdom->num_attchd) {
> +		pr_err("Hyper-V: can't free busy iommu domain
> (%p)\n", immdom);
> +		return;
> +	}
> +
> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
> +		struct hv_input_device_domain *ddp;
> +
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		ddp = &input->device_domain;
> +		memset(input, 0, sizeof(*input));
> +
> +		ddp->partition_id = HV_PARTITION_ID_SELF;
> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +		ddp->domain_id.id = hvdom->domid_num;
> +
> +		status =
> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> +					 NULL);
> +		local_irq_restore(flags);
> +
> +		if (!hv_result_success(status))
> +			hv_status_err(status, "\n");
> +	}
you could free the domid here, no?

> +
> +	kfree(hvdom);
> +}
> +
> +/* Attach a device to a domain previously created in the hypervisor
> */ +static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct
> pci_dev *pdev) +{
> +	unsigned long flags;
> +	u64 status;
> +	enum hv_device_type dev_type;
> +	struct hv_input_attach_device_domain *input;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type =
> HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +
> +	/* NB: Upon guest shutdown, device is re-attached to the
> default domain
> +	 * without explicit detach.
> +	 */
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
> dev_type); +
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input,
> NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Caller must have validated that dev is a valid pci dev */
> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
> +{
> +	struct hv_input_attach_device *input;
> +	u64 status;
> +	int rc;
> +	unsigned long flags;
> +	union hv_device_id host_devid;
> +	enum hv_device_type dev_type;
> +	u64 ptid = hv_iommu_get_curr_partid();
> +
> +	if (ptid == HV_PARTITION_ID_INVALID) {
> +		pr_err("Hyper-V: Invalid partition id in direct
> attach\n");
> +		return -EINVAL;
> +	}
> +
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +		input->partition_id = ptid;
> +		input->device_id = host_devid;
> +
> +		/* Hypervisor associates logical_id with this
> device, and in
> +		 * some hypercalls like retarget interrupts,
> logical_id must be
> +		 * used instead of the BDF. It is a required
> parameter.
> +		 */
> +		input->attdev_flags.logical_id = 1;
> +		input->logical_devid =
> +			   hv_build_devid_oftype(pdev,
> HV_DEVICE_TYPE_LOGICAL); +
> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
> input, NULL);
> +		local_irq_restore(flags);
> +
> +		if (hv_result(status) ==
> HV_STATUS_INSUFFICIENT_MEMORY) {
> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
> ptid, 1);
> +			if (rc)
> +				break;
> +		}
> +	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* This to attach a device to both host app (like DPDK) and a guest
> VM */
The IOMMU driver should be agnostic to the type of consumer, whether a
userspace driver or a VM. This comment is not necessary.

> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
> struct device *dev,
> +			       struct iommu_domain *old)
This does not match upstream kernel prototype, which kernel version is
this based on? I will stop here for now.

struct iommu_domain_ops {
	int (*attach_dev)(struct iommu_domain *domain, struct device
	*dev);

> +{
> +	struct pci_dev *pdev;
> +	int rc;
> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	/* l1vh does not support host device (eg DPDK) passthru */
> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
> +	    !hvdom_new->attached_dom)
> +		return -EINVAL;
> +
> +	/*
> +	 * VFIO does not do explicit detach calls, hence check first
> if we need
> +	 * to detach first. Also, in case of guest shutdown, it's
> the VMM
> +	 * thread that attaches it back to the hv_def_identity_dom,
> and
> +	 * hvdom_prev will not be null then. It is null during boot.
> +	 */
> +	if (hvdom_prev)
> +		if (!hv_l1vh_partition() ||
> !hv_special_domain(hvdom_prev))
> +			hv_iommu_detach_dev(&hvdom_prev->iommu_dom,
> dev); +
> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> "private" field */
> +		return 0;
> +	}
> +
> +	if (hvdom_new->attached_dom)
> +		rc = hv_iommu_direct_attach_device(pdev);
> +	else
> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
> +
> +	if (rc && hvdom_prev) {
> +		int rc1;
> +
> +		if (hvdom_prev->attached_dom)
> +			rc1 = hv_iommu_direct_attach_device(pdev);
> +		else
> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev, pdev);
> +
> +		if (rc1)
> +			pr_err("Hyper-V: iommu could not restore
> orig device state.. dev:%s\n",
> +			       dev_name(dev));
> +	}
> +
> +	if (rc == 0) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> "private" field */
> +		hvdom_new->num_attchd++;
> +	}
> +
> +	return rc;
> +}
> +
> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
> +					struct pci_dev *pdev)
> +{
> +	struct hv_input_detach_device *input;
> +	u64 status, log_devid;
> +	unsigned long flags;
> +
> +	log_devid = hv_build_devid_oftype(pdev,
> HV_DEVICE_TYPE_LOGICAL); +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = hv_iommu_get_curr_partid();
> +	input->logical_devid = log_devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
> +				      struct pci_dev *pdev)
> +{
> +	u64 status, devid;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +
> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->device_id.as_uint64 = devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input,
> NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct
> device *dev) +{
> +	struct pci_dev *pdev;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	if (hvdom->num_attchd == 0)
> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n",
> dev_name(dev)); +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hvdom->attached_dom) {
> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> +
> +		/* Do not reset attached_dom, hv_iommu_unmap_pages
> happens
> +		 * next.
> +		 */
> +	} else {
> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> +	}
> +
> +	hvdom->num_attchd--;
> +}
> +
> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t
> paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +	mapping->flags = flags;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> +
> +	return 0;
> +}
> +
> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t
> size) +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova,
> last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct
> hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now.
> */
> +		if (mapping->iova.start < iova)
> +			break;
> +
> +		unmapped += mapping->iova.last - mapping->iova.start
> + 1; +
> +		interval_tree_remove(node, &hvdom->mappings_tree);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +/* Return: must return exact status from the hypercall without
> changes */ +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> +			    unsigned long iova, phys_addr_t paddr,
> +			    unsigned long npages, u32 map_flags)
> +{
> +	u64 status;
> +	int i;
> +	struct hv_input_map_device_gpa_pages *input;
> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type =
> HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->map_flags = map_flags;
> +	input->target_device_va_base = iova;
> +
> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +	for (i = 0; i < npages; i++, pfn++)
> +		input->gpa_page_list[i] = pfn;
> +
> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES,
> npages, 0,
> +				     input, NULL);
> +
> +	local_irq_restore(flags);
> +	return status;
> +}
> +
> +/*
> + * The core VFIO code loops over memory ranges calling this function
> with
> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in
> vfio_iommu_map.
> + */
> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong
> iova,
> +			      phys_addr_t paddr, size_t pgsize,
> size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size,
> map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +	while (done < npages) {
> +		ulong completed, remain = npages - done;
> +
> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> +					  map_flags);
> +
> +		completed = hv_repcomp(status);
> +		done = done + completed;
> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> +
> +		if (hv_result(status) ==
> HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +
> hv_current_partition_id,
> +						    256);
> +			if (ret)
> +				break;
> +		}
> +		if (!hv_result_success(status))
> +			break;
> +	}
> +
> +	if (!hv_result_success(status)) {
> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> +
> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> +			      done, npages, iova);
> +		/*
> +		 * lookup tree has all mappings [0 - size-1]. Below
> unmap will
> +		 * only remove from [0 - done], we need to remove
> second chunk
> +		 * [done+1 - size-1].
> +		 */
> +		hv_iommu_del_tree_mappings(hvdom, iova, size -
> done_size);
> +		hv_iommu_unmap_pages(immdom, iova - done_size,
> pgsize,
> +				     done, NULL);
> +		if (mapped)
> +			*mapped = 0;
> +	} else
> +		if (mapped)
> +			*mapped = size;
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;
> +
> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> +	if (unmapped < size)
> +		pr_err("%s: could not delete all mappings
> (%lx:%lx/%lx)\n",
> +		       __func__, iova, unmapped, size);
> +
> +	if (hvdom->attached_dom)
> +		return size;
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type =
> HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES,
> npages,
> +				     0, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return unmapped;
> +}
> +
> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> +					 dma_addr_t iova)
> +{
> +	u64 paddr = 0;
> +	unsigned long flags;
> +	struct hv_iommu_mapping *mapping;
> +	struct interval_tree_node *node;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova,
> iova);
> +	if (node) {
> +		mapping = container_of(node, struct
> hv_iommu_mapping, iova);
> +		paddr = mapping->paddr + (iova -
> mapping->iova.start);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return paddr;
> +}
> +
> +/*
> + * Currently, hypervisor does not provide list of devices it is using
> + * dynamically. So use this to allow users to manually specify
> devices that
> + * should be skipped. (eg. hypervisor debugger using some network
> device).
> + */
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, "
> (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func,
> &parsed);
> +			if (rc)
> +				break;
> +			if (parsed <= 0)
> +				break;
> +
> +			if (pci_domain_nr(pdev->bus) == segment &&
> +			    pdev->bus->number == bus &&
> +			    PCI_SLOT(pdev->devfn) == slot &&
> +			    PCI_FUNC(pdev->devfn) == func) {
> +
> +				dev_info(dev, "skipped by Hyper-V
> IOMMU\n");
> +				return ERR_PTR(-ENODEV);
> +			}
> +			pos += parsed;
> +
> +		} while (pci_devs_to_skip[pos]);
> +	}
> +
> +	/* Device will be explicitly attached to the default domain,
> so no need
> +	 * to do dev_iommu_priv_set() here.
> +	 */
> +
> +	return &hv_virt_iommu;
> +}
> +
> +static void hv_iommu_probe_finalize(struct device *dev)
> +{
> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> +
> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> +		iommu_setup_dma_ops(dev);
> +	else
> +		set_dma_ops(dev, NULL);
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> +
> +	/* Need to detach device from device domain if necessary. */
> +	if (hvdom)
> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int hv_iommu_def_domain_type(struct device *dev)
> +{
> +	/* The hypervisor always creates this by default during boot
> */
> +	return IOMMU_DOMAIN_IDENTITY;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable	    = hv_iommu_capable,
> +	.domain_alloc_identity	=
> hv_iommu_domain_alloc_identity,
> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
> +	.probe_device	    = hv_iommu_probe_device,
> +	.probe_finalize     = hv_iommu_probe_finalize,
> +	.release_device     = hv_iommu_release_device,
> +	.def_domain_type    = hv_iommu_def_domain_type,
> +	.device_group	    = hv_iommu_device_group,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev   = hv_iommu_attach_dev,
> +		.map_pages    = hv_iommu_map_pages,
> +		.unmap_pages  = hv_iommu_unmap_pages,
> +		.iova_to_phys = hv_iommu_iova_to_phys,
> +		.free	      = hv_iommu_domain_free,
> +	},
> +	.owner		    = THIS_MODULE,
> +};
> +
> +static void __init hv_initialize_special_domains(void)
> +{
> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> +	hv_def_identity_dom.domid_num =
> HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */ +}
This could be initialized statically.

> +
> +static int __init hv_iommu_init(void)
> +{
> +	int ret;
> +	struct iommu_device *iommup = &hv_virt_iommu;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s",
> "hyperv-iommu");
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_sysfs_add failed:
> %d\n", ret);
> +		return ret;
> +	}
> +
> +	/* This must come before iommu_device_register because the
> latter calls
> +	 * into the hooks.
> +	 */
> +	hv_initialize_special_domains();
> +
> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_register failed:
> %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("Hyper-V IOMMU initialized\n");
> +
> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(iommup);
> +	return ret;
> +}
> +
> +void __init hv_iommu_detect(void)
> +{
> +	if (no_iommu || iommu_detected)
> +		return;
> +
> +	/* For l1vh, always expose an iommu unit */
> +	if (!hv_l1vh_partition())
> +		if (!(ms_hyperv.misc_features &
> HV_DEVICE_DOMAIN_AVAILABLE))
> +			return;
> +
> +	iommu_detected = 1;
> +	x86_init.iommu.iommu_init = hv_iommu_init;
> +
> +	pci_request_acs();
> +}
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index dfc516c1c719..2ad111727e82 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1767,4 +1767,10 @@ static inline unsigned long virt_to_hvpfn(void
> *addr) #define HVPFN_DOWN(x)	((x) >> HV_HYP_PAGE_SHIFT)
>  #define page_to_hvpfn(page)	(page_to_pfn(page) *
> NR_HV_HYP_PAGES_IN_PAGE) 
> +#ifdef CONFIG_HYPERV_IOMMU
> +void __init hv_iommu_detect(void);
> +#else
> +static inline void hv_iommu_detect(void) { }
> +#endif /* CONFIG_HYPERV_IOMMU */
> +
>  #endif /* _HYPERV_H */


^ permalink raw reply

* [GIT PULL] Hyper-V fixes for v6.19-rc7
From: Wei Liu @ 2026-01-22  5:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wei Liu, Linux on Hyper-V List, Linux Kernel List, kys, haiyangz,
	decui, longli

Hi Linus,

The following changes since commit 173d6f64f9558ff022a777a72eb8669b6cdd2649:

  mshv: release mutex on region invalidation failure (2025-12-18 20:00:10 +0000)

are available in the Git repository at:

  ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260121

for you to fetch changes up to 12ffd561d2de28825f39e15e8d22346d26b09688:

  mshv: handle gpa intercepts for arm64 (2026-01-15 07:29:14 +0000)

----------------------------------------------------------------
hyperv-fixes for v6.19-rc7
 - Fix ARM64 port of the MSHV driver (Anirudh Rayabharam)
 - Fix huge page handling in the MSHV driver (Stanislav Kinsburskii)
 - Minor fixes to driver code (Julia Lawall, Michael Kelley)
----------------------------------------------------------------
Anirudh Rayabharam (Microsoft) (2):
      mshv: add definitions for arm64 gpa intercepts
      mshv: handle gpa intercepts for arm64

Julia Lawall (1):
      Drivers: hv: vmbus: fix typo in function name reference

Michael Kelley (3):
      Drivers: hv: Always do Hyper-V panic notification in hv_kmsg_dump()
      mshv: Store the result of vfs_poll in a variable of type __poll_t
      mshv: Add __user attribute to argument passed to access_ok()

Stanislav Kinsburskii (1):
      mshv: Align huge page stride with guest mapping

 drivers/hv/hv_common.c      | 12 +++---
 drivers/hv/hyperv_vmbus.h   |  2 +-
 drivers/hv/mshv_eventfd.c   |  2 +-
 drivers/hv/mshv_regions.c   | 93 ++++++++++++++++++++++++++++++---------------
 drivers/hv/mshv_root_main.c | 17 +++++----
 include/hyperv/hvhdk.h      | 47 +++++++++++++++++++++++
 6 files changed, 127 insertions(+), 46 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next v2] net: mana: Improve diagnostic logging for better debuggability
From: Jakub Kicinski @ 2026-01-22  4:14 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, leon, kotaranov, shradhagupta, yury.norov,
	dipayanroy, shirazsaleem, ssengar, gargaditya, linux-hyperv,
	netdev, linux-kernel
In-Reply-To: <20260121065655.18249-1-ernis@linux.microsoft.com>

On Tue, 20 Jan 2026 22:56:55 -0800 Erni Sri Satya Vennela wrote:
> Enhance MANA driver logging to provide better visibility into
> hardware configuration and error states during driver initialization
> and runtime operations.

> +	dev_info(gc->dev, "Max Resources: msix_usable=%u max_queues=%u\n",
> +		 gc->num_msix_usable, gc->max_num_queues);

> +	dev_info(dev, "Device Config: max_vports=%u adapter_mtu=%u bm_hostmode=%u\n",
> +		 *max_num_vports, gc->adapter_mtu, *bm_hostmode);

IIUC in networking we try to follow the mantra that if the system is
functioning correctly there should be no logs. You can expose the debug
info via ethtool, devlink, debugfs etc. Take your pick.
-- 
pw-bot: cr

^ permalink raw reply

* [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-01-22  2:03 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, jakeo, linux-hyperv, linux-pci,
	linux-kernel
  Cc: mhklinux, stable

There has been a longstanding MMIO conflict between the pci_hyperv
driver's config_window (see hv_allocate_config_window()) and the
hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
both get MMIO from the low MMIO range below 4GB; this is not an issue
in the normal kernel since the VMBus driver reserves the framebuffer
MMIO in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram() can
always get the reserved framebuffer MMIO; however, a Gen2 VM's kdump
kernel fails to reserve the framebuffer MMIO in vmbus_reserve_fb() because
the screen_info.lfb_base is zero in the kdump kernel: the screen_info
is not initialized at all in the kdump kernel, because the EFI stub
code, which initializes screen_info, doesn't run in the case of kdump.

When vmbus_reserve_fb() fails to reserve the framebuffer MMIO in the
kdump kernel, if pci_hyperv in the kdump kernel loads before hyperv_drm
loads, pci_hyperv's vmbus_allocate_mmio() gets the framebuffer MMIO
and tries to use it, but since the host thinks that the MMIO range is
still in use by hyperv_drm, the host refuses to accept the MMIO range
as the config window, and pci_hyperv's hv_pci_enter_d0() errors out:
"PCI Pass-through VSP failed D0 Entry with status c0370048".

This PCI error in the kdump kernel was not fatal in the past because
the kdump kernel normally doesn't reply on pci_hyperv, and the root
file system is on a VMBus SCSI device.

Now, a VM on Azure can boot from NVMe, i.e. the root FS can be on a
NVMe device, which depends on pci_hyperv. When the PCI error occurs,
the kdump kernel fails to boot up since no root FS is detected.

Fix the MMIO conflict by allocating MMIO above 4GB for the
config_window.

Note: we still need to figure out how to address the possible MMIO
conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
MMIO BARs, but that's of low priority because all PCI devices available
to a Linux VM on Azure should use 64-bit BARs and should not use 32-bit
BARs -- I checked Mellanox VFs, MANA VFs, NVMe devices, and GPUs in
Linux VMs on Azure, and found no 32-bit BARs.

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org
---
 drivers/pci/controller/pci-hyperv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 1e237d3538f9..a6aecb1b5cab 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3406,9 +3406,13 @@ static int hv_allocate_config_window(struct hv_pcibus_device *hbus)
 
 	/*
 	 * Set up a region of MMIO space to use for accessing configuration
-	 * space.
+	 * space. Use the high MMIO range to not conflict with the hyperv_drm
+	 * driver (which normally gets MMIO from the low MMIO range) in the
+	 * kdump kernel of a Gen2 VM, which fails to reserve the framebuffer
+	 * MMIO range in vmbus_reserve_fb() due to screen_info.lfb_base being
+	 * zero in the kdump kernel.
 	 */
-	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, 0, -1,
+	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, SZ_4G, -1,
 				  PCI_CONFIG_MMIO_LENGTH, 0x1000, false);
 	if (ret)
 		return ret;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v4 5/7] mshv: Update hv_stats_page definitions
From: Stanislav Kinsburskii @ 2026-01-22  1:22 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, linux-kernel, mhklinux, kys, haiyangz, wei.liu,
	decui, longli, prapal, mrathor, paekkaladevi
In-Reply-To: <20260121214623.76374-6-nunodasneves@linux.microsoft.com>

On Wed, Jan 21, 2026 at 01:46:21PM -0800, Nuno Das Neves wrote:
> hv_stats_page belongs in hvhdk.h, move it there.
> 
> It does not require a union to access the data for different counters,
> just use a single u64 array for simplicity and to match the Windows
> definitions.
> 
> While at it, correct the ARM64 value for VpRootDispatchThreadBlocked.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>
> ---
>  drivers/hv/mshv_root_main.c | 22 ++++++----------------
>  include/hyperv/hvhdk.h      |  8 ++++++++
>  2 files changed, 14 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index fbfc9e7d9fa4..12825666e21b 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -39,23 +39,14 @@ MODULE_AUTHOR("Microsoft");
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
>  
> -/* TODO move this to another file when debugfs code is added */
>  enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */

Given the changes you are making for printing VP counters in the
subsequent patches, this union looks redundant.
I'd suggest replacing it a simple define for the thread blocked counter.

But nontheless:

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>


>  #if defined(CONFIG_X86)
> -	VpRootDispatchThreadBlocked			= 202,
> +	VpRootDispatchThreadBlocked = 202,
>  #elif defined(CONFIG_ARM64)
> -	VpRootDispatchThreadBlocked			= 94,
> +	VpRootDispatchThreadBlocked = 95,
>  #endif
> -	VpStatsMaxCounter
>  };
>  
> -struct hv_stats_page {
> -	union {
> -		u64 vp_cntrs[VpStatsMaxCounter];		/* VP counters */
> -		u8 data[HV_HYP_PAGE_SIZE];
> -	};
> -} __packed;
> -
>  struct mshv_root mshv_root;
>  
>  enum hv_scheduler_type hv_scheduler_type;
> @@ -485,12 +476,11 @@ static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
>  static bool mshv_vp_dispatch_thread_blocked(struct mshv_vp *vp)
>  {
>  	struct hv_stats_page **stats = vp->vp_stats_pages;
> -	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->vp_cntrs;
> -	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->vp_cntrs;
> +	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->data;
> +	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->data;
>  
> -	if (self_vp_cntrs[VpRootDispatchThreadBlocked])
> -		return self_vp_cntrs[VpRootDispatchThreadBlocked];
> -	return parent_vp_cntrs[VpRootDispatchThreadBlocked];
> +	return parent_vp_cntrs[VpRootDispatchThreadBlocked] ||
> +	       self_vp_cntrs[VpRootDispatchThreadBlocked];
>  }
>  
>  static int
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 469186df7826..ac501969105c 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -10,6 +10,14 @@
>  #include "hvhdk_mini.h"
>  #include "hvgdk.h"
>  
> +/*
> + * Hypervisor statistics page format
> + */
> +struct hv_stats_page {
> +	u64 data[HV_HYP_PAGE_SIZE / sizeof(u64)];
> +} __packed;
> +
> +
>  /* Bits for dirty mask of hv_vp_register_page */
>  #define HV_X64_REGISTER_CLASS_GENERAL	0
>  #define HV_X64_REGISTER_CLASS_IP	1
> -- 
> 2.34.1

^ permalink raw reply

* Re: [PATCH v4 6/7] mshv: Add data for printing stats page counters
From: Stanislav Kinsburskii @ 2026-01-22  1:18 UTC (permalink / raw)
  To: Nuno Das Neves
  Cc: linux-hyperv, linux-kernel, mhklinux, kys, haiyangz, wei.liu,
	decui, longli, prapal, mrathor, paekkaladevi
In-Reply-To: <20260121214623.76374-7-nunodasneves@linux.microsoft.com>

On Wed, Jan 21, 2026 at 01:46:22PM -0800, Nuno Das Neves wrote:
> Introduce hv_counters.c, containing static data corresponding to
> HV_*_COUNTER enums in the hypervisor source. Defining the enum
> members as an array instead makes more sense, since it will be
> iterated over to print counter information to debugfs.
> 
> Include hypervisor, logical processor, partition, and virtual
> processor counters.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/hv_counters.c | 488 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 488 insertions(+)
>  create mode 100644 drivers/hv/hv_counters.c
> 
> diff --git a/drivers/hv/hv_counters.c b/drivers/hv/hv_counters.c
> new file mode 100644
> index 000000000000..a8e07e72cc29
> --- /dev/null
> +++ b/drivers/hv/hv_counters.c
> @@ -0,0 +1,488 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2026, Microsoft Corporation.
> + *
> + * Data for printing stats page counters via debugfs.
> + *
> + * Authors: Microsoft Linux virtualization team
> + */
> +
> +struct hv_counter_entry {
> +	char *name;
> +	int idx;
> +};

This structure looks redundant to me mostly because of the "idx".
It looks what you need here is an arry of pointers to strings, like
below:

static const char *hv_hypervisor_counters[] = {
        NULL, /* 0 is unused */
	"HvLogicalProcessors",
	"HvPartitions",
	"HvTotalPages",
	"HvVirtualProcessors",
	"HvMonitoredNotifications",
	"HvModernStandbyEntries",
	"HvPlatformIdleTransitions",
	"HvHypervisorStartupCost",
	NULL, /* 9 is unused */
	"HvIOSpacePages",
	...
};

which can be iterated like this:

for (idx = 0; idx < ARRAY_SIZE(hv_hypervisor_counters); idx++) {
    const char *name = hv_hypervisor_counters[idx];
    if (!name)
	continue;
    /* print */
    ...
}

What do you think?

Thanks,
Stanislav

> +
> +/* HV_HYPERVISOR_COUNTER */
> +static struct hv_counter_entry hv_hypervisor_counters[] = {
> +	{ "HvLogicalProcessors", 1 },
> +	{ "HvPartitions", 2 },
> +	{ "HvTotalPages", 3 },
> +	{ "HvVirtualProcessors", 4 },
> +	{ "HvMonitoredNotifications", 5 },
> +	{ "HvModernStandbyEntries", 6 },
> +	{ "HvPlatformIdleTransitions", 7 },
> +	{ "HvHypervisorStartupCost", 8 },
> +
> +	{ "HvIOSpacePages", 10 },
> +	{ "HvNonEssentialPagesForDump", 11 },
> +	{ "HvSubsumedPages", 12 },
> +};
> +
> +/* HV_CPU_COUNTER */
> +static struct hv_counter_entry hv_lp_counters[] = {
> +	{ "LpGlobalTime", 1 },
> +	{ "LpTotalRunTime", 2 },
> +	{ "LpHypervisorRunTime", 3 },
> +	{ "LpHardwareInterrupts", 4 },
> +	{ "LpContextSwitches", 5 },
> +	{ "LpInterProcessorInterrupts", 6 },
> +	{ "LpSchedulerInterrupts", 7 },
> +	{ "LpTimerInterrupts", 8 },
> +	{ "LpInterProcessorInterruptsSent", 9 },
> +	{ "LpProcessorHalts", 10 },
> +	{ "LpMonitorTransitionCost", 11 },
> +	{ "LpContextSwitchTime", 12 },
> +	{ "LpC1TransitionsCount", 13 },
> +	{ "LpC1RunTime", 14 },
> +	{ "LpC2TransitionsCount", 15 },
> +	{ "LpC2RunTime", 16 },
> +	{ "LpC3TransitionsCount", 17 },
> +	{ "LpC3RunTime", 18 },
> +	{ "LpRootVpIndex", 19 },
> +	{ "LpIdleSequenceNumber", 20 },
> +	{ "LpGlobalTscCount", 21 },
> +	{ "LpActiveTscCount", 22 },
> +	{ "LpIdleAccumulation", 23 },
> +	{ "LpReferenceCycleCount0", 24 },
> +	{ "LpActualCycleCount0", 25 },
> +	{ "LpReferenceCycleCount1", 26 },
> +	{ "LpActualCycleCount1", 27 },
> +	{ "LpProximityDomainId", 28 },
> +	{ "LpPostedInterruptNotifications", 29 },
> +	{ "LpBranchPredictorFlushes", 30 },
> +#if IS_ENABLED(CONFIG_X86_64)
> +	{ "LpL1DataCacheFlushes", 31 },
> +	{ "LpImmediateL1DataCacheFlushes", 32 },
> +	{ "LpMbFlushes", 33 },
> +	{ "LpCounterRefreshSequenceNumber", 34 },
> +	{ "LpCounterRefreshReferenceTime", 35 },
> +	{ "LpIdleAccumulationSnapshot", 36 },
> +	{ "LpActiveTscCountSnapshot", 37 },
> +	{ "LpHwpRequestContextSwitches", 38 },
> +	{ "LpPlaceholder1", 39 },
> +	{ "LpPlaceholder2", 40 },
> +	{ "LpPlaceholder3", 41 },
> +	{ "LpPlaceholder4", 42 },
> +	{ "LpPlaceholder5", 43 },
> +	{ "LpPlaceholder6", 44 },
> +	{ "LpPlaceholder7", 45 },
> +	{ "LpPlaceholder8", 46 },
> +	{ "LpPlaceholder9", 47 },
> +	{ "LpSchLocalRunListSize", 48 },
> +	{ "LpReserveGroupId", 49 },
> +	{ "LpRunningPriority", 50 },
> +	{ "LpPerfmonInterruptCount", 51 },
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	{ "LpCounterRefreshSequenceNumber", 31 },
> +	{ "LpCounterRefreshReferenceTime", 32 },
> +	{ "LpIdleAccumulationSnapshot", 33 },
> +	{ "LpActiveTscCountSnapshot", 34 },
> +	{ "LpHwpRequestContextSwitches", 35 },
> +	{ "LpPlaceholder2", 36 },
> +	{ "LpPlaceholder3", 37 },
> +	{ "LpPlaceholder4", 38 },
> +	{ "LpPlaceholder5", 39 },
> +	{ "LpPlaceholder6", 40 },
> +	{ "LpPlaceholder7", 41 },
> +	{ "LpPlaceholder8", 42 },
> +	{ "LpPlaceholder9", 43 },
> +	{ "LpSchLocalRunListSize", 44 },
> +	{ "LpReserveGroupId", 45 },
> +	{ "LpRunningPriority", 46 },
> +#endif
> +};
> +
> +/* HV_PROCESS_COUNTER */
> +static struct hv_counter_entry hv_partition_counters[] = {
> +	{ "PtVirtualProcessors", 1 },
> +
> +	{ "PtTlbSize", 3 },
> +	{ "PtAddressSpaces", 4 },
> +	{ "PtDepositedPages", 5 },
> +	{ "PtGpaPages", 6 },
> +	{ "PtGpaSpaceModifications", 7 },
> +	{ "PtVirtualTlbFlushEntires", 8 },
> +	{ "PtRecommendedTlbSize", 9 },
> +	{ "PtGpaPages4K", 10 },
> +	{ "PtGpaPages2M", 11 },
> +	{ "PtGpaPages1G", 12 },
> +	{ "PtGpaPages512G", 13 },
> +	{ "PtDevicePages4K", 14 },
> +	{ "PtDevicePages2M", 15 },
> +	{ "PtDevicePages1G", 16 },
> +	{ "PtDevicePages512G", 17 },
> +	{ "PtAttachedDevices", 18 },
> +	{ "PtDeviceInterruptMappings", 19 },
> +	{ "PtIoTlbFlushes", 20 },
> +	{ "PtIoTlbFlushCost", 21 },
> +	{ "PtDeviceInterruptErrors", 22 },
> +	{ "PtDeviceDmaErrors", 23 },
> +	{ "PtDeviceInterruptThrottleEvents", 24 },
> +	{ "PtSkippedTimerTicks", 25 },
> +	{ "PtPartitionId", 26 },
> +#if IS_ENABLED(CONFIG_X86_64)
> +	{ "PtNestedTlbSize", 27 },
> +	{ "PtRecommendedNestedTlbSize", 28 },
> +	{ "PtNestedTlbFreeListSize", 29 },
> +	{ "PtNestedTlbTrimmedPages", 30 },
> +	{ "PtPagesShattered", 31 },
> +	{ "PtPagesRecombined", 32 },
> +	{ "PtHwpRequestValue", 33 },
> +	{ "PtAutoSuspendEnableTime", 34 },
> +	{ "PtAutoSuspendTriggerTime", 35 },
> +	{ "PtAutoSuspendDisableTime", 36 },
> +	{ "PtPlaceholder1", 37 },
> +	{ "PtPlaceholder2", 38 },
> +	{ "PtPlaceholder3", 39 },
> +	{ "PtPlaceholder4", 40 },
> +	{ "PtPlaceholder5", 41 },
> +	{ "PtPlaceholder6", 42 },
> +	{ "PtPlaceholder7", 43 },
> +	{ "PtPlaceholder8", 44 },
> +	{ "PtHypervisorStateTransferGeneration", 45 },
> +	{ "PtNumberofActiveChildPartitions", 46 },
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	{ "PtHwpRequestValue", 27 },
> +	{ "PtAutoSuspendEnableTime", 28 },
> +	{ "PtAutoSuspendTriggerTime", 29 },
> +	{ "PtAutoSuspendDisableTime", 30 },
> +	{ "PtPlaceholder1", 31 },
> +	{ "PtPlaceholder2", 32 },
> +	{ "PtPlaceholder3", 33 },
> +	{ "PtPlaceholder4", 34 },
> +	{ "PtPlaceholder5", 35 },
> +	{ "PtPlaceholder6", 36 },
> +	{ "PtPlaceholder7", 37 },
> +	{ "PtPlaceholder8", 38 },
> +	{ "PtHypervisorStateTransferGeneration", 39 },
> +	{ "PtNumberofActiveChildPartitions", 40 },
> +#endif
> +};
> +
> +/* HV_THREAD_COUNTER */
> +static struct hv_counter_entry hv_vp_counters[] = {
> +	{ "VpTotalRunTime", 1 },
> +	{ "VpHypervisorRunTime", 2 },
> +	{ "VpRemoteNodeRunTime", 3 },
> +	{ "VpNormalizedRunTime", 4 },
> +	{ "VpIdealCpu", 5 },
> +
> +	{ "VpHypercallsCount", 7 },
> +	{ "VpHypercallsTime", 8 },
> +#if IS_ENABLED(CONFIG_X86_64)
> +	{ "VpPageInvalidationsCount", 9 },
> +	{ "VpPageInvalidationsTime", 10 },
> +	{ "VpControlRegisterAccessesCount", 11 },
> +	{ "VpControlRegisterAccessesTime", 12 },
> +	{ "VpIoInstructionsCount", 13 },
> +	{ "VpIoInstructionsTime", 14 },
> +	{ "VpHltInstructionsCount", 15 },
> +	{ "VpHltInstructionsTime", 16 },
> +	{ "VpMwaitInstructionsCount", 17 },
> +	{ "VpMwaitInstructionsTime", 18 },
> +	{ "VpCpuidInstructionsCount", 19 },
> +	{ "VpCpuidInstructionsTime", 20 },
> +	{ "VpMsrAccessesCount", 21 },
> +	{ "VpMsrAccessesTime", 22 },
> +	{ "VpOtherInterceptsCount", 23 },
> +	{ "VpOtherInterceptsTime", 24 },
> +	{ "VpExternalInterruptsCount", 25 },
> +	{ "VpExternalInterruptsTime", 26 },
> +	{ "VpPendingInterruptsCount", 27 },
> +	{ "VpPendingInterruptsTime", 28 },
> +	{ "VpEmulatedInstructionsCount", 29 },
> +	{ "VpEmulatedInstructionsTime", 30 },
> +	{ "VpDebugRegisterAccessesCount", 31 },
> +	{ "VpDebugRegisterAccessesTime", 32 },
> +	{ "VpPageFaultInterceptsCount", 33 },
> +	{ "VpPageFaultInterceptsTime", 34 },
> +	{ "VpGuestPageTableMaps", 35 },
> +	{ "VpLargePageTlbFills", 36 },
> +	{ "VpSmallPageTlbFills", 37 },
> +	{ "VpReflectedGuestPageFaults", 38 },
> +	{ "VpApicMmioAccesses", 39 },
> +	{ "VpIoInterceptMessages", 40 },
> +	{ "VpMemoryInterceptMessages", 41 },
> +	{ "VpApicEoiAccesses", 42 },
> +	{ "VpOtherMessages", 43 },
> +	{ "VpPageTableAllocations", 44 },
> +	{ "VpLogicalProcessorMigrations", 45 },
> +	{ "VpAddressSpaceEvictions", 46 },
> +	{ "VpAddressSpaceSwitches", 47 },
> +	{ "VpAddressDomainFlushes", 48 },
> +	{ "VpAddressSpaceFlushes", 49 },
> +	{ "VpGlobalGvaRangeFlushes", 50 },
> +	{ "VpLocalGvaRangeFlushes", 51 },
> +	{ "VpPageTableEvictions", 52 },
> +	{ "VpPageTableReclamations", 53 },
> +	{ "VpPageTableResets", 54 },
> +	{ "VpPageTableValidations", 55 },
> +	{ "VpApicTprAccesses", 56 },
> +	{ "VpPageTableWriteIntercepts", 57 },
> +	{ "VpSyntheticInterrupts", 58 },
> +	{ "VpVirtualInterrupts", 59 },
> +	{ "VpApicIpisSent", 60 },
> +	{ "VpApicSelfIpisSent", 61 },
> +	{ "VpGpaSpaceHypercalls", 62 },
> +	{ "VpLogicalProcessorHypercalls", 63 },
> +	{ "VpLongSpinWaitHypercalls", 64 },
> +	{ "VpOtherHypercalls", 65 },
> +	{ "VpSyntheticInterruptHypercalls", 66 },
> +	{ "VpVirtualInterruptHypercalls", 67 },
> +	{ "VpVirtualMmuHypercalls", 68 },
> +	{ "VpVirtualProcessorHypercalls", 69 },
> +	{ "VpHardwareInterrupts", 70 },
> +	{ "VpNestedPageFaultInterceptsCount", 71 },
> +	{ "VpNestedPageFaultInterceptsTime", 72 },
> +	{ "VpPageScans", 73 },
> +	{ "VpLogicalProcessorDispatches", 74 },
> +	{ "VpWaitingForCpuTime", 75 },
> +	{ "VpExtendedHypercalls", 76 },
> +	{ "VpExtendedHypercallInterceptMessages", 77 },
> +	{ "VpMbecNestedPageTableSwitches", 78 },
> +	{ "VpOtherReflectedGuestExceptions", 79 },
> +	{ "VpGlobalIoTlbFlushes", 80 },
> +	{ "VpGlobalIoTlbFlushCost", 81 },
> +	{ "VpLocalIoTlbFlushes", 82 },
> +	{ "VpLocalIoTlbFlushCost", 83 },
> +	{ "VpHypercallsForwardedCount", 84 },
> +	{ "VpHypercallsForwardingTime", 85 },
> +	{ "VpPageInvalidationsForwardedCount", 86 },
> +	{ "VpPageInvalidationsForwardingTime", 87 },
> +	{ "VpControlRegisterAccessesForwardedCount", 88 },
> +	{ "VpControlRegisterAccessesForwardingTime", 89 },
> +	{ "VpIoInstructionsForwardedCount", 90 },
> +	{ "VpIoInstructionsForwardingTime", 91 },
> +	{ "VpHltInstructionsForwardedCount", 92 },
> +	{ "VpHltInstructionsForwardingTime", 93 },
> +	{ "VpMwaitInstructionsForwardedCount", 94 },
> +	{ "VpMwaitInstructionsForwardingTime", 95 },
> +	{ "VpCpuidInstructionsForwardedCount", 96 },
> +	{ "VpCpuidInstructionsForwardingTime", 97 },
> +	{ "VpMsrAccessesForwardedCount", 98 },
> +	{ "VpMsrAccessesForwardingTime", 99 },
> +	{ "VpOtherInterceptsForwardedCount", 100 },
> +	{ "VpOtherInterceptsForwardingTime", 101 },
> +	{ "VpExternalInterruptsForwardedCount", 102 },
> +	{ "VpExternalInterruptsForwardingTime", 103 },
> +	{ "VpPendingInterruptsForwardedCount", 104 },
> +	{ "VpPendingInterruptsForwardingTime", 105 },
> +	{ "VpEmulatedInstructionsForwardedCount", 106 },
> +	{ "VpEmulatedInstructionsForwardingTime", 107 },
> +	{ "VpDebugRegisterAccessesForwardedCount", 108 },
> +	{ "VpDebugRegisterAccessesForwardingTime", 109 },
> +	{ "VpPageFaultInterceptsForwardedCount", 110 },
> +	{ "VpPageFaultInterceptsForwardingTime", 111 },
> +	{ "VpVmclearEmulationCount", 112 },
> +	{ "VpVmclearEmulationTime", 113 },
> +	{ "VpVmptrldEmulationCount", 114 },
> +	{ "VpVmptrldEmulationTime", 115 },
> +	{ "VpVmptrstEmulationCount", 116 },
> +	{ "VpVmptrstEmulationTime", 117 },
> +	{ "VpVmreadEmulationCount", 118 },
> +	{ "VpVmreadEmulationTime", 119 },
> +	{ "VpVmwriteEmulationCount", 120 },
> +	{ "VpVmwriteEmulationTime", 121 },
> +	{ "VpVmxoffEmulationCount", 122 },
> +	{ "VpVmxoffEmulationTime", 123 },
> +	{ "VpVmxonEmulationCount", 124 },
> +	{ "VpVmxonEmulationTime", 125 },
> +	{ "VpNestedVMEntriesCount", 126 },
> +	{ "VpNestedVMEntriesTime", 127 },
> +	{ "VpNestedSLATSoftPageFaultsCount", 128 },
> +	{ "VpNestedSLATSoftPageFaultsTime", 129 },
> +	{ "VpNestedSLATHardPageFaultsCount", 130 },
> +	{ "VpNestedSLATHardPageFaultsTime", 131 },
> +	{ "VpInvEptAllContextEmulationCount", 132 },
> +	{ "VpInvEptAllContextEmulationTime", 133 },
> +	{ "VpInvEptSingleContextEmulationCount", 134 },
> +	{ "VpInvEptSingleContextEmulationTime", 135 },
> +	{ "VpInvVpidAllContextEmulationCount", 136 },
> +	{ "VpInvVpidAllContextEmulationTime", 137 },
> +	{ "VpInvVpidSingleContextEmulationCount", 138 },
> +	{ "VpInvVpidSingleContextEmulationTime", 139 },
> +	{ "VpInvVpidSingleAddressEmulationCount", 140 },
> +	{ "VpInvVpidSingleAddressEmulationTime", 141 },
> +	{ "VpNestedTlbPageTableReclamations", 142 },
> +	{ "VpNestedTlbPageTableEvictions", 143 },
> +	{ "VpFlushGuestPhysicalAddressSpaceHypercalls", 144 },
> +	{ "VpFlushGuestPhysicalAddressListHypercalls", 145 },
> +	{ "VpPostedInterruptNotifications", 146 },
> +	{ "VpPostedInterruptScans", 147 },
> +	{ "VpTotalCoreRunTime", 148 },
> +	{ "VpMaximumRunTime", 149 },
> +	{ "VpHwpRequestContextSwitches", 150 },
> +	{ "VpWaitingForCpuTimeBucket0", 151 },
> +	{ "VpWaitingForCpuTimeBucket1", 152 },
> +	{ "VpWaitingForCpuTimeBucket2", 153 },
> +	{ "VpWaitingForCpuTimeBucket3", 154 },
> +	{ "VpWaitingForCpuTimeBucket4", 155 },
> +	{ "VpWaitingForCpuTimeBucket5", 156 },
> +	{ "VpWaitingForCpuTimeBucket6", 157 },
> +	{ "VpVmloadEmulationCount", 158 },
> +	{ "VpVmloadEmulationTime", 159 },
> +	{ "VpVmsaveEmulationCount", 160 },
> +	{ "VpVmsaveEmulationTime", 161 },
> +	{ "VpGifInstructionEmulationCount", 162 },
> +	{ "VpGifInstructionEmulationTime", 163 },
> +	{ "VpEmulatedErrataSvmInstructions", 164 },
> +	{ "VpPlaceholder1", 165 },
> +	{ "VpPlaceholder2", 166 },
> +	{ "VpPlaceholder3", 167 },
> +	{ "VpPlaceholder4", 168 },
> +	{ "VpPlaceholder5", 169 },
> +	{ "VpPlaceholder6", 170 },
> +	{ "VpPlaceholder7", 171 },
> +	{ "VpPlaceholder8", 172 },
> +	{ "VpContentionTime", 173 },
> +	{ "VpWakeUpTime", 174 },
> +	{ "VpSchedulingPriority", 175 },
> +	{ "VpRdpmcInstructionsCount", 176 },
> +	{ "VpRdpmcInstructionsTime", 177 },
> +	{ "VpPerfmonPmuMsrAccessesCount", 178 },
> +	{ "VpPerfmonLbrMsrAccessesCount", 179 },
> +	{ "VpPerfmonIptMsrAccessesCount", 180 },
> +	{ "VpPerfmonInterruptCount", 181 },
> +	{ "VpVtl1DispatchCount", 182 },
> +	{ "VpVtl2DispatchCount", 183 },
> +	{ "VpVtl2DispatchBucket0", 184 },
> +	{ "VpVtl2DispatchBucket1", 185 },
> +	{ "VpVtl2DispatchBucket2", 186 },
> +	{ "VpVtl2DispatchBucket3", 187 },
> +	{ "VpVtl2DispatchBucket4", 188 },
> +	{ "VpVtl2DispatchBucket5", 189 },
> +	{ "VpVtl2DispatchBucket6", 190 },
> +	{ "VpVtl1RunTime", 191 },
> +	{ "VpVtl2RunTime", 192 },
> +	{ "VpIommuHypercalls", 193 },
> +	{ "VpCpuGroupHypercalls", 194 },
> +	{ "VpVsmHypercalls", 195 },
> +	{ "VpEventLogHypercalls", 196 },
> +	{ "VpDeviceDomainHypercalls", 197 },
> +	{ "VpDepositHypercalls", 198 },
> +	{ "VpSvmHypercalls", 199 },
> +	{ "VpBusLockAcquisitionCount", 200 },
> +	{ "VpLoadAvg", 201 },
> +	{ "VpRootDispatchThreadBlocked", 202 },
> +	{ "VpIdleCpuTime", 203 },
> +	{ "VpWaitingForCpuTimeBucket7", 204 },
> +	{ "VpWaitingForCpuTimeBucket8", 205 },
> +	{ "VpWaitingForCpuTimeBucket9", 206 },
> +	{ "VpWaitingForCpuTimeBucket10", 207 },
> +	{ "VpWaitingForCpuTimeBucket11", 208 },
> +	{ "VpWaitingForCpuTimeBucket12", 209 },
> +	{ "VpHierarchicalSuspendTime", 210 },
> +	{ "VpExpressSchedulingAttempts", 211 },
> +	{ "VpExpressSchedulingCount", 212 },
> +	{ "VpBusLockAcquisitionTime", 213 },
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	{ "VpSysRegAccessesCount", 9 },
> +	{ "VpSysRegAccessesTime", 10 },
> +	{ "VpSmcInstructionsCount", 11 },
> +	{ "VpSmcInstructionsTime", 12 },
> +	{ "VpOtherInterceptsCount", 13 },
> +	{ "VpOtherInterceptsTime", 14 },
> +	{ "VpExternalInterruptsCount", 15 },
> +	{ "VpExternalInterruptsTime", 16 },
> +	{ "VpPendingInterruptsCount", 17 },
> +	{ "VpPendingInterruptsTime", 18 },
> +	{ "VpGuestPageTableMaps", 19 },
> +	{ "VpLargePageTlbFills", 20 },
> +	{ "VpSmallPageTlbFills", 21 },
> +	{ "VpReflectedGuestPageFaults", 22 },
> +	{ "VpMemoryInterceptMessages", 23 },
> +	{ "VpOtherMessages", 24 },
> +	{ "VpLogicalProcessorMigrations", 25 },
> +	{ "VpAddressDomainFlushes", 26 },
> +	{ "VpAddressSpaceFlushes", 27 },
> +	{ "VpSyntheticInterrupts", 28 },
> +	{ "VpVirtualInterrupts", 29 },
> +	{ "VpApicSelfIpisSent", 30 },
> +	{ "VpGpaSpaceHypercalls", 31 },
> +	{ "VpLogicalProcessorHypercalls", 32 },
> +	{ "VpLongSpinWaitHypercalls", 33 },
> +	{ "VpOtherHypercalls", 34 },
> +	{ "VpSyntheticInterruptHypercalls", 35 },
> +	{ "VpVirtualInterruptHypercalls", 36 },
> +	{ "VpVirtualMmuHypercalls", 37 },
> +	{ "VpVirtualProcessorHypercalls", 38 },
> +	{ "VpHardwareInterrupts", 39 },
> +	{ "VpNestedPageFaultInterceptsCount", 40 },
> +	{ "VpNestedPageFaultInterceptsTime", 41 },
> +	{ "VpLogicalProcessorDispatches", 42 },
> +	{ "VpWaitingForCpuTime", 43 },
> +	{ "VpExtendedHypercalls", 44 },
> +	{ "VpExtendedHypercallInterceptMessages", 45 },
> +	{ "VpMbecNestedPageTableSwitches", 46 },
> +	{ "VpOtherReflectedGuestExceptions", 47 },
> +	{ "VpGlobalIoTlbFlushes", 48 },
> +	{ "VpGlobalIoTlbFlushCost", 49 },
> +	{ "VpLocalIoTlbFlushes", 50 },
> +	{ "VpLocalIoTlbFlushCost", 51 },
> +	{ "VpFlushGuestPhysicalAddressSpaceHypercalls", 52 },
> +	{ "VpFlushGuestPhysicalAddressListHypercalls", 53 },
> +	{ "VpPostedInterruptNotifications", 54 },
> +	{ "VpPostedInterruptScans", 55 },
> +	{ "VpTotalCoreRunTime", 56 },
> +	{ "VpMaximumRunTime", 57 },
> +	{ "VpWaitingForCpuTimeBucket0", 58 },
> +	{ "VpWaitingForCpuTimeBucket1", 59 },
> +	{ "VpWaitingForCpuTimeBucket2", 60 },
> +	{ "VpWaitingForCpuTimeBucket3", 61 },
> +	{ "VpWaitingForCpuTimeBucket4", 62 },
> +	{ "VpWaitingForCpuTimeBucket5", 63 },
> +	{ "VpWaitingForCpuTimeBucket6", 64 },
> +	{ "VpHwpRequestContextSwitches", 65 },
> +	{ "VpPlaceholder2", 66 },
> +	{ "VpPlaceholder3", 67 },
> +	{ "VpPlaceholder4", 68 },
> +	{ "VpPlaceholder5", 69 },
> +	{ "VpPlaceholder6", 70 },
> +	{ "VpPlaceholder7", 71 },
> +	{ "VpPlaceholder8", 72 },
> +	{ "VpContentionTime", 73 },
> +	{ "VpWakeUpTime", 74 },
> +	{ "VpSchedulingPriority", 75 },
> +	{ "VpVtl1DispatchCount", 76 },
> +	{ "VpVtl2DispatchCount", 77 },
> +	{ "VpVtl2DispatchBucket0", 78 },
> +	{ "VpVtl2DispatchBucket1", 79 },
> +	{ "VpVtl2DispatchBucket2", 80 },
> +	{ "VpVtl2DispatchBucket3", 81 },
> +	{ "VpVtl2DispatchBucket4", 82 },
> +	{ "VpVtl2DispatchBucket5", 83 },
> +	{ "VpVtl2DispatchBucket6", 84 },
> +	{ "VpVtl1RunTime", 85 },
> +	{ "VpVtl2RunTime", 86 },
> +	{ "VpIommuHypercalls", 87 },
> +	{ "VpCpuGroupHypercalls", 88 },
> +	{ "VpVsmHypercalls", 89 },
> +	{ "VpEventLogHypercalls", 90 },
> +	{ "VpDeviceDomainHypercalls", 91 },
> +	{ "VpDepositHypercalls", 92 },
> +	{ "VpSvmHypercalls", 93 },
> +	{ "VpLoadAvg", 94 },
> +	{ "VpRootDispatchThreadBlocked", 95 },
> +	{ "VpIdleCpuTime", 96 },
> +	{ "VpWaitingForCpuTimeBucket7", 97 },
> +	{ "VpWaitingForCpuTimeBucket8", 98 },
> +	{ "VpWaitingForCpuTimeBucket9", 99 },
> +	{ "VpWaitingForCpuTimeBucket10", 100 },
> +	{ "VpWaitingForCpuTimeBucket11", 101 },
> +	{ "VpWaitingForCpuTimeBucket12", 102 },
> +	{ "VpHierarchicalSuspendTime", 103 },
> +	{ "VpExpressSchedulingAttempts", 104 },
> +	{ "VpExpressSchedulingCount", 105 },
> +#endif
> +};
> +
> -- 
> 2.34.1

^ permalink raw reply

* [PATCH net-next v16 03/12] vsock: add netns support to virtio transports
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add netns support to loopback and vhost. Keep netns disabled for
virtio-vsock, but add necessary changes to comply with common API
updates.

This is the patch in the series when vhost-vsock namespaces actually
come online.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v15:
- add vsock_net_mode_global() (Stefano)

Changes in v14:
- fixed merge conflicts in drivers/vhost/vsock.c

Changes in v13:
- do not store or pass the mode around now that net->vsock.mode is
  immutable
- move virtio_transport_stream_allow() into virtio_transport.c
  because virtio is the only caller now

Changes in v12:
- change seqpacket_allow() and stream_allow() to return true for
  loopback and vhost (Stefano)

Changes in v11:
- reorder with the skb ownership patch for loopback (Stefano)
- toggle vhost_transport_supports_local_mode() to true

Changes in v10:
- Splitting patches complicates the series with meaningless placeholder
  values that eventually get replaced anyway, so to avoid that this
  patch combines into one. Links to previous patches here:
  - Link: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-3-852787a37bed@meta.com/
  - Link: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-6-852787a37bed@meta.com/
  - Link: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-7-852787a37bed@meta.com/
- remove placeholder values (Stefano)
- update comment describe net/net_mode for
  virtio_transport_reset_no_sock()
---
 drivers/vhost/vsock.c                   | 38 ++++++++++++++++-------
 include/linux/virtio_vsock.h            |  5 +--
 net/vmw_vsock/virtio_transport.c        | 13 ++++++--
 net/vmw_vsock/virtio_transport_common.c | 54 +++++++++++++++++++--------------
 net/vmw_vsock/vsock_loopback.c          | 14 +++++++--
 5 files changed, 84 insertions(+), 40 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 647ded6f6ea5..488d7fa6e4ec 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -48,6 +48,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(vhost_vsock_hash, 8);
 struct vhost_vsock {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[2];
+	struct net *net;
+	netns_tracker ns_tracker;
 
 	/* Link to global vhost_vsock_hash, writes use vhost_vsock_mutex */
 	struct hlist_node hash;
@@ -69,7 +71,7 @@ static u32 vhost_transport_get_local_cid(void)
 /* Callers must be in an RCU read section or hold the vhost_vsock_mutex.
  * The return value can only be dereferenced while within the section.
  */
-static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
 {
 	struct vhost_vsock *vsock;
 
@@ -81,9 +83,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
 		if (other_cid == 0)
 			continue;
 
-		if (other_cid == guest_cid)
+		if (other_cid == guest_cid &&
+		    vsock_net_check_mode(net, vsock->net))
 			return vsock;
-
 	}
 
 	return NULL;
@@ -272,7 +274,7 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
 }
 
 static int
-vhost_transport_send_pkt(struct sk_buff *skb)
+vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
 {
 	struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
 	struct vhost_vsock *vsock;
@@ -281,7 +283,7 @@ vhost_transport_send_pkt(struct sk_buff *skb)
 	rcu_read_lock();
 
 	/* Find the vhost_vsock according to guest context id  */
-	vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid));
+	vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid), net);
 	if (!vsock) {
 		rcu_read_unlock();
 		kfree_skb(skb);
@@ -308,7 +310,8 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
 	rcu_read_lock();
 
 	/* Find the vhost_vsock according to guest context id  */
-	vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+	vsock = vhost_vsock_get(vsk->remote_addr.svm_cid,
+				sock_net(sk_vsock(vsk)));
 	if (!vsock)
 		goto out;
 
@@ -410,6 +413,12 @@ static bool vhost_transport_msgzerocopy_allow(void)
 static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
 					    u32 remote_cid);
 
+static bool
+vhost_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
+{
+	return true;
+}
+
 static struct virtio_transport vhost_transport = {
 	.transport = {
 		.module                   = THIS_MODULE,
@@ -434,7 +443,7 @@ static struct virtio_transport vhost_transport = {
 		.stream_has_space         = virtio_transport_stream_has_space,
 		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
 		.stream_is_active         = virtio_transport_stream_is_active,
-		.stream_allow             = virtio_transport_stream_allow,
+		.stream_allow             = vhost_transport_stream_allow,
 
 		.seqpacket_dequeue        = virtio_transport_seqpacket_dequeue,
 		.seqpacket_enqueue        = virtio_transport_seqpacket_enqueue,
@@ -467,11 +476,12 @@ static struct virtio_transport vhost_transport = {
 static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
 					    u32 remote_cid)
 {
+	struct net *net = sock_net(sk_vsock(vsk));
 	struct vhost_vsock *vsock;
 	bool seqpacket_allow = false;
 
 	rcu_read_lock();
-	vsock = vhost_vsock_get(remote_cid);
+	vsock = vhost_vsock_get(remote_cid, net);
 
 	if (vsock)
 		seqpacket_allow = vsock->seqpacket_allow;
@@ -542,7 +552,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
 		if (le64_to_cpu(hdr->src_cid) == vsock->guest_cid &&
 		    le64_to_cpu(hdr->dst_cid) ==
 		    vhost_transport_get_local_cid())
-			virtio_transport_recv_pkt(&vhost_transport, skb);
+			virtio_transport_recv_pkt(&vhost_transport, skb,
+						  vsock->net);
 		else
 			kfree_skb(skb);
 
@@ -659,6 +670,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 {
 	struct vhost_virtqueue **vqs;
 	struct vhost_vsock *vsock;
+	struct net *net;
 	int ret;
 
 	/* This struct is large and allocation could fail, fall back to vmalloc
@@ -674,6 +686,9 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 		goto out;
 	}
 
+	net = current->nsproxy->net_ns;
+	vsock->net = get_net_track(net, &vsock->ns_tracker, GFP_KERNEL);
+
 	vsock->guest_cid = 0; /* no CID assigned yet */
 	vsock->seqpacket_allow = false;
 
@@ -715,7 +730,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
 	rcu_read_lock();
 
 	/* If the peer is still valid, no need to reset connection */
-	if (vhost_vsock_get(vsk->remote_addr.svm_cid)) {
+	if (vhost_vsock_get(vsk->remote_addr.svm_cid, sock_net(sk))) {
 		rcu_read_unlock();
 		return;
 	}
@@ -764,6 +779,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
 	virtio_vsock_skb_queue_purge(&vsock->send_pkt_queue);
 
 	vhost_dev_cleanup(&vsock->dev);
+	put_net_track(vsock->net, &vsock->ns_tracker);
 	kfree(vsock->dev.vqs);
 	vhost_vsock_free(vsock);
 	return 0;
@@ -790,7 +806,7 @@ static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u64 guest_cid)
 
 	/* Refuse if CID is already in use */
 	mutex_lock(&vhost_vsock_mutex);
-	other = vhost_vsock_get(guest_cid);
+	other = vhost_vsock_get(guest_cid, vsock->net);
 	if (other && other != vsock) {
 		mutex_unlock(&vhost_vsock_mutex);
 		return -EADDRINUSE;
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 1845e8d4f78d..f91704731057 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -173,6 +173,7 @@ struct virtio_vsock_pkt_info {
 	u32 remote_cid, remote_port;
 	struct vsock_sock *vsk;
 	struct msghdr *msg;
+	struct net *net;
 	u32 pkt_len;
 	u16 type;
 	u16 op;
@@ -185,7 +186,7 @@ struct virtio_transport {
 	struct vsock_transport transport;
 
 	/* Takes ownership of the packet */
-	int (*send_pkt)(struct sk_buff *skb);
+	int (*send_pkt)(struct sk_buff *skb, struct net *net);
 
 	/* Used in MSG_ZEROCOPY mode. Checks, that provided data
 	 * (number of buffers) could be transmitted with zerocopy
@@ -280,7 +281,7 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
 void virtio_transport_destruct(struct vsock_sock *vsk);
 
 void virtio_transport_recv_pkt(struct virtio_transport *t,
-			       struct sk_buff *skb);
+			       struct sk_buff *skb, struct net *net);
 void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb);
 u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
 void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index f0a9e51118f3..3f7ea2db9bd7 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -231,7 +231,7 @@ static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struc
 }
 
 static int
-virtio_transport_send_pkt(struct sk_buff *skb)
+virtio_transport_send_pkt(struct sk_buff *skb, struct net *net)
 {
 	struct virtio_vsock_hdr *hdr;
 	struct virtio_vsock *vsock;
@@ -536,6 +536,11 @@ static bool virtio_transport_msgzerocopy_allow(void)
 	return true;
 }
 
+bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
+{
+	return vsock_net_mode_global(vsk);
+}
+
 static bool virtio_transport_seqpacket_allow(struct vsock_sock *vsk,
 					     u32 remote_cid);
 
@@ -665,7 +670,11 @@ static void virtio_transport_rx_work(struct work_struct *work)
 				virtio_vsock_skb_put(skb, payload_len);
 
 			virtio_transport_deliver_tap_pkt(skb);
-			virtio_transport_recv_pkt(&virtio_transport, skb);
+
+			/* Force virtio-transport into global mode since it
+			 * does not yet support local-mode namespacing.
+			 */
+			virtio_transport_recv_pkt(&virtio_transport, skb, NULL);
 		}
 	} while (!virtqueue_enable_cb(vq));
 
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 718be9f33274..c126aa235091 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -413,7 +413,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 
 		virtio_transport_inc_tx_pkt(vvs, skb);
 
-		ret = t_ops->send_pkt(skb);
+		ret = t_ops->send_pkt(skb, info->net);
 		if (ret < 0)
 			break;
 
@@ -527,6 +527,7 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk)
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_CREDIT_UPDATE,
 		.vsk = vsk,
+		.net = sock_net(sk_vsock(vsk)),
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -1043,12 +1044,6 @@ bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
 }
 EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
 
-bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
-{
-	return vsock_net_mode(sock_net(sk_vsock(vsk))) == VSOCK_NET_MODE_GLOBAL;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
-
 int virtio_transport_dgram_bind(struct vsock_sock *vsk,
 				struct sockaddr_vm *addr)
 {
@@ -1067,6 +1062,7 @@ int virtio_transport_connect(struct vsock_sock *vsk)
 	struct virtio_vsock_pkt_info info = {
 		.op = VIRTIO_VSOCK_OP_REQUEST,
 		.vsk = vsk,
+		.net = sock_net(sk_vsock(vsk)),
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -1082,6 +1078,7 @@ int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
 			 (mode & SEND_SHUTDOWN ?
 			  VIRTIO_VSOCK_SHUTDOWN_SEND : 0),
 		.vsk = vsk,
+		.net = sock_net(sk_vsock(vsk)),
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -1108,6 +1105,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
 		.msg = msg,
 		.pkt_len = len,
 		.vsk = vsk,
+		.net = sock_net(sk_vsock(vsk)),
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -1145,6 +1143,7 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
 		.op = VIRTIO_VSOCK_OP_RST,
 		.reply = !!skb,
 		.vsk = vsk,
+		.net = sock_net(sk_vsock(vsk)),
 	};
 
 	/* Send RST only if the original pkt is not a RST pkt */
@@ -1156,9 +1155,13 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
 
 /* Normally packets are associated with a socket.  There may be no socket if an
  * attempt was made to connect to a socket that does not exist.
+ *
+ * net refers to the namespace of whoever sent the invalid message. For
+ * loopback, this is the namespace of the socket. For vhost, this is the
+ * namespace of the VM (i.e., vhost_vsock).
  */
 static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
-					  struct sk_buff *skb)
+					  struct sk_buff *skb, struct net *net)
 {
 	struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
 	struct virtio_vsock_pkt_info info = {
@@ -1171,6 +1174,12 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 		 * sock_net(sk) until the reply skb is freed.
 		 */
 		.vsk = vsock_sk(skb->sk),
+
+		/* net is not defined here because we pass it directly to
+		 * t->send_pkt(), instead of relying on
+		 * virtio_transport_send_pkt_info() to pass it. It is not needed
+		 * by virtio_transport_alloc_skb().
+		 */
 	};
 	struct sk_buff *reply;
 
@@ -1189,7 +1198,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
 	if (!reply)
 		return -ENOMEM;
 
-	return t->send_pkt(reply);
+	return t->send_pkt(reply, net);
 }
 
 /* This function should be called with sk_lock held and SOCK_DONE set */
@@ -1471,6 +1480,7 @@ virtio_transport_send_response(struct vsock_sock *vsk,
 		.remote_port = le32_to_cpu(hdr->src_port),
 		.reply = true,
 		.vsk = vsk,
+		.net = sock_net(sk_vsock(vsk)),
 	};
 
 	return virtio_transport_send_pkt_info(vsk, &info);
@@ -1513,12 +1523,12 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 	int ret;
 
 	if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
-		virtio_transport_reset_no_sock(t, skb);
+		virtio_transport_reset_no_sock(t, skb, sock_net(sk));
 		return -EINVAL;
 	}
 
 	if (sk_acceptq_is_full(sk)) {
-		virtio_transport_reset_no_sock(t, skb);
+		virtio_transport_reset_no_sock(t, skb, sock_net(sk));
 		return -ENOMEM;
 	}
 
@@ -1526,13 +1536,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 	 * Subsequent enqueues would lead to a memory leak.
 	 */
 	if (sk->sk_shutdown == SHUTDOWN_MASK) {
-		virtio_transport_reset_no_sock(t, skb);
+		virtio_transport_reset_no_sock(t, skb, sock_net(sk));
 		return -ESHUTDOWN;
 	}
 
 	child = vsock_create_connected(sk);
 	if (!child) {
-		virtio_transport_reset_no_sock(t, skb);
+		virtio_transport_reset_no_sock(t, skb, sock_net(sk));
 		return -ENOMEM;
 	}
 
@@ -1554,7 +1564,7 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
 	 */
 	if (ret || vchild->transport != &t->transport) {
 		release_sock(child);
-		virtio_transport_reset_no_sock(t, skb);
+		virtio_transport_reset_no_sock(t, skb, sock_net(sk));
 		sock_put(child);
 		return ret;
 	}
@@ -1582,7 +1592,7 @@ static bool virtio_transport_valid_type(u16 type)
  * lock.
  */
 void virtio_transport_recv_pkt(struct virtio_transport *t,
-			       struct sk_buff *skb)
+			       struct sk_buff *skb, struct net *net)
 {
 	struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
 	struct sockaddr_vm src, dst;
@@ -1605,24 +1615,24 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 					le32_to_cpu(hdr->fwd_cnt));
 
 	if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+		(void)virtio_transport_reset_no_sock(t, skb, net);
 		goto free_pkt;
 	}
 
 	/* The socket must be in connected or bound table
 	 * otherwise send reset back
 	 */
-	sk = vsock_find_connected_socket(&src, &dst);
+	sk = vsock_find_connected_socket_net(&src, &dst, net);
 	if (!sk) {
-		sk = vsock_find_bound_socket(&dst);
+		sk = vsock_find_bound_socket_net(&dst, net);
 		if (!sk) {
-			(void)virtio_transport_reset_no_sock(t, skb);
+			(void)virtio_transport_reset_no_sock(t, skb, net);
 			goto free_pkt;
 		}
 	}
 
 	if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+		(void)virtio_transport_reset_no_sock(t, skb, net);
 		sock_put(sk);
 		goto free_pkt;
 	}
@@ -1641,7 +1651,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 	 */
 	if (sock_flag(sk, SOCK_DONE) ||
 	    (sk->sk_state != TCP_LISTEN && vsk->transport != &t->transport)) {
-		(void)virtio_transport_reset_no_sock(t, skb);
+		(void)virtio_transport_reset_no_sock(t, skb, net);
 		release_sock(sk);
 		sock_put(sk);
 		goto free_pkt;
@@ -1673,7 +1683,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
 		kfree_skb(skb);
 		break;
 	default:
-		(void)virtio_transport_reset_no_sock(t, skb);
+		(void)virtio_transport_reset_no_sock(t, skb, net);
 		kfree_skb(skb);
 		break;
 	}
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index deff68c64a09..8068d1b6e851 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -26,7 +26,7 @@ static u32 vsock_loopback_get_local_cid(void)
 	return VMADDR_CID_LOCAL;
 }
 
-static int vsock_loopback_send_pkt(struct sk_buff *skb)
+static int vsock_loopback_send_pkt(struct sk_buff *skb, struct net *net)
 {
 	struct vsock_loopback *vsock = &the_vsock_loopback;
 	int len = skb->len;
@@ -48,6 +48,13 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
 
 static bool vsock_loopback_seqpacket_allow(struct vsock_sock *vsk,
 					   u32 remote_cid);
+
+static bool vsock_loopback_stream_allow(struct vsock_sock *vsk, u32 cid,
+					u32 port)
+{
+	return true;
+}
+
 static bool vsock_loopback_msgzerocopy_allow(void)
 {
 	return true;
@@ -77,7 +84,7 @@ static struct virtio_transport loopback_transport = {
 		.stream_has_space         = virtio_transport_stream_has_space,
 		.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
 		.stream_is_active         = virtio_transport_stream_is_active,
-		.stream_allow             = virtio_transport_stream_allow,
+		.stream_allow             = vsock_loopback_stream_allow,
 
 		.seqpacket_dequeue        = virtio_transport_seqpacket_dequeue,
 		.seqpacket_enqueue        = virtio_transport_seqpacket_enqueue,
@@ -132,7 +139,8 @@ static void vsock_loopback_work(struct work_struct *work)
 		 */
 		virtio_transport_consume_skb_sent(skb, false);
 		virtio_transport_deliver_tap_pkt(skb);
-		virtio_transport_recv_pkt(&loopback_transport, skb);
+		virtio_transport_recv_pkt(&loopback_transport, skb,
+					  sock_net(skb->sk));
 	}
 }
 

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 00/12] vsock: add namespace support to vhost-vsock and loopback
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman

This series adds namespace support to vhost-vsock and loopback. It does
not add namespaces to any of the other guest transports (virtio-vsock,
hyperv, or vmci).

The current revision supports two modes: local and global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).

The mode is set using the parent namespace's
/proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
created. The mode of the current namespace can be queried by reading
/proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
has been created.

Modes are per-netns. This allows a system to configure namespaces
independently (some may share CIDs, others are completely isolated).
This also supports future possible mixed use cases, where there may be
namespaces in global mode spinning up VMs while there are mixed mode
namespaces that provide services to the VMs, but are not allowed to
allocate from the global CID pool (this mode is not implemented in this
series).

Additionally, added tests for the new namespace features:

tools/testing/selftests/vsock/vmtest.sh
1..25
ok 1 vm_server_host_client
ok 2 vm_client_host_server
ok 3 vm_loopback
ok 4 ns_host_vsock_ns_mode_ok
ok 5 ns_host_vsock_child_ns_mode_ok
ok 6 ns_global_same_cid_fails
ok 7 ns_local_same_cid_ok
ok 8 ns_global_local_same_cid_ok
ok 9 ns_local_global_same_cid_ok
ok 10 ns_diff_global_host_connect_to_global_vm_ok
ok 11 ns_diff_global_host_connect_to_local_vm_fails
ok 12 ns_diff_global_vm_connect_to_global_host_ok
ok 13 ns_diff_global_vm_connect_to_local_host_fails
ok 14 ns_diff_local_host_connect_to_local_vm_fails
ok 15 ns_diff_local_vm_connect_to_local_host_fails
ok 16 ns_diff_global_to_local_loopback_local_fails
ok 17 ns_diff_local_to_global_loopback_fails
ok 18 ns_diff_local_to_local_loopback_fails
ok 19 ns_diff_global_to_global_loopback_ok
ok 20 ns_same_local_loopback_ok
ok 21 ns_same_local_host_connect_to_local_vm_ok
ok 22 ns_same_local_vm_connect_to_local_host_ok
ok 23 ns_delete_vm_ok
ok 24 ns_delete_host_ok
ok 25 ns_delete_both_ok
SUMMARY: PASS=25 SKIP=0 FAIL=0

Thanks again for everyone's help and reviews!

Suggested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>

Changes in v16:
- updated comments/docs/commit msg (vsock_find_* funcs, init net
  mode, why change random port alloc)
- removed init ns mode cmdline
- fixed the missing ${ns} arg for vm_ssh in vmtest.sh
- Link to v15: https://lore.kernel.org/r/20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com

Changes in v15:
- see per-patch change notes in 'vsock: add netns to vsock core'
- Link to v14: https://lore.kernel.org/r/20260112-vsock-vmtest-v14-0-a5c332db3e2b@meta.com

Changes in v14:
- squashed 'vsock: add per-net vsock NS mode state' into 'vsock: add
  netns to vsock core' (MST)
- remove RFC tag
- fixed base-commit (still had b4 configured to depend on old vmtest.sh
  series)
- Link to v13: https://lore.kernel.org/all/20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com/

Changes in v13:
- add support for immutable sysfs ns_mode and inheritance from sysfs child_ns_mode
- remove passing around of net_mode, can be accessed now via
  vsock_net_mode(net) since it is immutable
- update tests for new uAPI
- add one patch to extend the kselftest timeout (it was starting to
  fail with the new tests added)
- Link to v12: https://lore.kernel.org/r/20251126-vsock-vmtest-v12-0-257ee21cd5de@meta.com

Changes in v12:
- add ns mode checking to _allow() callbacks to reject local mode for
  incompatible transports (Stefano)
- flip vhost/loopback to return true for stream_allow() and
  seqpacket_allow() in "vsock: add netns support to virtio transports"
  (Stefano)
- add VMADDR_CID_ANY + local mode documentation in af_vsock.c (Stefano)
- change "selftests/vsock: add tests for host <-> vm connectivity with
  namespaces" to skip test 29 in vsock_test for namespace local
  vsock_test calls in a host local-mode namespace. There is a
  false-positive edge case for that test encountered with the
  ->stream_allow() approach. More details in that patch.
- updated cover letter with new test output
- Link to v11: https://lore.kernel.org/r/20251120-vsock-vmtest-v11-0-55cbc80249a7@meta.com

Changes in v11:
- vmtest: add a patch to use ss in wait_for_listener functions and
  support vsock, tcp, and unix. Change all patches to use the new
  functions.
- vmtest: add a patch to re-use vm dmesg / warn counting functions
- Link to v10: https://lore.kernel.org/r/20251117-vsock-vmtest-v10-0-df08f165bf3e@meta.com

Changes in v10:
- Combine virtio common patches into one (Stefano)
- Resolve vsock_loopback virtio_transport_reset_no_sock() issue
  with info->vsk setting. This eliminates the need for skb->cb,
  so remove skb->cb patches.
- many line width 80 fixes
- Link to v9: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-0-852787a37bed@meta.com

Changes in v9:
- reorder loopback patch after patch for virtio transport common code
- remove module ordering tests patch because loopback no longer depends
  on pernet ops
- major simplifications in vsock_loopback
- added a new patch for blocking local mode for guests, added test case
  to check
- add net ref tracking to vsock_loopback patch
- Link to v8: https://lore.kernel.org/r/20251023-vsock-vmtest-v8-0-dea984d02bb0@meta.com

Changes in v8:
- Break generic cleanup/refactoring patches into standalone series,
  remove those from this series
- Link to dependency: https://lore.kernel.org/all/20251022-vsock-selftests-fixes-and-improvements-v1-0-edeb179d6463@meta.com/
- Link to v7: https://lore.kernel.org/r/20251021-vsock-vmtest-v7-0-0661b7b6f081@meta.com

Changes in v7:
- fix hv_sock build
- break out vmtest patches into distinct, more well-scoped patches
- change `orig_net_mode` to `net_mode`
- many fixes and style changes in per-patch change sets (see individual
  patches for specific changes)
- optimize `virtio_vsock_skb_cb` layout
- update commit messages with more useful descriptions
- vsock_loopback: use orig_net_mode instead of current net mode
- add tests for edge cases (ns deletion, mode changing, loopback module
  load ordering)
- Link to v6: https://lore.kernel.org/r/20250916-vsock-vmtest-v6-0-064d2eb0c89d@meta.com

Changes in v6:
- define behavior when mode changes to local while socket/VM is alive
- af_vsock: clarify description of CID behavior
- af_vsock: use stronger langauge around CID rules (dont use "may")
- af_vsock: improve naming of buf/buffer
- af_vsock: improve string length checking on proc writes
- vsock_loopback: add space in struct to clarify lock protection
- vsock_loopback: do proper cleanup/unregister on vsock_loopback_exit()
- vsock_loopback: use virtio_vsock_skb_net() instead of sock_net()
- vsock_loopback: set loopback to NULL after kfree()
- vsock_loopback: use pernet_operations and remove callback mechanism
- vsock_loopback: add macros for "global" and "local"
- vsock_loopback: fix length checking
- vmtest.sh: check for namespace support in vmtest.sh
- Link to v5: https://lore.kernel.org/r/20250827-vsock-vmtest-v5-0-0ba580bede5b@meta.com

Changes in v5:
- /proc/net/vsock_ns_mode -> /proc/sys/net/vsock/ns_mode
- vsock_global_net -> vsock_global_dummy_net
- fix netns lookup in vhost_vsock to respect pid namespaces
- add callbacks for vsock_loopback to avoid circular dependency
- vmtest.sh loads vsock_loopback module
- remove vsock_net_mode_can_set()
- change vsock_net_write_mode() to return true/false based on success
- make vsock_net_mode enum instead of u8
- Link to v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com

Changes in v4:
- removed RFC tag
- implemented loopback support
- renamed new tests to better reflect behavior
- completed suite of tests with permutations of ns modes and vsock_test
  as guest/host
- simplified socat bridging with unix socket instead of tcp + veth
- only use vsock_test for success case, socat for failure case (context
  in commit message)
- lots of cleanup

Changes in v3:
- add notion of "modes"
- add procfs /proc/net/vsock_ns_mode
- local and global modes only
- no /dev/vhost-vsock-netns
- vmtest.sh already merged, so new patch just adds new tests for NS
- Link to v2:
  https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com

Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
  impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1:
  https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com

Changes in v1:
- added 'netns' module param to vsock.ko to enable the
  network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
  only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/

---
Bobby Eshleman (12):
      vsock: add netns to vsock core
      virtio: set skb owner of virtio_transport_reset_no_sock() reply
      vsock: add netns support to virtio transports
      selftests/vsock: increase timeout to 1200
      selftests/vsock: add namespace helpers to vmtest.sh
      selftests/vsock: prepare vm management helpers for namespaces
      selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers
      selftests/vsock: use ss to wait for listeners instead of /proc/net
      selftests/vsock: add tests for proc sys vsock ns_mode
      selftests/vsock: add namespace tests for CID collisions
      selftests/vsock: add tests for host <-> vm connectivity with namespaces
      selftests/vsock: add tests for namespace deletion

 MAINTAINERS                             |    1 +
 drivers/vhost/vsock.c                   |   44 +-
 include/linux/virtio_vsock.h            |    9 +-
 include/net/af_vsock.h                  |   61 +-
 include/net/net_namespace.h             |    4 +
 include/net/netns/vsock.h               |   21 +
 net/vmw_vsock/af_vsock.c                |  335 +++++++++-
 net/vmw_vsock/hyperv_transport.c        |    7 +-
 net/vmw_vsock/virtio_transport.c        |   22 +-
 net/vmw_vsock/virtio_transport_common.c |   62 +-
 net/vmw_vsock/vmci_transport.c          |   26 +-
 net/vmw_vsock/vsock_loopback.c          |   22 +-
 tools/testing/selftests/vsock/settings  |    2 +-
 tools/testing/selftests/vsock/vmtest.sh | 1055 +++++++++++++++++++++++++++++--
 14 files changed, 1531 insertions(+), 140 deletions(-)
---
base-commit: d8f87aa5fa0a4276491fa8ef436cd22605a3f9ba
change-id: 20250325-vsock-vmtest-b3a21d2102c2

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>


^ permalink raw reply

* [PATCH 2/2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-21 22:35 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <176903475057.166619.9437539561789960983.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Andreea Pintilie <anpintil@microsoft.com>

Query the hypervisor for integrated scheduler support and use it if
configured.

Microsoft Hypervisor originally provided two schedulers: root and core. The
root scheduler allows the root partition to schedule guest vCPUs across
physical cores, supporting both time slicing and CPU affinity (e.g., via
cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
scheduling entirely to the hypervisor.

Direct virtualization introduces a new privileged guest partition type - L1
Virtual Host (L1VH) — which can create child partitions from its own
resources. These child partitions are effectively siblings, scheduled by
the hypervisor's core scheduler. This prevents the L1VH parent from setting
affinity or time slicing for its own processes or guest VPs. While cgroups,
CFS, and cpuset controllers can still be used, their effectiveness is
unpredictable, as the core scheduler swaps vCPUs according to its own logic
(typically round-robin across all allocated physical CPUs). As a result,
the system may appear to "steal" time from the L1VH and its children.

To address this, Microsoft Hypervisor introduces the integrated scheduler.
This allows an L1VH partition to schedule its own vCPUs and those of its
guests across its "physical" cores, effectively emulating root scheduler
behavior within the L1VH, while retaining core scheduler behavior for the
rest of the system.

The integrated scheduler is controlled by the root partition and gated by
the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
supports the integrated scheduler. The L1VH partition must then check if it
is enabled by querying the corresponding extended partition property. If
this property is true, the L1VH partition must use the root scheduler
logic; otherwise, it must use the core scheduler.

Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
 include/hyperv/hvhdk_mini.h |    6 +++
 2 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..7a36297feea7 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
 	};
 }
 
+static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)
+{
+	size_t root_sched_enabled;
+	int ret;
+
+	*out = HV_SCHEDULER_TYPE_CORE_SMT;
+
+	if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
+		return 0;
+
+	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+						HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
+						0, &root_sched_enabled,
+						sizeof(root_sched_enabled));
+	if (ret)
+		return ret;
+
+	if (root_sched_enabled)
+		*out = HV_SCHEDULER_TYPE_ROOT;
+
+	pr_debug("%s: integrated scheduler property read: ret=%d value=%lu\n",
+		 __func__, ret, root_sched_enabled);
+
+	return 0;
+}
+
 /* TODO move this to hv_common.c when needed outside */
 static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
 {
@@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
 /* Retrieve and stash the supported scheduler type */
 static int __init mshv_retrieve_scheduler_type(struct device *dev)
 {
-	int ret = 0;
+	int ret;
 
 	if (hv_l1vh_partition())
-		hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
+		ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
 	else
 		ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
-
 	if (ret)
 		return ret;
 
@@ -2211,42 +2236,35 @@ struct notifier_block mshv_reboot_nb = {
 static void mshv_root_partition_exit(void)
 {
 	unregister_reboot_notifier(&mshv_reboot_nb);
-	root_scheduler_deinit();
 }
 
 static int __init mshv_root_partition_init(struct device *dev)
 {
 	int err;
 
-	err = root_scheduler_init(dev);
-	if (err)
-		return err;
-
 	err = register_reboot_notifier(&mshv_reboot_nb);
 	if (err)
-		goto root_sched_deinit;
+		return err;
 
 	return 0;
-
-root_sched_deinit:
-	root_scheduler_deinit();
-	return err;
 }
 
-static void mshv_init_vmm_caps(struct device *dev)
+static int mshv_init_vmm_caps(struct device *dev)
 {
-	/*
-	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
-	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
-	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
-	 */
-	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
-					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
-					      0, &mshv_root.vmm_caps,
-					      sizeof(mshv_root.vmm_caps)))
-		dev_warn(dev, "Unable to get VMM capabilities\n");
+	int ret;
+
+	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
+						0, &mshv_root.vmm_caps,
+						sizeof(mshv_root.vmm_caps));
+	if (ret) {
+		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
+		return ret;
+	}
 
 	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
+
+	return 0;
 }
 
 static int __init mshv_parent_partition_init(void)
@@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
 
 	mshv_cpuhp_online = ret;
 
+	ret = mshv_init_vmm_caps(dev);
+	if (ret)
+		goto remove_cpu_state;
+
 	ret = mshv_retrieve_scheduler_type(dev);
 	if (ret)
 		goto remove_cpu_state;
@@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto remove_cpu_state;
 
-	mshv_init_vmm_caps(dev);
+	ret = root_scheduler_init(dev);
+	if (ret)
+		goto exit_partition;
 
 	ret = mshv_irqfd_wq_init();
 	if (ret)
-		goto exit_partition;
+		goto deinit_root_scheduler;
 
 	spin_lock_init(&mshv_root.pt_ht_lock);
 	hash_init(mshv_root.pt_htable);
@@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
 
 	return 0;
 
+deinit_root_scheduler:
+	root_scheduler_deinit();
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
@@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
 	mshv_port_table_fini();
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
+	root_scheduler_deinit();
 	if (hv_root_partition())
 		mshv_root_partition_exit();
 	cpuhp_remove_state(mshv_cpuhp_online);
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index aa03616f965b..0f7178fa88a8 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -87,6 +87,9 @@ enum hv_partition_property_code {
 	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
 	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
 
+	/* Integrated scheduling properties */
+	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
+
 	/* Resource properties */
 	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
 	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
@@ -102,7 +105,7 @@ enum hv_partition_property_code {
 };
 
 #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
-#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	58
+#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
 
 struct hv_partition_property_vmm_capabilities {
 	u16 bank_count;
@@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
 #endif
 			u64 assignable_synthetic_proc_features: 1;
 			u64 tag_hv_message_from_child: 1;
+			u64 vmm_enable_integrated_scheduler : 1;
 			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
 		} __packed;
 	};



^ permalink raw reply related

* [PATCH 1/2] hyperv: Sync guest VMM capabilities structure with Microsoft Hypervisor ABI
From: Stanislav Kinsburskii @ 2026-01-21 22:35 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <176903475057.166619.9437539561789960983.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Andreea Pintilie <anpintil@microsoft.com>

Update the partition VMM capability structure to match the hypervisor
representation to bring it to the up to date state. A precursor patch for
Root-on-Core scheduler feature support.

Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 include/hyperv/hvhdk_mini.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 41a29bf8ec14..aa03616f965b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -102,7 +102,7 @@ enum hv_partition_property_code {
 };
 
 #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
-#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	59
+#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	58
 
 struct hv_partition_property_vmm_capabilities {
 	u16 bank_count;
@@ -119,6 +119,7 @@ struct hv_partition_property_vmm_capabilities {
 			u64 reservedbit3: 1;
 #endif
 			u64 assignable_synthetic_proc_features: 1;
+			u64 tag_hv_message_from_child: 1;
 			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
 		} __packed;
 	};



^ permalink raw reply related

* [PATCH 0/2] Introduce Hyper-V integrated scheduler support
From: Stanislav Kinsburskii @ 2026-01-21 22:35 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Microsoft Hypervisor originally provided two schedulers: root and core. The
root scheduler allows the root partition to schedule guest vCPUs across
physical cores, supporting both time slicing and CPU affinity (e.g., via
cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
scheduling entirely to the hypervisor.

Direct virtualization introduces a new privileged guest partition type - L1
Virtual Host (L1VH) — which can create child partitions from its own
resources. These child partitions are effectively siblings, scheduled by
the hypervisor's core scheduler. This prevents the L1VH parent from setting
affinity or time slicing for its own processes or guest VPs. While cgroups,
CFS, and cpuset controllers can still be used, their effectiveness is
unpredictable, as the core scheduler swaps vCPUs according to its own logic
(typically round-robin across all allocated physical CPUs). As a result,
the system may appear to "steal" time from the L1VH and its children.

To address this, Microsoft Hypervisor introduces the integrated scheduler.
This allows an L1VH partition to schedule its own vCPUs and those of its
guests across its "physical" cores, effectively emulating root scheduler
behavior within the L1VH, while retaining core scheduler behavior for the
rest of the system.

---

Andreea Pintilie (2):
      hyperv: Sync guest VMM capabilities structure with Microsoft Hypervisor ABI
      mshv: Add support for integrated scheduler


 drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
 include/hyperv/hvhdk_mini.h |    7 +++-
 2 files changed, 59 insertions(+), 27 deletions(-)


^ permalink raw reply

* [PATCH net-next v16 11/12] selftests/vsock: add tests for host <-> vm connectivity with namespaces
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add tests to validate namespace correctness using vsock_test and socat.
The vsock_test tool is used to validate expected success tests, but
socat is used for expected failure tests. socat is used to ensure that
connections are rejected outright instead of failing due to some other
socket behavior (as tested in vsock_test). Additionally, socat is
already required for tunneling TCP traffic from vsock_test. Using only
one of the vsock_test tests like 'test_stream_client_close_client' would
have yielded a similar result, but doing so wouldn't remove the socat
dependency.

Additionally, check for the dependency socat. socat needs special
handling beyond just checking if it is on the path because it must be
compiled with support for both vsock and unix. The function
check_socat() checks that this support exists.

Add more padding to test name printf strings because the tests added in
this patch would otherwise overflow.

Add vm_dmesg_* helpers to encapsulate checking dmesg
for oops and warnings.

Add ability to pass extra args to host-side vsock_test so that tests
that cause false positives may be skipped with arg --skip.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v12:
- add test skip (vsock_test test 29) when host_vsock_test() uses client
  mode in a local namespace. Test 29 causes a false positive to trigger.

Changes in v11:
- add 'sleep "${WAIT_PERIOD}"' after any non-TCP socat LISTEN cmd
  (Stefano)
- add host_wait_for_listener() after any socat TCP-LISTEN (Stefano)
- reuse vm_dmesg_{oops,warn}_count() inside vm_dmesg_check()
- fix copy-paste in test_ns_same_local_vm_connect_to_local_host_ok()
  (Stefano)

Changes in v10:
- add vm_dmesg_start() and vm_dmesg_check()

Changes in v9:
- consistent variable quoting
---
 tools/testing/selftests/vsock/vmtest.sh | 572 +++++++++++++++++++++++++++++++-
 1 file changed, 568 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 1bf537410ea6..a9eaf37bc31b 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -7,6 +7,7 @@
 #		* virtme-ng
 #		* busybox-static (used by virtme-ng)
 #		* qemu	(used by virtme-ng)
+#		* socat
 #
 # shellcheck disable=SC2317,SC2119
 
@@ -54,6 +55,19 @@ readonly TEST_NAMES=(
 	ns_local_same_cid_ok
 	ns_global_local_same_cid_ok
 	ns_local_global_same_cid_ok
+	ns_diff_global_host_connect_to_global_vm_ok
+	ns_diff_global_host_connect_to_local_vm_fails
+	ns_diff_global_vm_connect_to_global_host_ok
+	ns_diff_global_vm_connect_to_local_host_fails
+	ns_diff_local_host_connect_to_local_vm_fails
+	ns_diff_local_vm_connect_to_local_host_fails
+	ns_diff_global_to_local_loopback_local_fails
+	ns_diff_local_to_global_loopback_fails
+	ns_diff_local_to_local_loopback_fails
+	ns_diff_global_to_global_loopback_ok
+	ns_same_local_loopback_ok
+	ns_same_local_host_connect_to_local_vm_ok
+	ns_same_local_vm_connect_to_local_host_ok
 )
 readonly TEST_DESCS=(
 	# vm_server_host_client
@@ -82,6 +96,45 @@ readonly TEST_DESCS=(
 
 	# ns_local_global_same_cid_ok
 	"Check QEMU successfully starts one VM in a local ns and then another VM in a global ns with the same CID."
+
+	# ns_diff_global_host_connect_to_global_vm_ok
+	"Run vsock_test client in global ns with server in VM in another global ns."
+
+	# ns_diff_global_host_connect_to_local_vm_fails
+	"Run socat to test a process in a global ns fails to connect to a VM in a local ns."
+
+	# ns_diff_global_vm_connect_to_global_host_ok
+	"Run vsock_test client in VM in a global ns with server in another global ns."
+
+	# ns_diff_global_vm_connect_to_local_host_fails
+	"Run socat to test a VM in a global ns fails to connect to a host process in a local ns."
+
+	# ns_diff_local_host_connect_to_local_vm_fails
+	"Run socat to test a host process in a local ns fails to connect to a VM in another local ns."
+
+	# ns_diff_local_vm_connect_to_local_host_fails
+	"Run socat to test a VM in a local ns fails to connect to a host process in another local ns."
+
+	# ns_diff_global_to_local_loopback_local_fails
+	"Run socat to test a loopback vsock in a global ns fails to connect to a vsock in a local ns."
+
+	# ns_diff_local_to_global_loopback_fails
+	"Run socat to test a loopback vsock in a local ns fails to connect to a vsock in a global ns."
+
+	# ns_diff_local_to_local_loopback_fails
+	"Run socat to test a loopback vsock in a local ns fails to connect to a vsock in another local ns."
+
+	# ns_diff_global_to_global_loopback_ok
+	"Run socat to test a loopback vsock in a global ns successfully connects to a vsock in another global ns."
+
+	# ns_same_local_loopback_ok
+	"Run socat to test a loopback vsock in a local ns successfully connects to a vsock in the same ns."
+
+	# ns_same_local_host_connect_to_local_vm_ok
+	"Run vsock_test client in a local ns with server in VM in same ns."
+
+	# ns_same_local_vm_connect_to_local_host_ok
+	"Run vsock_test client in VM in a local ns with server in same ns."
 )
 
 readonly USE_SHARED_VM=(
@@ -112,7 +165,7 @@ usage() {
 	for ((i = 0; i < ${#TEST_NAMES[@]}; i++)); do
 		name=${TEST_NAMES[${i}]}
 		desc=${TEST_DESCS[${i}]}
-		printf "\t%-35s%-35s\n" "${name}" "${desc}"
+		printf "\t%-55s%-35s\n" "${name}" "${desc}"
 	done
 	echo
 
@@ -222,7 +275,7 @@ check_args() {
 }
 
 check_deps() {
-	for dep in vng ${QEMU} busybox pkill ssh ss; do
+	for dep in vng ${QEMU} busybox pkill ssh ss socat; do
 		if [[ ! -x $(command -v "${dep}") ]]; then
 			echo -e "skip:    dependency ${dep} not found!\n"
 			exit "${KSFT_SKIP}"
@@ -273,6 +326,20 @@ check_vng() {
 	fi
 }
 
+check_socat() {
+	local support_string
+
+	support_string="$(socat -V)"
+
+	if [[ "${support_string}" != *"WITH_VSOCK 1"* ]]; then
+		die "err: socat is missing vsock support"
+	fi
+
+	if [[ "${support_string}" != *"WITH_UNIX 1"* ]]; then
+		die "err: socat is missing unix support"
+	fi
+}
+
 handle_build() {
 	if [[ ! "${BUILD}" -eq 1 ]]; then
 		return
@@ -321,6 +388,14 @@ terminate_pidfiles() {
 	done
 }
 
+terminate_pids() {
+	local pid
+
+	for pid in "$@"; do
+		kill -SIGTERM "${pid}" &>/dev/null || :
+	done
+}
+
 vm_start() {
 	local pidfile=$1
 	local ns=$2
@@ -459,6 +534,28 @@ vm_dmesg_warn_count() {
 	vm_ssh "${ns}" -- dmesg --level=warn 2>/dev/null | grep -c -i 'vsock'
 }
 
+vm_dmesg_check() {
+	local pidfile=$1
+	local ns=$2
+	local oops_before=$3
+	local warn_before=$4
+	local oops_after warn_after
+
+	oops_after=$(vm_dmesg_oops_count "${ns}")
+	if [[ "${oops_after}" -gt "${oops_before}" ]]; then
+		echo "FAIL: kernel oops detected on vm in ns ${ns}" | log_host
+		return 1
+	fi
+
+	warn_after=$(vm_dmesg_warn_count "${ns}")
+	if [[ "${warn_after}" -gt "${warn_before}" ]]; then
+		echo "FAIL: kernel warning detected on vm in ns ${ns}" | log_host
+		return 1
+	fi
+
+	return 0
+}
+
 vm_vsock_test() {
 	local ns=$1
 	local host=$2
@@ -502,6 +599,8 @@ host_vsock_test() {
 	local host=$2
 	local cid=$3
 	local port=$4
+	shift 4
+	local extra_args=("$@")
 	local rc
 
 	local cmd="${VSOCK_TEST}"
@@ -516,13 +615,15 @@ host_vsock_test() {
 			--mode=client \
 			--peer-cid="${cid}" \
 			--control-host="${host}" \
-			--control-port="${port}" 2>&1 | log_host
+			--control-port="${port}" \
+			"${extra_args[@]}" 2>&1 | log_host
 		rc=$?
 	else
 		${cmd} \
 			--mode=server \
 			--peer-cid="${cid}" \
-			--control-port="${port}" 2>&1 | log_host &
+			--control-port="${port}" \
+			"${extra_args[@]}" 2>&1 | log_host &
 		rc=$?
 
 		if [[ $rc -ne 0 ]]; then
@@ -593,6 +694,468 @@ test_ns_host_vsock_ns_mode_ok() {
 	return "${KSFT_PASS}"
 }
 
+test_ns_diff_global_host_connect_to_global_vm_ok() {
+	local oops_before warn_before
+	local pids pid pidfile
+	local ns0 ns1 port
+	declare -a pids
+	local unixfile
+	ns0="global0"
+	ns1="global1"
+	port=1234
+	local rc
+
+	init_namespaces
+
+	pidfile="$(create_pidfile)"
+
+	if ! vm_start "${pidfile}" "${ns0}"; then
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns0}"
+	oops_before=$(vm_dmesg_oops_count "${ns0}")
+	warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+	unixfile=$(mktemp -u /tmp/XXXX.sock)
+	ip netns exec "${ns1}" \
+		socat TCP-LISTEN:"${TEST_HOST_PORT}",fork \
+			UNIX-CONNECT:"${unixfile}" &
+	pids+=($!)
+	host_wait_for_listener "${ns1}" "${TEST_HOST_PORT}" "tcp"
+
+	ip netns exec "${ns0}" socat UNIX-LISTEN:"${unixfile}",fork \
+		TCP-CONNECT:localhost:"${TEST_HOST_PORT}" &
+	pids+=($!)
+	host_wait_for_listener "${ns0}" "${unixfile}" "unix"
+
+	vm_vsock_test "${ns0}" "server" 2 "${TEST_GUEST_PORT}"
+	vm_wait_for_listener "${ns0}" "${TEST_GUEST_PORT}" "tcp"
+	host_vsock_test "${ns1}" "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}"
+	rc=$?
+
+	vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pids "${pids[@]}"
+	terminate_pidfiles "${pidfile}"
+
+	if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+		return "${KSFT_FAIL}"
+	fi
+
+	return "${KSFT_PASS}"
+}
+
+test_ns_diff_global_host_connect_to_local_vm_fails() {
+	local oops_before warn_before
+	local ns0="global0"
+	local ns1="local0"
+	local port=12345
+	local dmesg_rc
+	local pidfile
+	local result
+	local pid
+
+	init_namespaces
+
+	outfile=$(mktemp)
+
+	pidfile="$(create_pidfile)"
+	if ! vm_start "${pidfile}" "${ns1}"; then
+		log_host "failed to start vm (cid=${VSOCK_CID}, ns=${ns0})"
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns1}"
+	oops_before=$(vm_dmesg_oops_count "${ns1}")
+	warn_before=$(vm_dmesg_warn_count "${ns1}")
+
+	vm_ssh "${ns1}" -- socat VSOCK-LISTEN:"${port}" STDOUT > "${outfile}" &
+	vm_wait_for_listener "${ns1}" "${port}" "vsock"
+	echo TEST | ip netns exec "${ns0}" \
+		socat STDIN VSOCK-CONNECT:"${VSOCK_CID}":"${port}" 2>/dev/null
+
+	vm_dmesg_check "${pidfile}" "${ns1}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+	result=$(cat "${outfile}")
+	rm -f "${outfile}"
+
+	if [[ "${result}" == "TEST" ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+		return "${KSFT_FAIL}"
+	fi
+
+	return "${KSFT_PASS}"
+}
+
+test_ns_diff_global_vm_connect_to_global_host_ok() {
+	local oops_before warn_before
+	local ns0="global0"
+	local ns1="global1"
+	local port=12345
+	local unixfile
+	local dmesg_rc
+	local pidfile
+	local pids
+	local rc
+
+	init_namespaces
+
+	declare -a pids
+
+	log_host "Setup socat bridge from ns ${ns0} to ns ${ns1} over port ${port}"
+
+	unixfile=$(mktemp -u /tmp/XXXX.sock)
+
+	ip netns exec "${ns0}" \
+		socat TCP-LISTEN:"${port}" UNIX-CONNECT:"${unixfile}" &
+	pids+=($!)
+	host_wait_for_listener "${ns0}" "${port}" "tcp"
+
+	ip netns exec "${ns1}" \
+		socat UNIX-LISTEN:"${unixfile}" TCP-CONNECT:127.0.0.1:"${port}" &
+	pids+=($!)
+	host_wait_for_listener "${ns1}" "${unixfile}" "unix"
+
+	log_host "Launching ${VSOCK_TEST} in ns ${ns1}"
+	host_vsock_test "${ns1}" "server" "${VSOCK_CID}" "${port}"
+
+	pidfile="$(create_pidfile)"
+	if ! vm_start "${pidfile}" "${ns0}"; then
+		log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+		terminate_pids "${pids[@]}"
+		rm -f "${unixfile}"
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns0}"
+
+	oops_before=$(vm_dmesg_oops_count "${ns0}")
+	warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+	vm_vsock_test "${ns0}" "10.0.2.2" 2 "${port}"
+	rc=$?
+
+	vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+	terminate_pids "${pids[@]}"
+	rm -f "${unixfile}"
+
+	if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+		return "${KSFT_FAIL}"
+	fi
+
+	return "${KSFT_PASS}"
+
+}
+
+test_ns_diff_global_vm_connect_to_local_host_fails() {
+	local ns0="global0"
+	local ns1="local0"
+	local port=12345
+	local oops_before warn_before
+	local dmesg_rc
+	local pidfile
+	local result
+	local pid
+
+	init_namespaces
+
+	log_host "Launching socat in ns ${ns1}"
+	outfile=$(mktemp)
+
+	ip netns exec "${ns1}" socat VSOCK-LISTEN:"${port}" STDOUT &> "${outfile}" &
+	pid=$!
+	host_wait_for_listener "${ns1}" "${port}" "vsock"
+
+	pidfile="$(create_pidfile)"
+	if ! vm_start "${pidfile}" "${ns0}"; then
+		log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+		terminate_pids "${pid}"
+		rm -f "${outfile}"
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns0}"
+
+	oops_before=$(vm_dmesg_oops_count "${ns0}")
+	warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+	vm_ssh "${ns0}" -- \
+		bash -c "echo TEST | socat STDIN VSOCK-CONNECT:2:${port}" 2>&1 | log_guest
+
+	vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+	terminate_pids "${pid}"
+
+	result=$(cat "${outfile}")
+	rm -f "${outfile}"
+
+	if [[ "${result}" != TEST ]] && [[ "${dmesg_rc}" -eq 0 ]]; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_host_connect_to_local_vm_fails() {
+	local ns0="local0"
+	local ns1="local1"
+	local port=12345
+	local oops_before warn_before
+	local dmesg_rc
+	local pidfile
+	local result
+	local pid
+
+	init_namespaces
+
+	outfile=$(mktemp)
+
+	pidfile="$(create_pidfile)"
+	if ! vm_start "${pidfile}" "${ns1}"; then
+		log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns1}"
+	oops_before=$(vm_dmesg_oops_count "${ns1}")
+	warn_before=$(vm_dmesg_warn_count "${ns1}")
+
+	vm_ssh "${ns1}" -- socat VSOCK-LISTEN:"${port}" STDOUT > "${outfile}" &
+	vm_wait_for_listener "${ns1}" "${port}" "vsock"
+
+	echo TEST | ip netns exec "${ns0}" \
+		socat STDIN VSOCK-CONNECT:"${VSOCK_CID}":"${port}" 2>/dev/null
+
+	vm_dmesg_check "${pidfile}" "${ns1}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+
+	result=$(cat "${outfile}")
+	rm -f "${outfile}"
+
+	if [[ "${result}" != TEST ]] && [[ "${dmesg_rc}" -eq 0 ]]; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_vm_connect_to_local_host_fails() {
+	local oops_before warn_before
+	local ns0="local0"
+	local ns1="local1"
+	local port=12345
+	local dmesg_rc
+	local pidfile
+	local result
+	local pid
+
+	init_namespaces
+
+	log_host "Launching socat in ns ${ns1}"
+	outfile=$(mktemp)
+	ip netns exec "${ns1}" socat VSOCK-LISTEN:"${port}" STDOUT &> "${outfile}" &
+	pid=$!
+	host_wait_for_listener "${ns1}" "${port}" "vsock"
+
+	pidfile="$(create_pidfile)"
+	if ! vm_start "${pidfile}" "${ns0}"; then
+		log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+		rm -f "${outfile}"
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns0}"
+	oops_before=$(vm_dmesg_oops_count "${ns0}")
+	warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+	vm_ssh "${ns0}" -- \
+		bash -c "echo TEST | socat STDIN VSOCK-CONNECT:2:${port}" 2>&1 | log_guest
+
+	vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+	terminate_pids "${pid}"
+
+	result=$(cat "${outfile}")
+	rm -f "${outfile}"
+
+	if [[ "${result}" != TEST ]] && [[ "${dmesg_rc}" -eq 0 ]]; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+__test_loopback_two_netns() {
+	local ns0=$1
+	local ns1=$2
+	local port=12345
+	local result
+	local pid
+
+	modprobe vsock_loopback &> /dev/null || :
+
+	log_host "Launching socat in ns ${ns1}"
+	outfile=$(mktemp)
+
+	ip netns exec "${ns1}" socat VSOCK-LISTEN:"${port}" STDOUT > "${outfile}" 2>/dev/null &
+	pid=$!
+	host_wait_for_listener "${ns1}" "${port}" "vsock"
+
+	log_host "Launching socat in ns ${ns0}"
+	echo TEST | ip netns exec "${ns0}" socat STDIN VSOCK-CONNECT:1:"${port}" 2>/dev/null
+	terminate_pids "${pid}"
+
+	result=$(cat "${outfile}")
+	rm -f "${outfile}"
+
+	if [[ "${result}" == TEST ]]; then
+		return 0
+	fi
+
+	return 1
+}
+
+test_ns_diff_global_to_local_loopback_local_fails() {
+	init_namespaces
+
+	if ! __test_loopback_two_netns "global0" "local0"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_to_global_loopback_fails() {
+	init_namespaces
+
+	if ! __test_loopback_two_netns "local0" "global0"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_to_local_loopback_fails() {
+	init_namespaces
+
+	if ! __test_loopback_two_netns "local0" "local1"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_diff_global_to_global_loopback_ok() {
+	init_namespaces
+
+	if __test_loopback_two_netns "global0" "global1"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_same_local_loopback_ok() {
+	init_namespaces
+
+	if __test_loopback_two_netns "local0" "local0"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_same_local_host_connect_to_local_vm_ok() {
+	local oops_before warn_before
+	local ns="local0"
+	local port=1234
+	local dmesg_rc
+	local pidfile
+	local rc
+
+	init_namespaces
+
+	pidfile="$(create_pidfile)"
+
+	if ! vm_start "${pidfile}" "${ns}"; then
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns}"
+	oops_before=$(vm_dmesg_oops_count "${ns}")
+	warn_before=$(vm_dmesg_warn_count "${ns}")
+
+	vm_vsock_test "${ns}" "server" 2 "${TEST_GUEST_PORT}"
+
+	# Skip test 29 (transport release use-after-free): This test attempts
+	# binding both G2H and H2G CIDs. Because virtio-vsock (G2H) doesn't
+	# support local namespaces the test will fail when
+	# transport_g2h->stream_allow() returns false. This edge case only
+	# happens for vsock_test in client mode on the host in a local
+	# namespace. This is a false positive.
+	host_vsock_test "${ns}" "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}" --skip=29
+	rc=$?
+
+	vm_dmesg_check "${pidfile}" "${ns}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+
+	if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+		return "${KSFT_FAIL}"
+	fi
+
+	return "${KSFT_PASS}"
+}
+
+test_ns_same_local_vm_connect_to_local_host_ok() {
+	local oops_before warn_before
+	local ns="local0"
+	local port=1234
+	local dmesg_rc
+	local pidfile
+	local rc
+
+	init_namespaces
+
+	pidfile="$(create_pidfile)"
+
+	if ! vm_start "${pidfile}" "${ns}"; then
+		return "${KSFT_FAIL}"
+	fi
+
+	vm_wait_for_ssh "${ns}"
+	oops_before=$(vm_dmesg_oops_count "${ns}")
+	warn_before=$(vm_dmesg_warn_count "${ns}")
+
+	host_vsock_test "${ns}" "server" "${VSOCK_CID}" "${port}"
+	vm_vsock_test "${ns}" "10.0.2.2" 2 "${port}"
+	rc=$?
+
+	vm_dmesg_check "${pidfile}" "${ns}" "${oops_before}" "${warn_before}"
+	dmesg_rc=$?
+
+	terminate_pidfiles "${pidfile}"
+
+	if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+		return "${KSFT_FAIL}"
+	fi
+
+	return "${KSFT_PASS}"
+}
+
 namespaces_can_boot_same_cid() {
 	local ns0=$1
 	local ns1=$2
@@ -882,6 +1445,7 @@ fi
 check_args "${ARGS[@]}"
 check_deps
 check_vng
+check_socat
 handle_build
 
 echo "1..${#ARGS[@]}"

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 12/12] selftests/vsock: add tests for namespace deletion
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add tests that validate vsock sockets are resilient to deleting
namespaces. The vsock sockets should still function normally.

The function check_ns_delete_doesnt_break_connection() is added to
re-use the step-by-step logic of 1) setup connections, 2) delete ns,
3) check that the connections are still ok.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- remove tests that change the mode after socket creation (this is not
  supported behavior now and the immutability property is tested in other
  tests)
- remove "change_mode" behavior of
  check_ns_changes_dont_break_connection() and rename to
  check_ns_delete_doesnt_break_connection() because we only need to test
  namespace deletion (other tests confirm that the mode cannot change)

Changes in v11:
- remove pipefile (Stefano)

Changes in v9:
- more consistent shell style
- clarify -u usage comment for pipefile
---
 tools/testing/selftests/vsock/vmtest.sh | 84 +++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index a9eaf37bc31b..dc8dbe74a6d0 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -68,6 +68,9 @@ readonly TEST_NAMES=(
 	ns_same_local_loopback_ok
 	ns_same_local_host_connect_to_local_vm_ok
 	ns_same_local_vm_connect_to_local_host_ok
+	ns_delete_vm_ok
+	ns_delete_host_ok
+	ns_delete_both_ok
 )
 readonly TEST_DESCS=(
 	# vm_server_host_client
@@ -135,6 +138,15 @@ readonly TEST_DESCS=(
 
 	# ns_same_local_vm_connect_to_local_host_ok
 	"Run vsock_test client in VM in a local ns with server in same ns."
+
+	# ns_delete_vm_ok
+	"Check that deleting the VM's namespace does not break the socket connection"
+
+	# ns_delete_host_ok
+	"Check that deleting the host's namespace does not break the socket connection"
+
+	# ns_delete_both_ok
+	"Check that deleting the VM and host's namespaces does not break the socket connection"
 )
 
 readonly USE_SHARED_VM=(
@@ -1287,6 +1299,78 @@ test_vm_loopback() {
 	return "${KSFT_PASS}"
 }
 
+check_ns_delete_doesnt_break_connection() {
+	local pipefile pidfile outfile
+	local ns0="global0"
+	local ns1="global1"
+	local port=12345
+	local pids=()
+	local rc=0
+
+	init_namespaces
+
+	pidfile="$(create_pidfile)"
+	if ! vm_start "${pidfile}" "${ns0}"; then
+		return "${KSFT_FAIL}"
+	fi
+	vm_wait_for_ssh "${ns0}"
+
+	outfile=$(mktemp)
+	vm_ssh "${ns0}" -- \
+		socat VSOCK-LISTEN:"${port}",fork STDOUT > "${outfile}" 2>/dev/null &
+	pids+=($!)
+	vm_wait_for_listener "${ns0}" "${port}" "vsock"
+
+	# We use a pipe here so that we can echo into the pipe instead of using
+	# socat and a unix socket file. We just need a name for the pipe (not a
+	# regular file) so use -u.
+	pipefile=$(mktemp -u /tmp/vmtest_pipe_XXXX)
+	ip netns exec "${ns1}" \
+		socat PIPE:"${pipefile}" VSOCK-CONNECT:"${VSOCK_CID}":"${port}" &
+	pids+=($!)
+
+	timeout "${WAIT_PERIOD}" \
+		bash -c 'while [[ ! -e '"${pipefile}"' ]]; do sleep 1; done; exit 0'
+
+	if [[ "$1" == "vm" ]]; then
+		ip netns del "${ns0}"
+	elif [[ "$1" == "host" ]]; then
+		ip netns del "${ns1}"
+	elif [[ "$1" == "both" ]]; then
+		ip netns del "${ns0}"
+		ip netns del "${ns1}"
+	fi
+
+	echo "TEST" > "${pipefile}"
+
+	timeout "${WAIT_PERIOD}" \
+		bash -c 'while [[ ! -s '"${outfile}"' ]]; do sleep 1; done; exit 0'
+
+	if grep -q "TEST" "${outfile}"; then
+		rc="${KSFT_PASS}"
+	else
+		rc="${KSFT_FAIL}"
+	fi
+
+	terminate_pidfiles "${pidfile}"
+	terminate_pids "${pids[@]}"
+	rm -f "${outfile}" "${pipefile}"
+
+	return "${rc}"
+}
+
+test_ns_delete_vm_ok() {
+	check_ns_delete_doesnt_break_connection "vm"
+}
+
+test_ns_delete_host_ok() {
+	check_ns_delete_doesnt_break_connection "host"
+}
+
+test_ns_delete_both_ok() {
+	check_ns_delete_doesnt_break_connection "both"
+}
+
 shared_vm_test() {
 	local tname
 

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 10/12] selftests/vsock: add namespace tests for CID collisions
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add tests to verify CID collision rules across different vsock namespace
modes.

1. Two VMs with the same CID cannot start in different global namespaces
   (ns_global_same_cid_fails)
2. Two VMs with the same CID can start in different local namespaces
   (ns_local_same_cid_ok)
3. VMs with the same CID can coexist when one is in a global namespace
   and another is in a local namespace (ns_global_local_same_cid_ok and
   ns_local_global_same_cid_ok)

The tests ns_global_local_same_cid_ok and ns_local_global_same_cid_ok
make sure that ordering does not matter.

The tests use a shared helper function namespaces_can_boot_same_cid()
that attempts to start two VMs with identical CIDs in the specified
namespaces and verifies whether VM initialization failed or succeeded.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v11:
- check vm_start() rc in namespaces_can_boot_same_cid() (Stefano)
- fix ns_local_same_cid_ok() to use local0 and local1 instead of reusing
  local0 twice. This check should pass, ensuring local namespaces do not
  collide (Stefano)
---
 tools/testing/selftests/vsock/vmtest.sh | 78 +++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 38785a102236..1bf537410ea6 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -50,6 +50,10 @@ readonly TEST_NAMES=(
 	vm_loopback
 	ns_host_vsock_ns_mode_ok
 	ns_host_vsock_child_ns_mode_ok
+	ns_global_same_cid_fails
+	ns_local_same_cid_ok
+	ns_global_local_same_cid_ok
+	ns_local_global_same_cid_ok
 )
 readonly TEST_DESCS=(
 	# vm_server_host_client
@@ -66,6 +70,18 @@ readonly TEST_DESCS=(
 
 	# ns_host_vsock_child_ns_mode_ok
 	"Check /proc/sys/net/vsock/ns_mode is read-only and child_ns_mode is writable."
+
+	# ns_global_same_cid_fails
+	"Check QEMU fails to start two VMs with same CID in two different global namespaces."
+
+	# ns_local_same_cid_ok
+	"Check QEMU successfully starts two VMs with same CID in two different local namespaces."
+
+	# ns_global_local_same_cid_ok
+	"Check QEMU successfully starts one VM in a global ns and then another VM in a local ns with the same CID."
+
+	# ns_local_global_same_cid_ok
+	"Check QEMU successfully starts one VM in a local ns and then another VM in a global ns with the same CID."
 )
 
 readonly USE_SHARED_VM=(
@@ -577,6 +593,68 @@ test_ns_host_vsock_ns_mode_ok() {
 	return "${KSFT_PASS}"
 }
 
+namespaces_can_boot_same_cid() {
+	local ns0=$1
+	local ns1=$2
+	local pidfile1 pidfile2
+	local rc
+
+	pidfile1="$(create_pidfile)"
+
+	# The first VM should be able to start. If it can't then we have
+	# problems and need to return non-zero.
+	if ! vm_start "${pidfile1}" "${ns0}"; then
+		return 1
+	fi
+
+	pidfile2="$(create_pidfile)"
+	vm_start "${pidfile2}" "${ns1}"
+	rc=$?
+	terminate_pidfiles "${pidfile1}" "${pidfile2}"
+
+	return "${rc}"
+}
+
+test_ns_global_same_cid_fails() {
+	init_namespaces
+
+	if namespaces_can_boot_same_cid "global0" "global1"; then
+		return "${KSFT_FAIL}"
+	fi
+
+	return "${KSFT_PASS}"
+}
+
+test_ns_local_global_same_cid_ok() {
+	init_namespaces
+
+	if namespaces_can_boot_same_cid "local0" "global0"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_global_local_same_cid_ok() {
+	init_namespaces
+
+	if namespaces_can_boot_same_cid "global0" "local0"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
+test_ns_local_same_cid_ok() {
+	init_namespaces
+
+	if namespaces_can_boot_same_cid "local0" "local1"; then
+		return "${KSFT_PASS}"
+	fi
+
+	return "${KSFT_FAIL}"
+}
+
 test_ns_host_vsock_child_ns_mode_ok() {
 	local orig_mode
 	local rc

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 09/12] selftests/vsock: add tests for proc sys vsock ns_mode
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add tests for the /proc/sys/net/vsock/{ns_mode,child_ns_mode}
interfaces. Namely, that they accept/report "global" and "local" strings
and enforce their access policies.

Start a convention of commenting the test name over the test
description. Add test name comments over test descriptions that existed
before this convention.

Add a check_netns() function that checks if the test requires namespaces
and if the current kernel supports namespaces. Skip tests that require
namespaces if the system does not have namespace support.

This patch is the first to add tests that do *not* re-use the same
shared VM. For that reason, it adds a run_ns_tests() function to run
these tests and filter out the shared VM tests.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- remove write-once test ns_host_vsock_ns_mode_write_once_ok to reflect
  removing the write-once policy
- add child_ns_mode test test_ns_host_vsock_child_ns_mode_ok
- modify test_ns_host_vsock_ns_mode_ok() to check that the correct mode
  was inherited from child_ns_mode

Changes in v12:
- remove ns_vm_local_mode_rejected test, due to dropping that constraint

Changes in v11:
- Document ns_ prefix above TEST_NAMES (Stefano)

Changes in v10:
- Remove extraneous add_namespaces/del_namespaces calls.
- Rename run_tests() to run_ns_tests() since it is designed to only
  run ns tests.

Changes in v9:
- add test ns_vm_local_mode_rejected to check that guests cannot use
  local mode
---
 tools/testing/selftests/vsock/vmtest.sh | 140 +++++++++++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 0e681d4c3a15..38785a102236 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -41,14 +41,38 @@ readonly KERNEL_CMDLINE="\
 	virtme.ssh virtme_ssh_channel=tcp virtme_ssh_user=$USER \
 "
 readonly LOG=$(mktemp /tmp/vsock_vmtest_XXXX.log)
-readonly TEST_NAMES=(vm_server_host_client vm_client_host_server vm_loopback)
+
+# Namespace tests must use the ns_ prefix. This is checked in check_netns() and
+# is used to determine if a test needs namespace setup before test execution.
+readonly TEST_NAMES=(
+	vm_server_host_client
+	vm_client_host_server
+	vm_loopback
+	ns_host_vsock_ns_mode_ok
+	ns_host_vsock_child_ns_mode_ok
+)
 readonly TEST_DESCS=(
+	# vm_server_host_client
 	"Run vsock_test in server mode on the VM and in client mode on the host."
+
+	# vm_client_host_server
 	"Run vsock_test in client mode on the VM and in server mode on the host."
+
+	# vm_loopback
 	"Run vsock_test using the loopback transport in the VM."
+
+	# ns_host_vsock_ns_mode_ok
+	"Check /proc/sys/net/vsock/ns_mode strings on the host."
+
+	# ns_host_vsock_child_ns_mode_ok
+	"Check /proc/sys/net/vsock/ns_mode is read-only and child_ns_mode is writable."
 )
 
-readonly USE_SHARED_VM=(vm_server_host_client vm_client_host_server vm_loopback)
+readonly USE_SHARED_VM=(
+	vm_server_host_client
+	vm_client_host_server
+	vm_loopback
+)
 readonly NS_MODES=("local" "global")
 
 VERBOSE=0
@@ -196,6 +220,20 @@ check_deps() {
 	fi
 }
 
+check_netns() {
+	local tname=$1
+
+	# If the test requires NS support, check if NS support exists
+	# using /proc/self/ns
+	if [[ "${tname}" =~ ^ns_ ]] &&
+	   [[ ! -e /proc/self/ns ]]; then
+		log_host "No NS support detected for test ${tname}"
+		return 1
+	fi
+
+	return 0
+}
+
 check_vng() {
 	local tested_versions
 	local version
@@ -519,6 +557,54 @@ log_guest() {
 	LOG_PREFIX=guest log "$@"
 }
 
+ns_get_mode() {
+	local ns=$1
+
+	ip netns exec "${ns}" cat /proc/sys/net/vsock/ns_mode 2>/dev/null
+}
+
+test_ns_host_vsock_ns_mode_ok() {
+	for mode in "${NS_MODES[@]}"; do
+		local actual
+
+		actual=$(ns_get_mode "${mode}0")
+		if [[ "${actual}" != "${mode}" ]]; then
+			log_host "expected mode ${mode}, got ${actual}"
+			return "${KSFT_FAIL}"
+		fi
+	done
+
+	return "${KSFT_PASS}"
+}
+
+test_ns_host_vsock_child_ns_mode_ok() {
+	local orig_mode
+	local rc
+
+	orig_mode=$(cat /proc/sys/net/vsock/child_ns_mode)
+
+	rc="${KSFT_PASS}"
+	for mode in "${NS_MODES[@]}"; do
+		local ns="${mode}0"
+
+		if echo "${mode}" 2>/dev/null > /proc/sys/net/vsock/ns_mode; then
+			log_host "ns_mode should be read-only but write succeeded"
+			rc="${KSFT_FAIL}"
+			continue
+		fi
+
+		if ! echo "${mode}" > /proc/sys/net/vsock/child_ns_mode; then
+			log_host "child_ns_mode should be writable to ${mode}"
+			rc="${KSFT_FAIL}"
+			continue
+		fi
+	done
+
+	echo "${orig_mode}" > /proc/sys/net/vsock/child_ns_mode
+
+	return "${rc}"
+}
+
 test_vm_server_host_client() {
 	if ! vm_vsock_test "init_ns" "server" 2 "${TEST_GUEST_PORT}"; then
 		return "${KSFT_FAIL}"
@@ -592,6 +678,11 @@ run_shared_vm_tests() {
 			continue
 		fi
 
+		if ! check_netns "${arg}"; then
+			check_result "${KSFT_SKIP}" "${arg}"
+			continue
+		fi
+
 		run_shared_vm_test "${arg}"
 		check_result "$?" "${arg}"
 	done
@@ -645,6 +736,49 @@ run_shared_vm_test() {
 	return "${rc}"
 }
 
+run_ns_tests() {
+	for arg in "${ARGS[@]}"; do
+		if shared_vm_test "${arg}"; then
+			continue
+		fi
+
+		if ! check_netns "${arg}"; then
+			check_result "${KSFT_SKIP}" "${arg}"
+			continue
+		fi
+
+		add_namespaces
+
+		name=$(echo "${arg}" | awk '{ print $1 }')
+		log_host "Executing test_${name}"
+
+		host_oops_before=$(dmesg 2>/dev/null | grep -c -i 'Oops')
+		host_warn_before=$(dmesg --level=warn 2>/dev/null | grep -c -i 'vsock')
+		eval test_"${name}"
+		rc=$?
+
+		host_oops_after=$(dmesg 2>/dev/null | grep -c -i 'Oops')
+		if [[ "${host_oops_after}" -gt "${host_oops_before}" ]]; then
+			echo "FAIL: kernel oops detected on host" | log_host
+			check_result "${KSFT_FAIL}" "${name}"
+			del_namespaces
+			continue
+		fi
+
+		host_warn_after=$(dmesg --level=warn 2>/dev/null | grep -c -i 'vsock')
+		if [[ "${host_warn_after}" -gt "${host_warn_before}" ]]; then
+			echo "FAIL: kernel warning detected on host" | log_host
+			check_result "${KSFT_FAIL}" "${name}"
+			del_namespaces
+			continue
+		fi
+
+		check_result "${rc}" "${name}"
+
+		del_namespaces
+	done
+}
+
 BUILD=0
 QEMU="qemu-system-$(uname -m)"
 
@@ -690,6 +824,8 @@ if shared_vm_tests_requested "${ARGS[@]}"; then
 	terminate_pidfiles "${pidfile}"
 fi
 
+run_ns_tests "${ARGS[@]}"
+
 echo "SUMMARY: PASS=${cnt_pass} SKIP=${cnt_skip} FAIL=${cnt_fail}"
 echo "Log: ${LOG}"
 

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 08/12] selftests/vsock: use ss to wait for listeners instead of /proc/net
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Replace /proc/net parsing with ss(8) for detecting listening sockets in
wait_for_listener() functions and add support for TCP, VSOCK, and Unix
socket protocols.

The previous implementation parsed /proc/net/tcp using awk to detect
listening sockets, but this approach could not support vsock because
vsock does not export socket information to /proc/net/.

Instead, use ss so that we can detect listeners on tcp, vsock, and unix.

The protocol parameter is now required for all wait_for_listener family
functions (wait_for_listener, vm_wait_for_listener,
host_wait_for_listener) to explicitly specify which socket type to wait
for.

ss is added to the dependency check in check_deps().

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/vsock/vmtest.sh | 47 +++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 4b5929ffc9eb..0e681d4c3a15 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -182,7 +182,7 @@ check_args() {
 }
 
 check_deps() {
-	for dep in vng ${QEMU} busybox pkill ssh; do
+	for dep in vng ${QEMU} busybox pkill ssh ss; do
 		if [[ ! -x $(command -v "${dep}") ]]; then
 			echo -e "skip:    dependency ${dep} not found!\n"
 			exit "${KSFT_SKIP}"
@@ -337,21 +337,32 @@ wait_for_listener()
 	local port=$1
 	local interval=$2
 	local max_intervals=$3
-	local protocol=tcp
-	local pattern
+	local protocol=$4
 	local i
 
-	pattern=":$(printf "%04X" "${port}") "
-
-	# for tcp protocol additionally check the socket state
-	[ "${protocol}" = "tcp" ] && pattern="${pattern}0A"
-
 	for i in $(seq "${max_intervals}"); do
-		if awk -v pattern="${pattern}" \
-			'BEGIN {rc=1} $2" "$4 ~ pattern {rc=0} END {exit rc}' \
-			/proc/net/"${protocol}"*; then
+		case "${protocol}" in
+		tcp)
+			if ss --listening --tcp --numeric | grep -q ":${port} "; then
+				break
+			fi
+			;;
+		vsock)
+			if ss --listening --vsock --numeric | grep -q ":${port} "; then
+				break
+			fi
+			;;
+		unix)
+			# For unix sockets, port is actually the socket path
+			if ss --listening --unix | grep -q "${port}"; then
+				break
+			fi
+			;;
+		*)
+			echo "Unknown protocol: ${protocol}" >&2
 			break
-		fi
+			;;
+		esac
 		sleep "${interval}"
 	done
 }
@@ -359,23 +370,25 @@ wait_for_listener()
 vm_wait_for_listener() {
 	local ns=$1
 	local port=$2
+	local protocol=$3
 
 	vm_ssh "${ns}" <<EOF
 $(declare -f wait_for_listener)
-wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
+wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX} ${protocol}
 EOF
 }
 
 host_wait_for_listener() {
 	local ns=$1
 	local port=$2
+	local protocol=$3
 
 	if [[ "${ns}" == "init_ns" ]]; then
-		wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}"
+		wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}" "${protocol}"
 	else
 		ip netns exec "${ns}" bash <<-EOF
 			$(declare -f wait_for_listener)
-			wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
+			wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX} ${protocol}
 		EOF
 	fi
 }
@@ -422,7 +435,7 @@ vm_vsock_test() {
 			return $rc
 		fi
 
-		vm_wait_for_listener "${ns}" "${port}"
+		vm_wait_for_listener "${ns}" "${port}" "tcp"
 		rc=$?
 	fi
 	set +o pipefail
@@ -463,7 +476,7 @@ host_vsock_test() {
 			return $rc
 		fi
 
-		host_wait_for_listener "${ns}" "${port}"
+		host_wait_for_listener "${ns}" "${port}" "tcp"
 		rc=$?
 	fi
 	set +o pipefail

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 07/12] selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

These functions are reused by the VM tests to collect and compare dmesg
warnings and oops counts. The future VM-specific tests use them heavily.
This patches relies on vm_ssh() already supporting namespaces.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v11:
- break these out into an earlier patch so that they can be used
  directly in new patches (instead of causing churn by adding this
  later)
---
 tools/testing/selftests/vsock/vmtest.sh | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index c4d73dd0a4cf..4b5929ffc9eb 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -380,6 +380,17 @@ host_wait_for_listener() {
 	fi
 }
 
+vm_dmesg_oops_count() {
+	local ns=$1
+
+	vm_ssh "${ns}" -- dmesg 2>/dev/null | grep -c -i 'Oops'
+}
+
+vm_dmesg_warn_count() {
+	local ns=$1
+
+	vm_ssh "${ns}" -- dmesg --level=warn 2>/dev/null | grep -c -i 'vsock'
+}
 
 vm_vsock_test() {
 	local ns=$1
@@ -587,8 +598,8 @@ run_shared_vm_test() {
 
 	host_oops_cnt_before=$(dmesg | grep -c -i 'Oops')
 	host_warn_cnt_before=$(dmesg --level=warn | grep -c -i 'vsock')
-	vm_oops_cnt_before=$(vm_ssh "init_ns" -- dmesg | grep -c -i 'Oops')
-	vm_warn_cnt_before=$(vm_ssh "init_ns" -- dmesg --level=warn | grep -c -i 'vsock')
+	vm_oops_cnt_before=$(vm_dmesg_oops_count "init_ns")
+	vm_warn_cnt_before=$(vm_dmesg_warn_count "init_ns")
 
 	name=$(echo "${1}" | awk '{ print $1 }')
 	eval test_"${name}"
@@ -606,13 +617,13 @@ run_shared_vm_test() {
 		rc=$KSFT_FAIL
 	fi
 
-	vm_oops_cnt_after=$(vm_ssh "init_ns" -- dmesg | grep -i 'Oops' | wc -l)
+	vm_oops_cnt_after=$(vm_dmesg_oops_count "init_ns")
 	if [[ ${vm_oops_cnt_after} -gt ${vm_oops_cnt_before} ]]; then
 		echo "FAIL: kernel oops detected on vm" | log_host
 		rc=$KSFT_FAIL
 	fi
 
-	vm_warn_cnt_after=$(vm_ssh "init_ns" -- dmesg --level=warn | grep -c -i 'vsock')
+	vm_warn_cnt_after=$(vm_dmesg_warn_count "init_ns")
 	if [[ ${vm_warn_cnt_after} -gt ${vm_warn_cnt_before} ]]; then
 		echo "FAIL: kernel warning detected on vm" | log_host
 		rc=$KSFT_FAIL

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 05/12] selftests/vsock: add namespace helpers to vmtest.sh
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add functions for initializing namespaces with the different vsock NS
modes. Callers can use add_namespaces() and del_namespaces() to create
namespaces global0, global1, local0, and local1.

The add_namespaces() function initializes global0, local0, etc... with
their respective vsock NS mode by toggling child_ns_mode before creating
the namespace.

Remove namespaces upon exiting the program in cleanup(). This is
unlikely to be needed for a healthy run, but it is useful for tests that
are manually killed mid-test.

This patch is in preparation for later namespace tests.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- intialize namespaces to use the child_ns_mode mechanism
- remove setting modes from init_namespaces() function (this function
  only sets up the lo device now)
- remove ns_set_mode(ns) because ns_mode is no longer mutable
---
 tools/testing/selftests/vsock/vmtest.sh | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index c7b270dd77a9..c2bdc293b94c 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -49,6 +49,7 @@ readonly TEST_DESCS=(
 )
 
 readonly USE_SHARED_VM=(vm_server_host_client vm_client_host_server vm_loopback)
+readonly NS_MODES=("local" "global")
 
 VERBOSE=0
 
@@ -103,6 +104,36 @@ check_result() {
 	fi
 }
 
+add_namespaces() {
+	local orig_mode
+	orig_mode=$(cat /proc/sys/net/vsock/child_ns_mode)
+
+	for mode in "${NS_MODES[@]}"; do
+		echo "${mode}" > /proc/sys/net/vsock/child_ns_mode
+		ip netns add "${mode}0" 2>/dev/null
+		ip netns add "${mode}1" 2>/dev/null
+	done
+
+	echo "${orig_mode}" > /proc/sys/net/vsock/child_ns_mode
+}
+
+init_namespaces() {
+	for mode in "${NS_MODES[@]}"; do
+		# we need lo for qemu port forwarding
+		ip netns exec "${mode}0" ip link set dev lo up
+		ip netns exec "${mode}1" ip link set dev lo up
+	done
+}
+
+del_namespaces() {
+	for mode in "${NS_MODES[@]}"; do
+		ip netns del "${mode}0" &>/dev/null
+		ip netns del "${mode}1" &>/dev/null
+		log_host "removed ns ${mode}0"
+		log_host "removed ns ${mode}1"
+	done
+}
+
 vm_ssh() {
 	ssh -q -o UserKnownHostsFile=/dev/null -p ${SSH_HOST_PORT} localhost "$@"
 	return $?
@@ -110,6 +141,7 @@ vm_ssh() {
 
 cleanup() {
 	terminate_pidfiles "${!PIDFILES[@]}"
+	del_namespaces
 }
 
 check_args() {

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 06/12] selftests/vsock: prepare vm management helpers for namespaces
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add namespace support to vm management, ssh helpers, and vsock_test
wrapper functions. This enables running VMs and test helpers in specific
namespaces, which is required for upcoming namespace isolation tests.

The functions still work correctly within the init ns, though the caller
must now pass "init_ns" explicitly.

No functional changes for existing tests. All have been updated to pass
"init_ns" explicitly.

Affected functions (such as vm_start() and vm_ssh()) now wrap their
commands with 'ip netns exec' when executing commands in non-init
namespaces.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v16:
- add "init_ns" to vm_ssh calls in run_shared_vm_tests
---
 tools/testing/selftests/vsock/vmtest.sh | 101 ++++++++++++++++++++++----------
 1 file changed, 69 insertions(+), 32 deletions(-)

diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index c2bdc293b94c..c4d73dd0a4cf 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -135,7 +135,18 @@ del_namespaces() {
 }
 
 vm_ssh() {
-	ssh -q -o UserKnownHostsFile=/dev/null -p ${SSH_HOST_PORT} localhost "$@"
+	local ns_exec
+
+	if [[ "${1}" == init_ns ]]; then
+		ns_exec=""
+	else
+		ns_exec="ip netns exec ${1}"
+	fi
+
+	shift
+
+	${ns_exec} ssh -q -o UserKnownHostsFile=/dev/null -p "${SSH_HOST_PORT}" localhost "$@"
+
 	return $?
 }
 
@@ -258,10 +269,12 @@ terminate_pidfiles() {
 
 vm_start() {
 	local pidfile=$1
+	local ns=$2
 	local logfile=/dev/null
 	local verbose_opt=""
 	local kernel_opt=""
 	local qemu_opts=""
+	local ns_exec=""
 	local qemu
 
 	qemu=$(command -v "${QEMU}")
@@ -282,7 +295,11 @@ vm_start() {
 		kernel_opt="${KERNEL_CHECKOUT}"
 	fi
 
-	vng \
+	if [[ "${ns}" != "init_ns" ]]; then
+		ns_exec="ip netns exec ${ns}"
+	fi
+
+	${ns_exec} vng \
 		--run \
 		${kernel_opt} \
 		${verbose_opt} \
@@ -297,6 +314,7 @@ vm_start() {
 }
 
 vm_wait_for_ssh() {
+	local ns=$1
 	local i
 
 	i=0
@@ -304,7 +322,8 @@ vm_wait_for_ssh() {
 		if [[ ${i} -gt ${WAIT_PERIOD_MAX} ]]; then
 			die "Timed out waiting for guest ssh"
 		fi
-		if vm_ssh -- true; then
+
+		if vm_ssh "${ns}" -- true; then
 			break
 		fi
 		i=$(( i + 1 ))
@@ -338,30 +357,41 @@ wait_for_listener()
 }
 
 vm_wait_for_listener() {
-	local port=$1
+	local ns=$1
+	local port=$2
 
-	vm_ssh <<EOF
+	vm_ssh "${ns}" <<EOF
 $(declare -f wait_for_listener)
 wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
 EOF
 }
 
 host_wait_for_listener() {
-	local port=$1
+	local ns=$1
+	local port=$2
 
-	wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}"
+	if [[ "${ns}" == "init_ns" ]]; then
+		wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}"
+	else
+		ip netns exec "${ns}" bash <<-EOF
+			$(declare -f wait_for_listener)
+			wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
+		EOF
+	fi
 }
 
+
 vm_vsock_test() {
-	local host=$1
-	local cid=$2
-	local port=$3
+	local ns=$1
+	local host=$2
+	local cid=$3
+	local port=$4
 	local rc
 
 	# log output and use pipefail to respect vsock_test errors
 	set -o pipefail
 	if [[ "${host}" != server ]]; then
-		vm_ssh -- "${VSOCK_TEST}" \
+		vm_ssh "${ns}" -- "${VSOCK_TEST}" \
 			--mode=client \
 			--control-host="${host}" \
 			--peer-cid="${cid}" \
@@ -369,7 +399,7 @@ vm_vsock_test() {
 			2>&1 | log_guest
 		rc=$?
 	else
-		vm_ssh -- "${VSOCK_TEST}" \
+		vm_ssh "${ns}" -- "${VSOCK_TEST}" \
 			--mode=server \
 			--peer-cid="${cid}" \
 			--control-port="${port}" \
@@ -381,7 +411,7 @@ vm_vsock_test() {
 			return $rc
 		fi
 
-		vm_wait_for_listener "${port}"
+		vm_wait_for_listener "${ns}" "${port}"
 		rc=$?
 	fi
 	set +o pipefail
@@ -390,22 +420,28 @@ vm_vsock_test() {
 }
 
 host_vsock_test() {
-	local host=$1
-	local cid=$2
-	local port=$3
+	local ns=$1
+	local host=$2
+	local cid=$3
+	local port=$4
 	local rc
 
+	local cmd="${VSOCK_TEST}"
+	if [[ "${ns}" != "init_ns" ]]; then
+		cmd="ip netns exec ${ns} ${cmd}"
+	fi
+
 	# log output and use pipefail to respect vsock_test errors
 	set -o pipefail
 	if [[ "${host}" != server ]]; then
-		${VSOCK_TEST} \
+		${cmd} \
 			--mode=client \
 			--peer-cid="${cid}" \
 			--control-host="${host}" \
 			--control-port="${port}" 2>&1 | log_host
 		rc=$?
 	else
-		${VSOCK_TEST} \
+		${cmd} \
 			--mode=server \
 			--peer-cid="${cid}" \
 			--control-port="${port}" 2>&1 | log_host &
@@ -416,7 +452,7 @@ host_vsock_test() {
 			return $rc
 		fi
 
-		host_wait_for_listener "${port}"
+		host_wait_for_listener "${ns}" "${port}"
 		rc=$?
 	fi
 	set +o pipefail
@@ -460,11 +496,11 @@ log_guest() {
 }
 
 test_vm_server_host_client() {
-	if ! vm_vsock_test "server" 2 "${TEST_GUEST_PORT}"; then
+	if ! vm_vsock_test "init_ns" "server" 2 "${TEST_GUEST_PORT}"; then
 		return "${KSFT_FAIL}"
 	fi
 
-	if ! host_vsock_test "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}"; then
+	if ! host_vsock_test "init_ns" "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}"; then
 		return "${KSFT_FAIL}"
 	fi
 
@@ -472,11 +508,11 @@ test_vm_server_host_client() {
 }
 
 test_vm_client_host_server() {
-	if ! host_vsock_test "server" "${VSOCK_CID}" "${TEST_HOST_PORT_LISTENER}"; then
+	if ! host_vsock_test "init_ns" "server" "${VSOCK_CID}" "${TEST_HOST_PORT_LISTENER}"; then
 		return "${KSFT_FAIL}"
 	fi
 
-	if ! vm_vsock_test "10.0.2.2" 2 "${TEST_HOST_PORT_LISTENER}"; then
+	if ! vm_vsock_test "init_ns" "10.0.2.2" 2 "${TEST_HOST_PORT_LISTENER}"; then
 		return "${KSFT_FAIL}"
 	fi
 
@@ -486,13 +522,14 @@ test_vm_client_host_server() {
 test_vm_loopback() {
 	local port=60000 # non-forwarded local port
 
-	vm_ssh -- modprobe vsock_loopback &> /dev/null || :
+	vm_ssh "init_ns" -- modprobe vsock_loopback &> /dev/null || :
 
-	if ! vm_vsock_test "server" 1 "${port}"; then
+	if ! vm_vsock_test "init_ns" "server" 1 "${port}"; then
 		return "${KSFT_FAIL}"
 	fi
 
-	if ! vm_vsock_test "127.0.0.1" 1 "${port}"; then
+
+	if ! vm_vsock_test "init_ns" "127.0.0.1" 1 "${port}"; then
 		return "${KSFT_FAIL}"
 	fi
 
@@ -550,8 +587,8 @@ run_shared_vm_test() {
 
 	host_oops_cnt_before=$(dmesg | grep -c -i 'Oops')
 	host_warn_cnt_before=$(dmesg --level=warn | grep -c -i 'vsock')
-	vm_oops_cnt_before=$(vm_ssh -- dmesg | grep -c -i 'Oops')
-	vm_warn_cnt_before=$(vm_ssh -- dmesg --level=warn | grep -c -i 'vsock')
+	vm_oops_cnt_before=$(vm_ssh "init_ns" -- dmesg | grep -c -i 'Oops')
+	vm_warn_cnt_before=$(vm_ssh "init_ns" -- dmesg --level=warn | grep -c -i 'vsock')
 
 	name=$(echo "${1}" | awk '{ print $1 }')
 	eval test_"${name}"
@@ -569,13 +606,13 @@ run_shared_vm_test() {
 		rc=$KSFT_FAIL
 	fi
 
-	vm_oops_cnt_after=$(vm_ssh -- dmesg | grep -i 'Oops' | wc -l)
+	vm_oops_cnt_after=$(vm_ssh "init_ns" -- dmesg | grep -i 'Oops' | wc -l)
 	if [[ ${vm_oops_cnt_after} -gt ${vm_oops_cnt_before} ]]; then
 		echo "FAIL: kernel oops detected on vm" | log_host
 		rc=$KSFT_FAIL
 	fi
 
-	vm_warn_cnt_after=$(vm_ssh -- dmesg --level=warn | grep -c -i 'vsock')
+	vm_warn_cnt_after=$(vm_ssh "init_ns" -- dmesg --level=warn | grep -c -i 'vsock')
 	if [[ ${vm_warn_cnt_after} -gt ${vm_warn_cnt_before} ]]; then
 		echo "FAIL: kernel warning detected on vm" | log_host
 		rc=$KSFT_FAIL
@@ -621,8 +658,8 @@ cnt_total=0
 if shared_vm_tests_requested "${ARGS[@]}"; then
 	log_host "Booting up VM"
 	pidfile="$(create_pidfile)"
-	vm_start "${pidfile}"
-	vm_wait_for_ssh
+	vm_start "${pidfile}" "init_ns"
+	vm_wait_for_ssh "init_ns"
 	log_host "VM booted up"
 
 	run_shared_vm_tests "${ARGS[@]}"

-- 
2.47.3


^ permalink raw reply related

* [PATCH net-next v16 04/12] selftests/vsock: increase timeout to 1200
From: Bobby Eshleman @ 2026-01-21 22:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	Jonathan Corbet
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, linux-doc,
	Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Increase the timeout from 300s to 1200s. On a modern bare metal server
my last run showed the new set of tests taking ~400s. Multiply by an
(arbitrary) factor of three to account for slower/nested runners.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/vsock/settings | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vsock/settings b/tools/testing/selftests/vsock/settings
index 694d70710ff0..79b65bdf05db 100644
--- a/tools/testing/selftests/vsock/settings
+++ b/tools/testing/selftests/vsock/settings
@@ -1 +1 @@
-timeout=300
+timeout=1200

-- 
2.47.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox