Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v12 02/12] vsock: add netns to vsock core
From: Michael S. Tsirkin @ 2026-01-11  9:16 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
	Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, linux-kernel,
	virtualization, netdev, kvm, linux-hyperv, linux-kselftest,
	berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20251126-vsock-vmtest-v12-2-257ee21cd5de@meta.com>

On Wed, Nov 26, 2025 at 11:47:31PM -0800, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@meta.com>
> 
> Add netns logic to vsock core. Additionally, modify transport hook
> prototypes to be used by later transport-specific patches (e.g.,
> *_seqpacket_allow()).
> 
> Namespaces are supported primarily by changing socket lookup functions
> (e.g., vsock_find_connected_socket()) to take into account the socket
> namespace and the namespace mode before considering a candidate socket a
> "match".
> 
> This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode that
> accepts the "global" or "local" mode strings.
> 
> Add netns functionality (initialization, passing to transports, procfs,
> etc...) to the af_vsock socket layer. Later patches that add netns
> support to transports depend on this patch.
> 
> dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
> modified to take a vsk in order to perform logic on namespace modes. In
> future patches, the net and net_mode will also be used for socket
> lookups in these functions.
> 
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> ---

...

> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index adcba1b7bf74..6113c22db8dc 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c

...

> @@ -2658,6 +2745,142 @@ static struct miscdevice vsock_device = {
>  	.fops		= &vsock_device_ops,
>  };
>  
> +static int vsock_net_mode_string(const struct ctl_table *table, int write,
> +				 void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	char data[VSOCK_NET_MODE_STR_MAX] = {0};
> +	enum vsock_net_mode mode;
> +	struct ctl_table tmp;

nit: this file should now include linux/sysctl.h for this struct definition I
think?


^ permalink raw reply

* Re: [PATCH RFC net-next v13 03/13] virtio: set skb owner of virtio_transport_reset_no_sock() reply
From: Michael S. Tsirkin @ 2026-01-11  6:46 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
	Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-3-9d6db8e7c80b@meta.com>

On Tue, Dec 23, 2025 at 04:28:37PM -0800, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@meta.com>
> 
> Associate reply packets with the sending socket. When vsock must reply
> with an RST packet and there exists a sending socket (e.g., for
> loopback), setting the skb owner to the socket correctly handles
> reference counting between the skb and sk (i.e., the sk stays alive
> until the skb is freed).
> 
> This allows the net namespace to be used for socket lookups for the
> duration of the reply skb's lifetime, preventing race conditions between
> the namespace lifecycle and vsock socket search using the namespace
> pointer.
> 
> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> ---
> Changes in v11:
> - move before adding to netns support (Stefano)

can you explain about the revert please?
I looked at feedback from Stefano and all he said
aparently was not to break bisect.

> Changes in v10:
> - break this out into its own patch for easy revert (Stefano)
> ---
>  net/vmw_vsock/virtio_transport_common.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index fdb8f5b3fa60..718be9f33274 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -1165,6 +1165,12 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>  		.op = VIRTIO_VSOCK_OP_RST,
>  		.type = le16_to_cpu(hdr->type),
>  		.reply = true,
> +
> +		/* Set sk owner to socket we are replying to (may be NULL for
> +		 * non-loopback). This keeps a reference to the sock and
> +		 * sock_net(sk) until the reply skb is freed.
> +		 */
> +		.vsk = vsock_sk(skb->sk),
>  	};
>  	struct sk_buff *reply;
>  
> 
> -- 
> 2.47.3


^ permalink raw reply

* Re: [PATCH RFC net-next v13 02/13] vsock: add netns to vsock core
From: Michael S. Tsirkin @ 2026-01-11  6:43 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
	Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-2-9d6db8e7c80b@meta.com>

On Tue, Dec 23, 2025 at 04:28:36PM -0800, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@meta.com>
> 
> Add netns logic to vsock core. Additionally, modify transport hook
> prototypes to be used by later transport-specific patches (e.g.,
> *_seqpacket_allow()).
> 
> Namespaces are supported primarily by changing socket lookup functions
> (e.g., vsock_find_connected_socket()) to take into account the socket
> namespace and the namespace mode before considering a candidate socket a
> "match".
> 
> This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
> report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
> for new namespaces.
> 
> Add netns functionality (initialization, passing to transports, procfs,
> etc...) to the af_vsock socket layer. Later patches that add netns
> support to transports depend on this patch.
> 
> dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
> modified to take a vsk in order to perform logic on namespace modes. In
> future patches, the net will also be used for socket
> lookups in these functions.
> 
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

...


>  static int __vsock_bind_connectible(struct vsock_sock *vsk,
>  				    struct sockaddr_vm *addr)
>  {
> +	struct net *net = sock_net(sk_vsock(vsk));
>  	static u32 port;
>  	struct sockaddr_vm new_addr;
>


Hmm this static port gives me pause. So some port number info leaks
between namespaces. I am not saying it's a big security issue
and yet ... people expect isolation.


-- 
MST


^ permalink raw reply

* Re: [PATCH RFC net-next v13 01/13] vsock: add per-net vsock NS mode state
From: Michael S. Tsirkin @ 2026-01-11  6:29 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
	Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-1-9d6db8e7c80b@meta.com>

On Tue, Dec 23, 2025 at 04:28:35PM -0800, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@meta.com>
> 
> Add the per-net vsock NS mode state. This only adds the structure for
> holding the mode and some of the functions for setting/getting and
> checking the mode, but does not integrate the functionality yet.
> 
> Future patches add the uAPI and transport-specific usage of these
> structures and helpers.
> 
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>


I do not much like splitting out functionality like this,
myself - there's little to no docs and one can't figure out
whether this code does what it is  supposed to do without
reading the next patch.

I would just squash this with that one.

If you are splitting functionality along some API lines,
you need detailed docs so one can review implementation
separately from use.


> ---
> Changes in v13:
> - remove net_mode because net->vsock.mode becomes immutable, no need to
>   save the mode when vsocks are created.
> - add the new helpers for child_ns_mode to support ns_mode inheriting
>   the mode from child_ns_mode.
> - because ns_mode is immutable and child_ns_mode can be changed multiple
>   times, remove the write-once lock.
> - simplify vsock_net_check_mode() to no longer take mode arguments since
>   the mode can be accessed via the net pointers without fear of the mode
>   changing.
> - add logic in vsock_net_check_mode() to infer VSOCK_NET_MODE_GLOBAL
>   from NULL namespaces in order to allow only net pointers to be passed
>   to vsock_net_check_mode(), while still allowing namespace-unaware
>   transports to force global mode.
> 
> Changes in v10:
> - change mode_locked to int (Stefano)
> 
> Changes in v9:
> - use xchg(), WRITE_ONCE(), READ_ONCE() for mode and mode_locked (Stefano)
> - clarify mode0/mode1 meaning in vsock_net_check_mode() comment
> - remove spin lock in net->vsock (not used anymore)
> - change mode from u8 to enum vsock_net_mode in vsock_net_write_mode()
> 
> Changes in v7:
> - clarify vsock_net_check_mode() comments
> - change to `orig_net_mode == VSOCK_NET_MODE_GLOBAL && orig_net_mode == vsk->orig_net_mode`
> - remove extraneous explanation of `orig_net_mode`
> - rename `written` to `mode_locked`
> - rename `vsock_hdr` to `sysctl_hdr`
> - change `orig_net_mode` to `net_mode`
> - make vsock_net_check_mode() more generic by taking just net pointers
>   and modes, instead of a vsock_sock ptr, for reuse by transports
>   (e.g., vhost_vsock)
> 
> Changes in v6:
> - add orig_net_mode to store mode at creation time which will be used to
>   avoid breakage when namespace changes mode during socket/VM lifespan
> 
> Changes in v5:
> - use /proc/sys/net/vsock/ns_mode instead of /proc/net/vsock_ns_mode
> - change from net->vsock.ns_mode to net->vsock.mode
> - change vsock_net_set_mode() to vsock_net_write_mode()
> - vsock_net_write_mode() returns bool for write success to avoid
>   need to use vsock_net_mode_can_set()
> - remove vsock_net_mode_can_set()
> ---
>  MAINTAINERS                 |  1 +
>  include/net/af_vsock.h      | 42 ++++++++++++++++++++++++++++++++++++++++++
>  include/net/net_namespace.h |  4 ++++
>  include/net/netns/vsock.h   | 17 +++++++++++++++++
>  4 files changed, 64 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 454b8ed119e9..38d24e5a957c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27516,6 +27516,7 @@ L:	netdev@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vhost/vsock.c
>  F:	include/linux/virtio_vsock.h
> +F:	include/net/netns/vsock.h
>  F:	include/uapi/linux/virtio_vsock.h
>  F:	net/vmw_vsock/virtio_transport.c
>  F:	net/vmw_vsock/virtio_transport_common.c
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index d40e978126e3..6f5bc9dbefa5 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/kernel.h>
>  #include <linux/workqueue.h>
> +#include <net/netns/vsock.h>
>  #include <net/sock.h>
>  #include <uapi/linux/vm_sockets.h>
>  
> @@ -256,4 +257,45 @@ static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
>  {
>  	return t->msgzerocopy_allow && t->msgzerocopy_allow();
>  }
> +
> +static inline enum vsock_net_mode vsock_net_mode(struct net *net)
> +{
> +	return READ_ONCE(net->vsock.mode);
> +}
> +
> +static inline void vsock_net_set_child_mode(struct net *net,
> +					    enum vsock_net_mode mode)
> +{
> +	WRITE_ONCE(net->vsock.child_ns_mode, mode);
> +}
> +
> +static inline enum vsock_net_mode vsock_net_child_mode(struct net *net)
> +{
> +	return READ_ONCE(net->vsock.child_ns_mode);
> +}
> +
> +/* Return true if two namespaces pass the mode rules. Otherwise, return false.
> + *
> + * A NULL namespace is treated as VSOCK_NET_MODE_GLOBAL.
> + *
> + * Read more about modes in the comment header of net/vmw_vsock/af_vsock.c.
> + */
> +static inline bool vsock_net_check_mode(struct net *ns0, struct net *ns1)
> +{
> +	enum vsock_net_mode mode0, mode1;
> +
> +	/* Any vsocks within the same network namespace are always reachable,
> +	 * regardless of the mode.
> +	 */
> +	if (net_eq(ns0, ns1))
> +		return true;
> +
> +	mode0 = ns0 ? vsock_net_mode(ns0) : VSOCK_NET_MODE_GLOBAL;
> +	mode1 = ns1 ? vsock_net_mode(ns1) : VSOCK_NET_MODE_GLOBAL;
> +
> +	/* Different namespaces are only reachable if they are both
> +	 * global mode.
> +	 */
> +	return mode0 == VSOCK_NET_MODE_GLOBAL && mode0 == mode1;
> +}
>  #endif /* __AF_VSOCK_H__ */
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index cb664f6e3558..66d3de1d935f 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -37,6 +37,7 @@
>  #include <net/netns/smc.h>
>  #include <net/netns/bpf.h>
>  #include <net/netns/mctp.h>
> +#include <net/netns/vsock.h>
>  #include <net/net_trackers.h>
>  #include <linux/ns_common.h>
>  #include <linux/idr.h>
> @@ -196,6 +197,9 @@ struct net {
>  	/* Move to a better place when the config guard is removed. */
>  	struct mutex		rtnl_mutex;
>  #endif
> +#if IS_ENABLED(CONFIG_VSOCKETS)
> +	struct netns_vsock	vsock;
> +#endif
>  } __randomize_layout;
>  
>  #include <linux/seq_file_net.h>
> diff --git a/include/net/netns/vsock.h b/include/net/netns/vsock.h
> new file mode 100644
> index 000000000000..e2325e2d6ec5
> --- /dev/null
> +++ b/include/net/netns/vsock.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __NET_NET_NAMESPACE_VSOCK_H
> +#define __NET_NET_NAMESPACE_VSOCK_H
> +
> +#include <linux/types.h>
> +
> +enum vsock_net_mode {
> +	VSOCK_NET_MODE_GLOBAL,
> +	VSOCK_NET_MODE_LOCAL,
> +};
> +
> +struct netns_vsock {
> +	struct ctl_table_header *sysctl_hdr;
> +	enum vsock_net_mode mode;
> +	enum vsock_net_mode child_ns_mode;
> +};
> +#endif /* __NET_NET_NAMESPACE_VSOCK_H */
> 
> -- 
> 2.47.3


^ permalink raw reply

* Re: [PATCH RFC net-next v13 00/13] vsock: add namespace support to vhost-vsock and loopback
From: Michael S. Tsirkin @ 2026-01-11  0:12 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
	Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li,
	linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <aWGZILlNWzIbRNuO@devvm11784.nha0.facebook.com>

On Fri, Jan 09, 2026 at 04:11:12PM -0800, Bobby Eshleman wrote:
> On Tue, Dec 23, 2025 at 04:28:34PM -0800, Bobby Eshleman wrote:
> > This series adds namespace support to vhost-vsock and loopback. It does
> > not add namespaces to any of the other guest transports (virtio-vsock,
> > hyperv, or vmci).
> > 
> > The current revision supports two modes: local and global. Local
> > mode is complete isolation of namespaces, while global mode is complete
> > sharing between namespaces of CIDs (the original behavior).
> > 
> > The mode is set using the parent namespace's
> > /proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
> > created. The mode of the current namespace can be queried by reading
> > /proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
> > has been created.
> > 
> > Modes are per-netns. This allows a system to configure namespaces
> > independently (some may share CIDs, others are completely isolated).
> > This also supports future possible mixed use cases, where there may be
> > namespaces in global mode spinning up VMs while there are mixed mode
> > namespaces that provide services to the VMs, but are not allowed to
> > allocate from the global CID pool (this mode is not implemented in this
> > series).
> 
> Stefano, would like me to resend this without the RFC tag, or should I
> just leave as is for review? I don't have any planned changes at the
> moment.
> 
> Best,
> Bobby

i couldn't apply it on top of net-next so pls do.


^ permalink raw reply

* Re: [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests
From: Yu Zhang @ 2026-01-10  5:39 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157342641D173ABE9B4F1FED485A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thu, Jan 08, 2026 at 06:45:52PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> > 
> > This patch series introduces a para-virtualized IOMMU driver for
> > Linux guests running on Microsoft Hyper-V. The primary objective
> > is to enable hardware-assisted DMA isolation and scalable device
> 
> Is there any particular meaning for the qualifier "scalable" vs. just
> "device assignment"? I just want to understand what you are getting
> at.
> 

Sorry for the ambiguity.
I intended to highlight two primary use cases for pvIOMMU:
- to enable in-kernel DMA protection within the guest.
- to allow device assignment to guest user space (e.g., via VFIO).

I avoided using the phrase "device assignment" alone, because people may be
confused if the main purpose of introducing pvIOMMU is for device assignment
to a L1 guest(which actually does not depend on any virtual IOMMU) or to a
L2 nested guest(altough I guess w/ pvIOMMU, it should work but we've never
tested that case and are not aware any such requirement).

And you are right, simply adding "scalable" didn't help clarify this.
I will rephrase the commit message. Thanks!

B.R.
Yu

^ permalink raw reply

* Re: [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions
From: Yu Zhang @ 2026-01-10  5:07 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157C3EF6617A7BA4CA9E432D485A@SN6PR02MB4157.namprd02.prod.outlook.com>

On Thu, Jan 08, 2026 at 06:47:44PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> > 
> 
> The "Subject:" line prefix for this patch should probably be "Drivers: hv:"
> to be consistent with most other changes to this source code file.
> 
> > Previously, the allocation of per-CPU output argument pages was restricted
> > to root partitions or those operating in VTL mode.
> > 
> > Remove this restriction to support guest IOMMU related hypercalls, which
> > require valid output pages to function correctly.
> 
> The thinking here isn't quite correct. Just because a hypercall produces output
> doesn't mean that Linux needs to allocate a page for the output that is separate
> from the input. It's perfectly OK to use the same page for both input and output,
> as long as the two areas don't overlap. Yes, the page is called
> "hyperv_pcpu_input_arg", but that's a historical artifact from before the time
> it was realized that the same page can be used for both input and output.
> 
> Of course, if there's ever a hypercall that needs lots of input and lots of output
> such that the combined size doesn't fit in a single page, then separate input
> and output pages will be needed. But I'm skeptical that will ever happen. Rep
> hypercalls could have large amounts of input and/or output, but I'd venture
> that the rep count can always be managed so everything fits in a single page.
> 

Thanks, Michael.

Is there an existing hypercall precedent that reuses the input page for output?
I believe reusing the input page should be acceptable, at least for pvIOMMU's
hypercalls, but I will confirm these interfaces with the Hyper-V team.

> > 
> > While unconditionally allocating per-CPU output pages scales with the number
> > of vCPUs, and potentially adding overhead for guests that may not utilize the
> > IOMMU, this change anticipates that future hypercalls from child partitions
> > may also require these output pages.
> 
> I've heard the argument that the amount of overhead is modest relative to the
> overall amount of memory that is typically in a VM, particularly VMs with high
> vCPU counts. And I don't disagree. But on the flip side, why tie up memory when
> there's no need to do so? I'd argue for dropping this patch, and changing the
> two hypercall call sites in Patch 5 to just use part of the so-called hypercall input
> page for the output as well. It's only a one-line change in each hypercall call site.
> 

I share your concern about unconditionally allocating a separate output page
for each vCPU. And if reusing the input page isn't accepted by the Hyper-V team,
perhaps we could gate the allocation by checking IS_ENABLED(CONFIG_HYPERV_PVIOMMU)
in hv_output_page_exist()?

B.R.
Yu

^ permalink raw reply

* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Zack Rusin @ 2026-01-10  4:52 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
	Chia-I Wu, Christian König, Danilo Krummrich, Dave Airlie,
	Deepak Rawat, Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh,
	Hans de Goede, Hawking Zhang, Helge Deller, intel-gfx, intel-xe,
	Jani Nikula, Javier Martinez Canillas, Jocelyn Falempe,
	Joonas Lahtinen, Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv,
	linux-kernel, Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
	Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
	nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
	Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
	virtualization, Vitaly Prosyak
In-Reply-To: <c816f7ed-66e0-4773-b3d1-4769234bd30b@suse.de>

[-- Attachment #1: Type: text/plain, Size: 3519 bytes --]

On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 29.12.25 um 22:58 schrieb Zack Rusin:
> > Almost a rite of passage for every DRM developer and most Linux users
> > is upgrading your DRM driver/updating boot flags/changing some config
> > and having DRM driver fail at probe resulting in a blank screen.
> >
> > Currently there's no way to recover from DRM driver probe failure. PCI
> > DRM driver explicitly throw out the existing sysfb to get exclusive
> > access to PCI resources so if the probe fails the system is left without
> > a functioning display driver.
> >
> > Add code to sysfb to recever system framebuffer when DRM driver's probe
> > fails. This means that a DRM driver that fails to load reloads the system
> > framebuffer driver.
> >
> > This works best with simpledrm. Without it Xorg won't recover because
> > it still tries to load the vendor specific driver which ends up usually
> > not working at all. With simpledrm the system recovers really nicely
> > ending up with a working console and not a blank screen.
> >
> > There's a caveat in that some hardware might require some special magic
> > register write to recover EFI display. I'd appreciate it a lot if
> > maintainers could introduce a temporary failure in their drivers
> > probe to validate that the sysfb recovers and they get a working console.
> > The easiest way to double check it is by adding:
> >   /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
> >   dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
> >   ret = -EINVAL;
> >   goto out_error;
> > or such right after the devm_aperture_remove_conflicting_pci_devices .
>
> Recovering the display like that is guess work and will at best work
> with simple discrete devices where the framebuffer is always located in
> a confined graphics aperture.
>
> But the problem you're trying to solve is a real one.
>
> What we'd want to do instead is to take the initial hardware state into
> account when we do the initial mode-setting operation.
>
> The first step is to move each driver's remove_conflicting_devices call
> to the latest possible location in the probe function. We usually do it
> first, because that's easy. But on most hardware, it could happen much
> later.

Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
they request pci regions which is going to fail otherwise. Because
grabbining the pci resources is in general the very first thing that
those drivers need to do to setup anything, we
remove_conflicting_devices first or at least very early.

I also don't think it's possible or even desirable by some drivers to
reuse the initial state, good example here is vmwgfx where by default
some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
loads we allow scanning out from system memory, so you can set your vm
up with 8mb of vram but still use 4k resolutions when the driver
loads, this way the suspend size of the vm is very predictable (tiny
vram plus whatever ram was setup) while still allowing a lot of
flexibility.

In general I think however this is planned it's two or three separate series:
1) infrastructure to reload the sysfb driver (what this series is)
2) making sure that drivers that do want to recover cleanly actually
clean out all the state on exit properly,
3) abstracting at least some of that cleanup in some driver independent way

z

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5414 bytes --]

^ permalink raw reply

* Re: [PATCH net-next, v7] net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
From: Jakub Kicinski @ 2026-01-10  2:02 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, longli, kotaranov, horms, shradhagupta, ssengar, ernis,
	shirazsaleem, linux-hyperv, netdev, linux-kernel, linux-rdma,
	dipayanroy
In-Reply-To: <20260106230438.GA13125@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Tue, 6 Jan 2026 15:04:38 -0800 Dipayaan Roy wrote:
> +static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
> +{
> +	struct mana_queue_reset_work *reset_queue_work =
> +			container_of(work, struct mana_queue_reset_work, work);
> +
> +	struct mana_port_context *apc = container_of(reset_queue_work,
> +						     struct mana_port_context,
> +						     queue_reset_work);

> +struct mana_queue_reset_work {
> +	/* Work structure */

Not sure what value this comment adds. Looks like something AI
generator would add.

> +	struct work_struct work;
> +};
> +
>  struct mana_port_context {
>  	struct mana_context *ac;
>  	struct net_device *ndev;
> +	struct mana_queue_reset_work queue_reset_work;

Why did you wrap the work in another struct with just one member?
It forces you to work thru two layers of container of.

Either way, container_of supports nested structs so I think something
like:

	struct mana_port_context *apc = container_of(work,
						     struct mana_port_context,
						     queue_reset_work.work);

should work (untested). But really, better to just delete the pointless
nesting.
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH V2,net-next, 2/2] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Jakub Kicinski @ 2026-01-10  1:56 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Shradha Gupta, Saurabh Sengar,
	Aditya Garg, Dipayaan Roy, Shiraz Saleem, linux-kernel,
	linux-rdma, paulros
In-Reply-To: <1767732407-12389-3-git-send-email-haiyangz@linux.microsoft.com>

On Tue,  6 Jan 2026 12:46:47 -0800 Haiyang Zhang wrote:
> @@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
>  	u32 pkt_hash;
>  }; /* HW DATA */
>  
> -#define MANA_RXCOMP_OOB_NUM_PPI 4
> -
>  /* Receive completion OOB */
>  struct mana_rxcomp_oob {
>  	struct mana_cqe_header cqe_hdr;
> @@ -378,7 +381,6 @@ struct mana_ethtool_stats {
>  	u64 tx_cqe_err;
>  	u64 tx_cqe_unknown_type;
>  	u64 tx_linear_pkt_cnt;
> -	u64 rx_coalesced_err;
>  	u64 rx_cqe_unknown_type;
>  };

This should be deleted in the previous patch already

^ permalink raw reply

* Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Jakub Kicinski @ 2026-01-10  1:56 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Shradha Gupta, Saurabh Sengar,
	Aditya Garg, Dipayaan Roy, Shiraz Saleem, linux-kernel,
	linux-rdma, paulros
In-Reply-To: <1767732407-12389-2-git-send-email-haiyangz@linux.microsoft.com>

On Tue,  6 Jan 2026 12:46:46 -0800 Haiyang Zhang wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
> check and process the type CQE_RX_COALESCED_4. The default setting is
> disabled, to avoid possible regression on latency.
> 
> And add ethtool handler to switch this feature. To turn it on, run:
>   ethtool -C <nic> rx-frames 4
> To turn it off:
>   ethtool -C <nic> rx-frames 1

Exposing just rx frame count, and only two values is quite unusual.
Please explain in more detail the coalescing logic of the device.

> @@ -2079,14 +2081,10 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
>  		return;
>  	}
>  
> -	pktlen = oob->ppi[0].pkt_len;
> -
> -	if (pktlen == 0) {
> -		/* data packets should never have packetlength of zero */
> -		netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
> -			   rxq->gdma_id, cq->gdma_id, rxq->rxobj);
> +nextpkt:
> +	pktlen = oob->ppi[i].pkt_len;
> +	if (pktlen == 0)
>  		return;
> -	}
>  
>  	curr = rxq->buf_index;
>  	rxbuf_oob = &rxq->rx_oobs[curr];
> @@ -2097,12 +2095,15 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
>  	/* Unsuccessful refill will have old_buf == NULL.
>  	 * In this case, mana_rx_skb() will drop the packet.
>  	 */
> -	mana_rx_skb(old_buf, old_fp, oob, rxq);
> +	mana_rx_skb(old_buf, old_fp, oob, rxq, i);
>  
>  drop:
>  	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
>  
>  	mana_post_pkt_rxq(rxq);
> +
> +	if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
> +		goto nextpkt;

Please code this up as a loop. Using gotos for control flow other than
to jump to error handling epilogues is a poor coding practice (see the
kernel coding style).

> +static int mana_set_coalesce(struct net_device *ndev,
> +			     struct ethtool_coalesce *ec,
> +			     struct kernel_ethtool_coalesce *kernel_coal,
> +			     struct netlink_ext_ack *extack)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +	u8 saved_cqe_coalescing_enable;
> +	int err;
> +
> +	if (ec->rx_max_coalesced_frames != 1 &&
> +	    ec->rx_max_coalesced_frames != MANA_RXCOMP_OOB_NUM_PPI) {
> +		NL_SET_ERR_MSG_FMT(extack,
> +				   "rx-frames must be 1 or %u, got %u",
> +				   MANA_RXCOMP_OOB_NUM_PPI,
> +				   ec->rx_max_coalesced_frames);
> +		return -EINVAL;
> +	}
> +
> +	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
> +	apc->cqe_coalescing_enable =
> +		ec->rx_max_coalesced_frames == MANA_RXCOMP_OOB_NUM_PPI;
> +
> +	if (!apc->port_is_up)
> +		return 0;
> +
> +	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> +

unnecessary empty line

> +	if (err) {
> +		netdev_err(ndev, "Set rx-frames to %u failed:%d\n",
> +			   ec->rx_max_coalesced_frames, err);
> +		NL_SET_ERR_MSG_FMT(extack, "Set rx-frames to %u failed",
> +				   ec->rx_max_coalesced_frames);

These messages are both pointless. If HW communication has failed
presumably there will already be an error in the logs. The extack
gives the user no information they wouldn't already have.

> +		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
> +	}
> +
> +	return err;
> +}

^ permalink raw reply

* Re: [PATCH RFC net-next v13 00/13] vsock: add namespace support to vhost-vsock and loopback
From: Bobby Eshleman @ 2026-01-10  0:11 UTC (permalink / raw)
  To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
	Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	Broadcom internal kernel review list, Shuah Khan, Long Li
  Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
	linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com>

On Tue, Dec 23, 2025 at 04:28:34PM -0800, Bobby Eshleman wrote:
> This series adds namespace support to vhost-vsock and loopback. It does
> not add namespaces to any of the other guest transports (virtio-vsock,
> hyperv, or vmci).
> 
> The current revision supports two modes: local and global. Local
> mode is complete isolation of namespaces, while global mode is complete
> sharing between namespaces of CIDs (the original behavior).
> 
> The mode is set using the parent namespace's
> /proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
> created. The mode of the current namespace can be queried by reading
> /proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
> has been created.
> 
> Modes are per-netns. This allows a system to configure namespaces
> independently (some may share CIDs, others are completely isolated).
> This also supports future possible mixed use cases, where there may be
> namespaces in global mode spinning up VMs while there are mixed mode
> namespaces that provide services to the VMs, but are not allowed to
> allocate from the global CID pool (this mode is not implemented in this
> series).

Stefano, would like me to resend this without the RFC tag, or should I
just leave as is for review? I don't have any planned changes at the
moment.

Best,
Bobby

^ permalink raw reply

* [PATCH] mshv: make certain field names descriptive in a header struct
From: Mukesh Rathor @ 2026-01-09 20:06 UTC (permalink / raw)
  To: linux-hyperv; +Cc: wei.liu, nunodasneves

There is no functional change. Just make couple field names in
struct mshv_mem_region, in a header that can be used in many
places, a little descriptive to make code easier to read by
allowing better support for grep, cscope, etc.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/mshv_regions.c   | 44 ++++++++++++++++++-------------------
 drivers/hv/mshv_root.h      |  6 ++---
 drivers/hv/mshv_root_main.c | 10 ++++-----
 3 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 202b9d551e39..af81405f859b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 	struct page *page;
 	int ret;
 
-	page = region->pages[page_offset];
+	page = region->mreg_pages[page_offset];
 	if (!page)
 		return -EINVAL;
 
@@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
 
 	/* Start at stride since the first page is validated */
 	for (count = stride; count < page_count; count += stride) {
-		page = region->pages[page_offset + count];
+		page = region->mreg_pages[page_offset + count];
 
 		/* Break if current page is not present */
 		if (!page)
@@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
 
 	while (page_count) {
 		/* Skip non-present pages */
-		if (!region->pages[page_offset]) {
+		if (!region->mreg_pages[page_offset]) {
 			page_offset++;
 			page_count--;
 			continue;
@@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
 				   u32 flags,
 				   u64 page_offset, u64 page_count)
 {
-	struct page *page = region->pages[page_offset];
+	struct page *page = region->mreg_pages[page_offset];
 
 	if (PageHuge(page) || PageTransCompound(page))
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->pages + page_offset,
+					      region->mreg_pages + page_offset,
 					      page_count,
 					      HV_MAP_GPA_READABLE |
 					      HV_MAP_GPA_WRITABLE,
@@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
 				     u32 flags,
 				     u64 page_offset, u64 page_count)
 {
-	struct page *page = region->pages[page_offset];
+	struct page *page = region->mreg_pages[page_offset];
 
 	if (PageHuge(page) || PageTransCompound(page))
 		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
 
 	return hv_call_modify_spa_host_access(region->partition->pt_id,
-					      region->pages + page_offset,
+					      region->mreg_pages + page_offset,
 					      page_count, 0,
 					      flags, false);
 }
@@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 				   u32 flags,
 				   u64 page_offset, u64 page_count)
 {
-	struct page *page = region->pages[page_offset];
+	struct page *page = region->mreg_pages[page_offset];
 
 	if (PageHuge(page) || PageTransCompound(page))
 		flags |= HV_MAP_GPA_LARGE_PAGE;
@@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
 	return hv_call_map_gpa_pages(region->partition->pt_id,
 				     region->start_gfn + page_offset,
 				     page_count, flags,
-				     region->pages + page_offset);
+				     region->mreg_pages + page_offset);
 }
 
 static int mshv_region_remap_pages(struct mshv_mem_region *region,
@@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
 static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
 					 u64 page_offset, u64 page_count)
 {
-	if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
-		unpin_user_pages(region->pages + page_offset, page_count);
+	if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+		unpin_user_pages(region->mreg_pages + page_offset, page_count);
 
-	memset(region->pages + page_offset, 0,
+	memset(region->mreg_pages + page_offset, 0,
 	       page_count * sizeof(struct page *));
 }
 
@@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
 	int ret;
 
 	for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
-		pages = region->pages + done_count;
+		pages = region->mreg_pages + done_count;
 		userspace_addr = region->start_uaddr +
 				 done_count * HV_HYP_PAGE_SIZE;
 		nr_pages = min(region->nr_pages - done_count,
@@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
 				   u32 flags,
 				   u64 page_offset, u64 page_count)
 {
-	struct page *page = region->pages[page_offset];
+	struct page *page = region->mreg_pages[page_offset];
 
 	if (PageHuge(page) || PageTransCompound(page))
 		flags |= HV_UNMAP_GPA_LARGE_PAGE;
@@ -321,7 +321,7 @@ static void mshv_region_destroy(struct kref *ref)
 	struct mshv_partition *partition = region->partition;
 	int ret;
 
-	if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+	if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
 		mshv_region_movable_fini(region);
 
 	if (mshv_partition_encrypted(partition)) {
@@ -374,9 +374,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
 	int ret;
 
 	range->notifier_seq = mmu_interval_read_begin(range->notifier);
-	mmap_read_lock(region->mni.mm);
+	mmap_read_lock(region->mreg_mni.mm);
 	ret = hmm_range_fault(range);
-	mmap_read_unlock(region->mni.mm);
+	mmap_read_unlock(region->mreg_mni.mm);
 	if (ret)
 		return ret;
 
@@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 				   u64 page_offset, u64 page_count)
 {
 	struct hmm_range range = {
-		.notifier = &region->mni,
+		.notifier = &region->mreg_mni,
 		.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
 	};
 	unsigned long *pfns;
@@ -430,7 +430,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
 		goto out;
 
 	for (i = 0; i < page_count; i++)
-		region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+		region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
 
 	ret = mshv_region_remap_pages(region, region->hv_map_flags,
 				      page_offset, page_count);
@@ -489,7 +489,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
 {
 	struct mshv_mem_region *region = container_of(mni,
 						      struct mshv_mem_region,
-						      mni);
+						      mreg_mni);
 	u64 page_offset, page_count;
 	unsigned long mstart, mend;
 	int ret = -EPERM;
@@ -535,14 +535,14 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
 
 void mshv_region_movable_fini(struct mshv_mem_region *region)
 {
-	mmu_interval_notifier_remove(&region->mni);
+	mmu_interval_notifier_remove(&region->mreg_mni);
 }
 
 bool mshv_region_movable_init(struct mshv_mem_region *region)
 {
 	int ret;
 
-	ret = mmu_interval_notifier_insert(&region->mni, current->mm,
+	ret = mmu_interval_notifier_insert(&region->mreg_mni, current->mm,
 					   region->start_uaddr,
 					   region->nr_pages << HV_HYP_PAGE_SHIFT,
 					   &mshv_region_mni_ops);
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..f5b6d3979e5a 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -85,10 +85,10 @@ struct mshv_mem_region {
 	u64 start_uaddr;
 	u32 hv_map_flags;
 	struct mshv_partition *partition;
-	enum mshv_region_type type;
-	struct mmu_interval_notifier mni;
+	enum mshv_region_type mreg_type;
+	struct mmu_interval_notifier mreg_mni;
 	struct mutex mutex;	/* protects region pages remapping */
-	struct page *pages[];
+	struct page *mreg_pages[];
 };
 
 struct mshv_irq_ack_notifier {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..eff1b21461dc 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
 		return false;
 
 	/* Only movable memory ranges are supported for GPA intercepts */
-	if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+	if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
 		ret = mshv_region_handle_gfn_fault(region, gfn);
 	else
 		ret = false;
@@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
 		return PTR_ERR(rg);
 
 	if (is_mmio)
-		rg->type = MSHV_REGION_TYPE_MMIO;
+		rg->mreg_type = MSHV_REGION_TYPE_MMIO;
 	else if (mshv_partition_encrypted(partition) ||
 		 !mshv_region_movable_init(rg))
-		rg->type = MSHV_REGION_TYPE_MEM_PINNED;
+		rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
 	else
-		rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
+		rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
 
 	rg->partition = partition;
 
@@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	if (ret)
 		return ret;
 
-	switch (region->type) {
+	switch (region->mreg_type) {
 	case MSHV_REGION_TYPE_MEM_PINNED:
 		ret = mshv_prepare_pinned_region(region);
 		break;
-- 
2.51.2.vfs.0.1


^ permalink raw reply related

* RE: [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
From: Michael Kelley @ 2026-01-09 19:24 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <330d26ac-f1a2-4ee9-8cd7-20fd17db9f92@linux.microsoft.com>

From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:47 AM
> 
> On 1/8/2026 10:47 AM, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> >>
> >> From: Wei Liu <wei.liu@kernel.org>
> >>
> 
> <snip>
> 
> >> +struct hv_input_get_iommu_capabilities {
> >> +	u64 partition_id;
> >> +	u64 reserved;
> >> +} __packed;
> >> +
> >> +struct hv_output_get_iommu_capabilities {
> >> +	u32 size;
> >> +	u16 reserved;
> >> +	u8  max_iova_width;
> >> +	u8  max_pasid_width;
> >> +
> >> +#define HV_IOMMU_CAP_PRESENT (1ULL << 0)
> >> +#define HV_IOMMU_CAP_S2 (1ULL << 1)
> >> +#define HV_IOMMU_CAP_S1 (1ULL << 2)
> >> +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3)
> >> +#define HV_IOMMU_CAP_PASID (1ULL << 4)
> >> +#define HV_IOMMU_CAP_ATS (1ULL << 5)
> >> +#define HV_IOMMU_CAP_PRI (1ULL << 6)
> >> +
> >> +	u64 iommu_cap;
> >> +	u64 pgsize_bitmap;
> >> +} __packed;
> >> +
> >> +enum hv_logical_device_property_code {
> >> +	HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10,
> >> +};
> >> +
> >> +struct hv_input_get_logical_device_property {
> >> +	u64 partition_id;
> >> +	u64 logical_device_id;
> >> +	enum hv_logical_device_property_code code;
> >
> > Historically we've avoided "enum" types in structures that are part of
> > the hypervisor ABI. Use u32 here?
> 
> <snip>
> What has been the reasoning for that practice? Since the introduction of the
> include/hyperv/ headers, we have generally wanted to import as directly as
> possible the relevant definitions from the hypervisor code base. If there's
> a strong reason, we could consider switching the enum for a u32 here
> since, at least for the moment, there's only a single value being used.
> 

In the C language, the size of an enum is implementation defined. Do
a Co-Pilot search on "How many bytes is an enum in C", and you'll get a
fairly long answer explaining the idiosyncrasies. For gcc, and for MSVC on
the hypervisor side, the default is that an "enum" size is the same as an
"int", so everything works in current practice. But the compiler is allowed
to optimize the size of an enum if a smaller integer type can contain all
the values, and that would mess things up in an ABI. Hence the intent
to not use "enum" in the hypervisor ABI. Windows/Hyper-V historically
didn't have to worry about such things since they controlled both sides
of the ABI, but the more Linux uses the ABI, the greater potential for
something to go wrong.

I wish Windows/Hyper-V would tighten up their ABI specification, but
it is what it is. So I'm not sure how best to deal with the issue in light
of wanting to take the hypervisor ABI definitions directly from the
Windows environment and not modify them. I did a quick grep of the
hv*.h files in include/hyperv from linux-next, and while there are many
enum types defined, none are used as fields in a structure. There are
many cases of u32, and a couple u16's, followed by a comment
identifying the enum type that should be used to populate the field.

Michael

^ permalink raw reply

* Re: [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
From: Easwar Hariharan @ 2026-01-09 18:47 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, easwar.hariharan, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB415755B0CED30E8BEB062942D485A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 1/8/2026 10:47 AM, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>>
>> From: Wei Liu <wei.liu@kernel.org>
>>

<snip>

>> +struct hv_input_get_iommu_capabilities {
>> +	u64 partition_id;
>> +	u64 reserved;
>> +} __packed;
>> +
>> +struct hv_output_get_iommu_capabilities {
>> +	u32 size;
>> +	u16 reserved;
>> +	u8  max_iova_width;
>> +	u8  max_pasid_width;
>> +
>> +#define HV_IOMMU_CAP_PRESENT (1ULL << 0)
>> +#define HV_IOMMU_CAP_S2 (1ULL << 1)
>> +#define HV_IOMMU_CAP_S1 (1ULL << 2)
>> +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3)
>> +#define HV_IOMMU_CAP_PASID (1ULL << 4)
>> +#define HV_IOMMU_CAP_ATS (1ULL << 5)
>> +#define HV_IOMMU_CAP_PRI (1ULL << 6)
>> +
>> +	u64 iommu_cap;
>> +	u64 pgsize_bitmap;
>> +} __packed;
>> +
>> +enum hv_logical_device_property_code {
>> +	HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10,
>> +};
>> +
>> +struct hv_input_get_logical_device_property {
>> +	u64 partition_id;
>> +	u64 logical_device_id;
>> +	enum hv_logical_device_property_code code;
> 
> Historically we've avoided "enum" types in structures that are part of
> the hypervisor ABI. Use u32 here?

<snip>
What has been the reasoning for that practice? Since the introduction of the
include/hyperv/ headers, we have generally wanted to import as directly as
possible the relevant definitions from the hypervisor code base. If there's
a strong reason, we could consider switching the enum for a u32 here
since, at least for the moment, there's only a single value being used.

Thanks,
Easwar (he/him)


^ permalink raw reply

* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Easwar Hariharan @ 2026-01-09 18:40 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, easwar.hariharan, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
	robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB41570FC0D7EA1364FB48CD1ED485A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 1/8/2026 10:46 AM, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>>
>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>
>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
>> child partitions. This ID will also be required for future hypercalls
>> used by the Hyper-V IOMMU driver.
>>
>> Refactor the logic for building this logical device ID into a standalone
>> helper function and export the interface for wider use.
>>
>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
>> ---
>>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
>>  include/asm-generic/mshyperv.h      |  2 ++
>>  2 files changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>> index 146b43981b27..4b82e06b5d93 100644
>> --- a/drivers/pci/controller/pci-hyperv.c
>> +++ b/drivers/pci/controller/pci-hyperv.c
>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>
>>  #define hv_msi_prepare		pci_msi_prepare
>>
>> +/**
>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
>> + * function number of the device.
>> + */
>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
>> +{
>> +	struct pci_bus *pbus = pdev->bus;
>> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
>> +						struct hv_pcibus_device, sysdata);
>> +
>> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
>> +		     (hbus->hdev->dev_instance.b[4] << 16) |
>> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
>> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
>> +		     PCI_FUNC(pdev->devfn));
>> +}
>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
> 
> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
> use this symbol in that case -- you'll get a link error on vmlinux when building
> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
> require that the VMBus driver not be built as a module, so I don't think that's
> the right solution.
> 
> This is a messy problem. The new IOMMU driver needs to start with a generic
> "struct device" for the PCI device, and somehow find the corresponding VMBus
> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
> about ways to do this that don't depend on code and data structures that are
> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.

Thank you, Michael. FWIW, I did try to pull out the device ID components out of 
pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
but it was just too messy as you say.

> I was wondering if this "logical device id" is actually parsed by the hypervisor,
> or whether it is just a unique ID that is opaque to the hypervisor. From the
> usage in the hypercalls in pci-hyperv.c and this new IOMMU driver, it appears
> to be the former. Evidently the hypervisor is taking this logical device ID and
> and matching against bytes 4 thru 7 of the instance GUIDs of PCI pass-thru
> devices offered to the guest, so as to identify a particular PCI pass-thru device.
> If that's the case, then Linux doesn't have the option of choosing some other
> unique ID that is easier to generate and access. 

Yes, the device ID is actually used by the hypervisor to find the corresponding PCI
pass-thru device and the physical IOMMUs the device is behind and execute the
requested operation for those IOMMUs.

> There's a uniqueness issue with this kind of logical device ID that has been
> around for years, but I had never thought about before. In hv_pci_probe()
> instance GUID bytes 4 and 5 are used to generate the PCI domain number for
> the "fake" PCI bus that the PCI pass-thru device resides on. The issue is the
> lack of guaranteed uniqueness of bytes 4 and 5, so there's code to deal with
> a collision. (The full GUID is unique, but not necessarily some subset of the
> GUID.) It seems like the same kind of uniqueness issue could occur here. Does
> the Hyper-V host provide any guarantees about the uniqueness of bytes 4 thru
> 7 as a unit, and if not, what happens if there is a collision? Again, this
> uniqueness issue has existed for years, so it's not new to this patch set, but
> with new uses of the logical device ID, it seems relevant to consider.
 
Thank you for bringing that up, I was aware of the uniqueness workaround but, like you,
I had not considered that the workaround could prevent matching the device ID with the
record the hypervisor has of the PCI pass-thru device assigned to us. I will work with
the hypervisor folks to resolve this before this patch series is posted for merge.

Thanks,
Easwar (he/him)

^ permalink raw reply

* RE: [PATCH] scsi: storvsc: Process unsupported MODE_SENSE_10
From: Michael Kelley @ 2026-01-09 17:48 UTC (permalink / raw)
  To: longli@linux.microsoft.com, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
	James Bottomley, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: Long Li, stable@kernel.org
In-Reply-To: <1767815803-3747-1-git-send-email-longli@linux.microsoft.com>

From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: Wednesday, January 7, 2026 11:57 AM
> 
> The Hyper-V host does not support MODE_SENSE_10 and MODE_SENSE.
> The driver handles MODE_SENSE as unsupported command, but not for
> MODE_SENSE_10. Add MODE_SENSE_10 to the same handling logic and
> return correct code to SCSI layer.
> 
> Fixes: 89ae7d709357 ("Staging: hv: storvsc: Move the storage driver out of the staging area")
> Cc: stable@kernel.org
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 6e4112143c76..9b15784e2d64 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1154,6 +1154,7 @@ static void storvsc_on_io_completion(struct storvsc_device
> *stor_device,
> 
>  	if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
>  	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
> +	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
>  	   (stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
>  	   hv_dev_is_fc(device))) {
>  		vstor_packet->vm_srb.scsi_status = 0;

There's a code comment above this "if" statement that describes the situation.
The comment specifically lists INQUIRY, MODE_SENSE, and MAINTENANCE_IN. For
consistency, it should be updated to include MODE_SENSE_10.

With the comment updated,

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

^ permalink raw reply

* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Thomas Zimmermann @ 2026-01-09 10:34 UTC (permalink / raw)
  To: Zack Rusin, dri-devel
  Cc: Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun, Chia-I Wu,
	Christian König, Danilo Krummrich, Dave Airlie, Deepak Rawat,
	Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh, Hans de Goede,
	Hawking Zhang, Helge Deller, intel-gfx, intel-xe, Jani Nikula,
	Javier Martinez Canillas, Jocelyn Falempe, Joonas Lahtinen,
	Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv, linux-kernel,
	Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
	Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
	nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
	Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
	virtualization, Vitaly Prosyak
In-Reply-To: <20251229215906.3688205-1-zack.rusin@broadcom.com>

Hi

Am 29.12.25 um 22:58 schrieb Zack Rusin:
> Almost a rite of passage for every DRM developer and most Linux users
> is upgrading your DRM driver/updating boot flags/changing some config
> and having DRM driver fail at probe resulting in a blank screen.
>
> Currently there's no way to recover from DRM driver probe failure. PCI
> DRM driver explicitly throw out the existing sysfb to get exclusive
> access to PCI resources so if the probe fails the system is left without
> a functioning display driver.
>
> Add code to sysfb to recever system framebuffer when DRM driver's probe
> fails. This means that a DRM driver that fails to load reloads the system
> framebuffer driver.
>
> This works best with simpledrm. Without it Xorg won't recover because
> it still tries to load the vendor specific driver which ends up usually
> not working at all. With simpledrm the system recovers really nicely
> ending up with a working console and not a blank screen.
>
> There's a caveat in that some hardware might require some special magic
> register write to recover EFI display. I'd appreciate it a lot if
> maintainers could introduce a temporary failure in their drivers
> probe to validate that the sysfb recovers and they get a working console.
> The easiest way to double check it is by adding:
>   /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
>   dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
>   ret = -EINVAL;
>   goto out_error;
> or such right after the devm_aperture_remove_conflicting_pci_devices .

Recovering the display like that is guess work and will at best work 
with simple discrete devices where the framebuffer is always located in 
a confined graphics aperture.

But the problem you're trying to solve is a real one.

What we'd want to do instead is to take the initial hardware state into 
account when we do the initial mode-setting operation.

The first step is to move each driver's remove_conflicting_devices call 
to the latest possible location in the probe function. We usually do it 
first, because that's easy. But on most hardware, it could happen much 
later. The native driver is free to examine hardware state while probing 
the device as long as it does not interfere with the pre-configured 
framebuffer mode/format/address. Hence it can set up it's internal 
structures while the sysfb device is still active.

The next step for the native driver is to load the pre-configured 
hardware state into its initial internal atomic state. Maxime has worked 
on that on and off. The last iteration I'm aware of is at [1].

After the state-readout, the sysfb device has to be unplugged. But as 
the underlying hardware config remains active, the native driver can now 
use and modify it. We currently do a drm_mode_config_reset(), which 
clears the state and then let the first client set a new display state. 
But with state-readout, we could either pick up the existing framebuffer 
directly or do a proper modeset from existing state.

As DRM clients control the mode setting, they'd likely need some changes 
to handle state-readout. There's such code in i915's fbdev support AFAIK.

Best regards
Thomas

[1] 
https://lore.kernel.org/dri-devel/20250902-drm-state-readout-v1-0-14ad5315da3f@kernel.org/

>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Ce Sun <cesun102@amd.com>
> Cc: Chia-I Wu <olvaffe@gmail.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Deepak Rawat <drawat.floss@gmail.com>
> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Gurchetan Singh <gurchetansingh@chromium.org>
> Cc: Hans de Goede <hansg@kernel.org>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: intel-gfx@lists.freedesktop.org
> Cc: intel-xe@lists.freedesktop.org
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Javier Martinez Canillas <javierm@redhat.com>
> Cc: Jocelyn Falempe <jfalempe@redhat.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Lijo Lazar <lijo.lazar@amd.com>
> Cc: linux-efi@vger.kernel.org
> Cc: linux-fbdev@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: "Mario Limonciello (AMD)" <superm1@kernel.org>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: nouveau@lists.freedesktop.org
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: spice-devel@lists.freedesktop.org
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: "Timur Kristóf" <timur.kristof@gmail.com>
> Cc: Tvrtko Ursulin <tursulin@ursulin.net>
> Cc: virtualization@lists.linux.dev
> Cc: Vitaly Prosyak <vitaly.prosyak@amd.com>
>
> Zack Rusin (12):
>    video/aperture: Add sysfb restore on DRM probe failure
>    drm/vmwgfx: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/xe: Use devm aperture helpers for sysfb restore on probe failure
>    drm/amdgpu: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/virtio: Add sysfb restore on probe failure
>    drm/nouveau: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/qxl: Use devm aperture helpers for sysfb restore on probe failure
>    drm/vboxvideo: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/hyperv: Add sysfb restore on probe failure
>    drm/ast: Use devm aperture helpers for sysfb restore on probe failure
>    drm/radeon: Use devm aperture helpers for sysfb restore on probe
>      failure
>    drm/i915: Use devm aperture helpers for sysfb restore on probe failure
>
>   drivers/firmware/efi/sysfb_efi.c           |   2 +-
>   drivers/firmware/sysfb.c                   | 191 +++++++++++++--------
>   drivers/firmware/sysfb_simplefb.c          |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   9 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   7 +
>   drivers/gpu/drm/ast/ast_drv.c              |  13 +-
>   drivers/gpu/drm/hyperv/hyperv_drm_drv.c    |  23 +++
>   drivers/gpu/drm/i915/i915_driver.c         |  13 +-
>   drivers/gpu/drm/nouveau/nouveau_drm.c      |  16 +-
>   drivers/gpu/drm/qxl/qxl_drv.c              |  14 +-
>   drivers/gpu/drm/radeon/radeon_drv.c        |  15 +-
>   drivers/gpu/drm/vboxvideo/vbox_drv.c       |  13 +-
>   drivers/gpu/drm/virtio/virtgpu_drv.c       |  29 ++++
>   drivers/gpu/drm/vmwgfx/vmwgfx_drv.c        |  13 +-
>   drivers/gpu/drm/xe/xe_device.c             |   7 +-
>   drivers/gpu/drm/xe/xe_pci.c                |   7 +
>   drivers/video/aperture.c                   |  54 ++++++
>   include/linux/aperture.h                   |  14 ++
>   include/linux/sysfb.h                      |   6 +
>   19 files changed, 368 insertions(+), 88 deletions(-)
>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v2] mshv: Align huge page stride with guest mapping
From: Nuno Das Neves @ 2026-01-08 20:03 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel
In-Reply-To: <176781093198.21595.6373086133020540990.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On 1/7/2026 10:45 AM, Stanislav Kinsburskii wrote:
> Ensure that a stride larger than 1 (huge page) is only used when page
> points to a head of a huge page and both the guest frame number (gfn) and
> the operation size (page_count) are aligned to the huge page size
> (PTRS_PER_PMD). This matches the hypervisor requirement that map/unmap
> operations for huge pages must be guest-aligned and cover a full huge page.
> 
> Add mshv_chunk_stride() to encapsulate this alignment and page-order
> validation, and plumb a huge_page flag into the region chunk handlers.
> This prevents issuing large-page map/unmap/share operations that the
> hypervisor would reject due to misaligned guest mappings.
> 
> Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |   93 ++++++++++++++++++++++++++++++---------------
>  1 file changed, 62 insertions(+), 31 deletions(-)
> 

Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>

^ permalink raw reply

* RE: [PATCH v2] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2026-01-08 19:00 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <176781093198.21595.6373086133020540990.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 7, 2026 10:46 AM
> 
> Ensure that a stride larger than 1 (huge page) is only used when page
> points to a head of a huge page and both the guest frame number (gfn) and
> the operation size (page_count) are aligned to the huge page size
> (PTRS_PER_PMD). This matches the hypervisor requirement that map/unmap
> operations for huge pages must be guest-aligned and cover a full huge page.
> 
> Add mshv_chunk_stride() to encapsulate this alignment and page-order
> validation, and plumb a huge_page flag into the region chunk handlers.
> This prevents issuing large-page map/unmap/share operations that the
> hypervisor would reject due to misaligned guest mappings.
> 
> Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_regions.c |   93 ++++++++++++++++++++++++++++++---------------
>  1 file changed, 62 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 30bacba6aec3..adba3564d9f1 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -19,6 +19,41 @@
> 
>  #define MSHV_MAP_FAULT_IN_PAGES				PTRS_PER_PMD
> 
> +/**
> + * mshv_chunk_stride - Compute stride for mapping guest memory
> + * @page      : The page to check for huge page backing
> + * @gfn       : Guest frame number for the mapping
> + * @page_count: Total number of pages in the mapping
> + *
> + * Determines the appropriate stride (in pages) for mapping guest memory.
> + * Uses huge page stride if the backing page is huge and the guest mapping
> + * is properly aligned; otherwise falls back to single page stride.
> + *
> + * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + */
> +static int mshv_chunk_stride(struct page *page,
> +			     u64 gfn, u64 page_count)
> +{
> +	unsigned int page_order;
> +
> +	/*
> +	 * Use single page stride by default. For huge page stride, the
> +	 * page must be compound and point to the head of the compound
> +	 * page, and both gfn and page_count must be huge-page aligned.
> +	 */
> +	if (!PageCompound(page) || !PageHead(page) ||
> +	    !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> +	    !IS_ALIGNED(page_count, PTRS_PER_PMD))
> +		return 1;
> +
> +	page_order = folio_order(page_folio(page));
> +	/* The hypervisor only supports 2M huge page */
> +	if (page_order != PMD_ORDER)
> +		return -EINVAL;
> +
> +	return 1 << page_order;
> +}

I think this works and solves the problem we've been discussing. My
knowledge of PageCompound() and PageHead() is limited to the obvious,
so I can't spot any weird edge cases that might occur. My preference would
be to just check the alignment of the PFN corresponding to "page", which
is what the hypervisor will do, but this approach provides a different kind
of explicitness, and it's your call to make.

With that, for the entire patch:

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> +
>  /**
>   * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
>   *                             in a region.
> @@ -45,25 +80,23 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  				      int (*handler)(struct mshv_mem_region *region,
>  						     u32 flags,
>  						     u64 page_offset,
> -						     u64 page_count))
> +						     u64 page_count,
> +						     bool huge_page))
>  {
> -	u64 count, stride;
> -	unsigned int page_order;
> +	u64 gfn = region->start_gfn + page_offset;
> +	u64 count;
>  	struct page *page;
> -	int ret;
> +	int stride, ret;
> 
>  	page = region->pages[page_offset];
>  	if (!page)
>  		return -EINVAL;
> 
> -	page_order = folio_order(page_folio(page));
> -	/* The hypervisor only supports 4K and 2M page sizes */
> -	if (page_order && page_order != PMD_ORDER)
> -		return -EINVAL;
> +	stride = mshv_chunk_stride(page, gfn, page_count);
> +	if (stride < 0)
> +		return stride;
> 
> -	stride = 1 << page_order;
> -
> -	/* Start at stride since the first page is validated */
> +	/* Start at stride since the first stride is validated */
>  	for (count = stride; count < page_count; count += stride) {
>  		page = region->pages[page_offset + count];
> 
> @@ -71,12 +104,13 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>  		if (!page)
>  			break;
> 
> -		/* Break if page size changes */
> -		if (page_order != folio_order(page_folio(page)))
> +		/* Break if stride size changes */
> +		if (stride != mshv_chunk_stride(page, gfn + count,
> +						page_count - count))
>  			break;
>  	}
> 
> -	ret = handler(region, flags, page_offset, count);
> +	ret = handler(region, flags, page_offset, count, stride > 1);
>  	if (ret)
>  		return ret;
> 
> @@ -108,7 +142,8 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
>  				     int (*handler)(struct mshv_mem_region *region,
>  						    u32 flags,
>  						    u64 page_offset,
> -						    u64 page_count))
> +						    u64 page_count,
> +						    bool huge_page))
>  {
>  	long ret;
> 
> @@ -162,11 +197,10 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> 
>  static int mshv_region_chunk_share(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count)
> +				   u64 page_offset, u64 page_count,
> +				   bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> @@ -188,11 +222,10 @@ int mshv_region_share(struct mshv_mem_region *region)
> 
>  static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
>  				     u32 flags,
> -				     u64 page_offset, u64 page_count)
> +				     u64 page_offset, u64 page_count,
> +				     bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> 
>  	return hv_call_modify_spa_host_access(region->partition->pt_id,
> @@ -212,11 +245,10 @@ int mshv_region_unshare(struct mshv_mem_region *region)
> 
>  static int mshv_region_chunk_remap(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count)
> +				   u64 page_offset, u64 page_count,
> +				   bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_MAP_GPA_LARGE_PAGE;
> 
>  	return hv_call_map_gpa_pages(region->partition->pt_id,
> @@ -295,11 +327,10 @@ int mshv_region_pin(struct mshv_mem_region *region)
> 
>  static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
>  				   u32 flags,
> -				   u64 page_offset, u64 page_count)
> +				   u64 page_offset, u64 page_count,
> +				   bool huge_page)
>  {
> -	struct page *page = region->pages[page_offset];
> -
> -	if (PageHuge(page) || PageTransCompound(page))
> +	if (huge_page)
>  		flags |= HV_UNMAP_GPA_LARGE_PAGE;
> 
>  	return hv_call_unmap_gpa_pages(region->partition->pt_id,
> 
> 


^ permalink raw reply

* RE: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Michael Kelley @ 2026-01-08 18:48 UTC (permalink / raw)
  To: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <20251209051128.76913-6-zhangyu1@linux.microsoft.com>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> 
> Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> This driver implements stage-1 IO translation within the guest OS.
> It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> for:
>  - Capability discovery
>  - Domain allocation, configuration, and deallocation
>  - Device attachment and detachment
>  - IOTLB invalidation
> 
> The driver constructs x86-compatible stage-1 IO page tables in the
> guest memory using consolidated IO page table helpers. This allows
> the guest to manage stage-1 translations independently of vendor-
> specific drivers (like Intel VT-d or AMD IOMMU).
> 
> Hyper-v consumes this stage-1 IO page table, when a device domain is

s/Hyper-v/Hyper-V/

> created and configured, and nests it with the host's stage-2 IO page
> tables, therefore elemenating the VM exits for guest IOMMU mapping

s/elemenating/eliminating/

> operations.
> 
> For guest IOMMU unmapping operations, VM exits to perform the IOTLB
> flush(and possibly the device TLB flush) is still unavoidable. For

Typo: Add a space after "flush" and before the open parenthesis.

> now, HVCALL_FLUSH_DEVICE_DOMAIN	is used to implement a domain-selective

Typo:  Extra white space after HVCALL_FLUSH_DEVICE_DOMAIN

> IOTLB flush. New hypercalls for finer-grained hypercall will be provided
> in future patches.
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com>
> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
> Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> ---
>  drivers/iommu/hyperv/Kconfig  |  14 +
>  drivers/iommu/hyperv/Makefile |   1 +
>  drivers/iommu/hyperv/iommu.c  | 608 ++++++++++++++++++++++++++++++++++
>  drivers/iommu/hyperv/iommu.h  |  53 +++
>  4 files changed, 676 insertions(+)
>  create mode 100644 drivers/iommu/hyperv/iommu.c
>  create mode 100644 drivers/iommu/hyperv/iommu.h
> 
> diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig
> index 30f40d867036..fa3c77752d7b 100644
> --- a/drivers/iommu/hyperv/Kconfig
> +++ b/drivers/iommu/hyperv/Kconfig
> @@ -8,3 +8,17 @@ config HYPERV_IOMMU
>  	help
>  	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
>  	  guest and root partitions.
> +
> +if HYPERV_IOMMU
> +config HYPERV_PVIOMMU
> +	bool "Microsoft Hypervisor para-virtualized IOMMU support"
> +	depends on X86 && HYPERV && PCI_HYPERV

Depending on PCI_HYPERV is problematic as pointed out in my comments
on Patch 1 of this series.

> +	depends on IOMMU_PT

Use "select IOMMU_PT" instead of "depends"? Other IOMMU drivers use
"select".

> +	select IOMMU_API
> +	select IOMMU_DMA

IOMMU_DMA is enabled by default on x86 and arm64 architectures.
Other IOMMU drivers don't select it, so maybe this could be dropped.

> +	select DMA_OPS

DMA_OPS doesn't exist.  I'm not sure what this is supposed to be.

> +	select IOMMU_IOVA
> +	default HYPERV
> +	help
> +	  A para-virtualized IOMMU for Microsoft Hypervisor guest.
> +endif
> diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile
> index 9f557bad94ff..8669741c0a51 100644
> --- a/drivers/iommu/hyperv/Makefile
> +++ b/drivers/iommu/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o
> +obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o
> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> new file mode 100644
> index 000000000000..3d0aff868e16
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.c
> @@ -0,0 +1,608 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2019, 2024-2025 Microsoft, Inc.
> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/pci.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/generic_pt/iommu.h>
> +#include <linux/syscore_ops.h>
> +#include <linux/pci-ats.h>
> +
> +#include <asm/iommu.h>
> +#include <asm/hypervisor.h>
> +#include <asm/mshyperv.h>
> +
> +#include "iommu.h"
> +#include "../dma-iommu.h"
> +#include "../iommu-pages.h"
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev);
> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain);

With some fairly simple reordering of code in this source file, these
two declarations could go away. Generally, the best practice is to order
so such declarations aren't needed, though that's not always possible.

> +struct hv_iommu_dev *hv_iommu_device;
> +static struct hv_iommu_domain hv_identity_domain;
> +static struct hv_iommu_domain hv_blocking_domain;

Why is hv_iommu_device allocated dynamically while the two
domains are allocated statically? Seems like the approach could
be consistent, though maybe there's some reason I'm missing.

> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
> +static struct iommu_ops hv_iommu_ops;

I'm wondering if this declaration could also be eliminated by some
reordering, though I didn't take time to figure out the details. Maybe
this is one of those cases that can't be avoided.

> +
> +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT)
> +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL)
> +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS)
> +
> +static int hv_create_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_stage)
> +{
> +	int ret;
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_create_device_domain *input;
> +
> +	ret = ida_alloc_range(&hv_iommu_device->domain_ids,
> +			hv_iommu_device->first_domain, hv_iommu_device->last_domain,
> +			GFP_KERNEL);
> +	if (ret < 0)
> +		return ret;
> +
> +	hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	hv_domain->device_domain.domain_id.type = domain_stage;
> +	hv_domain->device_domain.domain_id.id = ret;
> +	hv_domain->hv_iommu = hv_iommu_device;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->create_device_domain_flags.forward_progress_required = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status)) {
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +		ida_free(&hv_iommu_device->domain_ids, hv_domain->device_domain.domain_id.id);
> +	}
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void hv_delete_device_domain(struct hv_iommu_domain *hv_domain)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_delete_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	ida_free(&hv_domain->hv_iommu->domain_ids, hv_domain->device_domain.domain_id.id);
> +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	case IOMMU_CAP_DEFERRED_FLUSH:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static int hv_iommu_attach_dev(struct iommu_domain *domain, struct device *dev)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct pci_dev *pdev;
> +	struct hv_input_attach_device_domain *input;
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	if (vdev->hv_domain == hv_domain)
> +		return 0;
> +
> +	if (vdev->hv_domain)
> +		hv_iommu_detach_dev(&vdev->hv_domain->domain, dev);
> +
> +	pdev = to_pci_dev(dev);
> +	dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : "",
> +		hv_domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->device_id.as_uint64 = hv_build_logical_dev_id(pdev);
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status)) {
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +	} else {
> +		vdev->hv_domain = hv_domain;
> +		spin_lock_irqsave(&hv_domain->lock, flags);
> +		list_add(&vdev->list, &hv_domain->dev_list);
> +		spin_unlock_irqrestore(&hv_domain->lock, flags);
> +	}
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +	struct pci_dev *pdev;
> +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> +		return;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	dev_dbg(dev, "Detaching from %d\n", hv_domain->device_domain.domain_id.id);
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->device_id.as_uint64 = hv_build_logical_dev_id(pdev);
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	spin_lock_irqsave(&hv_domain->lock, flags);
> +	hv_flush_device_domain(hv_domain);
> +	list_del(&vdev->list);
> +	spin_unlock_irqrestore(&hv_domain->lock, flags);
> +
> +	vdev->hv_domain = NULL;
> +}
> +
> +static int hv_iommu_get_logical_device_property(struct device *dev,
> +					enum hv_logical_device_property_code code,
> +					struct hv_output_get_logical_device_property *property)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_get_logical_device_property *input;
> +	struct hv_output_get_logical_device_property *output;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));

General practice is to *not* zero the output area prior to a hypercall. The hypervisor
should be correctly setting all the output bits. There are a couple of cases in the new
MSHV code where the output is zero'ed, but I'm planning to submit a patch to
remove those so that hypercall call sites that have output are consistent across the
code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the
right thing, and zero'ing the output could be done as a workaround. But such cases
should be explicitly known with code comments indicating the reason for the
zero'ing.

Same applies in hv_iommu_detect().

> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->logical_device_id = hv_build_logical_dev_id(to_pci_dev(dev));
> +	input->code = code;
> +	status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output);
> +	*property = *output;
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	struct pci_dev *pdev;
> +	struct hv_iommu_endpoint *vdev;
> +	struct hv_output_get_logical_device_property device_iommu_property = {0};
> +
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (hv_iommu_get_logical_device_property(dev,
> +						 HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
> +						 &device_iommu_property) ||
> +	    !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED))
> +		return ERR_PTR(-ENODEV);
> +
> +	pdev = to_pci_dev(dev);
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	vdev->dev = dev;
> +	vdev->hv_iommu = hv_iommu_device;
> +	dev_iommu_priv_set(dev, vdev);
> +
> +	if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
> +	    pci_ats_supported(pdev))
> +		pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap));
> +
> +	return &vdev->hv_iommu->iommu;
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +	if (vdev->hv_domain)
> +		hv_iommu_detach_dev(&vdev->hv_domain->domain, dev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
> +
> +	kfree(vdev);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int hv_configure_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_type)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct pt_iommu_x86_64_hw_info pt_info;
> +	struct hv_input_configure_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain = hv_domain->device_domain;
> +	input->settings.flags.blocked = (domain_type == IOMMU_DOMAIN_BLOCKED);
> +	input->settings.flags.translation_enabled = (domain_type != IOMMU_DOMAIN_IDENTITY);
> +
> +	if (domain_type & __IOMMU_DOMAIN_PAGING) {
> +		pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64, &pt_info);
> +		input->settings.page_table_root = pt_info.gcr3_pt;
> +		input->settings.flags.first_stage_paging_mode =
> +			pt_info.levels == 5;
> +	}
> +	status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static int __init hv_initialize_static_domains(void)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +
> +	/* Default stage-1 identity domain */
> +	hv_domain = &hv_identity_domain;
> +	memset(hv_domain, 0, sizeof(*hv_domain));

The memset() isn't necessary. hv_identity_domain is a static variable, so it is already
initialized to zero.

> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		return ret;
> +
> +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> +	hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> +	INIT_LIST_HEAD(&hv_domain->dev_list);
> +
> +	/* Default stage-1 blocked domain */
> +	hv_domain = &hv_blocking_domain;
> +	memset(hv_domain, 0, sizeof(*hv_domain));

Same here.

> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		goto delete_identity_domain;
> +
> +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
> +	if (ret)
> +		goto delete_blocked_domain;
> +
> +	hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> +	hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> +	hv_domain->domain.owner = &hv_iommu_ops;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> +	INIT_LIST_HEAD(&hv_domain->dev_list);
> +
> +	return 0;
> +
> +delete_blocked_domain:
> +	hv_delete_device_domain(&hv_blocking_domain);
> +delete_identity_domain:
> +	hv_delete_device_domain(&hv_identity_domain);
> +	return ret;
> +}
> +
> +#define INTERRUPT_RANGE_START	(0xfee00000)
> +#define INTERRUPT_RANGE_END	(0xfeefffff)
> +static void hv_iommu_get_resv_regions(struct device *dev,
> +		struct list_head *head)
> +{
> +	struct iommu_resv_region *region;
> +
> +	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> +				      INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1,
> +				      0, IOMMU_RESV_MSI, GFP_KERNEL);
> +	if (!region)
> +		return;
> +
> +	list_add_tail(&region->list, head);
> +}
> +
> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->device_domain.partition_id = hv_domain->device_domain.partition_id;
> +	input->device_domain.owner_vtl = hv_domain->device_domain.owner_vtl;
> +	input->device_domain.domain_id.type = hv_domain->device_domain.domain_id.type;
> +	input->device_domain.domain_id.id = hv_domain->device_domain.domain_id.id;
> +	status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +}
> +
> +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +}
> +
> +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> +				struct iommu_iotlb_gather *iotlb_gather)
> +{
> +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +
> +	iommu_put_pages_list(&iotlb_gather->freelist);
> +}
> +
> +static void hv_iommu_paging_domain_free(struct iommu_domain *domain)
> +{
> +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +
> +	/* Free all remaining mappings */
> +	pt_iommu_deinit(&hv_domain->pt_iommu);
> +
> +	hv_delete_device_domain(hv_domain);
> +
> +	kfree(hv_domain);
> +}
> +
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> +	.attach_dev	= hv_iommu_attach_dev,
> +	IOMMU_PT_DOMAIN_OPS(x86_64),
> +	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
> +	.iotlb_sync = hv_iommu_iotlb_sync,
> +	.free = hv_iommu_paging_domain_free,
> +};
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +	struct pt_iommu_x86_64_cfg cfg = {};
> +
> +	hv_domain = kzalloc(sizeof(*hv_domain), GFP_KERNEL);
> +	if (!hv_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret) {
> +		kfree(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> +	hv_domain->pt_iommu.nid = dev_to_node(dev);
> +	INIT_LIST_HEAD(&hv_domain->dev_list);
> +	spin_lock_init(&hv_domain->lock);
> +
> +	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
> +	cfg.common.hw_max_oasz_lg2 = 52;

FYI, when this code is rebased to the latest linux-next, need to set cfg.top_level as well.

> +
> +	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
> +	if (ret) {
> +		hv_delete_device_domain(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
> +
> +	ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING);
> +	if (ret) {
> +		pt_iommu_deinit(&hv_domain->pt_iommu);
> +		hv_delete_device_domain(hv_domain);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return &hv_domain->domain;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable		  = hv_iommu_capable,
> +	.domain_alloc_paging	  = hv_iommu_domain_alloc_paging,
> +	.probe_device		  = hv_iommu_probe_device,
> +	.release_device		  = hv_iommu_release_device,
> +	.device_group		  = hv_iommu_device_group,
> +	.get_resv_regions	  = hv_iommu_get_resv_regions,
> +	.owner			  = THIS_MODULE,
> +	.identity_domain	  = &hv_identity_domain.domain,
> +	.blocked_domain		  = &hv_blocking_domain.domain,
> +	.release_domain		  = &hv_blocking_domain.domain,
> +};
> +
> +static void hv_iommu_shutdown(void)
> +{
> +	iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> +
> +	kfree(hv_iommu_device);
> +}
> +
> +static struct syscore_ops hv_iommu_syscore_ops = {
> +	.shutdown = hv_iommu_shutdown,
> +};

Why is a shutdown needed at all?  hv_iommu_shutdown() doesn't do anything
that really needed, since sysfs entries are transient, and freeing memory isn't
relevant for a shutdown.

> +
> +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_get_iommu_capabilities *input;
> +	struct hv_output_get_iommu_capabilities *output;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES, input, output);
> +	*hv_iommu_cap = *output;
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> +			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> +{
> +	ida_init(&hv_iommu->domain_ids);
> +
> +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> +	    hv_iommu->max_iova_width > 48) {
> +		pr_err("5-level paging not supported, limiting iova width to 48.\n");
> +		hv_iommu->max_iova_width = 48;
> +	}
> +
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) << hv_iommu_cap->max_iova_width) - 1,
> +		.force_aperture = true,
> +	};
> +
> +	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> +	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> +	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap;
> +	hv_iommu_device = hv_iommu;
> +}
> +
> +static int __init hv_iommu_init(void)
> +{
> +	int ret = 0;
> +	struct hv_iommu_dev *hv_iommu = NULL;
> +	struct hv_output_get_iommu_capabilities hv_iommu_cap = {0};
> +
> +	if (no_iommu || iommu_detected)
> +		return -ENODEV;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	if (hv_iommu_detect(&hv_iommu_cap) ||
> +	    !hv_iommu_present(hv_iommu_cap.iommu_cap) ||
> +	    !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap))
> +		return -ENODEV;
> +
> +	iommu_detected = 1;
> +	pci_request_acs();
> +
> +	hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL);
> +	if (!hv_iommu)
> +		return -ENOMEM;
> +
> +	hv_init_iommu_device(hv_iommu, &hv_iommu_cap);
> +
> +	ret = hv_initialize_static_domains();
> +	if (ret) {
> +		pr_err("hv_initialize_static_domains failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu");
> +	if (ret) {
> +		pr_err("iommu_device_sysfs_add failed: %d\n", ret);
> +		goto err_free;
> +	}
> +

Extra blank line.

> +
> +	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL);
> +	if (ret) {
> +		pr_err("iommu_device_register failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	register_syscore_ops(&hv_iommu_syscore_ops);

Per above, not sure why this is needed.

> +
> +	pr_info("Microsoft Hypervisor IOMMU initialized\n");

Could this be changed to fit the "standardized" messages that are output
about Hyper-V specific code? They all start with "Hyper-V: ", such as these:

[    0.000000] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
[    0.000000] Hyper-V: Nested features: 0x0
[    0.000000] Hyper-V: LAPIC Timer Frequency: 0xc3500
[    0.000000] Hyper-V: Using hypercall for remote TLB flush
[    0.019223] Hyper-V: PV spinlocks enabled
[    0.052575] Hyper-V: Hypervisor Build 10.0.26100.7462-7-0
[    0.052577] Hyper-V: enabling crash_kexec_post_notifiers
[    0.052633] Hyper-V: Using IPI hypercalls

Maybe "Hyper-V: PV IOMMU initialized"?

> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(&hv_iommu->iommu);
> +err_free:
> +	kfree(hv_iommu);
> +	return ret;
> +}
> +
> +device_initcall(hv_iommu_init);

I'm concerned about the timing of this initialization. VMBus is initialized with
subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6.
So VMBus initialization happens quite a bit earlier, and the hypervisor starts
offering devices to the guest, including PCI pass-thru devices, before the
IOMMU initialization starts. I cobbled together a way to make this IOMMU code
run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
completed hv_pci_probe() before this IOMMU initialization was started. When
IOMMU initialization did run, it went back and found the NVMe devices. But
I'm unsure if that's OK because my hacked together environment obviously
couldn't do real IOMMU mapping. It appears that the NVMe device driver
didn't start its initialization until after the IOMMU driver was setup, which
would probably make everything OK. But that might be just timing luck, or
maybe there's something that affirmatively prevents the native PCI driver
(like NVMe) from getting started until after all the initcalls have finished.

I'm planning to look at this further to see if there's a way for a PCI driver
to try initializing a pass-thru device *before* this IOMMU driver has initialized.
If so, a different way to do the IOMMU initialization will be needed that is
linked to VMBus initialization so things can't happen out-of-order. Establishing
such a linkage is probably a good idea regardless.

FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
and devices coming and going dynamically all worked correctly. When a device
was removed, it was moved to the blocking domain, and then flushed before
being finally removed. All good! I wish I had a way to test with an IOMMU
paging domain that was doing real translation.

> diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> new file mode 100644
> index 000000000000..c8657e791a6e
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2024-2025, Microsoft, Inc.
> + *
> + */
> +
> +#ifndef _HYPERV_IOMMU_H
> +#define _HYPERV_IOMMU_H
> +
> +struct hv_iommu_dev {
> +	struct iommu_device iommu;
> +	struct ida domain_ids;
> +
> +	/* Device configuration */
> +	u8  max_iova_width;
> +	u8  max_pasid_width;
> +	u64 cap;
> +	u64 pgsize_bitmap;
> +
> +	struct iommu_domain_geometry geometry;
> +	u64 first_domain;
> +	u64 last_domain;
> +};
> +
> +struct hv_iommu_domain {
> +	union {
> +		struct iommu_domain    domain;
> +		struct pt_iommu        pt_iommu;
> +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> +	};
> +	struct hv_iommu_dev *hv_iommu;
> +	struct hv_input_device_domain device_domain;
> +	u64		pgsize_bitmap;
> +
> +	spinlock_t lock; /* protects dev_list and TLB flushes */
> +	/* List of devices in this DMA domain */

It appears that this list is really a list of endpoints (i.e., struct
hv_iommu_endpoint), not devices (which I read to be struct
hv_iommu_dev). 

But that said, what is the list used for?  I see code to add
endpoints to the list, and to remove then, but the list is never
walked by any code in this patch set. If there is an anticipated
future use, it would be better to add the list as part of the code
for that future use.

> +	struct list_head dev_list;
> +};
> +
> +struct hv_iommu_endpoint {
> +	struct device *dev;
> +	struct hv_iommu_dev *hv_iommu;
> +	struct hv_iommu_domain *hv_domain;
> +	struct list_head list; /* For domain->dev_list */
> +};
> +
> +#define to_hv_iommu_domain(d) \
> +	container_of(d, struct hv_iommu_domain, domain)
> +
> +#endif /* _HYPERV_IOMMU_H */
> --
> 2.49.0


^ permalink raw reply

* RE: [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions
From: Michael Kelley @ 2026-01-08 18:47 UTC (permalink / raw)
  To: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <20251209051128.76913-5-zhangyu1@linux.microsoft.com>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> 

The "Subject:" line prefix for this patch should probably be "Drivers: hv:"
to be consistent with most other changes to this source code file.

> Previously, the allocation of per-CPU output argument pages was restricted
> to root partitions or those operating in VTL mode.
> 
> Remove this restriction to support guest IOMMU related hypercalls, which
> require valid output pages to function correctly.

The thinking here isn't quite correct. Just because a hypercall produces output
doesn't mean that Linux needs to allocate a page for the output that is separate
from the input. It's perfectly OK to use the same page for both input and output,
as long as the two areas don't overlap. Yes, the page is called
"hyperv_pcpu_input_arg", but that's a historical artifact from before the time
it was realized that the same page can be used for both input and output.

Of course, if there's ever a hypercall that needs lots of input and lots of output
such that the combined size doesn't fit in a single page, then separate input
and output pages will be needed. But I'm skeptical that will ever happen. Rep
hypercalls could have large amounts of input and/or output, but I'd venture
that the rep count can always be managed so everything fits in a single page.

> 
> While unconditionally allocating per-CPU output pages scales with the number
> of vCPUs, and potentially adding overhead for guests that may not utilize the
> IOMMU, this change anticipates that future hypercalls from child partitions
> may also require these output pages.

I've heard the argument that the amount of overhead is modest relative to the
overall amount of memory that is typically in a VM, particularly VMs with high
vCPU counts. And I don't disagree. But on the flip side, why tie up memory when
there's no need to do so? I'd argue for dropping this patch, and changing the
two hypercall call sites in Patch 5 to just use part of the so-called hypercall input
page for the output as well. It's only a one-line change in each hypercall call site.

If folks really want to always allocate the separate output page, it's not an
issue that I'll continue to fight. But at least give a valid reason "why" in the
commit message.

Michael

> 
> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> ---
>  drivers/hv/hv_common.c | 21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index e109a620c83f..034fb2592884 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -255,11 +255,6 @@ static void hv_kmsg_dump_register(void)
>  	}
>  }
> 
> -static inline bool hv_output_page_exists(void)
> -{
> -	return hv_parent_partition() || IS_ENABLED(CONFIG_HYPERV_VTL_MODE);
> -}
> -
>  void __init hv_get_partition_id(void)
>  {
>  	struct hv_output_get_partition_id *output;
> @@ -371,11 +366,9 @@ int __init hv_common_init(void)
>  	hyperv_pcpu_input_arg = alloc_percpu(void  *);
>  	BUG_ON(!hyperv_pcpu_input_arg);
> 
> -	/* Allocate the per-CPU state for output arg for root */
> -	if (hv_output_page_exists()) {
> -		hyperv_pcpu_output_arg = alloc_percpu(void *);
> -		BUG_ON(!hyperv_pcpu_output_arg);
> -	}
> +	/* Allocate the per-CPU state for output arg*/
> +	hyperv_pcpu_output_arg = alloc_percpu(void *);
> +	BUG_ON(!hyperv_pcpu_output_arg);
> 
>  	if (hv_parent_partition()) {
>  		hv_synic_eventring_tail = alloc_percpu(u8 *);
> @@ -473,7 +466,7 @@ int hv_common_cpu_init(unsigned int cpu)
>  	u8 **synic_eventring_tail;
>  	u64 msr_vp_index;
>  	gfp_t flags;
> -	const int pgcount = hv_output_page_exists() ? 2 : 1;
> +	const int pgcount = 2;
>  	void *mem;
>  	int ret = 0;
> 
> @@ -491,10 +484,8 @@ int hv_common_cpu_init(unsigned int cpu)
>  		if (!mem)
>  			return -ENOMEM;
> 
> -		if (hv_output_page_exists()) {
> -			outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
> -			*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
> -		}
> +		outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
> +		*outputarg = (char *)mem + HV_HYP_PAGE_SIZE;
> 
>  		if (!ms_hyperv.paravisor_present &&
>  		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
> --
> 2.49.0

^ permalink raw reply

* RE: [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
From: Michael Kelley @ 2026-01-08 18:47 UTC (permalink / raw)
  To: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <20251209051128.76913-4-zhangyu1@linux.microsoft.com>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> 
> From: Wei Liu <wei.liu@kernel.org>
> 
> Hyper-V guest IOMMU is a para-virtualized IOMMU based on hypercalls.
> Introduce the hypercalls used by the child partition to interact with
> this facility.
> 
> These hypercalls fall into below categories:
> - Detection and capability: HVCALL_GET_IOMMU_CAPABILITIES is used to
>   detect the existence and capabilities of the guest IOMMU.
> 
> - Device management: HVCALL_GET_LOGICAL_DEVICE_PROPERTY is used to
>   check whether an endpoint device is managed by the guest IOMMU.
> 
> - Domain management: A set of hypercalls is provided to handle the
>   creation, configuration, and deletion of guest domains, as well as
>   the attachment/detachment of endpoint devices to/from those domains.
> 
> - IOTLB flushing: HVCALL_FLUSH_DEVICE_DOMAIN is used to ask Hyper-V
>   for a domain-selective IOTLB flush(which in its handler may flush

Typo:  Add a space after "IOTLB flush" and before the open parenthesis.

>   the device TLB as well). Page-selective IOTLB flushes will be offered
>   by new hypercalls in future patches.
> 
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com>
> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
> Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> ---
>  include/hyperv/hvgdk_mini.h |   8 +++
>  include/hyperv/hvhdk_mini.h | 123 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 131 insertions(+)
> 
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 77abddfc750e..e5b302bbfe14 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -478,10 +478,16 @@ union hv_vp_assist_msr_contents {	 /*
> HV_REGISTER_VP_ASSIST_PAGE */
>  #define HVCALL_GET_VP_INDEX_FROM_APIC_ID			0x009a
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
>  #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
> +#define HVCALL_CREATE_DEVICE_DOMAIN			0x00b1
> +#define HVCALL_ATTACH_DEVICE_DOMAIN			0x00b2
>  #define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
>  #define HVCALL_POST_MESSAGE_DIRECT			0x00c1
>  #define HVCALL_DISPATCH_VP				0x00c2
> +#define HVCALL_DETACH_DEVICE_DOMAIN			0x00c4
> +#define HVCALL_DELETE_DEVICE_DOMAIN			0x00c5
>  #define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
> +#define HVCALL_CONFIGURE_DEVICE_DOMAIN			0x00ce
> +#define HVCALL_FLUSH_DEVICE_DOMAIN			0x00d0
>  #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
>  #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
>  #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
> @@ -492,6 +498,8 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
>  #define HVCALL_GET_VP_CPUID_VALUES			0x00f4
>  #define HVCALL_MMIO_READ				0x0106
>  #define HVCALL_MMIO_WRITE				0x0107
> +#define HVCALL_GET_IOMMU_CAPABILITIES			0x0125
> +#define HVCALL_GET_LOGICAL_DEVICE_PROPERTY		0x0127
> 
>  /* HV_HYPERCALL_INPUT */
>  #define HV_HYPERCALL_RESULT_MASK	GENMASK_ULL(15, 0)
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 858f6a3925b3..ba6b91746b13 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -400,4 +400,127 @@ union hv_device_id {		/* HV_DEVICE_ID */
>  	} acpi;
>  } __packed;
> 
> +/* Device domain types */
> +#define HV_DEVICE_DOMAIN_TYPE_S1	1 /* Stage 1 domain */
> +
> +/* ID for default domain and NULL domain */
> +#define HV_DEVICE_DOMAIN_ID_DEFAULT 0
> +#define HV_DEVICE_DOMAIN_ID_NULL    0xFFFFFFFFULL
> +
> +union hv_device_domain_id {
> +	u64 as_uint64;
> +	struct {
> +		u32 type: 4;
> +		u32 reserved: 28;
> +		u32 id;
> +	} __packed;
> +};
> +
> +struct hv_input_device_domain {
> +	u64 partition_id;
> +	union hv_input_vtl owner_vtl;
> +	u8 padding[7];
> +	union hv_device_domain_id domain_id;
> +} __packed;
> +
> +union hv_create_device_domain_flags {
> +	u32 as_uint32;
> +	struct {
> +		u32 forward_progress_required: 1;
> +		u32 inherit_owning_vtl: 1;
> +		u32 reserved: 30;
> +	} __packed;
> +};
> +
> +struct hv_input_create_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	union hv_create_device_domain_flags create_device_domain_flags;
> +} __packed;
> +
> +struct hv_input_delete_device_domain {
> +	struct hv_input_device_domain device_domain;
> +} __packed;
> +
> +struct hv_input_attach_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	union hv_device_id device_id;
> +} __packed;
> +
> +struct hv_input_detach_device_domain {
> +	u64 partition_id;
> +	union hv_device_id device_id;
> +} __packed;
> +
> +struct hv_device_domain_settings {
> +	struct {
> +		/*
> +		 * Enable translations. If not enabled, all transaction bypass
> +		 * S1 translations.
> +		 */
> +		u64 translation_enabled: 1;
> +		u64 blocked: 1;
> +		/*
> +		 * First stage address translation paging mode:
> +		 * 0: 4-level paging (default)
> +		 * 1: 5-level paging
> +		 */
> +		u64 first_stage_paging_mode: 1;
> +		u64 reserved: 61;
> +	} flags;
> +
> +	/* Address of translation table */
> +	u64 page_table_root;
> +} __packed;
> +
> +struct hv_input_configure_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	struct hv_device_domain_settings settings;
> +} __packed;
> +
> +struct hv_input_get_iommu_capabilities {
> +	u64 partition_id;
> +	u64 reserved;
> +} __packed;
> +
> +struct hv_output_get_iommu_capabilities {
> +	u32 size;
> +	u16 reserved;
> +	u8  max_iova_width;
> +	u8  max_pasid_width;
> +
> +#define HV_IOMMU_CAP_PRESENT (1ULL << 0)
> +#define HV_IOMMU_CAP_S2 (1ULL << 1)
> +#define HV_IOMMU_CAP_S1 (1ULL << 2)
> +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3)
> +#define HV_IOMMU_CAP_PASID (1ULL << 4)
> +#define HV_IOMMU_CAP_ATS (1ULL << 5)
> +#define HV_IOMMU_CAP_PRI (1ULL << 6)
> +
> +	u64 iommu_cap;
> +	u64 pgsize_bitmap;
> +} __packed;
> +
> +enum hv_logical_device_property_code {
> +	HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10,
> +};
> +
> +struct hv_input_get_logical_device_property {
> +	u64 partition_id;
> +	u64 logical_device_id;
> +	enum hv_logical_device_property_code code;

Historically we've avoided "enum" types in structures that are part of
the hypervisor ABI. Use u32 here?

Michael

> +	u32 reserved;
> +} __packed;
> +
> +struct hv_output_get_logical_device_property {
> +#define HV_DEVICE_IOMMU_ENABLED (1ULL << 0)
> +	u64 device_iommu;
> +	u64 reserved;
> +} __packed;
> +
> +struct hv_input_flush_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	u32 flags;
> +	u32 reserved;
> +} __packed;
> +
>  #endif /* _HV_HVHDK_MINI_H */
> --
> 2.49.0


^ permalink raw reply

* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Michael Kelley @ 2026-01-08 18:46 UTC (permalink / raw)
  To: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <20251209051128.76913-2-zhangyu1@linux.microsoft.com>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> 
> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> 
> Hyper-V uses a logical device ID to identify a PCI endpoint device for
> child partitions. This ID will also be required for future hypercalls
> used by the Hyper-V IOMMU driver.
> 
> Refactor the logic for building this logical device ID into a standalone
> helper function and export the interface for wider use.
> 
> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
>  include/asm-generic/mshyperv.h      |  2 ++
>  2 files changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 146b43981b27..4b82e06b5d93 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> 
>  #define hv_msi_prepare		pci_msi_prepare
> 
> +/**
> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
> + * function number of the device.
> + */
> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
> +{
> +	struct pci_bus *pbus = pdev->bus;
> +	struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
> +						struct hv_pcibus_device, sysdata);
> +
> +	return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
> +		     (hbus->hdev->dev_instance.b[4] << 16) |
> +		     (hbus->hdev->dev_instance.b[7] << 8)  |
> +		     (hbus->hdev->dev_instance.b[6] & 0xf8) |
> +		     PCI_FUNC(pdev->devfn));
> +}
> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);

This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
use this symbol in that case -- you'll get a link error on vmlinux when building
the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
require that the VMBus driver not be built as a module, so I don't think that's
the right solution.

This is a messy problem. The new IOMMU driver needs to start with a generic
"struct device" for the PCI device, and somehow find the corresponding VMBus
PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
about ways to do this that don't depend on code and data structures that are
private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.

I was wondering if this "logical device id" is actually parsed by the hypervisor,
or whether it is just a unique ID that is opaque to the hypervisor. From the
usage in the hypercalls in pci-hyperv.c and this new IOMMU driver, it appears
to be the former. Evidently the hypervisor is taking this logical device ID and
and matching against bytes 4 thru 7 of the instance GUIDs of PCI pass-thru
devices offered to the guest, so as to identify a particular PCI pass-thru device.
If that's the case, then Linux doesn't have the option of choosing some other
unique ID that is easier to generate and access.

There's a uniqueness issue with this kind of logical device ID that has been
around for years, but I had never thought about before. In hv_pci_probe()
instance GUID bytes 4 and 5 are used to generate the PCI domain number for
the "fake" PCI bus that the PCI pass-thru device resides on. The issue is the
lack of guaranteed uniqueness of bytes 4 and 5, so there's code to deal with
a collision. (The full GUID is unique, but not necessarily some subset of the
GUID.) It seems like the same kind of uniqueness issue could occur here. Does
the Hyper-V host provide any guarantees about the uniqueness of bytes 4 thru
7 as a unit, and if not, what happens if there is a collision? Again, this
uniqueness issue has existed for years, so it's not new to this patch set, but
with new uses of the logical device ID, it seems relevant to consider.

Michael

> +
>  /**
>   * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
>   * affinity.
>   * @data:	Describes the IRQ
>   *
>   * Build new a destination for the MSI and make a hypercall to
> - * update the Interrupt Redirection Table. "Device Logical ID"
> - * is built out of this PCI bus's instance GUID and the function
> - * number of the device.
> + * update the Interrupt Redirection Table.
>   */
>  static void hv_irq_retarget_interrupt(struct irq_data *data)
>  {
> @@ -642,11 +658,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data)
>  	params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
>  	params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff;
>  	params->int_entry.msi_entry.data.as_uint32 = int_desc->data;
> -	params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
> -			   (hbus->hdev->dev_instance.b[4] << 16) |
> -			   (hbus->hdev->dev_instance.b[7] << 8) |
> -			   (hbus->hdev->dev_instance.b[6] & 0xf8) |
> -			   PCI_FUNC(pdev->devfn);
> +	params->device_id = hv_build_logical_dev_id(pdev);
>  	params->int_target.vector = hv_msi_get_int_vector(data);
> 
>  	if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) {
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 64ba6bc807d9..1a205ed69435 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -71,6 +71,8 @@ extern enum hv_partition_type hv_curr_partition_type;
>  extern void * __percpu *hyperv_pcpu_input_arg;
>  extern void * __percpu *hyperv_pcpu_output_arg;
> 
> +extern u64 hv_build_logical_dev_id(struct pci_dev *pdev);
> +
>  u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
>  u64 hv_do_fast_hypercall8(u16 control, u64 input8);
>  u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
> --
> 2.49.0

^ permalink raw reply

* RE: [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests
From: Michael Kelley @ 2026-01-08 18:45 UTC (permalink / raw)
  To: Yu Zhang, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com,
	easwar.hariharan@linux.microsoft.com,
	jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
	mrathor@linux.microsoft.com, peterz@infradead.org,
	linux-arch@vger.kernel.org
In-Reply-To: <20251209051128.76913-1-zhangyu1@linux.microsoft.com>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> 
> This patch series introduces a para-virtualized IOMMU driver for
> Linux guests running on Microsoft Hyper-V. The primary objective
> is to enable hardware-assisted DMA isolation and scalable device

Is there any particular meaning for the qualifier "scalable" vs. just
"device assignment"? I just want to understand what you are getting
at.

> assignment for Hyper-V child partitions, bypassing the performance
> overhead and complexity associated with emulated IOMMU hardware.
> 
> The driver implements the following core functionality:
> *   Hypercall-based Enumeration
>     Unlike traditional ACPI-based discovery (e.g., DMAR/IVRS),
>     this driver enumerates the Hyper-V IOMMU capabilities directly
>     via hypercalls. This approach allows the guest to discover
>     IOMMU presence and features without requiring specific virtual
>     firmware extensions or modifications.
> 
> *   Domain Management
>     The driver manages IOMMU domains through a new set of Hyper-V
>     hypercall interfaces, handling domain allocation, attachment,
>     and detachment for endpoint devices.
> 
> *   IOTLB Invalidation
>     IOTLB invalidation requests are marshaled and issued to the
>     hypervisor through the same hypercall mechanism.
> 
> *   Nested Translation Support
>     This implementation leverages guest-managed stage-1 I/O page
>     tables nested with host stage-2 translations. It is built
>     upon the consolidated IOMMU page table framework designed by
>     Jason Gunthorpe [1]. This design eliminates the need for complex
>     emulation during map operations and ensures scalability across
>     different architectures.
> 
> Implementation Notes:
> *   Architecture Independence
>     While the current implementation only supports x86 platforms (Intel
>     VT-d and AMD IOMMU), the driver design aims to be as architecture-
>     agnostic as possible. To achieve this, initialization occurs via
>     `device_initcall` rather than `x86_init.iommu.iommu_init`, and shutdown
>     is handled via `syscore_ops` instead of `x86_platform.iommu_shutdown`.
> 
> *   MSI Region Handling
>     In this RFC, the hardware MSI region is hard-coded to the standard
>     x86 interrupt range (0xfee00000 - 0xfeefffff). Future updates may
>     allow this configuration to be queried via hypercalls if new hardware
>     platforms are to be supported.
> 
> *   Reserved Regions (RMRR)
>     There is currently no requirement to support assigned devices with
>     ACPI RMRR limitations. Consequently, this patch series does not specify
>     or query reserved memory regions.
> 
> Testing:
> This series has been validated using dmatest with Intel DSA devices
> assigned to the child partition. The tests confirmed successful DMA
> transactions under the para-virtualized IOMMU.
> 
> Future Work:
> *   Page-selective IOTLB Invalidation
>     The current implementation relies on full-domain flushes. Support
>     for page-selective invalidation is planned for a future series.
> 
> *   Advanced Features
>     Support for vSVA and virtual PRI will be addressed in subsequent
>     updates.
> 
> *   Root Partition Co-existence
>     Ensure compatibility with the distinct para-virtualized IOMMU driver
>     used by Hyper-V's Linux root partition, in which the DMA remapping
>     is not achieved by stage-1 IO page tables and another set of iommu
>     ops is provided.
> 
> [1] https://github.com/jgunthorpe/linux/tree/iommu_pt_all 
> 
> Easwar Hariharan (2):
>   PCI: hv: Create and export hv_build_logical_dev_id()
>   iommu: Move Hyper-V IOMMU driver to its own subdirectory
> 
> Wei Liu (1):
>   hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
> 
> Yu Zhang (2):
>   hyperv: allow hypercall output pages to be allocated for child
>     partitions
>   iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
> 
>  drivers/hv/hv_common.c                        |  21 +-
>  drivers/iommu/Kconfig                         |  10 +-
>  drivers/iommu/Makefile                        |   2 +-
>  drivers/iommu/hyperv/Kconfig                  |  24 +
>  drivers/iommu/hyperv/Makefile                 |   3 +
>  drivers/iommu/hyperv/iommu.c                  | 608 ++++++++++++++++++
>  drivers/iommu/hyperv/iommu.h                  |  53 ++
>  .../irq_remapping.c}                          |   2 +-
>  drivers/pci/controller/pci-hyperv.c           |  28 +-
>  include/asm-generic/mshyperv.h                |   2 +
>  include/hyperv/hvgdk_mini.h                   |   8 +
>  include/hyperv/hvhdk_mini.h                   | 123 ++++
>  12 files changed, 850 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/iommu/hyperv/Kconfig
>  create mode 100644 drivers/iommu/hyperv/Makefile
>  create mode 100644 drivers/iommu/hyperv/iommu.c
>  create mode 100644 drivers/iommu/hyperv/iommu.h
>  rename drivers/iommu/{hyperv-iommu.c => hyperv/irq_remapping.c} (99%)
> 
> --
> 2.49.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox