* [PATCH RFC net-next v13 03/13] virtio: set skb owner of virtio_transport_reset_no_sock() reply
From: Bobby Eshleman @ 2025-12-24 0:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman,
Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Associate reply packets with the sending socket. When vsock must reply
with an RST packet and there exists a sending socket (e.g., for
loopback), setting the skb owner to the socket correctly handles
reference counting between the skb and sk (i.e., the sk stays alive
until the skb is freed).
This allows the net namespace to be used for socket lookups for the
duration of the reply skb's lifetime, preventing race conditions between
the namespace lifecycle and vsock socket search using the namespace
pointer.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v11:
- move before adding to netns support (Stefano)
Changes in v10:
- break this out into its own patch for easy revert (Stefano)
---
net/vmw_vsock/virtio_transport_common.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index fdb8f5b3fa60..718be9f33274 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1165,6 +1165,12 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
.op = VIRTIO_VSOCK_OP_RST,
.type = le16_to_cpu(hdr->type),
.reply = true,
+
+ /* Set sk owner to socket we are replying to (may be NULL for
+ * non-loopback). This keeps a reference to the sock and
+ * sock_net(sk) until the reply skb is freed.
+ */
+ .vsk = vsock_sk(skb->sk),
};
struct sk_buff *reply;
--
2.47.3
^ permalink raw reply related
* [PATCH RFC net-next v13 02/13] vsock: add netns to vsock core
From: Bobby Eshleman @ 2025-12-24 0:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman,
Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add netns logic to vsock core. Additionally, modify transport hook
prototypes to be used by later transport-specific patches (e.g.,
*_seqpacket_allow()).
Namespaces are supported primarily by changing socket lookup functions
(e.g., vsock_find_connected_socket()) to take into account the socket
namespace and the namespace mode before considering a candidate socket a
"match".
This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
for new namespaces.
Add netns functionality (initialization, passing to transports, procfs,
etc...) to the af_vsock socket layer. Later patches that add netns
support to transports depend on this patch.
dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
modified to take a vsk in order to perform logic on namespace modes. In
future patches, the net will also be used for socket
lookups in these functions.
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- remove net_mode and replace with direct accesses to net->vsock.mode,
since this is now immutable.
- update comments about mode behavior and mutability, and sysctl API
- only pass NULL for net when wanting global, instead of net_mode ==
VSOCK_NET_MODE_GLOBAL. This reflects the new logic
of vsock_net_check_mode() that only requires net pointers (not
net_mode).
- refactor sysctl string code into a re-usable function, because
child_ns_mode and ns_mode both handle the same strings.
- remove redundant vsock_net_init(&init_net) call in module init because
pernet registration calls the callback on the init_net too
Changes in v12:
- return true in dgram_allow(), stream_allow(), and seqpacket_allow()
only if net_mode == VSOCK_NET_MODE_GLOBAL (Stefano)
- document bind(VMADDR_CID_ANY) case in af_vsock.c (Stefano)
- change order of stream_allow() call in vmci so we can pass vsk
to it
Changes in v10:
- add file-level comment about what happens to sockets/devices
when the namespace mode changes (Stefano)
- change the 'if (write)' boolean in vsock_net_mode_string() to
if (!write), this simplifies a later patch which adds "goto"
for mutex unlocking on function exit.
Changes in v9:
- remove virtio_vsock_alloc_rx_skb() (Stefano)
- remove vsock_global_dummy_net, not needed as net=NULL +
net_mode=VSOCK_NET_MODE_GLOBAL achieves identical result
Changes in v7:
- hv_sock: fix hyperv build error
- explain why vhost does not use the dummy
- explain usage of __vsock_global_dummy_net
- explain why VSOCK_NET_MODE_STR_MAX is 8 characters
- use switch-case in vsock_net_mode_string()
- avoid changing transports as much as possible
- add vsock_find_{bound,connected}_socket_net()
- rename `vsock_hdr` to `sysctl_hdr`
- add virtio_vsock_alloc_linear_skb() wrapper for setting dummy net and
global mode for virtio-vsock, move skb->cb zero-ing into wrapper
- explain seqpacket_allow() change
- move net setting to __vsock_create() instead of vsock_create() so
that child sockets also have their net assigned upon accept()
Changes in v6:
- unregister sysctl ops in vsock_exit()
- af_vsock: clarify description of CID behavior
- af_vsock: fix buf vs buffer naming, and length checking
- af_vsock: fix length checking w/ correct ctl_table->maxlen
Changes in v5:
- vsock_global_net() -> vsock_global_dummy_net()
- update comments for new uAPI
- use /proc/sys/net/vsock/ns_mode instead of /proc/net/vsock_ns_mode
- add prototype changes so patch remains compilable
---
drivers/vhost/vsock.c | 9 +-
include/linux/virtio_vsock.h | 4 +-
include/net/af_vsock.h | 11 +-
net/vmw_vsock/af_vsock.c | 296 +++++++++++++++++++++++++++++---
net/vmw_vsock/hyperv_transport.c | 7 +-
net/vmw_vsock/virtio_transport.c | 9 +-
net/vmw_vsock/virtio_transport_common.c | 6 +-
net/vmw_vsock/vmci_transport.c | 26 ++-
net/vmw_vsock/vsock_loopback.c | 8 +-
9 files changed, 332 insertions(+), 44 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 0298ddc34824..8eca59fd1afb 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -406,7 +406,8 @@ static bool vhost_transport_msgzerocopy_allow(void)
return true;
}
-static bool vhost_transport_seqpacket_allow(u32 remote_cid);
+static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid);
static struct virtio_transport vhost_transport = {
.transport = {
@@ -462,11 +463,15 @@ static struct virtio_transport vhost_transport = {
.send_pkt = vhost_transport_send_pkt,
};
-static bool vhost_transport_seqpacket_allow(u32 remote_cid)
+static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid)
{
struct vhost_vsock *vsock;
bool seqpacket_allow = false;
+ if (vsock_net_mode(sock_net(sk_vsock(vsk))) != VSOCK_NET_MODE_GLOBAL)
+ return false;
+
rcu_read_lock();
vsock = vhost_vsock_get(remote_cid);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 0c67543a45c8..1845e8d4f78d 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -256,10 +256,10 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
-bool virtio_transport_stream_allow(u32 cid, u32 port);
+bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port);
int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr);
-bool virtio_transport_dgram_allow(u32 cid, u32 port);
+bool virtio_transport_dgram_allow(struct vsock_sock *vsk, u32 cid, u32 port);
int virtio_transport_connect(struct vsock_sock *vsk);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 6f5bc9dbefa5..10c2846fcc58 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -125,7 +125,7 @@ struct vsock_transport {
size_t len, int flags);
int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
struct msghdr *, size_t len);
- bool (*dgram_allow)(u32 cid, u32 port);
+ bool (*dgram_allow)(struct vsock_sock *vsk, u32 cid, u32 port);
/* STREAM. */
/* TODO: stream_bind() */
@@ -137,14 +137,14 @@ struct vsock_transport {
s64 (*stream_has_space)(struct vsock_sock *);
u64 (*stream_rcvhiwat)(struct vsock_sock *);
bool (*stream_is_active)(struct vsock_sock *);
- bool (*stream_allow)(u32 cid, u32 port);
+ bool (*stream_allow)(struct vsock_sock *vsk, u32 cid, u32 port);
/* SEQ_PACKET. */
ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
int flags);
int (*seqpacket_enqueue)(struct vsock_sock *vsk, struct msghdr *msg,
size_t len);
- bool (*seqpacket_allow)(u32 remote_cid);
+ bool (*seqpacket_allow)(struct vsock_sock *vsk, u32 remote_cid);
u32 (*seqpacket_has_data)(struct vsock_sock *vsk);
/* Notification. */
@@ -217,6 +217,11 @@ void vsock_remove_connected(struct vsock_sock *vsk);
struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
struct sockaddr_vm *dst);
+struct sock *vsock_find_bound_socket_net(struct sockaddr_vm *addr,
+ struct net *net);
+struct sock *vsock_find_connected_socket_net(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst,
+ struct net *net);
void vsock_remove_sock(struct vsock_sock *vsk);
void vsock_for_each_connected_socket(struct vsock_transport *transport,
void (*fn)(struct sock *sk));
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index adcba1b7bf74..0eb2364e09aa 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -83,6 +83,42 @@
* TCP_ESTABLISHED - connected
* TCP_CLOSING - disconnecting
* TCP_LISTEN - listening
+ *
+ * - Namespaces in vsock support two different modes configured
+ * through /proc/sys/net/vsock/ns_mode. The modes are "local" and "global".
+ * Each mode defines how the namespace interacts with CIDs.
+ * /proc/sys/net/vsock/ns_mode is read-only and inherited from the
+ * parent namespace's /proc/sys/net/vsock/child_ns_mode at creation
+ * time and is immutable thereafter. The default is "global".
+ *
+ * The modes affect the allocation and accessibility of CIDs as follows:
+ *
+ * - global - access and allocation are all system-wide
+ * - all CID allocation from global namespaces draw from the same
+ * system-wide pool.
+ * - if one global namespace has already allocated some CID, another
+ * global namespace will not be able to allocate the same CID.
+ * - global mode AF_VSOCK sockets can reach any VM or socket in any global
+ * namespace, they are not contained to only their own namespace.
+ * - AF_VSOCK sockets in a global mode namespace cannot reach VMs or
+ * sockets in any local mode namespace.
+ * - local - access and allocation are contained within the namespace
+ * - CID allocation draws only from a private pool local only to the
+ * namespace, and does not affect the CIDs available for allocation in any
+ * other namespace (global or local).
+ * - VMs in a local namespace do not collide with CIDs in any other local
+ * namespace or any global namespace. For example, if a VM in a local mode
+ * namespace is given CID 10, then CID 10 is still available for
+ * allocation in any other namespace, but not in the same namespace.
+ * - AF_VSOCK sockets in a local mode namespace can connect only to VMs or
+ * other sockets within their own namespace.
+ * - sockets bound to VMADDR_CID_ANY in local namespaces will never resolve
+ * to any transport that is not compatible with local mode. There is no
+ * error that propagates to the user (as there is for connection attempts)
+ * because it is possible for some packet to reach this socket from
+ * a different transport that *does* support local mode. For
+ * example, virtio-vsock may not support local mode, but the socket
+ * may still accept a connection from vhost-vsock which does.
*/
#include <linux/compat.h>
@@ -100,6 +136,7 @@
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/net.h>
+#include <linux/proc_fs.h>
#include <linux/poll.h>
#include <linux/random.h>
#include <linux/skbuff.h>
@@ -111,9 +148,18 @@
#include <linux/workqueue.h>
#include <net/sock.h>
#include <net/af_vsock.h>
+#include <net/netns/vsock.h>
#include <uapi/linux/vm_sockets.h>
#include <uapi/asm-generic/ioctls.h>
+#define VSOCK_NET_MODE_STR_GLOBAL "global"
+#define VSOCK_NET_MODE_STR_LOCAL "local"
+
+/* 6 chars for "global", 1 for null-terminator, and 1 more for '\n'.
+ * The newline is added by proc_dostring() for read operations.
+ */
+#define VSOCK_NET_MODE_STR_MAX 8
+
static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
static void vsock_sk_destruct(struct sock *sk);
static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
@@ -235,33 +281,42 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
sock_put(&vsk->sk);
}
-static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
+static struct sock *__vsock_find_bound_socket_net(struct sockaddr_vm *addr,
+ struct net *net)
{
struct vsock_sock *vsk;
list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) {
- if (vsock_addr_equals_addr(addr, &vsk->local_addr))
- return sk_vsock(vsk);
+ struct sock *sk = sk_vsock(vsk);
+
+ if (vsock_addr_equals_addr(addr, &vsk->local_addr) &&
+ vsock_net_check_mode(sock_net(sk), net))
+ return sk;
if (addr->svm_port == vsk->local_addr.svm_port &&
(vsk->local_addr.svm_cid == VMADDR_CID_ANY ||
- addr->svm_cid == VMADDR_CID_ANY))
- return sk_vsock(vsk);
+ addr->svm_cid == VMADDR_CID_ANY) &&
+ vsock_net_check_mode(sock_net(sk), net))
+ return sk;
}
return NULL;
}
-static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
- struct sockaddr_vm *dst)
+static struct sock *
+__vsock_find_connected_socket_net(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst, struct net *net)
{
struct vsock_sock *vsk;
list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
connected_table) {
+ struct sock *sk = sk_vsock(vsk);
+
if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
- dst->svm_port == vsk->local_addr.svm_port) {
- return sk_vsock(vsk);
+ dst->svm_port == vsk->local_addr.svm_port &&
+ vsock_net_check_mode(sock_net(sk), net)) {
+ return sk;
}
}
@@ -304,12 +359,13 @@ void vsock_remove_connected(struct vsock_sock *vsk)
}
EXPORT_SYMBOL_GPL(vsock_remove_connected);
-struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
+struct sock *vsock_find_bound_socket_net(struct sockaddr_vm *addr,
+ struct net *net)
{
struct sock *sk;
spin_lock_bh(&vsock_table_lock);
- sk = __vsock_find_bound_socket(addr);
+ sk = __vsock_find_bound_socket_net(addr, net);
if (sk)
sock_hold(sk);
@@ -317,15 +373,22 @@ struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
return sk;
}
+EXPORT_SYMBOL_GPL(vsock_find_bound_socket_net);
+
+struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
+{
+ return vsock_find_bound_socket_net(addr, NULL);
+}
EXPORT_SYMBOL_GPL(vsock_find_bound_socket);
-struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
- struct sockaddr_vm *dst)
+struct sock *vsock_find_connected_socket_net(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst,
+ struct net *net)
{
struct sock *sk;
spin_lock_bh(&vsock_table_lock);
- sk = __vsock_find_connected_socket(src, dst);
+ sk = __vsock_find_connected_socket_net(src, dst, net);
if (sk)
sock_hold(sk);
@@ -333,6 +396,13 @@ struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
return sk;
}
+EXPORT_SYMBOL_GPL(vsock_find_connected_socket_net);
+
+struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst)
+{
+ return vsock_find_connected_socket_net(src, dst, NULL);
+}
EXPORT_SYMBOL_GPL(vsock_find_connected_socket);
void vsock_remove_sock(struct vsock_sock *vsk)
@@ -528,7 +598,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
if (sk->sk_type == SOCK_SEQPACKET) {
if (!new_transport->seqpacket_allow ||
- !new_transport->seqpacket_allow(remote_cid)) {
+ !new_transport->seqpacket_allow(vsk, remote_cid)) {
module_put(new_transport->module);
return -ESOCKTNOSUPPORT;
}
@@ -676,6 +746,7 @@ static void vsock_pending_work(struct work_struct *work)
static int __vsock_bind_connectible(struct vsock_sock *vsk,
struct sockaddr_vm *addr)
{
+ struct net *net = sock_net(sk_vsock(vsk));
static u32 port;
struct sockaddr_vm new_addr;
@@ -695,7 +766,7 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
new_addr.svm_port = port++;
- if (!__vsock_find_bound_socket(&new_addr)) {
+ if (!__vsock_find_bound_socket_net(&new_addr, net)) {
found = true;
break;
}
@@ -712,7 +783,7 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
return -EACCES;
}
- if (__vsock_find_bound_socket(&new_addr))
+ if (__vsock_find_bound_socket_net(&new_addr, net))
return -EADDRINUSE;
}
@@ -1314,7 +1385,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}
- if (!transport->dgram_allow(remote_addr->svm_cid,
+ if (!transport->dgram_allow(vsk, remote_addr->svm_cid,
remote_addr->svm_port)) {
err = -EINVAL;
goto out;
@@ -1355,7 +1426,7 @@ static int vsock_dgram_connect(struct socket *sock,
if (err)
goto out;
- if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
+ if (!vsk->transport->dgram_allow(vsk, remote_addr->svm_cid,
remote_addr->svm_port)) {
err = -EINVAL;
goto out;
@@ -1585,7 +1656,7 @@ static int vsock_connect(struct socket *sock, struct sockaddr_unsized *addr,
* endpoints.
*/
if (!transport ||
- !transport->stream_allow(remote_addr->svm_cid,
+ !transport->stream_allow(vsk, remote_addr->svm_cid,
remote_addr->svm_port)) {
err = -ENETUNREACH;
goto out;
@@ -2658,6 +2729,183 @@ static struct miscdevice vsock_device = {
.fops = &vsock_device_ops,
};
+static int __vsock_net_mode_string(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos,
+ enum vsock_net_mode mode,
+ enum vsock_net_mode *new_mode)
+{
+ char data[VSOCK_NET_MODE_STR_MAX] = {0};
+ struct ctl_table tmp;
+ int ret;
+
+ if (!table->data || !table->maxlen || !*lenp) {
+ *lenp = 0;
+ return 0;
+ }
+
+ tmp = *table;
+ tmp.data = data;
+
+ if (!write) {
+ const char *p;
+
+ switch (mode) {
+ case VSOCK_NET_MODE_GLOBAL:
+ p = VSOCK_NET_MODE_STR_GLOBAL;
+ break;
+ case VSOCK_NET_MODE_LOCAL:
+ p = VSOCK_NET_MODE_STR_LOCAL;
+ break;
+ default:
+ WARN_ONCE(true, "netns has invalid vsock mode");
+ *lenp = 0;
+ return 0;
+ }
+
+ strscpy(data, p, sizeof(data));
+ tmp.maxlen = strlen(p);
+ }
+
+ ret = proc_dostring(&tmp, write, buffer, lenp, ppos);
+ if (ret)
+ return ret;
+
+ if (!write)
+ return 0;
+
+ if (*lenp >= sizeof(data))
+ return -EINVAL;
+
+ if (!strncmp(data, VSOCK_NET_MODE_STR_GLOBAL, sizeof(data)))
+ *new_mode = VSOCK_NET_MODE_GLOBAL;
+ else if (!strncmp(data, VSOCK_NET_MODE_STR_LOCAL, sizeof(data)))
+ *new_mode = VSOCK_NET_MODE_LOCAL;
+ else
+ return -EINVAL;
+
+ return 0;
+}
+
+static int vsock_net_mode_string(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct net *net;
+
+ if (write)
+ return -EPERM;
+
+ net = current->nsproxy->net_ns;
+
+ return __vsock_net_mode_string(table, write, buffer, lenp, ppos,
+ vsock_net_mode(net), NULL);
+}
+
+static int vsock_net_child_mode_string(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ enum vsock_net_mode new_mode;
+ struct net *net;
+ int ret;
+
+ net = current->nsproxy->net_ns;
+
+ ret = __vsock_net_mode_string(table, write, buffer, lenp, ppos,
+ vsock_net_child_mode(net), &new_mode);
+ if (ret)
+ return ret;
+
+ if (write)
+ vsock_net_set_child_mode(net, new_mode);
+
+ return 0;
+}
+
+static struct ctl_table vsock_table[] = {
+ {
+ .procname = "ns_mode",
+ .data = &init_net.vsock.mode,
+ .maxlen = VSOCK_NET_MODE_STR_MAX,
+ .mode = 0444,
+ .proc_handler = vsock_net_mode_string
+ },
+ {
+ .procname = "child_ns_mode",
+ .data = &init_net.vsock.child_ns_mode,
+ .maxlen = VSOCK_NET_MODE_STR_MAX,
+ .mode = 0644,
+ .proc_handler = vsock_net_child_mode_string
+ },
+};
+
+static int __net_init vsock_sysctl_register(struct net *net)
+{
+ struct ctl_table *table;
+
+ if (net_eq(net, &init_net)) {
+ table = vsock_table;
+ } else {
+ table = kmemdup(vsock_table, sizeof(vsock_table), GFP_KERNEL);
+ if (!table)
+ goto err_alloc;
+
+ table[0].data = &net->vsock.mode;
+ table[1].data = &net->vsock.child_ns_mode;
+ }
+
+ net->vsock.sysctl_hdr = register_net_sysctl_sz(net, "net/vsock", table,
+ ARRAY_SIZE(vsock_table));
+ if (!net->vsock.sysctl_hdr)
+ goto err_reg;
+
+ return 0;
+
+err_reg:
+ if (!net_eq(net, &init_net))
+ kfree(table);
+err_alloc:
+ return -ENOMEM;
+}
+
+static void vsock_sysctl_unregister(struct net *net)
+{
+ const struct ctl_table *table;
+
+ table = net->vsock.sysctl_hdr->ctl_table_arg;
+ unregister_net_sysctl_table(net->vsock.sysctl_hdr);
+ if (!net_eq(net, &init_net))
+ kfree(table);
+}
+
+static void vsock_net_init(struct net *net)
+{
+ if (net_eq(net, &init_net))
+ net->vsock.mode = VSOCK_NET_MODE_GLOBAL;
+ else
+ net->vsock.mode = vsock_net_child_mode(current->nsproxy->net_ns);
+
+ net->vsock.child_ns_mode = VSOCK_NET_MODE_GLOBAL;
+}
+
+static __net_init int vsock_sysctl_init_net(struct net *net)
+{
+ vsock_net_init(net);
+
+ if (vsock_sysctl_register(net))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static __net_exit void vsock_sysctl_exit_net(struct net *net)
+{
+ vsock_sysctl_unregister(net);
+}
+
+static struct pernet_operations vsock_sysctl_ops __net_initdata = {
+ .init = vsock_sysctl_init_net,
+ .exit = vsock_sysctl_exit_net,
+};
+
static int __init vsock_init(void)
{
int err = 0;
@@ -2685,10 +2933,17 @@ static int __init vsock_init(void)
goto err_unregister_proto;
}
+ if (register_pernet_subsys(&vsock_sysctl_ops)) {
+ err = -ENOMEM;
+ goto err_unregister_sock;
+ }
+
vsock_bpf_build_proto();
return 0;
+err_unregister_sock:
+ sock_unregister(AF_VSOCK);
err_unregister_proto:
proto_unregister(&vsock_proto);
err_deregister_misc:
@@ -2702,6 +2957,7 @@ static void __exit vsock_exit(void)
misc_deregister(&vsock_device);
sock_unregister(AF_VSOCK);
proto_unregister(&vsock_proto);
+ unregister_pernet_subsys(&vsock_sysctl_ops);
}
const struct vsock_transport *vsock_core_get_transport(struct vsock_sock *vsk)
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 432fcbbd14d4..4d6d7807f152 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -570,7 +570,7 @@ static int hvs_dgram_enqueue(struct vsock_sock *vsk,
return -EOPNOTSUPP;
}
-static bool hvs_dgram_allow(u32 cid, u32 port)
+static bool hvs_dgram_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
return false;
}
@@ -745,8 +745,11 @@ static bool hvs_stream_is_active(struct vsock_sock *vsk)
return hvs->chan != NULL;
}
-static bool hvs_stream_allow(u32 cid, u32 port)
+static bool hvs_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
+ if (vsock_net_mode(sock_net(sk_vsock(vsk))) != VSOCK_NET_MODE_GLOBAL)
+ return false;
+
if (cid == VMADDR_CID_HOST)
return true;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 8c867023a2e5..37eeefddb48c 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -536,7 +536,8 @@ static bool virtio_transport_msgzerocopy_allow(void)
return true;
}
-static bool virtio_transport_seqpacket_allow(u32 remote_cid);
+static bool virtio_transport_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid);
static struct virtio_transport virtio_transport = {
.transport = {
@@ -593,11 +594,15 @@ static struct virtio_transport virtio_transport = {
.can_msgzerocopy = virtio_transport_can_msgzerocopy,
};
-static bool virtio_transport_seqpacket_allow(u32 remote_cid)
+static bool
+virtio_transport_seqpacket_allow(struct vsock_sock *vsk, u32 remote_cid)
{
struct virtio_vsock *vsock;
bool seqpacket_allow;
+ if (vsock_net_mode(sock_net(sk_vsock(vsk))) != VSOCK_NET_MODE_GLOBAL)
+ return false;
+
seqpacket_allow = false;
rcu_read_lock();
vsock = rcu_dereference(the_virtio_vsock);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index dcc8a1d5851e..fdb8f5b3fa60 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1043,9 +1043,9 @@ bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
}
EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
-bool virtio_transport_stream_allow(u32 cid, u32 port)
+bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
- return true;
+ return vsock_net_mode(sock_net(sk_vsock(vsk))) == VSOCK_NET_MODE_GLOBAL;
}
EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
@@ -1056,7 +1056,7 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
}
EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
-bool virtio_transport_dgram_allow(u32 cid, u32 port)
+bool virtio_transport_dgram_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
return false;
}
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 7eccd6708d66..d5ce39ea5a1b 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -646,13 +646,17 @@ static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg)
return VMCI_SUCCESS;
}
-static bool vmci_transport_stream_allow(u32 cid, u32 port)
+static bool vmci_transport_stream_allow(struct vsock_sock *vsk, u32 cid,
+ u32 port)
{
static const u32 non_socket_contexts[] = {
VMADDR_CID_LOCAL,
};
int i;
+ if (vsock_net_mode(sock_net(sk_vsock(vsk))) != VSOCK_NET_MODE_GLOBAL)
+ return false;
+
BUILD_BUG_ON(sizeof(cid) != sizeof(*non_socket_contexts));
for (i = 0; i < ARRAY_SIZE(non_socket_contexts); i++) {
@@ -682,12 +686,10 @@ static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg)
err = VMCI_SUCCESS;
bh_process_pkt = false;
- /* Ignore incoming packets from contexts without sockets, or resources
- * that aren't vsock implementations.
+ /* Ignore incoming packets from resources that aren't vsock
+ * implementations.
*/
-
- if (!vmci_transport_stream_allow(dg->src.context, -1)
- || vmci_transport_peer_rid(dg->src.context) != dg->src.resource)
+ if (vmci_transport_peer_rid(dg->src.context) != dg->src.resource)
return VMCI_ERROR_NO_ACCESS;
if (VMCI_DG_SIZE(dg) < sizeof(*pkt))
@@ -749,6 +751,12 @@ static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg)
goto out;
}
+ /* Ignore incoming packets from contexts without sockets. */
+ if (!vmci_transport_stream_allow(vsk, dg->src.context, -1)) {
+ err = VMCI_ERROR_NO_ACCESS;
+ goto out;
+ }
+
/* We do most everything in a work queue, but let's fast path the
* notification of reads and writes to help data transfer performance.
* We can only do this if there is no process context code executing
@@ -1784,8 +1792,12 @@ static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
return err;
}
-static bool vmci_transport_dgram_allow(u32 cid, u32 port)
+static bool vmci_transport_dgram_allow(struct vsock_sock *vsk, u32 cid,
+ u32 port)
{
+ if (vsock_net_mode(sock_net(sk_vsock(vsk))) != VSOCK_NET_MODE_GLOBAL)
+ return false;
+
if (cid == VMADDR_CID_HYPERVISOR) {
/* Registrations of PBRPC Servers do not modify VMX/Hypervisor
* state and are allowed.
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index bc2ff918b315..378a96dcb666 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -46,7 +46,8 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
return 0;
}
-static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
+static bool vsock_loopback_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid);
static bool vsock_loopback_msgzerocopy_allow(void)
{
return true;
@@ -106,9 +107,10 @@ static struct virtio_transport loopback_transport = {
.send_pkt = vsock_loopback_send_pkt,
};
-static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
+static bool
+vsock_loopback_seqpacket_allow(struct vsock_sock *vsk, u32 remote_cid)
{
- return true;
+ return vsock_net_mode(sock_net(sk_vsock(vsk))) == VSOCK_NET_MODE_GLOBAL;
}
static void vsock_loopback_work(struct work_struct *work)
--
2.47.3
^ permalink raw reply related
* [PATCH RFC net-next v13 01/13] vsock: add per-net vsock NS mode state
From: Bobby Eshleman @ 2025-12-24 0:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman,
Bobby Eshleman
In-Reply-To: <20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add the per-net vsock NS mode state. This only adds the structure for
holding the mode and some of the functions for setting/getting and
checking the mode, but does not integrate the functionality yet.
Future patches add the uAPI and transport-specific usage of these
structures and helpers.
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- remove net_mode because net->vsock.mode becomes immutable, no need to
save the mode when vsocks are created.
- add the new helpers for child_ns_mode to support ns_mode inheriting
the mode from child_ns_mode.
- because ns_mode is immutable and child_ns_mode can be changed multiple
times, remove the write-once lock.
- simplify vsock_net_check_mode() to no longer take mode arguments since
the mode can be accessed via the net pointers without fear of the mode
changing.
- add logic in vsock_net_check_mode() to infer VSOCK_NET_MODE_GLOBAL
from NULL namespaces in order to allow only net pointers to be passed
to vsock_net_check_mode(), while still allowing namespace-unaware
transports to force global mode.
Changes in v10:
- change mode_locked to int (Stefano)
Changes in v9:
- use xchg(), WRITE_ONCE(), READ_ONCE() for mode and mode_locked (Stefano)
- clarify mode0/mode1 meaning in vsock_net_check_mode() comment
- remove spin lock in net->vsock (not used anymore)
- change mode from u8 to enum vsock_net_mode in vsock_net_write_mode()
Changes in v7:
- clarify vsock_net_check_mode() comments
- change to `orig_net_mode == VSOCK_NET_MODE_GLOBAL && orig_net_mode == vsk->orig_net_mode`
- remove extraneous explanation of `orig_net_mode`
- rename `written` to `mode_locked`
- rename `vsock_hdr` to `sysctl_hdr`
- change `orig_net_mode` to `net_mode`
- make vsock_net_check_mode() more generic by taking just net pointers
and modes, instead of a vsock_sock ptr, for reuse by transports
(e.g., vhost_vsock)
Changes in v6:
- add orig_net_mode to store mode at creation time which will be used to
avoid breakage when namespace changes mode during socket/VM lifespan
Changes in v5:
- use /proc/sys/net/vsock/ns_mode instead of /proc/net/vsock_ns_mode
- change from net->vsock.ns_mode to net->vsock.mode
- change vsock_net_set_mode() to vsock_net_write_mode()
- vsock_net_write_mode() returns bool for write success to avoid
need to use vsock_net_mode_can_set()
- remove vsock_net_mode_can_set()
---
MAINTAINERS | 1 +
include/net/af_vsock.h | 42 ++++++++++++++++++++++++++++++++++++++++++
include/net/net_namespace.h | 4 ++++
include/net/netns/vsock.h | 17 +++++++++++++++++
4 files changed, 64 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 454b8ed119e9..38d24e5a957c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27516,6 +27516,7 @@ L: netdev@vger.kernel.org
S: Maintained
F: drivers/vhost/vsock.c
F: include/linux/virtio_vsock.h
+F: include/net/netns/vsock.h
F: include/uapi/linux/virtio_vsock.h
F: net/vmw_vsock/virtio_transport.c
F: net/vmw_vsock/virtio_transport_common.c
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index d40e978126e3..6f5bc9dbefa5 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -10,6 +10,7 @@
#include <linux/kernel.h>
#include <linux/workqueue.h>
+#include <net/netns/vsock.h>
#include <net/sock.h>
#include <uapi/linux/vm_sockets.h>
@@ -256,4 +257,45 @@ static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
{
return t->msgzerocopy_allow && t->msgzerocopy_allow();
}
+
+static inline enum vsock_net_mode vsock_net_mode(struct net *net)
+{
+ return READ_ONCE(net->vsock.mode);
+}
+
+static inline void vsock_net_set_child_mode(struct net *net,
+ enum vsock_net_mode mode)
+{
+ WRITE_ONCE(net->vsock.child_ns_mode, mode);
+}
+
+static inline enum vsock_net_mode vsock_net_child_mode(struct net *net)
+{
+ return READ_ONCE(net->vsock.child_ns_mode);
+}
+
+/* Return true if two namespaces pass the mode rules. Otherwise, return false.
+ *
+ * A NULL namespace is treated as VSOCK_NET_MODE_GLOBAL.
+ *
+ * Read more about modes in the comment header of net/vmw_vsock/af_vsock.c.
+ */
+static inline bool vsock_net_check_mode(struct net *ns0, struct net *ns1)
+{
+ enum vsock_net_mode mode0, mode1;
+
+ /* Any vsocks within the same network namespace are always reachable,
+ * regardless of the mode.
+ */
+ if (net_eq(ns0, ns1))
+ return true;
+
+ mode0 = ns0 ? vsock_net_mode(ns0) : VSOCK_NET_MODE_GLOBAL;
+ mode1 = ns1 ? vsock_net_mode(ns1) : VSOCK_NET_MODE_GLOBAL;
+
+ /* Different namespaces are only reachable if they are both
+ * global mode.
+ */
+ return mode0 == VSOCK_NET_MODE_GLOBAL && mode0 == mode1;
+}
#endif /* __AF_VSOCK_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index cb664f6e3558..66d3de1d935f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -37,6 +37,7 @@
#include <net/netns/smc.h>
#include <net/netns/bpf.h>
#include <net/netns/mctp.h>
+#include <net/netns/vsock.h>
#include <net/net_trackers.h>
#include <linux/ns_common.h>
#include <linux/idr.h>
@@ -196,6 +197,9 @@ struct net {
/* Move to a better place when the config guard is removed. */
struct mutex rtnl_mutex;
#endif
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ struct netns_vsock vsock;
+#endif
} __randomize_layout;
#include <linux/seq_file_net.h>
diff --git a/include/net/netns/vsock.h b/include/net/netns/vsock.h
new file mode 100644
index 000000000000..e2325e2d6ec5
--- /dev/null
+++ b/include/net/netns/vsock.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_NET_NAMESPACE_VSOCK_H
+#define __NET_NET_NAMESPACE_VSOCK_H
+
+#include <linux/types.h>
+
+enum vsock_net_mode {
+ VSOCK_NET_MODE_GLOBAL,
+ VSOCK_NET_MODE_LOCAL,
+};
+
+struct netns_vsock {
+ struct ctl_table_header *sysctl_hdr;
+ enum vsock_net_mode mode;
+ enum vsock_net_mode child_ns_mode;
+};
+#endif /* __NET_NET_NAMESPACE_VSOCK_H */
--
2.47.3
^ permalink raw reply related
* [PATCH v5 2/2] Drivers: hv: vmbus: retrieve connection-id from DeviceTree
From: Hardik Garg @ 2025-12-23 23:09 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, krzk+dt, robh, conor+dt, mhklinux
Cc: devicetree, linux-hyperv, linux-kernel, ssengar, longli,
Naman Jain, hargar
In-Reply-To: <58cb22cb-b0c8-4694-b9e4-971aa7f0f972@linux.microsoft.com>
The connection-id determines which hypervisor communication channel
the guest should use to talk to the VMBus host. Add steps to read
this value from the DeviceTree. When this property is not present,
the driver selects the connection ID based on the protocol version
(4 for VERSION_WIN10_V5 and newer, or 1 for older versions).
Signed-off-by: Hardik Garg <hargar@linux.microsoft.com>
---
v4:
https://lore.kernel.org/all/1750374395-14615-3-git-send-email-hargar@linux.microsoft.com
v3:
https://lore.kernel.org/all/6a92ca86-ad6b-4d49-af6e-1ed7651b8ab8@linux.microsoft.com
v2:
https://lore.kernel.org/all/096edaf7-cc90-42b6-aff4-c5f088574e1e@linux.microsoft.com
v1:
https://lore.kernel.org/all/6acee4bf-cb04-43b9-9476-e8d811d26dfd@linux.microsoft.com
---
drivers/hv/connection.c | 7 +++++--
drivers/hv/vmbus_drv.c | 8 ++++++++
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 5d9cb5bf2d62..660cad3886f5 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -100,12 +100,15 @@ int vmbus_negotiate_version(struct
vmbus_channel_msginfo *msginfo, u32 version)
if (version >= VERSION_WIN10_V5) {
msg->msg_sint = VMBUS_MESSAGE_SINT;
msg->msg_vtl = ms_hyperv.vtl;
- vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID_4;
} else {
msg->interrupt_page = virt_to_phys(vmbus_connection.int_page);
- vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID;
}
+ /* Set default connection ID if not provided via DeviceTree */
+ if (!vmbus_connection.msg_conn_id)
+ vmbus_connection.msg_conn_id = (version >= VERSION_WIN10_V5) ?
+ VMBUS_MESSAGE_CONNECTION_ID_4 : VMBUS_MESSAGE_CONNECTION_ID;
+
if (vmbus_is_confidential() && version >= VERSION_WIN10_V6_0)
msg->feature_flags = VMBUS_FEATURE_FLAG_CONFIDENTIAL_CHANNELS;
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 47fcab38398a..f8c0594ab85f 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2600,10 +2600,18 @@ static int vmbus_device_add(struct
platform_device *pdev)
struct of_range range;
struct of_range_parser parser;
struct device_node *np = pdev->dev.of_node;
+ unsigned int conn_id;
int ret;
vmbus_root_device = &pdev->dev;
+ /* Read connection ID from DeviceTree */
+ if (!of_property_read_u32(np, "microsoft,message-connection-id",
+ &conn_id)) {
+ pr_info("VMBus message connection ID: %u\n", conn_id);
+ vmbus_connection.msg_conn_id = conn_id;
+ }
+
ret = of_range_parser_init(&parser, np);
if (ret)
return ret;
--
2.34.1
^ permalink raw reply related
* [PATCH v5 1/2] dt-bindings: microsoft: Add vmbus message-connection-id property
From: Hardik Garg @ 2025-12-23 23:07 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, krzk+dt, robh, conor+dt, mhklinux
Cc: devicetree, linux-hyperv, linux-kernel, ssengar, longli,
Naman Jain, hargar
In-Reply-To: <58cb22cb-b0c8-4694-b9e4-971aa7f0f972@linux.microsoft.com>
Document the microsoft,message-connection-id property for VMBus
DeviceTree node. The connection-id is a hardware-level identifier
that specifies which Hyper-V message port (or mailbox slot) the
guest should use to communicate with the VMBus control plane.
Historically, VMBus always used a single, fixed connection ID.
However, with the introduction of Virtual Trust Level 2 (VTL2)
support, the control plane can now reside in a different trust
level, requiring the guest to communicate through a different
message port. This connection-id is determined by the hypervisor
based on the status of VMBus relay. From the guests perspective,
it is completely static for the lifetime of that VM instance, it
never changes at runtime. Once the kernel boots, it must read this
value to establish communication with the correct VMBus control
plane. There is currently no system API, or discoverable interface
that allows the guest to determine this value dynamically.
Each guest has a private hypervisor mailbox and cannot access any other
guests communication path. Using an incorrect connection ID does not
allow eavesdropping or cause interference, it only results in failed
VMBus initialization because the host drops messages sent to an
unexpected port. Thus, exposing the correct connection ID to the guest
is safe and necessary for correct initialization.
Signed-off-by: Hardik Garg <hargar@linux.microsoft.com>
---
v4:
https://lore.kernel.org/all/1750374395-14615-2-git-send-email-hargar@linux.microsoft.com
v3:
https://lore.kernel.org/all/6a92ca86-ad6b-4d49-af6e-1ed7651b8ab8@linux.microsoft.com
v2:
https://lore.kernel.org/all/096edaf7-cc90-42b6-aff4-c5f088574e1e@linux.microsoft.com
v1:
https://lore.kernel.org/all/6acee4bf-cb04-43b9-9476-e8d811d26dfd@linux.microsoft.com
---
.../devicetree/bindings/bus/microsoft,vmbus.yaml | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml
b/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml
index 0bea4f5287ce..4745c2b89ac5 100644
--- a/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml
+++ b/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml
@@ -17,6 +17,17 @@ properties:
compatible:
const: microsoft,vmbus
+ microsoft,message-connection-id:
+ description:
+ connection-id is a hardware-level identifier that specifies
+ which Hyper-V message port (or mailbox slot) the guest should
+ use to communicate with the VMBus control plane. When this
+ property is not present, the driver selects the connection ID
+ based on the protocol version (4 for VERSION_WIN10_V5 and
+ newer, or 1 for older versions).
+ $ref: /schemas/types.yaml#/definitions/uint32
+ minimum: 1
+
ranges: true
'#address-cells':
@@ -55,6 +66,7 @@ examples:
vmbus@ff0000000 {
compatible = "microsoft,vmbus";
+ microsoft,message-connection-id = <4>;
#address-cells = <2>;
#size-cells = <1>;
ranges = <0x0f 0xf0000000 0x0f 0xf0000000 0x10000000>;
--
2.34.1
^ permalink raw reply related
* [PATCH v5 0/2] Add VMBus message connection ID support via DeviceTree
From: Hardik Garg @ 2025-12-23 23:05 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, krzk+dt, robh, conor+dt, mhklinux
Cc: devicetree, linux-hyperv, linux-kernel, ssengar, longli,
Naman Jain, hargar
This patch series adds support for reading the VMBus message
connection ID from DeviceTree. The connection-id determines which
hypervisor communication channel the guest should use to talk to
the VMBus host.
Changes in v5:
- Updated subject line and commit description to clarify what
connection ID is and why DeviceTree support is required
- Addressed reviewer feedback about zero handling and binding
constraints
- Revised binding description to clarify version-based selection
instead of using "defaults" language
- Fixed checkpatch warnings (indentation and alignment)
Changes in v4:
- Split the patch into two separate patches:
* DeviceTree bindings documentation
* Implementation changes
- Fixed warnings reported by checkpatch
Changes in v3:
- Added documentation for the new property in DeviceTree bindings
-
https://lore.kernel.org/all/6a92ca86-ad6b-4d49-af6e-1ed7651b8ab8@linux.microsoft.com/
Changes in v2:
- Rebased on hyperv-next branch as requested by maintainers
- Added details about the property name format in the commit message
-
https://lore.kernel.org/all/096edaf7-cc90-42b6-aff4-c5f088574e1e@linux.microsoft.com/
Changes in v1:
- Initial submission
-
https://lore.kernel.org/all/6acee4bf-cb04-43b9-9476-e8d811d26dfd@linux.microsoft.com/
Testing:
- Tested on Microsoft Hyper-V
- Verified with and without the DeviceTree property
- Confirmed proper fallback to version-based connection ID selection
- Validated binding documentation with dt_binding_check
Hardik Garg (2):
dt-bindings: microsoft: Add vmbus message-connection-id property
Drivers: hv: vmbus: retrieve connection-id from DeviceTree
.../devicetree/bindings/bus/microsoft,vmbus.yaml | 12 ++++++++++++
drivers/hv/connection.c | 7 +++++--
drivers/hv/vmbus_drv.c | 8 ++++++++
3 files changed, 25 insertions(+), 2 deletions(-)
--
2.34.1
^ permalink raw reply
* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2025-12-23 19:17 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aUrCr5wBSTrGm-IM@skinsburskii.localdomain>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 23, 2025 8:26 AM
>
> On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> > From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> > >
> > [snip]
> > >
> > > Separately, in looking at this, I spotted another potential problem with
> > > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > > not clear on. To create a new region, the user space VMM issues the
> > > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > > size, and the guest PFN. The only requirement on these values is that the
> > > userspace address and size be page aligned. But suppose a 4 Meg region is
> > > specified where the userspace address and the guest PFN have different
> > > offsets modulo 2 Meg. The userspace address range gets populated first,
> > > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > > to creating 4K mappings, or does it return an error? Returning an error would
> > > seem to be problematic for movable pages because the error wouldn't
> > > occur until the guest VM is running and takes a range fault on the region.
> > > Silently falling back to creating 4K mappings has performance implications,
> > > though I guess it would work. My question is whether the
> > > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > > error immediately.
> > >
> >
> > In thinking about this more, I can answer my own question about the
> > hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> > list of 4K system PFNs is not provided as an input to the hypercall, so
> > the hypervisor cannot silently fall back to 4K mappings. Assuming
> > sequential PFNs would be wrong, so it must return an error if the
> > alignment of a system PFN isn't on a 2 Meg boundary.
> >
> > For a pinned region, this error happens in mshv_region_map() as
> > called from mshv_prepare_pinned_region(), so will propagate back
> > to the ioctl. But the error happens only if pin_user_pages_fast()
> > allocates one or more 2 Meg pages. So creating a pinned region
> > where the guest PFN and userspace address have different offsets
> > modulo 2 Meg might or might not succeed.
> >
> > For a movable region, the error probably can't occur.
> > mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> > around the faulting guest PFN. mshv_region_range_fault() then
> > determines the corresponding userspace addr, which won't be on
> > a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> > page. With no 2 Meg pages, mshv_region_remap_pages() will
> > always do 4K mappings and will succeed. The downside is that a
> > movable region with a guest PFN and userspace address with
> > different offsets never gets any 2 Meg pages or mappings.
> >
> > My conclusion is the same -- such misalignment should not be
> > allowed when creating a region that has the potential to use 2 Meg
> > pages. Regions less than 2 Meg in size could be excluded from such
> > a requirement if there is benefit in doing so. It's possible to have
> > regions up to (but not including) 4 Meg where the alignment prevents
> > having a 2 Meg page, and those could also be excluded from the
> > requirement.
> >
>
> I'm not sure I understand the problem.
> There are three cases to consider:
> 1. Guest mapping, where page sizes are controlled by the guest.
> 2. Host mapping, where page sizes are controlled by the host.
And by "host", you mean specifically the Linux instance running in the
root partition. It hosts the VMM processes and creates the memory
regions for each guest.
> 3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
>
> The first case is not relevant here and is included for completeness.
Agreed.
>
> The second and third cases (host and hypervisor) share the memory layout,
Right. More specifically, they are both operating on the same set of physical
memory pages, and hence "share" a set of what I've referred to as
"system PFNs" (to distinguish from guest PFNs, or GFNs).
> but it is up
> to each entity to decide which page sizes to use. For example, the host might map the
> proposed 4M region with only 4K pages, even if a 2M page is available in the middle.
Agreed.
> In this case, the host will map the memory as represented by 4K pages, but the hypervisor
> can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
Yes, that's possible, but subject to significant requirements. A 2M page can be
used only if the underlying physical memory is a physically contiguous 2M chunk.
Furthermore, that contiguous 2M chunk must start on a physical 2M boundary,
and the virtual address to which it is being mapped must be on a 2M boundary.
In the case of the host, that virtual address is the user space address in the
user space process. In the case of the hypervisor, that "virtual address" is the
the location in guest physical address space; i.e., the guest PFN left-shifted 9
to be a guest physical address.
These requirements are from the physical processor and its requirements on
page table formats as specified by the hardware architecture. Whereas the
page table entry for a 4K page contains the entire PFN, the page table entry
for a 2M page omits the low order 9 bits of the PFN -- those bits must be zero,
which is equivalent to requiring that the PFN be on a 2M boundary. These
requirements apply to both host and hypervisor mappings.
When MSHV code in the host creates a new pinned region via the ioctl,
MSHV code first allocates memory for the region using pin_user_pages_fast(),
which returns the system PFN for each page of physical memory that is
allocated. If the host, at its discretion, allocates a 2M page, then a series
of 512 sequential 4K PFNs is returned for that 2M page, and the first of
the 512 sequential PFNs must have its low order 9 bits be zero.
Then the MSHV ioctl makes the HVCALL_MAP_GPA_PAGES hypercall for
the hypervisor to map the allocated memory into the guest physical
address space at a particular guest PFN. If the allocated memory contains
a 2M page, mshv_chunk_stride() will see a folio order of 9 for the 2M page,
causing the HV_MAP_GPA_LARGE_PAGE flag to be set, which requests that
the hypervisor do that mapping as a 2M large page. The hypercall does not
have the option of dropping back to 4K page mappings in this case. If
the 2M alignment of the system PFN is different from the 2M alignment
of the target guest PFN, it's not possible to create the mapping and the
hypercall fails.
The core problem is that the same 2M of physical memory wants to be
mapped by the host as a 2M page and by the hypervisor as a 2M page.
That can't be done unless the host alignment (in the VMM virtual address
space) and the guest physical address (i.e., the target guest PFN) alignment
match and are both on 2M boundaries.
Movable regions behave a bit differently because the memory for the
region is not allocated on the host "up front" when the region is created.
The memory is faulted in as the guest runs, and the vagaries of the current
MSHV in Linux code are such that 2M pages are never created on the host
if the alignments don't match. HV_MAP_GPA_LARGE_PAGE is never passed
to the HVCALL_MAP_GPA_PAGES hypercall, so the hypervisor just does 4K
mappings, which works even with the misalignment.
>
> This adjustment happens at runtime. Could this be the missing detail here?
Adjustments at runtime are a different topic from the issue I'm raising,
though eventually there's some relationship. My issue occurs in the
creation of a new region, and the setting up of the initial hypervisor
mapping. I haven't thought through the details of adjustments at runtime.
My usual caveats apply -- this is all "thought experiment". If I had the
means do some runtime testing to confirm, I would. It's possible the
hypervisor is playing some trick I haven't envisioned, but I'm skeptical of
that given the basics of how physical processors work with page tables.
Michael
^ permalink raw reply
* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2025-12-23 16:27 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41578A17A4DADD9276392298D4B4A@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Dec 22, 2025 at 06:25:02PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, December 19, 2025 2:54 PM
> >
> > On Thu, Dec 18, 2025 at 07:41:24PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday,
> > December 16, 2025 4:41 PM
> > > >
> > > > Ensure that a stride larger than 1 (huge page) is only used when both
> > > > the guest frame number (gfn) and the operation size (page_count) are
> > > > aligned to the huge page size (PTRS_PER_PMD). This matches the
> > > > hypervisor requirement that map/unmap operations for huge pages must be
> > > > guest-aligned and cover a full huge page.
> > > >
> > > > Add mshv_chunk_stride() to encapsulate this alignment and page-order
> > > > validation, and plumb a huge_page flag into the region chunk handlers.
> > > > This prevents issuing large-page map/unmap/share operations that the
> > > > hypervisor would reject due to misaligned guest mappings.
> > >
> > > This code looks good to me on the surface. But I can only make an educated
> > > guess as to the hypervisor behavior in certain situations, and if my guess is
> > > correct there's still a flaw in one case.
> > >
> > > Consider the madvise() DONTNEED experiment that I previously called out. [1]
> > > I surmise that the intent of this patch is to make that case work correctly.
> > > When the .invalidate callback is made for the 32 Kbyte range embedded in
> > > a previously mapped 2 Meg page, this new code detects that case. It calls the
> > > hypervisor to remap the 32 Kbyte range for no access, and clears the 8
> > > corresponding entries in the struct page array attached to the mshv region. The
> > > call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
> > > Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
> > > flag, my guess is that the hypervisor is smart enough to handle this case by
> > > splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access,
> > > and returning "success". If my guess is correct, there's no problem here.
> > >
> > > But then there's a second .invalidate callback for the entire 2 Meg page. Here's
> > > the call stack:
> > >
> > > [ 194.259337] dump_stack+0x14/0x20
> > > [ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback]
> > > [ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250
> > > [ 194.259347] __split_huge_pmd+0x14f/0x170
> > > [ 194.259349] unmap_page_range+0x104d/0x1a00
> > > [ 194.259358] unmap_single_vma+0x7d/0xc0
> > > [ 194.259360] zap_page_range_single_batched+0xe0/0x1c0
> > > [ 194.259363] madvise_vma_behavior+0xb01/0xc00
> > > [ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0
> > > [ 194.259375] do_madvise+0xc7/0x170
> > > [ 194.259380] __x64_sys_madvise+0x2f/0x40
> > > [ 194.259382] x64_sys_call+0x1d77/0x21b0
> > > [ 194.259385] do_syscall_64+0x56/0x640
> > > [ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
> > > page is actually split by the root partition. So mshv_chunk_stride() returns "9"
> > > for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
> > > set. My guess is that the hypervisor returns an error because it has already
> > > split the mapping. The whole point of this patch set is to avoid passing
> > > HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
> > > is not a large page mapping, but this looks like a case where it still happens.
> > >
> > > My concern is solely from looking at the code and thinking about the problem,
> > > as I don't have an environment where I can test root partition interactions
> > > with the hypervisor. So maybe I'm missing something. Lemme know what you
> > > think .....
> > >
> >
> > Yeah, I see your point: according to this stack, once a part of the page
> > is invalidated, the folio order remains the same until another invocation
> > of the same callback - this time for the whole huge
> > page - is made. Thus, the stride is still reported as the huge page size,
> > even though a part of the page has already been unmapped.
> >
> > This indeed looks like a flaw in the current approach, but it's actually
> > not. The reason is that upon the invalidation callback, the driver
> > simply remaps the whole huge page with no access (in this case, the PFNs
> > provided to the hypervisor are zero), and it's fine as the hypervisor
> > simply drops all the pages from the previous mapping and marks this page
> > as inaccessible. The only check the hypervisor makes in this case is
> > that both the GFN and mapping size are huge page aligned (which they are
> > in this case).
> >
> > I hope this clarifies the situation. Please let me know if you have any
> > other questions.
>
> Thanks. Yes, this clarifies. My guess about the hypervisor behavior was wrong.
> Based on what you've said about what the hypervisor does, and further studying
> MSHV code, here's my recap of the HV_MAP_GPA_LARGE_PAGE flag:
>
> 1. The hypervisor uses the flag to determine the granularity (4K or 2M) of the
> mapping HVCALL_MAP_GPA_PAGES or HVCALL_UNMAP_GPA_PAGES will
> create/remove. As such, the hypercall "repcount" is in this granularity. GFNs,
> such as the target_gpa_base input parameter and GFNs in the pfn_array, are
> always 4K GFNs, but if the flag is set, a GFN is treated as the first 4K GFN in
> a contiguous 2M range. If the flag is set, the target_gpa_base GFN must be
> 2M aligned.
>
> 2. The hypervisor doesn't care whether any existing mapping is 4K or 2M. It
> always removes an existing mapping, including splitting any 2M mappings if
> necessary. Then if the operation is to create/re-create a mapping, it creates
> an appropriate new mapping.
>
> My error was in thinking that the flag had to match any existing mapping.
> But the behavior you've clarified is certainly better. It handles the vagaries
> of the Linux "mm" subsystem, which in one case in my original experiment
> (madvise) invalidates the small range, then the 2M range, but the other
> case (mprotect) invalidates the 2M range, then the small range.
>
> Since there's no documentation for these root partition hypercalls, it sure
> would be nice if this info could be captured in code comments for some
> future developer to benefit from. If that's not something you want to
> worry about, I could submit a patch later to add the code comments
> (subject to your review, of course).
>
Please, feel free ti add the comments you see fit. I think you'll do it
better as you have a better understanding of what needs to be
documented.
Thanks,
Stanislav
> Separately, in looking at this, I spotted another potential problem with
> 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> not clear on. To create a new region, the user space VMM issues the
> MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> size, and the guest PFN. The only requirement on these values is that the
> userspace address and size be page aligned. But suppose a 4 Meg region is
> specified where the userspace address and the guest PFN have different
> offsets modulo 2 Meg. The userspace address range gets populated first,
> and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> to create a 2 Meg mapping for the guest, the corresponding system PFN in
> the page array may not be 2 Meg aligned. What does the hypervisor do in
> this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> to creating 4K mappings, or does it return an error? Returning an error would
> seem to be problematic for movable pages because the error wouldn't
> occur until the guest VM is running and takes a range fault on the region.
> Silently falling back to creating 4K mappings has performance implications,
> though I guess it would work. My question is whether the
> MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> error immediately.
>
> Michael
>
> > >
> > > [1] https://lore.kernel.org/linux-hyperv/SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com/
^ permalink raw reply
* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2025-12-23 16:26 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157AAFDD8BD5BDCD2D3DB99D4B5A@SN6PR02MB4157.namprd02.prod.outlook.com>
On Tue, Dec 23, 2025 at 03:51:22PM +0000, Michael Kelley wrote:
> From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
> >
> [snip]
> >
> > Separately, in looking at this, I spotted another potential problem with
> > 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> > not clear on. To create a new region, the user space VMM issues the
> > MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> > size, and the guest PFN. The only requirement on these values is that the
> > userspace address and size be page aligned. But suppose a 4 Meg region is
> > specified where the userspace address and the guest PFN have different
> > offsets modulo 2 Meg. The userspace address range gets populated first,
> > and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> > detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> > to create a 2 Meg mapping for the guest, the corresponding system PFN in
> > the page array may not be 2 Meg aligned. What does the hypervisor do in
> > this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> > to creating 4K mappings, or does it return an error? Returning an error would
> > seem to be problematic for movable pages because the error wouldn't
> > occur until the guest VM is running and takes a range fault on the region.
> > Silently falling back to creating 4K mappings has performance implications,
> > though I guess it would work. My question is whether the
> > MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> > error immediately.
> >
>
> In thinking about this more, I can answer my own question about the
> hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
> list of 4K system PFNs is not provided as an input to the hypercall, so
> the hypervisor cannot silently fall back to 4K mappings. Assuming
> sequential PFNs would be wrong, so it must return an error if the
> alignment of a system PFN isn't on a 2 Meg boundary.
>
> For a pinned region, this error happens in mshv_region_map() as
> called from mshv_prepare_pinned_region(), so will propagate back
> to the ioctl. But the error happens only if pin_user_pages_fast()
> allocates one or more 2 Meg pages. So creating a pinned region
> where the guest PFN and userspace address have different offsets
> modulo 2 Meg might or might not succeed.
>
> For a movable region, the error probably can't occur.
> mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
> around the faulting guest PFN. mshv_region_range_fault() then
> determines the corresponding userspace addr, which won't be on
> a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
> page. With no 2 Meg pages, mshv_region_remap_pages() will
> always do 4K mappings and will succeed. The downside is that a
> movable region with a guest PFN and userspace address with
> different offsets never gets any 2 Meg pages or mappings.
>
> My conclusion is the same -- such misalignment should not be
> allowed when creating a region that has the potential to use 2 Meg
> pages. Regions less than 2 Meg in size could be excluded from such
> a requirement if there is benefit in doing so. It's possible to have
> regions up to (but not including) 4 Meg where the alignment prevents
> having a 2 Meg page, and those could also be excluded from the
> requirement.
>
I'm not sure I understand the problem.
There are three cases to consider:
1. Guest mapping, where page sizes are controlled by the guest.
2. Host mapping, where page sizes are controlled by the host.
3. Hypervisor mapping, where page sizes are controlled by the hypervisor.
The first case is not relevant here and is included for completeness.
The second and third cases (host and hypervisor) share the memory layout, but it is up to each entity to decide which page sizes to use. For example, the host might map the proposed 4M region with only 4K pages, even if a 2M page is available in the middle. In this case, the host will map the memory as represented by 4K pages, but the hypervisor can still discover the 2M page in the middle and adjust its page tables to use a 2M page.
This adjustment happens at runtime. Could this be the missing detail here?
Thanks,
Stanislav
> Michael
^ permalink raw reply
* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2025-12-23 15:51 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41578A17A4DADD9276392298D4B4A@SN6PR02MB4157.namprd02.prod.outlook.com>
From: Michael Kelley Sent: Monday, December 22, 2025 10:25 AM
>
[snip]
>
> Separately, in looking at this, I spotted another potential problem with
> 2 Meg mappings that somewhat depends on hypervisor behavior that I'm
> not clear on. To create a new region, the user space VMM issues the
> MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
> size, and the guest PFN. The only requirement on these values is that the
> userspace address and size be page aligned. But suppose a 4 Meg region is
> specified where the userspace address and the guest PFN have different
> offsets modulo 2 Meg. The userspace address range gets populated first,
> and may contain a 2 Meg large page. Then when mshv_chunk_stride()
> detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
> to create a 2 Meg mapping for the guest, the corresponding system PFN in
> the page array may not be 2 Meg aligned. What does the hypervisor do in
> this case? It can't create a 2 Meg mapping, right? So does it silently fallback
> to creating 4K mappings, or does it return an error? Returning an error would
> seem to be problematic for movable pages because the error wouldn't
> occur until the guest VM is running and takes a range fault on the region.
> Silently falling back to creating 4K mappings has performance implications,
> though I guess it would work. My question is whether the
> MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
> error immediately.
>
In thinking about this more, I can answer my own question about the
hypervisor behavior. When HVCALL_MAP_GPA_PAGES is set, the full
list of 4K system PFNs is not provided as an input to the hypercall, so
the hypervisor cannot silently fall back to 4K mappings. Assuming
sequential PFNs would be wrong, so it must return an error if the
alignment of a system PFN isn't on a 2 Meg boundary.
For a pinned region, this error happens in mshv_region_map() as
called from mshv_prepare_pinned_region(), so will propagate back
to the ioctl. But the error happens only if pin_user_pages_fast()
allocates one or more 2 Meg pages. So creating a pinned region
where the guest PFN and userspace address have different offsets
modulo 2 Meg might or might not succeed.
For a movable region, the error probably can't occur.
mshv_region_handle_gfn_fault() builds an aligned 2 Meg chunk
around the faulting guest PFN. mshv_region_range_fault() then
determines the corresponding userspace addr, which won't be on
a 2 Meg boundary, so the allocated memory won't contain a 2 Meg
page. With no 2 Meg pages, mshv_region_remap_pages() will
always do 4K mappings and will succeed. The downside is that a
movable region with a guest PFN and userspace address with
different offsets never gets any 2 Meg pages or mappings.
My conclusion is the same -- such misalignment should not be
allowed when creating a region that has the potential to use 2 Meg
pages. Regions less than 2 Meg in size could be excluded from such
a requirement if there is benefit in doing so. It's possible to have
regions up to (but not including) 4 Meg where the alignment prevents
having a 2 Meg page, and those could also be excluded from the
requirement.
Michael
^ permalink raw reply
* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2025-12-22 18:25 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aUXXdjMyZ5swiCI2@skinsburskii.localdomain>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, December 19, 2025 2:54 PM
>
> On Thu, Dec 18, 2025 at 07:41:24PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday,
> December 16, 2025 4:41 PM
> > >
> > > Ensure that a stride larger than 1 (huge page) is only used when both
> > > the guest frame number (gfn) and the operation size (page_count) are
> > > aligned to the huge page size (PTRS_PER_PMD). This matches the
> > > hypervisor requirement that map/unmap operations for huge pages must be
> > > guest-aligned and cover a full huge page.
> > >
> > > Add mshv_chunk_stride() to encapsulate this alignment and page-order
> > > validation, and plumb a huge_page flag into the region chunk handlers.
> > > This prevents issuing large-page map/unmap/share operations that the
> > > hypervisor would reject due to misaligned guest mappings.
> >
> > This code looks good to me on the surface. But I can only make an educated
> > guess as to the hypervisor behavior in certain situations, and if my guess is
> > correct there's still a flaw in one case.
> >
> > Consider the madvise() DONTNEED experiment that I previously called out. [1]
> > I surmise that the intent of this patch is to make that case work correctly.
> > When the .invalidate callback is made for the 32 Kbyte range embedded in
> > a previously mapped 2 Meg page, this new code detects that case. It calls the
> > hypervisor to remap the 32 Kbyte range for no access, and clears the 8
> > corresponding entries in the struct page array attached to the mshv region. The
> > call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
> > Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
> > flag, my guess is that the hypervisor is smart enough to handle this case by
> > splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access,
> > and returning "success". If my guess is correct, there's no problem here.
> >
> > But then there's a second .invalidate callback for the entire 2 Meg page. Here's
> > the call stack:
> >
> > [ 194.259337] dump_stack+0x14/0x20
> > [ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback]
> > [ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250
> > [ 194.259347] __split_huge_pmd+0x14f/0x170
> > [ 194.259349] unmap_page_range+0x104d/0x1a00
> > [ 194.259358] unmap_single_vma+0x7d/0xc0
> > [ 194.259360] zap_page_range_single_batched+0xe0/0x1c0
> > [ 194.259363] madvise_vma_behavior+0xb01/0xc00
> > [ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0
> > [ 194.259375] do_madvise+0xc7/0x170
> > [ 194.259380] __x64_sys_madvise+0x2f/0x40
> > [ 194.259382] x64_sys_call+0x1d77/0x21b0
> > [ 194.259385] do_syscall_64+0x56/0x640
> > [ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
> > page is actually split by the root partition. So mshv_chunk_stride() returns "9"
> > for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
> > set. My guess is that the hypervisor returns an error because it has already
> > split the mapping. The whole point of this patch set is to avoid passing
> > HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
> > is not a large page mapping, but this looks like a case where it still happens.
> >
> > My concern is solely from looking at the code and thinking about the problem,
> > as I don't have an environment where I can test root partition interactions
> > with the hypervisor. So maybe I'm missing something. Lemme know what you
> > think .....
> >
>
> Yeah, I see your point: according to this stack, once a part of the page
> is invalidated, the folio order remains the same until another invocation
> of the same callback - this time for the whole huge
> page - is made. Thus, the stride is still reported as the huge page size,
> even though a part of the page has already been unmapped.
>
> This indeed looks like a flaw in the current approach, but it's actually
> not. The reason is that upon the invalidation callback, the driver
> simply remaps the whole huge page with no access (in this case, the PFNs
> provided to the hypervisor are zero), and it's fine as the hypervisor
> simply drops all the pages from the previous mapping and marks this page
> as inaccessible. The only check the hypervisor makes in this case is
> that both the GFN and mapping size are huge page aligned (which they are
> in this case).
>
> I hope this clarifies the situation. Please let me know if you have any
> other questions.
Thanks. Yes, this clarifies. My guess about the hypervisor behavior was wrong.
Based on what you've said about what the hypervisor does, and further studying
MSHV code, here's my recap of the HV_MAP_GPA_LARGE_PAGE flag:
1. The hypervisor uses the flag to determine the granularity (4K or 2M) of the
mapping HVCALL_MAP_GPA_PAGES or HVCALL_UNMAP_GPA_PAGES will
create/remove. As such, the hypercall "repcount" is in this granularity. GFNs,
such as the target_gpa_base input parameter and GFNs in the pfn_array, are
always 4K GFNs, but if the flag is set, a GFN is treated as the first 4K GFN in
a contiguous 2M range. If the flag is set, the target_gpa_base GFN must be
2M aligned.
2. The hypervisor doesn't care whether any existing mapping is 4K or 2M. It
always removes an existing mapping, including splitting any 2M mappings if
necessary. Then if the operation is to create/re-create a mapping, it creates
an appropriate new mapping.
My error was in thinking that the flag had to match any existing mapping.
But the behavior you've clarified is certainly better. It handles the vagaries
of the Linux "mm" subsystem, which in one case in my original experiment
(madvise) invalidates the small range, then the 2M range, but the other
case (mprotect) invalidates the 2M range, then the small range.
Since there's no documentation for these root partition hypercalls, it sure
would be nice if this info could be captured in code comments for some
future developer to benefit from. If that's not something you want to
worry about, I could submit a patch later to add the code comments
(subject to your review, of course).
Separately, in looking at this, I spotted another potential problem with
2 Meg mappings that somewhat depends on hypervisor behavior that I'm
not clear on. To create a new region, the user space VMM issues the
MSHV_GET_GUEST_MEMORY ioctl, specifying the userspace address, the
size, and the guest PFN. The only requirement on these values is that the
userspace address and size be page aligned. But suppose a 4 Meg region is
specified where the userspace address and the guest PFN have different
offsets modulo 2 Meg. The userspace address range gets populated first,
and may contain a 2 Meg large page. Then when mshv_chunk_stride()
detects a 2 Meg aligned guest PFN so HVCALL_MAP_GPA_PAGES can be told
to create a 2 Meg mapping for the guest, the corresponding system PFN in
the page array may not be 2 Meg aligned. What does the hypervisor do in
this case? It can't create a 2 Meg mapping, right? So does it silently fallback
to creating 4K mappings, or does it return an error? Returning an error would
seem to be problematic for movable pages because the error wouldn't
occur until the guest VM is running and takes a range fault on the region.
Silently falling back to creating 4K mappings has performance implications,
though I guess it would work. My question is whether the
MSHV_GET_GUEST_MEMORY ioctl should detect this case and return an
error immediately.
Michael
> >
> > [1] https://lore.kernel.org/linux-hyperv/SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com/
^ permalink raw reply
* Re: [PATCH v3 9/9] efi: libstub: Simplify interfaces for primary_display
From: Thomas Zimmermann @ 2025-12-21 16:18 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: javierm, arnd, richard.lyu, helgaas, x86, linux-arm-kernel,
linux-kernel, linux-efi, loongarch, linux-riscv, dri-devel,
linux-hyperv, linux-pci, linux-fbdev
In-Reply-To: <CAMj1kXFeBS7O5A-CPds3UfFnjegGTpVsuF7VznBc-zZ+gjygtw@mail.gmail.com>
Hi
Am 16.12.25 um 14:23 schrieb Ard Biesheuvel:
> Hi Thomas
>
> On Wed, 26 Nov 2025 at 17:09, Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Rename alloc_primary_display() and __alloc_primary_display(), clarify
>> free semantics to make interfaces easier to understand.
>>
>> Rename alloc_primary_display() to lookup_primary_display() as it
>> does not necessarily allocate. Then rename __alloc_primary_display()
>> to the new alloc_primary_display(). The helper belongs to
>> free_primary_display), so it should be named without underscores.
>>
>> The lookup helper does not necessarily allocate, so the output
>> parameter needs_free to indicate when free should be called.
> I don't understand why we need this. Whether or not the helper
> allocates is a compile time decision, and in builds where it doesn't,
> the free helper doesn't do anything.
>
> I'm all for making things simpler, but I don't think this patch
> achieves that tbh.
>
> I've queued up this series now up until this patch - once we converge
> on the simplification, I'm happy to apply it on top.
If you don't want this patch, just leave it out then. Coming from
another subsystem, I found the current logic and naming confusing THB.
Best regards
Thomas
>
> Thanks,
>
>
>
>> Pass
>> an argument through the calls to track this state. Put the free
>> handling into release_primary_display() for simplificy.
>>
>> Also move the comment fro primary_display.c to efi-stub-entry.c,
>> where it now describes lookup_primary_display().
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>> ---
>> drivers/firmware/efi/libstub/efi-stub-entry.c | 23 +++++++++++++++++--
>> drivers/firmware/efi/libstub/efi-stub.c | 22 ++++++++++++------
>> drivers/firmware/efi/libstub/efistub.h | 2 +-
>> .../firmware/efi/libstub/primary_display.c | 17 +-------------
>> drivers/firmware/efi/libstub/zboot.c | 6 +++--
>> 5 files changed, 42 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/firmware/efi/libstub/efi-stub-entry.c b/drivers/firmware/efi/libstub/efi-stub-entry.c
>> index aa85e910fe59..3077b51fe0b2 100644
>> --- a/drivers/firmware/efi/libstub/efi-stub-entry.c
>> +++ b/drivers/firmware/efi/libstub/efi-stub-entry.c
>> @@ -14,10 +14,29 @@ static void *kernel_image_addr(void *addr)
>> return addr + kernel_image_offset;
>> }
>>
>> -struct sysfb_display_info *alloc_primary_display(void)
>> +/*
>> + * There are two ways of populating the core kernel's sysfb_primary_display
>> + * via the stub:
>> + *
>> + * - using a configuration table, which relies on the EFI init code to
>> + * locate the table and copy the contents; or
>> + *
>> + * - by linking directly to the core kernel's copy of the global symbol.
>> + *
>> + * The latter is preferred because it makes the EFIFB earlycon available very
>> + * early, but it only works if the EFI stub is part of the core kernel image
>> + * itself. The zboot decompressor can only use the configuration table
>> + * approach.
>> + */
>> +
>> +struct sysfb_display_info *lookup_primary_display(bool *needs_free)
>> {
>> + *needs_free = true;
>> +
>> if (IS_ENABLED(CONFIG_ARM))
>> - return __alloc_primary_display();
>> + return alloc_primary_display();
>> +
>> + *needs_free = false;
>>
>> if (IS_ENABLED(CONFIG_X86) ||
>> IS_ENABLED(CONFIG_EFI_EARLYCON) ||
>> diff --git a/drivers/firmware/efi/libstub/efi-stub.c b/drivers/firmware/efi/libstub/efi-stub.c
>> index 42d6073bcd06..dc545f62c62b 100644
>> --- a/drivers/firmware/efi/libstub/efi-stub.c
>> +++ b/drivers/firmware/efi/libstub/efi-stub.c
>> @@ -51,14 +51,14 @@ static bool flat_va_mapping = (EFI_RT_VIRTUAL_OFFSET != 0);
>> void __weak free_primary_display(struct sysfb_display_info *dpy)
>> { }
>>
>> -static struct sysfb_display_info *setup_primary_display(void)
>> +static struct sysfb_display_info *setup_primary_display(bool *dpy_needs_free)
>> {
>> struct sysfb_display_info *dpy;
>> struct screen_info *screen = NULL;
>> struct edid_info *edid = NULL;
>> efi_status_t status;
>>
>> - dpy = alloc_primary_display();
>> + dpy = lookup_primary_display(dpy_needs_free);
>> if (!dpy)
>> return NULL;
>> screen = &dpy->screen;
>> @@ -68,15 +68,22 @@ static struct sysfb_display_info *setup_primary_display(void)
>>
>> status = efi_setup_graphics(screen, edid);
>> if (status != EFI_SUCCESS)
>> - goto err_free_primary_display;
>> + goto err___free_primary_display;
>>
>> return dpy;
>>
>> -err_free_primary_display:
>> - free_primary_display(dpy);
>> +err___free_primary_display:
>> + if (*dpy_needs_free)
>> + free_primary_display(dpy);
>> return NULL;
>> }
>>
>> +static void release_primary_display(struct sysfb_display_info *dpy, bool dpy_needs_free)
>> +{
>> + if (dpy && dpy_needs_free)
>> + free_primary_display(dpy);
>> +}
>> +
>> static void install_memreserve_table(void)
>> {
>> struct linux_efi_memreserve *rsv;
>> @@ -156,13 +163,14 @@ efi_status_t efi_stub_common(efi_handle_t handle,
>> char *cmdline_ptr)
>> {
>> struct sysfb_display_info *dpy;
>> + bool dpy_needs_free;
>> efi_status_t status;
>>
>> status = check_platform_features();
>> if (status != EFI_SUCCESS)
>> return status;
>>
>> - dpy = setup_primary_display();
>> + dpy = setup_primary_display(&dpy_needs_free);
>>
>> efi_retrieve_eventlog();
>>
>> @@ -182,7 +190,7 @@ efi_status_t efi_stub_common(efi_handle_t handle,
>>
>> status = efi_boot_kernel(handle, image, image_addr, cmdline_ptr);
>>
>> - free_primary_display(dpy);
>> + release_primary_display(dpy, dpy_needs_free);
>>
>> return status;
>> }
>> diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
>> index 979a21818cc1..1503ffb82903 100644
>> --- a/drivers/firmware/efi/libstub/efistub.h
>> +++ b/drivers/firmware/efi/libstub/efistub.h
>> @@ -1176,8 +1176,8 @@ efi_enable_reset_attack_mitigation(void) { }
>>
>> void efi_retrieve_eventlog(void);
>>
>> +struct sysfb_display_info *lookup_primary_display(bool *needs_free);
>> struct sysfb_display_info *alloc_primary_display(void);
>> -struct sysfb_display_info *__alloc_primary_display(void);
>> void free_primary_display(struct sysfb_display_info *dpy);
>>
>> void efi_cache_sync_image(unsigned long image_base,
>> diff --git a/drivers/firmware/efi/libstub/primary_display.c b/drivers/firmware/efi/libstub/primary_display.c
>> index cdaebab26514..34c54ac1e02a 100644
>> --- a/drivers/firmware/efi/libstub/primary_display.c
>> +++ b/drivers/firmware/efi/libstub/primary_display.c
>> @@ -7,24 +7,9 @@
>>
>> #include "efistub.h"
>>
>> -/*
>> - * There are two ways of populating the core kernel's sysfb_primary_display
>> - * via the stub:
>> - *
>> - * - using a configuration table, which relies on the EFI init code to
>> - * locate the table and copy the contents; or
>> - *
>> - * - by linking directly to the core kernel's copy of the global symbol.
>> - *
>> - * The latter is preferred because it makes the EFIFB earlycon available very
>> - * early, but it only works if the EFI stub is part of the core kernel image
>> - * itself. The zboot decompressor can only use the configuration table
>> - * approach.
>> - */
>> -
>> static efi_guid_t primary_display_guid = LINUX_EFI_PRIMARY_DISPLAY_TABLE_GUID;
>>
>> -struct sysfb_display_info *__alloc_primary_display(void)
>> +struct sysfb_display_info *alloc_primary_display(void)
>> {
>> struct sysfb_display_info *dpy;
>> efi_status_t status;
>> diff --git a/drivers/firmware/efi/libstub/zboot.c b/drivers/firmware/efi/libstub/zboot.c
>> index 4b76f74c56da..c1fd1fdbcb08 100644
>> --- a/drivers/firmware/efi/libstub/zboot.c
>> +++ b/drivers/firmware/efi/libstub/zboot.c
>> @@ -26,9 +26,11 @@ void __weak efi_cache_sync_image(unsigned long image_base,
>> // executable code loaded into memory to be safe for execution.
>> }
>>
>> -struct sysfb_display_info *alloc_primary_display(void)
>> +struct sysfb_display_info *lookup_primary_display(bool *needs_free)
>> {
>> - return __alloc_primary_display();
>> + *needs_free = true;
>> +
>> + return alloc_primary_display();
>> }
>>
>> asmlinkage efi_status_t __efiapi
>> --
>> 2.51.1
>>
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)
^ permalink raw reply
* Re: [PATCH] mshv: Align huge page stride with guest mapping
From: Stanislav Kinsburskii @ 2025-12-19 22:53 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D69A4C08B0A4FE01F9FED4A8A@SN6PR02MB4157.namprd02.prod.outlook.com>
On Thu, Dec 18, 2025 at 07:41:24PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 16, 2025 4:41 PM
> >
> > Ensure that a stride larger than 1 (huge page) is only used when both
> > the guest frame number (gfn) and the operation size (page_count) are
> > aligned to the huge page size (PTRS_PER_PMD). This matches the
> > hypervisor requirement that map/unmap operations for huge pages must be
> > guest-aligned and cover a full huge page.
> >
> > Add mshv_chunk_stride() to encapsulate this alignment and page-order
> > validation, and plumb a huge_page flag into the region chunk handlers.
> > This prevents issuing large-page map/unmap/share operations that the
> > hypervisor would reject due to misaligned guest mappings.
>
> This code looks good to me on the surface. But I can only make an educated
> guess as to the hypervisor behavior in certain situations, and if my guess is
> correct there's still a flaw in one case.
>
> Consider the madvise() DONTNEED experiment that I previously called out. [1]
> I surmise that the intent of this patch is to make that case work correctly.
> When the .invalidate callback is made for the 32 Kbyte range embedded in
> a previously mapped 2 Meg page, this new code detects that case. It calls the
> hypervisor to remap the 32 Kbyte range for no access, and clears the 8
> corresponding entries in the struct page array attached to the mshv region. The
> call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
> Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
> flag, my guess is that the hypervisor is smart enough to handle this case by
> splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access,
> and returning "success". If my guess is correct, there's no problem here.
>
> But then there's a second .invalidate callback for the entire 2 Meg page. Here's
> the call stack:
>
> [ 194.259337] dump_stack+0x14/0x20
> [ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback]
> [ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250
> [ 194.259347] __split_huge_pmd+0x14f/0x170
> [ 194.259349] unmap_page_range+0x104d/0x1a00
> [ 194.259358] unmap_single_vma+0x7d/0xc0
> [ 194.259360] zap_page_range_single_batched+0xe0/0x1c0
> [ 194.259363] madvise_vma_behavior+0xb01/0xc00
> [ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0
> [ 194.259375] do_madvise+0xc7/0x170
> [ 194.259380] __x64_sys_madvise+0x2f/0x40
> [ 194.259382] x64_sys_call+0x1d77/0x21b0
> [ 194.259385] do_syscall_64+0x56/0x640
> [ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
> page is actually split by the root partition. So mshv_chunk_stride() returns "9"
> for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
> set. My guess is that the hypervisor returns an error because it has already
> split the mapping. The whole point of this patch set is to avoid passing
> HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
> is not a large page mapping, but this looks like a case where it still happens.
>
> My concern is solely from looking at the code and thinking about the problem,
> as I don't have an environment where I can test root partition interactions
> with the hypervisor. So maybe I'm missing something. Lemme know what you
> think .....
>
Yeah, I see your point: according to this stack, once a part of the page
is invalidated, the folio order remains the same until another invocation
of the same callback — this time for the whole huge
page — is made. Thus, the stride is still reported as the huge page size,
even though a part of the page has already been unmapped.
This indeed looks like a flaw in the current approach, but it's actually
not. The reason is that upon the invalidation callback, the driver
simply remaps the whole huge page with no access (in this case, the PFNs
provided to the hypervisor are zero), and it's fine as the hypervisor
simply drops all the pages from the previous mapping and marks this page
as inaccessible. The only check the hypervisor makes in this case is
that both the GFN and mapping size are huge page aligned (which they are
in this case).
I hope this clarifies the situation. Please let me know if you have any
other questions.
Thanks,
Stanislav
> Michael
>
> [1] https://lore.kernel.org/linux-hyperv/SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com/
>
> >
> > Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> > drivers/hv/mshv_regions.c | 94 ++++++++++++++++++++++++++++++---------------
> > 1 file changed, 63 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > index 30bacba6aec3..29776019bcde 100644
> > --- a/drivers/hv/mshv_regions.c
> > +++ b/drivers/hv/mshv_regions.c
> > @@ -19,6 +19,42 @@
> >
> > #define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD
> >
> > +/**
> > + * mshv_chunk_stride - Compute stride for mapping guest memory
> > + * @page : The page to check for huge page backing
> > + * @gfn : Guest frame number for the mapping
> > + * @page_count: Total number of pages in the mapping
> > + *
> > + * Determines the appropriate stride (in pages) for mapping guest memory.
> > + * Uses huge page stride if the backing page is huge and the guest mapping
> > + * is properly aligned; otherwise falls back to single page stride.
> > + *
> > + * Return: Stride in pages, or -EINVAL if page order is unsupported.
> > + */
> > +static int mshv_chunk_stride(struct page *page,
> > + u64 gfn, u64 page_count)
> > +{
> > + unsigned int page_order;
> > +
> > + page_order = folio_order(page_folio(page));
> > + /* The hypervisor only supports 4K and 2M page sizes */
> > + if (page_order && page_order != PMD_ORDER)
> > + return -EINVAL;
> > +
> > + /*
> > + * Default to a single page stride. If page_order is set and both
> > + * the guest frame number (gfn) and page_count are huge-page
> > + * aligned (PTRS_PER_PMD), use a larger stride so the mapping can
> > + * be backed by a huge page in both guest and hypervisor.
> > + */
> > + if (page_order &&
> > + IS_ALIGNED(gfn, PTRS_PER_PMD) &&
> > + IS_ALIGNED(page_count, PTRS_PER_PMD))
> > + return 1 << page_order;
> > +
> > + return 1;
> > +}
> > +
> > /**
> > * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> > * in a region.
> > @@ -45,25 +81,23 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > int (*handler)(struct mshv_mem_region *region,
> > u32 flags,
> > u64 page_offset,
> > - u64 page_count))
> > + u64 page_count,
> > + bool huge_page))
> > {
> > - u64 count, stride;
> > - unsigned int page_order;
> > + u64 gfn = region->start_gfn + page_offset;
> > + u64 count;
> > struct page *page;
> > - int ret;
> > + int stride, ret;
> >
> > page = region->pages[page_offset];
> > if (!page)
> > return -EINVAL;
> >
> > - page_order = folio_order(page_folio(page));
> > - /* The hypervisor only supports 4K and 2M page sizes */
> > - if (page_order && page_order != PMD_ORDER)
> > - return -EINVAL;
> > -
> > - stride = 1 << page_order;
> > + stride = mshv_chunk_stride(page, gfn, page_count);
> > + if (stride < 0)
> > + return stride;
> >
> > - /* Start at stride since the first page is validated */
> > + /* Start at stride since the first stride is validated */
> > for (count = stride; count < page_count; count += stride) {
> > page = region->pages[page_offset + count];
> >
> > @@ -71,12 +105,13 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > if (!page)
> > break;
> >
> > - /* Break if page size changes */
> > - if (page_order != folio_order(page_folio(page)))
> > + /* Break if stride size changes */
> > + if (stride != mshv_chunk_stride(page, gfn + count,
> > + page_count - count))
> > break;
> > }
> >
> > - ret = handler(region, flags, page_offset, count);
> > + ret = handler(region, flags, page_offset, count, stride > 1);
> > if (ret)
> > return ret;
> >
> > @@ -108,7 +143,8 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
> > int (*handler)(struct mshv_mem_region *region,
> > u32 flags,
> > u64 page_offset,
> > - u64 page_count))
> > + u64 page_count,
> > + bool huge_page))
> > {
> > long ret;
> >
> > @@ -162,11 +198,10 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> >
> > static int mshv_region_chunk_share(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count)
> > + u64 page_offset, u64 page_count,
> > + bool huge_page)
> > {
> > - struct page *page = region->pages[page_offset];
> > -
> > - if (PageHuge(page) || PageTransCompound(page))
> > + if (huge_page)
> > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > @@ -188,11 +223,10 @@ int mshv_region_share(struct mshv_mem_region *region)
> >
> > static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count)
> > + u64 page_offset, u64 page_count,
> > + bool huge_page)
> > {
> > - struct page *page = region->pages[page_offset];
> > -
> > - if (PageHuge(page) || PageTransCompound(page))
> > + if (huge_page)
> > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> >
> > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > @@ -212,11 +246,10 @@ int mshv_region_unshare(struct mshv_mem_region *region)
> >
> > static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count)
> > + u64 page_offset, u64 page_count,
> > + bool huge_page)
> > {
> > - struct page *page = region->pages[page_offset];
> > -
> > - if (PageHuge(page) || PageTransCompound(page))
> > + if (huge_page)
> > flags |= HV_MAP_GPA_LARGE_PAGE;
> >
> > return hv_call_map_gpa_pages(region->partition->pt_id,
> > @@ -295,11 +328,10 @@ int mshv_region_pin(struct mshv_mem_region *region)
> >
> > static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > u32 flags,
> > - u64 page_offset, u64 page_count)
> > + u64 page_offset, u64 page_count,
> > + bool huge_page)
> > {
> > - struct page *page = region->pages[page_offset];
> > -
> > - if (PageHuge(page) || PageTransCompound(page))
> > + if (huge_page)
> > flags |= HV_UNMAP_GPA_LARGE_PAGE;
> >
> > return hv_call_unmap_gpa_pages(region->partition->pt_id,
> >
> >
>
^ permalink raw reply
* [PATCH 1/1] Drivers: hv: Fix uninit'ed variable in hv_msg_dump() if CONFIG_PRINTK not set
From: mhkelley58 @ 2025-12-19 16:08 UTC (permalink / raw)
To: haiyangz, wei.liu, decui, kys, linux-kernel, linux-hyperv,
dan.carpenter
From: Michael Kelley <mhklinux@outlook.com>
When CONFIG_PRINTK is not set, kmsg_dump_get_buffer() returns 'false'
without setting the bytes_written argument. In such case, bytes_written
is uninitialized when it is tested for zero.
This is admittedly an unlikely scenario, but in the interest of correctness
and avoiding tool noise about uninitialized variables, fix this by testing
the return value before testing bytes_written.
Fixes: 9c318a1d9b50 ("Drivers: hv: move panic report code from vmbus to hv early init code")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/202512172102.OcUspn1Z-lkp@intel.com/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/hv_common.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index f466a6099eff..de9e069c5a0c 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -188,6 +188,7 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
{
struct kmsg_dump_iter iter;
size_t bytes_written;
+ bool ret;
/* We are only interested in panics. */
if (detail->reason != KMSG_DUMP_PANIC || !sysctl_record_panic_msg)
@@ -198,9 +199,9 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
* be single-threaded.
*/
kmsg_dump_rewind(&iter);
- kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
- &bytes_written);
- if (!bytes_written)
+ ret = kmsg_dump_get_buffer(&iter, false, hv_panic_page, HV_HYP_PAGE_SIZE,
+ &bytes_written);
+ if (!ret || !bytes_written)
return;
/*
* P3 to contain the physical address of the panic page & P4 to
--
2.25.1
^ permalink raw reply related
* Re: [PATCH][next] hyperv: Avoid -Wflex-array-member-not-at-end warning
From: Gustavo A. R. Silva @ 2025-12-19 2:53 UTC (permalink / raw)
To: Wei Liu, Gustavo A. R. Silva
Cc: K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Long Li,
linux-hyperv, linux-kernel, linux-hardening
In-Reply-To: <20251218194231.GC1749918@liuwe-devbox-debian-v2.local>
On 12/19/25 04:42, Wei Liu wrote:
> On Fri, Dec 12, 2025 at 03:44:50PM +0900, Gustavo A. R. Silva wrote:
>> -Wflex-array-member-not-at-end was introduced in GCC-14, and we are
>> getting ready to enable it, globally.
>>
>> Use the new __TRAILING_OVERLAP() helper to fix the following warning:
>>
>> include/hyperv/hvgdk_mini.h:581:25: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
>>
>> This helper creates a union between a flexible-array member (FAM) and a
>> set of MEMBERS that would otherwise follow it.
>>
>> This overlays the trailing MEMBER u64 gva_list[]; onto the FAM
>> struct hv_tlb_flush_ex::hv_vp_set.bank_contents[], while keeping
>> the FAM and the start of MEMBER aligned.
>>
>> The static_assert() ensures this alignment remains, and it's
>> intentionally placed inmediately after the related structure --no
>> blank line in between.
>>
>> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
>
> Applied. I fixed a typo in the commit message ("inmediately" ->
> "immediately").
Awesome. :)
Thanks
-Gustavo
^ permalink raw reply
* Re: [PATCH 1/3] hyperv: add definitions for arm64 gpa intercepts
From: Wei Liu @ 2025-12-18 20:06 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: vdso, kys, decui, haiyangz, linux-kernel, longli, wei.liu,
linux-hyperv
In-Reply-To: <jhyqp7vlqsmnps52cgzzuyon3aihcxizog4bknnofuibhud5ry@3nix3cwzwapw>
On Wed, Dec 17, 2025 at 10:38:47AM +0530, Anirudh Rayabharam wrote:
> On Tue, Dec 16, 2025 at 07:07:45AM -0800, vdso@mailbox.org wrote:
> >
> > > On 12/16/2025 6:20 AM Anirudh Rayabharam <anirudh@anirudhrb.com> wrote:
> >
> > [...]
> >
> > > +#if IS_ENABLED(CONFIG_ARM64)
> > > +union hv_arm64_vp_execution_state {
> > > + u16 as_uint16;
> > > + struct {
> > > + u16 cpl:2;
> >
> > That looks oddly x86(-64)-specific (Current Priviledge Level).
> >
> > Unless I'm mistaken, CPL doesn't belong here, and the bitfield isn't
> > used on ARM64. Provided the layout of the struct is correct, the
> > bitfield can have a better name of `reserved0` or something like that.
>
> Hmmm... this is how it is defined in the hypervisor headers though.
We should ask the hypervisor folks then. Once this gets out in the wild
it will be difficult to change.
Wei
>
> Anirudh.
^ permalink raw reply
* Re: [PATCH 3/3] mshv: release mutex on region invalidation failure
From: Wei Liu @ 2025-12-18 20:01 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20251216142030.4095527-4-anirudh@anirudhrb.com>
On Tue, Dec 16, 2025 at 02:20:30PM +0000, Anirudh Rayabharam wrote:
> From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
>
> In the region invalidation failure path in
> mshv_region_interval_invalidate(), the region mutex is not released. Fix
> it by releasing the mutex in the failure path.
>
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
I applied this patch to hyperv-fixes.
Wei
^ permalink raw reply
* Re: [PATCH][next] hyperv: Avoid -Wflex-array-member-not-at-end warning
From: Wei Liu @ 2025-12-18 19:42 UTC (permalink / raw)
To: Gustavo A. R. Silva
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
linux-hyperv, linux-kernel, linux-hardening
In-Reply-To: <aTu54qH2iHLKScRW@kspp>
On Fri, Dec 12, 2025 at 03:44:50PM +0900, Gustavo A. R. Silva wrote:
> -Wflex-array-member-not-at-end was introduced in GCC-14, and we are
> getting ready to enable it, globally.
>
> Use the new __TRAILING_OVERLAP() helper to fix the following warning:
>
> include/hyperv/hvgdk_mini.h:581:25: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
>
> This helper creates a union between a flexible-array member (FAM) and a
> set of MEMBERS that would otherwise follow it.
>
> This overlays the trailing MEMBER u64 gva_list[]; onto the FAM
> struct hv_tlb_flush_ex::hv_vp_set.bank_contents[], while keeping
> the FAM and the start of MEMBER aligned.
>
> The static_assert() ensures this alignment remains, and it's
> intentionally placed inmediately after the related structure --no
> blank line in between.
>
> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Applied. I fixed a typo in the commit message ("inmediately" ->
"immediately").
Wei
> ---
> include/hyperv/hvgdk_mini.h | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 04b18d0e37af..30fbbde81c5c 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -578,9 +578,12 @@ struct hv_tlb_flush { /* HV_INPUT_FLUSH_VIRTUAL_ADDRESS_LIST */
> struct hv_tlb_flush_ex {
> u64 address_space;
> u64 flags;
> - struct hv_vpset hv_vp_set;
> - u64 gva_list[];
> + __TRAILING_OVERLAP(struct hv_vpset, hv_vp_set, bank_contents, __packed,
> + u64 gva_list[];
> + );
> } __packed;
> +static_assert(offsetof(struct hv_tlb_flush_ex, hv_vp_set.bank_contents) ==
> + offsetof(struct hv_tlb_flush_ex, gva_list));
>
> struct ms_hyperv_tsc_page { /* HV_REFERENCE_TSC_PAGE */
> volatile u32 tsc_sequence;
> --
> 2.43.0
>
^ permalink raw reply
* RE: [PATCH] mshv: Align huge page stride with guest mapping
From: Michael Kelley @ 2025-12-18 19:41 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <176593206931.276257.13023250440372517478.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, December 16, 2025 4:41 PM
>
> Ensure that a stride larger than 1 (huge page) is only used when both
> the guest frame number (gfn) and the operation size (page_count) are
> aligned to the huge page size (PTRS_PER_PMD). This matches the
> hypervisor requirement that map/unmap operations for huge pages must be
> guest-aligned and cover a full huge page.
>
> Add mshv_chunk_stride() to encapsulate this alignment and page-order
> validation, and plumb a huge_page flag into the region chunk handlers.
> This prevents issuing large-page map/unmap/share operations that the
> hypervisor would reject due to misaligned guest mappings.
This code looks good to me on the surface. But I can only make an educated
guess as to the hypervisor behavior in certain situations, and if my guess is
correct there's still a flaw in one case.
Consider the madvise() DONTNEED experiment that I previously called out. [1]
I surmise that the intent of this patch is to make that case work correctly.
When the .invalidate callback is made for the 32 Kbyte range embedded in
a previously mapped 2 Meg page, this new code detects that case. It calls the
hypervisor to remap the 32 Kbyte range for no access, and clears the 8
corresponding entries in the struct page array attached to the mshv region. The
call to the hypervisor is made *without* the HV_MAP_GPA_LARGE_PAGE flag.
Since the mapping was originally done *with* the HV_MAP_GPA_LARGE_PAGE
flag, my guess is that the hypervisor is smart enough to handle this case by
splitting the 2 Meg mapping it created, setting the 32 Kbyte range to no access,
and returning "success". If my guess is correct, there's no problem here.
But then there's a second .invalidate callback for the entire 2 Meg page. Here's
the call stack:
[ 194.259337] dump_stack+0x14/0x20
[ 194.259339] mhktest_invalidate+0x2a/0x40 [my dummy invalidate callback]
[ 194.259342] __mmu_notifier_invalidate_range_start+0x1f4/0x250
[ 194.259347] __split_huge_pmd+0x14f/0x170
[ 194.259349] unmap_page_range+0x104d/0x1a00
[ 194.259358] unmap_single_vma+0x7d/0xc0
[ 194.259360] zap_page_range_single_batched+0xe0/0x1c0
[ 194.259363] madvise_vma_behavior+0xb01/0xc00
[ 194.259366] madvise_do_behavior.part.0+0x3cd/0x4a0
[ 194.259375] do_madvise+0xc7/0x170
[ 194.259380] __x64_sys_madvise+0x2f/0x40
[ 194.259382] x64_sys_call+0x1d77/0x21b0
[ 194.259385] do_syscall_64+0x56/0x640
[ 194.259388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
In __split_huge_pmd(), the .invalidate callback is made *before* the 2 Meg
page is actually split by the root partition. So mshv_chunk_stride() returns "9"
for the stride, and the hypervisor is called with HV_MAP_GPA_LARGE_PAGE
set. My guess is that the hypervisor returns an error because it has already
split the mapping. The whole point of this patch set is to avoid passing
HV_MAP_GPA_LARGE_PAGE to the hypervisor when the hypervisor mapping
is not a large page mapping, but this looks like a case where it still happens.
My concern is solely from looking at the code and thinking about the problem,
as I don't have an environment where I can test root partition interactions
with the hypervisor. So maybe I'm missing something. Lemme know what you
think .....
Michael
[1] https://lore.kernel.org/linux-hyperv/SN6PR02MB4157978DFAA6C2584D0678E1D4A1A@SN6PR02MB4157.namprd02.prod.outlook.com/
>
> Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 94 ++++++++++++++++++++++++++++++---------------
> 1 file changed, 63 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 30bacba6aec3..29776019bcde 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -19,6 +19,42 @@
>
> #define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD
>
> +/**
> + * mshv_chunk_stride - Compute stride for mapping guest memory
> + * @page : The page to check for huge page backing
> + * @gfn : Guest frame number for the mapping
> + * @page_count: Total number of pages in the mapping
> + *
> + * Determines the appropriate stride (in pages) for mapping guest memory.
> + * Uses huge page stride if the backing page is huge and the guest mapping
> + * is properly aligned; otherwise falls back to single page stride.
> + *
> + * Return: Stride in pages, or -EINVAL if page order is unsupported.
> + */
> +static int mshv_chunk_stride(struct page *page,
> + u64 gfn, u64 page_count)
> +{
> + unsigned int page_order;
> +
> + page_order = folio_order(page_folio(page));
> + /* The hypervisor only supports 4K and 2M page sizes */
> + if (page_order && page_order != PMD_ORDER)
> + return -EINVAL;
> +
> + /*
> + * Default to a single page stride. If page_order is set and both
> + * the guest frame number (gfn) and page_count are huge-page
> + * aligned (PTRS_PER_PMD), use a larger stride so the mapping can
> + * be backed by a huge page in both guest and hypervisor.
> + */
> + if (page_order &&
> + IS_ALIGNED(gfn, PTRS_PER_PMD) &&
> + IS_ALIGNED(page_count, PTRS_PER_PMD))
> + return 1 << page_order;
> +
> + return 1;
> +}
> +
> /**
> * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> * in a region.
> @@ -45,25 +81,23 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> int (*handler)(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset,
> - u64 page_count))
> + u64 page_count,
> + bool huge_page))
> {
> - u64 count, stride;
> - unsigned int page_order;
> + u64 gfn = region->start_gfn + page_offset;
> + u64 count;
> struct page *page;
> - int ret;
> + int stride, ret;
>
> page = region->pages[page_offset];
> if (!page)
> return -EINVAL;
>
> - page_order = folio_order(page_folio(page));
> - /* The hypervisor only supports 4K and 2M page sizes */
> - if (page_order && page_order != PMD_ORDER)
> - return -EINVAL;
> -
> - stride = 1 << page_order;
> + stride = mshv_chunk_stride(page, gfn, page_count);
> + if (stride < 0)
> + return stride;
>
> - /* Start at stride since the first page is validated */
> + /* Start at stride since the first stride is validated */
> for (count = stride; count < page_count; count += stride) {
> page = region->pages[page_offset + count];
>
> @@ -71,12 +105,13 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> if (!page)
> break;
>
> - /* Break if page size changes */
> - if (page_order != folio_order(page_folio(page)))
> + /* Break if stride size changes */
> + if (stride != mshv_chunk_stride(page, gfn + count,
> + page_count - count))
> break;
> }
>
> - ret = handler(region, flags, page_offset, count);
> + ret = handler(region, flags, page_offset, count, stride > 1);
> if (ret)
> return ret;
>
> @@ -108,7 +143,8 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
> int (*handler)(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset,
> - u64 page_count))
> + u64 page_count,
> + bool huge_page))
> {
> long ret;
>
> @@ -162,11 +198,10 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>
> static int mshv_region_chunk_share(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count)
> + u64 page_offset, u64 page_count,
> + bool huge_page)
> {
> - struct page *page = region->pages[page_offset];
> -
> - if (PageHuge(page) || PageTransCompound(page))
> + if (huge_page)
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> @@ -188,11 +223,10 @@ int mshv_region_share(struct mshv_mem_region *region)
>
> static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count)
> + u64 page_offset, u64 page_count,
> + bool huge_page)
> {
> - struct page *page = region->pages[page_offset];
> -
> - if (PageHuge(page) || PageTransCompound(page))
> + if (huge_page)
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> @@ -212,11 +246,10 @@ int mshv_region_unshare(struct mshv_mem_region *region)
>
> static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count)
> + u64 page_offset, u64 page_count,
> + bool huge_page)
> {
> - struct page *page = region->pages[page_offset];
> -
> - if (PageHuge(page) || PageTransCompound(page))
> + if (huge_page)
> flags |= HV_MAP_GPA_LARGE_PAGE;
>
> return hv_call_map_gpa_pages(region->partition->pt_id,
> @@ -295,11 +328,10 @@ int mshv_region_pin(struct mshv_mem_region *region)
>
> static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count)
> + u64 page_offset, u64 page_count,
> + bool huge_page)
> {
> - struct page *page = region->pages[page_offset];
> -
> - if (PageHuge(page) || PageTransCompound(page))
> + if (huge_page)
> flags |= HV_UNMAP_GPA_LARGE_PAGE;
>
> return hv_call_unmap_gpa_pages(region->partition->pt_id,
>
>
^ permalink raw reply
* Re: [PATCH] mshv: hide x86-specific functions on arm64
From: Wei Liu @ 2025-12-18 19:15 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: Wei Liu, Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang,
Dexuan Cui, Long Li, Stansialv Kinsburskii, Nuno Das Neves,
Praveen K Paladugu, Easwar Hariharan, Anatol Belski,
Arnd Bergmann, Sean Christopherson, Naman Jain, linux-hyperv,
linux-kernel
In-Reply-To: <aUH9GGzROHfbP6pj@skinsburskii.localdomain>
On Tue, Dec 16, 2025 at 04:45:12PM -0800, Stanislav Kinsburskii wrote:
> On Tue, Dec 16, 2025 at 10:21:29PM +0000, Wei Liu wrote:
> > On Tue, Dec 16, 2025 at 10:36:12PM +0100, Arnd Bergmann wrote:
> > > From: Arnd Bergmann <arnd@arndb.de>
> > >
> > > The hv_sleep_notifiers_register() and hv_machine_power_off() functions are
> > > only called and declared on x86, but cause a warning on arm64:
> > >
> > > drivers/hv/mshv_common.c:210:6: error: no previous prototype for 'hv_sleep_notifiers_register' [-Werror=missing-prototypes]
> > > 210 | void hv_sleep_notifiers_register(void)
> > > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > drivers/hv/mshv_common.c:224:6: error: no previous prototype for 'hv_machine_power_off' [-Werror=missing-prototypes]
> > > 224 | void hv_machine_power_off(void)
> > > | ^~~~~~~~~~~~~~~~~~~~
> > >
> > > Hide both inside of an #ifdef block.
> > >
> > > Fixes: f0be2600ac55 ("mshv: Use reboot notifier to configure sleep state")
> > > Fixes: 615a6e7d83f9 ("mshv: Cleanly shutdown root partition with MSHV")
> > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> >
> > Thanks Arnd. I have a queued up a patch which make available the
> > declarations to arm64. Let me think about whether to use your patch
> > instead.
> >
> > Anyway, this issue will be fixed one way or the other once I send a PR
> > to Linus.
> >
>
> The whole idea of mshv_common.c is to have common code for all archs.
> It would be nice to keep this file as it is.
The path forward is either to move these functions to an x86-specific
file or to make arm64 work the same way. I'm not too sure about the
latter yet given the port is yet to be done. For the time being, I am
going to drop my own patch and take Arnd's patch. We can figure out
which of the two approaches to take later.
Wei
^ permalink raw reply
* [PATCH net, v2] net: mana: Fix use-after-free in reset service rescan path
From: Dipayaan Roy @ 2025-12-18 13:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, dipayanroy
When mana_serv_reset() encounters -ETIMEDOUT or -EPROTO from
mana_gd_resume(), it performs a PCI rescan via mana_serv_rescan().
mana_serv_rescan() calls pci_stop_and_remove_bus_device(), which can
invoke the driver's remove path and free the gdma_context associated
with the device. After returning, mana_serv_reset() currently jumps to
the out label and attempts to clear gc->in_service, dereferencing a
freed gdma_context.
The issue was observed with the following call logs:
[ 698.942636] BUG: unable to handle page fault for address: ff6c2b638088508d
[ 698.943121] #PF: supervisor write access in kernel mode
[ 698.943423] #PF: error_code(0x0002) - not-present page
[S[ 698.943793] Pat Dec 6 07:GD5 100000067 P4D 1002f7067 PUD 1002f8067 PMD 101bef067 PTE 0
0:56 2025] hv_[n e 698.944283] Oops: Oops: 0002 [#1] SMP NOPTI
tvsc f8615163-00[ 698.944611] CPU: 28 UID: 0 PID: 249 Comm: kworker/28:1
...
[Sat Dec 6 07:50:56 2025] R10: [ 699.121594] mana 7870:00:00.0 enP30832s1: Configured vPort 0 PD 18 DB 16
000000000000001b R11: 0000000000000000 R12: ff44cf3f40270000
[Sat Dec 6 07:50:56 2025] R13: 0000000000000001 R14: ff44cf3f402700c8 R15: ff44cf3f4021b405
[Sat Dec 6 07:50:56 2025] FS: 0000000000000000(0000) GS:ff44cf7e9fcf9000(0000) knlGS:0000000000000000
[Sat Dec 6 07:50:56 2025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Dec 6 07:50:56 2025] CR2: ff6c2b638088508d CR3: 000000011fe43001 CR4: 0000000000b73ef0
[Sat Dec 6 07:50:56 2025] Call Trace:
[Sat Dec 6 07:50:56 2025] <TASK>
[Sat Dec 6 07:50:56 2025] mana_serv_func+0x24/0x50 [mana]
[Sat Dec 6 07:50:56 2025] process_one_work+0x190/0x350
[Sat Dec 6 07:50:56 2025] worker_thread+0x2b7/0x3d0
[Sat Dec 6 07:50:56 2025] kthread+0xf3/0x200
[Sat Dec 6 07:50:56 2025] ? __pfx_worker_thread+0x10/0x10
[Sat Dec 6 07:50:56 2025] ? __pfx_kthread+0x10/0x10
[Sat Dec 6 07:50:56 2025] ret_from_fork+0x21a/0x250
[Sat Dec 6 07:50:56 2025] ? __pfx_kthread+0x10/0x10
[Sat Dec 6 07:50:56 2025] ret_from_fork_asm+0x1a/0x30
[Sat Dec 6 07:50:56 2025] </TASK>
Fix this by returning immediately after mana_serv_rescan() to avoid
accessing GC state that may no longer be valid.
Fixes: 9bf66036d686 ("net: mana: Handle hardware recovery events when probing the device")
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index efb4e412ec7e..0055c231acf6 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -481,7 +481,7 @@ static void mana_serv_reset(struct pci_dev *pdev)
/* Perform PCI rescan on device if we failed on HWC */
dev_err(&pdev->dev, "MANA service: resume failed, rescanning\n");
mana_serv_rescan(pdev);
- goto out;
+ return;
}
if (ret)
--
2.34.1
^ permalink raw reply related
* Re: [PATCH 1/3] hyperv: add definitions for arm64 gpa intercepts
From: vdso @ 2025-12-17 5:46 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: kys, decui, haiyangz, linux-kernel, longli, wei.liu, linux-hyperv
In-Reply-To: <jhyqp7vlqsmnps52cgzzuyon3aihcxizog4bknnofuibhud5ry@3nix3cwzwapw>
> On 12/16/2025 9:08 PM Anirudh Rayabharam <anirudh@anirudhrb.com> wrote:
>
>
> On Tue, Dec 16, 2025 at 07:07:45AM -0800, vdso@mailbox.org wrote:
> >
> > > On 12/16/2025 6:20 AM Anirudh Rayabharam <anirudh@anirudhrb.com> wrote:
> >
> > [...]
> >
> > > +#if IS_ENABLED(CONFIG_ARM64)
> > > +union hv_arm64_vp_execution_state {
> > > + u16 as_uint16;
> > > + struct {
> > > + u16 cpl:2;
> >
> > That looks oddly x86(-64)-specific (Current Priviledge Level).
> >
> > Unless I'm mistaken, CPL doesn't belong here, and the bitfield isn't
> > used on ARM64. Provided the layout of the struct is correct, the
> > bitfield can have a better name of `reserved0` or something like that.
>
> Hmmm... this is how it is defined in the hypervisor headers though.
The questions would be why the hypervisor has got that there (e.g., the
definitions of that struct for x86 and ARM64 are merged), and if Linux needs
to care about the reason valid in the hv's codebase. Perhaps the definitions
are merged there to write less arch-specific code, similar to what Stas suggested
for the patch 2.
I haven't been able to find anything called CPL in the ARM64 arch docs, and
that field really sticks out as the x86-64's CPL. Naming it like that in an
ARM64-specific structure doesn't look justified.
^ permalink raw reply
* Re: [PATCH 2/3] mshv: handle gpa intercepts for arm64
From: Anirudh Rayabharam @ 2025-12-17 5:10 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aUGElRga1r2g8cb-@skinsburskii.localdomain>
On Tue, Dec 16, 2025 at 08:11:01AM -0800, Stanislav Kinsburskii wrote:
> On Tue, Dec 16, 2025 at 02:20:29PM +0000, Anirudh Rayabharam wrote:
> > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> >
> > The mshv driver now uses movable pages for guests. For arm64 guests
> > to be functional, handle gpa intercepts for arm64 too (the current
> > code implements handling only for x86). Without this, arm64 guests are
> > broken.
> >
> > Move some arch-agnostic functions out of #ifdefs so that they can be
> > re-used.
> >
> > Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > ---
> > drivers/hv/mshv_root_main.c | 38 ++++++++++++++++++++++++++++---------
> > 1 file changed, 29 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 9cf28a3f12fe..882605349664 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -608,7 +608,6 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > return NULL;
> > }
> >
> > -#ifdef CONFIG_X86_64
> > static struct mshv_mem_region *
> > mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > {
> > @@ -625,6 +624,34 @@ mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > return region;
> > }
> >
> > +#ifdef CONFIG_X86_64
> > +static u64 mshv_get_gpa_intercept_gfn(struct mshv_vp *vp)
> > +{
> > + struct hv_x64_memory_intercept_message *msg;
> > + u64 gfn;
> > +
> > + msg = (struct hv_x64_memory_intercept_message *)
> > + vp->vp_intercept_msg_page->u.payload;
> > +
> > + gfn = HVPFN_DOWN(msg->guest_physical_address);
> > +
> > + return gfn;
> > +}
> > +#else /* CONFIG_X86_64 */
>
> It's better to explicitly branch for ARM64 here for clarity as
> hv_arm64_memory_intercept_message is defined only for ARM64.
Ack.
>
> > +static u64 mshv_get_gpa_intercept_gfn(struct mshv_vp *vp)
> > +{
> > + struct hv_arm64_memory_intercept_message *msg;
> > + u64 gfn;
> > +
> > + msg = (struct hv_arm64_memory_intercept_message *)
> > + vp->vp_intercept_msg_page->u.payload;
> > +
> > + gfn = HVPFN_DOWN(msg->guest_physical_address);
> > +
> > + return gfn;
> > +}
> > +#endif /* CONFIG_X86_64 */
> > +
>
> Are these functions really needed?
> It would be clearer (and shorter) to branch directly in
> mshv_handle_gpa_intercept.
True, that might be simpler. I'll send a v2.
Thanks,
Anirudh
^ permalink raw reply
* Re: [PATCH 1/3] hyperv: add definitions for arm64 gpa intercepts
From: Anirudh Rayabharam @ 2025-12-17 5:08 UTC (permalink / raw)
To: vdso; +Cc: kys, decui, haiyangz, linux-kernel, longli, wei.liu, linux-hyperv
In-Reply-To: <1801063954.177813.1765897665357@app.mailbox.org>
On Tue, Dec 16, 2025 at 07:07:45AM -0800, vdso@mailbox.org wrote:
>
> > On 12/16/2025 6:20 AM Anirudh Rayabharam <anirudh@anirudhrb.com> wrote:
>
> [...]
>
> > +#if IS_ENABLED(CONFIG_ARM64)
> > +union hv_arm64_vp_execution_state {
> > + u16 as_uint16;
> > + struct {
> > + u16 cpl:2;
>
> That looks oddly x86(-64)-specific (Current Priviledge Level).
>
> Unless I'm mistaken, CPL doesn't belong here, and the bitfield isn't
> used on ARM64. Provided the layout of the struct is correct, the
> bitfield can have a better name of `reserved0` or something like that.
Hmmm... this is how it is defined in the hypervisor headers though.
Anirudh.
^ permalink raw reply
* Re: [PATCH] mshv: Use PMD_ORDER instead of HPAGE_PMD_ORDER when processing regions
From: Anirudh Rayabharam @ 2025-12-17 5:03 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <176529822862.17729.14849117117197568731.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Tue, Dec 09, 2025 at 04:37:20PM +0000, Stanislav Kinsburskii wrote:
> Fix page order determination logic when CONFIG_PGTABLE_HAS_HUGE_LEAVES
> is undefined, as HPAGE_PMD_SHIFT is defined as BUILD_BUG in that case.
>
> Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region
> traversal")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 202b9d551e39..dc2d7044fb91 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -58,7 +58,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>
> page_order = folio_order(page_folio(page));
> /* The hypervisor only supports 4K and 2M page sizes */
> - if (page_order && page_order != HPAGE_PMD_ORDER)
> + if (page_order && page_order != PMD_ORDER)
> return -EINVAL;
>
> stride = 1 << page_order;
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox