* Re: [PATCH RFC net-next v13 02/13] vsock: add netns to vsock core
From: Bobby Eshleman @ 2026-01-13 0:52 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <aWWFB2K5H5OXGWP8@devvm11784.nha0.facebook.com>
On Mon, Jan 12, 2026 at 03:34:31PM -0800, Bobby Eshleman wrote:
> On Sun, Jan 11, 2026 at 01:43:37AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 23, 2025 at 04:28:36PM -0800, Bobby Eshleman wrote:
> > > From: Bobby Eshleman <bobbyeshleman@meta.com>
> > >
> > > Add netns logic to vsock core. Additionally, modify transport hook
> > > prototypes to be used by later transport-specific patches (e.g.,
> > > *_seqpacket_allow()).
> > >
> > > Namespaces are supported primarily by changing socket lookup functions
> > > (e.g., vsock_find_connected_socket()) to take into account the socket
> > > namespace and the namespace mode before considering a candidate socket a
> > > "match".
> > >
> > > This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
> > > report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
> > > for new namespaces.
> > >
> > > Add netns functionality (initialization, passing to transports, procfs,
> > > etc...) to the af_vsock socket layer. Later patches that add netns
> > > support to transports depend on this patch.
> > >
> > > dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
> > > modified to take a vsk in order to perform logic on namespace modes. In
> > > future patches, the net will also be used for socket
> > > lookups in these functions.
> > >
> > > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > ...
> >
> >
> > > static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > > struct sockaddr_vm *addr)
> > > {
> > > + struct net *net = sock_net(sk_vsock(vsk));
> > > static u32 port;
> > > struct sockaddr_vm new_addr;
> > >
> >
> >
> > Hmm this static port gives me pause. So some port number info leaks
> > between namespaces. I am not saying it's a big security issue
> > and yet ... people expect isolation.
>
> Probably the easiest solution is making it per-ns, my quick rough draft
> looks like this:
>
> diff --git a/include/net/netns/vsock.h b/include/net/netns/vsock.h
> index e2325e2d6ec5..b34d69a22fa8 100644
> --- a/include/net/netns/vsock.h
> +++ b/include/net/netns/vsock.h
> @@ -11,6 +11,10 @@ enum vsock_net_mode {
>
> struct netns_vsock {
> struct ctl_table_header *sysctl_hdr;
> +
> + /* protected by the vsock_table_lock in af_vsock.c */
> + u32 port;
> +
> enum vsock_net_mode mode;
> enum vsock_net_mode child_ns_mode;
> };
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 9d614e4a4fa5..cd2a47140134 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -748,11 +748,10 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> struct sockaddr_vm *addr)
> {
> struct net *net = sock_net(sk_vsock(vsk));
> - static u32 port;
> struct sockaddr_vm new_addr;
>
> - if (!port)
> - port = get_random_u32_above(LAST_RESERVED_PORT);
> + if (!net->vsock.port)
> + net->vsock.port = get_random_u32_above(LAST_RESERVED_PORT);
>
> vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);
>
> @@ -761,11 +760,11 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
> unsigned int i;
>
> for (i = 0; i < MAX_PORT_RETRIES; i++) {
> - if (port == VMADDR_PORT_ANY ||
> - port <= LAST_RESERVED_PORT)
> - port = LAST_RESERVED_PORT + 1;
> + if (net->vsock.port == VMADDR_PORT_ANY ||
> + net->vsock.port <= LAST_RESERVED_PORT)
> + net->vsock.port = LAST_RESERVED_PORT + 1;
>
> - new_addr.svm_port = port++;
> + new_addr.svm_port = net->vsock.port++;
>
> if (!__vsock_find_bound_socket_net(&new_addr, net)) {
> found = true;
>
>
>
Another option being to follow inet's path and use
siphash_4u32() the way that secure_ipv4_port_ephemeral() does...
^ permalink raw reply
* Re: [PATCH RFC net-next v13 02/13] vsock: add netns to vsock core
From: Bobby Eshleman @ 2026-01-12 23:34 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20260111013536-mutt-send-email-mst@kernel.org>
On Sun, Jan 11, 2026 at 01:43:37AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 23, 2025 at 04:28:36PM -0800, Bobby Eshleman wrote:
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > Add netns logic to vsock core. Additionally, modify transport hook
> > prototypes to be used by later transport-specific patches (e.g.,
> > *_seqpacket_allow()).
> >
> > Namespaces are supported primarily by changing socket lookup functions
> > (e.g., vsock_find_connected_socket()) to take into account the socket
> > namespace and the namespace mode before considering a candidate socket a
> > "match".
> >
> > This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
> > report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
> > for new namespaces.
> >
> > Add netns functionality (initialization, passing to transports, procfs,
> > etc...) to the af_vsock socket layer. Later patches that add netns
> > support to transports depend on this patch.
> >
> > dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
> > modified to take a vsk in order to perform logic on namespace modes. In
> > future patches, the net will also be used for socket
> > lookups in these functions.
> >
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
>
> ...
>
>
> > static int __vsock_bind_connectible(struct vsock_sock *vsk,
> > struct sockaddr_vm *addr)
> > {
> > + struct net *net = sock_net(sk_vsock(vsk));
> > static u32 port;
> > struct sockaddr_vm new_addr;
> >
>
>
> Hmm this static port gives me pause. So some port number info leaks
> between namespaces. I am not saying it's a big security issue
> and yet ... people expect isolation.
Probably the easiest solution is making it per-ns, my quick rough draft
looks like this:
diff --git a/include/net/netns/vsock.h b/include/net/netns/vsock.h
index e2325e2d6ec5..b34d69a22fa8 100644
--- a/include/net/netns/vsock.h
+++ b/include/net/netns/vsock.h
@@ -11,6 +11,10 @@ enum vsock_net_mode {
struct netns_vsock {
struct ctl_table_header *sysctl_hdr;
+
+ /* protected by the vsock_table_lock in af_vsock.c */
+ u32 port;
+
enum vsock_net_mode mode;
enum vsock_net_mode child_ns_mode;
};
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9d614e4a4fa5..cd2a47140134 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -748,11 +748,10 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
struct sockaddr_vm *addr)
{
struct net *net = sock_net(sk_vsock(vsk));
- static u32 port;
struct sockaddr_vm new_addr;
- if (!port)
- port = get_random_u32_above(LAST_RESERVED_PORT);
+ if (!net->vsock.port)
+ net->vsock.port = get_random_u32_above(LAST_RESERVED_PORT);
vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);
@@ -761,11 +760,11 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
unsigned int i;
for (i = 0; i < MAX_PORT_RETRIES; i++) {
- if (port == VMADDR_PORT_ANY ||
- port <= LAST_RESERVED_PORT)
- port = LAST_RESERVED_PORT + 1;
+ if (net->vsock.port == VMADDR_PORT_ANY ||
+ net->vsock.port <= LAST_RESERVED_PORT)
+ net->vsock.port = LAST_RESERVED_PORT + 1;
- new_addr.svm_port = port++;
+ new_addr.svm_port = net->vsock.port++;
if (!__vsock_find_bound_socket_net(&new_addr, net)) {
found = true;
Not as nice, but not necessarily horrid. WDYT?
Best,
Bobby
^ permalink raw reply related
* Re: [PATCH RFC net-next v13 03/13] virtio: set skb owner of virtio_transport_reset_no_sock() reply
From: Bobby Eshleman @ 2026-01-12 23:21 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20260111014500-mutt-send-email-mst@kernel.org>
On Sun, Jan 11, 2026 at 01:46:43AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 23, 2025 at 04:28:37PM -0800, Bobby Eshleman wrote:
> > From: Bobby Eshleman <bobbyeshleman@meta.com>
> >
> > Associate reply packets with the sending socket. When vsock must reply
> > with an RST packet and there exists a sending socket (e.g., for
> > loopback), setting the skb owner to the socket correctly handles
> > reference counting between the skb and sk (i.e., the sk stays alive
> > until the skb is freed).
> >
> > This allows the net namespace to be used for socket lookups for the
> > duration of the reply skb's lifetime, preventing race conditions between
> > the namespace lifecycle and vsock socket search using the namespace
> > pointer.
> >
> > Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
> > Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> > ---
> > Changes in v11:
> > - move before adding to netns support (Stefano)
>
> can you explain about the revert please?
> I looked at feedback from Stefano and all he said
> aparently was not to break bisect.
The patch that brings support into vsock_loopback depends on this one to
avoid a introducing a race condition, so it should come before that one.
Best,
Bobby
^ permalink raw reply
* Re: [PATCH RFC net-next v13 00/13] vsock: add namespace support to vhost-vsock and loopback
From: Bobby Eshleman @ 2026-01-12 21:48 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Michael S. Tsirkin, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <aWUnqbDlBmjfnC_Q@sgarzare-redhat>
On Mon, Jan 12, 2026 at 06:26:18PM +0100, Stefano Garzarella wrote:
> On Sat, Jan 10, 2026 at 07:12:07PM -0500, Michael S. Tsirkin wrote:
> > On Fri, Jan 09, 2026 at 04:11:12PM -0800, Bobby Eshleman wrote:
> > > On Tue, Dec 23, 2025 at 04:28:34PM -0800, Bobby Eshleman wrote:
> > > > This series adds namespace support to vhost-vsock and loopback. It does
> > > > not add namespaces to any of the other guest transports (virtio-vsock,
> > > > hyperv, or vmci).
> > > >
> > > > The current revision supports two modes: local and global. Local
> > > > mode is complete isolation of namespaces, while global mode is complete
> > > > sharing between namespaces of CIDs (the original behavior).
> > > >
> > > > The mode is set using the parent namespace's
> > > > /proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
> > > > created. The mode of the current namespace can be queried by reading
> > > > /proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
> > > > has been created.
> > > >
> > > > Modes are per-netns. This allows a system to configure namespaces
> > > > independently (some may share CIDs, others are completely isolated).
> > > > This also supports future possible mixed use cases, where there may be
> > > > namespaces in global mode spinning up VMs while there are mixed mode
> > > > namespaces that provide services to the VMs, but are not allowed to
> > > > allocate from the global CID pool (this mode is not implemented in this
> > > > series).
> > >
> > > Stefano, would like me to resend this without the RFC tag, or should I
> > > just leave as is for review? I don't have any planned changes at the
> > > moment.
> > >
> > > Best,
> > > Bobby
> >
> > i couldn't apply it on top of net-next so pls do.
> >
>
> Yeah, some difficulties to apply also here.
> I tried `base-commit: 962ac5ca99a5c3e7469215bf47572440402dfd59` as mentioned
> in the cover, but didn't apply. After several tries I successfully applied
> on top of commit bc69ed975203 ("Merge tag 'for_linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost")
>
> So, I agree, better to resend and you can remove RFC.
>
> BTW I'll do my best to start to review tomorrow!
>
> Thanks,
> Stefano
>
Sounds good to me. Sorry about that, I must have done something weird
with b4 to pin the base commit because it has been
962ac5ca99a5c3e7469215bf47572440402dfd59 for the last several revisions.
Looks like my local of this is actually based on:
7b8e9264f55a ("Merge tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
I'll re-apply to head and resend today!
While I'm at it I will try to address Michael's feedback and bump a
version.
Best,
Bobby
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V2,net-next, 2/2] net: mana: Add ethtool counters for RX CQEs in coalesced type
From: Haiyang Zhang @ 2026-01-12 21:03 UTC (permalink / raw)
To: Jakub Kicinski, Haiyang Zhang
Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
KY Srinivasan, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Konstantin Taranov,
Simon Horman, Erni Sri Satya Vennela, Shradha Gupta,
Saurabh Sengar, Aditya Garg, Dipayaan Roy, Shiraz Saleem,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
Paul Rosswurm
In-Reply-To: <20260109175620.3e461176@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, January 9, 2026 8:56 PM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Paolo
> Abeni <pabeni@redhat.com>; Konstantin Taranov <kotaranov@microsoft.com>;
> Simon Horman <horms@kernel.org>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH V2,net-next, 2/2] net: mana: Add ethtool
> counters for RX CQEs in coalesced type
>
> On Tue, 6 Jan 2026 12:46:47 -0800 Haiyang Zhang wrote:
> > @@ -227,8 +232,6 @@ struct mana_rxcomp_perpkt_info {
> > u32 pkt_hash;
> > }; /* HW DATA */
> >
> > -#define MANA_RXCOMP_OOB_NUM_PPI 4
> > -
> > /* Receive completion OOB */
> > struct mana_rxcomp_oob {
> > struct mana_cqe_header cqe_hdr;
> > @@ -378,7 +381,6 @@ struct mana_ethtool_stats {
> > u64 tx_cqe_err;
> > u64 tx_cqe_unknown_type;
> > u64 tx_linear_pkt_cnt;
> > - u64 rx_coalesced_err;
> > u64 rx_cqe_unknown_type;
> > };
>
> This should be deleted in the previous patch already
Will do.
- Haiyang
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-12 21:01 UTC (permalink / raw)
To: Jakub Kicinski, Haiyang Zhang
Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
KY Srinivasan, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
David S. Miller, Eric Dumazet, Paolo Abeni, Konstantin Taranov,
Simon Horman, Erni Sri Satya Vennela, Shradha Gupta,
Saurabh Sengar, Aditya Garg, Dipayaan Roy, Shiraz Saleem,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
Paul Rosswurm
In-Reply-To: <20260109175610.0eb69acb@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Friday, January 9, 2026 8:56 PM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Paolo
> Abeni <pabeni@redhat.com>; Konstantin Taranov <kotaranov@microsoft.com>;
> Simon Horman <horms@kernel.org>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support
> for coalesced RX packets on CQE
>
> On Tue, 6 Jan 2026 12:46:46 -0800 Haiyang Zhang wrote:
> > From: Haiyang Zhang <haiyangz@microsoft.com>
> >
> > Our NIC can have up to 4 RX packets on 1 CQE. To support this feature,
> > check and process the type CQE_RX_COALESCED_4. The default setting is
> > disabled, to avoid possible regression on latency.
> >
> > And add ethtool handler to switch this feature. To turn it on, run:
> > ethtool -C <nic> rx-frames 4
> > To turn it off:
> > ethtool -C <nic> rx-frames 1
>
> Exposing just rx frame count, and only two values is quite unusual.
> Please explain in more detail the coalescing logic of the device.
Our NIC device only supports coalescing on RX. And when it's disabled each
RX CQE indicates 1 RX packet; when enabled each RX CQE indicates up to 4 packets.
>
> > @@ -2079,14 +2081,10 @@ static void mana_process_rx_cqe(struct mana_rxq
> *rxq, struct mana_cq *cq,
> > return;
> > }
> >
> > - pktlen = oob->ppi[0].pkt_len;
> > -
> > - if (pktlen == 0) {
> > - /* data packets should never have packetlength of zero */
> > - netdev_err(ndev, "RX pkt len=0, rq=%u, cq=%u, rxobj=0x%llx\n",
> > - rxq->gdma_id, cq->gdma_id, rxq->rxobj);
> > +nextpkt:
> > + pktlen = oob->ppi[i].pkt_len;
> > + if (pktlen == 0)
> > return;
> > - }
> >
> > curr = rxq->buf_index;
> > rxbuf_oob = &rxq->rx_oobs[curr];
> > @@ -2097,12 +2095,15 @@ static void mana_process_rx_cqe(struct mana_rxq
> *rxq, struct mana_cq *cq,
> > /* Unsuccessful refill will have old_buf == NULL.
> > * In this case, mana_rx_skb() will drop the packet.
> > */
> > - mana_rx_skb(old_buf, old_fp, oob, rxq);
> > + mana_rx_skb(old_buf, old_fp, oob, rxq, i);
> >
> > drop:
> > mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
> >
> > mana_post_pkt_rxq(rxq);
> > +
> > + if (coalesced && (++i < MANA_RXCOMP_OOB_NUM_PPI))
> > + goto nextpkt;
>
> Please code this up as a loop. Using gotos for control flow other than
> to jump to error handling epilogues is a poor coding practice (see the
> kernel coding style).
Will do.
>
> > +static int mana_set_coalesce(struct net_device *ndev,
> > + struct ethtool_coalesce *ec,
> > + struct kernel_ethtool_coalesce *kernel_coal,
> > + struct netlink_ext_ack *extack)
> > +{
> > + struct mana_port_context *apc = netdev_priv(ndev);
> > + u8 saved_cqe_coalescing_enable;
> > + int err;
> > +
> > + if (ec->rx_max_coalesced_frames != 1 &&
> > + ec->rx_max_coalesced_frames != MANA_RXCOMP_OOB_NUM_PPI) {
> > + NL_SET_ERR_MSG_FMT(extack,
> > + "rx-frames must be 1 or %u, got %u",
> > + MANA_RXCOMP_OOB_NUM_PPI,
> > + ec->rx_max_coalesced_frames);
> > + return -EINVAL;
> > + }
> > +
> > + saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
> > + apc->cqe_coalescing_enable =
> > + ec->rx_max_coalesced_frames == MANA_RXCOMP_OOB_NUM_PPI;
> > +
> > + if (!apc->port_is_up)
> > + return 0;
> > +
> > + err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> > +
>
> unnecessary empty line
Will rm.
>
> > + if (err) {
> > + netdev_err(ndev, "Set rx-frames to %u failed:%d\n",
> > + ec->rx_max_coalesced_frames, err);
> > + NL_SET_ERR_MSG_FMT(extack, "Set rx-frames to %u failed",
> > + ec->rx_max_coalesced_frames);
>
> These messages are both pointless. If HW communication has failed
> presumably there will already be an error in the logs. The extack
> gives the user no information they wouldn't already have.
Will rm.
Thanks,
- Haiyang
^ permalink raw reply
* [PATCH v1] mshv: make certain field names descriptive in a header struct
From: Mukesh Rathor @ 2026-01-12 19:49 UTC (permalink / raw)
To: linux-hyperv; +Cc: wei.liu, nunodasneves
When header struct fields use very common names like "pages" or "type",
it makes it difficult to find uses of these fields with tools like grep
and cscope. Add the prefix mreg_ to some fields in struct
mshv_mem_region to make it easier to find them.
There is no functional change.
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
drivers/hv/mshv_regions.c | 44 ++++++++++++++++++-------------------
drivers/hv/mshv_root.h | 6 ++---
drivers/hv/mshv_root_main.c | 10 ++++-----
3 files changed, 30 insertions(+), 30 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 202b9d551e39..af81405f859b 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
struct page *page;
int ret;
- page = region->pages[page_offset];
+ page = region->mreg_pages[page_offset];
if (!page)
return -EINVAL;
@@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
/* Start at stride since the first page is validated */
for (count = stride; count < page_count; count += stride) {
- page = region->pages[page_offset + count];
+ page = region->mreg_pages[page_offset + count];
/* Break if current page is not present */
if (!page)
@@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
while (page_count) {
/* Skip non-present pages */
- if (!region->pages[page_offset]) {
+ if (!region->mreg_pages[page_offset]) {
page_offset++;
page_count--;
continue;
@@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
return hv_call_modify_spa_host_access(region->partition->pt_id,
- region->pages + page_offset,
+ region->mreg_pages + page_offset,
page_count,
HV_MAP_GPA_READABLE |
HV_MAP_GPA_WRITABLE,
@@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
return hv_call_modify_spa_host_access(region->partition->pt_id,
- region->pages + page_offset,
+ region->mreg_pages + page_offset,
page_count, 0,
flags, false);
}
@@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_MAP_GPA_LARGE_PAGE;
@@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
return hv_call_map_gpa_pages(region->partition->pt_id,
region->start_gfn + page_offset,
page_count, flags,
- region->pages + page_offset);
+ region->mreg_pages + page_offset);
}
static int mshv_region_remap_pages(struct mshv_mem_region *region,
@@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
u64 page_offset, u64 page_count)
{
- if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
- unpin_user_pages(region->pages + page_offset, page_count);
+ if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+ unpin_user_pages(region->mreg_pages + page_offset, page_count);
- memset(region->pages + page_offset, 0,
+ memset(region->mreg_pages + page_offset, 0,
page_count * sizeof(struct page *));
}
@@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
int ret;
for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
- pages = region->pages + done_count;
+ pages = region->mreg_pages + done_count;
userspace_addr = region->start_uaddr +
done_count * HV_HYP_PAGE_SIZE;
nr_pages = min(region->nr_pages - done_count,
@@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_UNMAP_GPA_LARGE_PAGE;
@@ -321,7 +321,7 @@ static void mshv_region_destroy(struct kref *ref)
struct mshv_partition *partition = region->partition;
int ret;
- if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+ if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
mshv_region_movable_fini(region);
if (mshv_partition_encrypted(partition)) {
@@ -374,9 +374,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
int ret;
range->notifier_seq = mmu_interval_read_begin(range->notifier);
- mmap_read_lock(region->mni.mm);
+ mmap_read_lock(region->mreg_mni.mm);
ret = hmm_range_fault(range);
- mmap_read_unlock(region->mni.mm);
+ mmap_read_unlock(region->mreg_mni.mm);
if (ret)
return ret;
@@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
u64 page_offset, u64 page_count)
{
struct hmm_range range = {
- .notifier = ®ion->mni,
+ .notifier = ®ion->mreg_mni,
.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
};
unsigned long *pfns;
@@ -430,7 +430,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
goto out;
for (i = 0; i < page_count; i++)
- region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+ region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
ret = mshv_region_remap_pages(region, region->hv_map_flags,
page_offset, page_count);
@@ -489,7 +489,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
{
struct mshv_mem_region *region = container_of(mni,
struct mshv_mem_region,
- mni);
+ mreg_mni);
u64 page_offset, page_count;
unsigned long mstart, mend;
int ret = -EPERM;
@@ -535,14 +535,14 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
void mshv_region_movable_fini(struct mshv_mem_region *region)
{
- mmu_interval_notifier_remove(®ion->mni);
+ mmu_interval_notifier_remove(®ion->mreg_mni);
}
bool mshv_region_movable_init(struct mshv_mem_region *region)
{
int ret;
- ret = mmu_interval_notifier_insert(®ion->mni, current->mm,
+ ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
region->start_uaddr,
region->nr_pages << HV_HYP_PAGE_SHIFT,
&mshv_region_mni_ops);
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..f5b6d3979e5a 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -85,10 +85,10 @@ struct mshv_mem_region {
u64 start_uaddr;
u32 hv_map_flags;
struct mshv_partition *partition;
- enum mshv_region_type type;
- struct mmu_interval_notifier mni;
+ enum mshv_region_type mreg_type;
+ struct mmu_interval_notifier mreg_mni;
struct mutex mutex; /* protects region pages remapping */
- struct page *pages[];
+ struct page *mreg_pages[];
};
struct mshv_irq_ack_notifier {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..eff1b21461dc 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
return false;
/* Only movable memory ranges are supported for GPA intercepts */
- if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+ if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
ret = mshv_region_handle_gfn_fault(region, gfn);
else
ret = false;
@@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
return PTR_ERR(rg);
if (is_mmio)
- rg->type = MSHV_REGION_TYPE_MMIO;
+ rg->mreg_type = MSHV_REGION_TYPE_MMIO;
else if (mshv_partition_encrypted(partition) ||
!mshv_region_movable_init(rg))
- rg->type = MSHV_REGION_TYPE_MEM_PINNED;
+ rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
else
- rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
+ rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
rg->partition = partition;
@@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
if (ret)
return ret;
- switch (region->type) {
+ switch (region->mreg_type) {
case MSHV_REGION_TYPE_MEM_PINNED:
ret = mshv_prepare_pinned_region(region);
break;
--
2.51.2.vfs.0.1
^ permalink raw reply related
* RE: [PATCH] scsi: storvsc: Process unsupported MODE_SENSE_10
From: Long Li @ 2026-01-12 17:52 UTC (permalink / raw)
To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley,
Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: stable@kernel.org
In-Reply-To: <SN6PR02MB4157232BE7BB9B6B1AA81AEBD482A@SN6PR02MB4157.namprd02.prod.outlook.com>
> > @@ -1154,6 +1154,7 @@ static void storvsc_on_io_completion(struct
> > storvsc_device *stor_device,
> >
> > if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
> > (stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
> > + (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
> > (stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
> > hv_dev_is_fc(device))) {
> > vstor_packet->vm_srb.scsi_status = 0;
>
> There's a code comment above this "if" statement that describes the situation.
> The comment specifically lists INQUIRY, MODE_SENSE, and
> MAINTENANCE_IN. For consistency, it should be updated to include
> MODE_SENSE_10.
>
> With the comment updated,
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Will send v2, thank you.
Long
^ permalink raw reply
* RE: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Michael Kelley @ 2026-01-12 17:48 UTC (permalink / raw)
To: Yu Zhang
Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
iommu@lists.linux.dev, linux-pci@vger.kernel.org,
kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
will@kernel.org, robin.murphy@arm.com,
easwar.hariharan@linux.microsoft.com,
jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
mrathor@linux.microsoft.com, peterz@infradead.org,
linux-arch@vger.kernel.org
In-Reply-To: <dws34g6znmam7eabwetg722b4wgf2wxufcqxqphhbqlryx23mb@we5utwanawe2>
From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, January 12, 2026 8:56 AM
>
> On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>
> <snip>
> Thank you so much, Michael, for the thorough review!
>
> I've snipped some comments I fully agree with and will address in
> next version. Actually, I have to admit I agree with your remaining
> comments below as well. :)
>
> > > +struct hv_iommu_dev *hv_iommu_device;
> > > +static struct hv_iommu_domain hv_identity_domain;
> > > +static struct hv_iommu_domain hv_blocking_domain;
> >
> > Why is hv_iommu_device allocated dynamically while the two
> > domains are allocated statically? Seems like the approach could
> > be consistent, though maybe there's some reason I'm missing.
> >
>
> On second thought, `hv_identity_domain` and `hv_blocking_domain` should
> likely be allocated dynamically as well, consistent with `hv_iommu_device`.
I don't know if there's a strong rationale either way (static allocation vs.
dynamic). If the long-term expectation is that there is never more than one
PV IOMMU in a guest, then static would be OK. If future direction allows that
there could be multiple PV IOMMUs in a guest, then doing dynamic from
the start is justifiable (though the current PV IOMMU hypercalls seem to
assume only one PV IOMMU). But either way, being consistent is desirable.
>
> <snip>
> > > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > > + enum hv_logical_device_property_code code,
> > > + struct hv_output_get_logical_device_property *property)
> > > +{
> > > + u64 status;
> > > + unsigned long flags;
> > > + struct hv_input_get_logical_device_property *input;
> > > + struct hv_output_get_logical_device_property *output;
> > > +
> > > + local_irq_save(flags);
> > > +
> > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> > > + memset(input, 0, sizeof(*input));
> > > + memset(output, 0, sizeof(*output));
> >
> > General practice is to *not* zero the output area prior to a hypercall. The hypervisor
> > should be correctly setting all the output bits. There are a couple of cases in the new
> > MSHV code where the output is zero'ed, but I'm planning to submit a patch to
> > remove those so that hypercall call sites that have output are consistent across the
> > code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the
> > right thing, and zero'ing the output could be done as a workaround. But such cases
> > should be explicitly known with code comments indicating the reason for the
> > zero'ing.
> >
> > Same applies in hv_iommu_detect().
> >
>
> Thanks for the information! Just to clarify: this is only because Hyper-V is
> supposed to zero the output page, and for input page, memset is still needed.
> Am I correct?
Yes, you are correct.
The general TLFS requirement for hypercall input is that unused fields and bits
are set to zero. This requirement ensures forward compatibility if a later version of
the hypervisor assigns some meaning to previously unused fields/bits. So best practice
for hypercall call sites is to use memset() to zero the entire input area, and then specific
field values are set on top of that. Any fields/bits that aren't explicitly set then meet
the TLFS requirement.
It would be OK if a hypercall call site explicitly set every field/bit instead of using
memset(), but it's easy to unintentionally miss a field/bit and create a forward
compatibility problem. However, when the hypercall input contains a large array,
the code usually does *not* do memset() on the large array because of the perf
impact, but instead the code populating the large array must be careful to not leave
any bits uninitialized.
For hypercall output, the hypervisor essentially has the same requirement. It should
make sure that any unused fields/bits in the output area are zero, so that the Linux
guest can properly deal with a future hypervisor version that assigns meaning to
previously unused fields/bits.
>
> <snip>
>
> > > +static void hv_iommu_shutdown(void)
> > > +{
> > > + iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> > > +
> > > + kfree(hv_iommu_device);
> > > +}
> > > +
> > > +static struct syscore_ops hv_iommu_syscore_ops = {
> > > + .shutdown = hv_iommu_shutdown,
> > > +};
> >
> > Why is a shutdown needed at all? hv_iommu_shutdown() doesn't do anything
> > that really needed, since sysfs entries are transient, and freeing memory isn't
> > relevant for a shutdown.
> >
>
> For iommu_device_sysfs_remove(), I guess they are not necessary, and
> I will need to do some homework to better understand the sysfs. :)
> Originally, we wanted a shutdown routine to trigger some hypercall,
> so that Hyper-V will disable the DMA translation, e.g., during the VM
> reboot process.
I would presume that if Hyper-V reboots the VM, Hyper-V automatically
resets the PV IOMMU and prevents any further DMA operations. But
consider kexec(), where a new kernel gets loaded without going through
the hypervisor "reboot-this-VM" path. There have been problems in the
past with kexec() where parts of Hyper-V state for the guest didn't get
reset, and the PV IOMMU is likely something in that category. So there
may indeed be a need to tell the hypervisor to reset everything related
to the PV IOMMU. There are already functions to do Hyper-V cleanup: see
vmbus_initiate_unload() and hyperv_cleanup(). These existing functions
may be a better place to do PV IOMMU cleanup/reset if needed.
>
> <snip>
>
> > > +device_initcall(hv_iommu_init);
> >
> > I'm concerned about the timing of this initialization. VMBus is initialized with
> > subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6.
> > So VMBus initialization happens quite a bit earlier, and the hypervisor starts
> > offering devices to the guest, including PCI pass-thru devices, before the
> > IOMMU initialization starts. I cobbled together a way to make this IOMMU code
> > run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
> > two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
> > completed hv_pci_probe() before this IOMMU initialization was started. When
> > IOMMU initialization did run, it went back and found the NVMe devices. But
> > I'm unsure if that's OK because my hacked together environment obviously
> > couldn't do real IOMMU mapping. It appears that the NVMe device driver
> > didn't start its initialization until after the IOMMU driver was setup, which
> > would probably make everything OK. But that might be just timing luck, or
> > maybe there's something that affirmatively prevents the native PCI driver
> > (like NVMe) from getting started until after all the initcalls have finished.
> >
>
> This is yet another immature attempt by me to do the hv_iommu_init() in
> an arch-independent path. And I do not think using device_initcall() is
> harmless. This patch set was tested using an assigned Intel DSA device,
> and the DMA tests succeeded w/o any error. But that is not enough to
> justify using device_initcall(): I reset the idxd driver as kernel
> builtin and realized, just like you said, both hv_pci_probe() and
> idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU
> tries to probe the endpoint device, a warning is printed:
>
> [ 3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something fishy here!
>
You succeeded in doing what I was going to try! I won't spend time on it now.
> > I'm planning to look at this further to see if there's a way for a PCI driver
> > to try initializing a pass-thru device *before* this IOMMU driver has initialized.
> > If so, a different way to do the IOMMU initialization will be needed that is
> > linked to VMBus initialization so things can't happen out-of-order. Establishing
> > such a linkage is probably a good idea regardless.
> >
> > FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
> > the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
> > and devices coming and going dynamically all worked correctly. When a device
> > was removed, it was moved to the blocking domain, and then flushed before
> > being finally removed. All good! I wish I had a way to test with an IOMMU
> > paging domain that was doing real translation.
> >
>
> Thank you, Michael! I really appreciate you running these extra experiments!
>
> My tests on this DSA device passed (using paging domain) too, with no DMA
> errors observed (regardless its driver is builtin or as a kernel module).
> But that doesn't make me confident about using `device_initcall`. I believe
> your concern is valid. E.g., an endpoint device might allocate a DMA address(
> using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then
> use that address for DMA later, after a paging domain is attached?
Yes, that's exactly my concern.
>
> > > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> > > new file mode 100644
> > > index 000000000000..c8657e791a6e
> > > --- /dev/null
> > > +++ b/drivers/iommu/hyperv/iommu.h
> > > @@ -0,0 +1,53 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +
> > > +/*
> > > + * Hyper-V IOMMU driver.
> > > + *
> > > + * Copyright (C) 2024-2025, Microsoft, Inc.
> > > + *
> > > + */
> > > +
> > > +#ifndef _HYPERV_IOMMU_H
> > > +#define _HYPERV_IOMMU_H
> > > +
> > > +struct hv_iommu_dev {
> > > + struct iommu_device iommu;
> > > + struct ida domain_ids;
> > > +
> > > + /* Device configuration */
> > > + u8 max_iova_width;
> > > + u8 max_pasid_width;
> > > + u64 cap;
> > > + u64 pgsize_bitmap;
> > > +
> > > + struct iommu_domain_geometry geometry;
> > > + u64 first_domain;
> > > + u64 last_domain;
> > > +};
> > > +
> > > +struct hv_iommu_domain {
> > > + union {
> > > + struct iommu_domain domain;
> > > + struct pt_iommu pt_iommu;
> > > + struct pt_iommu_x86_64 pt_iommu_x86_64;
> > > + };
> > > + struct hv_iommu_dev *hv_iommu;
> > > + struct hv_input_device_domain device_domain;
> > > + u64 pgsize_bitmap;
> > > +
> > > + spinlock_t lock; /* protects dev_list and TLB flushes */
> > > + /* List of devices in this DMA domain */
> >
> > It appears that this list is really a list of endpoints (i.e., struct
> > hv_iommu_endpoint), not devices (which I read to be struct
> > hv_iommu_dev).
> >
> > But that said, what is the list used for? I see code to add
> > endpoints to the list, and to remove then, but the list is never
> > walked by any code in this patch set. If there is an anticipated
> > future use, it would be better to add the list as part of the code
> > for that future use.
> >
>
> Yes, we do not really need this list for this patch set. Thanks!
>
> B.R.
> Yu
^ permalink raw reply
* Re: [PATCH RFC net-next v13 00/13] vsock: add namespace support to vhost-vsock and loopback
From: Stefano Garzarella @ 2026-01-12 17:26 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Bobby Eshleman, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20260110191107-mutt-send-email-mst@kernel.org>
On Sat, Jan 10, 2026 at 07:12:07PM -0500, Michael S. Tsirkin wrote:
>On Fri, Jan 09, 2026 at 04:11:12PM -0800, Bobby Eshleman wrote:
>> On Tue, Dec 23, 2025 at 04:28:34PM -0800, Bobby Eshleman wrote:
>> > This series adds namespace support to vhost-vsock and loopback. It does
>> > not add namespaces to any of the other guest transports (virtio-vsock,
>> > hyperv, or vmci).
>> >
>> > The current revision supports two modes: local and global. Local
>> > mode is complete isolation of namespaces, while global mode is complete
>> > sharing between namespaces of CIDs (the original behavior).
>> >
>> > The mode is set using the parent namespace's
>> > /proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
>> > created. The mode of the current namespace can be queried by reading
>> > /proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
>> > has been created.
>> >
>> > Modes are per-netns. This allows a system to configure namespaces
>> > independently (some may share CIDs, others are completely isolated).
>> > This also supports future possible mixed use cases, where there may be
>> > namespaces in global mode spinning up VMs while there are mixed mode
>> > namespaces that provide services to the VMs, but are not allowed to
>> > allocate from the global CID pool (this mode is not implemented in this
>> > series).
>>
>> Stefano, would like me to resend this without the RFC tag, or should I
>> just leave as is for review? I don't have any planned changes at the
>> moment.
>>
>> Best,
>> Bobby
>
>i couldn't apply it on top of net-next so pls do.
>
Yeah, some difficulties to apply also here.
I tried `base-commit: 962ac5ca99a5c3e7469215bf47572440402dfd59` as
mentioned in the cover, but didn't apply. After several tries I
successfully applied on top of commit bc69ed975203 ("Merge tag
'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost")
So, I agree, better to resend and you can remove RFC.
BTW I'll do my best to start to review tomorrow!
Thanks,
Stefano
^ permalink raw reply
* Re: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Easwar Hariharan @ 2026-01-12 17:01 UTC (permalink / raw)
To: mhklinux
Cc: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
mani, robh, bhelgaas, easwar.hariharan, linux-pci, linux-kernel,
linux-hyperv
In-Reply-To: <20260111170034.67558-1-mhklinux@outlook.com>
On 1/11/2026 9:00 AM, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
>
> Field pci_bus in struct hv_pcibus_device is unused since
> commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
>
> No functional change.
>
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
> drivers/pci/controller/pci-hyperv.c | 1 -
> 1 file changed, 1 deletion(-)
>
Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
^ permalink raw reply
* Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-01-12 16:56 UTC (permalink / raw)
To: Michael Kelley
Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
iommu@lists.linux.dev, linux-pci@vger.kernel.org,
kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
will@kernel.org, robin.murphy@arm.com,
easwar.hariharan@linux.microsoft.com,
jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
mrathor@linux.microsoft.com, peterz@infradead.org,
linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB41572D46CF6C1DE68974EAA1D485A@SN6PR02MB4157.namprd02.prod.outlook.com>
On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
<snip>
Thank you so much, Michael, for the thorough review!
I've snipped some comments I fully agree with and will address in
next version. Actually, I have to admit I agree with your remaining
comments below as well. :)
> > +struct hv_iommu_dev *hv_iommu_device;
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
>
> Why is hv_iommu_device allocated dynamically while the two
> domains are allocated statically? Seems like the approach could
> be consistent, though maybe there's some reason I'm missing.
>
On second thought, `hv_identity_domain` and `hv_blocking_domain` should
likely be allocated dynamically as well, consistent with `hv_iommu_device`.
<snip>
> > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > + enum hv_logical_device_property_code code,
> > + struct hv_output_get_logical_device_property *property)
> > +{
> > + u64 status;
> > + unsigned long flags;
> > + struct hv_input_get_logical_device_property *input;
> > + struct hv_output_get_logical_device_property *output;
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> > + memset(input, 0, sizeof(*input));
> > + memset(output, 0, sizeof(*output));
>
> General practice is to *not* zero the output area prior to a hypercall. The hypervisor
> should be correctly setting all the output bits. There are a couple of cases in the new
> MSHV code where the output is zero'ed, but I'm planning to submit a patch to
> remove those so that hypercall call sites that have output are consistent across the
> code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the
> right thing, and zero'ing the output could be done as a workaround. But such cases
> should be explicitly known with code comments indicating the reason for the
> zero'ing.
>
> Same applies in hv_iommu_detect().
>
Thanks for the information! Just to clarify: this is only because Hyper-V is
supposed to zero the output page, and for input page, memset is still needed.
Am I correct?
<snip>
> > +static void hv_iommu_shutdown(void)
> > +{
> > + iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> > +
> > + kfree(hv_iommu_device);
> > +}
> > +
> > +static struct syscore_ops hv_iommu_syscore_ops = {
> > + .shutdown = hv_iommu_shutdown,
> > +};
>
> Why is a shutdown needed at all? hv_iommu_shutdown() doesn't do anything
> that really needed, since sysfs entries are transient, and freeing memory isn't
> relevant for a shutdown.
>
For iommu_device_sysfs_remove(), I guess they are not necessary, and
I will need to do some homework to better understand the sysfs. :)
Originally, we wanted a shutdown routine to trigger some hypercall,
so that Hyper-V will disable the DMA translation, e.g., during the VM
reboot process.
<snip>
> > +device_initcall(hv_iommu_init);
>
> I'm concerned about the timing of this initialization. VMBus is initialized with
> subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6.
> So VMBus initialization happens quite a bit earlier, and the hypervisor starts
> offering devices to the guest, including PCI pass-thru devices, before the
> IOMMU initialization starts. I cobbled together a way to make this IOMMU code
> run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
> two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
> completed hv_pci_probe() before this IOMMU initialization was started. When
> IOMMU initialization did run, it went back and found the NVMe devices. But
> I'm unsure if that's OK because my hacked together environment obviously
> couldn't do real IOMMU mapping. It appears that the NVMe device driver
> didn't start its initialization until after the IOMMU driver was setup, which
> would probably make everything OK. But that might be just timing luck, or
> maybe there's something that affirmatively prevents the native PCI driver
> (like NVMe) from getting started until after all the initcalls have finished.
>
This is yet another immature attempt by me to do the hv_iommu_init() in
an arch-independent path. And I do not think using device_initcall() is
harmless. This patch set was tested using an assigned Intel DSA device,
and the DMA tests succeeded w/o any error. But that is not enough to
justify using device_initcall(): I reset the idxd driver as kernel
builtin and realized, just like you said, both hv_pci_probe() and
idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU
tries to probe the endpoint device, a warning is printed:
[ 3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something fishy here!
> I'm planning to look at this further to see if there's a way for a PCI driver
> to try initializing a pass-thru device *before* this IOMMU driver has initialized.
> If so, a different way to do the IOMMU initialization will be needed that is
> linked to VMBus initialization so things can't happen out-of-order. Establishing
> such a linkage is probably a good idea regardless.
>
> FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
> the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
> and devices coming and going dynamically all worked correctly. When a device
> was removed, it was moved to the blocking domain, and then flushed before
> being finally removed. All good! I wish I had a way to test with an IOMMU
> paging domain that was doing real translation.
>
Thank you, Michael! I really appreciate you running these extra experiments!
My tests on this DSA device passed (using paging domain) too, with no DMA
errors observed (regardless its driver is builtin or as a kernel module).
But that doesn't make me confident about using `device_initcall`. I believe
your concern is valid. E.g., an endpoint device might allocate a DMA address(
using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then
use that address for DMA later, after a paging domain is attached?
> > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> > new file mode 100644
> > index 000000000000..c8657e791a6e
> > --- /dev/null
> > +++ b/drivers/iommu/hyperv/iommu.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Hyper-V IOMMU driver.
> > + *
> > + * Copyright (C) 2024-2025, Microsoft, Inc.
> > + *
> > + */
> > +
> > +#ifndef _HYPERV_IOMMU_H
> > +#define _HYPERV_IOMMU_H
> > +
> > +struct hv_iommu_dev {
> > + struct iommu_device iommu;
> > + struct ida domain_ids;
> > +
> > + /* Device configuration */
> > + u8 max_iova_width;
> > + u8 max_pasid_width;
> > + u64 cap;
> > + u64 pgsize_bitmap;
> > +
> > + struct iommu_domain_geometry geometry;
> > + u64 first_domain;
> > + u64 last_domain;
> > +};
> > +
> > +struct hv_iommu_domain {
> > + union {
> > + struct iommu_domain domain;
> > + struct pt_iommu pt_iommu;
> > + struct pt_iommu_x86_64 pt_iommu_x86_64;
> > + };
> > + struct hv_iommu_dev *hv_iommu;
> > + struct hv_input_device_domain device_domain;
> > + u64 pgsize_bitmap;
> > +
> > + spinlock_t lock; /* protects dev_list and TLB flushes */
> > + /* List of devices in this DMA domain */
>
> It appears that this list is really a list of endpoints (i.e., struct
> hv_iommu_endpoint), not devices (which I read to be struct
> hv_iommu_dev).
>
> But that said, what is the list used for? I see code to add
> endpoints to the list, and to remove then, but the list is never
> walked by any code in this patch set. If there is an anticipated
> future use, it would be better to add the list as part of the code
> for that future use.
>
Yes, we do not really need this list for this patch set. Thanks!
B.R.
Yu
^ permalink raw reply
* Re: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Srivatsa S. Bhat @ 2026-01-12 16:55 UTC (permalink / raw)
To: Michael Kelley
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157BFB422607900AC1EBD82D481A@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Jan 12, 2026 at 03:54:51PM +0000, Michael Kelley wrote:
> From: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Sent: Monday, January 12, 2026 6:29 AM
> > Hi Michael,
> >
> > On Sun, Jan 11, 2026 at 09:00:34AM -0800, mhkelley58@gmail.com wrote:
> > > From: Michael Kelley <mhklinux@outlook.com>
> > >
> > > Field pci_bus in struct hv_pcibus_device is unused since
> > > commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
> > >
> >
> > Since that commit is several years old (2021), I was curious if this was found by
> > manual inspection or if the compiler was able to flag the unused
> > variable as well.
>
> Code inspection. I was brushing up on how the structs defined
> in pci-hyperv.c relate to the standard Linux PCI struct pci_bus and
> struct pci_dev. Having a pointer to struct pci_bus in struct
> hv_pcibus_device makes sense, and I was a bit surprised to find
> it's not set or used. Instead, the PCI bus is always found through
> the PCI bridge.
>
Ah, I see, thank you for the background!
Regards,
Srivatsa
>
> >
> > > No functional change.
> > >
> > > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> >
> > Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu>
> >
> > Regards,
> > Srivatsa
> > Microsoft Linux Systems Group
> >
> > > ---
> > > drivers/pci/controller/pci-hyperv.c | 1 -
> > > 1 file changed, 1 deletion(-)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > > index 1e237d3538f9..7fcba05cec30 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -501,7 +501,6 @@ struct hv_pcibus_device {
> > > struct resource *low_mmio_res;
> > > struct resource *high_mmio_res;
> > > struct completion *survey_event;
> > > - struct pci_bus *pci_bus;
> > > spinlock_t config_lock; /* Avoid two threads writing index page */
> > > spinlock_t device_list_lock; /* Protect lists below */
> > > void __iomem *cfg_addr;
> > > --
> > > 2.25.1
> > >
> > >
^ permalink raw reply
* Re: [PATCH] mshv: make certain field names descriptive in a header struct
From: Nuno Das Neves @ 2026-01-12 16:55 UTC (permalink / raw)
To: Mukesh Rathor, linux-hyperv; +Cc: wei.liu
In-Reply-To: <20260109200611.1422390-1-mrathor@linux.microsoft.com>
On 1/9/2026 12:06 PM, Mukesh Rathor wrote:
> There is no functional change. Just make couple field names in
> struct mshv_mem_region, in a header that can be used in many
> places, a little descriptive to make code easier to read by
> allowing better support for grep, cscope, etc.
>
The commit message could be a improved a bit. Putting the
motivation first is usually better e.g.:
"
When struct fields use very common names like "pages" or "type",
it makes it difficult to find uses of these fields with tools
like grep and cscope.
Add the prefix mreg_ to some fields in struct mshv_mem_region to
make it easier to find them. No functional change.
"
Looks good to me otherwise.
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 44 ++++++++++++++++++-------------------
> drivers/hv/mshv_root.h | 6 ++---
> drivers/hv/mshv_root_main.c | 10 ++++-----
> 3 files changed, 30 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 202b9d551e39..af81405f859b 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> struct page *page;
> int ret;
>
> - page = region->pages[page_offset];
> + page = region->mreg_pages[page_offset];
> if (!page)
> return -EINVAL;
>
> @@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
>
> /* Start at stride since the first page is validated */
> for (count = stride; count < page_count; count += stride) {
> - page = region->pages[page_offset + count];
> + page = region->mreg_pages[page_offset + count];
>
> /* Break if current page is not present */
> if (!page)
> @@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
>
> while (page_count) {
> /* Skip non-present pages */
> - if (!region->pages[page_offset]) {
> + if (!region->mreg_pages[page_offset]) {
> page_offset++;
> page_count--;
> continue;
> @@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> - region->pages + page_offset,
> + region->mreg_pages + page_offset,
> page_count,
> HV_MAP_GPA_READABLE |
> HV_MAP_GPA_WRITABLE,
> @@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> - region->pages + page_offset,
> + region->mreg_pages + page_offset,
> page_count, 0,
> flags, false);
> }
> @@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_MAP_GPA_LARGE_PAGE;
> @@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> return hv_call_map_gpa_pages(region->partition->pt_id,
> region->start_gfn + page_offset,
> page_count, flags,
> - region->pages + page_offset);
> + region->mreg_pages + page_offset);
> }
>
> static int mshv_region_remap_pages(struct mshv_mem_region *region,
> @@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
> static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> u64 page_offset, u64 page_count)
> {
> - if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
> - unpin_user_pages(region->pages + page_offset, page_count);
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> + unpin_user_pages(region->mreg_pages + page_offset, page_count);
>
> - memset(region->pages + page_offset, 0,
> + memset(region->mreg_pages + page_offset, 0,
> page_count * sizeof(struct page *));
> }
>
> @@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
> int ret;
>
> for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> - pages = region->pages + done_count;
> + pages = region->mreg_pages + done_count;
> userspace_addr = region->start_uaddr +
> done_count * HV_HYP_PAGE_SIZE;
> nr_pages = min(region->nr_pages - done_count,
> @@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> u32 flags,
> u64 page_offset, u64 page_count)
> {
> - struct page *page = region->pages[page_offset];
> + struct page *page = region->mreg_pages[page_offset];
>
> if (PageHuge(page) || PageTransCompound(page))
> flags |= HV_UNMAP_GPA_LARGE_PAGE;
> @@ -321,7 +321,7 @@ static void mshv_region_destroy(struct kref *ref)
> struct mshv_partition *partition = region->partition;
> int ret;
>
> - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> mshv_region_movable_fini(region);
>
> if (mshv_partition_encrypted(partition)) {
> @@ -374,9 +374,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> int ret;
>
> range->notifier_seq = mmu_interval_read_begin(range->notifier);
> - mmap_read_lock(region->mni.mm);
> + mmap_read_lock(region->mreg_mni.mm);
> ret = hmm_range_fault(range);
> - mmap_read_unlock(region->mni.mm);
> + mmap_read_unlock(region->mreg_mni.mm);
> if (ret)
> return ret;
>
> @@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> u64 page_offset, u64 page_count)
> {
> struct hmm_range range = {
> - .notifier = ®ion->mni,
> + .notifier = ®ion->mreg_mni,
> .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> };
> unsigned long *pfns;
> @@ -430,7 +430,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> goto out;
>
> for (i = 0; i < page_count; i++)
> - region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> + region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
>
> ret = mshv_region_remap_pages(region, region->hv_map_flags,
> page_offset, page_count);
> @@ -489,7 +489,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> {
> struct mshv_mem_region *region = container_of(mni,
> struct mshv_mem_region,
> - mni);
> + mreg_mni);
> u64 page_offset, page_count;
> unsigned long mstart, mend;
> int ret = -EPERM;
> @@ -535,14 +535,14 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
>
> void mshv_region_movable_fini(struct mshv_mem_region *region)
> {
> - mmu_interval_notifier_remove(®ion->mni);
> + mmu_interval_notifier_remove(®ion->mreg_mni);
> }
>
> bool mshv_region_movable_init(struct mshv_mem_region *region)
> {
> int ret;
>
> - ret = mmu_interval_notifier_insert(®ion->mni, current->mm,
> + ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
> region->start_uaddr,
> region->nr_pages << HV_HYP_PAGE_SHIFT,
> &mshv_region_mni_ops);
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 3c1d88b36741..f5b6d3979e5a 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -85,10 +85,10 @@ struct mshv_mem_region {
> u64 start_uaddr;
> u32 hv_map_flags;
> struct mshv_partition *partition;
> - enum mshv_region_type type;
> - struct mmu_interval_notifier mni;
> + enum mshv_region_type mreg_type;
> + struct mmu_interval_notifier mreg_mni;
> struct mutex mutex; /* protects region pages remapping */
> - struct page *pages[];
> + struct page *mreg_pages[];
> };
>
> struct mshv_irq_ack_notifier {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 1134a82c7881..eff1b21461dc 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> return false;
>
> /* Only movable memory ranges are supported for GPA intercepts */
> - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> ret = mshv_region_handle_gfn_fault(region, gfn);
> else
> ret = false;
> @@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> return PTR_ERR(rg);
>
> if (is_mmio)
> - rg->type = MSHV_REGION_TYPE_MMIO;
> + rg->mreg_type = MSHV_REGION_TYPE_MMIO;
> else if (mshv_partition_encrypted(partition) ||
> !mshv_region_movable_init(rg))
> - rg->type = MSHV_REGION_TYPE_MEM_PINNED;
> + rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
> else
> - rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
> + rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
>
> rg->partition = partition;
>
> @@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
> if (ret)
> return ret;
>
> - switch (region->type) {
> + switch (region->mreg_type) {
> case MSHV_REGION_TYPE_MEM_PINNED:
> ret = mshv_prepare_pinned_region(region);
> break;
^ permalink raw reply
* RE: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Michael Kelley @ 2026-01-12 15:54 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <aWUFPUxrMkM32zDD@csail.mit.edu>
From: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Sent: Monday, January 12, 2026 6:29 AM
> Hi Michael,
>
> On Sun, Jan 11, 2026 at 09:00:34AM -0800, mhkelley58@gmail.com wrote:
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > Field pci_bus in struct hv_pcibus_device is unused since
> > commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
> >
>
> Since that commit is several years old (2021), I was curious if this was found by
> manual inspection or if the compiler was able to flag the unused
> variable as well.
Code inspection. I was brushing up on how the structs defined
in pci-hyperv.c relate to the standard Linux PCI struct pci_bus and
struct pci_dev. Having a pointer to struct pci_bus in struct
hv_pcibus_device makes sense, and I was a bit surprised to find
it's not set or used. Instead, the PCI bus is always found through
the PCI bridge.
Michael
>
> > No functional change.
> >
> > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
>
> Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu>
>
> Regards,
> Srivatsa
> Microsoft Linux Systems Group
>
> > ---
> > drivers/pci/controller/pci-hyperv.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > index 1e237d3538f9..7fcba05cec30 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -501,7 +501,6 @@ struct hv_pcibus_device {
> > struct resource *low_mmio_res;
> > struct resource *high_mmio_res;
> > struct completion *survey_event;
> > - struct pci_bus *pci_bus;
> > spinlock_t config_lock; /* Avoid two threads writing index page */
> > spinlock_t device_list_lock; /* Protect lists below */
> > void __iomem *cfg_addr;
> > --
> > 2.25.1
> >
> >
^ permalink raw reply
* RE: [PATCH net-next] net: hv_netvsc: reject RSS hash key programming without RX indirection table
From: Haiyang Zhang @ 2026-01-12 15:34 UTC (permalink / raw)
To: Aditya Garg, KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
Long Li, andrew+netdev@lunn.ch, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
stephen@networkplumber.org, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
dipayanroy@linux.microsoft.com, ssengar@linux.microsoft.com,
shradhagupta@linux.microsoft.com, ernis@linux.microsoft.com,
Aditya Garg
In-Reply-To: <1768212093-1594-1-git-send-email-gargaditya@linux.microsoft.com>
> -----Original Message-----
> From: Aditya Garg <gargaditya@linux.microsoft.com>
> Sent: Monday, January 12, 2026 5:02 AM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; stephen@networkplumber.org; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; dipayanroy@linux.microsoft.com;
> ssengar@linux.microsoft.com; shradhagupta@linux.microsoft.com;
> ernis@linux.microsoft.com; Aditya Garg <gargaditya@microsoft.com>;
> gargaditya@linux.microsoft.com
> Subject: [PATCH net-next] net: hv_netvsc: reject RSS hash key programming
> without RX indirection table
>
> RSS configuration requires a valid RX indirection table. When the device
> reports a single receive queue, rndis_filter_device_add() does not
> allocate an indirection table, accepting RSS hash key updates in this
> state leads to a hang.
>
> Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return
> -EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device
> capabilities and prevents incorrect behavior.
>
> Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key")
> Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* Re: [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: Srivatsa S. Bhat @ 2026-01-12 14:29 UTC (permalink / raw)
To: mhklinux
Cc: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
mani, robh, bhelgaas, linux-pci, linux-kernel, linux-hyperv
In-Reply-To: <20260111170034.67558-1-mhklinux@outlook.com>
Hi Michael,
On Sun, Jan 11, 2026 at 09:00:34AM -0800, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
>
> Field pci_bus in struct hv_pcibus_device is unused since
> commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
>
Since that commit is several years old (2021), I was curious if this was found by
manual inspection or if the compiler was able to flag the unused
variable as well.
> No functional change.
>
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu>
Regards,
Srivatsa
Microsoft Linux Systems Group
> ---
> drivers/pci/controller/pci-hyperv.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..7fcba05cec30 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -501,7 +501,6 @@ struct hv_pcibus_device {
> struct resource *low_mmio_res;
> struct resource *high_mmio_res;
> struct completion *survey_event;
> - struct pci_bus *pci_bus;
> spinlock_t config_lock; /* Avoid two threads writing index page */
> spinlock_t device_list_lock; /* Protect lists below */
> void __iomem *cfg_addr;
> --
> 2.25.1
>
>
^ permalink raw reply
* Re: [PATCH net-next] net: hv_netvsc: reject RSS hash key programming without RX indirection table
From: Dipayaan Roy @ 2026-01-12 13:09 UTC (permalink / raw)
To: Aditya Garg
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, stephen, linux-hyperv, netdev,
linux-kernel, ssengar, shradhagupta, ernis, gargaditya
In-Reply-To: <1768212093-1594-1-git-send-email-gargaditya@linux.microsoft.com>
On Mon, Jan 12, 2026 at 02:01:33AM -0800, Aditya Garg wrote:
> RSS configuration requires a valid RX indirection table. When the device
> reports a single receive queue, rndis_filter_device_add() does not
> allocate an indirection table, accepting RSS hash key updates in this
> state leads to a hang.
>
> Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return
> -EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device
> capabilities and prevents incorrect behavior.
>
> Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key")
> Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> ---
> drivers/net/hyperv/netvsc_drv.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 3d47d749ef9f..cbd52cb79268 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -1750,6 +1750,9 @@ static int netvsc_set_rxfh(struct net_device *dev,
> rxfh->hfunc != ETH_RSS_HASH_TOP)
> return -EOPNOTSUPP;
>
> + if (!ndc->rx_table_sz)
> + return -EOPNOTSUPP;
> +
> rndis_dev = ndev->extension;
> if (rxfh->indir) {
> for (i = 0; i < ndc->rx_table_sz; i++)
> --
> 2.43.0
>
LGTM.
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
^ permalink raw reply
* [PATCH net-next, v8] net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
From: Dipayaan Roy @ 2026-01-12 13:05 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, dipayanroy
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
Changes in v8:
- better aligned queue reset work struct.
Changes in v7:
- Add enable_work in resume path.
Changes in v6:
- Rebased.
Changes in v5:
-Fixed commit message, used 'create_singlethread_workqueue' and fixed
cleanup part.
Changes in v4:
-Fixed commit message, work initialization before registering netdev,
fixed potential null pointer de-reference bug.
Changes in v3:
-Fixed commit meesage, removed rtnl_trylock and added
disable_work_sync, fixed mana_queue_reset_work, and few
cosmetics.
Changes in v2:
-Fixed cosmetic changes.
---
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 77 ++++++++++++++++++-
include/net/mana/gdma.h | 7 +-
include/net/mana/mana.h | 3 +-
3 files changed, 84 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1ad154f9db1a..91c418097284 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -299,6 +299,39 @@ static int mana_get_gso_hs(struct sk_buff *skb)
return gso_hs;
}
+static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
+{
+ struct mana_port_context *apc = container_of(work,
+ struct mana_port_context,
+ queue_reset_work);
+ struct net_device *ndev = apc->ndev;
+ int err;
+
+ rtnl_lock();
+
+ /* Pre-allocate buffers to prevent failure in mana_attach later */
+ err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+ if (err) {
+ netdev_err(ndev, "Insufficient memory for reset post tx stall detection\n");
+ goto out;
+ }
+
+ err = mana_detach(ndev, false);
+ if (err) {
+ netdev_err(ndev, "mana_detach failed: %d\n", err);
+ goto dealloc_pre_rxbufs;
+ }
+
+ err = mana_attach(ndev);
+ if (err)
+ netdev_err(ndev, "mana_attach failed: %d\n", err);
+
+dealloc_pre_rxbufs:
+ mana_pre_dealloc_rxbufs(apc);
+out:
+ rtnl_unlock();
+}
+
netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
{
enum mana_tx_pkt_format pkt_fmt = MANA_SHORT_PKT_FMT;
@@ -839,6 +872,23 @@ static int mana_change_mtu(struct net_device *ndev, int new_mtu)
return err;
}
+static void mana_tx_timeout(struct net_device *netdev, unsigned int txqueue)
+{
+ struct mana_port_context *apc = netdev_priv(netdev);
+ struct mana_context *ac = apc->ac;
+ struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+ /* Already in service, hence tx queue reset is not required.*/
+ if (gc->in_service)
+ return;
+
+ /* Note: If there are pending queue reset work for this port(apc),
+ * subsequent request queued up from here are ignored. This is because
+ * we are using the same work instance per port(apc).
+ */
+ queue_work(ac->per_port_queue_reset_wq, &apc->queue_reset_work);
+}
+
static int mana_shaper_set(struct net_shaper_binding *binding,
const struct net_shaper *shaper,
struct netlink_ext_ack *extack)
@@ -924,6 +974,7 @@ static const struct net_device_ops mana_devops = {
.ndo_bpf = mana_bpf,
.ndo_xdp_xmit = mana_xdp_xmit,
.ndo_change_mtu = mana_change_mtu,
+ .ndo_tx_timeout = mana_tx_timeout,
.net_shaper_ops = &mana_shaper_ops,
};
@@ -3287,6 +3338,8 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
ndev->min_mtu = ETH_MIN_MTU;
ndev->needed_headroom = MANA_HEADROOM;
ndev->dev_port = port_idx;
+ /* Recommended timeout based on HW FPGA re-config scenario. */
+ ndev->watchdog_timeo = 15 * HZ;
SET_NETDEV_DEV(ndev, gc->dev);
netif_set_tso_max_size(ndev, GSO_MAX_SIZE);
@@ -3303,6 +3356,10 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
if (err)
goto reset_apc;
+ /* Initialize the per port queue reset work.*/
+ INIT_WORK(&apc->queue_reset_work,
+ mana_per_port_queue_reset_work_handler);
+
netdev_lockdep_set_classes(ndev);
ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
@@ -3492,6 +3549,7 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
{
struct gdma_context *gc = gd->gdma_context;
struct mana_context *ac = gd->driver_data;
+ struct mana_port_context *apc = NULL;
struct device *dev = gc->dev;
u8 bm_hostmode = 0;
u16 num_ports = 0;
@@ -3549,6 +3607,14 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
if (ac->num_ports > MAX_PORTS_IN_MANA_DEV)
ac->num_ports = MAX_PORTS_IN_MANA_DEV;
+ ac->per_port_queue_reset_wq =
+ create_singlethread_workqueue("mana_per_port_queue_reset_wq");
+ if (!ac->per_port_queue_reset_wq) {
+ dev_err(dev, "Failed to allocate per port queue reset workqueue\n");
+ err = -ENOMEM;
+ goto out;
+ }
+
if (!resuming) {
for (i = 0; i < ac->num_ports; i++) {
err = mana_probe_port(ac, i, &ac->ports[i]);
@@ -3565,6 +3631,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
} else {
for (i = 0; i < ac->num_ports; i++) {
rtnl_lock();
+ apc = netdev_priv(ac->ports[i]);
+ enable_work(&apc->queue_reset_work);
err = mana_attach(ac->ports[i]);
rtnl_unlock();
/* we log the port for which the attach failed and stop
@@ -3616,13 +3684,15 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
for (i = 0; i < ac->num_ports; i++) {
ndev = ac->ports[i];
- apc = netdev_priv(ndev);
if (!ndev) {
if (i == 0)
dev_err(dev, "No net device to remove\n");
goto out;
}
+ apc = netdev_priv(ndev);
+ disable_work_sync(&apc->queue_reset_work);
+
/* All cleanup actions should stay after rtnl_lock(), otherwise
* other functions may access partially cleaned up data.
*/
@@ -3649,6 +3719,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
mana_destroy_eq(ac);
out:
+ if (ac->per_port_queue_reset_wq) {
+ destroy_workqueue(ac->per_port_queue_reset_wq);
+ ac->per_port_queue_reset_wq = NULL;
+ }
+
mana_gd_deregister_device(gd);
if (suspending)
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index eaa27483f99b..a59bd4035a99 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -598,6 +598,10 @@ enum {
/* Driver can self reset on FPGA Reconfig EQE notification */
#define GDMA_DRV_CAP_FLAG_1_HANDLE_RECONFIG_EQE BIT(17)
+
+/* Driver detects stalled send queues and recovers them */
+#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
+
#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
/* Driver supports linearizing the skb when num_sge exceeds hardware limit */
@@ -621,7 +625,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE | \
GDMA_DRV_CAP_FLAG_1_PERIODIC_STATS_QUERY | \
GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
- GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY)
+ GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
+ GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY)
#define GDMA_DRV_CAP_FLAGS2 0
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d7e089c6b694..a078af283bdd 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,7 +480,7 @@ struct mana_context {
struct mana_ethtool_hc_stats hc_stats;
struct mana_eq *eqs;
struct dentry *mana_eqs_debugfs;
-
+ struct workqueue_struct *per_port_queue_reset_wq;
/* Workqueue for querying hardware stats */
struct delayed_work gf_stats_work;
bool hwc_timeout_occurred;
@@ -495,6 +495,7 @@ struct mana_context {
struct mana_port_context {
struct mana_context *ac;
struct net_device *ndev;
+ struct work_struct queue_reset_work;
u8 mac_addr[ETH_ALEN];
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next, v7] net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
From: Dipayaan Roy @ 2026-01-12 12:59 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, longli, kotaranov, horms, shradhagupta, ssengar, ernis,
shirazsaleem, linux-hyperv, netdev, linux-kernel, linux-rdma,
dipayanroy
In-Reply-To: <20260109180209.023c50cf@kernel.org>
Fri, Jan 09, 2026 at 06:02:09PM -0800, Jakub Kicinski wrote:
> On Tue, 6 Jan 2026 15:04:38 -0800 Dipayaan Roy wrote:
> > +static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
> > +{
> > + struct mana_queue_reset_work *reset_queue_work =
> > + container_of(work, struct mana_queue_reset_work, work);
> > +
> > + struct mana_port_context *apc = container_of(reset_queue_work,
> > + struct mana_port_context,
> > + queue_reset_work);
>
> > +struct mana_queue_reset_work {
> > + /* Work structure */
>
> Not sure what value this comment adds. Looks like something AI
> generator would add.
>
> > + struct work_struct work;
> > +};
> > +
> > struct mana_port_context {
> > struct mana_context *ac;
> > struct net_device *ndev;
> > + struct mana_queue_reset_work queue_reset_work;
>
> Why did you wrap the work in another struct with just one member?
> It forces you to work thru two layers of container of.
>
> Either way, container_of supports nested structs so I think something
> like:
>
> struct mana_port_context *apc = container_of(work,
> struct mana_port_context,
> queue_reset_work.work);
>
> should work (untested). But really, better to just delete the pointless
> nesting.
Thanks Jakub, I will remove the nesting and re-share a new patch after
testing.
> --
> pw-bot: cr
Regards
Dipayaan Roy
^ permalink raw reply
* [PATCH net-next] net: hv_netvsc: reject RSS hash key programming without RX indirection table
From: Aditya Garg @ 2026-01-12 10:01 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, stephen, linux-hyperv, netdev,
linux-kernel, dipayanroy, ssengar, shradhagupta, ernis,
gargaditya, gargaditya
RSS configuration requires a valid RX indirection table. When the device
reports a single receive queue, rndis_filter_device_add() does not
allocate an indirection table, accepting RSS hash key updates in this
state leads to a hang.
Fix this by gating netvsc_set_rxfh() on ndc->rx_table_sz and return
-EOPNOTSUPP when the table is absent. This aligns set_rxfh with the device
capabilities and prevents incorrect behavior.
Fixes: 962f3fee83a4 ("netvsc: add ethtool ops to get/set RSS key")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
drivers/net/hyperv/netvsc_drv.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 3d47d749ef9f..cbd52cb79268 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1750,6 +1750,9 @@ static int netvsc_set_rxfh(struct net_device *dev,
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
+ if (!ndc->rx_table_sz)
+ return -EOPNOTSUPP;
+
rndis_dev = ndev->extension;
if (rxfh->indir) {
for (i = 0; i < ndc->rx_table_sz; i++)
--
2.43.0
^ permalink raw reply related
* RE: [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions
From: Michael Kelley @ 2026-01-11 22:27 UTC (permalink / raw)
To: Yu Zhang
Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
iommu@lists.linux.dev, linux-pci@vger.kernel.org,
kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, lpieralisi@kernel.org,
kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org,
will@kernel.org, robin.murphy@arm.com,
easwar.hariharan@linux.microsoft.com,
jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
mrathor@linux.microsoft.com, peterz@infradead.org,
linux-arch@vger.kernel.org
In-Reply-To: <4xjdq3js7w4qxcev727ujedpcujvzgrhf4xsfn3plfrn7fskxu@2qwxljanz3i6>
From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, January 9, 2026 9:07 PM
>
> On Thu, Jan 08, 2026 at 06:47:44PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025
> 9:11 PM
> > >
> >
> > The "Subject:" line prefix for this patch should probably be "Drivers: hv:"
> > to be consistent with most other changes to this source code file.
> >
> > > Previously, the allocation of per-CPU output argument pages was restricted
> > > to root partitions or those operating in VTL mode.
> > >
> > > Remove this restriction to support guest IOMMU related hypercalls, which
> > > require valid output pages to function correctly.
> >
> > The thinking here isn't quite correct. Just because a hypercall produces output
> > doesn't mean that Linux needs to allocate a page for the output that is separate
> > from the input. It's perfectly OK to use the same page for both input and output,
> > as long as the two areas don't overlap. Yes, the page is called
> > "hyperv_pcpu_input_arg", but that's a historical artifact from before the time
> > it was realized that the same page can be used for both input and output.
> >
> > Of course, if there's ever a hypercall that needs lots of input and lots of output
> > such that the combined size doesn't fit in a single page, then separate input
> > and output pages will be needed. But I'm skeptical that will ever happen. Rep
> > hypercalls could have large amounts of input and/or output, but I'd venture
> > that the rep count can always be managed so everything fits in a single page.
> >
>
> Thanks, Michael.
>
> Is there an existing hypercall precedent that reuses the input page for output?
> I believe reusing the input page should be acceptable, at least for pvIOMMU's
> hypercalls, but I will confirm these interfaces with the Hyper-V team.
See hv_pci_read_mmio() for a precedent in current kernel code.
There's also hv_get_partition_id() which uses hyperv_pcpu_input_page for
the hypercall output. But in this case, there is no input, so input and output
aren't actually sharing the page.
In the kernel 6.13 and earlier, get_vtl() used the hyperv_pcpu_input_page
for both input and output, but it did it wrong because the input and output areas
overlapped. While overlap worked because the hypercall is a simple "one-shot"
operation (i.e., read the input, then write the output), it's not legal according
to the TLFS. When the illegal overlap was fixed in commit 07412e1f163d, the
developer decided to allocate the hyperv_pcpu_output_page for VTL2 images,
so the fix uses separate pages for the input and output. There was extensive
discussion of the tradeoffs in allocating the output page for VTL2. In my view
it was an unnecessary use of memory, but the developer preferred to do it for
consistency, and I didn't press the argument because it was limited to VTL2.
Similarly, I won't press the argument here if folks really want to always allocate
the output page. My only request is that the commit message not be misleading
about the reason.
See https://elixir.bootlin.com/linux/v6.13/source/arch/x86/hyperv/hv_init.c#L416
for the older get_vtl() code that puts the input and output in the same page, but
improperly overlaps.
>
> > >
> > > While unconditionally allocating per-CPU output pages scales with the number
> > > of vCPUs, and potentially adding overhead for guests that may not utilize the
> > > IOMMU, this change anticipates that future hypercalls from child partitions
> > > may also require these output pages.
> >
> > I've heard the argument that the amount of overhead is modest relative to the
> > overall amount of memory that is typically in a VM, particularly VMs with high
> > vCPU counts. And I don't disagree. But on the flip side, why tie up memory when
> > there's no need to do so? I'd argue for dropping this patch, and changing the
> > two hypercall call sites in Patch 5 to just use part of the so-called hypercall input
> > page for the output as well. It's only a one-line change in each hypercall call site.
> >
>
> I share your concern about unconditionally allocating a separate output page
> for each vCPU. And if reusing the input page isn't accepted by the Hyper-V team,
> perhaps we could gate the allocation by checking
> IS_ENABLED(CONFIG_HYPERV_PVIOMMU)
> in hv_output_page_exist()?
Yes, that's doable, though I hope it doesn't come to that. At some point the
additional complexity starts to favor just allocating the output page. :-)
Michael
^ permalink raw reply
* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Michael Kelley @ 2026-01-11 17:36 UTC (permalink / raw)
To: Easwar Hariharan
Cc: Yu Zhang, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
linux-pci@vger.kernel.org, kys@microsoft.com,
haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
mrathor@linux.microsoft.com, peterz@infradead.org,
linux-arch@vger.kernel.org
In-Reply-To: <162c901f-69a7-420a-9148-a469d5a8ca4f@linux.microsoft.com>
From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
>
> On 1/8/2026 10:46 AM, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
> >>
> >> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> >>
> >> Hyper-V uses a logical device ID to identify a PCI endpoint device for
> >> child partitions. This ID will also be required for future hypercalls
> >> used by the Hyper-V IOMMU driver.
> >>
> >> Refactor the logic for building this logical device ID into a standalone
> >> helper function and export the interface for wider use.
> >>
> >> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> >> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> >> ---
> >> drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
> >> include/asm-generic/mshyperv.h | 2 ++
> >> 2 files changed, 22 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> >> index 146b43981b27..4b82e06b5d93 100644
> >> --- a/drivers/pci/controller/pci-hyperv.c
> >> +++ b/drivers/pci/controller/pci-hyperv.c
> >> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> >>
> >> #define hv_msi_prepare pci_msi_prepare
> >>
> >> +/**
> >> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
> >> + * function number of the device.
> >> + */
> >> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
> >> +{
> >> + struct pci_bus *pbus = pdev->bus;
> >> + struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
> >> + struct hv_pcibus_device, sysdata);
> >> +
> >> + return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
> >> + (hbus->hdev->dev_instance.b[4] << 16) |
> >> + (hbus->hdev->dev_instance.b[7] << 8) |
> >> + (hbus->hdev->dev_instance.b[6] & 0xf8) |
> >> + PCI_FUNC(pdev->devfn));
> >> +}
> >> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
> >
> > This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
> > new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
> > The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
> > use this symbol in that case -- you'll get a link error on vmlinux when building
> > the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
> > require that the VMBus driver not be built as a module, so I don't think that's
> > the right solution.
> >
> > This is a messy problem. The new IOMMU driver needs to start with a generic
> > "struct device" for the PCI device, and somehow find the corresponding VMBus
> > PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
> > about ways to do this that don't depend on code and data structures that are
> > private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
>
> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
> but it was just too messy as you say.
Yes, the current approach for getting the device ID wanders through struct
hv_pcibus_device (which is private to the pci-hyperv driver), and through
struct hv_device (which is a VMBus data structure). That makes the linkage
between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
substantial, which is not good.
But here's an idea for an alternate approach. The PV IOMMU driver doesn't
have to generate the logical device ID on-the-fly by going to the dev_instance
field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
driver to access. The logical device ID doesn't change while Linux is running, so
stashing another copy somewhere isn't a problem.
So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
a PCI domain ID and the related logical device ID. The PV IOMMU driver is
responsible for storing this data in a form that it can later search. hv_pci_probe()
calls this new function when it instantiates a new PCI pass-thru device. Then when
the IOMMU driver needs to attach a new device, it can get the PCI domain ID
from the struct pci_dev (or struct pci_bus), search for the related logical device
ID in its own data structure, and use it. The pci-hyperv driver has a dependency
on the IOMMU driver, but that's a dependency in the desired direction. The
PCI domain ID and logical device ID are just integers, so no data structures are
shared.
Note that the pci-hyperv must inform the PV IOMMU driver of the logical
device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge().
The latter function eventually invokes hv_iommu_attach_dev(), which will
need the logical device ID. See example stack trace. [1]
I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
remove the information if a PCI pass-thru device is unbound or removed, as
the logical device ID will be the same if the device ever comes back. At worst,
the IOMMU driver can simply replace an existing logical device ID if a new one
is provided for the same PCI domain ID.
An include file must provide a stub for the new function if
CONFIG_HYPERV_PVIOMMU is not defined, so that the pci-hyperv driver still
builds and works.
I haven't coded this up, but it seems like it should be pretty clean.
Michael
[1] Example stack trace, starting with vmbus_add_channel_work() as a
result of Hyper-V offering the PCI pass-thru device to the guest.
hv_pci_probe() runs, and ends up in the generic Linux code for adding
a PCI device, which in turn sets up the IOMMU.
[ 1.731786] hv_iommu_attach_dev+0xf0/0x1d0
[ 1.731788] __iommu_attach_device+0x21/0xb0
[ 1.731790] __iommu_device_set_domain+0x65/0xd0
[ 1.731792] __iommu_group_set_domain_internal+0x61/0x120
[ 1.731795] iommu_setup_default_domain+0x3a4/0x530
[ 1.731796] __iommu_probe_device.part.0+0x15d/0x1d0
[ 1.731798] iommu_probe_device+0x81/0xb0
[ 1.731799] iommu_bus_notifier+0x2c/0x80
[ 1.731800] notifier_call_chain+0x66/0xe0
[ 1.731802] blocking_notifier_call_chain+0x47/0x70
[ 1.731804] bus_notify+0x3b/0x50
[ 1.731805] device_add+0x631/0x850
[ 1.731807] pci_device_add+0x2db/0x670
[ 1.731809] pci_scan_single_device+0xc3/0x100
[ 1.731810] pci_scan_slot+0x97/0x230
[ 1.731812] pci_scan_child_bus_extend+0x3b/0x2f0
[ 1.731814] pci_scan_root_bus_bridge+0xc0/0xf0
[ 1.731816] hv_pci_probe+0x398/0x5f0
[ 1.731817] vmbus_probe+0x42/0xa0
[ 1.731819] really_probe+0xe5/0x3e0
[ 1.731822] __driver_probe_device+0x7e/0x170
[ 1.731823] driver_probe_device+0x23/0xa0
[ 1.731824] __device_attach_driver+0x92/0x130
[ 1.731826] bus_for_each_drv+0x8c/0xe0
[ 1.731828] __device_attach+0xc0/0x200
[ 1.731830] device_initial_probe+0x4c/0x50
[ 1.731831] bus_probe_device+0x32/0x90
[ 1.731832] device_add+0x65b/0x850
[ 1.731836] device_register+0x1f/0x30
[ 1.731837] vmbus_device_register+0x87/0x130
[ 1.731840] vmbus_add_channel_work+0x139/0x1a0
[ 1.731841] process_one_work+0x19f/0x3f0
[ 1.731843] worker_thread+0x188/0x2f0
[ 1.731845] kthread+0x119/0x230
[ 1.731852] ret_from_fork+0x1b4/0x1e0
[ 1.731854] ret_from_fork_asm+0x1a/0x30
>
> > I was wondering if this "logical device id" is actually parsed by the hypervisor,
> > or whether it is just a unique ID that is opaque to the hypervisor. From the
> > usage in the hypercalls in pci-hyperv.c and this new IOMMU driver, it appears
> > to be the former. Evidently the hypervisor is taking this logical device ID and
> > and matching against bytes 4 thru 7 of the instance GUIDs of PCI pass-thru
> > devices offered to the guest, so as to identify a particular PCI pass-thru device.
> > If that's the case, then Linux doesn't have the option of choosing some other
> > unique ID that is easier to generate and access.
>
> Yes, the device ID is actually used by the hypervisor to find the corresponding PCI
> pass-thru device and the physical IOMMUs the device is behind and execute the
> requested operation for those IOMMUs.
>
> > There's a uniqueness issue with this kind of logical device ID that has been
> > around for years, but I had never thought about before. In hv_pci_probe()
> > instance GUID bytes 4 and 5 are used to generate the PCI domain number for
> > the "fake" PCI bus that the PCI pass-thru device resides on. The issue is the
> > lack of guaranteed uniqueness of bytes 4 and 5, so there's code to deal with
> > a collision. (The full GUID is unique, but not necessarily some subset of the
> > GUID.) It seems like the same kind of uniqueness issue could occur here. Does
> > the Hyper-V host provide any guarantees about the uniqueness of bytes 4 thru
> > 7 as a unit, and if not, what happens if there is a collision? Again, this
> > uniqueness issue has existed for years, so it's not new to this patch set, but
> > with new uses of the logical device ID, it seems relevant to consider.
>
> Thank you for bringing that up, I was aware of the uniqueness workaround but, like you,
> I had not considered that the workaround could prevent matching the device ID with the
> record the hypervisor has of the PCI pass-thru device assigned to us. I will work with
> the hypervisor folks to resolve this before this patch series is posted for merge.
>
> Thanks,
> Easwar (he/him)
^ permalink raw reply
* [PATCH 1/1] PCI: hv: Remove unused field pci_bus in struct hv_pcibus_device
From: mhkelley58 @ 2026-01-11 17:00 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
mani, robh, bhelgaas
Cc: linux-pci, linux-kernel, linux-hyperv
From: Michael Kelley <mhklinux@outlook.com>
Field pci_bus in struct hv_pcibus_device is unused since
commit 418cb6c8e051 ("PCI: hv: Generify PCI probing"). Remove it.
No functional change.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/pci/controller/pci-hyperv.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 1e237d3538f9..7fcba05cec30 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -501,7 +501,6 @@ struct hv_pcibus_device {
struct resource *low_mmio_res;
struct resource *high_mmio_res;
struct completion *survey_event;
- struct pci_bus *pci_bus;
spinlock_t config_lock; /* Avoid two threads writing index page */
spinlock_t device_list_lock; /* Protect lists below */
void __iomem *cfg_addr;
--
2.25.1
^ permalink raw reply related
* Re: [PATCH net-next v12 02/12] vsock: add netns to vsock core
From: Michael S. Tsirkin @ 2026-01-11 9:16 UTC (permalink / raw)
To: Bobby Eshleman
Cc: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Jason Wang,
Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, linux-kernel,
virtualization, netdev, kvm, linux-hyperv, linux-kselftest,
berrange, Sargun Dhillon, Bobby Eshleman
In-Reply-To: <20251126-vsock-vmtest-v12-2-257ee21cd5de@meta.com>
On Wed, Nov 26, 2025 at 11:47:31PM -0800, Bobby Eshleman wrote:
> From: Bobby Eshleman <bobbyeshleman@meta.com>
>
> Add netns logic to vsock core. Additionally, modify transport hook
> prototypes to be used by later transport-specific patches (e.g.,
> *_seqpacket_allow()).
>
> Namespaces are supported primarily by changing socket lookup functions
> (e.g., vsock_find_connected_socket()) to take into account the socket
> namespace and the namespace mode before considering a candidate socket a
> "match".
>
> This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode that
> accepts the "global" or "local" mode strings.
>
> Add netns functionality (initialization, passing to transports, procfs,
> etc...) to the af_vsock socket layer. Later patches that add netns
> support to transports depend on this patch.
>
> dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
> modified to take a vsk in order to perform logic on namespace modes. In
> future patches, the net and net_mode will also be used for socket
> lookups in these functions.
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
> ---
...
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index adcba1b7bf74..6113c22db8dc 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
...
> @@ -2658,6 +2745,142 @@ static struct miscdevice vsock_device = {
> .fops = &vsock_device_ops,
> };
>
> +static int vsock_net_mode_string(const struct ctl_table *table, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
> +{
> + char data[VSOCK_NET_MODE_STR_MAX] = {0};
> + enum vsock_net_mode mode;
> + struct ctl_table tmp;
nit: this file should now include linux/sysctl.h for this struct definition I
think?
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox