Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH net 0/2] net: mana: fix error-path issues in queue setup
From: Aditya Garg @ 2026-06-08  4:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, horms, shradhagupta, dipayanroy, ernis, kees,
	shacharr, stephen, gargaditya, ssengar, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260605182748.5f106575@kernel.org>

On 06-06-2026 06:57, Jakub Kicinski wrote:
> On Thu,  4 Jun 2026 01:01:24 -0700 Aditya Garg wrote:
>> Two error-path fixes in MANA queue setup, both surfaced during Sashiko
>> AI review of a recently upstreamed patch series.
>>
>> Patch 1 initializes queue->id to INVALID_QUEUE_ID in
>> mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
>> firmware id is assigned does not NULL gc->cq_table[0] and silently
>> break whichever real CQ owns that slot. This mirrors the existing
>> pattern in mana_gd_create_eq().
>>
>> Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
>> an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
>> it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
>> and a spurious error in dmesg.
> 
> Looks like these patches were generated against net-next, please rebase:
> 
> Applying: net: mana: initialize gdma queue id to INVALID_QUEUE_ID
> Applying: net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
> error: patch failed: drivers/net/ethernet/microsoft/mana/mana_en.c:2351
> error: drivers/net/ethernet/microsoft/mana/mana_en.c: patch does not apply
> Patch failed at 0002 net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check

Thanks Jakub for pointing it out.
I'll rebase against net and post a v2

Regards,
Aditya

^ permalink raw reply

* Re: [PATCH 1/1] Drivers: hv: vmbus: Set DMA coherent mask for VMBus devices
From: sashiko-bot @ 2026-06-08  2:19 UTC (permalink / raw)
  To: Michael Kelley; +Cc: linux-hyperv
In-Reply-To: <20260608020616.52852-1-mhklkml@zohomail.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] If device_register() fails in vmbus_device_register(), the error handling path panics the kernel due to a BUG_ON, and would result in a double free if the panic were bypassed.
--

commit 575a138c8a50043a8fa9993f9a0f7b0f34865b8a
Author: Michael Kelley <mhklinux@outlook.com>
Drivers: hv: vmbus: Set DMA coherent mask for VMBus devices

This commit fixes DMA allocation failures for VMBus devices in CoCo VMs.
It ensures the coherent_dma_mask is set to 64 bits at the same time the
dma_mask is configured during vmbus_device_register().

> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 23206640c6139..a6b9a33db6578 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2169,6 +2169,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
>  	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
>  	child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
>  	dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
> +	dma_set_coherent_mask(&child_device_obj->device, DMA_BIT_MASK(64));
>  
>  	/*
>  	 * Register with the LDM. This will kick off the driver/device

[Severity: High]
This is a pre-existing issue, but does the error handling path for
device_register() further down in this function cause a kernel panic?

If device_register() fails, the code calls put_device():

vmbus_device_register() {
    ...
    ret = device_register(&child_device_obj->device);
    if (ret) {
        pr_err("Unable to register child device\n");
        put_device(&child_device_obj->device);
        return ret;
    }
    ...
}

This triggers the release callback vmbus_device_release(), which
unconditionally invokes hv_process_channel_removal(). However,
hv_process_channel_removal() begins with:

    BUG_ON(!channel->rescind);

Since channel->rescind is false during initial channel offer processing,
would this BUG_ON fire and crash the kernel?

Additionally, if the BUG_ON were bypassed, hv_process_channel_removal()
calls free_channel(channel). The error then propagates up to
vmbus_add_channel_work(), which takes the err_deq_chan label and calls
free_channel(newchannel) again. Does this sequence result in a double free?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608020616.52852-1-mhklkml@zohomail.com?part=1

^ permalink raw reply

* [PATCH 1/1] Drivers: hv: vmbus: Set DMA coherent mask for VMBus devices
From: Michael Kelley @ 2026-06-08  2:06 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv; +Cc: linux-kernel

From: Michael Kelley <mhklinux@outlook.com>

In current code, the coherent_dma_mask for VMBus devices is not set, so
it has the default value of 0, which essentially means "invalid". Because
drivers for VMBus devices do not use dma_alloc_*() functions, the usual
use of the coherent mask does not occur, and no errors result.

However, a valid coherent_dma_mask may be needed even though the drivers
don't use dma_alloc_*() functions. In a CoCo VM, the VMBus storvsc and
netvsc drivers must bounce buffer DMA operations through the swiotlb
because the Hyper-V host can't DMA into encrypted guest memory. If the
kernel is built with CONFIG_SWIOTLB_DYNAMIC and the initial swiotlb size
is small, swiotlb code may need to grow the swiotlb in response to a DMA
mapping request. That growth first allocates a transient pool while the
swiotlb is expanded in the background. The transient pool memory is
allocated from the DMA atomic pools, and the allocation code checks for
a valid coherent_dma_mask. With current code, this check fails, then the
DMA mapping request from the storvsc or netvsc driver fails, and finally
an I/O error occurs.

Fix this problem by setting coherent_dma_mask for VMBus devices at the
same time that dma_mask is set. Being a synthetic bus, VMBus does not
have any restrictions on coherent DMA, so the coherent mask is set to
the full 64 bits for all VMBus devices, just like with dma_mask.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
I have not provided a Fixes: tag because the scenario under which
the error occurs is an artificial test case that I came across
while stressing a unrelated patch set. The fix is valid for general
goodness, but the likelihood of the problem occurring in the real
world is extremely small. So I see little value in adding this
patch to the stable kernel maintainers' workload. If someone wants
to argue otherwise, I have no fundamental objection to adding the
Fixes: tag.

 drivers/hv/vmbus_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index c9eeb2ec365d..26e8273bbddd 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2192,6 +2192,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
 	child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
 	dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
+	dma_set_coherent_mask(&child_device_obj->device, DMA_BIT_MASK(64));
 
 	/*
 	 * Register with the LDM. This will kick off the driver/device
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-07 12:16 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Konstantin Taranov,
	Simon Horman, Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
	Michael Kelley, Long Li, Yury Norov, linux-hyperv, linux-kernel,
	netdev, Paul Rosswurm, Shradha Gupta, Saurabh Singh Sengar,
	stable
In-Reply-To: <6d1fa9d9-73c2-48b5-95a1-51710d81b3ed@redhat.com>

On Thu, Jun 04, 2026 at 12:45:03PM +0200, Paolo Abeni wrote:
> On 6/1/26 12:27 PM, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated is capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vCPUs are so
> > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > much more than their overheads if they were spread across sibling vCPUs.
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU, while some vCPUs have none.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly.
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > Reviewed-by: Simon Horman <horms@kernel.org>
> 
> Why do you consider this patch a fix? To me is a configuration
> improvement and should land on net-next.

Hi Paolo,

This is a fix for commit 755391121038 ("net: mana: Allocate MSI-X
vectors dynamically"). Before that commit, IRQs were statically
allocated and clustering of MANA IRQs happened less often on low vCPU
configs. With dynamic allocation, MANA IRQs are assigned at a later
stage when other device IRQs have already occupied vCPUs. The NUMA-aware
affinity logic in that commit increased the probability of IRQ
clustering, causing a 2x throughput regression (15.65 vs 7.73 Gbps) on
low vCPU Azure SKUs at high connection counts.

> 
> > @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +/* should be called with cpus_read_lock() held */
> 
> Minor nit: s/should/must/ or just drop the comment, as
> `for_each_online_cpu()` usage implies that.
> 

Thanks, will change this in next version.

> > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		if (len == 0)
> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> 
> As this is another heuristic regarding irq spreading, why don't you
> implement that inside irq_setup()?
> 

irq_setup() already handles multiple cases - dynamic, static, and HWC
affinity logic with NUMA-aware sibling group spreading. Adding the
linear case there would make it more complex and harder to follow.
Keeping it as a separate function makes both, NUMA aware and linear
paths easier to understand and maintain.

Happy to reconsider if you feel strongly about it.

> > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > -		skip_first_cpu = true;
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > +		if (err) {
> > +			cpus_read_unlock();
> > +			goto free_irq;
> > +		}
> > +	} else {
> > +		/*
> > +		 * When num_msix_usable are more than num_online_cpus, our
> > +		 * queue IRQs should be equal to num of online vCPUs.
> > +		 * We try to make sure queue IRQs spread across all vCPUs.
> > +		 * In such a case NUMA or CPU core affinity does not matter.
> > +		 * Note: in this case the total mana IRQ should always be
> > +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * However, if CPUs went offline since num_msix_usable was
> > +		 * computed, queue IRQs will be more than num_online_cpus().
> > +		 * In such cases remaining extra IRQs will retain their default
> > +		 * affinity.
> > +		 */
> > +		int first_unassigned = num_online_cpus();
> > +		if (nvec > first_unassigned) {
> 
> An empty line is needed between the variable declaration and the code.

noted, Thanks.

> 
> /P

^ permalink raw reply

* [PATCH net-next v2] net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queries
From: Erni Sri Satya Vennela @ 2026-06-06 13:32 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
	linux-hyperv, netdev, linux-kernel, linux-rdma

mana_query_link_cfg() sends an HWC command to firmware on every call,
but the link speed and QoS values it returns only change when the
driver explicitly calls mana_set_bw_clamp(). This function is called
not only by userspace via ethtool get_link_ksettings, but also
periodically by hv_netvsc through netvsc_get_link_ksettings and by
the sysfs speed_show attribute via dev_attr_show, resulting in
unnecessary HWC traffic every few minutes.

Add a link_cfg_error field to mana_port_context to cache the query
result. The field uses three states: 1 (not yet queried, initial
value set during mana_probe_port), 0 (success, speed/max_speed are
valid), or a negative errno for permanent errors like -EOPNOTSUPP
when the hardware does not support the command. Transient errors and
qos_unconfigured responses are not cached so that subsequent calls
will retry.

MANA is ops-locked because it implements net_shaper_ops, so the core
already takes netdev_lock() around all ethtool_ops and net_shaper_ops
entry points. Reuse that lock to serialize mana_query_link_cfg() and
mana_set_bw_clamp(). This prevents a concurrent mana_set_bw_clamp()
from racing with an in-flight query and publishing stale pre-clamp
speed/max_speed.

Invalidate the cache inside mana_set_bw_clamp() on success, so all
current and future callers that change the link configuration
automatically trigger a fresh query on the next mana_query_link_cfg()
call. Also reset link_cfg_error during resume in mana_probe() under
netdev_lock(), so that any query already in flight cannot later
store 0 and silently overwrite the post-resume invalidation.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Changes in v2:
* Use netdev_lock() instead of introducing new per-port mutex.
* Update commit message.
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 23 +++++++++++++++----
 include/net/mana/mana.h                       |  4 ++++
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index db14357d3732..af2517a27aad 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1456,6 +1456,12 @@ int mana_query_link_cfg(struct mana_port_context *apc)
 	struct mana_query_link_config_req req = {};
 	int err;
 
+	netdev_assert_locked(ndev);
+
+	err = apc->link_cfg_error;
+	if (err <= 0)
+		return err;
+
 	mana_gd_init_req_hdr(&req.hdr, MANA_QUERY_LINK_CONFIG,
 			     sizeof(req), sizeof(resp));
 
@@ -1468,6 +1474,7 @@ int mana_query_link_cfg(struct mana_port_context *apc)
 	if (err) {
 		if (err == -EOPNOTSUPP) {
 			netdev_info_once(ndev, "MANA_QUERY_LINK_CONFIG not supported\n");
+			apc->link_cfg_error = err;
 			return err;
 		}
 		netdev_err(ndev, "Failed to query link config: %d\n", err);
@@ -1485,12 +1492,12 @@ int mana_query_link_cfg(struct mana_port_context *apc)
 		return err;
 	}
 
-	if (resp.qos_unconfigured) {
-		err = -EINVAL;
-		return err;
-	}
+	if (resp.qos_unconfigured)
+		return -EINVAL;
+
 	apc->speed = resp.link_speed_mbps;
 	apc->max_speed = resp.qos_speed_mbps;
+	apc->link_cfg_error = 0;
 	return 0;
 }
 
@@ -1502,6 +1509,8 @@ int mana_set_bw_clamp(struct mana_port_context *apc, u32 speed,
 	struct net_device *ndev = apc->ndev;
 	int err;
 
+	netdev_assert_locked(ndev);
+
 	mana_gd_init_req_hdr(&req.hdr, MANA_SET_BW_CLAMP,
 			     sizeof(req), sizeof(resp));
 	req.vport = apc->port_handle;
@@ -1535,6 +1544,8 @@ int mana_set_bw_clamp(struct mana_port_context *apc, u32 speed,
 	if (resp.qos_unconfigured)
 		netdev_info(ndev, "QoS is unconfigured\n");
 
+	/* Invalidate the cache; next query will re-fetch from firmware. */
+	apc->link_cfg_error = 1;
 	return 0;
 }
 
@@ -3448,6 +3459,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->port_handle = INVALID_MANA_HANDLE;
 	apc->pf_filter_handle = INVALID_MANA_HANDLE;
 	apc->port_idx = port_idx;
+	apc->link_cfg_error = 1;
 	apc->cqe_coalescing_enable = 0;
 
 	mutex_init(&apc->vport_mutex);
@@ -3768,6 +3780,9 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 			rtnl_lock();
 			apc = netdev_priv(ac->ports[i]);
 			enable_work(&apc->queue_reset_work);
+			netdev_lock(ac->ports[i]);
+			apc->link_cfg_error = 1;
+			netdev_unlock(ac->ports[i]);
 			err = mana_attach(ac->ports[i]);
 			rtnl_unlock();
 			/* Log the port for which the attach failed, stop
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d9c27310fd04..2a45ff7211ef 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -555,6 +555,10 @@ struct mana_port_context {
 	u32 speed;
 	/* Maximum speed supported by the SKU (mbps) */
 	u32 max_speed;
+	/* 1 = not queried, 0 = cached success, negative = permanent error.
+	 * Protected by the netdev instance lock.
+	 */
+	int link_cfg_error;
 
 	bool port_is_up;
 	bool port_st_save; /* Saved port state */
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net-next] net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queries
From: Erni Sri Satya Vennela @ 2026-06-06 11:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, dipayanroy, kees,
	linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260605161315.26784677@kernel.org>

On Fri, Jun 05, 2026 at 04:13:15PM -0700, Jakub Kicinski wrote:
> On Thu, 4 Jun 2026 22:29:29 -0700 Erni Sri Satya Vennela wrote:
> > I tried two netdev_lock-based variants. 
> > 
> > mana_query_link_cfg() has four callers:
> > 
> > 1 ethtool ioctl/netlink			- has RTNL	- has netdev->lock
> > 2 sysfs speed_show/duplex_show		- has RTNL	- no netdev->lock
> > 3 netvsc_get_link_ksettings VF forward	- has RTNL	- no netdev->lock
> > 4 mana_shaper_set			- no RTNL	- has netdev->lock
> > 
> > No existing lock covers all four.
> 
> How fresh is your tree? The just-minted commit 9f275c2e9020 should
> address the gap, I believe?

Hi Jakub,

Thanks for pointing out the commit 9f275c2e9020. It does close the gap.
My analysis was against a tree that predated it but the commit landed
on net-next on Jun 4 21:30 UTC and my reply went out about an hour later,
so the rebase that picked it up hadn't happened on my end yet.

I've now rebased to current net-next and re-walked the four callers of
mana_query_link_cfg(). All of them hold netdev->lock by the time they 
reach mana_query_link_cfg(), and the race scenarios I described no longer
apply.

Thanks,
Vennela

^ permalink raw reply

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: David Woodhouse @ 2026-06-06 10:52 UTC (permalink / raw)
  To: Thomas Gleixner, Sean Christopherson, Paolo Bonzini, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
	Michael Kelley
In-Reply-To: <877boc554l.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]

On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> 
> > Now that all paravirt code that explicitly specifies the TSC frequency
> > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> > 
> > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > line parameter"), one of the goals of the param is to allow the refined
> > calibration work "to do meaningful error checking".
> > 
> > Note, preferring the user-provided TSC frequency over the frequency from
> > the hypervisor or trusted firmware, while simultaneously not treating the
> > user-provided frequency as gospel, is obviously incongruous.  Sweep the
> > problem under the rug for now to avoid opening a big can of worms that
> > likely doesn't have a great answer.
> 
> There is a good answer I think.
> 
> early_tsc_khz exists to cater for the overclocking crowd. On their
> modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> matching reality anymore. So they work around that by supplying a close
> enough tsc_early_khz and then they let the refined calibration work
> figure it out.
> 
> Arguably that's only relevant for bare metal systems and what's worse is
> that in virtual environments the refined calibration work can fail,
> which renders the TSC unstable.
> 
> So I'd rather say we change this logic to:
> 
>    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
>       tsc_khz = x86_init.....();
>       force(X86_FEATURE_TSC_KNOWN_FREQ);
>    } else if (tsc_khz_early) {
>       ....
>    } else {
>       ...
>    }
> 
> Along with:
> 
>    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
>       if (tsc_khz_early)
>          pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
> 
> or something daft like that.
> 
> The kernel has for various reasons always tried to cater for the needs
> of users who are plagued by bonkers firmware, but we have to stop to
> prioritize or treating equal ancient and modded out of spec hardware.
> 
> TBH, I consider that whole KVM clock nonsense to fall into the modded
> out of spec hardware realm. Do a reality check:
> 
>    How many production systems are out there still which run VMs on CPUs
>    with a broken TSC and the lack of VM TSC scaling?
> 
> I'm not saying that we should not support the few remaining systems
> anymore, but our tendency to pretend that we can keep all of this
> nonsense working and at the same time making progress is just a fallacy.

I don't know that we can take the KVM (and Xen) clock away from guests,
but all of the *horrid* part about it is the way it attempts to cope
with the possibility that the *host* timekeeping might flip away from
TSC-based mode at any point in time. By the end of my outstanding
cleanup series, that is the *only* thing the gtod_notifier remains for.

If we can trust the hardware *and* the host kernel, then KVM could
theoretically hardwire the kvmclock into 'master clock mode' where it
basically just advertises the TSC→kvmclock relationship *once* to all
CPUs and it never changes.

All the nonsense about updating it every time we enter a CPU could just
go away completely.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Thomas Gleixner @ 2026-06-06 10:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-11-seanjc@google.com>

On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:

> Now that all paravirt code that explicitly specifies the TSC frequency
> also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
>
> Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> line parameter"), one of the goals of the param is to allow the refined
> calibration work "to do meaningful error checking".
>
> Note, preferring the user-provided TSC frequency over the frequency from
> the hypervisor or trusted firmware, while simultaneously not treating the
> user-provided frequency as gospel, is obviously incongruous.  Sweep the
> problem under the rug for now to avoid opening a big can of worms that
> likely doesn't have a great answer.

There is a good answer I think.

early_tsc_khz exists to cater for the overclocking crowd. On their
modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
matching reality anymore. So they work around that by supplying a close
enough tsc_early_khz and then they let the refined calibration work
figure it out.

Arguably that's only relevant for bare metal systems and what's worse is
that in virtual environments the refined calibration work can fail,
which renders the TSC unstable.

So I'd rather say we change this logic to:

   if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
      tsc_khz = x86_init.....();
      force(X86_FEATURE_TSC_KNOWN_FREQ);
   } else if (tsc_khz_early) {
      ....
   } else {
      ...
   }

Along with:

   if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
      if (tsc_khz_early)
         pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");

or something daft like that.

The kernel has for various reasons always tried to cater for the needs
of users who are plagued by bonkers firmware, but we have to stop to
prioritize or treating equal ancient and modded out of spec hardware.

TBH, I consider that whole KVM clock nonsense to fall into the modded
out of spec hardware realm. Do a reality check:

   How many production systems are out there still which run VMs on CPUs
   with a broken TSC and the lack of VM TSC scaling?

I'm not saying that we should not support the few remaining systems
anymore, but our tendency to pretend that we can keep all of this
nonsense working and at the same time making progress is just a fallacy.

I rather want to have a more fine grained differentiation and
prioritization of:

  1) The actual real world relevant use cases which run on contemporary
     hardware.

  2) Still relevant use cases on slightly older hardware with less
     capabilities

  3) Broken firmware

  4) Modded out of spec nonsense

  5) Support for ancient museums pieces

Thanks,

        tglx


^ permalink raw reply

* Re: [PATCH net 0/2] net: mana: fix error-path issues in queue setup
From: Jakub Kicinski @ 2026-06-06  1:27 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, horms, shradhagupta, dipayanroy, ernis, kees,
	shacharr, stephen, gargaditya, ssengar, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260604080137.1995269-1-gargaditya@linux.microsoft.com>

On Thu,  4 Jun 2026 01:01:24 -0700 Aditya Garg wrote:
> Two error-path fixes in MANA queue setup, both surfaced during Sashiko
> AI review of a recently upstreamed patch series.
> 
> Patch 1 initializes queue->id to INVALID_QUEUE_ID in
> mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
> firmware id is assigned does not NULL gc->cq_table[0] and silently
> break whichever real CQ owns that slot. This mirrors the existing
> pattern in mana_gd_create_eq().
> 
> Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
> an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
> it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
> and a spurious error in dmesg.

Looks like these patches were generated against net-next, please rebase:

Applying: net: mana: initialize gdma queue id to INVALID_QUEUE_ID
Applying: net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
error: patch failed: drivers/net/ethernet/microsoft/mana/mana_en.c:2351
error: drivers/net/ethernet/microsoft/mana/mana_en.c: patch does not apply
Patch failed at 0002 net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check

^ permalink raw reply

* Re: [PATCH net-next v12 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: sashiko-bot @ 2026-06-06  0:57 UTC (permalink / raw)
  To: Long Li; +Cc: linux-hyperv
In-Reply-To: <20260605005717.2059954-3-longli@microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Permanent downward clamping of maximum queues (`apc->max_queues`) across suspend/resume cycles.
--

commit ae1648776d10be5e3fd6640d34f239a82e77a213
Author: Long Li <longli@microsoft.com>
Subject: net: mana: Query device capabilities and configure MSI-X sharing for EQs

This commit adjusts the max number of queues to allow dedicated MSI-X
vectors for each vPort and configures MSI-X sharing when resources are
scarce. It also updates port initialization to clamp the queue counts
to the recomputed per-vPort maximums.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index ed60cc15fe78d..3ec8e94e7c17e 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -3208,6 +3213,8 @@ static int mana_init_port(struct net_device *ndev)
>  	max_queues = min_t(u32, max_txq, max_rxq);
>  	if (apc->max_queues > max_queues)
>  		apc->max_queues = max_queues;
> +	if (apc->max_queues > gc->max_num_queues_vport)
> +		apc->max_queues = gc->max_num_queues_vport;

[Severity: Medium]
Will this result in a permanent downward clamping of the maximum queues across
suspend and resume cycles?

Because gc->max_num_queues_vport is dynamically recalculated on every
suspend/resume cycle based on current hardware state (such as the number of
online CPUs and available MSI-X vectors), it might temporarily shrink.

If a system resumes with fewer vectors, this logic reduces apc->max_queues.

However, if the system subsequently resumes again with the original number of
vectors restored, apc->max_queues cannot recover because this clamping only
ever reduces the value and the original allocation limit is permanently
overwritten. 

Should we be clamping a temporary runtime variable instead of permanently
modifying the device's capability maximums here?

>  
>  	if (apc->num_queues > apc->max_queues)
>  		apc->num_queues = apc->max_queues;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260605005717.2059954-1-longli@microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH net-next v2] net: mana: Add Interrupt Moderation support
From: sashiko-bot @ 2026-06-05 23:43 UTC (permalink / raw)
  To: Haiyang Zhang; +Cc: linux-hyperv
In-Reply-To: <20260604234211.2056341-1-haiyangz@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Use-after-free and workqueue corruption due to a race between NAPI polling and interface teardown.
- [High] Unconditional use of GDMA_MESSAGE_V3 breaks backward compatibility with older firmware.
- [Medium] Static interrupt moderation is silently disabled if only one parameter is configured.
--

commit ec13708996b121633f56552fac3f0ceba004bf71
Author: Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Add Interrupt Moderation support

This commit adds Static and Dynamic Interrupt Moderation (DIM) support for Rx
and Tx. It updates the queue creation procedure with new data structures and
settings, adds statistics collection functions for DIM, and introduces workers
to update DIM settings. It also hooks into ethtool to allow configuration.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index db14357d3732e..b1e0c444f4144 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1551,6 +1551,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  
>  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
>  			     sizeof(req), sizeof(resp));
> +
> +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
>  	req.vport = vport;

[Severity: High]
Will this break initialization on older firmware? We unconditionally set
msg_version to GDMA_MESSAGE_V3 here without checking if the hardware actually
supports it.

If an older Azure host without the GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION
capability rejects the V3 request, could this cause queue creation to fail
and result in a complete loss of network connectivity?

[ ... ]
> @@ -2274,7 +2344,13 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
>  	if (w < cq->budget) {
>  		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
>  		cq->work_done_since_doorbell = 0;
> -		napi_complete_done(&cq->napi, w);
> +
> +		if (napi_complete_done(&cq->napi, w)) {
> +			if (cq->type == MANA_CQ_TYPE_RX)
> +				mana_update_rx_dim(cq);
> +			else
> +				mana_update_tx_dim(cq);
> +		}

[Severity: High]
Is there a race condition here during teardown? If napi_complete_done()
clears the scheduling state, a concurrent teardown thread in
mana_destroy_rxq() can return from napi_disable_locked() and proceed to call
cancel_work_sync().

Because mana_update_rx_dim() hasn't been called yet to schedule the work,
cancel_work_sync() would return immediately and the queue memory would be
freed. When the NAPI thread resumes and calls mana_update_rx_dim(), could
this lead to a use-after-free or corruption of the workqueue linked list
by scheduling work on freed memory?

[ ... ]
> @@ -2783,6 +2869,10 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
>  	cq_spec.queue_size = cq->gdma_cq->queue_size;
>  	cq_spec.modr_ctx_id = 0;
>  	cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
> +	cq_spec.req_cq_moderation = apc->rx_dim_enabled ||
> +		(apc->intr_modr_rx_usec && apc->intr_modr_rx_comp);
> +	cq_spec.cq_moderation_usec = apc->intr_modr_rx_usec;
> +	cq_spec.cq_moderation_comp = apc->intr_modr_rx_comp;

[Severity: Medium]
Does this logic unintentionally ignore user configurations where only one
parameter is non-zero? 

In mana_set_coalesce(), ethtool allows setting rx-usecs without setting
rx-frames. If a user sets only time-based moderation (e.g., rx-usecs 50,
rx-frames 0), the condition here evaluates to false and silently disables
static moderation without returning an error via ethtool.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604234211.2056341-1-haiyangz@linux.microsoft.com?part=1

^ permalink raw reply

* Re: [PATCH net-next] net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queries
From: Jakub Kicinski @ 2026-06-05 23:13 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, dipayanroy, kees,
	linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <aiJeuU3DLKL7JcPN@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Thu, 4 Jun 2026 22:29:29 -0700 Erni Sri Satya Vennela wrote:
> I tried two netdev_lock-based variants. 
> 
> mana_query_link_cfg() has four callers:
> 
> 1 ethtool ioctl/netlink			- has RTNL	- has netdev->lock
> 2 sysfs speed_show/duplex_show		- has RTNL	- no netdev->lock
> 3 netvsc_get_link_ksettings VF forward	- has RTNL	- no netdev->lock
> 4 mana_shaper_set			- no RTNL	- has netdev->lock
> 
> No existing lock covers all four.

How fresh is your tree? The just-minted commit 9f275c2e9020 should
address the gap, I believe?

^ permalink raw reply

* [PATCH net-next] net: mana: Add support for PF device 0x00C1
From: Haiyang Zhang @ 2026-06-05 21:22 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Shradha Gupta, Erni Sri Satya Vennela, Dipayaan Roy, linux-kernel,
	linux-rdma
  Cc: paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Update the device id table to include the new device id 0x00C1.
This device's BAR layout is similar to VF's, update the function,
mana_gd_init_registers(), accordingly.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 +++++--
 include/net/mana/gdma.h                         | 2 ++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 712a0881d720..5bc91ee8a543 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -150,7 +150,7 @@ static int mana_gd_init_registers(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 
-	if (gc->is_pf)
+	if (gc->is_pf && !gc->is_pf2)
 		return mana_gd_init_pf_regs(pdev);
 	else
 		return mana_gd_init_vf_regs(pdev);
@@ -2070,7 +2070,7 @@ static void mana_gd_cleanup_device(struct pci_dev *pdev)
 
 static bool mana_is_pf(unsigned short dev_id)
 {
-	return dev_id == MANA_PF_DEVICE_ID;
+	return dev_id == MANA_PF_DEVICE_ID || dev_id == MANA_PF2_DEVICE_ID;
 }
 
 static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
@@ -2118,6 +2118,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	gc->numa_node = dev_to_node(&pdev->dev);
 	gc->is_pf = mana_is_pf(pdev->device);
+	gc->is_pf2 = (pdev->device == MANA_PF2_DEVICE_ID);
+
 	gc->bar0_va = bar0_va;
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
@@ -2269,6 +2271,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
 
 static const struct pci_device_id mana_id_table[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_MICROSOFT, MANA_PF_DEVICE_ID) },
+	{ PCI_DEVICE(PCI_VENDOR_ID_MICROSOFT, MANA_PF2_DEVICE_ID) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_MICROSOFT, MANA_VF_DEVICE_ID) },
 	{ }
 };
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 70d62bc32837..7361e98d94ce 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -418,6 +418,7 @@ struct gdma_context {
 	u32			test_event_eq_id;
 
 	bool			is_pf;
+	bool			is_pf2;
 
 	phys_addr_t		bar0_pa;
 	void __iomem		*bar0_va;
@@ -571,6 +572,7 @@ struct gdma_eqe {
 #define GDMA_SRIOV_REG_CFG_BASE_OFF	0x108
 
 #define MANA_PF_DEVICE_ID 0x00B9
+#define MANA_PF2_DEVICE_ID 0x00C1
 #define MANA_VF_DEVICE_ID 0x00BA
 
 struct gdma_posted_wqe_info {
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 19:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <aiMPxl5vkvJDldi9@google.com>

On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
> On Fri, Jun 05, 2026, Thomas Gleixner wrote:
>> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
>> > frequency.
>> 
>> That's misleading because fixed frequency means that the frequency does
>> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
>> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
>> from CPUID/MSRs.
>
> Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
> phrase this.
>
>> > In practice, this is likely one big nop, as re-calibration is
>> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
>> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>> 
>> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
>> frequency drivers, which means that's absolutely not interesting and
>> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
>> set on those systems.
>
> It _shouldn't_ be set on those systems, but in the world of virtualization it's
> not completely impossible.
>
>> IOW, this patch is pointless voodoo ware.
>
> Would y'all be opposed to adding a WARN?  I don't actually care about P4 or K7
> CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
> recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
> readers that don't already know that in practice, it's limited to ancient CPUs.
>
> In other words, the point is to document expectations and mutual exclusion, not
> to "fix" anything.

Fair enough.

So yes, having a check there for actually X86_FEATURE_CONSTANT_TSC
(X86_FEATURE_CONSTANT_TSC is not interesting) and emitting a warning and
returning early is the right thing to do there.

But we also should have a check in the TSC init code somewhere which
validates that X86_FEATURE_CONSTANT_TSC is set when
X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
w/o X86_FEATURE_CONSTANT_TSC.

Thanks,

        tglx



^ permalink raw reply

* Re: [PATCH] hv_balloon: Simplify data output in hv_balloon_debug_show()
From: Sahil Chandna @ 2026-06-05 19:11 UTC (permalink / raw)
  To: Markus Elfring, linux-hyperv, Dexuan Cui, Haiyang Zhang,
	K. Y. Srinivasan, Long Li, Wei Liu
  Cc: LKML
In-Reply-To: <4f969d00-df24-4cde-8539-8cbe4a09f417@web.de>



On 05/06/2026 18:20, Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Fri, 5 Jun 2026 14:44:54 +0200
> 
> Move the specification for a line break from a seq_puts() call
> to a seq_printf() call.
> 
> The source code was transformed by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---
>   drivers/hv/hv_balloon.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 9a55f5c43307..42ce27be344d 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -1862,9 +1862,7 @@ static int hv_balloon_debug_show(struct seq_file *f, void *offset)
>   	if (hot_add_enabled())
>   		seq_puts(f, " hot_add");
>   
> -	seq_puts(f, "\n");
> -
> -	seq_printf(f, "%-22s: %u", "state", dm->state);
> +	seq_printf(f, "\n%-22s: %u", "state", dm->state);
>   	switch (dm->state) {
>   	case DM_INITIALIZING:
>   			sname = "Initializing";

Reviewed-by: Sahil Chandna <sahilchandna@linux.microsoft.com>

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-06-05 18:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <87fr315fq9.ffs@fw13>

On Fri, Jun 05, 2026, Thomas Gleixner wrote:
> On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> > Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> > frequency.
> 
> That's misleading because fixed frequency means that the frequency does
> not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
> X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
> from CPUID/MSRs.

Sorry, "if the TSC runs at a known, fixed frequency" would be a better way to
phrase this.

> > In practice, this is likely one big nop, as re-calibration is
> > used only for SMP=n kernels, and only for hardware that is 20+ years old,
> > i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
> 
> recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
> frequency drivers, which means that's absolutely not interesting and
> neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
> set on those systems.

It _shouldn't_ be set on those systems, but in the world of virtualization it's
not completely impossible.

> IOW, this patch is pointless voodoo ware.

Would y'all be opposed to adding a WARN?  I don't actually care about P4 or K7
CPUs, but without any reference to X86_FEATURE_TSC_KNOWN_FREQ in
recalibrate_cpu_khz(), the code _looks_ wrong, and so is very confusing for
readers that don't already know that in practice, it's limited to ancient CPUs.

In other words, the point is to document expectations and mutual exclusion, not
to "fix" anything.

^ permalink raw reply

* [PATCH] hv_balloon: Simplify data output in hv_balloon_debug_show()
From: Markus Elfring @ 2026-06-05 12:50 UTC (permalink / raw)
  To: linux-hyperv, Dexuan Cui, Haiyang Zhang, K. Y. Srinivasan,
	Long Li, Wei Liu
  Cc: LKML

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Fri, 5 Jun 2026 14:44:54 +0200

Move the specification for a line break from a seq_puts() call
to a seq_printf() call.

The source code was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/hv/hv_balloon.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 9a55f5c43307..42ce27be344d 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -1862,9 +1862,7 @@ static int hv_balloon_debug_show(struct seq_file *f, void *offset)
 	if (hot_add_enabled())
 		seq_puts(f, " hot_add");
 
-	seq_puts(f, "\n");
-
-	seq_printf(f, "%-22s: %u", "state", dm->state);
+	seq_printf(f, "\n%-22s: %u", "state", dm->state);
 	switch (dm->state) {
 	case DM_INITIALIZING:
 			sname = "Initializing";
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Thomas Gleixner @ 2026-06-05 12:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>

On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
>  		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
> -		crystal_khz = eax_base_mhz * 1000 *
> -			eax_denominator / ebx_numerator;
> +		info.crystal_khz = eax_base_mhz * 1000 *
> +			info.denominator / info.numerator;

Please get rid of this ugly line break. You have 100 characters.


^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-05 12:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <20260529144435.704127-2-seanjc@google.com>

On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> frequency.

That's misleading because fixed frequency means that the frequency does
not change, i.e. X86_FEATURE_CONSTANT_TSC is set. But
X86_FEATURE_CONSTANT_TSC does not imply that the frequency can be read
from CPUID/MSRs.

> In practice, this is likely one big nop, as re-calibration is
> used only for SMP=n kernels, and only for hardware that is 20+ years old,
> i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.

recalibrate_cpu_khz() is only invoked from Intel P4 and AMD K7 CPU
frequency drivers, which means that's absolutely not interesting and
neither X86_FEATURE_CONSTANT_TSC nor X86_FEATURE_TSC_KNOWN_FREQ can be
set on those systems.

IOW, this patch is pointless voodoo ware.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v2 3/3] ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing
From: sashiko-bot @ 2026-06-05  9:58 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm, linux-hyperv
In-Reply-To: <20260604095755.64849-4-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Uninitialized stack memory read / Undefined Behavior in `vmclock_get_crosststamp()`
--

commit 9639c73bd7986c7a2cf17a27c0bf79db7385ce83
Author: David Woodhouse <dwmw@amazon.co.uk>

ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing

When the system clocksource is kvmclock or Hyper-V, vmclock_get_crosststamp()
previously fell through to get_cycles(). This patch uses hw_cycles from
ktime_get_snapshot_id() when available, giving a perfect pairing for the
device time calculation.

> diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
> index d6a5a533164aa..eebdcd5ebc080 100644
> --- a/drivers/ptp/ptp_vmclock.c
> +++ b/drivers/ptp/ptp_vmclock.c
[ ... ]
> @@ -140,6 +140,10 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
>  			if (sts->pre_sts.cs_id == st->cs_id) {
>  				cycle = sts->pre_sts.cycles;
>  				sts->post_sts = sts->pre_sts;
> +			} else if (sts->pre_sts.hw_csid == st->cs_id &&
> +				   sts->pre_sts.hw_cycles) {

[Severity: High]
This is a pre-existing issue, but by adding these checks, does this code
evaluate uninitialized stack memory?

Looking at ptp_sys_offset_extended() in drivers/ptp/ptp_chardev.c, the
struct ptp_system_timestamp sts is allocated on the stack but not
zero-initialized.

When ptp_read_system_prets(&sts) calls ktime_get_snapshot_id(), it sets
systime_snapshot->valid = false and returns early in failure cases (e.g.,
if an invalid or unavailable AUX clock ID is passed).

In these cases, the cs_id, hw_csid, and hw_cycles fields remain uninitialized
stack garbage.

vmclock_get_crosststamp() evaluates these branch conditions without first
checking if sts->pre_sts.valid is true.

Although the sample is eventually discarded by ptp_sys_offset_extended()
because sts.pre_sts.valid is false, branching on uninitialized stack memory
is undefined behavior and can trigger KMSAN warnings.

Could a check for sts->pre_sts.valid be added before evaluating these fields?

> +				cycle = sts->pre_sts.hw_cycles;
> +				sts->post_sts = sts->pre_sts;
>  			} else {
>  				cycle = get_cycles();
>  				ptp_read_system_postts(sts);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604095755.64849-1-dwmw2@infradead.org?part=3

^ permalink raw reply

* Re: [PATCH v2 2/3] x86/kvmclock: Implement read_snapshot() for kvmclock clocksource
From: sashiko-bot @ 2026-06-05  9:58 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm, linux-hyperv
In-Reply-To: <20260604095755.64849-3-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] kvm_clock_get_cycles_snapshot() unconditionally returns raw TSC cycles and bypasses the global last_value monotonicity clamp, breaking timekeeping on hosts with unstable TSCs.
--

commit 6dc8b64e567331ab0edc9710a62a9a13cd0b8811
Author: David Woodhouse <dwmw@amazon.co.uk>

x86/kvmclock: Implement read_snapshot() for kvmclock clocksource

This commit implements the read_snapshot() callback for the kvmclock
clocksource. This allows returning both the kvmclock nanosecond value
and the raw TSC value used to compute it.

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index b5991d53fc0ea..cb3d0ca1fa225 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c

[ ... ]

> @@ -87,6 +87,27 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)
>  	return kvm_clock_read();
>  }
>  
> +static u64 kvm_clock_get_cycles_snapshot(struct clocksource *cs,
> +					 struct clocksource_hw_snapshot *chs)
> +{
> +	struct pvclock_vcpu_time_info *src;
> +	unsigned version;
> +	u64 ret, tsc;
> +
> +	preempt_disable_notrace();
> +	src = this_cpu_pvti();
> +	do {
> +		version = pvclock_read_begin(src);
> +		tsc = rdtsc_ordered();
> +		ret = __pvclock_read_cycles(src, tsc);
> +	} while (pvclock_read_retry(src, version));
> +	preempt_enable_notrace();
> +
> +	chs->hw_cycles = tsc;
> +	chs->hw_csid = CSID_X86_TSC;

[Severity: High]
Does this open-coded read loop bypass the fallback monotonic clamping used in
the standard kvm-clock read path?

The standard path (__pvclock_clocksource_read) enforces clock monotonicity
by clamping to a globally shared last_value accumulator when
PVCLOCK_TSC_STABLE_BIT is not set.

Without this check and clamping logic, if a vCPU migrates to a host with an
unsynchronized or slower TSC, could the snapshot return a derived time that
goes backwards relative to the timekeeper's cycle_last?

Also, by unconditionally returning CSID_X86_TSC and the raw hardware cycles
even when the host marks the TSC as unstable, does this incorrectly trick
cross-timestamping consumers into assuming they have a reliable, synchronized
hardware timestamp?

> +	return ret;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604095755.64849-1-dwmw2@infradead.org?part=2

^ permalink raw reply

* Re: [PATCH v2 1/3] clocksource/hyperv: Implement read_snapshot() for TSC page clocksource
From: sashiko-bot @ 2026-06-05  9:58 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm, linux-hyperv
In-Reply-To: <20260604095755.64849-2-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] This problem wasn't introduced by this patch, but if `memremap` fails during Hyper-V root partition initialization, the active clocksource is not unregistered and `tsc_page` becomes NULL, leading to a guaranteed kernel panic upon the next clock read.
--

commit fd7e137ffdb89db4afcdcc509ba544a534823492
Author: David Woodhouse <dwmw@amazon.co.uk>

clocksource/hyperv: Implement read_snapshot() for TSC page clocksource

This commit implements the read_snapshot() callback for the Hyper-V TSC
page clocksource. It allows the raw TSC value and the derived time to be
atomically captured, enabling ktime_get_snapshot_id() to provide the raw
TSC to consumers like KVM's master clock.

> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> index e9f5034a1bc89..df567795d1751 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c

[ ... ]

> @@ -444,6 +444,22 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg)
>  	return read_hv_clock_tsc();
>  }
>  
> +static u64 notrace read_hv_clock_tsc_cs_snapshot(struct clocksource *arg,
> +						  struct clocksource_hw_snapshot *chs)
> +{
> +	u64 time;
> +
> +	if (hv_read_tsc_page_tsc(tsc_page, &chs->hw_cycles, &time)) {

[Severity: High]
This isn't a bug introduced by this patch, but if memremap() fails during
Hyper-V root partition initialization, tsc_page becomes NULL:

drivers/clocksource/hyperv_timer.c:hv_remap_tsc_clocksource() {
	...
	tsc_page = memremap(tsc_pfn << HV_HYP_PAGE_SHIFT, sizeof(tsc_pg),
			    MEMREMAP_WB);
	if (!tsc_page)
		pr_err("Failed to remap Hyper-V TSC page.\n");
	...
}

Since the clocksource is not unregistered and a valid fallback pointer isn't
restored on failure, will passing this NULL tsc_page to hv_read_tsc_page_tsc()
cause a kernel panic upon the next snapshot read?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604095755.64849-1-dwmw2@infradead.org?part=1

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
From: sashiko-bot @ 2026-06-05  8:02 UTC (permalink / raw)
  To: Aditya Garg; +Cc: linux-hyperv
In-Reply-To: <20260604080137.1995269-3-gargaditya@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] In `mana_create_vport()`, if `mana_create_txq()` fails, the function returns the error directly without unwinding the previously successful `mana_cfg_vport()` and `mana_pf_register_hw_vport()`. Similarly, if `mana_cfg_vport()` fails, it returns the error without unwinding `mana_pf_register_hw_vport()`. This leaks firmware vPort registrations and leaves `apc->vport_use_count > 0`, leading to a permanent `-EBUSY` error on all future attempts to open the device.
--

commit 22337dc3f6a86510c8d8b1e096424ad2f6000592
Author: Aditya Garg <gargaditya@linux.microsoft.com>

net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check

This commit fixes an issue where mana_destroy_txq() attempts to destroy a
WQ object using INVALID_MANA_HANDLE if allocation fails earlier. It adds
a check to verify the handle is valid before calling mana_destroy_wq_obj().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index db14357d3732e..4a2cd800111e7 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -2351,7 +2351,8 @@ static void mana_destroy_txq(struct mana_port_context *apc)
>  			netif_napi_del_locked(napi);
>  			apc->tx_qp[i]->txq.napi_initialized = false;
>  		}
> -		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
> +		if (apc->tx_qp[i]->tx_object != INVALID_MANA_HANDLE)
> +			mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
>  
>  		mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
>  

[Severity: High]
This is a pre-existing issue, but while reviewing the error paths that lead
to mana_destroy_txq(), I noticed a potential resource leak in
mana_create_vport().

If mana_create_txq() fails, the function returns the error directly without
unwinding the previously successful mana_cfg_vport() and
mana_pf_register_hw_vport():

drivers/net/ethernet/microsoft/mana/mana_en.c:mana_create_vport() {
    ...
	err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
	if (err)
		return err;

	return mana_create_txq(apc, net);
}

The same applies if mana_cfg_vport() fails, as it returns the error without
unwinding mana_pf_register_hw_vport().

Because mana_alloc_queues() propagates this error without calling
mana_destroy_vport() or unwinding the partial initialization:

drivers/net/ethernet/microsoft/mana/mana_en.c:mana_alloc_queues() {
    ...
	err = mana_create_vport(apc, ndev);
	if (err) {
		netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
		return err;
	}
    ...
}

Does this leak firmware vPort registrations and leave apc->vport_use_count > 0,
leading to a permanent -EBUSY error on all future attempts to open the device?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604080137.1995269-1-gargaditya@linux.microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH net-next] net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queries
From: Erni Sri Satya Vennela @ 2026-06-05  5:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, dipayanroy, kees,
	linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260602132127.25fc27ee@kernel.org>

On Tue, Jun 02, 2026 at 01:21:27PM -0700, Jakub Kicinski wrote:
> On Thu, 28 May 2026 11:07:51 -0700 Erni Sri Satya Vennela wrote:
> > mana_query_link_cfg() sends an HWC command to firmware on every call,
> > but the link speed and QoS values it returns only change when the
> > driver explicitly calls mana_set_bw_clamp(). This function is called
> > not only by userspace via ethtool get_link_ksettings, but also
> > periodically by hv_netvsc through netvsc_get_link_ksettings and by
> > the sysfs speed_show attribute via dev_attr_show, resulting in
> > unnecessary HWC traffic every few minutes.
> 
> mana is ops-locked, right? Because you support net shapers
> 
> Could you instead take the netdev_lock() in the work?
> It's already held around the user space originated calls.

Hi Jakub,

I tried two netdev_lock-based variants. 

mana_query_link_cfg() has four callers:

1 ethtool ioctl/netlink			- has RTNL	- has netdev->lock
2 sysfs speed_show/duplex_show		- has RTNL	- no netdev->lock
3 netvsc_get_link_ksettings VF forward	- has RTNL	- no netdev->lock
4 mana_shaper_set			- no RTNL	- has netdev->lock

No existing lock covers all four.

A. netdev_assert_locked() in the mana_query_link_cfg() :
Lockdep WARN on every sysfs cat /sys/class/net/eth*/speed and every
periodic netvsc_get_link_ksettings() poll since callers 2 and 3 hold
RTNL only.
A slow firmware reply on callers 2/3 can land after mana_shaper_set
(caller 4) has changed the rate and invalidated the cache,
publishing a stale apc->speed / apc->max_speed as "cached valid". 
Because the value is cached, the staleness then persists until the next
shaper change

B. ASSERT_RTNL() + netdev_lock_ops() inside mana_query_link_cfg():
Self-deadlocks on #1 (__dev_ethtool already holds it) and #4
(mana_shaper_set already holds it and calls mana_query_link_cfg() before
the clamp).
ASSERT_RTNL() also WARNs from #4 — shaper genl ops don't take RTNL.

Eg. Deadlock scenario:
__dev_ethtool()
  netdev_lock_ops(dev)              ← held
    ops->get_link_ksettings()
      mana_get_link_ksettings()
        mana_query_link_cfg()
          netdev_lock_ops(ndev)     ← DEADLOCK

Can we consider private link_cfg_mutex which is orthogonal to RTNL and
netdev->lock and covers all four callers?

Thanks,
Vennela

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH net-next v11 0/6] net: mana: Per-vPort EQ and MSI-X management
From: Long Li @ 2026-06-05  0:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
	KY Srinivasan, Wei Liu, Dexuan Cui,
	shradhagupta@linux.microsoft.com, Simon Horman,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260602173607.34063034@kernel.org>

> On Tue, 2 Jun 2026 22:48:05 +0000 Long Li wrote:
> > Changes in patch 6 from v2 to v11:
> >
> >  - Error handling updated from NULL/-ENOMEM to IS_ERR()/PTR_ERR() for
> > mana_gd_get_gic() return values
> >  - Added mdev->eqs[i]->eq.irq = gic->irq to populate the irq field on
> > all RDMA EQs for consistency with the Ethernet path
> >  - Introduced a separate msi variable instead of modifying
> > spec.eq.msix_index directly
> >  - Commit message updated
> >
> > The gdma.h changes are identical to v2.
> 
> Hm, yes, Leon seems to be AFK since May 19th.
> Please repost with his tag included, the list of changes you provided does seem
> immaterial. I don't want to merge v11 as is, there's a good chance people marked
> this thread as ignored by now.

I have sent v12 with Leon's tag on patch 6.

Thanks,
Long

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox