* [PATCH v3 1/4] scsi: scan: allocate sdev and starget on the NUMA node of the host adapter
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
To: Martin K . Petersen, Jens Axboe
Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
Hannes Reinecke, Juergen E . Fischer, Russell King,
linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
storagedev, HighPoint Linux Team, Tyrel Datwyler,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
Kashyap Desai, Shivasharan S, Chandrakanth Patil,
megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
Eugenio Perez, virtualization, Vishal Bhakta,
bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, xen-devel, James Rizzo, Sumit Saxena
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>
From: James Rizzo <james.rizzo@broadcom.com>
When a host adapter is attached to a specific NUMA node, allocating
scsi_device and scsi_target via kzalloc() may place them on a remote
node. All hot-path I/O accesses to these structures then cross the NUMA
interconnect, adding latency and consuming inter-node bandwidth.
Use kzalloc_node() with dev_to_node(shost->dma_dev) so allocations land
on the same node as the HBA, reducing cross-node traffic and improving
I/O performance on NUMA systems.
Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
drivers/scsi/scsi_scan.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index e27da038603a..121a14d5fdb8 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -34,6 +34,7 @@
#include <linux/kthread.h>
#include <linux/spinlock.h>
#include <linux/async.h>
+#include <linux/topology.h>
#include <linux/slab.h>
#include <linux/unaligned.h>
@@ -287,8 +288,8 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
struct queue_limits lim;
- sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
- GFP_KERNEL);
+ sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+ GFP_KERNEL, dev_to_node(shost->dma_dev));
if (!sdev)
goto out;
@@ -502,7 +503,7 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
struct scsi_target *found_target;
int error, ref_got;
- starget = kzalloc(size, GFP_KERNEL);
+ starget = kzalloc_node(size, GFP_KERNEL, dev_to_node(shost->dma_dev));
if (!starget) {
printk(KERN_ERR "%s: allocation failure\n", __func__);
return NULL;
--
2.43.7
^ permalink raw reply related
* [PATCH v3 0/4] scsi/block: NUMA-local scan allocations, shared-tag path cleanup, and SCSI I/O counters
From: Sumit Saxena @ 2026-06-09 12:17 UTC (permalink / raw)
To: Martin K . Petersen, Jens Axboe
Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
Hannes Reinecke, Juergen E . Fischer, Russell King,
linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
storagedev, HighPoint Linux Team, Tyrel Datwyler,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
Kashyap Desai, Shivasharan S, Chandrakanth Patil,
megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
Eugenio Perez, virtualization, Vishal Bhakta,
bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, xen-devel, Sumit Saxena
This series contains three performance improvements targeting the SCSI
and block layers on multi-socket NUMA and heavily loaded SMP systems.
On multi-socket NUMA systems we observed extreme I/O throughput variance
of 50-60% between runs. This series identifies and fixes two root causes:
cross-node memory accesses due to NUMA-unaware allocations in the scan
path, and false sharing between hot atomic counters in struct request_queue
and struct scsi_device.
Performance notes:
Tested on a dual-socket NUMA system (2x 32-core, 256 GB/socket) with
an mpi3mr HBA, running fio (random read, 4K, QD 64, 16 jobs, 60 s,
direct I/O). IOPS figures are in KIOPS (thousands of IOPS):
Configuration Avg KIOPS Range (KIOPS) Spread
Baseline 6,255 4,200 - 6,700 ~37%
Baseline + all patches 7,350 7,000 - 7,700 ~10%
Key findings:
These patches combinedly reduces the observed 50-60% run-to-run variance
to under 10%, significantly improving workload predictability and
improves IOPs by 16-18%.
No functional regressions observed.
Changes in v3
-------------
-Handled feedback from Bart Van Assche and John Garry.
-Added a patch for shost local NUMA allocation.
-Converted ioerr_cnt and iotmo_cnt atomic counters into per-cpu counters.
Changes in v2
--------------
Patch 1 — Same functional goal as v1 patch 1: NUMA-local scsi_device /
scsi_target allocations in the scan path so steady-state I/O does not
habitually touch remote memory when the host has a fixed DMA/NUMA
affinity.
Patch 2 — Replaces v1’s ____cacheline_aligned_in_smp on
nr_active_requests_shared_tags with removal of the shared-tag fairness
throttling machinery (including hctx_may_queue(), blk_mq_hw_ctx.nr_active,
and request_queue.nr_active_requests_shared_tags and their updates).
This follows the earlier standalone proposal by Bart Van Assche [1],
rebased for the current tree; it removes the high-frequency atomic
accounting that motivated the v1 false-sharing workaround and, in our
testing, improves IOPS on the order of roughly 16–18% for the shared-tag
workload exercised.
Patch 3 — Replaces v1’s cache-line padding of iodone_cnt with
percpu_counter for both iorequest_cnt and iodone_cnt, so submission and
completion paths mostly update CPU-local state instead of bouncing a
single cache line, without inflating struct scsi_device for SMP
alignment.
Merge / review hints
--------------------
Patch 3 touches the block layer and should have block maintainer review;
rest of patches are SCSI-oriented. Please route or Ack as your subsystem
workflow requires.
Bart Van Assche (1):
block: drop shared-tag fairness throttling
James Rizzo (1):
scsi: scan: allocate sdev and starget on the NUMA node of the host
adapter
Sumit Saxena (2):
scsi: host: allocate struct Scsi_Host on the NUMA node of the host
adapter
scsi: use percpu counters for iostat counters in struct scsi_device
block/blk-core.c | 2 -
block/blk-mq-debugfs.c | 22 ++++-
block/blk-mq-tag.c | 4 -
block/blk-mq.c | 17 +---
block/blk-mq.h | 100 ----------------------
drivers/scsi/3w-9xxx.c | 2 +-
drivers/scsi/3w-sas.c | 2 +-
drivers/scsi/3w-xxxx.c | 2 +-
drivers/scsi/53c700.c | 2 +-
drivers/scsi/BusLogic.c | 2 +-
drivers/scsi/a100u2w.c | 2 +-
drivers/scsi/a2091.c | 2 +-
drivers/scsi/a3000.c | 2 +-
drivers/scsi/aacraid/linit.c | 2 +-
drivers/scsi/advansys.c | 6 +-
drivers/scsi/aha152x.c | 2 +-
drivers/scsi/aha1542.c | 2 +-
drivers/scsi/aha1740.c | 2 +-
drivers/scsi/aic7xxx/aic79xx_osm.c | 2 +-
drivers/scsi/aic7xxx/aic7xxx_osm.c | 2 +-
drivers/scsi/aic94xx/aic94xx_init.c | 2 +-
drivers/scsi/am53c974.c | 2 +-
drivers/scsi/arcmsr/arcmsr_hba.c | 3 +-
drivers/scsi/arm/acornscsi.c | 2 +-
drivers/scsi/arm/arxescsi.c | 2 +-
drivers/scsi/arm/cumana_1.c | 2 +-
drivers/scsi/arm/cumana_2.c | 2 +-
drivers/scsi/arm/eesox.c | 2 +-
drivers/scsi/arm/oak.c | 2 +-
drivers/scsi/arm/powertec.c | 2 +-
drivers/scsi/atari_scsi.c | 2 +-
drivers/scsi/atp870u.c | 2 +-
drivers/scsi/bfa/bfad_im.c | 2 +-
drivers/scsi/csiostor/csio_init.c | 4 +-
drivers/scsi/dc395x.c | 2 +-
drivers/scsi/dmx3191d.c | 2 +-
drivers/scsi/elx/efct/efct_xport.c | 4 +-
drivers/scsi/esas2r/esas2r_main.c | 2 +-
drivers/scsi/fdomain.c | 2 +-
drivers/scsi/fnic/fnic_main.c | 2 +-
drivers/scsi/g_NCR5380.c | 2 +-
drivers/scsi/gvp11.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas_main.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 2 +-
drivers/scsi/hosts.c | 6 +-
drivers/scsi/hpsa.c | 2 +-
drivers/scsi/hptiop.c | 2 +-
drivers/scsi/ibmvscsi/ibmvfc.c | 2 +-
drivers/scsi/ibmvscsi/ibmvscsi.c | 2 +-
drivers/scsi/imm.c | 2 +-
drivers/scsi/initio.c | 2 +-
drivers/scsi/ipr.c | 2 +-
drivers/scsi/ips.c | 2 +-
drivers/scsi/isci/init.c | 2 +-
drivers/scsi/jazz_esp.c | 2 +-
drivers/scsi/libiscsi.c | 2 +-
drivers/scsi/lpfc/lpfc_init.c | 2 +-
drivers/scsi/mac53c94.c | 2 +-
drivers/scsi/mac_esp.c | 2 +-
drivers/scsi/mac_scsi.c | 2 +-
drivers/scsi/megaraid.c | 2 +-
drivers/scsi/megaraid/megaraid_mbox.c | 2 +-
drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
drivers/scsi/mesh.c | 2 +-
drivers/scsi/mpi3mr/mpi3mr_os.c | 2 +-
drivers/scsi/mpt3sas/mpt3sas_scsih.c | 4 +-
drivers/scsi/mvme147.c | 2 +-
drivers/scsi/mvsas/mv_init.c | 2 +-
drivers/scsi/mvumi.c | 2 +-
drivers/scsi/myrb.c | 2 +-
drivers/scsi/myrs.c | 2 +-
drivers/scsi/ncr53c8xx.c | 2 +-
drivers/scsi/nsp32.c | 2 +-
drivers/scsi/pcmcia/nsp_cs.c | 2 +-
drivers/scsi/pcmcia/qlogic_stub.c | 2 +-
drivers/scsi/pcmcia/sym53c500_cs.c | 2 +-
drivers/scsi/pm8001/pm8001_init.c | 2 +-
drivers/scsi/pmcraid.c | 2 +-
drivers/scsi/ppa.c | 2 +-
drivers/scsi/ps3rom.c | 2 +-
drivers/scsi/qla1280.c | 2 +-
drivers/scsi/qla2xxx/qla_mid.c | 2 +-
drivers/scsi/qla2xxx/qla_os.c | 2 +-
drivers/scsi/qlogicfas.c | 2 +-
drivers/scsi/qlogicpti.c | 2 +-
drivers/scsi/scsi_debug.c | 2 +-
drivers/scsi/scsi_error.c | 4 +-
drivers/scsi/scsi_lib.c | 10 +--
drivers/scsi/scsi_scan.c | 15 +++-
drivers/scsi/scsi_sysfs.c | 23 +++--
drivers/scsi/sd.c | 2 +-
drivers/scsi/sgiwd93.c | 2 +-
drivers/scsi/smartpqi/smartpqi_init.c | 2 +-
drivers/scsi/snic/snic_main.c | 2 +-
drivers/scsi/stex.c | 2 +-
drivers/scsi/storvsc_drv.c | 2 +-
drivers/scsi/sun3_scsi.c | 2 +-
drivers/scsi/sun3x_esp.c | 2 +-
drivers/scsi/sun_esp.c | 2 +-
drivers/scsi/sym53c8xx_2/sym_glue.c | 2 +-
drivers/scsi/virtio_scsi.c | 2 +-
drivers/scsi/vmw_pvscsi.c | 2 +-
drivers/scsi/wd719x.c | 2 +-
drivers/scsi/xen-scsifront.c | 2 +-
drivers/scsi/zorro_esp.c | 2 +-
include/linux/blk-mq.h | 6 --
include/linux/blkdev.h | 2 -
include/scsi/libfc.h | 2 +-
include/scsi/scsi_device.h | 9 +-
include/scsi/scsi_host.h | 3 +-
110 files changed, 168 insertions(+), 258 deletions(-)
--
2.43.7
^ permalink raw reply
* Re: [PATCH net v3] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: patchwork-bot+netdevbpf @ 2026-06-09 11:20 UTC (permalink / raw)
To: Anton Leontev
Cc: netdev, linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba,
pabeni, edumazet, davem, stable, linux-kernel
In-Reply-To: <20260604165938.32033-1-leontyevantony@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Thu, 4 Jun 2026 19:59:38 +0300 you wrote:
> netvsc_copy_to_send_buf() copies page buffer entries into the VMBus
> send buffer using phys_to_virt() on the entry PFN. Entries for the
> RNDIS header and the skb linear data come from kmalloc'd memory and
> are always in the kernel direct map, but entries for skb fragments
> reference page cache or user pages, which on 32-bit x86 with
> CONFIG_HIGHMEM=y can live above the LOWMEM boundary. For such a page
> phys_to_virt() returns an address outside the direct map and the
> subsequent memcpy() faults on the transmit softirq path, which is
> fatal.
>
> [...]
Here is the summary with links:
- [net,v3] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
https://git.kernel.org/netdev/net/c/004e9ecfe6c5
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2 2/2] net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
From: sashiko-bot @ 2026-06-09 10:14 UTC (permalink / raw)
To: Aditya Garg; +Cc: linux-hyperv
In-Reply-To: <20260608101345.2267320-3-gargaditya@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues:
- [High] The patch prevents destroying uninitialized TX WQ objects that hold the `INVALID_MANA_HANDLE` sentinel, but bypasses unreached elements in the zero-initialized array, causing handle `0` to be erroneously passed to firmware.
- [High] Hardware vport registrations and configurations are not cleanly reverted if subsequent queue allocations fail inside `mana_create_vport()`, permanently breaking the network interface.
--
commit 27abeb920ad813c7f2d581fe8f249fdc9e94666a
Author: Aditya Garg <gargaditya@linux.microsoft.com>
net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
This commit adds a check in mana_destroy_txq() to prevent destroying
uninitialized TX WQ objects that hold the INVALID_MANA_HANDLE sentinel. This
avoids passing invalid handles to the firmware, which previously caused
errors and logs. The change mirrors existing RX-side cleanup patterns.
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index c9b1df1ed1098..d7de4c4d25bbe 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -2334,7 +2334,8 @@ static void mana_destroy_txq(struct mana_port_context *apc)
> netif_napi_del_locked(napi);
> apc->tx_qp[i].txq.napi_initialized = false;
> }
> - mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
> + if (apc->tx_qp[i].tx_object != INVALID_MANA_HANDLE)
> + mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
[Severity: High]
This is a pre-existing issue, but does this check safely handle all
uninitialized tx_object values?
In mana_create_txq(), apc->tx_qp is allocated using kzalloc_objs(), which
zero-initializes the array:
mana_create_txq() {
apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
if (!apc->tx_qp)
return -ENOMEM;
for (i = 0; i < apc->num_queues; i++) {
apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
...
}
If mana_create_txq() fails during the loop, the unvisited elements
(where j > i) will still have their zero-initialized values, meaning
tx_qp[j].tx_object == 0.
Since mana_destroy_txq() iterates over all apc->num_queues, and 0 is not
INVALID_MANA_HANDLE, the condition apc->tx_qp[i].tx_object !=
INVALID_MANA_HANDLE will be true for these unvisited queues.
Could this lead to erroneously calling mana_destroy_wq_obj() with handle 0,
potentially destroying an active WQ object used by another queue or port?
>
> mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
>
[Severity: High]
This isn't a bug introduced by this patch, but is there a resource leak in
mana_create_vport() if mana_create_txq() fails?
mana_create_vport() {
...
err = mana_pf_register_hw_vport(apc);
if (err)
return err;
err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
if (err)
return err;
return mana_create_txq(apc, net);
}
If mana_create_txq() fails, it returns the error directly to
mana_alloc_queues() without jumping to a destroy_vport label. This bypasses
mana_destroy_vport() and skips cleaning up the hardware vport registrations.
Because the interface fails to open, apc->port_is_up remains false. Future
teardown attempts will skip mana_dealloc_queues(), leaving vport_use_count
permanently incremented.
Would this prevent the interface from ever being brought up again, as
mana_cfg_vport() enforces apc->vport_use_count == 0, requiring a driver
reload to recover?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260608101345.2267320-1-gargaditya@linux.microsoft.com?part=2
^ permalink raw reply
* Re: [PATCH net v2 1/2] net: mana: initialize gdma queue id to INVALID_QUEUE_ID
From: sashiko-bot @ 2026-06-09 10:14 UTC (permalink / raw)
To: Aditya Garg; +Cc: linux-hyperv
In-Reply-To: <20260608101345.2267320-2-gargaditya@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] The mana_gd_create_mana_eq function leaks a firmware DMA region object on its error path if the EQ hardware creation fails.
--
commit 4c25a3c164c64bd1c7eed40e10b6c6526c99eb4c
Author: Aditya Garg <gargaditya@linux.microsoft.com>
net: mana: initialize gdma queue id to INVALID_QUEUE_ID
mana_gd_create_mana_wq_cq() leaves queue->id as 0 (from kzalloc_obj())
until mana_create_wq_obj() assigns the firmware-returned id. If creation
fails before that, cleanup calls mana_gd_destroy_cq() with id 0, NULLing
gc->cq_table[0] and silently breaking whichever real CQ owns that slot.
Initialize queue->id to INVALID_QUEUE_ID right after allocation, matching
mana_gd_create_eq(). The existing (id >= max_num_cqs) guard then
short-circuits cleanly.
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index d8e816882f02c..ac71ca8450bf4 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1192,6 +1192,8 @@ int mana_gd_create_mana_wq_cq(struct gdma_dev *gd,
> if (!queue)
> return -ENOMEM;
>
> + queue->id = INVALID_QUEUE_ID;
> +
> gmi = &queue->mem_info;
> err = mana_gd_alloc_memory(gc, spec->queue_size, gmi);
> if (err) {
[Severity: High]
This is a pre-existing issue, but I noticed mana_gd_create_mana_eq() leaks a
firmware DMA region object on its error path if the EQ hardware creation fails.
In mana_gd_create_mana_eq(), mana_gd_create_dma_region() registers a DMA region
with the hardware and returns a valid handle. If the subsequent call to
mana_gd_create_eq() fails, execution jumps to the out label:
drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_gd_create_mana_eq() {
...
err = mana_gd_create_eq(gd, spec, true, queue);
if (err)
goto out;
...
out:
dev_err(gc->dev, "Failed to create queue type %d of size: %u, err: %d\n",
spec->type, spec->queue_size, err);
mana_gd_free_memory(gmi);
free_q:
kfree(queue);
return err;
}
Does this code need to call mana_gd_destroy_dma_region(gc,
gmi->dma_region_handle) in the error path?
Without it, does this leak the hardware DMA region resource and leave freed
physical pages mapped in the hardware, potentially creating a hardware
use-after-free scenario?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260608101345.2267320-1-gargaditya@linux.microsoft.com?part=1
^ permalink raw reply
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Thomas Gleixner @ 2026-06-09 7:48 UTC (permalink / raw)
To: Sean Christopherson, David Woodhouse
Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
Michael Kelley
In-Reply-To: <aidEfvTMjLa2zt43@google.com>
On Mon, Jun 08 2026 at 15:38, Sean Christopherson wrote:
> On Sat, Jun 06, 2026, David Woodhouse wrote:
>> > Along with:
>> >
>> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
>> > if (tsc_khz_early)
>> > pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
>> >
>> > or something daft like that.
>
> Ya, I ended up in the same place once Sashiko pointed out that skipping the SNP/TDX
> setup was hazardous[*], and also once I realized that tsc_khz_early *complemented*
> the refinement instead of replacing it.
>
> This is what I have locally:
>
> if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> known_tsc_khz = snp_secure_tsc_init();
> else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> known_tsc_khz = tdx_tsc_init();
>
> /*
> * If the TSC frequency wasn't provided by trusted firmware, try to get
> * it from the hypervisor (which is untrusted when running as a CoCo guest).
> */
> if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
> known_tsc_khz = x86_init.hyper.get_tsc_khz();
>
> /*
> * Mark the TSC frequency as known if it was obtained from a hypervisor
> * or trusted firmware. Don't mark the frequency as known if the user
> * specified the frequency, as the user-provided frequency is intended
> * as a "starting point", not a known, guaranteed frequency.
> */
> if (known_tsc_khz && !tsc_early_khz)
> setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
If the frequenct is known via the above then you want to set the
KNOWN_FREQ feature bit unconditionally. SNP/TDX/hypervisor override the
command line argument as you print below.
> /*
> * Ignore the user-provided TSC frequency if the exact frequency was
> * obtained from trusted firmware or the hypervisor, as the user-
> * provided frequency is intended as a "starting point", not a known,
> * guaranteed frequency.
> */
> if (!known_tsc_khz)
> known_tsc_khz = tsc_early_khz;
> else if (tsc_early_khz)
> pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");
>> All the nonsense about updating it every time we enter a CPU could just
>> go away completely.
>
> But to Thomas' point, why bother? For actual old hardware, kvmclock is what it
> is. For modern hardware, it's completely antiquated.
I agree, but we are not forced to make it a first class citizen to the
detriment of sane systems.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-06-09 4:32 UTC (permalink / raw)
To: Jacob Keller
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, dipayanroy, leitao, kees, john.fastabend,
hawk, bpf, daniel, ast, sdf, yury.norov, pavan.chebbi
In-Reply-To: <c3b2ab74-754d-4d09-b7a2-d274343d0936@intel.com>
On Thu, Jun 04, 2026 at 11:40:30AM -0700, Jacob Keller wrote:
> On 6/2/2026 1:24 PM, Dipayaan Roy wrote:
> > On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
> > allocation in the RX refill path can cause 15-20% throughput
> > regression under high connection counts (>16 TCP streams).
> >
> > Add an ethtool private flag "full-page-rx" that allows the user to
> > force one RX buffer per page, bypassing the page_pool fragment path.
> > This restores line-rate (180+ Gbps) performance on affected platforms.
> >
> > Usage:
> > ethtool --set-priv-flags eth0 full-page-rx on
> >
> > There is no behavioral change by default. The flag must be explicitly
> > enabled by the user or udev rule.
> >
> > The existing single-buffer-per-page logic for XDP and jumbo frames is
> > consolidated into a new helper mana_use_single_rxbuf_per_page() which
> > is now the single decision point for both the automatic and
> > user-controlled paths.
> >
> > Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> > ---
>
> I had one or two minor nits, but nothing that I think really deserves a
> v11. The only real comment is a future "gotcha" that could happen if you
> ever added a second private flag, which seems unlikely and maybe not
> worth dealing with until it matters.
>
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
>
Hi Jacob,
Thank you for the review.
I will keep this patch as is, since no plans for any new private flags.
Regards
Dipayaan Roy
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 22 +++-
> > .../ethernet/microsoft/mana/mana_ethtool.c | 103 ++++++++++++++++++
> > include/net/mana/mana.h | 8 ++
> > 3 files changed, 131 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index db14357d3732..447cecfd3f67 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
> > return va;
> > }
> >
> > +static bool
> > +mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
> > +{
> > + /* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
> > + * in the RX refill path (~2kB buffer) can cause significant throughput
> > + * regression under high connection counts. Allow user to force one RX
> > + * buffer per page via ethtool private flag to bypass the fragment
> > + * path.
> > + */
> > + if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
> > + return true;
> > +
> > + /* For xdp and jumbo frames make sure only one packet fits per page. */
> > + if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
> > + return true;
>
> Technically you could combine all three into one if, but I agree that
> clarity and space for the comment about why the private flag exists
> makes sense.
>
> > +
> > + return false;
> > +}
> > +
> > /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
> > static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
> > int mtu, u32 *datasize, u32 *alloc_size,
> > @@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
> > /* Calculate datasize first (consistent across all cases) */
> > *datasize = mtu + ETH_HLEN;
> >
> > - /* For xdp and jumbo frames make sure only one packet fits per page */
> > - if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
> > + if (mana_use_single_rxbuf_per_page(apc, mtu)) {
> > if (mana_xdp_get(apc)) {
> > *headroom = XDP_PACKET_HEADROOM;
> > *alloc_size = PAGE_SIZE;
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > index 7e79681634db..f22bbb325948 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > @@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
> > { "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
> > };
> >
> > +static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
> > + [MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
> > +};
> > +
> > static int mana_get_sset_count(struct net_device *ndev, int stringset)
> > {
> > struct mana_port_context *apc = netdev_priv(ndev);
> > @@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
> > ARRAY_SIZE(mana_phy_stats) +
> > ARRAY_SIZE(mana_hc_stats) +
> > num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
> > +
> > + case ETH_SS_PRIV_FLAGS:
> > + return MANA_PRIV_FLAG_MAX;
> > +
> > default:
> > return -EINVAL;
> > }
> > @@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
> > }
> > }
> >
> > +static void mana_get_strings_priv_flags(u8 **data)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
> > + ethtool_puts(data, mana_priv_flags[i]);
> > +}
> > +
> > static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> > {
> > struct mana_port_context *apc = netdev_priv(ndev);
> > @@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> > case ETH_SS_STATS:
> > mana_get_strings_stats(apc, &data);
> > break;
> > + case ETH_SS_PRIV_FLAGS:
> > + mana_get_strings_priv_flags(&data);
> > + break;
> > default:
> > break;
> > }
> > @@ -590,6 +609,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
> > return 0;
> > }
> >
> > +static u32 mana_get_priv_flags(struct net_device *ndev)
> > +{
> > + struct mana_port_context *apc = netdev_priv(ndev);
> > +
> > + return apc->priv_flags;
> > +}
> > +
> > +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> > +{
> > + struct mana_port_context *apc = netdev_priv(ndev);
> > + u32 changed = apc->priv_flags ^ priv_flags;
> > + u32 old_priv_flags = apc->priv_flags;
> > + bool schedule_port_reset = false;
> > + int err = 0;
> > +
> > + if (!changed)
> > + return 0;
> > +
> > + /* Reject unknown bits */
> > + if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> > + return -EINVAL;
>
> Good. Explicit rejection ensures that there's no risk of bad value. I
> think this is only required for the legacy ioctl interface, and won't be
> able to have a bit set that isn't in your accepted list. However the
> legacy ioctl interface looks like it doesn't do that double checking, so
> its good to have this.
>
> > +
> > + if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> > + apc->priv_flags = priv_flags;
> > +
>
> In the (unlikely) event that you need another private flag in the
> future, this bit seems like it shouldn't be inside the if block here. It
> seems like you'd want to either do this at the end or up front. Of
> course it doesn't matter as long as this is the only private flag you have.
>
> > + if (!apc->port_is_up) {
> > + /* Port is down, flag updated to apply on next up
> > + * so just return.
> > + */
> > + return 0;
> > + }
> > +
> > + /* Pre-allocate buffers to prevent failure in mana_attach
> > + * later
> > + */
> > + err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
> > + if (err) {
> > + netdev_err(ndev,
> > + "Insufficient memory for new allocations\n");
> > + apc->priv_flags = old_priv_flags;
> > + return err;
> > + }
> > +
> > + err = mana_detach(ndev, false);
> > + if (err) {
> > + netdev_err(ndev, "mana_detach failed: %d\n", err);
> > + apc->priv_flags = old_priv_flags;
> > +
> > + /* Port is in an inconsistent state. Restore
> > + * 'port_is_up' so that queue reset work handler
> > + * can properly detach and re-attach.
> > + */
> > + apc->port_is_up = true;
> > + schedule_port_reset = true;
> > + goto out;
> > + }
> > +
> > + err = mana_attach(ndev);
> > + if (err) {
> > + netdev_err(ndev, "mana_attach failed: %d\n", err);
> > + apc->priv_flags = old_priv_flags;
> > +
> > + /* Restore 'port_is_up' so the reset work handler
> > + * can properly detach/attach. Without this,
> > + * the handler sees port_is_up=false and skips
> > + * queue allocation, leaving the port dead.
> > + */
> > + apc->port_is_up = true;
> > + schedule_port_reset = true;
> > + }
>
> I might have made this bit a separate function, but that comes from
> history of working with older drivers which accumulated a larger number
> of private flags. Given that we frown on adding new ones except in more
> rare cases these days, this is probably fine.
>
> > + }
> > +
> > +out:
> > + mana_pre_dealloc_rxbufs(apc);
> > +
> > + if (schedule_port_reset)
> > + queue_work(apc->ac->per_port_queue_reset_wq,
> > + &apc->queue_reset_work);
> > +
> > + return err;
> > +}
> > +
> > const struct ethtool_ops mana_ethtool_ops = {
> > .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
> > .get_ethtool_stats = mana_get_ethtool_stats,
> > @@ -608,4 +709,6 @@ const struct ethtool_ops mana_ethtool_ops = {
> > .set_ringparam = mana_set_ringparam,
> > .get_link_ksettings = mana_get_link_ksettings,
> > .get_link = ethtool_op_get_link,
> > + .get_priv_flags = mana_get_priv_flags,
> > + .set_priv_flags = mana_set_priv_flags,
> > };
> > diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> > index d9c27310fd04..26fd5e041a47 100644
> > --- a/include/net/mana/mana.h
> > +++ b/include/net/mana/mana.h
> > @@ -30,6 +30,12 @@ enum TRI_STATE {
> > TRI_STATE_TRUE = 1
> > };
> >
> > +/* MANA ethtool private flag bit positions */
> > +enum mana_priv_flag_bits {
> > + MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
> > + MANA_PRIV_FLAG_MAX,
>
> For cases like this, I find it helpful to add a comment indicating this
> must be the last entry. (and in that case, drop the trailing comma).
>
> > +};
> > +
> > /* Number of entries for hardware indirection table must be in power of 2 */
> > #define MANA_INDIRECT_TABLE_MAX_SIZE 512
> > #define MANA_INDIRECT_TABLE_DEF_SIZE 64
> > @@ -531,6 +537,8 @@ struct mana_port_context {
> > u32 rxbpre_headroom;
> > u32 rxbpre_frag_count;
> >
> > + u32 priv_flags;
> > +
> > struct bpf_prog *bpf_prog;
> >
> > /* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
>
^ permalink raw reply
* Re: [PATCH v4 26/47] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: Sean Christopherson @ 2026-06-09 0:31 UTC (permalink / raw)
To: sashiko-reviews; +Cc: linux-hyperv, kvm
In-Reply-To: <20260530070848.158521F00893@smtp.kernel.org>
On Sat, May 30, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index c81a24d0efdf6..fd1c417b4f9b9 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -460,7 +460,7 @@ static void __init sev_map_percpu_data(void)
>
> [ ... ]
>
> > -static void kvm_guest_cpu_offline(bool shutdown)
> > +static void kvm_guest_cpu_offline(enum kvm_guest_cpu_action action)
> > {
> > kvm_disable_steal_time();
> > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > @@ -468,9 +468,10 @@ static void kvm_guest_cpu_offline(bool shutdown)
> > if (kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL))
> > wrmsrq(MSR_KVM_MIGRATION_CONTROL, 0);
>
> [Severity: High]
> This isn't a bug introduced by this patch, but I noticed that
> MSR_KVM_MIGRATION_CONTROL is cleared to 0 here during CPU offline and system
> suspend, but it does not appear to be re-enabled upon online/resume.
>
> Since this patch introduces the kvm_guest_cpu_action enum to solve an
> identical state-machine issue for kvmclock, should we also use this action
> enum to ensure the migration control MSR isn't incorrectly cleared on
> suspend? Otherwise, wouldn't a single CPU hotplug event or a suspend/resume
> cycle permanently disable live migration for the entire VM?
Looks like. That's someone else's future problem though.
^ permalink raw reply
* Re: [PATCH v4 21/47] x86/xen: Obtain TSC frequency from CPUID if present
From: Sean Christopherson @ 2026-06-09 0:29 UTC (permalink / raw)
To: sashiko-reviews; +Cc: kvm, linux-hyperv
In-Reply-To: <20260530063531.D034A1F00898@smtp.kernel.org>
On Sat, May 30, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index 487ad838c4418..36d66abf53796 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -42,6 +42,17 @@ static unsigned int __init xen_tsc_khz(void)
> > {
> > struct pvclock_vcpu_time_info *info =
> > &HYPERVISOR_shared_info->vcpu_info[0].time;
> > + u32 base = xen_cpuid_base();
> > + u32 eax, ebx, ecx, edx;
> > +
> > + /*
> > + * If Xen provides the guest TSC frequency directly in CPUID
> > + * (leaf 3, sub-leaf 0, ECX), use that instead of reverse-
> > + * calculating from the pvclock mul/shift.
> > + */
> > + cpuid_count(base + 3, 0, &eax, &ebx, &ecx, &edx);
> > + if (ecx)
> > + return ecx;
>
> [Severity: High]
> Is it possible for the hypervisor to support a maximum leaf of base + 2,
> resulting in an out-of-bounds CPUID query here?
>
> Looking at xen_cpuid_base(), it calls cpuid_base_hypervisor(XEN_SIGNATURE, 2)
> which only guarantees that the hypervisor supports up to leaf base + 2.
>
> If the hypervisor does not support base + 3, this CPUID query could be out
> of bounds. Processors handling out-of-bounds CPUID requests typically return
> data from the maximum basic leaf.
Heh, depends on the hypervisor. This quirk is specific to Intel CPUs, and so
KVM emulates this behavior only when the advertised vCPU vendor is Intel.
Anyways, AFAICT, Sashiko is right to be skeptical, I don't see anything obvious
that guarantees +3 will be supported.
David, can you send this as a standalone patch, and either address Sashiko's
concern or add a blurb/comment explaining why it's safe? Unlike the KVM changes,
this won't conflict with any of the other changes in this series. So while it's
themetatically very related to this series, in practice it can go in separately,
and I'd strongly prefer to let the Xen folks handle this one.
^ permalink raw reply
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Sean Christopherson @ 2026-06-08 22:38 UTC (permalink / raw)
To: David Woodhouse
Cc: Thomas Gleixner, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz, H. Peter Anvin,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
Michael Kelley
In-Reply-To: <eef867eae15e30d08482ba16a1a32159745b64a7.camel@infradead.org>
On Sat, Jun 06, 2026, David Woodhouse wrote:
> On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> > On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> >
> > > Now that all paravirt code that explicitly specifies the TSC frequency
> > > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> > >
> > > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > > by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > > line parameter"), one of the goals of the param is to allow the refined
> > > calibration work "to do meaningful error checking".
> > >
> > > Note, preferring the user-provided TSC frequency over the frequency from
> > > the hypervisor or trusted firmware, while simultaneously not treating the
> > > user-provided frequency as gospel, is obviously incongruous. Sweep the
> > > problem under the rug for now to avoid opening a big can of worms that
> > > likely doesn't have a great answer.
> >
> > There is a good answer I think.
> >
> > early_tsc_khz exists to cater for the overclocking crowd. On their
> > modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> > matching reality anymore. So they work around that by supplying a close
> > enough tsc_early_khz and then they let the refined calibration work
> > figure it out.
> >
> > Arguably that's only relevant for bare metal systems and what's worse is
> > that in virtual environments the refined calibration work can fail,
> > which renders the TSC unstable.
> >
> > So I'd rather say we change this logic to:
> >
> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> > tsc_khz = x86_init.....();
> > force(X86_FEATURE_TSC_KNOWN_FREQ);
> > } else if (tsc_khz_early) {
> > ....
> > } else {
> > ...
> > }
> >
> > Along with:
> >
> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> > if (tsc_khz_early)
> > pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
> >
> > or something daft like that.
Ya, I ended up in the same place once Sashiko pointed out that skipping the SNP/TDX
setup was hazardous[*], and also once I realized that tsc_khz_early *complemented*
the refinement instead of replacing it.
This is what I have locally:
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
known_tsc_khz = snp_secure_tsc_init();
else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
known_tsc_khz = tdx_tsc_init();
/*
* If the TSC frequency wasn't provided by trusted firmware, try to get
* it from the hypervisor (which is untrusted when running as a CoCo guest).
*/
if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
known_tsc_khz = x86_init.hyper.get_tsc_khz();
/*
* Mark the TSC frequency as known if it was obtained from a hypervisor
* or trusted firmware. Don't mark the frequency as known if the user
* specified the frequency, as the user-provided frequency is intended
* as a "starting point", not a known, guaranteed frequency.
*/
if (known_tsc_khz && !tsc_early_khz)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
/*
* Ignore the user-provided TSC frequency if the exact frequency was
* obtained from trusted firmware or the hypervisor, as the user-
* provided frequency is intended as a "starting point", not a known,
* guaranteed frequency.
*/
if (!known_tsc_khz)
known_tsc_khz = tsc_early_khz;
else if (tsc_early_khz)
pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");
[*] https://lore.kernel.org/all/ahnF-FehodVd474X@google.com
> > The kernel has for various reasons always tried to cater for the needs
> > of users who are plagued by bonkers firmware, but we have to stop to
> > prioritize or treating equal ancient and modded out of spec hardware.
> >
> > TBH, I consider that whole KVM clock nonsense to fall into the modded
> > out of spec hardware realm. Do a reality check:
> >
> > How many production systems are out there still which run VMs on CPUs
> > with a broken TSC and the lack of VM TSC scaling?
> >
> > I'm not saying that we should not support the few remaining systems
> > anymore, but our tendency to pretend that we can keep all of this
> > nonsense working and at the same time making progress is just a fallacy.
FWIW, I have the exact same sentiments about kvmclock, but I'm also trying my
best not to break folks that are happily running on what is effectively flawed,
ancient "hardward".
> I don't know that we can take the KVM (and Xen) clock away from guests,
> but all of the *horrid* part about it is the way it attempts to cope
> with the possibility that the *host* timekeeping might flip away from
> TSC-based mode at any point in time. By the end of my outstanding
> cleanup series, that is the *only* thing the gtod_notifier remains for.
>
> If we can trust the hardware *and* the host kernel, then KVM could
> theoretically hardwire the kvmclock into 'master clock mode' where it
> basically just advertises the TSC→kvmclock relationship *once* to all
> CPUs and it never changes.
>
> All the nonsense about updating it every time we enter a CPU could just
> go away completely.
But to Thomas' point, why bother? For actual old hardware, kvmclock is what it
is. For modern hardware, it's completely antiquated.
^ permalink raw reply
* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-06-08 22:35 UTC (permalink / raw)
To: Shradha Gupta
Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
Saurabh Singh Sengar, stable
In-Reply-To: <20260601102749.1768304-1-shradhagupta@linux.microsoft.com>
On Mon, Jun 01, 2026 at 03:27:46AM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
>
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
>
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
>
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
>
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 2
> IRQ3: mana_q3 0
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 38.85 0.03 24.89 24.65
> pass 2: 39.15 0.03 24.57 25.28
> pass 3: 40.36 0.03 23.20 23.17
>
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
>
> TYPE effective vCPU aff
> =======================================================
> IRQ0: HWC 0
> IRQ1: mana_q1 0
> IRQ2: mana_q2 1
> IRQ3: mana_q3 2
> IRQ4: mana_q4 3
>
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU 0 1 2 3
> =======================================================
> pass 1: 15.42 15.85 14.99 14.51
> pass 2: 15.53 15.94 15.81 15.93
> pass 3: 16.41 16.35 16.40 16.36
>
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn with patch w/o patch
> 20480 15.65 7.73
> 10240 15.63 8.93
> 8192 15.64 9.69
> 6144 15.64 13.16
> 4096 15.69 15.75
> 2048 15.69 15.83
> 1024 15.71 15.28
So, case 1 is irq_setup(), and case 2 is irq_setup_linear(). Is that
correct?
On the previous round we've discussed a no-affinity case:
irq_set_affinity_and_hint(irq, NULL);
My naive view is that the more freedom you give to the scheduler in
balancing the IRQ handling load, the better results you've got. But
your numbers show that the 'linear' distribution is still better. Can
you add the results of that experiment as the 'case 3' please? Any
ideas why the linear case wins over the no-affinity?
Thanks,
Yury
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>
> ---
> Changes in v3
> * Optimize the comments in mana_gd_setup_dyn_irqs()
> * add more details in the dev_dbg for extra IRQs
> ---
> Changes in v2
> * Removed the unused skip_first_cpu variable
> * fixed exit condition in irq_setup_linear() with len == 0
> * changed return type of irq_setup_linear() as it will always be 0
> * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> * added appropriate comments to indicate expected behaviour when
> IRQs are more than or equal to num_online_cpus()
> ---
> .../net/ethernet/microsoft/mana/gdma_main.c | 60 ++++++++++++++++---
> 1 file changed, 53 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 712a0881d720..00a28b3ca0a6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> } else {
> /* If dynamic allocation is enabled we have already allocated
> * hwc msi
> + * Also, we make sure in this case the following is always true
> + * (num_msix_usable - 1 HWC) <= num_online_cpus()
> */
> gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> }
> @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> return 0;
> }
>
> +/* should be called with cpus_read_lock() held */
> +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + if (len == 0)
> + break;
> +
> + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> + len--;
> + }
> +}
> +
> static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> {
> struct gdma_context *gc = pci_get_drvdata(pdev);
> struct gdma_irq_context *gic;
> - bool skip_first_cpu = false;
> int *irqs, irq, err, i;
>
> irqs = kmalloc_objs(int, nvec);
> @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> return -ENOMEM;
>
> /*
> + * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> + * nvec is only Queue IRQ (HWC already setup).
> * While processing the next pci irq vector, we start with index 1,
> * as IRQ vector at index 0 is already processed for HWC.
> * However, the population of irqs array starts with index 0, to be
> @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> * first CPU sibling group since they are already affinitized to HWC IRQ
> */
> cpus_read_lock();
> - if (gc->num_msix_usable <= num_online_cpus())
> - skip_first_cpu = true;
> + if (gc->num_msix_usable <= num_online_cpus()) {
> + err = irq_setup(irqs, nvec, gc->numa_node, true);
> + if (err) {
> + cpus_read_unlock();
> + goto free_irq;
> + }
> + } else {
> + /*
> + * When num_msix_usable are more than num_online_cpus, our
> + * queue IRQs should be equal to num of online vCPUs.
> + * We try to make sure queue IRQs spread across all vCPUs.
> + * In such a case NUMA or CPU core affinity does not matter.
> + * Note: in this case the total mana IRQ should always be
> + * num_online_cpus + 1. The first HWC IRQ is already handled
> + * in HWC setup calls
> + * However, if CPUs went offline since num_msix_usable was
> + * computed, queue IRQs will be more than num_online_cpus().
> + * In such cases remaining extra IRQs will retain their default
> + * affinity.
> + */
> + int first_unassigned = num_online_cpus();
> + if (nvec > first_unassigned) {
> + char buf[32];
> +
> + if (first_unassigned == nvec - 1)
> + snprintf(buf, sizeof(buf), "%d",
> + first_unassigned);
> + else
> + snprintf(buf, sizeof(buf), "%d-%d",
> + first_unassigned, nvec - 1);
> +
> + dev_dbg(&pdev->dev,
> + "MANA IRQ indices #%s will retain the default CPU affinity\n", buf);
> + }
>
> - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> - if (err) {
> - cpus_read_unlock();
> - goto free_irq;
> + irq_setup_linear(irqs, nvec);
> }
>
> cpus_read_unlock();
>
> base-commit: 8415598365503ced2e3d019491b0a2756c85c494
> --
> 2.34.1
^ permalink raw reply
* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-06-08 22:28 UTC (permalink / raw)
To: Shradha Gupta
Cc: Jacob Keller, Dexuan Cui, Wei Liu, Haiyang Zhang,
K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
Michael Kelley, Long Li, Yury Norov, linux-hyperv, linux-kernel,
netdev, Paul Rosswurm, Shradha Gupta, Saurabh Singh Sengar,
stable
In-Reply-To: <aiEBdsP7NTBd0+ah@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Wed, Jun 03, 2026 at 09:39:18PM -0700, Shradha Gupta wrote:
> On Wed, Jun 03, 2026 at 02:49:24PM -0700, Jacob Keller wrote:
> > On 6/1/2026 3:27 AM, Shradha Gupta wrote:
> > > In mana driver, the number of IRQs allocated is capped by the
> > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > their NUMA/core bindings.
> > >
> > > This is important, especially in the envs where number of vCPUs are so
> > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > much more than their overheads if they were spread across sibling vCPUs.
> > >
> > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > IRQs are assigned at a later stage compared to static allocation, other
> > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > same vCPU, while some vCPUs have none.
> > >
> > > In such cases when many parallel TCP connections are tested, the
> > > throughput drops significantly.
> > >
> > > Test envs:
> > > =======================================================
> > > Case 1: without this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > >
> > > TYPE effective vCPU aff
> > > =======================================================
> > > IRQ0: HWC 0
> > > IRQ1: mana_q1 0
> > > IRQ2: mana_q2 2
> > > IRQ3: mana_q3 0
> > > IRQ4: mana_q4 3
> > >
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU 0 1 2 3
> > > =======================================================
> > > pass 1: 38.85 0.03 24.89 24.65
> > > pass 2: 39.15 0.03 24.57 25.28
> > > pass 3: 40.36 0.03 23.20 23.17
> > >
> > > =======================================================
> > > Case 2: with this patch
> > > =======================================================
> > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > >
> > > TYPE effective vCPU aff
> > > =======================================================
> > > IRQ0: HWC 0
> > > IRQ1: mana_q1 0
> > > IRQ2: mana_q2 1
> > > IRQ3: mana_q3 2
> > > IRQ4: mana_q4 3
> > >
> > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > vCPU 0 1 2 3
> > > =======================================================
> > > pass 1: 15.42 15.85 14.99 14.51
> > > pass 2: 15.53 15.94 15.81 15.93
> > > pass 3: 16.41 16.35 16.40 16.36
> > >
> > > =======================================================
> > > Throughput Impact(in Gbps, same env)
> > > =======================================================
> > > TCP conn with patch w/o patch
> > > 20480 15.65 7.73
> > > 10240 15.63 8.93
> > > 8192 15.64 9.69
> > > 6144 15.64 13.16
> > > 4096 15.69 15.75
> > > 2048 15.69 15.83
> > > 1024 15.71 15.28
> > >
> > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > Cc: stable@vger.kernel.org
> > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > Reviewed-by: Simon Horman <horms@kernel.org>
> > > ---
> > > Changes in v3
> > > * Optimize the comments in mana_gd_setup_dyn_irqs()
> > > * add more details in the dev_dbg for extra IRQs
> > > ---
> > > Changes in v2
> > > * Removed the unused skip_first_cpu variable
> > > * fixed exit condition in irq_setup_linear() with len == 0
> > > * changed return type of irq_setup_linear() as it will always be 0
> > > * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > * added appropriate comments to indicate expected behaviour when
> > > IRQs are more than or equal to num_online_cpus()
> > > ---
> > > .../net/ethernet/microsoft/mana/gdma_main.c | 60 ++++++++++++++++---
> > > 1 file changed, 53 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 712a0881d720..00a28b3ca0a6 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > } else {
> > > /* If dynamic allocation is enabled we have already allocated
> > > * hwc msi
> > > + * Also, we make sure in this case the following is always true
> > > + * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > */
> > > gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > }
> > > @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > return 0;
> > > }
> > >
> > > +/* should be called with cpus_read_lock() held */
> > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > +{
> > > + int cpu;
> > > +
> > > + for_each_online_cpu(cpu) {
> > > + if (len == 0)
> > > + break;
> > > +
> > > + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > + len--;
> > > + }
> > > +}
> >
> > I would find all of this a bit easier to follow if irq_setup_linear()
> > and irq_setup() had a mana prefix so it was more obvious these are
> > specific to the driver. Of course irq_setup is pre-existing, and its not
> > my driver so do as you will :)
> >
> > > +
> > > static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > {
> > > struct gdma_context *gc = pci_get_drvdata(pdev);
> > > struct gdma_irq_context *gic;
> > > - bool skip_first_cpu = false;
> > > int *irqs, irq, err, i;
> > >
> > > irqs = kmalloc_objs(int, nvec);
> > > @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > return -ENOMEM;
> > >
> > > /*
> > > + * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> > > + * nvec is only Queue IRQ (HWC already setup).
> > > * While processing the next pci irq vector, we start with index 1,
> > > * as IRQ vector at index 0 is already processed for HWC.
> > > * However, the population of irqs array starts with index 0, to be
> > > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > * first CPU sibling group since they are already affinitized to HWC IRQ
> > > */
> > > cpus_read_lock();
> > > - if (gc->num_msix_usable <= num_online_cpus())
> > > - skip_first_cpu = true;
> > > + if (gc->num_msix_usable <= num_online_cpus()) {
> > > + err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > + if (err) {
> > > + cpus_read_unlock();
> > > + goto free_irq;
> > > + }
> > > + } else {
> > > + /*
> > > + * When num_msix_usable are more than num_online_cpus, our
> > > + * queue IRQs should be equal to num of online vCPUs.
> > > + * We try to make sure queue IRQs spread across all vCPUs.
> > > + * In such a case NUMA or CPU core affinity does not matter.
> > > + * Note: in this case the total mana IRQ should always be
> > > + * num_online_cpus + 1. The first HWC IRQ is already handled
> > > + * in HWC setup calls
> > > + * However, if CPUs went offline since num_msix_usable was
> > > + * computed, queue IRQs will be more than num_online_cpus().
> > > + * In such cases remaining extra IRQs will retain their default
> > > + * affinity.
> > > + */
> > > + int first_unassigned = num_online_cpus();
> > > + if (nvec > first_unassigned) {
> > > + char buf[32];
> > > +
> > > + if (first_unassigned == nvec - 1)
> > > + snprintf(buf, sizeof(buf), "%d",
> > > + first_unassigned);
> > > + else
> > > + snprintf(buf, sizeof(buf), "%d-%d",
> > > + first_unassigned, nvec - 1);
> > > +
> > > + dev_dbg(&pdev->dev,
> > > + "MANA IRQ indices #%s will retain the default CPU affinity\n", buf);
> > > + }
> > >
> > > - err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > - if (err) {
> > > - cpus_read_unlock();
> > > - goto free_irq;
> > > + irq_setup_linear(irqs, nvec);
> >
> > irq_setup() doesn't have a driver prefix, but is actually a static
> > function in gdma_main.c, so its implementation is specific to this
> > driver despite its name.
> >
> > So if I understand this change correctly, if the number of usable MSI-X
> > vectors is smaller than the number of CPUs, you contineu to use the
> > current irq_setup logic.. otherwise you switch to the simpler "linear"
> > logic.
> >
> > I guess this means the logic and heuristic used in irq_setup() breaks
> > down when the number of vectors is large and number of vCPU is small?
> >
> > Makes sense.
> >
>
> Hi Jacob,
>
> Yes, that's the right understanding.
> Regarding the function names, let me take that up in a seperate patch to
> add prefixes to all such functions.
I agree. Now that you've got more than one setup method, short
'irq_setup' looks confusing, if not misleading. You need some name
that distinguished numa-based and plain linear method.
Thanks,
Yury
^ permalink raw reply
* Re: [PATCH v2 0/4] Convert remaining buses to generic driver_override handling
From: Danilo Krummrich @ 2026-06-08 18:09 UTC (permalink / raw)
To: Runyu Xiao
Cc: gregkh, rafael, driver-core, linux, andersson, mathieu.poirier,
kys, haiyangz, wei.liu, decui, longli, nipun.gupta,
nikhil.agarwal, linux-remoteproc, linux-arm-msm, linux-hyperv,
linux-kernel, jianhao.xu
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>
On Thu Jun 4, 2026 at 5:52 AM CEST, Runyu Xiao wrote:
> Runyu Xiao (4):
> amba: use generic driver_override infrastructure
> rpmsg: core: use generic driver_override infrastructure
> vmbus: use generic driver_override infrastructure
> cdx: use generic driver_override infrastructure
Given that you changed the approach to use the new driver_override
infrastructure, I assume you read the message in [1]?
In this message I also explained that this all has been addressed and was merged
into driver-core-next already.
[1] https://lore.kernel.org/all/DIYPR0K2CZW7.254R8K7ONBX5D@kernel.org/
^ permalink raw reply
* Re: [PATCH v2 2/4] rpmsg: core: use generic driver_override infrastructure
From: Mathieu Poirier @ 2026-06-08 17:47 UTC (permalink / raw)
To: Runyu Xiao
Cc: gregkh, rafael, dakr, driver-core, linux, andersson, kys,
haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
jianhao.xu, stable
In-Reply-To: <20260604035239.1711889-3-runyu.xiao@seu.edu.cn>
On Wed, 3 Jun 2026 at 21:52, Runyu Xiao <runyu.xiao@seu.edu.cn> wrote:
>
> RPMSG still keeps driver_override in bus-private storage.
>
> That private pointer can be updated from the sysfs driver_override
> attribute, and also from rpmsg_register_device_override(). Both paths
> replace the pointer and can free the old value.
>
> However, driver_match_device() can call rpmsg_dev_match() from
> __driver_attach() without holding the device lock, and rpmsg_dev_match()
> still dereferences that private pointer directly.
>
> This leaves the match path racing with concurrent driver_override
> updates, with the usual risk of comparing against freed memory.
>
> Switch rpmsg to the driver-core driver_override infrastructure. This
> removes the private storage, uses device_match_driver_override() for the
> locked read in rpmsg_dev_match(), and converts
> rpmsg_register_device_override() to device_set_driver_override() so the
> in-kernel override path uses the same core-managed storage. With that
> storage now owned by struct device, drop the remaining rpmsg transport
> release-path frees of rpdev->driver_override as well.
>
> Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
> Fixes: 39e47767ec9b ("rpmsg: Add driver_override device attribute for rpmsg_device")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
> drivers/rpmsg/qcom_glink_native.c | 2 --
> drivers/rpmsg/rpmsg_core.c | 41 ++++++--------------------------------
> drivers/rpmsg/virtio_rpmsg_bus.c | 1 -
> include/linux/rpmsg.h | 4 ----
For the bottom 3:
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
> 4 files changed, 6 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
> index e7f7831d37f8..11d3007db5cd 100644
> --- a/drivers/rpmsg/rpmsg_core.c
> +++ b/drivers/rpmsg/rpmsg_core.c
> @@ -358,33 +358,6 @@ rpmsg_show_attr(src, src, "0x%x\n");
> rpmsg_show_attr(dst, dst, "0x%x\n");
> rpmsg_show_attr(announce, announce ? "true" : "false", "%s\n");
>
> -static ssize_t driver_override_store(struct device *dev,
> - struct device_attribute *attr,
> - const char *buf, size_t count)
> -{
> - struct rpmsg_device *rpdev = to_rpmsg_device(dev);
> - int ret;
> -
> - ret = driver_set_override(dev, &rpdev->driver_override, buf, count);
> - if (ret)
> - return ret;
> -
> - return count;
> -}
> -
> -static ssize_t driver_override_show(struct device *dev,
> - struct device_attribute *attr, char *buf)
> -{
> - struct rpmsg_device *rpdev = to_rpmsg_device(dev);
> - ssize_t len;
> -
> - device_lock(dev);
> - len = sysfs_emit(buf, "%s\n", rpdev->driver_override);
> - device_unlock(dev);
> - return len;
> -}
> -static DEVICE_ATTR_RW(driver_override);
> -
> static ssize_t modalias_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> {
> @@ -405,7 +378,6 @@ static struct attribute *rpmsg_dev_attrs[] = {
> &dev_attr_dst.attr,
> &dev_attr_src.attr,
> &dev_attr_announce.attr,
> - &dev_attr_driver_override.attr,
> NULL,
> };
> ATTRIBUTE_GROUPS(rpmsg_dev);
> @@ -424,9 +396,11 @@ static int rpmsg_dev_match(struct device *dev, const struct device_driver *drv)
> const struct rpmsg_driver *rpdrv = to_rpmsg_driver(drv);
> const struct rpmsg_device_id *ids = rpdrv->id_table;
> unsigned int i;
> + int ret;
>
> - if (rpdev->driver_override)
> - return !strcmp(rpdev->driver_override, drv->name);
> + ret = device_match_driver_override(dev, drv);
> + if (ret >= 0)
> + return ret;
>
> if (ids)
> for (i = 0; ids[i].name[0]; i++)
> @@ -533,6 +507,7 @@ static void rpmsg_dev_remove(struct device *dev)
>
> static const struct bus_type rpmsg_bus = {
> .name = "rpmsg",
> + .driver_override = true,
> .match = rpmsg_dev_match,
> .dev_groups = rpmsg_dev_groups,
> .uevent = rpmsg_uevent,
> @@ -560,9 +535,7 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
>
> device_initialize(dev);
> if (driver_override) {
> - ret = driver_set_override(dev, &rpdev->driver_override,
> - driver_override,
> - strlen(driver_override));
> + ret = device_set_driver_override(dev, driver_override);
> if (ret) {
> dev_err(dev, "device_set_override failed: %d\n", ret);
> put_device(dev);
> @@ -573,8 +546,6 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
> ret = device_add(dev);
> if (ret) {
> dev_err(dev, "device_add failed: %d\n", ret);
> - kfree(rpdev->driver_override);
> - rpdev->driver_override = NULL;
> put_device(dev);
> }
>
> diff --git a/drivers/rpmsg/qcom_glink_native.c b/drivers/rpmsg/qcom_glink_native.c
> index 401a4ece0c97..d9d4468e4cbd 100644
> --- a/drivers/rpmsg/qcom_glink_native.c
> +++ b/drivers/rpmsg/qcom_glink_native.c
> @@ -1626,7 +1626,6 @@ static void qcom_glink_rpdev_release(struct device *dev)
> {
> struct rpmsg_device *rpdev = to_rpmsg_device(dev);
>
> - kfree(rpdev->driver_override);
> kfree(rpdev);
> }
>
> @@ -1862,7 +1861,6 @@ static void qcom_glink_device_release(struct device *dev)
>
> /* Release qcom_glink_alloc_channel() reference */
> kref_put(&channel->refcount, qcom_glink_channel_release);
> - kfree(rpdev->driver_override);
> kfree(rpdev);
> }
>
> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c
> index 5ae15111fb4f..1b8bb05924af 100644
> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
> @@ -374,7 +374,6 @@ static void virtio_rpmsg_release_device(struct device *dev)
> struct rpmsg_device *rpdev = to_rpmsg_device(dev);
> struct virtio_rpmsg_channel *vch = to_virtio_rpmsg_channel(rpdev);
>
> - kfree(rpdev->driver_override);
> kfree(vch);
> }
>
> diff --git a/include/linux/rpmsg.h b/include/linux/rpmsg.h
> index 83266ce14642..2e40eb54155e 100644
> --- a/include/linux/rpmsg.h
> +++ b/include/linux/rpmsg.h
> @@ -41,9 +41,6 @@ struct rpmsg_channel_info {
> * rpmsg_device - device that belong to the rpmsg bus
> * @dev: the device struct
> * @id: device id (used to match between rpmsg drivers and devices)
> - * @driver_override: driver name to force a match; do not set directly,
> - * because core frees it; use driver_set_override() to
> - * set or clear it.
> * @src: local address
> * @dst: destination address
> * @ept: the rpmsg endpoint of this channel
> @@ -53,7 +50,6 @@ struct rpmsg_channel_info {
> struct rpmsg_device {
> struct device dev;
> struct rpmsg_device_id id;
> - const char *driver_override;
> u32 src;
> u32 dst;
> struct rpmsg_endpoint *ept;
> --
> 2.34.1
^ permalink raw reply
* Re: [GIT PULL] Hyper-V fixes for v7.1-rc8
From: pr-tracker-bot @ 2026-06-08 15:03 UTC (permalink / raw)
To: Wei Liu
Cc: Linus Torvalds, Wei Liu, Linux on Hyper-V List, Linux Kernel List,
kys, haiyangz, decui, longli
In-Reply-To: <20260608053408.GA1541576@liuwe-devbox-debian-v2.local>
The pull request you sent on Sun, 7 Jun 2026 22:34:08 -0700:
> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260607
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e92a7628772ba49f3cdc1d141cd2b0b5d607bda2
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* [PATCH net v2 2/2] net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
From: Aditya Garg @ 2026-06-08 10:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
kees, shacharr, stephen, gargaditya, gargaditya, ssengar,
linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260608101345.2267320-1-gargaditya@linux.microsoft.com>
mana_create_txq() has several error paths (after mana_alloc_queues() or
mana_create_wq_obj() failure) where tx_qp[i].tx_object stays as the
INVALID_MANA_HANDLE sentinel set at allocation. mana_destroy_txq() then
unconditionally calls mana_destroy_wq_obj() with (u64)-1, which firmware
rejects and logs an error.
Mirror the RX-side pattern in mana_destroy_rxq() and skip the destroy
when the handle is still INVALID_MANA_HANDLE.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..d7de4c4d25bb 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2334,7 +2334,8 @@ static void mana_destroy_txq(struct mana_port_context *apc)
netif_napi_del_locked(napi);
apc->tx_qp[i].txq.napi_initialized = false;
}
- mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+ if (apc->tx_qp[i].tx_object != INVALID_MANA_HANDLE)
+ mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
--
2.43.0
^ permalink raw reply related
* [PATCH net v2 1/2] net: mana: initialize gdma queue id to INVALID_QUEUE_ID
From: Aditya Garg @ 2026-06-08 10:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
kees, shacharr, stephen, gargaditya, gargaditya, ssengar,
linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260608101345.2267320-1-gargaditya@linux.microsoft.com>
mana_gd_create_mana_wq_cq() leaves queue->id as 0 (from kzalloc_obj())
until mana_create_wq_obj() assigns the firmware-returned id. If creation
fails before that, cleanup calls mana_gd_destroy_cq() with id 0, NULLing
gc->cq_table[0] and silently breaking whichever real CQ owns that slot.
Initialize queue->id to INVALID_QUEUE_ID right after allocation, matching
mana_gd_create_eq(). The existing (id >= max_num_cqs) guard then
short-circuits cleanly.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index d8e816882f02..ac71ca8450bf 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1192,6 +1192,8 @@ int mana_gd_create_mana_wq_cq(struct gdma_dev *gd,
if (!queue)
return -ENOMEM;
+ queue->id = INVALID_QUEUE_ID;
+
gmi = &queue->mem_info;
err = mana_gd_alloc_memory(gc, spec->queue_size, gmi);
if (err) {
--
2.43.0
^ permalink raw reply related
* [PATCH net v2 0/2] net: mana: fix error-path issues in queue setup
From: Aditya Garg @ 2026-06-08 10:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
kees, shacharr, stephen, gargaditya, gargaditya, ssengar,
linux-hyperv, netdev, linux-kernel
Two error-path fixes in MANA queue setup, both surfaced during Sashiko
AI review of a recently upstreamed patch series.
Patch 1 initializes queue->id to INVALID_QUEUE_ID in
mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
firmware id is assigned does not NULL gc->cq_table[0] and silently
break whichever real CQ owns that slot. This mirrors the existing
pattern in mana_gd_create_eq().
Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
and a spurious error in dmesg.
Changes in v2:
- Rebased onto net.
Aditya Garg (2):
net: mana: initialize gdma queue id to INVALID_QUEUE_ID
net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 ++
drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
2 files changed, 4 insertions(+), 1 deletion(-)
--
2.43.0
^ permalink raw reply
* [PATCH 3/4] x86/msr: Switch wrmsrl() users to wrmsrq()
From: Juergen Gross @ 2026-06-08 8:28 UTC (permalink / raw)
To: linux-kernel, x86, linux-perf-users, kvm, linux-coco,
linux-hyperv, linux-pm
Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
James Morse, Babu Moger, Sean Christopherson, Paolo Bonzini,
Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
Artem Bityutskiy, Len Brown
In-Reply-To: <20260608082809.3492719-1-jgross@suse.com>
wrmsrl() is a deprecated synonym for wrmsrq(). Switch its users to
wrmsrq().
Signed-off-by: Juergen Gross <jgross@suse.com>
---
arch/x86/events/amd/uncore.c | 2 +-
arch/x86/events/intel/core.c | 4 ++--
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kvm/pmu.c | 6 +++---
arch/x86/kvm/vmx/tdx.c | 6 +++---
drivers/hv/mshv_vtl_main.c | 2 +-
drivers/idle/intel_idle.c | 2 +-
8 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 98ef4bf9911a..7dc6af4231cc 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -975,7 +975,7 @@ static void amd_uncore_umc_read(struct perf_event *event)
* that the counter never gets a chance to saturate.
*/
if (new & BIT_ULL(63 - COUNTER_SHIFT)) {
- wrmsrl(hwc->event_base, 0);
+ wrmsrq(hwc->event_base, 0);
local64_set(&hwc->prev_count, 0);
} else {
local64_set(&hwc->prev_count, new);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index dd1e3aa75ee9..e9baa64dc962 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3166,12 +3166,12 @@ static void intel_pmu_config_acr(int idx, u64 mask, u32 reload)
}
if (cpuc->acr_cfg_b[idx] != mask) {
- wrmsrl(msr_b + msr_offset, mask);
+ wrmsrq(msr_b + msr_offset, mask);
cpuc->acr_cfg_b[idx] = mask;
}
/* Only need to update the reload value when there is a valid config value. */
if (mask && cpuc->acr_cfg_c[idx] != reload) {
- wrmsrl(msr_c + msr_offset, reload);
+ wrmsrq(msr_c + msr_offset, reload);
cpuc->acr_cfg_c[idx] = reload;
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c5ed0bc1f831..e4918c32a822 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -532,7 +532,7 @@ static void resctrl_abmc_config_one_amd(void *info)
{
union l3_qos_abmc_cfg *abmc_cfg = info;
- wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
+ wrmsrq(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
}
/*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b85e715ebb30..d44afbe005bb 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -708,7 +708,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Reset hw history on AMD CPUs */
if (cpu_feature_enabled(X86_FEATURE_AMD_WORKLOAD_CLASS))
- wrmsrl(MSR_AMD_WORKLOAD_HRST, 0x1);
+ wrmsrq(MSR_AMD_WORKLOAD_HRST, 0x1);
return prev_p;
}
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e218352e3423..aee70e5dc15d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1313,14 +1313,14 @@ static void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu)
pmc = &pmu->gp_counters[i];
if (pmc->counter != rdpmc(i))
- wrmsrl(gp_counter_msr(i), pmc->counter);
- wrmsrl(gp_eventsel_msr(i), pmc->eventsel_hw);
+ wrmsrq(gp_counter_msr(i), pmc->counter);
+ wrmsrq(gp_eventsel_msr(i), pmc->eventsel_hw);
}
for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
pmc = &pmu->fixed_counters[i];
if (pmc->counter != rdpmc(INTEL_PMC_FIXED_RDPMC_BASE | i))
- wrmsrl(fixed_counter_msr(i), pmc->counter);
+ wrmsrq(fixed_counter_msr(i), pmc->counter);
}
}
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 04ce321ebdf3..cb50e23c39ca 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -823,7 +823,7 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
return;
++vcpu->stat.host_state_reload;
- wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
+ wrmsrq(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
vt->guest_state_loaded = false;
}
@@ -1048,10 +1048,10 @@ static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu)
/*
* Likewise, even if a TDX hosts didn't support XSS both arms of
- * the comparison would be 0 and the wrmsrl would be skipped.
+ * the comparison would be 0 and the wrmsrq would be skipped.
*/
if (kvm_host.xss != (kvm_tdx->xfam & kvm_caps.supported_xss))
- wrmsrl(MSR_IA32_XSS, kvm_host.xss);
+ wrmsrq(MSR_IA32_XSS, kvm_host.xss);
}
#define TDX_DEBUGCTL_PRESERVED (DEBUGCTLMSR_BTF | \
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index f5d27f28d6ad..0d3d4161974f 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -596,7 +596,7 @@ static int mshv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set)
} else {
/* Handle MSRs */
if (set)
- wrmsrl(reg_table[i].msr_addr, *reg64);
+ wrmsrq(reg_table[i].msr_addr, *reg64);
else
rdmsrq(reg_table[i].msr_addr, *reg64);
}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 15c698291b32..67d5993c7387 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -2379,7 +2379,7 @@ static void intel_c1_demotion_toggle(void *enable)
msr_val |= NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE;
else
msr_val &= ~(NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE);
- wrmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
+ wrmsrq(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
}
static ssize_t intel_c1_demotion_store(struct device *dev,
--
2.54.0
^ permalink raw reply related
* [PATCH 1/4] x86/msr: Switch rdmsrl() users to rdmsrq()
From: Juergen Gross @ 2026-06-08 8:28 UTC (permalink / raw)
To: linux-kernel, x86, linux-perf-users, linux-hyperv, linux-pm
Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
James Morse, Babu Moger, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
Artem Bityutskiy, Len Brown
In-Reply-To: <20260608082809.3492719-1-jgross@suse.com>
rdmsrl() is a deprecated synonym for rdmsrq(). Switch its users to
rdmsrq().
Signed-off-by: Juergen Gross <jgross@suse.com>
---
arch/x86/events/amd/uncore.c | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
drivers/hv/mshv_vtl_main.c | 2 +-
drivers/idle/intel_idle.c | 4 ++--
4 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index dd956cfcadef..98ef4bf9911a 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -966,7 +966,7 @@ static void amd_uncore_umc_read(struct perf_event *event)
* UMC counters do not have RDPMC assignments. Read counts directly
* from the corresponding PERF_CTR.
*/
- rdmsrl(hwc->event_base, new);
+ rdmsrq(hwc->event_base, new);
/*
* Unlike the other uncore counters, UMC counters saturate and set the
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 59215fef3924..c5ed0bc1f831 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -301,7 +301,7 @@ static int __cntr_id_read(u32 cntr_id, u64 *val)
* is set if the counter data is unavailable.
*/
wrmsr(MSR_IA32_QM_EVTSEL, ABMC_EXTENDED_EVT_ID | ABMC_EVT_ID, cntr_id);
- rdmsrl(MSR_IA32_QM_CTR, msr_val);
+ rdmsrq(MSR_IA32_QM_CTR, msr_val);
if (msr_val & RMID_VAL_ERROR)
return -EIO;
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index c19400701467..f5d27f28d6ad 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -598,7 +598,7 @@ static int mshv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set)
if (set)
wrmsrl(reg_table[i].msr_addr, *reg64);
else
- rdmsrl(reg_table[i].msr_addr, *reg64);
+ rdmsrq(reg_table[i].msr_addr, *reg64);
}
return 0;
}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index f49354e37777..15c698291b32 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -2370,7 +2370,7 @@ static void intel_c1_demotion_toggle(void *enable)
{
unsigned long long msr_val;
- rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
+ rdmsrq(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
/*
* Enable/disable C1 undemotion along with C1 demotion, as this is the
* most sensible configuration in general.
@@ -2410,7 +2410,7 @@ static ssize_t intel_c1_demotion_show(struct device *dev,
* Read the MSR value for a CPU and assume it is the same for all CPUs. Any other
* configuration would be a BIOS bug.
*/
- rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
+ rdmsrq(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
return sysfs_emit(buf, "%d\n", !!(msr_val & NHM_C1_AUTO_DEMOTE));
}
static DEVICE_ATTR_RW(intel_c1_demotion);
--
2.54.0
^ permalink raw reply related
* [PATCH 0/4] x86/msr: Get rid of rdmsrl() and wrmsrl()
From: Juergen Gross @ 2026-06-08 8:28 UTC (permalink / raw)
To: linux-kernel, x86, linux-perf-users, linux-hyperv, linux-pm, kvm,
linux-coco
Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
James Morse, Babu Moger, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
Artem Bityutskiy, Len Brown, Sean Christopherson, Paolo Bonzini,
Kiryl Shutsemau, Rick Edgecombe
rdsmrl() and wrmsrl() are deprecated aliases of rdmsrq() and wrmsrq().
Switch all users and remove the deprecated variants.
Juergen Gross (4):
x86/msr: Switch rdmsrl() users to rdmsrq()
x86/msr: Remove rdmsrl()
x86/msr: Switch wrmsrl() users to wrmsrq()
x86/msr: Remove wrmsrl()
arch/x86/events/amd/uncore.c | 4 ++--
arch/x86/events/intel/core.c | 4 ++--
arch/x86/include/asm/msr.h | 5 -----
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kvm/pmu.c | 6 +++---
arch/x86/kvm/vmx/tdx.c | 6 +++---
drivers/hv/mshv_vtl_main.c | 4 ++--
drivers/idle/intel_idle.c | 6 +++---
9 files changed, 18 insertions(+), 23 deletions(-)
--
2.54.0
^ permalink raw reply
* Re: [PATCH] Drivers: hv: mshv: add bounds check on vp_index in mshv_intercept_isr()
From: Wei Liu @ 2026-06-08 6:23 UTC (permalink / raw)
To: Junrui Luo
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Jinank Jain, Praveen K Paladugu, Mukesh Rathor, Nuno Das Neves,
Anirudh Rayabharam, Roman Kisel, Muminul Islam, linux-hyperv,
linux-kernel, Stanislav Kinsburskii, Yuhao Jiang
In-Reply-To: <SYBPR01MB7881B8B5D35E02A0E8404E4FAF232@SYBPR01MB7881.ausprd01.prod.outlook.com>
On Thu, Apr 16, 2026 at 10:18:05PM +0800, Junrui Luo wrote:
> mshv_intercept_isr() extracts vp_index from the hypervisor message
> payload and uses it directly to index into pt_vp_array without
> validation. handle_bitset_message() and handle_pair_message() already
> validate vp_index against MSHV_MAX_VPS before array access.
>
> A vp_index exceeding MSHV_MAX_VPS leads to an out-of-bounds read from
> pt_vp_array.
>
> Add the same MSHV_MAX_VPS bounds check for consistency with the other
> message handlers.
>
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Like other places say, the hypervisor shouldn't give us an out-of-bound
index. It has many different ways to screw with the root kernel, so I'm
not overly concerned about this.
That said, having a bit more consistency and defensive programming
doesn't hurt. I have applied this patch. Thanks.
Wei
> ---
> drivers/hv/mshv_synic.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 43f1bcbbf2d3..5bceb8122981 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -384,6 +384,10 @@ mshv_intercept_isr(struct hv_message *msg)
> */
> vp_index =
> ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
> + if (unlikely(vp_index >= MSHV_MAX_VPS)) {
> + pr_debug("VP index %u out of bounds\n", vp_index);
> + goto unlock_out;
> + }
> vp = partition->pt_vp_array[vp_index];
> if (unlikely(!vp)) {
> pr_debug("failed to find VP %u\n", vp_index);
>
> ---
> base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
> change-id: 20260416-fixes-693196e52f93
>
> Best regards,
> --
> Junrui Luo <moonafterrain@outlook.com>
>
^ permalink raw reply
* Re: [PATCH] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: Wei Liu @ 2026-06-08 6:00 UTC (permalink / raw)
To: Mukesh R; +Cc: linux-hyperv, linux-kernel, wei.liu
In-Reply-To: <20260601225116.956392-1-mrathor@linux.microsoft.com>
On Mon, Jun 01, 2026 at 03:51:16PM -0700, Mukesh R wrote:
> Make cosmetic changes:
> o Rename struct pci_dev *dev to *pdev since there are cases of
> struct device *dev in the file and all over the kernel
> o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
> of building different types of device ids
> o Fix checkpatch.pl issues with return and extraneous printk
> o Replace spaces with tabs
> o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
> paths involve many types of device ids
> o Fix indentation in a large if block by using goto.
>
> There are no functional changes.
>
> Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
Applied to hyperv-next.
Wei
^ permalink raw reply
* [GIT PULL] Hyper-V fixes for v7.1-rc8
From: Wei Liu @ 2026-06-08 5:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Wei Liu, Linux on Hyper-V List, Linux Kernel List, kys, haiyangz,
decui, longli
Hi Linus,
The following changes since commit 254f49634ee16a731174d2ae34bc50bd5f45e731:
Linux 7.1-rc1 (2026-04-26 14:19:00 -0700)
are available in the Git repository at:
ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20260607
for you to fetch changes up to 98e0fc32e53dd62cd38a0d67eaf5846ae20078cc:
mshv: support 1G hugepages by passing them as 2M-aligned chunks (2026-05-27 15:30:15 -0700)
----------------------------------------------------------------
hyperv-fixes for v7.1-rc8
- MSHV driver fixes from various people (Anirudh Rayabharam, Can Peng,
Dexuan Cui, Michael Kelley, Jork Loeser, Wei Liu)
- Hyper-V user space tools fixes (Thorsten Blum)
- Allow VMBus to be unloaded after frame buffer is flushed (Michael
Kelley)
----------------------------------------------------------------
Anirudh Rayabharam (Microsoft) (1):
mshv: support 1G hugepages by passing them as 2M-aligned chunks
Can Peng (1):
mshv: use kmalloc_array in mshv_root_scheduler_init
Dexuan Cui (2):
hyperv: Clean up and fix the guest ID comment in hvgdk.h
Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
Jork Loeser (3):
mshv: limit SynIC management to MSHV-owned resources
mshv: clean up SynIC state on kexec for L1VH
mshv: unmap debugfs stats pages on kexec
Michael Kelley (3):
Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
drm/hyperv: During panic do VMBus unload after frame buffer is flushed
mshv: Add conditional VMBus dependency
Thorsten Blum (2):
hv: utils: handle and propagate errors in kvp_register
hv: utils: replace deprecated strcpy with strscpy in kvp_register
Wei Liu (1):
mshv: add a missing padding field
drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 5 +
drivers/gpu/drm/hyperv/hyperv_drm_modeset.c | 15 +--
drivers/hv/Kconfig | 1 +
drivers/hv/channel_mgmt.c | 1 +
drivers/hv/hv.c | 3 +
drivers/hv/hv_kvp.c | 27 +++--
drivers/hv/hyperv_vmbus.h | 1 -
drivers/hv/mshv_debugfs.c | 7 +-
drivers/hv/mshv_regions.c | 29 +++---
drivers/hv/mshv_root_main.c | 2 +-
drivers/hv/mshv_synic.c | 156 ++++++++++++++++++----------
drivers/hv/vmbus_drv.c | 54 ++++++++--
include/hyperv/hvgdk.h | 10 +-
include/hyperv/hvhdk.h | 1 +
include/linux/hyperv.h | 7 ++
15 files changed, 207 insertions(+), 112 deletions(-)
^ permalink raw reply
* Re: [PATCH net 0/2] net: mana: fix error-path issues in queue setup
From: Aditya Garg @ 2026-06-08 4:43 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, pabeni, horms, shradhagupta, dipayanroy, ernis, kees,
shacharr, stephen, gargaditya, ssengar, linux-hyperv, netdev,
linux-kernel
In-Reply-To: <20260605182748.5f106575@kernel.org>
On 06-06-2026 06:57, Jakub Kicinski wrote:
> On Thu, 4 Jun 2026 01:01:24 -0700 Aditya Garg wrote:
>> Two error-path fixes in MANA queue setup, both surfaced during Sashiko
>> AI review of a recently upstreamed patch series.
>>
>> Patch 1 initializes queue->id to INVALID_QUEUE_ID in
>> mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
>> firmware id is assigned does not NULL gc->cq_table[0] and silently
>> break whichever real CQ owns that slot. This mirrors the existing
>> pattern in mana_gd_create_eq().
>>
>> Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
>> an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
>> it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
>> and a spurious error in dmesg.
>
> Looks like these patches were generated against net-next, please rebase:
>
> Applying: net: mana: initialize gdma queue id to INVALID_QUEUE_ID
> Applying: net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
> error: patch failed: drivers/net/ethernet/microsoft/mana/mana_en.c:2351
> error: drivers/net/ethernet/microsoft/mana/mana_en.c: patch does not apply
> Patch failed at 0002 net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
Thanks Jakub for pointing it out.
I'll rebase against net and post a v2
Regards,
Aditya
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox