Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v4 09/10] drm/damage-helper: Rename state parameters in damage helpers
From: Javier Martinez Canillas @ 2026-06-01 10:29 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-10-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Rename some of the state parameters of the damage-helper functions to
> align them with each other and other helpers. No functional changes.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 10/10] drm/vmwgfx: Remove unused field struct vmwgfx_du_update_plane.old_state
From: Javier Martinez Canillas @ 2026-06-01 10:30 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklinux,
	zack.rusin, bcm-kernel-feedback-list
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, Thomas Zimmermann
In-Reply-To: <20260530185716.65688-11-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Plane updates no longer require the old plane state. Remove the field
> from struct vmwgfx_du_update_plane and fix all callers.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Reviewed-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v4 07/10] drm/damage-helper: Remove old state from drm_atomic_helper_damage_iter_init()
From: Hamza Mahfooz @ 2026-06-01 14:01 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: mripard, maarten.lankhorst, airlied, airlied, simona, admin,
	gargaditya08, paul, jani.nikula, mhklinux, zack.rusin,
	bcm-kernel-feedback-list, dri-devel, linux-hyperv, intel-gfx,
	intel-xe, linux-mips, virtualization
In-Reply-To: <20260530185716.65688-8-tzimmermann@suse.de>

On Sat, May 30, 2026 at 08:53:20PM +0200, Thomas Zimmermann wrote:
> Nothing in drm_atomic_helper_damage_iter_init() requires the old
> plane state. Remove the parameter and mass-convert callers.
> 
> Most callers now no longer require the old plane state in their plane's
> atomic_update helper. Remove it as well.
> 
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Acked-by: Zack Rusin <zack.rusin@broadcom.com>
> ---

Acked-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> # hyperv

^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Pasha Tatashin @ 2026-06-01 15:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jork Loeser, linux-hyperv, linux-mm, kexec, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin,
	Pratyush Yadav, Alexander Graf, Jason Miu, Andrew Morton,
	David Hildenbrand, Muchun Song, Oscar Salvador, Baoquan He,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Kees Cook,
	Ran Xiaokai, Justinien Bouron, Sourabh Jain, Pingfan Liu,
	Rafael J. Wysocki, Mario Limonciello, linux-arm-kernel, x86,
	linux-kernel, Michael Kelley
In-Reply-To: <ahxrc4pTvVU20RTX@kernel.org>

On 05-31 20:10, Mike Rapoport wrote:
> Hi Jork,
> 
> Only had time to skim through the patches.
> I have a couple of high level questions for now.
> 
> On Wed, May 27, 2026 at 05:41:42PM -0700, Jork Loeser wrote:
> > When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV
> > root partition driver deposits pages to the hypervisor and creates
> > partitions for guest VMs. Prior patches enabled kexec for L1VH, but
> > only when no partitions had been created and no memory had been donated.
> > 
> > This series lifts that limitation. It uses KHO (Kexec Handover) to:
> > 
> >  - Track all pages deposited to the hypervisor in a KHO radix tree
> >    and preserve them across kexec so the new kernel knows which pages
> >    are owned by the hypervisor.
> > 
> >  - Freeze running partitions before kexec, record their IDs in the
> >    KHO FDT, and vacuum (tear down + reclaim memory) stale partitions
> >    after kexec.
> > 
> >  - In case of a crash, exclude hypervisor-owned pages from crash
> >    dump collection by passing the radix tree root PA via Hyper-V
> >    crash MSR P2 to the crash kernel.
> > 
> > Dependency on Pratyush's KHO series
> > ===================================
> > 
> > Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series
> > "kho: make boot time huge page allocation work nicely with KHO" [1],
> > which is still under discussion. This series uses functionality from
> > those patches -- specifically the meta-data page enumeration via table
> > callbacks and the restructured radix tree API. It also extends the
> > KHO radix tree with:
> > 
> >  - A freeze mechanism to lock the tree before serializing for kexec
> >    (patch 13).
> 
> There were a lot of effort to make KHO stateless and drop the requirement
> for finalization/freeze.

Yes, using KHO directly here is incorrect. The state machine is provided 
by LUO, so we should use LUO here. MSHV should provide a file that 
userspace adds to LUO, and all state machine management would be the 
same as for all other clients participating in LU.

> 
> Why is this necessary to add a freeze mechanism to kho_radix_tree?
> If it's a hard requirement of mshv maybe the freeze part should be handled
> there?
j  
> >  - A crash-kernel-safe variant that memremaps radix nodes for use
> >    outside the direct map (patch 14).
> > 
> > Patch overview
> > ==============
> > 
> > Patches 1-12:  KHO radix tree and memblock changes (from [1])
> > Patch 13:      Radix tree freeze and del_key() error reporting
> 
> del_key() error reporting sounds like something we'd want to avoid.
> del_key() is called on "freeing" path and during error handling, it would
> be hard if at all possible to deal with errors from del_key().
> 
> > Patch 14:      Crash-kernel-safe radix tree presence check
> > Patch 15:      Page tracker using KHO radix tree for deposited pages
> > Patch 16:      Debugfs interface for page tracker
> > Patches 17-18: Crash MSR reshuffling + crash dump page exclusion
> > Patch 19:      Export kexec_in_progress for modules
> 
> Isn't there another way to differentiate kexec reboot?
> 
> > Patch 20:      Freeze and vacuum partitions across kexec
> > 
> > Feedback
> > ========
> > 
> > This is an RFC. I am looking for feedback on the overall approach as
> > well as the KHO changes (patches 13-14).
> > 
> > [1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush@kernel.org/
> > 
> > Based-on: linux-next/master (next-20260527)
> 
> -- 
> Sincerely yours,
> Mike.

^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-01 16:19 UTC (permalink / raw)
  To: Jagielski, Jedrzej, Haiyang Zhang, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
	Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Shradha Gupta, Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg,
	Kees Cook, Breno Leitao, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org
  Cc: Paul Rosswurm
In-Reply-To: <PH0PR11MB590230F0CEBE8B1DEAA15F82F0152@PH0PR11MB5902.namprd11.prod.outlook.com>



> -----Original Message-----
> From: Jagielski, Jedrzej <jedrzej.jagielski@intel.com>
> Sent: Monday, June 1, 2026 5:39 AM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>;
> Shradha Gupta <shradhagupta@linux.microsoft.com>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Kees Cook <kees@kernel.org>; Breno
> Leitao <leitao@debian.org>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org
> Cc: Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] RE: [PATCH net-next] net: mana: Add Interrupt
> Moderation support
> 
> [Niekt?re osoby, kt?re odebra?y t? wiadomo??, nie otrzymuj? cz?sto
> wiadomo?ci e-mail z jedrzej.jagielski@intel.com. Dowiedz si?, dlaczego
> jest to wa?ne, na stronie https://aka.ms/LearnAboutSenderIdentification ]
> 
> From: Haiyang Zhang <haiyangz@linux.microsoft.com>
> Sent: Saturday, May 30, 2026 9:50 PM
> 
> >From: Haiyang Zhang <haiyangz@microsoft.com>
> >
> >Add Static and Dynamic Interrupt Moderation (DIM) support for
> >Rx and Tx.
> >Update queue creation procedure with new data struct with the related
> >settings.
> >Add functions to collect stat for DIM, and workers to update DIM data
> >and settings.
> >Update ethtool handler to get/set the moderation settings from a user.
> >By default, adaptive-rx/tx (DIM) are enabled.
> >
> >Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> >---
> > drivers/net/ethernet/microsoft/Kconfig        |   1 +
> > .../net/ethernet/microsoft/mana/gdma_main.c   |  27 ++++
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 101 ++++++++++++++-
> > .../ethernet/microsoft/mana/mana_ethtool.c    | 120 +++++++++++++++++-
> > include/net/mana/gdma.h                       |  24 +++-
> > include/net/mana/mana.h                       |  42 ++++++
> > 6 files changed, 309 insertions(+), 6 deletions(-)
> >
> >diff --git a/drivers/net/ethernet/microsoft/Kconfig
> b/drivers/net/ethernet/microsoft/Kconfig
> >index 3f36ee6a8ece..e9be18c92ca5 100644
> >--- a/drivers/net/ethernet/microsoft/Kconfig
> >+++ b/drivers/net/ethernet/microsoft/Kconfig
> >@@ -21,6 +21,7 @@ config MICROSOFT_MANA
> >       depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
> >       depends on PCI_HYPERV
> >       select AUXILIARY_BUS
> >+      select DIMLIB
> >       select PAGE_POOL
> >       select NET_SHAPER
> >       help
> >diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >index 712a0881d720..5aa0ea794a00 100644
> >--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> >@@ -405,6 +405,7 @@ static int mana_gd_disable_queue(struct gdma_queue
> *queue)
> > #define DOORBELL_OFFSET_RQ    0x400
> > #define DOORBELL_OFFSET_CQ    0x800
> > #define DOORBELL_OFFSET_EQ    0xFF8
> >+#define DOORBELL_OFFSET_DIM   0x820
> >
> > static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
> >                                 enum gdma_queue_type q_type, u32 qid,
> >@@ -445,6 +446,16 @@ static void mana_gd_ring_doorbell(struct
> gdma_context *gc, u32 db_index,
> >               addr += DOORBELL_OFFSET_SQ;
> >               break;
> >
> >+      case GDMA_DIM:
> >+              e.dim.id = qid;
> >+              e.dim.mod_usec = tail_ptr;
> >+              e.dim.mod_usec_vld = tail_ptr >> 15;
> >+              e.dim.mod_comps = tail_ptr >> 16;
> 
> please use defines instead of magic
Will do.

> 
> >+              e.dim.mod_comps_vld = num_req;
> >+
> >+              addr += DOORBELL_OFFSET_DIM;
> >+              break;
> >+
> >       default:
> >               WARN_ON(1);
> >               return;
> >@@ -479,6 +490,22 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8
> arm_bit)
> > }
> > EXPORT_SYMBOL_NS(mana_gd_ring_cq, "NET_MANA");
> >
> >+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool
> mod_usec_vld,
> >+                    u32 mod_comps, bool mod_comps_vld)
> >+{
> >+      struct gdma_context *gc = cq->gdma_dev->gdma_context;
> >+      u32 dim_val;
> >+
> >+      /* Convert the DIM values to doorbell parameters */
> >+      dim_val = (mod_usec & MANA_INTR_MODR_USEC_MAX) |
> >+                (((u32)mod_usec_vld & 1) << 15) |
> >+                ((mod_comps & MANA_INTR_MODR_COMP_MAX) << 16);
> 
> i believe FIELD_PREP if preferrable in such cases
Will do.

> 
> >+
> >+      mana_gd_ring_doorbell(gc, cq->gdma_dev->doorbell, GDMA_DIM, cq-
> >id,
> >+                            dim_val, (u8)mod_comps_vld & 1);
> >+}
> >+EXPORT_SYMBOL_NS(mana_gd_ring_dim, "NET_MANA");
> >+
> > #define MANA_SERVICE_PERIOD 10
> >
> > static void mana_serv_rescan(struct pci_dev *pdev)
> >diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> >index 82f1461a48e9..f1a16f8aca66 100644
> >--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> >+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> >@@ -1551,6 +1551,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >
> >       mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
> >                            sizeof(req), sizeof(resp));
> >+
> >+      req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> >+      req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
> >       req.vport = vport;
> >       req.wq_type = wq_type;
> >       req.wq_gdma_region = wq_spec->gdma_region;
> >@@ -1559,6 +1562,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >       req.cq_size = cq_spec->queue_size;
> >       req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
> >       req.cq_parent_qid = cq_spec->attached_eq;
> >+      req.req_cq_moderation = cq_spec->req_cq_moderation;
> >+      req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> >+      req.cq_moderation_usec = cq_spec->cq_moderation_usec;
> >
> >       err = mana_send_request(apc->ac, &req, sizeof(req), &resp,
> >                               sizeof(resp));
> >@@ -2253,6 +2259,66 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
> >               xdp_do_flush();
> > }
> >
> >+static void mana_rx_dim_work(struct work_struct *work)
> >+{
> >+      struct dim *dim = container_of(work, struct dim, work);
> >+      struct mana_cq *cq = container_of(dim, struct mana_cq, dim);
> >+      struct dim_cq_moder cur_moder =
> >+              net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
> 
> RCT; here and for following
Will update this and below.

> 
> >+
> >+      cur_moder.usec = min_t(u16, cur_moder.usec,
> MANA_INTR_MODR_USEC_MAX);
> >+      cur_moder.pkts = min_t(u16, cur_moder.pkts,
> MANA_INTR_MODR_COMP_MAX);
> >+
> >+      mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
> >+                       cur_moder.pkts, true);
> >+
> >+      dim->state = DIM_START_MEASURE;
> >+}
> >+
> >+static void mana_tx_dim_work(struct work_struct *work)
> >+{
> >+      struct dim *dim = container_of(work, struct dim, work);
> >+      struct mana_cq *cq = container_of(dim, struct mana_cq, dim);
> >+      struct dim_cq_moder cur_moder =
> >+              net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
> >+
> >+      cur_moder.usec = min_t(u16, cur_moder.usec,
> MANA_INTR_MODR_USEC_MAX);
> >+      cur_moder.pkts = min_t(u16, cur_moder.pkts,
> MANA_INTR_MODR_COMP_MAX);
> >+
> >+      mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
> >+                       cur_moder.pkts, true);
> >+
> >+      dim->state = DIM_START_MEASURE;
> >+}
> >+
> >+static void mana_update_rx_dim(struct mana_cq *cq)
> >+{
> >+      struct mana_rxq *rxq = cq->rxq;
> >+      struct mana_port_context *apc = netdev_priv(rxq->ndev);
> >+      struct dim_sample dim_sample = {};
> >+
> >+      if (!apc->rx_dim_enabled)
> >+              return;
> >+
> >+      dim_update_sample(READ_ONCE(cq->dim_event_ctr), rxq-
> >stats.packets,
> >+                        rxq->stats.bytes, &dim_sample);
> >+      net_dim(&cq->dim, &dim_sample);
> >+}
> >+
> >+static void mana_update_tx_dim(struct mana_cq *cq)
> >+{
> >+      struct mana_txq *txq = cq->txq;
> >+      struct mana_port_context *apc = netdev_priv(txq->ndev);
> >+      struct dim_sample dim_sample = {};
> >+
> >+      if (!apc->tx_dim_enabled)
> >+              return;
> >+
> >+      dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq-
> >stats.packets,
> >+                        txq->stats.bytes, &dim_sample);
> >+      net_dim(&cq->dim, &dim_sample);
> >+}
> >+
> > static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
> > {
> >       struct mana_cq *cq = context;
> >@@ -2271,7 +2337,13 @@ static int mana_cq_handler(void *context, struct
> gdma_queue *gdma_queue)
> >       if (w < cq->budget) {
> >               mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
> >               cq->work_done_since_doorbell = 0;
> >-              napi_complete_done(&cq->napi, w);
> >+
> >+              if (napi_complete_done(&cq->napi, w)) {
> >+                      if (cq->type == MANA_CQ_TYPE_RX)
> >+                              mana_update_rx_dim(cq);
> >+                      else
> >+                              mana_update_tx_dim(cq);
> >+              }
> >       } else if (cq->work_done_since_doorbell >=
> >                  (cq->gdma_cq->queue_size / COMP_ENTRY_SIZE) * 4) {
> >               /* MANA hardware requires at least one doorbell ring every
> 8
> >@@ -2303,6 +2375,7 @@ static void mana_schedule_napi(void *context,
> struct gdma_queue *gdma_queue)
> > {
> >       struct mana_cq *cq = context;
> >
> >+      WRITE_ONCE(cq->dim_event_ctr, cq->dim_event_ctr + 1);
> >       napi_schedule_irqoff(&cq->napi);
> > }
> >
> >@@ -2345,6 +2418,7 @@ static void mana_destroy_txq(struct
> mana_port_context *apc)
> >               if (apc->tx_qp[i]->txq.napi_initialized) {
> >                       napi_synchronize(napi);
> >                       napi_disable_locked(napi);
> >+                      cancel_work_sync(&apc->tx_qp[i]->tx_cq.dim.work);
> >                       netif_napi_del_locked(napi);
> >                       apc->tx_qp[i]->txq.napi_initialized = false;
> >               }
> >@@ -2475,6 +2549,10 @@ static int mana_create_txq(struct
> mana_port_context *apc,
> >               cq_spec.queue_size = cq->gdma_cq->queue_size;
> >               cq_spec.modr_ctx_id = 0;
> >               cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
> >+              cq_spec.req_cq_moderation = apc->tx_dim_enabled ||
> >+                      (apc->intr_modr_tx_usec && apc-
> >intr_modr_tx_comp);
> >+              cq_spec.cq_moderation_usec = apc->intr_modr_tx_usec;
> >+              cq_spec.cq_moderation_comp = apc->intr_modr_tx_comp;
> >
> >               err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
> >                                        &wq_spec, &cq_spec,
> >@@ -2509,6 +2587,9 @@ static int mana_create_txq(struct mana_port_context
> *apc,
> >               napi_enable_locked(&cq->napi);
> >               txq->napi_initialized = true;
> >
> >+              INIT_WORK(&cq->dim.work, mana_tx_dim_work);
> >+              cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> >+
> >               mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
> >       }
> >
> >@@ -2543,6 +2624,7 @@ static void mana_destroy_rxq(struct
> mana_port_context *apc,
> >               napi_synchronize(napi);
> >
> >               napi_disable_locked(napi);
> >+              cancel_work_sync(&rxq->rx_cq.dim.work);
> >               netif_napi_del_locked(napi);
> >       }
> >
> >@@ -2780,6 +2862,10 @@ static struct mana_rxq *mana_create_rxq(struct
> mana_port_context *apc,
> >       cq_spec.queue_size = cq->gdma_cq->queue_size;
> >       cq_spec.modr_ctx_id = 0;
> >       cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
> >+      cq_spec.req_cq_moderation = apc->rx_dim_enabled ||
> >+              (apc->intr_modr_rx_usec && apc->intr_modr_rx_comp);
> >+      cq_spec.cq_moderation_usec = apc->intr_modr_rx_usec;
> >+      cq_spec.cq_moderation_comp = apc->intr_modr_rx_comp;
> >
> >       err = mana_create_wq_obj(apc, apc->port_handle, GDMA_RQ,
> >                                &wq_spec, &cq_spec, &rxq->rxobj);
> >@@ -2815,6 +2901,9 @@ static struct mana_rxq *mana_create_rxq(struct
> mana_port_context *apc,
> >
> >       napi_enable_locked(&cq->napi);
> >
> >+      INIT_WORK(&cq->dim.work, mana_rx_dim_work);
> >+      cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> >+
> >       mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
> > out:
> >       if (!err)
> >@@ -3432,6 +3521,16 @@ static int mana_probe_port(struct mana_context
> *ac, int port_idx,
> >       apc->port_idx = port_idx;
> >       apc->cqe_coalescing_enable = 0;
> >
> >+      /* Initialize interrupt moderation settings if supported by HW */
> >+      if (gc->pf_cap_flags1 &
> GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION) {
> >+              apc->intr_modr_rx_usec = MANA_INTR_MODR_USEC_DEF;
> >+              apc->intr_modr_rx_comp = MANA_INTR_MODR_COMP_DEF;
> >+              apc->intr_modr_tx_usec = MANA_INTR_MODR_USEC_DEF;
> >+              apc->intr_modr_tx_comp = MANA_INTR_MODR_COMP_DEF;
> >+              apc->rx_dim_enabled = MANA_ADAPTIVE_RX_DEF;
> >+              apc->tx_dim_enabled = MANA_ADAPTIVE_TX_DEF;
> >+      }
> >+
> >       mutex_init(&apc->vport_mutex);
> >       apc->vport_use_count = 0;
> >
> >diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> >index 04350973e19e..a90216eba794 100644
> >--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> >+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> >@@ -419,6 +419,15 @@ static int mana_get_coalesce(struct net_device
> *ndev,
> >           !kernel_coal->rx_cqe_nsecs)
> >               kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
> >
> >+      ec->rx_coalesce_usecs = apc->intr_modr_rx_usec;
> >+      ec->rx_max_coalesced_frames = apc->intr_modr_rx_comp;
> >+
> >+      ec->tx_coalesce_usecs = apc->intr_modr_tx_usec;
> >+      ec->tx_max_coalesced_frames = apc->intr_modr_tx_comp;
> >+
> >+      ec->use_adaptive_rx_coalesce = apc->rx_dim_enabled;
> >+      ec->use_adaptive_tx_coalesce = apc->tx_dim_enabled;
> >+
> >       return 0;
> > }
> >
> >@@ -429,8 +438,28 @@ static int mana_set_coalesce(struct net_device
> *ndev,
> > {
> >       struct mana_port_context *apc = netdev_priv(ndev);
> >       u8 saved_cqe_coalescing_enable;
> >+      u16 old_rx_usec, old_rx_comp;
> >+      u16 old_tx_usec, old_tx_comp;
> >+      bool old_rx_dim, old_tx_dim;
> 
> how about using some sort of struct instead of declaring a number
> of params for bookkeeping? imho would be cleaner
Will consider this.

> 
> >+      bool modr_changed = false;
> >+      bool dim_changed = false;
> >+      struct gdma_context *gc;
> >       int err;
> >
> >+      gc = apc->ac->gdma_dev->gdma_context;
> >+
> >+      /* Both static and dynamic interrupt moderation (DIM) rely on the
> >+       * same HW capability advertised by the PF.
> >+       */
> >+      if ((ec->use_adaptive_rx_coalesce || ec->use_adaptive_tx_coalesce
> ||
> >+           ec->rx_coalesce_usecs || ec->tx_coalesce_usecs ||
> >+           ec->rx_max_coalesced_frames || ec->tx_max_coalesced_frames)
> &&
> >+          !(gc->pf_cap_flags1 &
> GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)) {
> >+              NL_SET_ERR_MSG(extack,
> >+                             "Interrupt Moderation is not supported by
> HW");
> >+              return -EOPNOTSUPP;
> >+      }
> >+
> >       if (kernel_coal->rx_cqe_frames != 1 &&
> >           kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
> >               NL_SET_ERR_MSG_FMT(extack,
> >@@ -440,6 +469,47 @@ static int mana_set_coalesce(struct net_device
> *ndev,
> >               return -EINVAL;
> >       }
> >
> >+      if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
> >+          ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
> >+              NL_SET_ERR_MSG_FMT(extack,
> >+                                 "coalesce usecs must be <= %u",
> >+                                 MANA_INTR_MODR_USEC_MAX);
> >+              return -EINVAL;
> >+      }
> >+
> >+      if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
> >+          ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
> >+              NL_SET_ERR_MSG_FMT(extack,
> >+                                 "coalesce frames must be <= %u",
> >+                                 MANA_INTR_MODR_COMP_MAX);
> >+              return -EINVAL;
> >+      }
> >+
> >+      if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
> >+          ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
> >+          ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
> >+          ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
> >+              modr_changed = true;
> >+
> >+      old_rx_usec = apc->intr_modr_rx_usec;
> >+      old_rx_comp = apc->intr_modr_rx_comp;
> >+      old_tx_usec = apc->intr_modr_tx_usec;
> >+      old_tx_comp = apc->intr_modr_tx_comp;
> >+
> >+      apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
> >+      apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
> >+      apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
> >+      apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
> >+
> >+      if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
> >+          !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
> >+              dim_changed = true;
> >+
> >+      old_rx_dim = apc->rx_dim_enabled;
> >+      old_tx_dim = apc->tx_dim_enabled;
> >+      apc->rx_dim_enabled = !!ec->use_adaptive_rx_coalesce;
> >+      apc->tx_dim_enabled = !!ec->use_adaptive_tx_coalesce;
> >+
> >       saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
> >       apc->cqe_coalescing_enable =
> >               kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
> >@@ -447,10 +517,46 @@ static int mana_set_coalesce(struct net_device
> *ndev,
> >       if (!apc->port_is_up)
> >               return 0;
> >
> >-      err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> >-      if (err)
> >-              apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
> >+      if (apc->cqe_coalescing_enable != saved_cqe_coalescing_enable &&
> >+          !modr_changed && !dim_changed) {
> >+              /* If only CQE coalescing setting is changed, we can just
> update
> >+               * RSS configuration.
> >+               */
> >+              err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> >+              if (err) {
> >+                      netdev_err(ndev, "Change CQE coalescing
> failed: %d\n",
> >+                                 err);
> >+                      apc->cqe_coalescing_enable =
> >+                              saved_cqe_coalescing_enable;
> >+                      return err;
> >+              }
> >+              return 0;
> >+      }
> >+
> >+      if (modr_changed || dim_changed) {
> >+              err = mana_detach(ndev, false);
> >+              if (err) {
> >+                      netdev_err(ndev, "mana_detach failed: %d\n", err);
> >+                      goto restore_modr;
> >+              }
> >+
> >+              err = mana_attach(ndev);
> >+              if (err) {
> >+                      netdev_err(ndev, "mana_attach failed: %d\n", err);
> >+                      goto restore_modr;
> 
> i see there is already such pattern in the mana code; how about
> creating a helper?
We are planning to update this pattern. So I keep this part of code like
other functions. And we will refactor/update them in separate patch set.

> 
> >+              }
> >+      }
> >+
> >+      return 0;
> >
> >+restore_modr:
> >+      apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
> >+      apc->intr_modr_rx_usec = old_rx_usec;
> >+      apc->intr_modr_rx_comp = old_rx_comp;
> >+      apc->intr_modr_tx_usec = old_tx_usec;
> >+      apc->intr_modr_tx_comp = old_tx_comp;
> >+      apc->rx_dim_enabled = old_rx_dim;
> >+      apc->tx_dim_enabled = old_tx_dim;
> >       return err;
> > }
> >
> >@@ -574,7 +680,13 @@ static int mana_get_link_ksettings(struct net_device
> *ndev,
> > }
> >
> > const struct ethtool_ops mana_ethtool_ops = {
> >-      .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
> >+      .supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES |
> >+                                  ETHTOOL_COALESCE_RX_USECS |
> >+                                  ETHTOOL_COALESCE_RX_MAX_FRAMES |
> >+                                  ETHTOOL_COALESCE_TX_USECS |
> >+                                  ETHTOOL_COALESCE_TX_MAX_FRAMES |
> >+                                  ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
> >+                                  ETHTOOL_COALESCE_USE_ADAPTIVE_TX,
> >       .get_ethtool_stats      = mana_get_ethtool_stats,
> >       .get_sset_count         = mana_get_sset_count,
> >       .get_strings            = mana_get_strings,
> >diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> >index 70d62bc32837..0a0cc7b080d3 100644
> >--- a/include/net/mana/gdma.h
> >+++ b/include/net/mana/gdma.h
> >@@ -47,6 +47,7 @@ enum gdma_queue_type {
> >       GDMA_RQ,
> >       GDMA_CQ,
> >       GDMA_EQ,
> >+      GDMA_DIM,
> > };
> >
> > enum gdma_work_request_flags {
> >@@ -126,6 +127,17 @@ union gdma_doorbell_entry {
> >               u64 tail_ptr    : 31;
> >               u64 arm         : 1;
> >       } eq;
> >+
> >+      struct {
> >+              u64 id           : 24;
> >+              u64 reserved     : 8;
> >+              u64 mod_usec     : 10;
> >+              u64 reserve1     : 5;
> >+              u64 mod_usec_vld : 1;
> >+              u64 mod_comps    : 8;
> >+              u64 reserve2     : 7;
> >+              u64 mod_comps_vld: 1;
> >+      } dim;
> > }; /* HW DATA */
> >
> > struct gdma_msg_hdr {
> >@@ -484,6 +496,9 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8
> arm_bit);
> >
> > int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type
> type);
> >
> >+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool
> mod_usec_vld,
> >+                    u32 mod_comps, bool mod_comps_vld);
> >+
> > struct gdma_wqe {
> >       u32 reserved    :24;
> >       u32 last_vbytes :8;
> >@@ -629,6 +644,9 @@ enum {
> > /* Driver supports self recovery on Hardware Channel timeouts */
> > #define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
> >
> >+/* Driver supports dynamic interrupt moderation - DIM */
> >+#define GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(27)
> >+
> > #define GDMA_DRV_CAP_FLAGS1 \
> >       (GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
> >        GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
> >@@ -643,7 +661,8 @@ enum {
> >        GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
> >        GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
> >        GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
> >-       GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
> >+       GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
> >+       GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)
> >
> > #define GDMA_DRV_CAP_FLAGS2 0
> >
> >@@ -679,6 +698,9 @@ struct gdma_verify_ver_req {
> >       u8 os_ver_str4[128];
> > }; /* HW DATA */
> >
> >+/* HW supports dynamic interrupt moderation - DIM */
> >+#define GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(15)
> >+
> > struct gdma_verify_ver_resp {
> >       struct gdma_resp_hdr hdr;
> >       u64 gdma_protocol_ver;
> >diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> >index d9c27310fd04..57868a79f23d 100644
> >--- a/include/net/mana/mana.h
> >+++ b/include/net/mana/mana.h
> >@@ -4,6 +4,7 @@
> > #ifndef _MANA_H
> > #define _MANA_H
> >
> >+#include <linux/dim.h>
> > #include <net/xdp.h>
> > #include <net/net_shaper.h>
> >
> >@@ -64,6 +65,16 @@ enum TRI_STATE {
> > /* Maximum number of packets per coalesced CQE */
> > #define MANA_RXCOMP_OOB_NUM_PPI 4
> >
> >+/* Default/max interrupt moderation settings */
> >+#define MANA_INTR_MODR_USEC_DEF 0
> >+#define MANA_INTR_MODR_COMP_DEF 0
> >+
> >+#define MANA_ADAPTIVE_RX_DEF true
> >+#define MANA_ADAPTIVE_TX_DEF true
> >+
> >+#define MANA_INTR_MODR_USEC_MAX 1023
> >+#define MANA_INTR_MODR_COMP_MAX 255
> 
> used as a limiter and mask - for mask case i believe
> GENMASK cand be used
Will do. 

Thanks,
- Haiyang

^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Jork Loeser @ 2026-06-01 20:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <ahxrc4pTvVU20RTX@kernel.org>



On Sun, 31 May 2026, Mike Rapoport wrote:

> Hi Jork,

>>  - A freeze mechanism to lock the tree before serializing for kexec
>>    (patch 13).
>
> There were a lot of effort to make KHO stateless and drop the requirement
> for finalization/freeze.
>
> Why is this necessary to add a freeze mechanism to kho_radix_tree?
> If it's a hard requirement of mshv maybe the freeze part should be handled
> there?

Good feedback. It's a safety-net so we do not accidentally donate pages 
without being able to track them. Thought it might be a good generic 
feature. Let me keep it in the MSHV driver.

>> Patch 13:      Radix tree freeze and del_key() error reporting
>
> del_key() error reporting sounds like something we'd want to avoid.
> del_key() is called on "freeing" path and during error handling, it would
> be hard if at all possible to deal with errors from del_key().

I hear you. Stating "yeah, it can only really fail if the key isn't there, 
or it's frozen, but not due to other things, so don't bother to check the 
return code if you are sure" is an odd contract. With the freeze-logic 
moving into MSHV, will revert to no-error.

>> Patch 19:      Export kexec_in_progress for modules
>
> Isn't there another way to differentiate kexec reboot?

I could not find one, unfortunately.

> Sincerely yours,
> Mike.

Best,
Jork

^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Jork Loeser @ 2026-06-01 20:15 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, linux-hyperv, linux-mm, kexec, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <ah2eBxaBnVs_1j5n@google.com>

On Mon, 1 Jun 2026, Pasha Tatashin wrote:

> On 05-31 20:10, Mike Rapoport wrote:

>>>  - A freeze mechanism to lock the tree before serializing for kexec
>>>    (patch 13).
>>
>> There were a lot of effort to make KHO stateless and drop the requirement
>> for finalization/freeze.
>
> Yes, using KHO directly here is incorrect. The state machine is provided
> by LUO, so we should use LUO here. MSHV should provide a file that
> userspace adds to LUO, and all state machine management would be the
> same as for all other clients participating in LU.

The thing is, there is no file handle to rely on. Even once partitions are 
all removed, Hyper-V might hang onto pages (and won't return them even if 
asked). However, these pages very much must be excluded from Linux 
post-kexec, or the system will crash. We cannot rely on UM to ensure 
integrity of memory management.

Contrast that to standard LUO use: If you drop individual file handles, or 
even skip the LUO phase entirely, the worst that will happen is that the 
objects will be gone post-kexec. The MM itself will still be consistent. 
For MSHV & page donation, this is different.

(And yes, partition preservation will very much tie into LUO)

Best,
Jork

^ permalink raw reply

* Re: [PATCH v4 1/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: David Woodhouse @ 2026-06-01 21:46 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-2-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

On Fri, 29 May 2026 07:43:48 -0700, Sean Christopherson wrote:
> Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
> frequency.  In practice, this is likely one big nop, as re-calibration is
> used only for SMP=n kernels, and only for hardware that is 20+ years old,
> i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 8/47] x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
From: David Woodhouse @ 2026-06-01 21:49 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-9-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 740 bytes --]

On Fri, 29 May 2026 07:43:55 -0700, Sean Christopherson wrote:
> Add dedicated hypervisor hooks for getting known TSC/CPU frequencies
> instead of overriding seemingly generic platform hooks, and explicitly
> priotize hypervisor-provided frequencies over native methods, but do NOT
> clobber the frequency obtained from trusted firmware.  While shuffling the
> hooks around is arguably "six of one, half dozen of the other", scoping
> them to x86_hyper_init makes their purpose more obvious, and allows for
> explicitly defining the priority of sources (as is done here).
>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 11/47] x86/tsc: Kill off x86_platform_ops.calibrate_{cpu,tsc}() hooks
From: David Woodhouse @ 2026-06-01 21:51 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-12-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 634 bytes --]

On Fri, 29 May 2026 07:43:58 -0700, Sean Christopherson wrote:
> Now that getting the CPU and/or TSC frequencies from the hypervisor uses
> dedicated hooks, drop x86_platform_ops.calibrate_{cpu,tsc}() and instead
> directly invoke the correct helper at each phase of (re)calibration.  In
> addition to eliminating unnecessary code, this makes it a bit more obvious
> when the "late" path invokes pit_hpet_ptimer_calibrate_cpu() instead of
> x86_platform_ops.calibrate_cpu().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 13/47] x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
From: David Woodhouse @ 2026-06-01 21:52 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-14-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 503 bytes --]

On Fri, 29 May 2026 07:44:00 -0700, Sean Christopherson wrote:
> Fold the guts of native_calibrate_cpu() into its sole remaining caller,
> recalibrate_cpu_khz() to eliminate the extra SMP=n #ifdef, and so that it's
> more obvious that directly invoking the early vs. late calibration routines
> in determine_cpu_tsc_frequencies() is intentional.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 12/47] x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() => native_calibrate_cpu_late()
From: David Woodhouse @ 2026-06-01 21:52 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-13-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 339 bytes --]

On Fri, 29 May 2026 07:43:59 -0700, Sean Christopherson wrote:
> Rename the late CPU calibration routine so that its relationship to the
> early routine is more obvious and intuitive.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 14/47] x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
From: David Woodhouse @ 2026-06-01 21:53 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529144435.704127-15-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]

On Fri, 29 May 2026 07:44:01 -0700, Sean Christopherson wrote:
> Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz() in anticipation of
> adding support for getting TSC info from PV CPUID, i.e. in a KVM specific
> way, but without non-kvmclock.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 17/47] x86/kvm: Mark TSC as reliable when it's constant and nonstop
From: David Woodhouse @ 2026-06-01 22:02 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnhnjvfIblFxTFX@google.com>

[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]

On Fri, 29 May 2026 11:57:34 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 909d3e5e5bcd5..4fe9c69bf40b3 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > [ ... ]
> > > @@ -1040,7 +1041,20 @@ static void __init kvm_init_platform(void)
> > [ ... ]
> > > -	kvmclock_init();
> > > +        /*
> > > +         * If the TSC counts at a constant frequency across P/T states, counts
> > > +         * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > +         * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> > > +         * exists purely to honor the TSC being marked unstable via command
> > > +         * line, any runtime detection of an unstable will happen after this.
> > > +         */
> > > +	tsc_is_reliable = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > +			  boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > +			  !check_tsc_unstable();
> > 
> > [Severity: High]
> > Does this evaluate check_tsc_unstable() too early to catch the command line
> > parameter?
> 
> Huh, it does indeed.
> 
> > It looks like kvm_init_platform() is called from setup_arch(), but the
> > tsc=unstable kernel parameter is parsed via __setup() later during
> > parse_args() in start_kernel().
> > 
> > If check_tsc_unstable() evaluates to 0 here because the parameter hasn't
> > been parsed yet, wouldn't it incorrectly force X86_FEATURE_TSC_RELIABLE
> > and set prefer_tsc to true?
> 
> Yep, but this is a pre-existing problem that goes all the way back to the original
> commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if invariant TSC is exposed").
> 
> We could try to fix that, but I'm _very_ strongly inclined to add (yet another)
> patch to simply drop the check_tsc_unstable() since it has always been dead code.

Yeah, kill it with fire.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 18/47] x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
From: David Woodhouse @ 2026-06-01 22:06 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnZ8tuHFVFTeofW@google.com>

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

On Fri, 29 May 2026 11:24:50 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 4fe9c69bf40b3..c1139182121d5 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > 
> > [ ... ]
> > 
> > > @@ -1039,6 +1040,13 @@ static void __init kvm_init_platform(void)
> > >                     x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
> > >                     x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
> > >             }
> > > +
> > > +#ifdef CONFIG_X86_LOCAL_APIC
> > > +           /* The leaf also includes the local APIC bus/timer frequency.*/
> > > +           apic_khz = cpuid_ebx(timing_info_leaf);
> > > +           if (apic_khz)
> > > +                  lapic_timer_period = apic_khz * 1000 / HZ;
> > 
> > [Severity: High]
> > Could the intermediate multiplication of apic_khz * 1000 overflow here?
> > 
> > Since apic_khz is a u32, the multiplication is evaluated in 32-bit unsigned
> > arithmetic. If the hypervisor reports an APIC timer frequency greater than
> > ~4.29 GHz (apic_khz > 4294967), this calculation will silently overflow.
> 
> Hmm, easy enough to use mul_u64_u32_div() (I think that's the write helper for
> this?).

Yep.

> But this problem pre-exits in almost every other path that sets lapic_timer_period.
> So while I tried to avoid doing yet more tangentially related cleanup, it seems
> like adding a helper to set lapic_timer_period is the way to go.  That would also
> allow making lapic_timer_period local to arch/x86/kernel/apic/apic.c.
> 
> *sigh*

Yay, more patches!

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 31/47] x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150753.714296-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 628 bytes --]

On Fri, 29 May 2026 08:07:52 -0700, Sean Christopherson wrote:
> NOP-ify the sched_clock save/restore hooks when using VMware's version of
> sched_clock.  This will allow extending paravirt_set_sched_clock() to set
> the save/restore hooks, without having to simultaneously change the
> behavior of VMware guests.
>
> Note, it's not at all obvious that it's safe/correct for VMware guests to
> do nothing on suspend/resume, but that's a pre-existing problem.  Leave it
> for a VMware expert to sort out.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 30/47] x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150741.714145-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

On Fri, 29 May 2026 08:07:41 -0700, Sean Christopherson wrote:
> NOP-ify the x86_platform sched_clock save/restore hooks when setting up
> Xen's PV clock to make it somewhat obvious the hooks aren't used when
> running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 46/47] x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for documentation
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
  To: seanjc
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
	wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
	luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
	rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
	boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
	linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
	nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150833.715042-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 543 bytes --]

On Fri, 29 May 2026 08:08:33 -0700, Sean Christopherson wrote:
> Invoke kvmclock_cpu_action() with AP_ONLINE and BSP_RESUME, even though
> kvmclock doesn't need to do anything in either case, so that the asymmetry
> of kvmclock is a detail buried in kvmclock, and to explicitly document
> that doing nothing during those phases is intentional and correct.
>
> For all intents and purposes, no functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* [PATCH] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: Mukesh R @ 2026-06-01 22:51 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel; +Cc: wei.liu

Make cosmetic changes:
 o Rename struct pci_dev *dev to *pdev since there are cases of
   struct device *dev in the file and all over the kernel
 o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
   of building different types of device ids
 o Fix checkpatch.pl issues with return and extraneous printk
 o Replace spaces with tabs
 o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
   paths involve many types of device ids
 o Fix indentation in a large if block by using goto.

There are no functional changes.

Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/irqdomain.c | 198 +++++++++++++++++++-----------------
 1 file changed, 104 insertions(+), 94 deletions(-)

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 365e364268d9..b3ad50a874dc 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
-
 /*
  * Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
  *
@@ -14,8 +13,8 @@
 #include <linux/irqchip/irq-msi-lib.h>
 #include <asm/mshyperv.h>
 
-static int hv_map_interrupt(union hv_device_id device_id, bool level,
-		int cpu, int vector, struct hv_interrupt_entry *entry)
+static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
+		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
 {
 	struct hv_input_map_device_interrupt *input;
 	struct hv_output_map_device_interrupt *output;
@@ -32,7 +31,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 	intr_desc = &input->interrupt_descriptor;
 	memset(input, 0, sizeof(*input));
 	input->partition_id = hv_current_partition_id;
-	input->device_id = device_id.as_uint64;
+	input->device_id = hv_devid.as_uint64;
 	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
 	intr_desc->vector_count = 1;
 	intr_desc->target.vector = vector;
@@ -44,7 +43,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 
 	intr_desc->target.vp_set.valid_bank_mask = 0;
 	intr_desc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
-	nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
+	nr_bank = cpumask_to_vpset(&intr_desc->target.vp_set, cpumask_of(cpu));
 	if (nr_bank < 0) {
 		local_irq_restore(flags);
 		pr_err("%s: unable to generate VP set\n", __func__);
@@ -61,7 +60,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 
 	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
 			input, output);
-	*entry = output->interrupt_entry;
+	*ret_entry = output->interrupt_entry;
 
 	local_irq_restore(flags);
 
@@ -71,21 +70,19 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 	return hv_result_to_errno(status);
 }
 
-static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
 {
 	unsigned long flags;
 	struct hv_input_unmap_device_interrupt *input;
-	struct hv_interrupt_entry *intr_entry;
 	u64 status;
 
 	local_irq_save(flags);
 	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
 	memset(input, 0, sizeof(*input));
-	intr_entry = &input->interrupt_entry;
 	input->partition_id = hv_current_partition_id;
 	input->device_id = id;
-	*intr_entry = *old_entry;
+	input->interrupt_entry = *irq_entry;
 
 	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
 	local_irq_restore(flags);
@@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
 	return 0;
 }
 
-static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
+static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
 {
-	union hv_device_id dev_id;
+	int pos;
+	union hv_device_id hv_devid;
 	struct rid_data data = {
 		.bridge = NULL,
-		.rid = PCI_DEVID(dev->bus->number, dev->devfn)
+		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
 	};
 
-	pci_for_each_dma_alias(dev, get_rid_cb, &data);
+	pci_for_each_dma_alias(pdev, get_rid_cb, &data);
 
-	dev_id.as_uint64 = 0;
-	dev_id.device_type = HV_DEVICE_TYPE_PCI;
-	dev_id.pci.segment = pci_domain_nr(dev->bus);
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_PCI;
+	hv_devid.pci.segment = pci_domain_nr(pdev->bus);
 
-	dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
-	dev_id.pci.bdf.device = PCI_SLOT(data.rid);
-	dev_id.pci.bdf.function = PCI_FUNC(data.rid);
-	dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
+	hv_devid.pci.bdf.bus = PCI_BUS_NUM(data.rid);
+	hv_devid.pci.bdf.device = PCI_SLOT(data.rid);
+	hv_devid.pci.bdf.function = PCI_FUNC(data.rid);
+	hv_devid.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
 
-	if (data.bridge) {
-		int pos;
+	if (data.bridge == NULL)
+		goto out;
 
-		/*
-		 * Microsoft Hypervisor requires a bus range when the bridge is
-		 * running in PCI-X mode.
-		 *
-		 * To distinguish conventional vs PCI-X bridge, we can check
-		 * the bridge's PCI-X Secondary Status Register, Secondary Bus
-		 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
-		 * Specification Revision 1.0 5.2.2.1.3.
-		 *
-		 * Value zero means it is in conventional mode, otherwise it is
-		 * in PCI-X mode.
-		 */
+	/*
+	 * Microsoft Hypervisor requires a bus range when the bridge is
+	 * running in PCI-X mode.
+	 *
+	 * To distinguish conventional vs PCI-X bridge, we can check
+	 * the bridge's PCI-X Secondary Status Register, Secondary Bus
+	 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
+	 * Specification Revision 1.0 5.2.2.1.3.
+	 *
+	 * Value zero means it is in conventional mode, otherwise it is
+	 * in PCI-X mode.
+	 */
 
-		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
-		if (pos) {
-			u16 status;
+	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
+	if (pos) {
+		u16 status;
 
-			pci_read_config_word(data.bridge, pos +
-					PCI_X_BRIDGE_SSTATUS, &status);
+		pci_read_config_word(data.bridge, pos + PCI_X_BRIDGE_SSTATUS,
+				     &status);
 
-			if (status & PCI_X_SSTATUS_FREQ) {
-				/* Non-zero, PCI-X mode */
-				u8 sec_bus, sub_bus;
+		if (status & PCI_X_SSTATUS_FREQ) {
+			/* Non-zero, PCI-X mode */
+			u8 sec_bus, sub_bus;
 
-				dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
+			hv_devid.pci.source_shadow =
+					     HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
 
-				pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
-				dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
-				pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
-				dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
-			}
+			pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS,
+					     &sec_bus);
+			hv_devid.pci.shadow_bus_range.secondary_bus = sec_bus;
+			pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS,
+					     &sub_bus);
+			hv_devid.pci.shadow_bus_range.subordinate_bus = sub_bus;
 		}
 	}
 
-	return dev_id;
+out:
+	return hv_devid;
 }
 
-/**
- * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+/*
+ * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
  * @data:      Describes the IRQ
  * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
  *
@@ -188,22 +189,23 @@ int hv_map_msi_interrupt(struct irq_data *data,
 {
 	struct irq_cfg *cfg = irqd_cfg(data);
 	struct hv_interrupt_entry dummy;
-	union hv_device_id device_id;
+	union hv_device_id hv_devid;
 	struct msi_desc *msidesc;
-	struct pci_dev *dev;
+	struct pci_dev *pdev;
 	int cpu;
 
 	msidesc = irq_data_get_msi_desc(data);
-	dev = msi_desc_to_pci_dev(msidesc);
-	device_id = hv_build_pci_dev_id(dev);
+	pdev = msi_desc_to_pci_dev(msidesc);
+	hv_devid = hv_build_devid_type_pci(pdev);
 	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
 
-	return hv_map_interrupt(device_id, false, cpu, cfg->vector,
+	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
 				out_entry ? out_entry : &dummy);
 }
 EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
 
-static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
+static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
+			     struct msi_msg *msg)
 {
 	/* High address is always 0 */
 	msg->address_hi = 0;
@@ -211,17 +213,19 @@ static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi
 	msg->data = entry->msi_entry.data.as_uint32;
 }
 
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+				  struct hv_interrupt_entry *irq_entry);
+
 static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct hv_interrupt_entry *stored_entry;
 	struct irq_cfg *cfg = irqd_cfg(data);
 	struct msi_desc *msidesc;
-	struct pci_dev *dev;
+	struct pci_dev *pdev;
 	int ret;
 
 	msidesc = irq_data_get_msi_desc(data);
-	dev = msi_desc_to_pci_dev(msidesc);
+	pdev = msi_desc_to_pci_dev(msidesc);
 
 	if (!cfg) {
 		pr_debug("%s: cfg is NULL", __func__);
@@ -240,7 +244,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		stored_entry = data->chip_data;
 		data->chip_data = NULL;
 
-		ret = hv_unmap_msi_interrupt(dev, stored_entry);
+		ret = hv_unmap_msi_interrupt(pdev, stored_entry);
 
 		kfree(stored_entry);
 
@@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	}
 
 	stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
-	if (!stored_entry) {
-		pr_debug("%s: failed to allocate chip data\n", __func__);
+	if (!stored_entry)
 		return;
-	}
 
 	ret = hv_map_msi_interrupt(data, stored_entry);
 	if (ret) {
@@ -262,18 +264,21 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 
 	data->chip_data = stored_entry;
 	entry_to_msi_msg(data->chip_data, msg);
-
-	return;
 }
 
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+				  struct hv_interrupt_entry *irq_entry)
 {
-	return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry);
+	union hv_device_id hv_devid;
+
+	hv_devid = hv_build_devid_type_pci(pdev);
+	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
 }
 
-static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
+/* NB: during map, hv_interrupt_entry is saved via data->chip_data */
+static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
 {
-	struct hv_interrupt_entry old_entry;
+	struct hv_interrupt_entry irq_entry;
 	struct msi_msg msg;
 
 	if (!irqd->chip_data) {
@@ -281,13 +286,13 @@ static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
 		return;
 	}
 
-	old_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
-	entry_to_msi_msg(&old_entry, &msg);
+	irq_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
+	entry_to_msi_msg(&irq_entry, &msg);
 
 	kfree(irqd->chip_data);
 	irqd->chip_data = NULL;
 
-	(void)hv_unmap_msi_interrupt(dev, &old_entry);
+	(void)hv_unmap_msi_interrupt(pdev, &irq_entry);
 }
 
 /*
@@ -302,7 +307,8 @@ static struct irq_chip hv_pci_msi_controller = {
 };
 
 static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
-				 struct irq_domain *real_parent, struct msi_domain_info *info)
+				 struct irq_domain *real_parent,
+				 struct msi_domain_info *info)
 {
 	struct irq_chip *chip = info->chip;
 
@@ -317,7 +323,8 @@ static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
 }
 
 #define HV_MSI_FLAGS_SUPPORTED	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
-#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
+#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS |	\
+				 MSI_FLAG_USE_DEF_CHIP_OPS)
 
 static struct msi_parent_ops hv_msi_parent_ops = {
 	.supported_flags	= HV_MSI_FLAGS_SUPPORTED,
@@ -329,14 +336,14 @@ static struct msi_parent_ops hv_msi_parent_ops = {
 	.init_dev_msi_info	= hv_init_dev_msi_info,
 };
 
-static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs,
-			       void *arg)
+/* Allocate nr_irqs IRQs for the given irq domain */
+static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq,
+			       unsigned int nr_irqs, void *arg)
 {
 	/*
-	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e. everything except
-	 * entry_to_msi_msg() should be in here.
+	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e.
+	 *	 everything except entry_to_msi_msg() should be in here.
 	 */
-
 	int ret;
 
 	ret = irq_domain_alloc_irqs_parent(d, virq, nr_irqs, arg);
@@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
 		return ret;
 
 	for (int i = 0; i < nr_irqs; ++i) {
-		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
-				    handle_edge_irq, NULL, "edge");
+		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
+				    NULL, handle_edge_irq, NULL, "edge");
 	}
+
 	return 0;
 }
 
-static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
+static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
+			       unsigned int nr_irqs)
 {
 	for (int i = 0; i < nr_irqs; ++i) {
 		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
@@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
 
 		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
 	}
+
 	irq_domain_free_irqs_top(d, virq, nr_irqs);
 }
 
@@ -394,25 +404,25 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
 
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
 {
-	union hv_device_id device_id;
+	union hv_device_id hv_devid;
 
-	device_id.as_uint64 = 0;
-	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
-	device_id.ioapic.ioapic_id = (u8)ioapic_id;
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
 
-	return hv_unmap_interrupt(device_id.as_uint64, entry);
+	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
 }
 EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
 
 int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
 		struct hv_interrupt_entry *entry)
 {
-	union hv_device_id device_id;
+	union hv_device_id hv_devid;
 
-	device_id.as_uint64 = 0;
-	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
-	device_id.ioapic.ioapic_id = (u8)ioapic_id;
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
 
-	return hv_map_interrupt(device_id, level, cpu, vector, entry);
+	return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
 }
 EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
-- 
2.51.2.vfs.0.1


^ permalink raw reply related

* Re: [PATCH] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: sashiko-bot @ 2026-06-01 22:59 UTC (permalink / raw)
  To: Mukesh R; +Cc: linux-hyperv
In-Reply-To: <20260601225116.956392-1-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Allocation or mapping failure in `hv_irq_compose_msi_msg()` causes the function to return without initializing the caller-provided `msg` structure, potentially leading to uncontrolled MSI writes by the device.
- [High] The `hv_msi_domain_free` function fails to iterate through allocated IRQs for multi-vector MSI, leading to leaked memory and stale hypervisor mappings.
--

commit 042e553a37dead023377268bdb9c0d43c5394d75
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Cosmetic changes in irqdomain.c for readability

This commit makes various cosmetic changes to irqdomain.c, such as renaming
variables, fixing indentation, replacing spaces with tabs, and resolving
checkpatch warnings, with no functional changes intended.

> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 365e364268d9a..b3ad50a874dca 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c

[ ... ]

> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	}
>  
>  	stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
> -	if (!stored_entry) {
> -		pr_debug("%s: failed to allocate chip data\n", __func__);
> +	if (!stored_entry)
>  		return;
> -	}
>  
>  	ret = hv_map_msi_interrupt(data, stored_entry);
>  	if (ret) {
>  		kfree(stored_entry);
>  		return;
>  	}

[Severity: High]
This is a pre-existing issue, but does returning early here leave the
caller-provided msg structure uninitialized?

If kzalloc_obj() or hv_map_msi_interrupt() fails in hv_irq_compose_msi_msg(),
the function returns without populating msg. Because the callback returns
void, the error is swallowed and the caller might write the uninitialized
contents of msg to the device's MSI address/data registers, potentially
causing the device to perform DMA writes to arbitrary or null physical
addresses when an interrupt fires.

[ ... ]

> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs)
>  {
>  	for (int i = 0; i < nr_irqs; ++i) {
>  		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);

[Severity: High]
This is a pre-existing issue, but does this loop process the same base IRQ
on every iteration?

By calling irq_domain_get_irq_data() with virq rather than virq + i inside
hv_msi_domain_free(), subsequent iterations will skip the remaining
interrupts in a multi-vector MSI scenario. This could leave them unfreed
and potentially lead to leaked kernel memory and stale hypervisor mappings.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260601225116.956392-1-mrathor@linux.microsoft.com?part=1

^ permalink raw reply

* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Borislav Petkov @ 2026-06-02  3:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>

On Fri, May 29, 2026 at 07:43:49AM -0700, Sean Christopherson wrote:
> +static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
> +{
> +	unsigned int ecx_hz, edx;
> +
> +	memset(info, 0, sizeof(*info));

Let's not clear this unnecessarily...

> +
> +	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
> +		return -ENOENT;

... just to return here...

> +
> +	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> +	cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
> +
> +	if (!info->denominator || !info->numerator)
> +		return -ENOENT;

... or here.

We wanna clear it here, when we'll return success.

> +
> +	/*
> +	 * Note, some CPUs provide the multiplier information, but not the core

	Note: some CPUs...

> +	 * crystal frequency.  The multiplier information is still useful for
> +	 * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
> +	 */
> +	info->crystal_khz = ecx_hz / 1000;
> +	return 0;
> +}
> +
> +int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
> +{
> +	if (cpuid_get_tsc_info(info) || !info->crystal_khz)
> +		return -ENOENT;
> +
> +	info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
> +	return 0;
> +}

Unused here. Add it with its first user pls.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: sashiko-bot @ 2026-06-02 10:28 UTC (permalink / raw)
  To: Shradha Gupta; +Cc: linux-hyperv
In-Reply-To: <20260601102749.1768304-1-shradhagupta@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Missing error check for xa_store() leads to resource leak and potential kernel oops on allocation failure.
--

commit 853688f50dc4ca56dd74698a3ecfdfeaf857e716
Author: Shradha Gupta <shradhagupta@linux.microsoft.com>

net: mana: Optimize irq affinity for low vcpu configs

This patch optimizes irq affinity in the mana driver for low vCPU
environments. When the number of allocated irqs is greater than the vCPU
count, it ensures that queue irqs are spread across all available vCPUs
to prevent softirq handling imbalances.

> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
[ ... ]
> @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> -		skip_first_cpu = true;
> +	if (gc->num_msix_usable <= num_online_cpus()) {
> +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> +		if (err) {
> +			cpus_read_unlock();
> +			goto free_irq;

[Severity: High]
This isn't a bug introduced by this patch, but looking at the error handling
path for goto free_irq, is there a potential resource leak if xa_store()
fails earlier in the function?

When setting up the irqs in mana_gd_setup_dyn_irqs(), we call xa_store()
with GFP_KERNEL:

    xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);

However, the return value is not checked. If this allocation fails, the gic
pointer is not saved. If a subsequent operation like irq_setup() fails and
we jump to the free_irq label, the cleanup loop attempts to retrieve it:

    gic = xa_load(&gc->irq_contexts, i);
    if (WARN_ON(!gic))
        continue;

Because the store failed, xa_load() will return NULL, causing the loop
to continue without calling free_irq() or kfree(gic). Could this result
in an un-freed irq handler that might trigger a panic later if the
module is unloaded and the hardware raises an interrupt?

> +		}
> +	} else {
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260601102749.1768304-1-shradhagupta@linux.microsoft.com?part=1

^ permalink raw reply

* [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: LeantionX @ 2026-06-02 15:52 UTC (permalink / raw)
  To: netdev
  Cc: linux-hyperv, kys, haiyangz, wei.liu, decui, longli,
	andrew+netdev, kuba, pabeni, edumazet, davem, stable,
	linux-kernel, Anton Leontev

From: Anton Leontev <leontyevantony@gmail.com>

netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.

A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).

Map the fragment page on demand with kmap_local_page()/kunmap_local()
instead. Using pfn_to_page() on pb[i].pfn maps exactly the page
described by the page buffer entry. On configurations without HIGHMEM
(amd64, i386 without CONFIG_HIGHMEM) kmap_local_page() reduces to
page_address(), so this is a no-op there.

Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
 drivers/net/hyperv/netvsc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..6984f6c97257 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/slab.h>
@@ -965,11 +966,13 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 	}

 	for (i = 0; i < page_count; i++) {
-		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
+		struct page *page = pfn_to_page(pb[i].pfn);
+		char *src = kmap_local_page(page);
 		u32 offset = pb[i].offset;
 		u32 len = pb[i].len;

 		memcpy(dest, (src + offset), len);
+		kunmap_local(src);
 		dest += len;
 	}

-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net-next] net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queries
From: Jakub Kicinski @ 2026-06-02 20:21 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, pabeni, kotaranov, horms, dipayanroy, kees,
	linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260528180757.1536640-1-ernis@linux.microsoft.com>

On Thu, 28 May 2026 11:07:51 -0700 Erni Sri Satya Vennela wrote:
> mana_query_link_cfg() sends an HWC command to firmware on every call,
> but the link speed and QoS values it returns only change when the
> driver explicitly calls mana_set_bw_clamp(). This function is called
> not only by userspace via ethtool get_link_ksettings, but also
> periodically by hv_netvsc through netvsc_get_link_ksettings and by
> the sysfs speed_show attribute via dev_attr_show, resulting in
> unnecessary HWC traffic every few minutes.

mana is ops-locked, right? Because you support net shapers

Could you instead take the netdev_lock() in the work?
It's already held around the user space originated calls.

^ permalink raw reply

* [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

This series depends on the following fixes now merged in net-next:
  17bfe0a8c014 ("net: mana: add NULL guards in teardown path to prevent panic")
  5b05aa36ee24 ("net: mana: skip redundant detach on already-detached port")

Changes in v10:
  - Rebased on net-next which now includes the prerequisite fixes.
  - Recovery logic in mana_set_priv_flags() leverages the idempotent
    mana_detach() from the merged fixes.
Changes in v9:
  - Added correct tree.
Changes in v8:
  - Fixed queue_reset_work recovery by restoring port_is_up before
    scheduling reset so the handler can properly re-attach.
  - Simplified "err && schedule_port_reset" to "schedule_port_reset".
Changes in v7:
  - Rebased onto net-next.
  - Retained private flag approach after David Wei's testing on
    Grace (ARM64) confirmed that fragment mode outperforms
    full-page mode on other platforms, validating this is a
    single-platform workaround rather than a generic issue.
Changes in v6:
  - Added missed maintainers.
Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 178 +++++++++++++++---
 include/net/mana/mana.h                       |   8 +
 3 files changed, 177 insertions(+), 31 deletions(-)

-- 
2.43.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox