Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH net-next v4] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-13 20:57 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Shradha Gupta, Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg,
	Breno Leitao, linux-kernel, linux-rdma
  Cc: paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Add Static and Dynamic Interrupt Moderation (DIM) support for
Rx and Tx.
Update queue creation procedure with new data struct with the related
settings.
Add functions to collect stat for DIM, and workers to update DIM data
and settings.
Update ethtool handler to get/set the moderation settings from a user.
To avoid detach/re-attach ops, ring DIM doorbell to change settings
at run time.
By default, adaptive-rx/tx (DIM) are enabled if supported by HW.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v4:
  Fixed tx stat, concurrency, and mb issues from Simon's review.

v3:
  Updated to avoid detach/re-attach ops as suggested by Paolo.

v2:
  Updated with comments from Jedrzej.

---
 drivers/net/ethernet/microsoft/Kconfig        |   1 +
 .../net/ethernet/microsoft/mana/gdma_main.c   |  29 +++
 drivers/net/ethernet/microsoft/mana/mana_en.c | 171 ++++++++++++++++++
 .../ethernet/microsoft/mana/mana_ethtool.c    | 167 ++++++++++++++++-
 include/net/mana/gdma.h                       |  24 ++-
 include/net/mana/mana.h                       |  54 ++++++
 6 files changed, 437 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/Kconfig b/drivers/net/ethernet/microsoft/Kconfig
index 3f36ee6a8ece..e9be18c92ca5 100644
--- a/drivers/net/ethernet/microsoft/Kconfig
+++ b/drivers/net/ethernet/microsoft/Kconfig
@@ -21,6 +21,7 @@ config MICROSOFT_MANA
 	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
 	depends on PCI_HYPERV
 	select AUXILIARY_BUS
+	select DIMLIB
 	select PAGE_POOL
 	select NET_SHAPER
 	help
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index c9ec80a1dd6f..7a012b1e5751 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /* Copyright (c) 2021, Microsoft Corporation. */
 
+#include <linux/bitfield.h>
 #include <linux/debugfs.h>
 #include <linux/module.h>
 #include <linux/pci.h>
@@ -464,6 +465,7 @@ static int mana_gd_disable_queue(struct gdma_queue *queue)
 #define DOORBELL_OFFSET_RQ	0x400
 #define DOORBELL_OFFSET_CQ	0x800
 #define DOORBELL_OFFSET_EQ	0xFF8
+#define DOORBELL_OFFSET_DIM	0x820
 
 static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
 				  enum gdma_queue_type q_type, u32 qid,
@@ -504,6 +506,16 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
 		addr += DOORBELL_OFFSET_SQ;
 		break;
 
+	case GDMA_DIM:
+		e.dim.id = qid;
+		e.dim.mod_usec = FIELD_GET(MANA_INTR_MODR_USEC_MAX, tail_ptr);
+		e.dim.mod_usec_vld = !!(tail_ptr & MANA_INTR_MODR_USEC_VLD);
+		e.dim.mod_comps = FIELD_GET(MANA_INTR_MODR_COMP_MASK, tail_ptr);
+		e.dim.mod_comps_vld = num_req;
+
+		addr += DOORBELL_OFFSET_DIM;
+		break;
+
 	default:
 		WARN_ON(1);
 		return;
@@ -538,6 +550,23 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit)
 }
 EXPORT_SYMBOL_NS(mana_gd_ring_cq, "NET_MANA");
 
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+		      u32 mod_comps, bool mod_comps_vld)
+{
+	struct gdma_context *gc = cq->gdma_dev->gdma_context;
+	u32 dim_val;
+
+	/* Convert the DIM values to doorbell parameters */
+	dim_val = FIELD_PREP(MANA_INTR_MODR_USEC_MAX, mod_usec) |
+		  FIELD_PREP(MANA_INTR_MODR_COMP_MASK, mod_comps);
+	if (mod_usec_vld)
+		dim_val |= MANA_INTR_MODR_USEC_VLD;
+
+	mana_gd_ring_doorbell(gc, cq->gdma_dev->doorbell, GDMA_DIM, cq->id,
+			      dim_val, mod_comps_vld);
+}
+EXPORT_SYMBOL_NS(mana_gd_ring_dim, "NET_MANA");
+
 #define MANA_SERVICE_PERIOD 10
 
 static void mana_serv_rescan(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 26aef21c6c2c..d36850084f2e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
 
 	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
 			     sizeof(req), sizeof(resp));
+
+	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
+	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
 	req.vport = vport;
 	req.wq_type = wq_type;
 	req.wq_gdma_region = wq_spec->gdma_region;
@@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
 	req.cq_size = cq_spec->queue_size;
 	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
 	req.cq_parent_qid = cq_spec->attached_eq;
+	req.req_cq_moderation = cq_spec->req_cq_moderation;
+	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
+	req.cq_moderation_usec = cq_spec->cq_moderation_usec;
 
 	err = mana_send_request(apc->ac, &req, sizeof(req), &resp,
 				sizeof(resp));
@@ -1844,6 +1850,7 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 	struct gdma_posted_wqe_info *wqe_info;
 	unsigned int pkt_transmitted = 0;
 	unsigned int wqe_unit_cnt = 0;
+	unsigned int tx_bytes = 0;
 	struct mana_txq *txq = cq->txq;
 	struct mana_port_context *apc;
 	struct netdev_queue *net_txq;
@@ -1925,6 +1932,8 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 
 		mana_unmap_skb(skb, apc);
 
+		tx_bytes += skb->len;
+
 		napi_consume_skb(skb, cq->budget);
 
 		pkt_transmitted++;
@@ -1955,6 +1964,10 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 	if (atomic_sub_return(pkt_transmitted, &txq->pending_sends) < 0)
 		WARN_ON_ONCE(1);
 
+	/* Feed DIM with the completion rate observed here, in NAPI context. */
+	cq->tx_dim_pkts += pkt_transmitted;
+	cq->tx_dim_bytes += tx_bytes;
+
 	cq->work_done = pkt_transmitted;
 }
 
@@ -2306,6 +2319,119 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
 		xdp_do_flush();
 }
 
+static void mana_rx_dim_work(struct work_struct *work)
+{
+	struct dim *dim = container_of(work, struct dim, work);
+	struct dim_cq_moder cur_moder;
+	struct mana_cq *cq;
+
+	cur_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
+	cq = container_of(dim, struct mana_cq, dim);
+
+	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+			 cur_moder.pkts, true);
+
+	dim->state = DIM_START_MEASURE;
+}
+
+static void mana_tx_dim_work(struct work_struct *work)
+{
+	struct dim *dim = container_of(work, struct dim, work);
+	struct dim_cq_moder cur_moder;
+	struct mana_cq *cq;
+
+	cur_moder = net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+	cq = container_of(dim, struct mana_cq, dim);
+
+	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+			 cur_moder.pkts, true);
+
+	dim->state = DIM_START_MEASURE;
+}
+
+/* The caller must update apc->rx/tx_dim_enabled before disabling and
+ * after enabling. And synchronize_net() before draining the DIM work,
+ * so that NAPI cannot observe a stale flag.
+ */
+int mana_dim_change(struct mana_cq *cq, bool enable)
+{
+	bool is_rx = cq->type == MANA_CQ_TYPE_RX;
+	struct mana_port_context *apc;
+	work_func_t work_func;
+	u32 usec, comp;
+
+	if (is_rx) {
+		apc = netdev_priv(cq->rxq->ndev);
+		usec = apc->intr_modr_rx_usec;
+		comp = apc->intr_modr_rx_comp;
+		work_func = mana_rx_dim_work;
+	} else {
+		apc = netdev_priv(cq->txq->ndev);
+		usec = apc->intr_modr_tx_usec;
+		comp = apc->intr_modr_tx_comp;
+		work_func = mana_tx_dim_work;
+	}
+
+	/* On enable, zero the DIM state so net_dim() starts measuring from
+	 * scratch.
+	 * On disable, drain any pending DIM work and restore the static
+	 * moderation values.
+	 */
+	if (enable) {
+		memset(&cq->dim, 0, sizeof(cq->dim));
+		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+		INIT_WORK(&cq->dim.work, work_func);
+	} else {
+		cancel_work_sync(&cq->dim.work);
+		mana_gd_ring_dim(cq->gdma_cq, usec, true, comp, true);
+	}
+
+	return 0;
+}
+
+static void mana_update_rx_dim(struct mana_cq *cq)
+{
+	struct mana_port_context *apc = netdev_priv(cq->rxq->ndev);
+	struct dim_sample dim_sample = {};
+	struct mana_rxq *rxq = cq->rxq;
+
+	/* Pairs with smp_store_release() in mana_set_coalesce(): observing the
+	 * enable flag set guarantees the DIM (re)initialization is visible.
+	 */
+	if (!smp_load_acquire(&apc->rx_dim_enabled))
+		return;
+
+	dim_update_sample(READ_ONCE(cq->dim_event_ctr), rxq->stats.packets,
+			  rxq->stats.bytes, &dim_sample);
+	net_dim(&cq->dim, &dim_sample);
+}
+
+static void mana_update_tx_dim(struct mana_cq *cq)
+{
+	struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
+	struct dim_sample dim_sample = {};
+
+	/* Pairs with smp_store_release() in mana_set_coalesce(): observing the
+	 * enable flag set guarantees the DIM (re)initialization is visible.
+	 */
+	if (!smp_load_acquire(&apc->tx_dim_enabled))
+		return;
+
+	/* cq->tx_dim_pkts/bytes are accumulated in mana_poll_tx_cq(), in the
+	 * same NAPI context as this read, so they track the hardware
+	 * completion rate and need no u64_stats_sync protection.
+	 */
+	dim_update_sample(READ_ONCE(cq->dim_event_ctr), cq->tx_dim_pkts,
+			  cq->tx_dim_bytes, &dim_sample);
+	net_dim(&cq->dim, &dim_sample);
+}
+
 static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 {
 	struct mana_cq *cq = context;
@@ -2324,6 +2450,15 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 	if (w < cq->budget) {
 		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
 		cq->work_done_since_doorbell = 0;
+
+		/* Update DIM before napi_complete_done() to prevent running
+		 * net_dim() concurrently.
+		 */
+		if (cq->type == MANA_CQ_TYPE_RX)
+			mana_update_rx_dim(cq);
+		else
+			mana_update_tx_dim(cq);
+
 		napi_complete_done(&cq->napi, w);
 	} else if (cq->work_done_since_doorbell >=
 		   (cq->gdma_cq->queue_size / COMP_ENTRY_SIZE) * 4) {
@@ -2356,6 +2491,7 @@ static void mana_schedule_napi(void *context, struct gdma_queue *gdma_queue)
 {
 	struct mana_cq *cq = context;
 
+	WRITE_ONCE(cq->dim_event_ctr, cq->dim_event_ctr + 1);
 	napi_schedule_irqoff(&cq->napi);
 }
 
@@ -2398,6 +2534,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
+			cancel_work_sync(&apc->tx_qp[i]->tx_cq.dim.work);
 			netif_napi_del_locked(napi);
 			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
@@ -2529,6 +2666,11 @@ static int mana_create_txq(struct mana_port_context *apc,
 		cq_spec.modr_ctx_id = 0;
 		cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
 
+		/* DIM setting can be changed at runtime */
+		cq_spec.req_cq_moderation = true;
+		cq_spec.cq_moderation_usec = apc->intr_modr_tx_usec;
+		cq_spec.cq_moderation_comp = apc->intr_modr_tx_comp;
+
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
 					 &apc->tx_qp[i]->tx_object);
@@ -2559,6 +2701,13 @@ static int mana_create_txq(struct mana_port_context *apc,
 
 		set_bit(NAPI_STATE_NO_BUSY_POLL, &cq->napi.state);
 		netif_napi_add_locked(net, &cq->napi, mana_poll);
+
+		/* Initialize the DIM work before enabling NAPI, so that a poll
+		 * cannot reach net_dim() with an uninitialized cq->dim.work.
+		 */
+		INIT_WORK(&cq->dim.work, mana_tx_dim_work);
+		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
 		napi_enable_locked(&cq->napi);
 		txq->napi_initialized = true;
 
@@ -2596,6 +2745,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 		napi_synchronize(napi);
 
 		napi_disable_locked(napi);
+		cancel_work_sync(&rxq->rx_cq.dim.work);
 		netif_napi_del_locked(napi);
 	}
 
@@ -2834,6 +2984,11 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	cq_spec.modr_ctx_id = 0;
 	cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
 
+	/* DIM setting can be changed at runtime */
+	cq_spec.req_cq_moderation = true;
+	cq_spec.cq_moderation_usec = apc->intr_modr_rx_usec;
+	cq_spec.cq_moderation_comp = apc->intr_modr_rx_comp;
+
 	err = mana_create_wq_obj(apc, apc->port_handle, GDMA_RQ,
 				 &wq_spec, &cq_spec, &rxq->rxobj);
 	if (err)
@@ -2866,6 +3021,12 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
 					   rxq->page_pool));
 
+	/* Initialize the DIM work before enabling NAPI, so that a poll
+	 * cannot reach net_dim() with an uninitialized cq->dim.work.
+	 */
+	INIT_WORK(&cq->dim.work, mana_rx_dim_work);
+	cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
 	napi_enable_locked(&cq->napi);
 
 	mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
@@ -3532,6 +3693,16 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->link_cfg_error = 1;
 	apc->cqe_coalescing_enable = 0;
 
+	/* Initialize interrupt moderation settings if supported by HW */
+	if (gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION) {
+		apc->intr_modr_rx_usec = MANA_INTR_MODR_USEC_DEF;
+		apc->intr_modr_rx_comp = MANA_INTR_MODR_COMP_DEF;
+		apc->intr_modr_tx_usec = MANA_INTR_MODR_USEC_DEF;
+		apc->intr_modr_tx_comp = MANA_INTR_MODR_COMP_DEF;
+		apc->rx_dim_enabled = MANA_ADAPTIVE_RX_DEF;
+		apc->tx_dim_enabled = MANA_ADAPTIVE_TX_DEF;
+	}
+
 	mutex_init(&apc->vport_mutex);
 	apc->vport_use_count = 0;
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 94e658d07a27..5e5fb5b18bbf 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -419,6 +419,15 @@ static int mana_get_coalesce(struct net_device *ndev,
 	    !kernel_coal->rx_cqe_nsecs)
 		kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
 
+	ec->rx_coalesce_usecs = apc->intr_modr_rx_usec;
+	ec->rx_max_coalesced_frames = apc->intr_modr_rx_comp;
+
+	ec->tx_coalesce_usecs = apc->intr_modr_tx_usec;
+	ec->tx_max_coalesced_frames = apc->intr_modr_tx_comp;
+
+	ec->use_adaptive_rx_coalesce = apc->rx_dim_enabled;
+	ec->use_adaptive_tx_coalesce = apc->tx_dim_enabled;
+
 	return 0;
 }
 
@@ -428,9 +437,34 @@ static int mana_set_coalesce(struct net_device *ndev,
 			     struct netlink_ext_ack *extack)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
-	u8 saved_cqe_coalescing_enable;
+	struct {
+		u16 intr_modr_rx_usec;
+		u16 intr_modr_rx_comp;
+		u16 intr_modr_tx_usec;
+		u16 intr_modr_tx_comp;
+		u8 cqe_coalescing_enable;
+		bool rx_dim_enabled;
+		bool tx_dim_enabled;
+	} saved;
+	bool modr_changed = false;
+	bool dim_changed = false;
+	struct gdma_context *gc;
 	int err;
 
+	gc = apc->ac->gdma_dev->gdma_context;
+
+	/* Both static and dynamic interrupt moderation (DIM) rely on the
+	 * same HW capability advertised by the PF.
+	 */
+	if ((ec->use_adaptive_rx_coalesce || ec->use_adaptive_tx_coalesce ||
+	     ec->rx_coalesce_usecs || ec->tx_coalesce_usecs ||
+	     ec->rx_max_coalesced_frames || ec->tx_max_coalesced_frames) &&
+	    !(gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)) {
+		NL_SET_ERR_MSG(extack,
+			       "Interrupt Moderation is not supported by HW");
+		return -EOPNOTSUPP;
+	}
+
 	if (kernel_coal->rx_cqe_frames != 1 &&
 	    kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
 		NL_SET_ERR_MSG_FMT(extack,
@@ -440,18 +474,129 @@ static int mana_set_coalesce(struct net_device *ndev,
 		return -EINVAL;
 	}
 
-	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+	if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
+	    ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "coalesce usecs must be <= %lu",
+				   MANA_INTR_MODR_USEC_MAX);
+		return -EINVAL;
+	}
+
+	if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
+	    ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "coalesce frames must be <= %lu",
+				   MANA_INTR_MODR_COMP_MAX);
+		return -EINVAL;
+	}
+
+	if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
+	    ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
+	    ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
+	    ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
+		modr_changed = true;
+
+	saved.intr_modr_rx_usec = apc->intr_modr_rx_usec;
+	saved.intr_modr_rx_comp = apc->intr_modr_rx_comp;
+	saved.intr_modr_tx_usec = apc->intr_modr_tx_usec;
+	saved.intr_modr_tx_comp = apc->intr_modr_tx_comp;
+
+	apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
+	apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
+	apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
+	apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
+
+	if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
+	    !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
+		dim_changed = true;
+
+	saved.rx_dim_enabled = apc->rx_dim_enabled;
+	saved.tx_dim_enabled = apc->tx_dim_enabled;
+
+	saved.cqe_coalescing_enable = apc->cqe_coalescing_enable;
 	apc->cqe_coalescing_enable =
 		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
 
-	if (!apc->port_is_up)
+	if (!apc->port_is_up) {
+		WRITE_ONCE(apc->rx_dim_enabled, !!ec->use_adaptive_rx_coalesce);
+		WRITE_ONCE(apc->tx_dim_enabled, !!ec->use_adaptive_tx_coalesce);
 		return 0;
+	}
 
-	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
-	if (err)
-		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+	if (apc->cqe_coalescing_enable != saved.cqe_coalescing_enable) {
+		/* CQE coalescing setting is applied via RSS configuration. */
+		err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+		if (err) {
+			netdev_err(ndev, "Change CQE coalescing failed: %d\n",
+				   err);
+			apc->cqe_coalescing_enable =
+				saved.cqe_coalescing_enable;
+			apc->intr_modr_rx_usec = saved.intr_modr_rx_usec;
+			apc->intr_modr_rx_comp = saved.intr_modr_rx_comp;
+			apc->intr_modr_tx_usec = saved.intr_modr_tx_usec;
+			apc->intr_modr_tx_comp = saved.intr_modr_tx_comp;
+			return err;
+		}
+	}
 
-	return err;
+	if (modr_changed || dim_changed) {
+		bool new_rx_dim = !!ec->use_adaptive_rx_coalesce;
+		bool new_tx_dim = !!ec->use_adaptive_tx_coalesce;
+		bool disable_rx_dim = saved.rx_dim_enabled && !new_rx_dim;
+		bool disable_tx_dim = saved.tx_dim_enabled && !new_tx_dim;
+		bool enable_rx_dim = !saved.rx_dim_enabled && new_rx_dim;
+		bool enable_tx_dim = !saved.tx_dim_enabled && new_tx_dim;
+		int q;
+
+		/* On disable: clear the per-port flag first and
+		 * synchronize_net() so any in-flight NAPI poll observes
+		 * the new value and will not schedule further DIM work;
+		 * then drain pending work and restore the static
+		 * moderation values.
+		 */
+		if (disable_rx_dim)
+			WRITE_ONCE(apc->rx_dim_enabled, false);
+		if (disable_tx_dim)
+			WRITE_ONCE(apc->tx_dim_enabled, false);
+		if (disable_rx_dim || disable_tx_dim)
+			synchronize_net();
+
+		for (q = 0; q < apc->num_queues; q++) {
+			struct mana_cq *rx_cq = &apc->rxqs[q]->rx_cq;
+			struct mana_cq *tx_cq = &apc->tx_qp[q]->tx_cq;
+
+			if (disable_rx_dim)
+				mana_dim_change(rx_cq, false);
+			else if (enable_rx_dim)
+				mana_dim_change(rx_cq, true);
+			else if (!new_rx_dim && modr_changed)
+				mana_gd_ring_dim(rx_cq->gdma_cq,
+						 apc->intr_modr_rx_usec, true,
+						 apc->intr_modr_rx_comp, true);
+
+			if (disable_tx_dim)
+				mana_dim_change(tx_cq, false);
+			else if (enable_tx_dim)
+				mana_dim_change(tx_cq, true);
+			else if (!new_tx_dim && modr_changed)
+				mana_gd_ring_dim(tx_cq->gdma_cq,
+						 apc->intr_modr_tx_usec, true,
+						 apc->intr_modr_tx_comp, true);
+		}
+
+		/* Publish the enable flag with release semantics so a
+		 * concurrent NAPI poll that observes it set also sees the DIM
+		 * (re)init done by mana_dim_change() above.
+		 */
+		if (enable_rx_dim)
+			/* pairs with smp_load_acquire() in mana_update_rx_dim() */
+			smp_store_release(&apc->rx_dim_enabled, true);
+		if (enable_tx_dim)
+			/* pairs with smp_load_acquire() in mana_update_tx_dim() */
+			smp_store_release(&apc->tx_dim_enabled, true);
+	}
+
+	return 0;
 }
 
 /* mana_set_channels - change the number of queues on a port
@@ -595,7 +740,13 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 }
 
 const struct ethtool_ops mana_ethtool_ops = {
-	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
+	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES |
+				     ETHTOOL_COALESCE_RX_USECS |
+				     ETHTOOL_COALESCE_RX_MAX_FRAMES |
+				     ETHTOOL_COALESCE_TX_USECS |
+				     ETHTOOL_COALESCE_TX_MAX_FRAMES |
+				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
+				     ETHTOOL_COALESCE_USE_ADAPTIVE_TX,
 	.op_needs_rtnl		= ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
 				  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 0c395917b214..8529cef0d7c4 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -47,6 +47,7 @@ enum gdma_queue_type {
 	GDMA_RQ,
 	GDMA_CQ,
 	GDMA_EQ,
+	GDMA_DIM,
 };
 
 enum gdma_work_request_flags {
@@ -126,6 +127,17 @@ union gdma_doorbell_entry {
 		u64 tail_ptr	: 31;
 		u64 arm		: 1;
 	} eq;
+
+	struct {
+		u64 id           : 24;
+		u64 reserved     : 8;
+		u64 mod_usec     : 10;
+		u64 reserve1     : 5;
+		u64 mod_usec_vld : 1;
+		u64 mod_comps    : 8;
+		u64 reserve2     : 7;
+		u64 mod_comps_vld: 1;
+	} dim;
 }; /* HW DATA */
 
 struct gdma_msg_hdr {
@@ -502,6 +514,9 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit);
 
 int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type type);
 
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+		      u32 mod_comps, bool mod_comps_vld);
+
 struct gdma_wqe {
 	u32 reserved	:24;
 	u32 last_vbytes	:8;
@@ -650,6 +665,9 @@ enum {
 /* Driver supports self recovery on Hardware Channel timeouts */
 #define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
 
+/* Driver supports dynamic interrupt moderation - DIM */
+#define GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(28)
+
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
@@ -665,7 +683,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
+	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT | \
+	 GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
@@ -701,6 +720,9 @@ struct gdma_verify_ver_req {
 	u8 os_ver_str4[128];
 }; /* HW DATA */
 
+/* HW supports dynamic interrupt moderation - DIM */
+#define GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(15)
+
 struct gdma_verify_ver_resp {
 	struct gdma_resp_hdr hdr;
 	u64 gdma_protocol_ver;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 13c87baf018e..df4c4a3f68fa 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -4,6 +4,7 @@
 #ifndef _MANA_H
 #define _MANA_H
 
+#include <linux/dim.h>
 #include <net/xdp.h>
 #include <net/net_shaper.h>
 
@@ -64,6 +65,19 @@ enum TRI_STATE {
 /* Maximum number of packets per coalesced CQE */
 #define MANA_RXCOMP_OOB_NUM_PPI 4
 
+/* Default/max interrupt moderation settings */
+#define MANA_INTR_MODR_USEC_DEF 0
+#define MANA_INTR_MODR_COMP_DEF 0
+
+#define MANA_ADAPTIVE_RX_DEF true
+#define MANA_ADAPTIVE_TX_DEF true
+
+/* DIM doorbell value field layout */
+#define MANA_INTR_MODR_USEC_MAX    GENMASK(9, 0)
+#define MANA_INTR_MODR_USEC_VLD    BIT(15)
+#define MANA_INTR_MODR_COMP_MAX    GENMASK(7, 0)
+#define MANA_INTR_MODR_COMP_MASK   GENMASK(23, 16)
+
 /* Update this count whenever the respective structures are changed */
 #define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
 #define MANA_STATS_TX_COUNT 11
@@ -297,6 +311,17 @@ struct mana_cq {
 	int work_done;
 	int work_done_since_doorbell;
 	int budget;
+
+	/* DIM - Dynamic Interrupt Moderation */
+	struct dim dim;
+	u16 dim_event_ctr;
+
+	/* Cumulative TX completions fed to DIM. Updated and read only in
+	 * NAPI context (mana_poll_tx_cq() / mana_update_tx_dim()), so they
+	 * measure the hardware completion rate and need no u64_stats_sync.
+	 */
+	u64 tx_dim_pkts;
+	u64 tx_dim_bytes;
 };
 
 struct mana_recv_buf_oob {
@@ -573,6 +598,15 @@ struct mana_port_context {
 	u8 cqe_coalescing_enable;
 	u32 cqe_coalescing_timeout_ns;
 
+	/* Interrupt moderation settings */
+	u16 intr_modr_rx_usec;
+	u16 intr_modr_rx_comp;
+	u16 intr_modr_tx_usec;
+	u16 intr_modr_tx_comp;
+
+	bool rx_dim_enabled;
+	bool tx_dim_enabled;
+
 	struct mana_ethtool_stats eth_stats;
 
 	struct mana_ethtool_phy_stats phy_stats;
@@ -598,6 +632,8 @@ int mana_alloc_queues(struct net_device *ndev);
 int mana_attach(struct net_device *ndev);
 int mana_detach(struct net_device *ndev, bool from_close);
 
+int mana_dim_change(struct mana_cq *cq, bool enable);
+
 int mana_probe(struct gdma_dev *gd, bool resuming);
 void mana_remove(struct gdma_dev *gd, bool suspending);
 
@@ -633,6 +669,9 @@ struct mana_obj_spec {
 	u32 queue_size;
 	u32 attached_eq;
 	u32 modr_ctx_id;
+	u8 req_cq_moderation;
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
 };
 
 enum mana_command_code {
@@ -764,6 +803,15 @@ struct mana_create_wqobj_req {
 	u32 cq_size;
 	u32 cq_moderation_ctx_id;
 	u32 cq_parent_qid;
+
+	/* V2 */
+	u8 allow_rqwqe_chain;
+
+	/* V3 */
+	u8 req_cq_moderation;
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
+	u8 reserved2[2];
 }; /* HW DATA */
 
 struct mana_create_wqobj_resp {
@@ -771,6 +819,12 @@ struct mana_create_wqobj_resp {
 	u32 wq_id;
 	u32 cq_id;
 	mana_handle_t wq_obj;
+
+	/* V2 */
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
+	u8 cq_moderation_enabled;
+	u8 reserved1[3];
 }; /* HW DATA */
 
 /* Destroy WQ Object */
-- 
2.34.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net-next v3] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-13 20:48 UTC (permalink / raw)
  To: Simon Horman
  Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	KY Srinivasan, wei.liu@kernel.org, Dexuan Cui, Long Li,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
	shradhagupta@linux.microsoft.com, ernis@linux.microsoft.com,
	dipayanroy@linux.microsoft.com, gargaditya@linux.microsoft.com,
	kees@kernel.org, leitao@debian.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260613082014.715350-1-horms@kernel.org>



> -----Original Message-----
> From: Simon Horman <horms@kernel.org>
> Sent: Saturday, June 13, 2026 4:20 AM
> To: haiyangz@linux.microsoft.com
> Cc: Simon Horman <horms@kernel.org>; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Konstantin Taranov
> <kotaranov@microsoft.com>; shradhagupta@linux.microsoft.com;
> ernis@linux.microsoft.com; dipayanroy@linux.microsoft.com;
> gargaditya@linux.microsoft.com; kees@kernel.org; leitao@debian.org; linux-
> kernel@vger.kernel.org; linux-rdma@vger.kernel.org; Paul Rosswurm
> <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next v3] net: mana: Add Interrupt
> Moderation support
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsashiko.
> dev%2F&data=05%7C02%7Chaiyangz%40microsoft.com%7Ccbca53cce3d24759006708dec
> 924a556%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C639169356403391033%7C
> Unknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJX
> aW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=D2ZPGJ5XznEkcL
> uH3Cxtze0FYOL3xxDIY%2Bg2DP508Q8%3D&reserved=0
> ---
> Date:   Thu Jun 11 19:02:22 2026 +0000
> 
>     net: mana: Add Interrupt Moderation support
> 
>     This commit adds Static and Dynamic Interrupt Moderation (DIM) support
> for
>     Rx and Tx. It updates the queue creation procedure, adds functions to
>     collect stats for DIM, and provides ethtool handler support to get/set
>     the moderation settings.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 26aef21c6c2c8..d92e022d8533c 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >
> >  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
> >  			     sizeof(req), sizeof(resp));
> > +
> > +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> > +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
> >  	req.vport = vport;
> >  	req.wq_type = wq_type;
> >  	req.wq_gdma_region = wq_spec->gdma_region;
> > @@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >  	req.cq_size = cq_spec->queue_size;
> >  	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
> >  	req.cq_parent_qid = cq_spec->attached_eq;
> > +	req.req_cq_moderation = cq_spec->req_cq_moderation;
> > +	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> > +	req.cq_moderation_usec = cq_spec->cq_moderation_usec;
> 
> [Severity: High]
> Does this break queue creation on older Azure host firmwares?
> 
> The code unconditionally bumps the WQ object creation request to use
> GDMA_MESSAGE_V3 and passes the req_cq_moderation flag. If the firmware
> does
> not support GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION, could the
> firmware
> reject the queue creation request and break device initialization
> entirely?

Our firmware is forward compatible with newer message versions, so the 
old firmware still properly handles this message, just the new feature 
fields are ignored, and queue creation will be successful.
And if the DIM capability bit is zero from FW, driver will keep the DIM
feature to be off and unchangeable.

> 
> [ ... ]
> 
> > +static void mana_update_tx_dim(struct mana_cq *cq)
> > +{
> > +	struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
> > +	struct dim_sample dim_sample = {};
> > +	struct mana_txq *txq = cq->txq;
> > +
> > +	if (!apc->tx_dim_enabled)
> > +		return;
> > +
> > +	dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq->stats.packets,
> > +			  txq->stats.bytes, &dim_sample);
> > +	net_dim(&cq->dim, &dim_sample);
> > +}
> 
> [Severity: High]
> Does this measure the CPU enqueue rate rather than the hardware completion
> rate?
> 
> It appears that txq->stats.packets is incremented in mana_start_xmit()
> upon enqueue, not during completion in mana_poll_tx_cq(). Dynamic
> interrupt
> moderation relies on measuring the actual hardware completion rate to
> function correctly.
> 
> Additionally, is it safe to read txq->stats.packets here without the
> u64_stats_fetch_begin() retry loop? Reading a 64-bit statistic locklessly
> without the synchronization primitives could result in torn reads on 32-
> bit
> systems.
In the next patch version, I added counters to cq, and track the HW comp rate,
And the reader is in the same NAPI thread to prevent any torn reads.

> 
> [ ... ]
> 
> > @@ -2324,7 +2434,13 @@ static int mana_cq_handler(void *context, struct
> gdma_queue *gdma_queue)
> >  	if (w < cq->budget) {
> >  		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
> >  		cq->work_done_since_doorbell = 0;
> > -		napi_complete_done(&cq->napi, w);
> > +
> > +		if (napi_complete_done(&cq->napi, w)) {
> > +			if (cq->type == MANA_CQ_TYPE_RX)
> > +				mana_update_rx_dim(cq);
> > +			else
> > +				mana_update_tx_dim(cq);
> > +		}
> 
> [Severity: High]
> Is it safe to call mana_update_rx_dim() and mana_update_tx_dim() after
> napi_complete_done()?
> 
> Since napi_complete_done() ends the NAPI polling cycle and relinquishes
> mutual exclusion, and the hardware interrupt was already re-armed by
> mana_gd_ring_cq() right before this, couldn't a new packet trigger an
> interrupt immediately? This could allow NAPI to be scheduled on another
> CPU
> concurrently while this CPU is still running the DIM update, potentially
> corrupting the non-thread-safe net_dim() state machine.

Thanks, I moved the dim update before the napi_complete_done() to avoid
concurrently running it.

> 
> [ ... ]
> 
> > @@ -2562,6 +2685,9 @@ static int mana_create_txq(struct
> mana_port_context *apc,
> >  		napi_enable_locked(&cq->napi);
> >  		txq->napi_initialized = true;
> >
> > +		INIT_WORK(&cq->dim.work, mana_tx_dim_work);
> > +		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> 
> [Severity: High]
> Could this allow NAPI to schedule an uninitialized work_struct?
> 
> napi_enable_locked() is called right before INIT_WORK(). Once NAPI is
> enabled, it can be scheduled to run immediately (for example, via
> netpoll).
> If NAPI executes before INIT_WORK completes, mana_update_tx_dim() would
> pass the uninitialized cq->dim to net_dim(), which could cause a kernel
> panic when it attempts to call schedule_work().
> 
> This pattern also occurs in mana_create_rxq().

I have updated the init order in the new version.

> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > index 94e658d07a27e..67a2b282ff4da 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> 
> [ ... ]
> 
> > @@ -440,18 +474,123 @@ static int mana_set_coalesce(struct net_device
> *ndev,
> > +		if (enable_rx_dim)
> > +			apc->rx_dim_enabled = true;
> > +		if (enable_tx_dim)
> > +			apc->tx_dim_enabled = true;
> 
> [Severity: High]
> Could the lack of memory barriers here expose uninitialized DIM structures
> to NAPI?
> 
> The driver initializes DIM via mana_dim_change() and then sets
> apc->rx_dim_enabled to true. Without an smp_store_release() here and a
> corresponding smp_load_acquire() in mana_update_rx_dim(), weakly-ordered
> CPUs like ARM64 might reorder the stores. Concurrently, NAPI polling might
> observe the flag as true before the initialization is fully visible in
> memory,
> potentially invoking net_dim() on garbage memory.

I added smp_store_release() & smp_load_acquire() to fix it in the next version.

And, I will submit the next version soon.

Thanks,
- Haiyang


^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Add Interrupt Moderation support
From: Simon Horman @ 2026-06-13  8:20 UTC (permalink / raw)
  To: haiyangz
  Cc: Simon Horman, linux-hyperv, netdev, kys, haiyangz, wei.liu, decui,
	longli, andrew+netdev, davem, edumazet, kuba, pabeni, kotaranov,
	shradhagupta, ernis, dipayanroy, gargaditya, kees, leitao,
	linux-kernel, linux-rdma, paulros
In-Reply-To: <20260611190239.2532429-1-haiyangz@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
Date:   Thu Jun 11 19:02:22 2026 +0000

    net: mana: Add Interrupt Moderation support
    
    This commit adds Static and Dynamic Interrupt Moderation (DIM) support for
    Rx and Tx. It updates the queue creation procedure, adds functions to
    collect stats for DIM, and provides ethtool handler support to get/set
    the moderation settings.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 26aef21c6c2c8..d92e022d8533c 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  
>  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
>  			     sizeof(req), sizeof(resp));
> +
> +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
>  	req.vport = vport;
>  	req.wq_type = wq_type;
>  	req.wq_gdma_region = wq_spec->gdma_region;
> @@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  	req.cq_size = cq_spec->queue_size;
>  	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
>  	req.cq_parent_qid = cq_spec->attached_eq;
> +	req.req_cq_moderation = cq_spec->req_cq_moderation;
> +	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> +	req.cq_moderation_usec = cq_spec->cq_moderation_usec;

[Severity: High]
Does this break queue creation on older Azure host firmwares?

The code unconditionally bumps the WQ object creation request to use
GDMA_MESSAGE_V3 and passes the req_cq_moderation flag. If the firmware does
not support GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION, could the firmware
reject the queue creation request and break device initialization entirely?

[ ... ]

> +static void mana_update_tx_dim(struct mana_cq *cq)
> +{
> +	struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
> +	struct dim_sample dim_sample = {};
> +	struct mana_txq *txq = cq->txq;
> +
> +	if (!apc->tx_dim_enabled)
> +		return;
> +
> +	dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq->stats.packets,
> +			  txq->stats.bytes, &dim_sample);
> +	net_dim(&cq->dim, &dim_sample);
> +}

[Severity: High]
Does this measure the CPU enqueue rate rather than the hardware completion
rate?

It appears that txq->stats.packets is incremented in mana_start_xmit()
upon enqueue, not during completion in mana_poll_tx_cq(). Dynamic interrupt
moderation relies on measuring the actual hardware completion rate to
function correctly.

Additionally, is it safe to read txq->stats.packets here without the
u64_stats_fetch_begin() retry loop? Reading a 64-bit statistic locklessly
without the synchronization primitives could result in torn reads on 32-bit
systems.

[ ... ]

> @@ -2324,7 +2434,13 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
>  	if (w < cq->budget) {
>  		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
>  		cq->work_done_since_doorbell = 0;
> -		napi_complete_done(&cq->napi, w);
> +
> +		if (napi_complete_done(&cq->napi, w)) {
> +			if (cq->type == MANA_CQ_TYPE_RX)
> +				mana_update_rx_dim(cq);
> +			else
> +				mana_update_tx_dim(cq);
> +		}

[Severity: High]
Is it safe to call mana_update_rx_dim() and mana_update_tx_dim() after
napi_complete_done()?

Since napi_complete_done() ends the NAPI polling cycle and relinquishes
mutual exclusion, and the hardware interrupt was already re-armed by
mana_gd_ring_cq() right before this, couldn't a new packet trigger an
interrupt immediately? This could allow NAPI to be scheduled on another CPU
concurrently while this CPU is still running the DIM update, potentially
corrupting the non-thread-safe net_dim() state machine.

[ ... ]

> @@ -2562,6 +2685,9 @@ static int mana_create_txq(struct mana_port_context *apc,
>  		napi_enable_locked(&cq->napi);
>  		txq->napi_initialized = true;
>  
> +		INIT_WORK(&cq->dim.work, mana_tx_dim_work);
> +		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;

[Severity: High]
Could this allow NAPI to schedule an uninitialized work_struct?

napi_enable_locked() is called right before INIT_WORK(). Once NAPI is
enabled, it can be scheduled to run immediately (for example, via netpoll).
If NAPI executes before INIT_WORK completes, mana_update_tx_dim() would
pass the uninitialized cq->dim to net_dim(), which could cause a kernel
panic when it attempts to call schedule_work().

This pattern also occurs in mana_create_rxq().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 94e658d07a27e..67a2b282ff4da 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c

[ ... ]

> @@ -440,18 +474,123 @@ static int mana_set_coalesce(struct net_device *ndev,
> +		if (enable_rx_dim)
> +			apc->rx_dim_enabled = true;
> +		if (enable_tx_dim)
> +			apc->tx_dim_enabled = true;

[Severity: High]
Could the lack of memory barriers here expose uninitialized DIM structures
to NAPI?

The driver initializes DIM via mana_dim_change() and then sets
apc->rx_dim_enabled to true. Without an smp_store_release() here and a
corresponding smp_load_acquire() in mana_update_rx_dim(), weakly-ordered
CPUs like ARM64 might reorder the stores. Concurrently, NAPI polling might
observe the flag as true before the initialization is fully visible in memory,
potentially invoking net_dim() on garbage memory.

^ permalink raw reply

* Re: [PATCH net v2 0/2] net: mana: fix error-path issues in queue setup
From: patchwork-bot+netdevbpf @ 2026-06-13  1:00 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
	kees, shacharr, stephen, gargaditya, ssengar, linux-hyperv,
	netdev, linux-kernel
In-Reply-To: <20260608101345.2267320-1-gargaditya@linux.microsoft.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon,  8 Jun 2026 03:13:39 -0700 you wrote:
> Two error-path fixes in MANA queue setup, both surfaced during Sashiko
> AI review of a recently upstreamed patch series.
> 
> Patch 1 initializes queue->id to INVALID_QUEUE_ID in
> mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
> firmware id is assigned does not NULL gc->cq_table[0] and silently
> break whichever real CQ owns that slot. This mirrors the existing
> pattern in mana_gd_create_eq().
> 
> [...]

Here is the summary with links:
  - [net,v2,1/2] net: mana: initialize gdma queue id to INVALID_QUEUE_ID
    https://git.kernel.org/netdev/net/c/5985474e1cb4
  - [net,v2,2/2] net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
    https://git.kernel.org/netdev/net/c/f8fd56977eee

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] PCI: hv: add hard timeout to wait_for_response()
From: Easwar Hariharan @ 2026-06-12 19:27 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-hyperv, easwar.hariharan, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Manivannan Sadhasivam, Rob Herring,
	Bjorn Helgaas, linux-pci, stable
In-Reply-To: <20260612174010.2598695-1-hamzamahfooz@linux.microsoft.com>

On 6/12/2026 10:40, Hamza Mahfooz wrote:
> It is possible that we never receive a rescind event, in which case we
> will wait indefinitely for a device that will never show up. So, assume
> a device is gone if have been polling for more than 5 seconds.
> 

Echo Long's request for more context on where this was seen.

> Cc: stable@vger.kernel.org
> Fixes: c3635da2a336 ("PCI: hv: Do not wait forever on a device that has disappeared")
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfc8fa403dad..bd63efc4a210 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -52,6 +52,7 @@
>  #include <linux/acpi.h>
>  #include <linux/sizes.h>
>  #include <linux/of_irq.h>
> +#include <linux/jiffies.h>
>  #include <asm/mshyperv.h>
>  
>  /*
> @@ -1038,6 +1039,8 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
>  		kfree(hpdev);
>  }
>  
> +#define TIMEOUT_MS 5000
> +

This can be in seconds, see below.

>  /*
>   * There is no good way to get notified from vmbus_onoffer_rescind(),
>   * so let's use polling here, since this is not a hot path.
> @@ -1045,8 +1048,13 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
>  static int wait_for_response(struct hv_device *hdev,
>  			     struct completion *comp)
>  {
> +	unsigned long timeout = get_jiffies_64() + msecs_to_jiffies(TIMEOUT_MS);

You can use secs_to_jiffies() here instead.

> +	unsigned long now;
> +
>  	while (true) {
> -		if (hdev->channel->rescind) {
> +		now = get_jiffies_64();
> +		if (hdev->channel->rescind ||
> +		    time_after(now, timeout)) {
>  			dev_warn_once(&hdev->device, "The device is gone.\n");
>  			return -ENODEV;
>  		}


^ permalink raw reply

* Re: [PATCH net-next v3] net: mana: Add Interrupt Moderation support
From: sashiko-bot @ 2026-06-12 19:03 UTC (permalink / raw)
  To: Haiyang Zhang; +Cc: linux-hyperv
In-Reply-To: <20260611190239.2532429-1-haiyangz@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [High] Calling `mana_update_rx_dim` and `mana_update_tx_dim` after `napi_complete_done` violates NAPI serialization, leading to data races and corruption of the `net_dim` state.
- [High] Unconditional use of the `GDMA_MESSAGE_V3` format and `req_cq_moderation` flag breaks WQ object creation on older MANA firmwares.
- [High] NAPI is enabled before the DIM work struct is initialized, allowing NAPI to run on an uninitialized work struct.
- [High] Lockless access to the DIM enable flags lacks memory barriers, potentially exposing uninitialized DIM structures to NAPI.
- [High] TX DIM incorrectly uses enqueued packets instead of completed packets, and reads concurrent stats unsafely.
--

commit 275fa81625ff24a83b95ff846fea9070fb3193d8
Author: Haiyang Zhang <haiyangz@microsoft.com>
Date:   Thu Jun 11 19:02:22 2026 +0000

    net: mana: Add Interrupt Moderation support
    
    This commit adds Static and Dynamic Interrupt Moderation (DIM) support for
    Rx and Tx. It updates the queue creation procedure, adds functions to
    collect stats for DIM, and provides ethtool handler support to get/set
    the moderation settings.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 26aef21c6c2c8..d92e022d8533c 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  
>  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
>  			     sizeof(req), sizeof(resp));
> +
> +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
>  	req.vport = vport;
>  	req.wq_type = wq_type;
>  	req.wq_gdma_region = wq_spec->gdma_region;
> @@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  	req.cq_size = cq_spec->queue_size;
>  	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
>  	req.cq_parent_qid = cq_spec->attached_eq;
> +	req.req_cq_moderation = cq_spec->req_cq_moderation;
> +	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
> +	req.cq_moderation_usec = cq_spec->cq_moderation_usec;

[Severity: High]
Does this break queue creation on older Azure host firmwares?

The code unconditionally bumps the WQ object creation request to use
GDMA_MESSAGE_V3 and passes the req_cq_moderation flag. If the firmware does
not support GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION, could the firmware
reject the queue creation request and break device initialization entirely?

[ ... ]

> +static void mana_update_tx_dim(struct mana_cq *cq)
> +{
> +	struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
> +	struct dim_sample dim_sample = {};
> +	struct mana_txq *txq = cq->txq;
> +
> +	if (!apc->tx_dim_enabled)
> +		return;
> +
> +	dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq->stats.packets,
> +			  txq->stats.bytes, &dim_sample);
> +	net_dim(&cq->dim, &dim_sample);
> +}

[Severity: High]
Does this measure the CPU enqueue rate rather than the hardware completion
rate?

It appears that txq->stats.packets is incremented in mana_start_xmit()
upon enqueue, not during completion in mana_poll_tx_cq(). Dynamic interrupt
moderation relies on measuring the actual hardware completion rate to
function correctly.

Additionally, is it safe to read txq->stats.packets here without the
u64_stats_fetch_begin() retry loop? Reading a 64-bit statistic locklessly
without the synchronization primitives could result in torn reads on 32-bit
systems.

[ ... ]

> @@ -2324,7 +2434,13 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
>  	if (w < cq->budget) {
>  		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
>  		cq->work_done_since_doorbell = 0;
> -		napi_complete_done(&cq->napi, w);
> +
> +		if (napi_complete_done(&cq->napi, w)) {
> +			if (cq->type == MANA_CQ_TYPE_RX)
> +				mana_update_rx_dim(cq);
> +			else
> +				mana_update_tx_dim(cq);
> +		}

[Severity: High]
Is it safe to call mana_update_rx_dim() and mana_update_tx_dim() after
napi_complete_done()?

Since napi_complete_done() ends the NAPI polling cycle and relinquishes
mutual exclusion, and the hardware interrupt was already re-armed by
mana_gd_ring_cq() right before this, couldn't a new packet trigger an
interrupt immediately? This could allow NAPI to be scheduled on another CPU
concurrently while this CPU is still running the DIM update, potentially
corrupting the non-thread-safe net_dim() state machine.

[ ... ]

> @@ -2562,6 +2685,9 @@ static int mana_create_txq(struct mana_port_context *apc,
>  		napi_enable_locked(&cq->napi);
>  		txq->napi_initialized = true;
>  
> +		INIT_WORK(&cq->dim.work, mana_tx_dim_work);
> +		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;

[Severity: High]
Could this allow NAPI to schedule an uninitialized work_struct?

napi_enable_locked() is called right before INIT_WORK(). Once NAPI is
enabled, it can be scheduled to run immediately (for example, via netpoll).
If NAPI executes before INIT_WORK completes, mana_update_tx_dim() would
pass the uninitialized cq->dim to net_dim(), which could cause a kernel
panic when it attempts to call schedule_work().

This pattern also occurs in mana_create_rxq().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 94e658d07a27e..67a2b282ff4da 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c

[ ... ]

> @@ -440,18 +474,123 @@ static int mana_set_coalesce(struct net_device *ndev,
> +		if (enable_rx_dim)
> +			apc->rx_dim_enabled = true;
> +		if (enable_tx_dim)
> +			apc->tx_dim_enabled = true;

[Severity: High]
Could the lack of memory barriers here expose uninitialized DIM structures
to NAPI?

The driver initializes DIM via mana_dim_change() and then sets
apc->rx_dim_enabled to true. Without an smp_store_release() here and a
corresponding smp_load_acquire() in mana_update_rx_dim(), weakly-ordered
CPUs like ARM64 might reorder the stores. Concurrently, NAPI polling might
observe the flag as true before the initialization is fully visible in memory,
potentially invoking net_dim() on garbage memory.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260611190239.2532429-1-haiyangz@linux.microsoft.com?part=1

^ permalink raw reply

* Re: [PATCH] PCI: hv: add hard timeout to wait_for_response()
From: Hamza Mahfooz @ 2026-06-12 18:44 UTC (permalink / raw)
  To: Long Li
  Cc: linux-hyperv@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring, Bjorn Helgaas,
	linux-pci@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB6683BAA40F52C52150D7811ECE182@SA1PR21MB6683.namprd21.prod.outlook.com>

On Fri, Jun 12, 2026 at 06:04:22PM +0000, Long Li wrote:
> What if the VMBUS response comes in right after this check? The completion is allocated on the caller stack, and it will cause kernel OOP.

That is a fair point, I'll try to see if there is a better way to handle this error case.

^ permalink raw reply

* RE: [PATCH] PCI: hv: add hard timeout to wait_for_response()
From: Long Li @ 2026-06-12 18:04 UTC (permalink / raw)
  To: Hamza Mahfooz, linux-hyperv@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring, Bjorn Helgaas,
	linux-pci@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260612174010.2598695-1-hamzamahfooz@linux.microsoft.com>

> Subject: [PATCH] PCI: hv: add hard timeout to wait_for_response()
> 
> It is possible that we never receive a rescind event, in which case we will wait
> indefinitely for a device that will never show up. So, assume a device is gone if
> have been polling for more than 5 seconds.

Can you explain in what situation we never receive a rescind event? If this for dealing a device unload when the vmbus is dead? Please provide more context. A kernel trace is helpful.

Does this patch handle the case where the rescind event comes in right after the timeout?

> 
> Cc: stable@vger.kernel.org
> Fixes: c3635da2a336 ("PCI: hv: Do not wait forever on a device that has
> disappeared")
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-
> hyperv.c
> index cfc8fa403dad..bd63efc4a210 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -52,6 +52,7 @@
>  #include <linux/acpi.h>
>  #include <linux/sizes.h>
>  #include <linux/of_irq.h>
> +#include <linux/jiffies.h>
>  #include <asm/mshyperv.h>
> 
>  /*
> @@ -1038,6 +1039,8 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
>  		kfree(hpdev);
>  }
> 
> +#define TIMEOUT_MS 5000
> +
>  /*
>   * There is no good way to get notified from vmbus_onoffer_rescind(),
>   * so let's use polling here, since this is not a hot path.
> @@ -1045,8 +1048,13 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
> static int wait_for_response(struct hv_device *hdev,
>  			     struct completion *comp)
>  {
> +	unsigned long timeout = get_jiffies_64() +
> msecs_to_jiffies(TIMEOUT_MS);
> +	unsigned long now;
> +
>  	while (true) {
> -		if (hdev->channel->rescind) {
> +		now = get_jiffies_64();
> +		if (hdev->channel->rescind ||
> +		    time_after(now, timeout)) {

What if the VMBUS response comes in right after this check? The completion is allocated on the caller stack, and it will cause kernel OOP.

How do you test this patch?


^ permalink raw reply

* Re: [PATCH] PCI: hv: add hard timeout to wait_for_response()
From: sashiko-bot @ 2026-06-12 17:52 UTC (permalink / raw)
  To: Hamza Mahfooz; +Cc: linux-pci, linux-hyperv
In-Reply-To: <20260612174010.2598695-1-hamzamahfooz@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Use-After-Free and Stack Corruption due to abandoned VMBus requests on timeout.
- [Medium] Race condition in wait_for_response() polling loop can falsely trigger timeout.
--

commit 18b7ab3e09172b3f0651ad176d0ab6367f1e43bb
Author: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>

PCI: hv: add hard timeout to wait_for_response()

This commit adds a hard 5-second timeout to wait_for_response() in the
hyper-V PCI controller driver. It addresses an issue where the driver could
wait indefinitely for a device to appear if a rescind event is never
received.

> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfc8fa403dad6..bd63efc4a2103 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c

[ ... ]

> @@ -1045,8 +1048,13 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
>  static int wait_for_response(struct hv_device *hdev,
>  			     struct completion *comp)
>  {
> +	unsigned long timeout = get_jiffies_64() + msecs_to_jiffies(TIMEOUT_MS);
> +	unsigned long now;
> +
>  	while (true) {
> -		if (hdev->channel->rescind) {
> +		now = get_jiffies_64();
> +		if (hdev->channel->rescind ||
> +		    time_after(now, timeout)) {

[Severity: Medium]
Is there a race condition if the thread is preempted right before updating
the time?

If the thread is preempted just before calling get_jiffies_64() and the
delay crosses the 5-second boundary, this will evaluate to true and
immediately return -ENODEV.

Since there is no final check of the completion status after the timeout
check, could this falsely assume a timeout even if the Hyper-V host
successfully replied during the preemption window?

>  			dev_warn_once(&hdev->device, "The device is gone.\n");
>  			return -ENODEV;

[Severity: Critical]
By adding this timeout, do we risk a use-after-free and stack corruption if
the VMBus request is never explicitly cancelled?

When we return -ENODEV early here, callers like hv_send_resources_allocated()
will free their heap-allocated request packets and return, which pops the
local completion structure off the stack.

However, since the VMBus channel remains active and the pending request is
not removed from the requestor array, what happens if the Hyper-V host
eventually replies?

Wouldn't hv_pci_onchannelcallback() fetch the dangling packet pointer and
invoke comp_packet->completion_func(...) on freed memory?

It seems this could also corrupt the stack when writing to the destroyed
completion structure. Additionally, in hv_pci_query_relations(), returning
on timeout leaves hbus->survey_event pointing to a local stack variable
that will later be completed by survey_child_resources().

Does this also leak a slot in the VMBus requestor array for each timed-out
request?

>  		}
> 
>  		if (wait_for_completion_timeout(comp, HZ / 10))

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612174010.2598695-1-hamzamahfooz@linux.microsoft.com?part=1

^ permalink raw reply

* [PATCH] PCI: hv: add hard timeout to wait_for_response()
From: Hamza Mahfooz @ 2026-06-12 17:40 UTC (permalink / raw)
  To: linux-hyperv
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring, Bjorn Helgaas, linux-pci,
	Hamza Mahfooz, stable

It is possible that we never receive a rescind event, in which case we
will wait indefinitely for a device that will never show up. So, assume
a device is gone if have been polling for more than 5 seconds.

Cc: stable@vger.kernel.org
Fixes: c3635da2a336 ("PCI: hv: Do not wait forever on a device that has disappeared")
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index cfc8fa403dad..bd63efc4a210 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -52,6 +52,7 @@
 #include <linux/acpi.h>
 #include <linux/sizes.h>
 #include <linux/of_irq.h>
+#include <linux/jiffies.h>
 #include <asm/mshyperv.h>
 
 /*
@@ -1038,6 +1039,8 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
 		kfree(hpdev);
 }
 
+#define TIMEOUT_MS 5000
+
 /*
  * There is no good way to get notified from vmbus_onoffer_rescind(),
  * so let's use polling here, since this is not a hot path.
@@ -1045,8 +1048,13 @@ static void put_pcichild(struct hv_pci_dev *hpdev)
 static int wait_for_response(struct hv_device *hdev,
 			     struct completion *comp)
 {
+	unsigned long timeout = get_jiffies_64() + msecs_to_jiffies(TIMEOUT_MS);
+	unsigned long now;
+
 	while (true) {
-		if (hdev->channel->rescind) {
+		now = get_jiffies_64();
+		if (hdev->channel->rescind ||
+		    time_after(now, timeout)) {
 			dev_warn_once(&hdev->device, "The device is gone.\n");
 			return -ENODEV;
 		}
-- 
2.54.0


^ permalink raw reply related

* [tip:irq/core] [x86/irq]  2b57c69917: lkvs.thermal_test.sh_-t_check_pkg_interrupts.fail
From: kernel test robot @ 2026-06-12  6:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: oe-lkp, lkp, linux-kernel, x86, Michael Kelley, Radu Rendec,
	linux-perf-users, linux-hyperv, linux-edac, kvm, xen-devel,
	oliver.sang



Hello,

kernel test robot noticed "lkvs.thermal_test.sh_-t_check_pkg_interrupts.fail" on:

commit: 2b57c69917eeba3ee657f252257e37f31916ba2a ("x86/irq: Make irqstats array based")
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git irq/core


in testcase: lkvs
version: lkvs-x86_64-388e0c1-1_20260521
with following parameters:

	test: thermal


config: x86_64-rhel-9.4-func
compiler: gcc-14
test machine: 512 threads 4 sockets Intel(R) Xeon(R) 6768P  CPU @ 2.4GHz (Granite Rapids) with 128G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202606121325.97b29701-lkp@intel.com


2026-06-10 15:11:39 cd /lkp/benchmarks/lkvs/lkvs/BM
2026-06-10 15:11:39 ./runtests -f thermal/tests
Next run cases from thermal/tests
<<<test start - 'thermal_test.sh -t check_thermal_throttling'>>

...

/lkp/benchmarks/lkvs/lkvs/BM/thermal/thermal_test.sh: line 177: 15994 Killed                  taskset -c 0-"$NUM_CPUS" stress -c "$cpus" -t 20
<<<test end, result: PASS, duration: 41.108s>>

<<<test start - 'thermal_test.sh -t check_pkg_interrupts'>>
|0610_151221.580|ERROR| common.sh:142:die() - FATAL: die() is called by thermal_test.sh:93:pkg_interrupts()|
|0610_151221.584|ERROR| common.sh:143:die() - FATAL: Thermal event interrupts is not detected.|
<<<test end, result: FAIL, duration: 0.786s>>

Test Start Time: 2026-06-10_15-11-39
--------------------------------------------------------
Testcase                                                                     Result    Exit Value  Duration
--------                                                                     ------    ----------  --------
[RESULT][thermal_test.sh -t check_thermal_throttling]                        [PASS]    [0]         [41.108s]
[RESULT][thermal_test.sh -t check_pkg_interrupts]                            [FAIL]    [1]         [0.786s]
--------------------------------------------------------



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260612/202606121325.97b29701-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* [PATCH net-next v3] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-11 19:02 UTC (permalink / raw)
  To: linux-hyperv, netdev, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Shradha Gupta, Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg,
	Kees Cook, Breno Leitao, linux-kernel, linux-rdma
  Cc: paulros

From: Haiyang Zhang <haiyangz@microsoft.com>

Add Static and Dynamic Interrupt Moderation (DIM) support for
Rx and Tx.
Update queue creation procedure with new data struct with the related
settings.
Add functions to collect stat for DIM, and workers to update DIM data
and settings.
Update ethtool handler to get/set the moderation settings from a user.
To avoid detach/re-attach ops, ring DIM doorbell to change settings
at run time.
By default, adaptive-rx/tx (DIM) are enabled.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
v3:
  Updated to avoid detach/re-attach ops as suggested by Paolo.

v2:
  Updated with comments from Jedrzej.

---
 drivers/net/ethernet/microsoft/Kconfig        |   1 +
 .../net/ethernet/microsoft/mana/gdma_main.c   |  29 ++++
 drivers/net/ethernet/microsoft/mana/mana_en.c | 147 +++++++++++++++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 161 +++++++++++++++++-
 include/net/mana/gdma.h                       |  24 ++-
 include/net/mana/mana.h                       |  47 +++++
 6 files changed, 399 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/Kconfig b/drivers/net/ethernet/microsoft/Kconfig
index 3f36ee6a8ece..e9be18c92ca5 100644
--- a/drivers/net/ethernet/microsoft/Kconfig
+++ b/drivers/net/ethernet/microsoft/Kconfig
@@ -21,6 +21,7 @@ config MICROSOFT_MANA
 	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
 	depends on PCI_HYPERV
 	select AUXILIARY_BUS
+	select DIMLIB
 	select PAGE_POOL
 	select NET_SHAPER
 	help
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index c9ec80a1dd6f..7a012b1e5751 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /* Copyright (c) 2021, Microsoft Corporation. */
 
+#include <linux/bitfield.h>
 #include <linux/debugfs.h>
 #include <linux/module.h>
 #include <linux/pci.h>
@@ -464,6 +465,7 @@ static int mana_gd_disable_queue(struct gdma_queue *queue)
 #define DOORBELL_OFFSET_RQ	0x400
 #define DOORBELL_OFFSET_CQ	0x800
 #define DOORBELL_OFFSET_EQ	0xFF8
+#define DOORBELL_OFFSET_DIM	0x820
 
 static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
 				  enum gdma_queue_type q_type, u32 qid,
@@ -504,6 +506,16 @@ static void mana_gd_ring_doorbell(struct gdma_context *gc, u32 db_index,
 		addr += DOORBELL_OFFSET_SQ;
 		break;
 
+	case GDMA_DIM:
+		e.dim.id = qid;
+		e.dim.mod_usec = FIELD_GET(MANA_INTR_MODR_USEC_MAX, tail_ptr);
+		e.dim.mod_usec_vld = !!(tail_ptr & MANA_INTR_MODR_USEC_VLD);
+		e.dim.mod_comps = FIELD_GET(MANA_INTR_MODR_COMP_MASK, tail_ptr);
+		e.dim.mod_comps_vld = num_req;
+
+		addr += DOORBELL_OFFSET_DIM;
+		break;
+
 	default:
 		WARN_ON(1);
 		return;
@@ -538,6 +550,23 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit)
 }
 EXPORT_SYMBOL_NS(mana_gd_ring_cq, "NET_MANA");
 
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+		      u32 mod_comps, bool mod_comps_vld)
+{
+	struct gdma_context *gc = cq->gdma_dev->gdma_context;
+	u32 dim_val;
+
+	/* Convert the DIM values to doorbell parameters */
+	dim_val = FIELD_PREP(MANA_INTR_MODR_USEC_MAX, mod_usec) |
+		  FIELD_PREP(MANA_INTR_MODR_COMP_MASK, mod_comps);
+	if (mod_usec_vld)
+		dim_val |= MANA_INTR_MODR_USEC_VLD;
+
+	mana_gd_ring_doorbell(gc, cq->gdma_dev->doorbell, GDMA_DIM, cq->id,
+			      dim_val, mod_comps_vld);
+}
+EXPORT_SYMBOL_NS(mana_gd_ring_dim, "NET_MANA");
+
 #define MANA_SERVICE_PERIOD 10
 
 static void mana_serv_rescan(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 26aef21c6c2c..d92e022d8533 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1579,6 +1579,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
 
 	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
 			     sizeof(req), sizeof(resp));
+
+	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
+	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
 	req.vport = vport;
 	req.wq_type = wq_type;
 	req.wq_gdma_region = wq_spec->gdma_region;
@@ -1587,6 +1590,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
 	req.cq_size = cq_spec->queue_size;
 	req.cq_moderation_ctx_id = cq_spec->modr_ctx_id;
 	req.cq_parent_qid = cq_spec->attached_eq;
+	req.req_cq_moderation = cq_spec->req_cq_moderation;
+	req.cq_moderation_comp = cq_spec->cq_moderation_comp;
+	req.cq_moderation_usec = cq_spec->cq_moderation_usec;
 
 	err = mana_send_request(apc->ac, &req, sizeof(req), &resp,
 				sizeof(resp));
@@ -2306,6 +2312,110 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
 		xdp_do_flush();
 }
 
+static void mana_rx_dim_work(struct work_struct *work)
+{
+	struct dim *dim = container_of(work, struct dim, work);
+	struct dim_cq_moder cur_moder;
+	struct mana_cq *cq;
+
+	cur_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
+	cq = container_of(dim, struct mana_cq, dim);
+
+	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+			 cur_moder.pkts, true);
+
+	dim->state = DIM_START_MEASURE;
+}
+
+static void mana_tx_dim_work(struct work_struct *work)
+{
+	struct dim *dim = container_of(work, struct dim, work);
+	struct dim_cq_moder cur_moder;
+	struct mana_cq *cq;
+
+	cur_moder = net_dim_get_tx_moderation(dim->mode, dim->profile_ix);
+	cq = container_of(dim, struct mana_cq, dim);
+
+	cur_moder.usec = min_t(u16, cur_moder.usec, MANA_INTR_MODR_USEC_MAX);
+	cur_moder.pkts = min_t(u16, cur_moder.pkts, MANA_INTR_MODR_COMP_MAX);
+
+	mana_gd_ring_dim(cq->gdma_cq, cur_moder.usec, true,
+			 cur_moder.pkts, true);
+
+	dim->state = DIM_START_MEASURE;
+}
+
+/* The caller must update apc->rx/tx_dim_enabled before disabling and
+ * after enabling. And synchronize_net() before draining the DIM work,
+ * so that NAPI cannot observe a stale flag.
+ */
+int mana_dim_change(struct mana_cq *cq, bool enable)
+{
+	bool is_rx = cq->type == MANA_CQ_TYPE_RX;
+	struct mana_port_context *apc;
+	work_func_t work_func;
+	u32 usec, comp;
+
+	if (is_rx) {
+		apc = netdev_priv(cq->rxq->ndev);
+		usec = apc->intr_modr_rx_usec;
+		comp = apc->intr_modr_rx_comp;
+		work_func = mana_rx_dim_work;
+	} else {
+		apc = netdev_priv(cq->txq->ndev);
+		usec = apc->intr_modr_tx_usec;
+		comp = apc->intr_modr_tx_comp;
+		work_func = mana_tx_dim_work;
+	}
+
+	/* On enable, zero the DIM state so net_dim() starts measuring from
+	 * scratch.
+	 * On disable, drain any pending DIM work and restore the static
+	 * moderation values.
+	 */
+	if (enable) {
+		memset(&cq->dim, 0, sizeof(cq->dim));
+		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+		INIT_WORK(&cq->dim.work, work_func);
+	} else {
+		cancel_work_sync(&cq->dim.work);
+		mana_gd_ring_dim(cq->gdma_cq, usec, true, comp, true);
+	}
+
+	return 0;
+}
+
+static void mana_update_rx_dim(struct mana_cq *cq)
+{
+	struct mana_port_context *apc = netdev_priv(cq->rxq->ndev);
+	struct dim_sample dim_sample = {};
+	struct mana_rxq *rxq = cq->rxq;
+
+	if (!apc->rx_dim_enabled)
+		return;
+
+	dim_update_sample(READ_ONCE(cq->dim_event_ctr), rxq->stats.packets,
+			  rxq->stats.bytes, &dim_sample);
+	net_dim(&cq->dim, &dim_sample);
+}
+
+static void mana_update_tx_dim(struct mana_cq *cq)
+{
+	struct mana_port_context *apc = netdev_priv(cq->txq->ndev);
+	struct dim_sample dim_sample = {};
+	struct mana_txq *txq = cq->txq;
+
+	if (!apc->tx_dim_enabled)
+		return;
+
+	dim_update_sample(READ_ONCE(cq->dim_event_ctr), txq->stats.packets,
+			  txq->stats.bytes, &dim_sample);
+	net_dim(&cq->dim, &dim_sample);
+}
+
 static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 {
 	struct mana_cq *cq = context;
@@ -2324,7 +2434,13 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 	if (w < cq->budget) {
 		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
 		cq->work_done_since_doorbell = 0;
-		napi_complete_done(&cq->napi, w);
+
+		if (napi_complete_done(&cq->napi, w)) {
+			if (cq->type == MANA_CQ_TYPE_RX)
+				mana_update_rx_dim(cq);
+			else
+				mana_update_tx_dim(cq);
+		}
 	} else if (cq->work_done_since_doorbell >=
 		   (cq->gdma_cq->queue_size / COMP_ENTRY_SIZE) * 4) {
 		/* MANA hardware requires at least one doorbell ring every 8
@@ -2356,6 +2472,7 @@ static void mana_schedule_napi(void *context, struct gdma_queue *gdma_queue)
 {
 	struct mana_cq *cq = context;
 
+	WRITE_ONCE(cq->dim_event_ctr, cq->dim_event_ctr + 1);
 	napi_schedule_irqoff(&cq->napi);
 }
 
@@ -2398,6 +2515,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
+			cancel_work_sync(&apc->tx_qp[i]->tx_cq.dim.work);
 			netif_napi_del_locked(napi);
 			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
@@ -2529,6 +2647,11 @@ static int mana_create_txq(struct mana_port_context *apc,
 		cq_spec.modr_ctx_id = 0;
 		cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
 
+		/* DIM setting can be changed at runtime */
+		cq_spec.req_cq_moderation = true;
+		cq_spec.cq_moderation_usec = apc->intr_modr_tx_usec;
+		cq_spec.cq_moderation_comp = apc->intr_modr_tx_comp;
+
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
 					 &apc->tx_qp[i]->tx_object);
@@ -2562,6 +2685,9 @@ static int mana_create_txq(struct mana_port_context *apc,
 		napi_enable_locked(&cq->napi);
 		txq->napi_initialized = true;
 
+		INIT_WORK(&cq->dim.work, mana_tx_dim_work);
+		cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
 		mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
 	}
 
@@ -2596,6 +2722,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 		napi_synchronize(napi);
 
 		napi_disable_locked(napi);
+		cancel_work_sync(&rxq->rx_cq.dim.work);
 		netif_napi_del_locked(napi);
 	}
 
@@ -2834,6 +2961,11 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	cq_spec.modr_ctx_id = 0;
 	cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
 
+	/* DIM setting can be changed at runtime */
+	cq_spec.req_cq_moderation = true;
+	cq_spec.cq_moderation_usec = apc->intr_modr_rx_usec;
+	cq_spec.cq_moderation_comp = apc->intr_modr_rx_comp;
+
 	err = mana_create_wq_obj(apc, apc->port_handle, GDMA_RQ,
 				 &wq_spec, &cq_spec, &rxq->rxobj);
 	if (err)
@@ -2868,6 +3000,9 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	napi_enable_locked(&cq->napi);
 
+	INIT_WORK(&cq->dim.work, mana_rx_dim_work);
+	cq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
+
 	mana_gd_ring_cq(cq->gdma_cq, SET_ARM_BIT);
 out:
 	if (!err)
@@ -3532,6 +3667,16 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	apc->link_cfg_error = 1;
 	apc->cqe_coalescing_enable = 0;
 
+	/* Initialize interrupt moderation settings if supported by HW */
+	if (gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION) {
+		apc->intr_modr_rx_usec = MANA_INTR_MODR_USEC_DEF;
+		apc->intr_modr_rx_comp = MANA_INTR_MODR_COMP_DEF;
+		apc->intr_modr_tx_usec = MANA_INTR_MODR_USEC_DEF;
+		apc->intr_modr_tx_comp = MANA_INTR_MODR_COMP_DEF;
+		apc->rx_dim_enabled = MANA_ADAPTIVE_RX_DEF;
+		apc->tx_dim_enabled = MANA_ADAPTIVE_TX_DEF;
+	}
+
 	mutex_init(&apc->vport_mutex);
 	apc->vport_use_count = 0;
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 94e658d07a27..67a2b282ff4d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -419,6 +419,15 @@ static int mana_get_coalesce(struct net_device *ndev,
 	    !kernel_coal->rx_cqe_nsecs)
 		kernel_coal->rx_cqe_nsecs = MANA_RX_CQE_NSEC_DEF;
 
+	ec->rx_coalesce_usecs = apc->intr_modr_rx_usec;
+	ec->rx_max_coalesced_frames = apc->intr_modr_rx_comp;
+
+	ec->tx_coalesce_usecs = apc->intr_modr_tx_usec;
+	ec->tx_max_coalesced_frames = apc->intr_modr_tx_comp;
+
+	ec->use_adaptive_rx_coalesce = apc->rx_dim_enabled;
+	ec->use_adaptive_tx_coalesce = apc->tx_dim_enabled;
+
 	return 0;
 }
 
@@ -428,9 +437,34 @@ static int mana_set_coalesce(struct net_device *ndev,
 			     struct netlink_ext_ack *extack)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
-	u8 saved_cqe_coalescing_enable;
+	struct {
+		u16 intr_modr_rx_usec;
+		u16 intr_modr_rx_comp;
+		u16 intr_modr_tx_usec;
+		u16 intr_modr_tx_comp;
+		u8 cqe_coalescing_enable;
+		bool rx_dim_enabled;
+		bool tx_dim_enabled;
+	} saved;
+	bool modr_changed = false;
+	bool dim_changed = false;
+	struct gdma_context *gc;
 	int err;
 
+	gc = apc->ac->gdma_dev->gdma_context;
+
+	/* Both static and dynamic interrupt moderation (DIM) rely on the
+	 * same HW capability advertised by the PF.
+	 */
+	if ((ec->use_adaptive_rx_coalesce || ec->use_adaptive_tx_coalesce ||
+	     ec->rx_coalesce_usecs || ec->tx_coalesce_usecs ||
+	     ec->rx_max_coalesced_frames || ec->tx_max_coalesced_frames) &&
+	    !(gc->pf_cap_flags1 & GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)) {
+		NL_SET_ERR_MSG(extack,
+			       "Interrupt Moderation is not supported by HW");
+		return -EOPNOTSUPP;
+	}
+
 	if (kernel_coal->rx_cqe_frames != 1 &&
 	    kernel_coal->rx_cqe_frames != MANA_RXCOMP_OOB_NUM_PPI) {
 		NL_SET_ERR_MSG_FMT(extack,
@@ -440,18 +474,123 @@ static int mana_set_coalesce(struct net_device *ndev,
 		return -EINVAL;
 	}
 
-	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
+	if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
+	    ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "coalesce usecs must be <= %lu",
+				   MANA_INTR_MODR_USEC_MAX);
+		return -EINVAL;
+	}
+
+	if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
+	    ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "coalesce frames must be <= %lu",
+				   MANA_INTR_MODR_COMP_MAX);
+		return -EINVAL;
+	}
+
+	if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
+	    ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
+	    ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
+	    ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
+		modr_changed = true;
+
+	saved.intr_modr_rx_usec = apc->intr_modr_rx_usec;
+	saved.intr_modr_rx_comp = apc->intr_modr_rx_comp;
+	saved.intr_modr_tx_usec = apc->intr_modr_tx_usec;
+	saved.intr_modr_tx_comp = apc->intr_modr_tx_comp;
+
+	apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
+	apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
+	apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
+	apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
+
+	if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
+	    !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
+		dim_changed = true;
+
+	saved.rx_dim_enabled = apc->rx_dim_enabled;
+	saved.tx_dim_enabled = apc->tx_dim_enabled;
+
+	saved.cqe_coalescing_enable = apc->cqe_coalescing_enable;
 	apc->cqe_coalescing_enable =
 		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
 
-	if (!apc->port_is_up)
+	if (!apc->port_is_up) {
+		apc->rx_dim_enabled = !!ec->use_adaptive_rx_coalesce;
+		apc->tx_dim_enabled = !!ec->use_adaptive_tx_coalesce;
 		return 0;
+	}
 
-	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
-	if (err)
-		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
+	if (apc->cqe_coalescing_enable != saved.cqe_coalescing_enable) {
+		/* CQE coalescing setting is applied via RSS configuration. */
+		err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
+		if (err) {
+			netdev_err(ndev, "Change CQE coalescing failed: %d\n",
+				   err);
+			apc->cqe_coalescing_enable =
+				saved.cqe_coalescing_enable;
+			apc->intr_modr_rx_usec = saved.intr_modr_rx_usec;
+			apc->intr_modr_rx_comp = saved.intr_modr_rx_comp;
+			apc->intr_modr_tx_usec = saved.intr_modr_tx_usec;
+			apc->intr_modr_tx_comp = saved.intr_modr_tx_comp;
+			return err;
+		}
+	}
 
-	return err;
+	if (modr_changed || dim_changed) {
+		bool new_rx_dim = !!ec->use_adaptive_rx_coalesce;
+		bool new_tx_dim = !!ec->use_adaptive_tx_coalesce;
+		bool disable_rx_dim = saved.rx_dim_enabled && !new_rx_dim;
+		bool disable_tx_dim = saved.tx_dim_enabled && !new_tx_dim;
+		bool enable_rx_dim = !saved.rx_dim_enabled && new_rx_dim;
+		bool enable_tx_dim = !saved.tx_dim_enabled && new_tx_dim;
+		int q;
+
+		/* On disable: clear the per-port flag first and
+		 * synchronize_net() so any in-flight NAPI poll observes
+		 * the new value and will not schedule further DIM work;
+		 * then drain pending work and restore the static
+		 * moderation values.
+		 */
+		if (disable_rx_dim)
+			apc->rx_dim_enabled = false;
+		if (disable_tx_dim)
+			apc->tx_dim_enabled = false;
+		if (disable_rx_dim || disable_tx_dim)
+			synchronize_net();
+
+		for (q = 0; q < apc->num_queues; q++) {
+			struct mana_cq *rx_cq = &apc->rxqs[q]->rx_cq;
+			struct mana_cq *tx_cq = &apc->tx_qp[q]->tx_cq;
+
+			if (disable_rx_dim)
+				mana_dim_change(rx_cq, false);
+			else if (enable_rx_dim)
+				mana_dim_change(rx_cq, true);
+			else if (!new_rx_dim && modr_changed)
+				mana_gd_ring_dim(rx_cq->gdma_cq,
+						 apc->intr_modr_rx_usec, true,
+						 apc->intr_modr_rx_comp, true);
+
+			if (disable_tx_dim)
+				mana_dim_change(tx_cq, false);
+			else if (enable_tx_dim)
+				mana_dim_change(tx_cq, true);
+			else if (!new_tx_dim && modr_changed)
+				mana_gd_ring_dim(tx_cq->gdma_cq,
+						 apc->intr_modr_tx_usec, true,
+						 apc->intr_modr_tx_comp, true);
+		}
+
+		if (enable_rx_dim)
+			apc->rx_dim_enabled = true;
+		if (enable_tx_dim)
+			apc->tx_dim_enabled = true;
+	}
+
+	return 0;
 }
 
 /* mana_set_channels - change the number of queues on a port
@@ -595,7 +734,13 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 }
 
 const struct ethtool_ops mana_ethtool_ops = {
-	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
+	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES |
+				     ETHTOOL_COALESCE_RX_USECS |
+				     ETHTOOL_COALESCE_RX_MAX_FRAMES |
+				     ETHTOOL_COALESCE_TX_USECS |
+				     ETHTOOL_COALESCE_TX_MAX_FRAMES |
+				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
+				     ETHTOOL_COALESCE_USE_ADAPTIVE_TX,
 	.op_needs_rtnl		= ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
 				  ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 0c395917b214..8529cef0d7c4 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -47,6 +47,7 @@ enum gdma_queue_type {
 	GDMA_RQ,
 	GDMA_CQ,
 	GDMA_EQ,
+	GDMA_DIM,
 };
 
 enum gdma_work_request_flags {
@@ -126,6 +127,17 @@ union gdma_doorbell_entry {
 		u64 tail_ptr	: 31;
 		u64 arm		: 1;
 	} eq;
+
+	struct {
+		u64 id           : 24;
+		u64 reserved     : 8;
+		u64 mod_usec     : 10;
+		u64 reserve1     : 5;
+		u64 mod_usec_vld : 1;
+		u64 mod_comps    : 8;
+		u64 reserve2     : 7;
+		u64 mod_comps_vld: 1;
+	} dim;
 }; /* HW DATA */
 
 struct gdma_msg_hdr {
@@ -502,6 +514,9 @@ void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit);
 
 int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type type);
 
+void mana_gd_ring_dim(struct gdma_queue *cq, u32 mod_usec, bool mod_usec_vld,
+		      u32 mod_comps, bool mod_comps_vld);
+
 struct gdma_wqe {
 	u32 reserved	:24;
 	u32 last_vbytes	:8;
@@ -650,6 +665,9 @@ enum {
 /* Driver supports self recovery on Hardware Channel timeouts */
 #define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
 
+/* Driver supports dynamic interrupt moderation - DIM */
+#define GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(28)
+
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
@@ -665,7 +683,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
 	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
+	 GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT | \
+	 GDMA_DRV_CAP_FLAG_1_DYN_INTERRUPT_MODERATION)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
@@ -701,6 +720,9 @@ struct gdma_verify_ver_req {
 	u8 os_ver_str4[128];
 }; /* HW DATA */
 
+/* HW supports dynamic interrupt moderation - DIM */
+#define GDMA_PF_CAP_FLAG_1_DYN_INTERRUPT_MODERATION BIT(15)
+
 struct gdma_verify_ver_resp {
 	struct gdma_resp_hdr hdr;
 	u64 gdma_protocol_ver;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 13c87baf018e..f7d0109c9c22 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -4,6 +4,7 @@
 #ifndef _MANA_H
 #define _MANA_H
 
+#include <linux/dim.h>
 #include <net/xdp.h>
 #include <net/net_shaper.h>
 
@@ -64,6 +65,19 @@ enum TRI_STATE {
 /* Maximum number of packets per coalesced CQE */
 #define MANA_RXCOMP_OOB_NUM_PPI 4
 
+/* Default/max interrupt moderation settings */
+#define MANA_INTR_MODR_USEC_DEF 0
+#define MANA_INTR_MODR_COMP_DEF 0
+
+#define MANA_ADAPTIVE_RX_DEF true
+#define MANA_ADAPTIVE_TX_DEF true
+
+/* DIM doorbell value field layout */
+#define MANA_INTR_MODR_USEC_MAX    GENMASK(9, 0)
+#define MANA_INTR_MODR_USEC_VLD    BIT(15)
+#define MANA_INTR_MODR_COMP_MAX    GENMASK(7, 0)
+#define MANA_INTR_MODR_COMP_MASK   GENMASK(23, 16)
+
 /* Update this count whenever the respective structures are changed */
 #define MANA_STATS_RX_COUNT (6 + MANA_RXCOMP_OOB_NUM_PPI - 1)
 #define MANA_STATS_TX_COUNT 11
@@ -297,6 +311,10 @@ struct mana_cq {
 	int work_done;
 	int work_done_since_doorbell;
 	int budget;
+
+	/* DIM - Dynamic Interrupt Moderation */
+	struct dim dim;
+	u16 dim_event_ctr;
 };
 
 struct mana_recv_buf_oob {
@@ -573,6 +591,15 @@ struct mana_port_context {
 	u8 cqe_coalescing_enable;
 	u32 cqe_coalescing_timeout_ns;
 
+	/* Interrupt moderation settings */
+	u16 intr_modr_rx_usec;
+	u16 intr_modr_rx_comp;
+	u16 intr_modr_tx_usec;
+	u16 intr_modr_tx_comp;
+
+	bool rx_dim_enabled;
+	bool tx_dim_enabled;
+
 	struct mana_ethtool_stats eth_stats;
 
 	struct mana_ethtool_phy_stats phy_stats;
@@ -598,6 +625,8 @@ int mana_alloc_queues(struct net_device *ndev);
 int mana_attach(struct net_device *ndev);
 int mana_detach(struct net_device *ndev, bool from_close);
 
+int mana_dim_change(struct mana_cq *cq, bool enable);
+
 int mana_probe(struct gdma_dev *gd, bool resuming);
 void mana_remove(struct gdma_dev *gd, bool suspending);
 
@@ -633,6 +662,9 @@ struct mana_obj_spec {
 	u32 queue_size;
 	u32 attached_eq;
 	u32 modr_ctx_id;
+	u8 req_cq_moderation;
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
 };
 
 enum mana_command_code {
@@ -764,6 +796,15 @@ struct mana_create_wqobj_req {
 	u32 cq_size;
 	u32 cq_moderation_ctx_id;
 	u32 cq_parent_qid;
+
+	/* V2 */
+	u8 allow_rqwqe_chain;
+
+	/* V3 */
+	u8 req_cq_moderation;
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
+	u8 reserved2[2];
 }; /* HW DATA */
 
 struct mana_create_wqobj_resp {
@@ -771,6 +812,12 @@ struct mana_create_wqobj_resp {
 	u32 wq_id;
 	u32 cq_id;
 	mana_handle_t wq_obj;
+
+	/* V2 */
+	u16 cq_moderation_comp;
+	u16 cq_moderation_usec;
+	u8 cq_moderation_enabled;
+	u8 reserved1[3];
 }; /* HW DATA */
 
 /* Destroy WQ Object */
-- 
2.34.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net-next v2] net: mana: Add Interrupt Moderation support
From: Haiyang Zhang @ 2026-06-11 18:38 UTC (permalink / raw)
  To: Paolo Abeni, Haiyang Zhang, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
	Long Li, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Konstantin Taranov, Simon Horman, Shradha Gupta,
	Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg, Kees Cook,
	Breno Leitao, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org
  Cc: Paul Rosswurm
In-Reply-To: <dcd35c42-3aae-4ba2-bd84-4af08467b2fc@redhat.com>



> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Tuesday, June 9, 2026 9:49 AM
> To: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; David S.
> Miller <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Konstantin Taranov <kotaranov@microsoft.com>;
> Simon Horman <horms@kernel.org>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Erni Sri Satya Vennela
> <ernis@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Kees Cook <kees@kernel.org>; Breno
> Leitao <leitao@debian.org>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org
> Cc: Paul Rosswurm <paulros@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH net-next v2] net: mana: Add Interrupt
> Moderation support
> 
> On 6/5/26 1:41 AM, Haiyang Zhang wrote:
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index db14357d3732..b1e0c444f414 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1551,6 +1551,9 @@ int mana_create_wq_obj(struct mana_port_context
> *apc,
> >
> >  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
> >  			     sizeof(req), sizeof(resp));
> > +
> > +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> > +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
> 
> Sashiko noted the above cold break initialization on older firmware:

Our firmware is forward compatible with newer message versions, so the 
old firmware still properly handles this message, just the new feature 
fields are ignored, and queue creation will be successful.
And if the DIM capability bit is zero from FW, driver will keep the DIM
feature to be off and unchangeable.


> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsashiko.
> dev%2F%23%2Fpatchset%2F20260604234211.2056341-1-
> haiyangz%2540linux.microsoft.com&data=05%7C02%7Chaiyangz%40microsoft.com%7
> C9cc8d2c7aa7f472ff8e308dec62def04%7C72f988bf86f141af91ab2d7cd011db47%7C1%7
> C0%7C639166097783522606%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIl
> YiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%
> 7C%7C&sdata=TB7q6EhtR6HJ02Q4f767sXUCmYZyGr3wH1Sz3vLPWfA%3D&reserved=0
> 
> [...]
> > +static void mana_update_rx_dim(struct mana_cq *cq)
> > +{
> > +	struct mana_port_context *apc = netdev_priv(cq->rxq->ndev);
> > +	struct mana_rxq *rxq = cq->rxq;
> > +	struct dim_sample dim_sample = {};
> 
> Minor nit: please fix the variable declaration order above. Other
> occurrences below.

Done in the new version.

> 
> [...]
> > @@ -440,17 +474,94 @@ static int mana_set_coalesce(struct net_device
> *ndev,
> >  		return -EINVAL;
> >  	}
> >
> > -	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
> > +	if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
> > +	    ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
> > +		NL_SET_ERR_MSG_FMT(extack,
> > +				   "coalesce usecs must be <= %lu",
> > +				   MANA_INTR_MODR_USEC_MAX);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
> > +	    ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
> > +		NL_SET_ERR_MSG_FMT(extack,
> > +				   "coalesce frames must be <= %lu",
> > +				   MANA_INTR_MODR_COMP_MAX);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
> > +	    ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
> > +	    ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
> > +	    ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
> > +		modr_changed = true;
> > +
> > +	saved.intr_modr_rx_usec = apc->intr_modr_rx_usec;
> > +	saved.intr_modr_rx_comp = apc->intr_modr_rx_comp;
> > +	saved.intr_modr_tx_usec = apc->intr_modr_tx_usec;
> > +	saved.intr_modr_tx_comp = apc->intr_modr_tx_comp;
> > +
> > +	apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
> > +	apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
> > +	apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
> > +	apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
> > +
> > +	if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
> > +	    !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
> > +		dim_changed = true;
> > +
> > +	saved.rx_dim_enabled = apc->rx_dim_enabled;
> > +	saved.tx_dim_enabled = apc->tx_dim_enabled;
> > +	apc->rx_dim_enabled = !!ec->use_adaptive_rx_coalesce;
> > +	apc->tx_dim_enabled = !!ec->use_adaptive_tx_coalesce;
> > +
> > +	saved.cqe_coalescing_enable = apc->cqe_coalescing_enable;
> >  	apc->cqe_coalescing_enable =
> >  		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
> >
> >  	if (!apc->port_is_up)
> >  		return 0;
> >
> > -	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> > -	if (err)
> > -		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
> > +	if (apc->cqe_coalescing_enable != saved.cqe_coalescing_enable &&
> > +	    !modr_changed && !dim_changed) {
> > +		/* If only CQE coalescing setting is changed, we can just
> update
> > +		 * RSS configuration.
> > +		 */
> > +		err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> > +		if (err) {
> > +			netdev_err(ndev, "Change CQE coalescing failed: %d\n",
> > +				   err);
> > +			apc->cqe_coalescing_enable =
> > +				saved.cqe_coalescing_enable;
> > +			return err;
> > +		}
> > +		return 0;
> > +	}
> > +
> > +	if (modr_changed || dim_changed) {
> > +		err = mana_detach(ndev, false);
> > +		if (err) {
> > +			netdev_err(ndev, "mana_detach failed: %d\n", err);
> > +			goto restore_modr;
> > +		}
> > +
> > +		err = mana_attach(ndev);
> > +		if (err) {
> > +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> > +			goto restore_modr;
> > +		}
> 
> You should try hard to avoid this sequence: if mana_attach fails,
> mana_set_coalesce() will leave the NIC unexpectedly down.

I have updated the patch to use doorbell for this setting change without
re-attach, and will submit the new version soon.

Thanks,
- Haiyang

^ permalink raw reply

* Re: [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: Stefan Hajnoczi @ 2026-06-10 15:37 UTC (permalink / raw)
  To: Sumit Saxena, Michael S. Tsirkin
  Cc: Martin K . Petersen, Jens Axboe, James E . J . Bottomley,
	linux-scsi, linux-block, Adam Radford, Khalid Aziz,
	Adaptec OEM Raid Solutions, Matthew Wilcox, Hannes Reinecke,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Eugenio Perez,
	virtualization, Vishal Bhakta, bcm-kernel-feedback-list,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	xen-devel, John Garry
In-Reply-To: <20260609121806.2121755-3-sumit.saxena@broadcom.com>

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Tue, Jun 09, 2026 at 05:48:01PM +0530, Sumit Saxena wrote:
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 5fdaa71f0652..88375574cb18 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -929,7 +929,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>  	num_targets = virtscsi_config_get(vdev, max_target) + 1;
>  
>  	shost = scsi_host_alloc(&virtscsi_host_template,
> -				struct_size(vscsi, req_vqs, num_queues));
> +				struct_size(vscsi, req_vqs, num_queues), NULL);

A virtio_device has a parent (this is the virtio transport, like
virtio_pci) and that may have NUMA node.

drivers/virtio/virtio.c:register_virtio_device() could call
set_dev_node(dev, dev_to_node(dev->parent)) to propagate the NUMA node
to the virtio_device if it is not already automatically propagated.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH rdma-next v3] RDMA/mana_ib: Clamp adapter capabilities at the ib_device_attr boundary
From: Leon Romanovsky @ 2026-06-11 11:17 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: longli, kotaranov, Jason Gunthorpe, linux-rdma, linux-hyperv,
	linux-kernel
In-Reply-To: <20260525190101.1264185-1-ernis@linux.microsoft.com>

On Mon, May 25, 2026 at 12:01:01PM -0700, Erni Sri Satya Vennela wrote:
> mana_ib stores its adapter capabilities internally as u32 in
> struct mana_ib_adapter_caps. The IB core, however, exposes the
> corresponding device attributes through struct ib_device_attr, where
> fields such as max_qp, max_qp_wr, max_send_sge, max_recv_sge,
> max_sge_rd, max_cq, max_cqe, max_mr, max_pd, max_qp_rd_atom,
> max_res_rd_atom and max_qp_init_rd_atom are signed int.
> 
> mana_ib_query_device() is the only place that copies the cached u32
> caps into these int fields. If a cap exceeds INT_MAX, the implicit
> u32-to-int narrowing yields a negative value. Clamp each cap to
> INT_MAX at this boundary so the values handed to the IB core are always
> non-negative.
> 
> While here, fix a related overflow in the computation of
> max_res_rd_atom. It is derived as max_qp_rd_atom * max_qp, both of
> which are int after the assignment above; the multiplication can
> overflow an int even with the new clamps in place. Widen to s64
> before multiplying and clamp the result to INT_MAX.
> 
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
> Changes in v3:
> * Drop clamping from mana_ib_gd_query_adapter_caps(). The internal u32
>   caps cache does not need to be clamped.
> * Move all clamping exclusively to mana_ib_query_device(), which is the
>   only place the cached u32 values are narrowed into the signed int
>   fields of struct ib_device_attr.
> * Reframe commit message: this is a u32-to-int type boundary fix, not a
>   CVM/untrusted-hardware hardening patch.

You should align all types to u32 and avoid hiding the issue behind  
min_t().

Thanks

^ permalink raw reply

* Re: [PATCH v5 01/15] drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
From: Thomas Zimmermann @ 2026-06-11 11:09 UTC (permalink / raw)
  To: Javier Martinez Canillas, mripard, maarten.lankhorst, airlied,
	airlied, simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Zack Rusin, stable
In-Reply-To: <87mrx15om7.fsf@ocarina.mail-host-address-is-not-set>

Hi

Am 11.06.26 um 12:59 schrieb Javier Martinez Canillas:
> Thomas Zimmermann <tzimmermann@suse.de> writes:
>
>> Hi Javier
>>
>> Am 11.06.26 um 12:10 schrieb Javier Martinez Canillas:
>>> Thomas Zimmermann <tzimmermann@suse.de> writes:
>>>
>>> Hello Thomas,
>>>
>>>> The mode-setting pipeline can disabled damage clippings for a commit
>>>> by setting ignore_damage_clips in struct drm_plane_state. The commit
>>>> will then do a full display update.
>>>>
>>>> Test the flag in DCN code and do a full update in DCN code if it has
>>>> been set.
>>>>
>>>> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
>>>> to ignore damage clips") introduced ignore_damage_clips to selectively
>>>> ignore damage clipping in certain framebuffer changes. This driver does
>>>> not do that, but DRM's damage iterator will soon rely on the flag.
>>>> Therefore supporting it here as well make sense for consistency.
>>>>
>>>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>>>> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
>>> I don't think that a Fixes tag is correct here? Your patch series
>>> is changing the 'struct drm_plane_state.ignore_damage_clips' and
>>> the changes make sense, but definitely isn't a fix in my opinion.
>> But shouldn't we have added this test in amdgpu and the other drivers as
>> part of commit 35ed38d58257 ? Sure, these drivers don't use
>> ignore_damage_clips, but it's still an inconsistency wrt damage
> I don't think so, since the original scope of ignore_damage_clips was for DRM
> driver of virtual devices (namely virtio-gpu and vmwgfx). These do per-buffer
> uploads instead of per-plane uploads, and so there was a need to force a full
> plane update if the framebuffer attached to the plane was changed.
>
> Your series are now extending the scope of ignore_damage_clips to be used by
> core helpers and force a full plane update when doing a modeset. This makes
> sense to me but it wasn't the original intention of the propery and that is
> why I don't think that should be considered a fix.
>
> The only patch that IMO is really a fix for commit 35ed38d58257 is patch #6.
> Because is true that the plane state ignore_damage_clips was carried over
> when the state was duplicated.

Ok, I'll drop the Fixes tags then.

Best regards
Thomas

>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v5 01/15] drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
From: Javier Martinez Canillas @ 2026-06-11 10:59 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Zack Rusin, stable
In-Reply-To: <45aec54a-ec80-48ed-9bcc-84e7bccc11eb@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> Hi Javier
>
> Am 11.06.26 um 12:10 schrieb Javier Martinez Canillas:
>> Thomas Zimmermann <tzimmermann@suse.de> writes:
>>
>> Hello Thomas,
>>
>>> The mode-setting pipeline can disabled damage clippings for a commit
>>> by setting ignore_damage_clips in struct drm_plane_state. The commit
>>> will then do a full display update.
>>>
>>> Test the flag in DCN code and do a full update in DCN code if it has
>>> been set.
>>>
>>> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
>>> to ignore damage clips") introduced ignore_damage_clips to selectively
>>> ignore damage clipping in certain framebuffer changes. This driver does
>>> not do that, but DRM's damage iterator will soon rely on the flag.
>>> Therefore supporting it here as well make sense for consistency.
>>>
>>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>>> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
>> I don't think that a Fixes tag is correct here? Your patch series
>> is changing the 'struct drm_plane_state.ignore_damage_clips' and
>> the changes make sense, but definitely isn't a fix in my opinion.
>
> But shouldn't we have added this test in amdgpu and the other drivers as 
> part of commit 35ed38d58257 ? Sure, these drivers don't use
> ignore_damage_clips, but it's still an inconsistency wrt damage

I don't think so, since the original scope of ignore_damage_clips was for DRM
driver of virtual devices (namely virtio-gpu and vmwgfx). These do per-buffer
uploads instead of per-plane uploads, and so there was a need to force a full
plane update if the framebuffer attached to the plane was changed.

Your series are now extending the scope of ignore_damage_clips to be used by
core helpers and force a full plane update when doing a modeset. This makes
sense to me but it wasn't the original intention of the propery and that is
why I don't think that should be considered a fix.

The only patch that IMO is really a fix for commit 35ed38d58257 is patch #6.
Because is true that the plane state ignore_damage_clips was carried over
when the state was duplicated.

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-11 10:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Martin K . Petersen, Jens Axboe,
	James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <aimSb9I0Vl-68hy9@kbusch-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 1232 bytes --]

On Wed, Jun 10, 2026 at 10:06 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Jun 10, 2026 at 09:16:11PM +0530, Sumit Saxena wrote:
> > The motivation for this change stems from performance issue we
> > encountered due to false sharing of the 'nr_active_requests_shared_tags'
> > counter
> > on certain CPU architectures. I initially submitted a patch to move that
> > counter to
> > its own cache line to avoid conflicts with 'nr_requests' and other hot
> > fields
> > (see:
> >
https://patchwork.kernel.org/project/linux-scsi/patch/20260402074637.92417-3-sumit.saxena@broadcom.com/
> > ).
> >
> > During the review, Bart shared his work, which eliminates the
> > counter entirely by removing the fairness throttling. My testing
confirmed
> > that
> > this approach resolved the performance issues and improved IOPS.
> > This patch is part of a larger set, and I have reported the cumulative
> > performance
> > improvements in the cover letter.
>
> So the problem is just the atomic operation accounting overhead? I
> previously thought the device just really needed to consume all the tags
> to hit performance.
That's correct, it's the atomic operation accounting overhead.

Thanks,
Sumit

[-- Attachment #1.2: Type: text/html, Size: 1660 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH v5 01/15] drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
From: Thomas Zimmermann @ 2026-06-11 10:41 UTC (permalink / raw)
  To: Javier Martinez Canillas, mripard, maarten.lankhorst, airlied,
	airlied, simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Zack Rusin, stable
In-Reply-To: <87y0gl5qw8.fsf@ocarina.mail-host-address-is-not-set>

Hi Javier

Am 11.06.26 um 12:10 schrieb Javier Martinez Canillas:
> Thomas Zimmermann <tzimmermann@suse.de> writes:
>
> Hello Thomas,
>
>> The mode-setting pipeline can disabled damage clippings for a commit
>> by setting ignore_damage_clips in struct drm_plane_state. The commit
>> will then do a full display update.
>>
>> Test the flag in DCN code and do a full update in DCN code if it has
>> been set.
>>
>> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
>> to ignore damage clips") introduced ignore_damage_clips to selectively
>> ignore damage clipping in certain framebuffer changes. This driver does
>> not do that, but DRM's damage iterator will soon rely on the flag.
>> Therefore supporting it here as well make sense for consistency.
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")
> I don't think that a Fixes tag is correct here? Your patch series
> is changing the 'struct drm_plane_state.ignore_damage_clips' and
> the changes make sense, but definitely isn't a fix in my opinion.

But shouldn't we have added this test in amdgpu and the other drivers as 
part of commit 35ed38d58257 ? Sure, these drivers don't use 
ignore_damage_clips, but it's still an inconsistency wrt damage 
handlers. Hence the Fixes tag. Another problem is that the drivers never 
did the test for changes to the plane-state src coordinate that the 
damage iterator does. But that is only fixed later in the series.
>
> Having said that, the change look good to me.
>
> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

Thanks for reviewing.

Best regards
Thomas

>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)



^ permalink raw reply

* Re: [PATCH v5 04/15] drm/vmwgfx: Handle struct drm_plane_state.ignore_damage_clips
From: Javier Martinez Canillas @ 2026-06-11 10:12 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, Zack Rusin, stable
In-Reply-To: <20260610152505.260172-5-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> The mode-setting pipeline can disabled damage clippings for a commit
> by setting ignore_damage_clips in struct drm_plane_state. The commit
> will then do a full display update.
>
> Test the flag in the primary ldu plane's atomic_update and do a full
> update if it has been set.
>
> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
> to ignore damage clips") introduced ignore_damage_clips to selectively
> ignore damage clipping in certain framebuffer changes. Vmwgfx does not
> do that, but DRM's damage iterator will soon rely on the flag. Therefore
> supporting it here as well make sense for consistency.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v5 03/15] drm/vboxvideo: Handle struct drm_plane_state.ignore_damage_clips
From: Javier Martinez Canillas @ 2026-06-11 10:12 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, Zack Rusin, stable
In-Reply-To: <20260610152505.260172-4-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> The mode-setting pipeline can disabled damage clippings for a commit
> by setting ignore_damage_clips in struct drm_plane_state. The commit
> will then do a full display update.
>
> Test the flag in the primary plane's atomic_update and do a full update
> if it has been set.
>
> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
> to ignore damage clips") introduced ignore_damage_clips to selectively
> ignore damage clipping in certain framebuffer changes. Vboxvideo does not
> do that, but DRM's damage iterator will soon rely on the flag. Therefore
> supporting it here as well make sense for consistency.
>
> While at it, also replace uint32_t with the preferred u32.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")

And for this one as well.

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v5 02/15] drm/i915/display: Handle struct drm_plane_state.ignore_damage_clips
From: Javier Martinez Canillas @ 2026-06-11 10:11 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, stable, Zack Rusin
In-Reply-To: <20260610152505.260172-3-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

> The mode-setting pipeline can disabled damage clippings for a commit
> by setting ignore_damage_clips in struct drm_plane_state. The commit
> will then do a full display update. Commit 35ed38d58257 ("drm: Allow
> drivers to indicate the damage helpers to ignore damage clips") introduced
> ignore_damage_clips to selectively ignore damage clipping in certain
> framebuffer changes.
>
> The i915 driver does not modify the flag, but DRM's damage iterator
> will soon rely on it. Calling drm_atomic_helper_check_plane_damage()
> right before drm_atomic_helper_damage_merged() guarantees that it
> has the correct state. The i915 driver does not do this elsewhere
> so far.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")

Same comment here than for patch #1. I don't think this is a fix.

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v5 01/15] drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips
From: Javier Martinez Canillas @ 2026-06-11 10:10 UTC (permalink / raw)
  To: Thomas Zimmermann, mripard, maarten.lankhorst, airlied, airlied,
	simona, admin, gargaditya08, paul, jani.nikula, mhklkml,
	zack.rusin, bcm-kernel-feedback-list, harry.wentland, sunpeng.li,
	siqueira, alexander.deucher, rodrigo.vivi, joonas.lahtinen,
	tursulin, dmitry.osipenko, gurchetansingh, olvaffe
  Cc: dri-devel, linux-hyperv, intel-gfx, intel-xe, linux-mips,
	virtualization, amd-gfx, Thomas Zimmermann, Zack Rusin, stable
In-Reply-To: <20260610152505.260172-2-tzimmermann@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

Hello Thomas,

> The mode-setting pipeline can disabled damage clippings for a commit
> by setting ignore_damage_clips in struct drm_plane_state. The commit
> will then do a full display update.
>
> Test the flag in DCN code and do a full update in DCN code if it has
> been set.
>
> Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers
> to ignore damage clips") introduced ignore_damage_clips to selectively
> ignore damage clipping in certain framebuffer changes. This driver does
> not do that, but DRM's damage iterator will soon rely on the flag.
> Therefore supporting it here as well make sense for consistency.
>
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips")

I don't think that a Fixes tag is correct here? Your patch series
is changing the 'struct drm_plane_state.ignore_damage_clips' and
the changes make sense, but definitely isn't a fix in my opinion.

Having said that, the change look good to me.

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Keith Busch @ 2026-06-10 16:35 UTC (permalink / raw)
  To: Sumit Saxena
  Cc: Christoph Hellwig, Martin K . Petersen, Jens Axboe,
	James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <CAL2rwxr1uGshb1o=jvP2OnBffNz2cKXj8tHuAUCN5HFuy2vB_g@mail.gmail.com>

On Wed, Jun 10, 2026 at 09:16:11PM +0530, Sumit Saxena wrote:
> The motivation for this change stems from performance issue we
> encountered due to false sharing of the 'nr_active_requests_shared_tags'
> counter
> on certain CPU architectures. I initially submitted a patch to move that
> counter to
> its own cache line to avoid conflicts with 'nr_requests' and other hot
> fields
> (see:
> https://patchwork.kernel.org/project/linux-scsi/patch/20260402074637.92417-3-sumit.saxena@broadcom.com/
> ).
> 
> During the review, Bart shared his work, which eliminates the
> counter entirely by removing the fairness throttling. My testing confirmed
> that
> this approach resolved the performance issues and improved IOPS.
> This patch is part of a larger set, and I have reported the cumulative
> performance
> improvements in the cover letter.

So the problem is just the atomic operation accounting overhead? I
previously thought the device just really needed to consume all the tags
to hit performance.

^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-10 15:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K . Petersen, Jens Axboe, James E . J . Bottomley,
	linux-scsi, linux-block, Adam Radford, Khalid Aziz,
	Adaptec OEM Raid Solutions, Matthew Wilcox, Hannes Reinecke,
	Juergen E . Fischer, Russell King, linux-arm-kernel, Finn Thain,
	Michael Schmitz, Anil Gurumurthy, Sudarsana Kalluru,
	Oliver Neukum, Ali Akcaagac, Jamie Lenehan, Ram Vegesna,
	target-devel, Bradley Grove, Satish Kharat, Sesidhar Baddela,
	Karan Tilak Kumar, Yihang Li, Don Brace, storagedev,
	HighPoint Linux Team, Tyrel Datwyler, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy, linuxppc-dev,
	Brian King, Lee Duncan, Chris Leech, Mike Christie, open-iscsi,
	Justin Tee, Paul Ely, Kashyap Desai, Shivasharan S,
	Chandrakanth Patil, megaraidlinux.pdl, Sathya Prakash Veerichetty,
	Sreekanth Reddy, mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani,
	Ranjan Kumar, MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori,
	YOKOTA Hiroshi, Jack Wang, Geoff Levand, Michael Reed,
	Nilesh Javali, GR-QLogic-Storage-Upstream, Narsimhulu Musini,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv, Michael S . Tsirkin, Jason Wang, Paolo Bonzini,
	Stefan Hajnoczi, Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <aikAs4X-2NWTuwCc@infradead.org>


[-- Attachment #1.1: Type: text/plain, Size: 1889 bytes --]

On Wed, Jun 10, 2026 at 11:44 AM Christoph Hellwig <hch@infradead.org>
wrote:
>
> Just dropping the fairness was rejected before and there is no
> explanation here on why any of that has changed.

I missed the fact that this patch had been previously rejected.

The motivation for this change stems from performance issue we
encountered due to false sharing of the 'nr_active_requests_shared_tags'
counter
on certain CPU architectures. I initially submitted a patch to move that
counter to
its own cache line to avoid conflicts with 'nr_requests' and other hot
fields
(see:
https://patchwork.kernel.org/project/linux-scsi/patch/20260402074637.92417-3-sumit.saxena@broadcom.com/
).

During the review, Bart shared his work, which eliminates the
counter entirely by removing the fairness throttling. My testing confirmed
that
this approach resolved the performance issues and improved IOPS.
This patch is part of a larger set, and I have reported the cumulative
performance
improvements in the cover letter.

Thanks,
Sumit

>
> On Tue, Jun 09, 2026 at 05:48:02PM +0530, Sumit Saxena wrote:
> > From: Bart Van Assche <bvanassche@acm.org>
> >
> > Original patch [1] by Bart Van Assche; this version is rebased onto the
> > current tree.  In testing it improves IOPS by roughly 16-18% by removing
> > the fair-sharing throttle on shared tag queues.
> >
> > This patch removes the following code and structure members:
> > - The function hctx_may_queue().
> > - blk_mq_hw_ctx.nr_active and
request_queue.nr_active_requests_shared_tags
> >   and also all the code that modifies these two member variables.
>
> .. and besides that, this commit message is still entirely useless
> as it doesn't explain any of the thoughts of why this change is safe
> and desirable.  While the mechanics above are totally obvious from
> the code change itself.
>

[-- Attachment #1.2: Type: text/html, Size: 2485 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox