Re: [net-next v8 2/3] net: ethernet: mtk_eth_soc: Add RSS support

Netdev List
 help / color / mirror / Atom feed

From: Jakub Kicinski <kuba@kernel.org>
To: linux@fw-web.de
Cc: Jakub Kicinski <kuba@kernel.org>,
	nbd@nbd.name, lorenzo@kernel.org, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	linux@armlinux.org.uk, mason-cw.chang@mediatek.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	daniel@makrotopia.org, linux-mediatek@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [net-next v8 2/3] net: ethernet: mtk_eth_soc: Add RSS support
Date: Wed, 13 May 2026 18:52:20 -0700	[thread overview]
Message-ID: <20260514015221.521091-1-kuba@kernel.org> (raw)
In-Reply-To: <20260509190938.169290-3-linux@fw-web.de>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: ethernet: mtk_eth_soc: Add RSS support

This commit adds Receive Side Scaling support for the mtk_eth_soc driver,
enabling up to four RX rings on MT7988 with ethtool -x and
--set-rxfh-indir support. It also reworks the RX NAPI / IRQ handling
into per-ring instances and renames eth->irq[] to eth->irq_fe[] /
eth->irq_pdma[].

> diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c

[ ... ]

> @@ -2235,16 +2199,15 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
[ ... ]
> @@ -2436,7 +2399,7 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
>  		 * we continue
>  		 */
>  		wmb();
> -		mtk_update_rx_cpu_idx(eth);
> +		mtk_update_rx_cpu_idx(eth, ring);
>  	}
>  
>  	eth->rx_packets += done;

[High]
With RSS enabled, up to MTK_RX_NAPI_NUM (=4) NAPI instances can run
concurrently on different CPUs.  eth->rx_packets, eth->rx_bytes (here)
and eth->rx_events (incremented in mtk_handle_irq_rx with IRQF_SHARED)
are plain u32 fields with no atomic, no spinlock, and no per-CPU
storage:

	eth->rx_packets += done;
	eth->rx_bytes += bytes;
	dim_update_sample(eth->rx_events, eth->rx_packets, eth->rx_bytes,
			  &dim_sample);
	net_dim(&eth->rx_dim, &dim_sample);

Pre-patch a single rx_napi serialised these accesses.  Are these
unsynchronised cross-CPU updates intentional?  The corrupted values
feed dim_update_sample() / net_dim() which drives interrupt
moderation, and eth->rx_dim is a single struct dim shared across all
RX NAPIs.

> @@ -2645,7 +2608,9 @@ static int mtk_napi_tx(struct napi_struct *napi, int budget)
>  
>  static int mtk_napi_rx(struct napi_struct *napi, int budget)
>  {
> -	struct mtk_eth *eth = container_of(napi, struct mtk_eth, rx_napi);
> +	struct mtk_napi *rx_napi = container_of(napi, struct mtk_napi, napi);
> +	struct mtk_eth *eth = rx_napi->eth;
> +	struct mtk_rx_ring *ring = rx_napi->rx_ring;
>  	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
>  	int rx_done_total = 0;

[Medium]
mtk_napi_rx() calls mtk_handle_status_irq(eth) on entry of every poll.
mtk_handle_status_irq() does an unprotected read / clear of
MTK_INT_STATUS2 and calls mtk_stats_update():

	status2 = mtk_r32(eth, MTK_INT_STATUS2);
	if (unlikely(status2 & (MTK_GDM1_AF | MTK_GDM2_AF))) {
		mtk_stats_update(eth);
		mtk_w32(eth, (MTK_GDM1_AF | MTK_GDM2_AF), MTK_INT_STATUS2);
	}

With multiple rx_napi instances polling on different CPUs, can a fresh
AF event arriving between the read and the write be silently dropped,
or stats be double-counted?  The per-MAC stats_lock taken inside
mtk_stats_update_mac does not synchronise the status read or the
clear.

[ ... ]

> @@ -3222,6 +3192,105 @@ static int mtk_hwlro_get_fdir_all(struct net_device *dev,
[ ... ]
> +static int mtk_rss_init(struct mtk_eth *eth)
> +{
> +	const struct mtk_soc_data *soc = eth->soc;
> +	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
> +	struct mtk_rss_params *rss_params = &eth->rss_params;
> +	u32 val;
> +	int i;
> +
> +	netdev_rss_key_fill(rss_params->hash_key, MTK_RSS_HASH_KEYSIZE);
> +
> +	for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE; i++)
> +		rss_params->indirection_table[i] = ethtool_rxfh_indir_default(i, eth->soc->rss_num);
> +
> +	if (soc->rx.desc_size == sizeof(struct mtk_rx_dma)) {
> +		/* Set RSS rings to PSE modes */
> +		for (i = 1; i <= MTK_HW_LRO_RING_NUM(eth); i++) {
> +			val = mtk_r32(eth, MTK_LRO_CTRL_DW2_CFG(reg_map, i));
> +			val |= MTK_RING_PSE_MODE;
> +			mtk_w32(eth, val, MTK_LRO_CTRL_DW2_CFG(reg_map, i));
> +		}
> +
> +		/* Enable non-lro multiple rx */
> +		val = mtk_r32(eth, reg_map->pdma.lro_ctrl_dw0);
> +		val |= MTK_NON_LRO_MULTI_EN;
> +		mtk_w32(eth, val, reg_map->pdma.lro_ctrl_dw0);
> +
> +		/* Enable RSS dly int supoort */
> +		val |= MTK_LRO_DLY_INT_EN;
> +		mtk_w32(eth, val, reg_map->pdma.lro_ctrl_dw0);
> +	}

[Medium]
Is this block reachable on any SoC?  mtk_rss_init() is only invoked
from mtk_dma_init() under MTK_HAS_CAPS(MTK_RSS).  Only mt7988_data
sets MTK_RSS in caps, and mt7988_data uses
rx.desc_size = sizeof(struct mtk_rx_dma_v2), not
sizeof(struct mtk_rx_dma).

If the PSE-mode / MTK_NON_LRO_MULTI_EN / MTK_LRO_DLY_INT_EN
programming is needed on MT7988, is it being silently skipped?  If it
was intended for mt7981 / mt7986, those SoCs never reach this path
because their caps lack MTK_RSS.

There is also a typo above: "RSS dly int supoort" -> "support".

[ ... ]

> +	/* Set perRSS GRP INT */
> +	mtk_m32(eth, MTK_RX_DONE_INT(eth, MTK_RSS_RING(1)),
> +		MTK_RX_DONE_INT(eth, MTK_RSS_RING(1)), reg_map->pdma.int_grp);

[ ... ]

> +
> +	return 0;
> +}

[ ... ]

> @@ -3312,6 +3381,17 @@ static int mtk_dma_init(struct mtk_eth *eth)
[ ... ]
> +	if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS)) {
> +		for (i = 1; i < MTK_RX_RSS_NUM(eth); i++) {
> +			err = mtk_rx_alloc(eth, MTK_RSS_RING(i), MTK_RX_FLAGS_NORMAL);
> +			if (err)
> +				return err;
> +		}
> +		err = mtk_rss_init(eth);
> +		if (err)
> +			return err;
> +	}
> +

[ ... ]

> @@ -3437,14 +3523,36 @@ static int mtk_get_irqs(struct platform_device *pdev, struct mtk_eth *eth)
>  	return 0;
>  }
>  
> -static irqreturn_t mtk_handle_irq_rx(int irq, void *_eth)
> +static int mtk_get_irqs_pdma(struct platform_device *pdev, struct mtk_eth *eth)
>  {
> -	struct mtk_eth *eth = _eth;
> +	char rxring[] = "pdma0";
> +	int i;
> +
> +	for (i = 0; i < MTK_PDMA_IRQ_NUM; i++) {
> +		rxring[4] = '0' + i;
> +		eth->irq_pdma[i] = platform_get_irq_byname(pdev, rxring);
> +		if (eth->irq_pdma[i] < 0)
> +			return eth->irq_pdma[i];
> +	}
> +
> +	return 0;
> +}

[High]
This loop demands all four pdma0..pdma3 IRQs whenever
MTK_HAS_CAPS(MTK_PDMA_INT) is set, but the same patch adds
MTK_PDMA_INT to MT7981_CAPS and MT7986_CAPS — SoCs which do not
have MTK_RSS.  On those SoCs only rx_napi[0] / pdma0 is ever
referenced (every pdma1..pdma3 use site is gated on
MTK_HAS_CAPS(MTK_RSS) in probe / open / stop / dma_init / dma_free).

Existing MT7981 / MT7986 device trees declare only fe1 / fe2.  Does
this turn into a probe-time -ENXIO failure for those boards?  Should
the call to mtk_get_irqs_pdma(), or at least the requests for
pdma1..pdma3, be folded under MTK_HAS_CAPS(MTK_RSS) so older DTs
continue to probe?

> +
> +static irqreturn_t mtk_handle_irq_rx(int irq, void *priv)
> +{
> +	struct mtk_napi *rx_napi = priv;
> +	struct mtk_eth *eth = rx_napi->eth;
> +	struct mtk_rx_ring *ring = rx_napi->rx_ring;
>  
>  	eth->rx_events++;
> -	if (likely(napi_schedule_prep(&eth->rx_napi))) {
> -		mtk_rx_irq_disable(eth, eth->soc->rx.irq_done_mask);
> -		__napi_schedule(&eth->rx_napi);
> +	if (unlikely(!(mtk_r32(eth, eth->soc->reg_map->pdma.irq_status) &
> +		       mtk_r32(eth, eth->soc->reg_map->pdma.irq_mask) &
> +		       MTK_RX_DONE_INT(eth, ring->ring_no))))
> +		return IRQ_NONE;

[High]
Pre-patch mt7988_data set rx.irq_done_mask = MTK_RX_DONE_INT_V2 =
BIT(14).  Post-patch the per-soc field is removed and replaced with
MTK_RX_DONE_INT(eth, ring_no), which for V3 (mt7988) expands to
BIT(24 + ring_no), so MTK_RX_DONE_INT(eth, 0) is now BIT(24) instead
of BIT(14).

Is the bit number change intentional?  If MT7988 RX-done is actually
signalled on BIT(14), the new mask never matches, mtk_handle_irq_rx
returns IRQ_NONE, and RX traffic stalls.  If BIT(14) was wrong all
along, could the commit message describe the silent fix?  The patch
otherwise advertises only RSS.

[ ... ]

> @@ -3888,9 +4011,7 @@ static void mtk_dim_rx(struct work_struct *work)
[ ... ]
> -	val = mtk_r32(eth, reg_map->pdma.delay_irq);
> -	val &= MTK_PDMA_DELAY_TX_MASK;
> -	val |= MTK_PDMA_DELAY_RX_EN;
> +	val = MTK_PDMA_DELAY_RX_EN;
[ ... ]
> -	mtk_w32(eth, val, reg_map->pdma.delay_irq);
>  	if (MTK_HAS_CAPS(eth->soc->caps, MTK_QDMA))
> -		mtk_w32(eth, val, reg_map->qdma.delay_irq);
> +		mtk_m32(eth, MTK_PDMA_DELAY_TX_MASK,
> +			val << MTK_PDMA_DELAY_TX_PTIME_SHIFT, reg_map->qdma.delay_irq);
> +
> +	if (eth->soc->rx.desc_size == sizeof(struct mtk_rx_dma)) {
> +		mtk_m32(eth, MTK_PDMA_DELAY_RX_MASK, val, reg_map->pdma.delay_irq);
> +		mtk_w32(eth, val, reg_map->pdma.lro_rx1_dly_int);
> +		mtk_w32(eth, val, reg_map->pdma.lro_rx1_dly_int + 0x4);
> +		mtk_w32(eth, val, reg_map->pdma.lro_rx1_dly_int + 0x8);
> +	} else {
> +		val = val | (val << MTK_PDMA_DELAY_RX_RING_SHIFT);
> +		mtk_w32(eth, val, reg_map->pdma.rx_delay_irq);
> +	}

[Critical]
The new branch is gated on rx.desc_size, not on a capability bit.
rt5350_data (MT7628) has rx.desc_size = sizeof(struct mtk_rx_dma) and
uses mt7628_reg_map, which does not initialise lro_rx1_dly_int (it
defaults to 0).

mtk_hw_init() calls mtk_dim_rx(&eth->rx_dim.work) inside the MT7628
branch, and net_dim invokes it again on every moderation update.  Each
call therefore performs:

	mtk_w32(eth, val, 0);
	mtk_w32(eth, val, 0x4);
	mtk_w32(eth, val, 0x8);

These offsets land in the Frame Engine global config region
(MTK_FE_GLO_CFG(0) = 0x00 etc.) instead of any LRO/RSS register, with
val containing MTK_PDMA_DELAY_RX_EN = BIT(15).  Can this corrupt
MT7628 frame-engine state on every boot and every NAPI moderation
update?  Should the branch be gated on MTK_HWLRO or MTK_RSS rather
than on rx.desc_size?

[Medium]
On V3 hardware (the only chip enabling RSS) only two ring slots are
programmed here:

	val = val | (val << MTK_PDMA_DELAY_RX_RING_SHIFT);
	mtk_w32(eth, val, reg_map->pdma.rx_delay_irq);

With MTK_PDMA_DELAY_RX_RING_SHIFT = 16 the register holds two rings,
but RSS uses MTK_RX_RSS_NUM(eth) = 4 rings.  Do rings 2 and 3 ever
receive DIM-driven moderation updates after probe, or do they stay at
the hardware default?

[ ... ]

> @@ -4149,6 +4285,25 @@ static void mtk_hw_reset_monitor_work(struct work_struct *work)
[ ... ]
> +static int mtk_napi_init(struct mtk_eth *eth)
> +{
[ ... ]
> +}

[ ... ]

> @@ -4238,12 +4393,11 @@ static int mtk_hw_init(struct mtk_eth *eth, bool reset)
[ ... ]

> @@ -4700,8 +4860,13 @@ static void mtk_get_ethtool_stats(struct net_device *dev,
>  
>  static u32 mtk_get_rx_ring_count(struct net_device *dev)
>  {
> +	struct mtk_mac *mac = netdev_priv(dev);
> +	struct mtk_eth *eth = mac->hw;
> +
>  	if (dev->hw_features & NETIF_F_LRO)
>  		return MTK_MAX_RX_RING_NUM;
> +	else if (MTK_HAS_CAPS(eth->soc->caps, MTK_RSS))
> +		return MTK_RX_RSS_NUM(eth);
>  
>  	return 0;
>  }

[ ... ]

> @@ -4784,6 +4949,70 @@ static int mtk_set_eee(struct net_device *dev, struct ethtool_keee *eee)
[ ... ]
> +static int mtk_set_rxfh(struct net_device *dev, struct ethtool_rxfh_param *rxfh,
> +			struct netlink_ext_ack *extack)
> +{
> +	struct mtk_mac *mac = netdev_priv(dev);
> +	struct mtk_eth *eth = mac->hw;
> +	struct mtk_rss_params *rss_params = &eth->rss_params;
> +	const struct mtk_reg_map *reg_map = eth->soc->reg_map;
> +	int i;
> +
> +	if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
> +	    rxfh->hfunc != ETH_RSS_HASH_TOP)
> +		return -EOPNOTSUPP;
> +
> +	if (rxfh->key) {
> +		memcpy(rss_params->hash_key, rxfh->key,
> +		       sizeof(rss_params->hash_key));
> +		for (i = 0; i < MTK_RSS_HASH_KEYSIZE / sizeof(u32); i++)
> +			mtk_w32(eth, rss_params->hash_key[i],
> +				MTK_RSS_HASH_KEY_DW(reg_map, i));
> +	}
> +
> +	if (rxfh->indir) {
> +		for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE; i++)
> +			rss_params->indirection_table[i] = rxfh->indir[i];
> +		for (i = 0; i < MTK_RSS_MAX_INDIRECTION_TABLE / 16; i++)
> +			mtk_w32(eth, mtk_rss_indr_table(rss_params, i),
> +				MTK_RSS_INDR_TABLE_DW(reg_map, i));
> +	}
> +
> +	return 0;
> +}

[High]
Should this op (and .get_rxfh / .get_rxfh_key_size /
.get_rxfh_indir_size) be gated on MTK_HAS_CAPS(MTK_RSS)?  As written,
mtk_ethtool_ops exposes them unconditionally for every SoC the driver
supports.

MTK_RSS_HASH_KEY_DW(reg_map, x) and MTK_RSS_INDR_TABLE_DW(reg_map, x)
are computed as (reg_map)->pdma.rss_glo_cfg + 0x20 + ... and + 0x50 +
... .  mtk_reg_map (mt2701/mt7621/mt7622/mt7623/mt7629) and
mt7628_reg_map do not initialise rss_glo_cfg, so it defaults to 0.

A CAP_NET_ADMIN user invoking ethtool -X dev hkey ... on those SoCs
would write user-controlled u32 values into absolute frame-engine
offsets 0x20..0x44 (and indirection writes to 0x50..0x6c).  Can these
ops be made conditional on MTK_RSS, or filled in only for the RSS-
capable SoC?

[Medium]
mtk_set_rxfh() does not validate user-supplied indir[i] entries
against MTK_RX_RSS_NUM(eth).  The packing helper allots only 2 bits
per entry:

	val |= (rss_params->indirection_table[i] << (2 * (i % 16)));

ethtool core only enforces indir[i] < num_rx_rings when
num_rx_rings is non-zero, and MTK_RSS hardware has at most 4 rings.
Should the driver reject values >= MTK_RX_RSS_NUM(eth) so they don't
silently bleed across the 2-bit slots in the packed register?  The
indirection table also stores the raw u32 into a u8 field, which
truncates 4..255 instead of catching them.

[Medium]
mtk_rss_init() wraps key / indirection writes with an
MTK_RSS_CFG_REQ pause / release handshake:

	/* Pause */
	val |= MTK_RSS_CFG_REQ;
	mtk_w32(eth, val, reg_map->pdma.rss_glo_cfg);
	...
	/* Release pause */
	val &= ~(MTK_RSS_CFG_REQ);

mtk_set_rxfh() writes the same registers at runtime without that
handshake.  Can the hardware sample partially-updated key /
indirection state and steer packets to incorrect or out-of-range
rings while the write loop is in flight?

[ ... ]

> @@ -5424,6 +5708,7 @@ static const struct mtk_soc_data mt2701_data = {

[ ... entries for mt2701/mt7621/mt7622/mt7623/mt7629/rt5350 ... ]

> @@ -5567,16 +5851,16 @@ static const struct mtk_soc_data mt7981_data = {
[ ... ]
>  	.foe_entry_size = MTK_FOE_ENTRY_V2_SIZE,
> +	.rss_num = 4,
[ ... ]
> @@ -5597,6 +5881,7 @@ static const struct mtk_soc_data mt7986_data = {
[ ... ]
>  	.foe_entry_size = MTK_FOE_ENTRY_V2_SIZE,
> +	.rss_num = 4,

[Low]
mt7981_data and mt7986_data set rss_num = 4, but MT7981_CAPS and
MT7986_CAPS do not include MTK_RSS (only MT7988_CAPS does).  Every
RSS code path is gated on MTK_HAS_CAPS(MTK_RSS) before consulting
rss_num, so this has no current effect, but the SoC capability table
becomes internally inconsistent.  Should rss_num be 0 on these SoCs
until MTK_RSS is actually wired up and tested for them?

> @@ -5627,20 +5911,20 @@ static const struct mtk_soc_data mt7988_data = {
[ ... ]
>  	.foe_entry_size = MTK_FOE_ENTRY_V3_SIZE,
> +	.rss_num = 4,
[ ... ]
>  	.rx = {
>  		.desc_size = sizeof(struct mtk_rx_dma_v2),
> -		.irq_done_mask = MTK_RX_DONE_INT_V2,
>  		.dma_l4_valid = RX_DMA_L4_VALID_V2,
>  		.dma_max_len = MTK_TX_DMA_BUF_LEN_V2,
>  		.dma_len_offset = 8,
> -		.dma_size = MTK_DMA_SIZE(2K),
> +		.dma_size = MTK_DMA_SIZE(1K),
>  	},
>  };

[ ... ]
-- 
pw-bot: cr

next prev parent reply	other threads:[~2026-05-14  1:52 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-09 19:09 [net-next v8 0/3] Add RSS and LRO support Frank Wunderlich
2026-05-09 19:09 ` [net-next v8 1/3] net: ethernet: mtk_eth_soc: Add register definitions for RSS and LRO Frank Wunderlich
2026-05-09 19:09 ` [net-next v8 2/3] net: ethernet: mtk_eth_soc: Add RSS support Frank Wunderlich
2026-05-14  1:52   ` Jakub Kicinski [this message]
2026-05-14  1:56   ` Jakub Kicinski
2026-05-09 19:09 ` [net-next v8 3/3] net: ethernet: mtk_eth_soc: Add LRO support Frank Wunderlich
2026-05-14  1:52   ` Jakub Kicinski
2026-05-14  1:53   ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260514015221.521091-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=angelogioacchino.delregno@collabora.com \
    --cc=daniel@makrotopia.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mediatek@lists.infradead.org \
    --cc=linux@armlinux.org.uk \
    --cc=linux@fw-web.de \
    --cc=lorenzo@kernel.org \
    --cc=mason-cw.chang@mediatek.com \
    --cc=matthias.bgg@gmail.com \
    --cc=nbd@nbd.name \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox