public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: "Théo Lebrun" <theo.lebrun@bootlin.com>
To: "Nicolai Buchwitz" <nb@tipi-net.de>,
	"Théo Lebrun" <theo.lebrun@bootlin.com>
Cc: "Nicolas Ferre" <nicolas.ferre@microchip.com>,
	"Claudiu Beznea" <claudiu.beznea@tuxon.dev>,
	"Andrew Lunn" <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Richard Cochran" <richardcochran@gmail.com>,
	"Russell King" <linux@armlinux.org.uk>,
	"Paolo Valerio" <pvalerio@redhat.com>,
	"Conor Dooley" <conor@kernel.org>,
	"Vladimir Kondratiev" <vladimir.kondratiev@mobileye.com>,
	"Gregory CLEMENT" <gregory.clement@bootlin.com>,
	"Benoît Monin" <benoit.monin@bootlin.com>,
	"Tawfik Bayouk" <tawfik.bayouk@mobileye.com>,
	"Thomas Petazzoni" <thomas.petazzoni@bootlin.com>,
	"Maxime Chevallier" <maxime.chevallier@bootlin.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next 10/11] net: macb: use context swapping in .set_ringparam()
Date: Thu, 02 Apr 2026 18:31:44 +0200	[thread overview]
Message-ID: <DHIT9TPJQJ46.21A89R5UAFXVH@bootlin.com> (raw)
In-Reply-To: <90f843aa3940bdbabadddce27314c1f1@tipi-net.de>

On Thu Apr 2, 2026 at 1:29 PM CEST, Nicolai Buchwitz wrote:
> On 1.4.2026 18:39, Théo Lebrun wrote:
>> ethtool_ops.set_ringparam() is implemented using the primitive close /
>> update ring size / reopen sequence. Under memory pressure this does not
>> fly: we free our buffers at close and cannot reallocate new ones at
>> open. Also, it triggers a slow PHY reinit.
>> 
>> Instead, exploit the new context mechanism and improve our sequence to:
>>  - allocate a new context (including buffers) first
>>  - if it fails, early return without any impact to the interface
>>  - stop interface
>>  - update global state (bp, netdev, etc)
>>  - pass buffer pointers to the hardware
>>  - start interface
>>  - free old context.
>> 
>> The HW disable sequence is inspired by macb_reset_hw() but avoids
>> (1) setting NCR bit CLRSTAT and (2) clearing register PBUFRXCUT.
>> 
>> The HW re-enable sequence is inspired by macb_mac_link_up(), skipping
>> over register writes which would be redundant (because values have not
>> changed).
>> 
>> The generic context swapping parts are isolated into helper functions
>> macb_context_swap_start|end(), reusable by other operations 
>> (change_mtu,
>> set_channels, etc).
>> 
>> Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
>> ---
>>  drivers/net/ethernet/cadence/macb_main.c | 89 
>> +++++++++++++++++++++++++++++---
>>  1 file changed, 82 insertions(+), 7 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/cadence/macb_main.c 
>> b/drivers/net/ethernet/cadence/macb_main.c
>> index 42b19b969f3e..543356554c11 100644
>> --- a/drivers/net/ethernet/cadence/macb_main.c
>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>> @@ -2905,6 +2905,76 @@ static struct macb_context 
>> *macb_context_alloc(struct macb *bp,
>>  	return ctx;
>>  }
>> 
>> +static void macb_context_swap_start(struct macb *bp)
>> +{
>> +	struct macb_queue *queue;
>> +	unsigned int q;
>> +	u32 ctrl;
>> +
>> +	/* Disable software Tx, disable HW Tx/Rx and disable NAPI. */
>> +
>> +	netif_tx_disable(bp->netdev);
>> +
>> +	ctrl = macb_readl(bp, NCR);
>> +	macb_writel(bp, NCR, ctrl & ~(MACB_BIT(RE) | MACB_BIT(TE)));
>> +
>> +	macb_writel(bp, TSR, -1);
>> +	macb_writel(bp, RSR, -1);
>> +
>> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>> +		queue_writel(queue, IDR, -1);
>> +		queue_readl(queue, ISR);
>> +		if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
>> +			queue_writel(queue, ISR, -1);
>> +	}
>> +
>> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>> +		napi_disable(&queue->napi_rx);
>> +		napi_disable(&queue->napi_tx);
>> +	}
>
> tx_error_task, hresp_err_bh_work, and tx_lpi_work all dereference
> bp->ctx and could race with the pointer swap in swap_end.
> macb_close() cancels at least tx_lpi_work here. Should these be
> flushed too?

This is a large topic! While trying to find a solution as part of this
series I am noticing many race conditions. With this context series we
worsen some (by introducing bp->ctx NULL ptr dereference).

Let's start by identifying all schedule-able contexts involved:
 - #1 any request from userspace, too many callbacks to list
 - #2 NAPI softirq or kthread context, macb_{rx,tx}_poll()
 - #3 bp->hresp_err_bh_work / macb_hresp_error_task()
 - #4 bp->tx_lpi_work / macb_tx_lpi_work_fn()
 - #5 queue->tx_error_task / macb_tx_error_task()
 - #6 IRQ context, macb_interrupt()

Some race conditions:

 - #1 macb_close() doesn't cancel & wait upon #3 hresp_err_bh_work.
   They could race, especially as #3 doesn't grab bp->lock. One race
   example: #3 BP HRESP starts the interface after it has been closed
   and buffers freed. RBQP/TBQP are not reset so MACB would occur
   memory corruption on Rx and transmit memory content.

 - #1 macb_close() doesn't cancel & wait upon #5 tx_error_task. #5 does
   grab bp->lock but that doesn't make it much safer. One race example:
   same as above, restart of interface with ghost ring buffers.

 - #3 hresp_err_bh_work could collide with anything as it does no
   locking, especially #1 (xmit for example) or #2 (NAPI). It is less
   likely to collide with #6 IRQ because it starts by disabling those
   but there is a possibility of the IRQ having already triggered and
   macb_interrupt() already running in parallel of
   macb_hresp_error_task().

 - #5 queue->tx_error_task writes to Tx head/tail inside bp->lock.
   #1 macb_start_xmit() modifies those too, but inside
   queue->tx_ptr_lock. Oops. There probably are other places modifying
   head/tail or any other Tx queue value without queue->tx_ptr_lock.

 - #5 macb_tx_error_task() tries to gently disable TX but if it
    times-out then it uses the global switch (TE field in NCR
    register). That sounds racy with #2 NAPI that doesn't grab bp->lock
    and would probably break if the interface is shutdown under its
    feet.

I don't see much more. To fix all that, someone ought to exhaustively go
through all tasks (#1-6 above) & all shared data and reason one by one.
Who will be that someone? ;-) But that sounds pretty unrelated to the
series at hand, no?

I'd agree that some locking of bp->lock around the swap operation would
improve the series, and I'll add that in V2 for sure!

>
>> +}
>> +
>> +static void macb_context_swap_end(struct macb *bp,
>> +				  struct macb_context *new_ctx)
>> +{
>> +	struct macb_context *old_ctx;
>> +	struct macb_queue *queue;
>> +	unsigned int q;
>> +	u32 ctrl;
>> +
>> +	/* Swap contexts & give buffer pointers to HW. */
>> +
>> +	old_ctx = bp->ctx;
>> +	bp->ctx = new_ctx;
>> +	macb_init_buffers(bp);
>> +
>> +	/* Start NAPI, HW Tx/Rx and software Tx. */
>> +
>> +	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
>> +		napi_enable(&queue->napi_rx);
>> +		napi_enable(&queue->napi_tx);
>> +	}
>> +
>> +	if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) {
>> +		for (q = 0, queue = bp->queues; q < bp->num_queues;
>> +		     ++q, ++queue) {
>> +			queue_writel(queue, IER,
>> +				     bp->rx_intr_mask |
>> +				     MACB_TX_INT_FLAGS |
>> +				     MACB_BIT(HRESP));
>> +		}
>> +	}
>> +
>> +	ctrl = macb_readl(bp, NCR);
>> +	macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE));
>> +
>> +	netif_tx_start_all_queues(bp->netdev);
>> +
>> +	/* Free old context. */
>> +
>> +	macb_free_consistent(old_ctx);
>
> 1. kfree(old_ctx) is missing. The context struct itself leaks on
>     every swap.

Agreed.

> 2. macb_close() calls netdev_tx_reset_queue() for each queue.
>     Shouldn't the swap do the same? BQL accounting will be stale
>     after switching to a fresh context.

I explicitely left that out as I thought DQL would benefit from keeping
past context of the traffic. But indeed as we start afresh from a new
set of buffers we should reset DQL. fbnic, pointed out as an good
example by Jakub recently, does that.

>
> 3. macb_configure_dma() is not called after the swap. For
>     set_ringparam this is probably fine since rx_buffer_size
>     does not change, but this becomes a problem in patch 11.

Indeed, I had missed it took bp->ctx->rx_buffer_size as a parameter.
Will fix.

Thanks,

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


  reply	other threads:[~2026-04-02 16:31 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 16:39 [PATCH net-next 00/11] net: macb: implement context swapping Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 01/11] net: macb: unify device pointer naming convention Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 02/11] net: macb: unify `struct macb *` " Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 03/11] net: macb: unify queue index variable naming convention and types Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 04/11] net: macb: enforce reverse christmas tree (RCT) convention Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 05/11] net: macb: allocate tieoff descriptor once across device lifetime Théo Lebrun
2026-04-02 11:14   ` Nicolai Buchwitz
2026-04-02 13:57     ` Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 06/11] net: macb: introduce macb_context struct for buffer management Théo Lebrun
2026-04-02 11:22   ` Nicolai Buchwitz
2026-04-02 14:11     ` Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 07/11] net: macb: avoid macb_init_rx_buffer_size() modifying state Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 08/11] net: macb: make `struct macb` subset reachable from macb_context struct Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 09/11] net: macb: introduce macb_context_alloc() helper Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 10/11] net: macb: use context swapping in .set_ringparam() Théo Lebrun
2026-04-01 20:17   ` Maxime Chevallier
2026-04-02 16:34     ` Théo Lebrun
2026-04-02 11:29   ` Nicolai Buchwitz
2026-04-02 16:31     ` Théo Lebrun [this message]
2026-04-03  9:03       ` Théo Lebrun
2026-04-01 16:39 ` [PATCH net-next 11/11] net: macb: use context swapping in .ndo_change_mtu() Théo Lebrun
2026-04-02 11:30   ` Nicolai Buchwitz
2026-04-02 11:35 ` [PATCH net-next 00/11] net: macb: implement context swapping Nicolai Buchwitz
2026-04-02 13:46   ` Théo Lebrun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHIT9TPJQJ46.21A89R5UAFXVH@bootlin.com \
    --to=theo.lebrun@bootlin.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=benoit.monin@bootlin.com \
    --cc=claudiu.beznea@tuxon.dev \
    --cc=conor@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gregory.clement@bootlin.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=maxime.chevallier@bootlin.com \
    --cc=nb@tipi-net.de \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.ferre@microchip.com \
    --cc=pabeni@redhat.com \
    --cc=pvalerio@redhat.com \
    --cc=richardcochran@gmail.com \
    --cc=tawfik.bayouk@mobileye.com \
    --cc=thomas.petazzoni@bootlin.com \
    --cc=vladimir.kondratiev@mobileye.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox