From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtpout-04.galae.net (smtpout-04.galae.net [185.171.202.116]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8016B38657D for ; Fri, 3 Apr 2026 09:04:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.171.202.116 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775207043; cv=none; b=VgufqUlsblbK46lNVtVTYoXjGhySyxwXotCUCtwCPvwl+MPlmqrDdjOSa3A6ilq6rtcnCIcvxHq5K8+9qwLcVHdX3LxyVPXYL+fFo1rRaPCN8qUOLYtHNbhBdQFjhSioak9lKXdBwyHy9RQcqmV/f+i83yqRpXaYqrtPRruNYNo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775207043; c=relaxed/simple; bh=+dM6/hrrPfQsqwP6ZAWFsMLlTS5c9rU+YL9yPJQHRu0=; h=Mime-Version:Content-Type:Date:Message-Id:Subject:Cc:To:From: References:In-Reply-To; b=p9xVWBHIVkdn1uPAb2bT3B/XJjEGzAoCmsmP0dlbp6msSRcqymljZtJoQdKi3WswjR3Qe+0SlU/igv6GKl3b0BtBO46WyEKuCo0QiIJdnhCnPD1VRLdGzcUmIoORDO4Y1jSfKJoTuDVfz5QTDZ05EnohmvwrYbFKWEsOBNjcJVs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=bootlin.com; spf=pass smtp.mailfrom=bootlin.com; dkim=pass (2048-bit key) header.d=bootlin.com header.i=@bootlin.com header.b=dhhmPSWB; arc=none smtp.client-ip=185.171.202.116 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=bootlin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bootlin.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bootlin.com header.i=@bootlin.com header.b="dhhmPSWB" Received: from smtpout-01.galae.net (smtpout-01.galae.net [212.83.139.233]) by smtpout-04.galae.net (Postfix) with ESMTPS id AFE46C59F5A; Fri, 3 Apr 2026 09:04:31 +0000 (UTC) Received: from mail.galae.net (mail.galae.net [212.83.136.155]) by smtpout-01.galae.net (Postfix) with ESMTPS id BB1C8603C1; Fri, 3 Apr 2026 09:03:59 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by localhost (Mailerdaemon) with ESMTPSA id 28D81104502AB; Fri, 3 Apr 2026 11:03:50 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bootlin.com; s=dkim; t=1775207038; h=from:subject:date:message-id:to:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:references; bh=gvXMzezT/GwEiC62Cc8+yeqNLI/MJZ10CahI0gyd0Js=; b=dhhmPSWBGCET5DTLYuM1RXB1hdE8eJfG0DMjsgLRJmCtBmd7IryLkEgtug6F4w7Lhr+xG5 51Ff68nbZc/eJK0N7atgcYfxlbWDtilC4uMkGpZ8bL9T04JeWIaFpW/Z2FoPFgE1Cbwqnl b2JJrzNjAWoEoBLzVHh6ZYiRpT4U536iPb1onE3Z6TtDQbDMdsJ6sOp8msA+sc0jGU7Bba o+QmTNS2QH0+twNPWazxwbndhno2K5qqFaMHSc5QhGVAF5frR1HRs2Pk4hTq4XHeo0fGat zXoEhNqA/99SHoJT47ai6ev+h/nT0l0uyVOGNBlKU49iCtdI1nN/h5QdEB4b/Q== Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Fri, 03 Apr 2026 11:03:48 +0200 Message-Id: Subject: Re: [PATCH net-next 10/11] net: macb: use context swapping in .set_ringparam() Cc: "Nicolas Ferre" , "Claudiu Beznea" , "Andrew Lunn" , "David S. Miller" , "Eric Dumazet" , "Jakub Kicinski" , "Paolo Abeni" , "Richard Cochran" , "Russell King" , "Paolo Valerio" , "Conor Dooley" , "Vladimir Kondratiev" , "Gregory CLEMENT" , =?utf-8?q?Beno=C3=AEt_Monin?= , "Tawfik Bayouk" , "Thomas Petazzoni" , "Maxime Chevallier" , , To: =?utf-8?q?Th=C3=A9o_Lebrun?= , "Nicolai Buchwitz" From: =?utf-8?q?Th=C3=A9o_Lebrun?= X-Mailer: aerc 0.21.0-0-g5549850facc2 References: <20260401-macb-context-v1-0-9590c5ab7272@bootlin.com> <20260401-macb-context-v1-10-9590c5ab7272@bootlin.com> <90f843aa3940bdbabadddce27314c1f1@tipi-net.de> In-Reply-To: X-Last-TLS-Session-Version: TLSv1.3 On Thu Apr 2, 2026 at 6:31 PM CEST, Th=C3=A9o Lebrun wrote: > On Thu Apr 2, 2026 at 1:29 PM CEST, Nicolai Buchwitz wrote: >> On 1.4.2026 18:39, Th=C3=A9o Lebrun wrote: >>> ethtool_ops.set_ringparam() is implemented using the primitive close / >>> update ring size / reopen sequence. Under memory pressure this does not >>> fly: we free our buffers at close and cannot reallocate new ones at >>> open. Also, it triggers a slow PHY reinit. >>>=20 >>> Instead, exploit the new context mechanism and improve our sequence to: >>> - allocate a new context (including buffers) first >>> - if it fails, early return without any impact to the interface >>> - stop interface >>> - update global state (bp, netdev, etc) >>> - pass buffer pointers to the hardware >>> - start interface >>> - free old context. >>>=20 >>> The HW disable sequence is inspired by macb_reset_hw() but avoids >>> (1) setting NCR bit CLRSTAT and (2) clearing register PBUFRXCUT. >>>=20 >>> The HW re-enable sequence is inspired by macb_mac_link_up(), skipping >>> over register writes which would be redundant (because values have not >>> changed). >>>=20 >>> The generic context swapping parts are isolated into helper functions >>> macb_context_swap_start|end(), reusable by other operations=20 >>> (change_mtu, >>> set_channels, etc). >>>=20 >>> Signed-off-by: Th=C3=A9o Lebrun >>> --- >>> drivers/net/ethernet/cadence/macb_main.c | 89=20 >>> +++++++++++++++++++++++++++++--- >>> 1 file changed, 82 insertions(+), 7 deletions(-) >>>=20 >>> diff --git a/drivers/net/ethernet/cadence/macb_main.c=20 >>> b/drivers/net/ethernet/cadence/macb_main.c >>> index 42b19b969f3e..543356554c11 100644 >>> --- a/drivers/net/ethernet/cadence/macb_main.c >>> +++ b/drivers/net/ethernet/cadence/macb_main.c >>> @@ -2905,6 +2905,76 @@ static struct macb_context=20 >>> *macb_context_alloc(struct macb *bp, >>> return ctx; >>> } >>>=20 >>> +static void macb_context_swap_start(struct macb *bp) >>> +{ >>> + struct macb_queue *queue; >>> + unsigned int q; >>> + u32 ctrl; >>> + >>> + /* Disable software Tx, disable HW Tx/Rx and disable NAPI. */ >>> + >>> + netif_tx_disable(bp->netdev); >>> + >>> + ctrl =3D macb_readl(bp, NCR); >>> + macb_writel(bp, NCR, ctrl & ~(MACB_BIT(RE) | MACB_BIT(TE))); >>> + >>> + macb_writel(bp, TSR, -1); >>> + macb_writel(bp, RSR, -1); >>> + >>> + for (q =3D 0, queue =3D bp->queues; q < bp->num_queues; ++q, ++queue)= { >>> + queue_writel(queue, IDR, -1); >>> + queue_readl(queue, ISR); >>> + if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE) >>> + queue_writel(queue, ISR, -1); >>> + } >>> + >>> + for (q =3D 0, queue =3D bp->queues; q < bp->num_queues; ++q, ++queue)= { >>> + napi_disable(&queue->napi_rx); >>> + napi_disable(&queue->napi_tx); >>> + } >> >> tx_error_task, hresp_err_bh_work, and tx_lpi_work all dereference >> bp->ctx and could race with the pointer swap in swap_end. >> macb_close() cancels at least tx_lpi_work here. Should these be >> flushed too? > > This is a large topic! While trying to find a solution as part of this > series I am noticing many race conditions. With this context series we > worsen some (by introducing bp->ctx NULL ptr dereference). > > Let's start by identifying all schedule-able contexts involved: > - #1 any request from userspace, too many callbacks to list > - #2 NAPI softirq or kthread context, macb_{rx,tx}_poll() > - #3 bp->hresp_err_bh_work / macb_hresp_error_task() > - #4 bp->tx_lpi_work / macb_tx_lpi_work_fn() > - #5 queue->tx_error_task / macb_tx_error_task() > - #6 IRQ context, macb_interrupt() > > Some race conditions: > > - #1 macb_close() doesn't cancel & wait upon #3 hresp_err_bh_work. > They could race, especially as #3 doesn't grab bp->lock. One race > example: #3 BP HRESP starts the interface after it has been closed > and buffers freed. RBQP/TBQP are not reset so MACB would occur > memory corruption on Rx and transmit memory content. > > - #1 macb_close() doesn't cancel & wait upon #5 tx_error_task. #5 does > grab bp->lock but that doesn't make it much safer. One race example: > same as above, restart of interface with ghost ring buffers. > > - #3 hresp_err_bh_work could collide with anything as it does no > locking, especially #1 (xmit for example) or #2 (NAPI). It is less > likely to collide with #6 IRQ because it starts by disabling those > but there is a possibility of the IRQ having already triggered and > macb_interrupt() already running in parallel of > macb_hresp_error_task(). > > - #5 queue->tx_error_task writes to Tx head/tail inside bp->lock. > #1 macb_start_xmit() modifies those too, but inside > queue->tx_ptr_lock. Oops. There probably are other places modifying > head/tail or any other Tx queue value without queue->tx_ptr_lock. > > - #5 macb_tx_error_task() tries to gently disable TX but if it > times-out then it uses the global switch (TE field in NCR > register). That sounds racy with #2 NAPI that doesn't grab bp->lock > and would probably break if the interface is shutdown under its > feet. > > I don't see much more. To fix all that, someone ought to exhaustively go > through all tasks (#1-6 above) & all shared data and reason one by one. > Who will be that someone? ;-) But that sounds pretty unrelated to the > series at hand, no? > > I'd agree that some locking of bp->lock around the swap operation would > improve the series, and I'll add that in V2 for sure! After some sleep, I feel like my message was a bit rough. To clarify what I plan for V2: - grab bp->lock on swap to protect us against some of #1 userspace and all of #6 IRQ. - disabling #2 NAPI on swap is already done - disable all three BH features on swap That will not fix everything listed above. On top, we should: - check/revise our locking strategy for almost all codepaths, - check all BH features are disabled and blocked upon in the right codepaths, - in many bp->lock critical section, we should early exit if !bp->ctx. > >> >>> +} >>> + >>> +static void macb_context_swap_end(struct macb *bp, >>> + struct macb_context *new_ctx) >>> +{ >>> + struct macb_context *old_ctx; >>> + struct macb_queue *queue; >>> + unsigned int q; >>> + u32 ctrl; >>> + >>> + /* Swap contexts & give buffer pointers to HW. */ >>> + >>> + old_ctx =3D bp->ctx; >>> + bp->ctx =3D new_ctx; >>> + macb_init_buffers(bp); >>> + >>> + /* Start NAPI, HW Tx/Rx and software Tx. */ >>> + >>> + for (q =3D 0, queue =3D bp->queues; q < bp->num_queues; ++q, ++queue)= { >>> + napi_enable(&queue->napi_rx); >>> + napi_enable(&queue->napi_tx); >>> + } >>> + >>> + if (!(bp->caps & MACB_CAPS_MACB_IS_EMAC)) { >>> + for (q =3D 0, queue =3D bp->queues; q < bp->num_queues; >>> + ++q, ++queue) { >>> + queue_writel(queue, IER, >>> + bp->rx_intr_mask | >>> + MACB_TX_INT_FLAGS | >>> + MACB_BIT(HRESP)); >>> + } >>> + } >>> + >>> + ctrl =3D macb_readl(bp, NCR); >>> + macb_writel(bp, NCR, ctrl | MACB_BIT(RE) | MACB_BIT(TE)); >>> + >>> + netif_tx_start_all_queues(bp->netdev); >>> + >>> + /* Free old context. */ >>> + >>> + macb_free_consistent(old_ctx); >> >> 1. kfree(old_ctx) is missing. The context struct itself leaks on >> every swap. > > Agreed. > >> 2. macb_close() calls netdev_tx_reset_queue() for each queue. >> Shouldn't the swap do the same? BQL accounting will be stale >> after switching to a fresh context. > > I explicitely left that out as I thought DQL would benefit from keeping > past context of the traffic. But indeed as we start afresh from a new > set of buffers we should reset DQL. fbnic, pointed out as an good > example by Jakub recently, does that. > >> >> 3. macb_configure_dma() is not called after the swap. For >> set_ringparam this is probably fine since rx_buffer_size >> does not change, but this becomes a problem in patch 11. > > Indeed, I had missed it took bp->ctx->rx_buffer_size as a parameter. > Will fix. Thanks, -- Th=C3=A9o Lebrun, Bootlin Embedded Linux and Kernel engineering https://bootlin.com