Re: [PATCH net-next v5] net: axienet: Use NAPI for TX completion path

From: Robert Hancock <robert.hancock@calian.com>
To: "kuba@kernel.org" <kuba@kernel.org>
Cc: "pabeni@redhat.com" <pabeni@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"michal.simek@xilinx.com" <michal.simek@xilinx.com>,
	"radhey.shyam.pandey@xilinx.com" <radhey.shyam.pandey@xilinx.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH net-next v5] net: axienet: Use NAPI for TX completion path
Date: Wed, 11 May 2022 17:16:55 +0000	[thread overview]
Message-ID: <f3868a5f9abe263f4ebebd21382cd022afa6a029.camel@calian.com> (raw)
In-Reply-To: <20220510185639.1c6d6c8a@kernel.org>

On Tue, 2022-05-10 at 18:56 -0700, Jakub Kicinski wrote:
> On Mon,  9 May 2022 11:30:39 -0600 Robert Hancock wrote:
> > This driver was using the TX IRQ handler to perform all TX completion
> > tasks. Under heavy TX network load, this can cause significant irqs-off
> > latencies (found to be in the hundreds of microseconds using ftrace).
> > This can cause other issues, such as overrunning serial UART FIFOs when
> > using high baud rates with limited UART FIFO sizes.
> > 
> > Switch to using a NAPI poll handler to perform the TX completion work
> > to get this out of hard IRQ context and avoid the IRQ latency impact.
> > A separate poll handler is used for TX and RX since they have separate
> > IRQs on this controller, so that the completion work for each of them
> > stays on the same CPU as the interrupt.
> > 
> > Testing on a Xilinx MPSoC ZU9EG platform using iperf3 from a Linux PC
> > through a switch at 1G link speed showed no significant change in TX or
> > RX throughput, with approximately 941 Mbps before and after. Hard IRQ
> > time in the TX throughput test was significantly reduced from 12% to
> > below 1% on the CPU handling TX interrupts, with total hard+soft IRQ CPU
> > usage dropping from about 56% down to 48%.
> > 
> > Signed-off-by: Robert Hancock <robert.hancock@calian.com>
> > ---
> > 
> > Changed since v4: Added locking to protect TX ring tail pointer against
> > concurrent access by TX transmit and TX poll paths.
> 
> Hi, sorry for a late reply there's just too many patches to look at
> lately.
> 
> The lock is slightly concerning, the driver follows the usual wake up 
> scheme based on memory barriers. If we add the lock we should probably
> take the barriers out.

So there's basically two places where there is contention, axienet_start_xmit
where it is moving the tail pointer down after adding more entries to the TX
ring, and the TX poll function calling axienet_check_tx_bd_space where it is
using the tail pointer to see if there is enough space in the TX ring to wake
the queue. I suppose barriers are likely sufficient if the code updating the
ring pointer is more careful about how it is done - for example in the snippet
quoted below, it's moving the pointer down and then moving it back to 0 if it
is past the end of the ring; this would need to change to only update the
pointer once and not have the intermediate state where it is at an invalid
position.

I think the stability issue I saw earlier was not actually due to these changes
however, but to similar changes in v1 of the "net: macb: use NAPI for TX
completion path" patch. In the case of that driver, it was previously relying
on the TX completion path being protected by a spinlock in the IRQ handler,
which was lost when the TX completion was moved to a poll function.

> 
> We can also try to avoid the lock and drill into what the issue is.
> At a quick look it seems like there is a barrier missing between setup
> of the descriptors and kicking the transfer off:
> 
> diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> index d6fc3f7acdf0..9e244b73a0ca 100644
> --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
> @@ -878,10 +878,11 @@ axienet_start_xmit(struct sk_buff *skb, struct
> net_device *ndev)
>  	cur_p->skb = skb;
>  
>  	tail_p = lp->tx_bd_p + sizeof(*lp->tx_bd_v) * lp->tx_bd_tail;
> -	/* Start the transfer */
> -	axienet_dma_out_addr(lp, XAXIDMA_TX_TDESC_OFFSET, tail_p);
>  	if (++lp->tx_bd_tail >= lp->tx_bd_num)
>  		lp->tx_bd_tail = 0;
> +	wmb(); // possibly dma_wmb()

I think the MMIO write in axienet_dma_out_addr is supposed to be an implicit
barrier, so that shouldn't be needed?

> +	/* Start the transfer */
> +	axienet_dma_out_addr(lp, XAXIDMA_TX_TDESC_OFFSET, tail_p);
>  
>  	/* Stop queue if next transmit may not have space */
>  	if (axienet_check_tx_bd_space(lp, MAX_SKB_FRAGS + 1)) {
-- 
Robert Hancock
Senior Hardware Designer, Calian Advanced Technologies
www.calian.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel