All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Gerhard Engleder <gerhard@engleder-embedded.com>
Cc: <netdev@vger.kernel.org>, <bpf@vger.kernel.org>,
	<davem@davemloft.net>, <kuba@kernel.org>, <edumazet@google.com>,
	<pabeni@redhat.com>, <bjorn@kernel.org>,
	<magnus.karlsson@intel.com>, <jonathan.lemon@gmail.com>
Subject: Re: [PATCH net-next 4/5] tsnep: Add XDP socket zero-copy RX support
Date: Fri, 7 Apr 2023 15:41:08 +0200	[thread overview]
Message-ID: <ZDAddB+bF7W2OhY1@boxer> (raw)
In-Reply-To: <72bb23b0-50cc-7333-56e7-a887223ac6e1@engleder-embedded.com>

On Wed, Apr 05, 2023 at 09:13:58PM +0200, Gerhard Engleder wrote:
> On 03.04.23 19:26, Maciej Fijalkowski wrote:
> > On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote:
> > > On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote:
> > > 
> > > Hey Gerhard,
> > > 
> > > > Add support for XSK zero-copy to RX path. The setup of the XSK pool can
> > > > be done at runtime. If the netdev is running, then the queue must be
> > > > disabled and enabled during reconfiguration. This can be done easily
> > > > with functions introduced in previous commits.
> > > > 
> > > > A more important property is that, if the netdev is running, then the
> > > > setup of the XSK pool shall not stop the netdev in case of errors. A
> > > > broken netdev after a failed XSK pool setup is bad behavior. Therefore,
> > > > the allocation and setup of resources during XSK pool setup is done only
> > > > before any queue is disabled. Additionally, freeing and later allocation
> > > > of resources is eliminated in some cases. Page pool entries are kept for
> > > > later use. Two memory models are registered in parallel. As a result,
> > > > the XSK pool setup cannot fail during queue reconfiguration.
> > > > 
> > > > In contrast to other drivers, XSK pool setup and XDP BPF program setup
> > > > are separate actions. XSK pool setup can be done without any XDP BPF
> > > > program. The XDP BPF program can be added, removed or changed without
> > > > any reconfiguration of the XSK pool.
> > > 
> > > I won't argue about your design, but I'd be glad if you would present any
> > > perf numbers (ZC vs copy mode) just to give us some overview how your
> > > implementation works out. Also, please consider using batching APIs and
> > > see if this gives you any boost (my assumption is that it would).
> > > 
> > > > 
> > > > Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
> > > > ---
> > > >   drivers/net/ethernet/engleder/tsnep.h      |   7 +
> > > >   drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++-
> > > >   drivers/net/ethernet/engleder/tsnep_xdp.c  |  67 ++++
> > > >   3 files changed, 488 insertions(+), 18 deletions(-)
> > 
> > (...)
> > 
> > > > +
> > > >   static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
> > > >   			       struct xdp_buff *xdp, int *status,
> > > > -			       struct netdev_queue *tx_nq, struct tsnep_tx *tx)
> > > > +			       struct netdev_queue *tx_nq, struct tsnep_tx *tx,
> > > > +			       bool zc)
> > > >   {
> > > >   	unsigned int length;
> > > > -	unsigned int sync;
> > > >   	u32 act;
> > > >   	length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
> > > >   	act = bpf_prog_run_xdp(prog, xdp);
> > > > -
> > > > -	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
> > > > -	sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
> > > > -	sync = max(sync, length);
> > > > -
> > > >   	switch (act) {
> > > >   	case XDP_PASS:
> > > >   		return false;
> > > > @@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
> > > >   		trace_xdp_exception(rx->adapter->netdev, prog, act);
> > > >   		fallthrough;
> > > >   	case XDP_DROP:
> > > > -		page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data),
> > > > -				   sync, true);
> > > > +		if (zc) {
> > > > +			xsk_buff_free(xdp);
> > > > +		} else {
> > > > +			unsigned int sync;
> > > > +
> > > > +			/* Due xdp_adjust_tail: DMA sync for_device cover max
> > > > +			 * len CPU touch
> > > > +			 */
> > > > +			sync = xdp->data_end - xdp->data_hard_start -
> > > > +			       XDP_PACKET_HEADROOM;
> > > > +			sync = max(sync, length);
> > > > +			page_pool_put_page(rx->page_pool,
> > > > +					   virt_to_head_page(xdp->data), sync,
> > > > +					   true);
> > > > +		}
> > > >   		return true;
> > > >   	}
> > > >   }
> > > > @@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
> > > >   					 length, false);
> > > >   			consume = tsnep_xdp_run_prog(rx, prog, &xdp,
> > > > -						     &xdp_status, tx_nq, tx);
> > > > +						     &xdp_status, tx_nq, tx,
> > > > +						     false);
> > > >   			if (consume) {
> > > >   				rx->packets++;
> > > >   				rx->bytes += length;
> > > > @@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
> > > >   	return done;
> > > >   }
> > > > +static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi,
> > > > +			    int budget)
> > > > +{
> > > > +	struct tsnep_rx_entry *entry;
> > > > +	struct netdev_queue *tx_nq;
> > > > +	struct bpf_prog *prog;
> > > > +	struct tsnep_tx *tx;
> > > > +	int desc_available;
> > > > +	int xdp_status = 0;
> > > > +	struct page *page;
> > > > +	int done = 0;
> > > > +	int length;
> > > > +
> > > > +	desc_available = tsnep_rx_desc_available(rx);
> > > > +	prog = READ_ONCE(rx->adapter->xdp_prog);
> > > > +	if (prog) {
> > > > +		tx_nq = netdev_get_tx_queue(rx->adapter->netdev,
> > > > +					    rx->tx_queue_index);
> > > > +		tx = &rx->adapter->tx[rx->tx_queue_index];
> > > > +	}
> > > > +
> > > > +	while (likely(done < budget) && (rx->read != rx->write)) {
> > > > +		entry = &rx->entry[rx->read];
> > > > +		if ((__le32_to_cpu(entry->desc_wb->properties) &
> > > > +		     TSNEP_DESC_OWNER_COUNTER_MASK) !=
> > > > +		    (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK))
> > > > +			break;
> > > > +		done++;
> > > > +
> > > > +		if (desc_available >= TSNEP_RING_RX_REFILL) {
> > > > +			bool reuse = desc_available >= TSNEP_RING_RX_REUSE;
> > > > +
> > > > +			desc_available -= tsnep_rx_refill_zc(rx, desc_available,
> > > > +							     reuse);
> > > > +			if (!entry->xdp) {
> > > > +				/* buffer has been reused for refill to prevent
> > > > +				 * empty RX ring, thus buffer cannot be used for
> > > > +				 * RX processing
> > > > +				 */
> > > > +				rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
> > > > +				desc_available++;
> > > > +
> > > > +				rx->dropped++;
> > > > +
> > > > +				continue;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* descriptor properties shall be read first, because valid data
> > > > +		 * is signaled there
> > > > +		 */
> > > > +		dma_rmb();
> > > > +
> > > > +		prefetch(entry->xdp->data);
> > > > +		length = __le32_to_cpu(entry->desc_wb->properties) &
> > > > +			 TSNEP_DESC_LENGTH_MASK;
> > > > +		entry->xdp->data_end = entry->xdp->data + length;
> > > > +		xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool);
> > > > +
> > > > +		/* RX metadata with timestamps is in front of actual data,
> > > > +		 * subtract metadata size to get length of actual data and
> > > > +		 * consider metadata size as offset of actual data during RX
> > > > +		 * processing
> > > > +		 */
> > > > +		length -= TSNEP_RX_INLINE_METADATA_SIZE;
> > > > +
> > > > +		rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
> > > > +		desc_available++;
> > > > +
> > > > +		if (prog) {
> > > > +			bool consume;
> > > > +
> > > > +			entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE;
> > > > +			entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE;
> > > > +
> > > > +			consume = tsnep_xdp_run_prog(rx, prog, entry->xdp,
> > > > +						     &xdp_status, tx_nq, tx,
> > > > +						     true);
> > > 
> > > reason for separate xdp run prog routine for ZC was usually "likely-fying"
> > > XDP_REDIRECT action as this is the main action for AF_XDP which was giving
> > > us perf improvement. Please try this out on your side to see if this
> > > yields any positive value.
> > 
> > One more thing - you have to handle XDP_TX action in a ZC specific way.
> > Your current code will break if you enable xsk_pool and return XDP_TX from
> > XDP prog.
> 
> I took again a look to igc, but I didn't found any specifics for XDP_TX
> ZC. Only some buffer flipping, which I assume is needed for shared
> pages.
> For ice I see a call to xdp_convert_buff_to_frame() in ZC path, which
> has some XSK logic within. Is this the ZC specific way? igc calls
> xdp_convert_buff_to_frame() in both cases, so I'm not sure. But I will
> try the XDP_TX action. I did test only with xdpsock.

I think I will back off a bit with a statement that your XDP_TX is clearly
broken, here's why.

igc when converting xdp_buff to xdp_frame and xdp_buff's memory being
backed by xsk_buff_pool will grab the new page from kernel, copy the
contents of xdp_buff to it, recycle xdp_buff back to xsk_buff_pool and
return new page back to driver (i have just described what
xdp_convert_zc_to_xdp_frame() is doing). Thing is that it is expensive and
hurts perf and we stepped away from this on ice, this is a matter of
storing the xdp_buff onto adequate Tx buffer struct that you can access
while cleaning Tx descriptors so that you'll be able to xsk_buff_free()
it.

So saying 'your current code will break' might have been too much from my
side. Just make sure that your XDP_TX action on ZC works. In order to do
that, i was loading xdpsock with XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD flag
on a queue receiving frames and then running xdp_rxq_info with XDP_TX
action.


  reply	other threads:[~2023-04-07 13:44 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-02 19:38 [PATCH net-next 0/5] tsnep: XDP socket zero-copy support Gerhard Engleder
2023-04-02 19:38 ` [PATCH net-next 1/5] tsnep: Rework TX/RX queue initialization Gerhard Engleder
2023-04-02 19:38 ` [PATCH net-next 2/5] tsnep: Add functions for queue enable/disable Gerhard Engleder
2023-04-02 19:38 ` [PATCH net-next 3/5] tsnep: Move skb receive action to separate function Gerhard Engleder
2023-04-02 19:38 ` [PATCH net-next 4/5] tsnep: Add XDP socket zero-copy RX support Gerhard Engleder
2023-04-03 17:19   ` Maciej Fijalkowski
2023-04-03 17:26     ` Maciej Fijalkowski
2023-04-05 19:13       ` Gerhard Engleder
2023-04-07 13:41         ` Maciej Fijalkowski [this message]
2023-04-15 14:19           ` Gerhard Engleder
2023-04-05 18:49     ` Gerhard Engleder
2023-04-02 19:38 ` [PATCH net-next 5/5] tsnep: Add XDP socket zero-copy TX support Gerhard Engleder
2023-04-03 22:36   ` Maciej Fijalkowski
2023-04-05 19:15     ` Gerhard Engleder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZDAddB+bF7W2OhY1@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gerhard@engleder-embedded.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kuba@kernel.org \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.