Re: [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Edward Cree <ecree@solarflare.com>
Cc: <netdev@vger.kernel.org>, <jakub.kicinski@netronome.com>,
	"Michael S. Tsirkin" <mst@redhat.com>, <pavel.odintsov@gmail.com>,
	Jason Wang <jasowang@redhat.com>, <mchan@broadcom.com>,
	John Fastabend <john.fastabend@gmail.com>,
	<peter.waskiewicz.jr@intel.com>, <ast@fiberby.dk>,
	Daniel Borkmann <borkmann@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andy Gospodarek <andy@greyhouse.net>,
	brouer@redhat.com
Subject: Re: [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
Date: Fri, 13 Oct 2017 11:13:47 +0200	[thread overview]
Message-ID: <20171013111347.1472fcfb@redhat.com> (raw)
In-Reply-To: <21efa4a5-1e2f-bbe9-c1d0-115843fb885b@solarflare.com>

On Thu, 12 Oct 2017 22:13:43 +0100
Edward Cree <ecree@solarflare.com> wrote:

> On 12/10/17 13:26, Jesper Dangaard Brouer wrote:
> > This patch makes cpumap functional, by adding SKB allocation and
> > invoking the network stack on the dequeuing CPU.
> >
> > For constructing the SKB on the remote CPU, the xdp_buff in converted
> > into a struct xdp_pkt, and it mapped into the top headroom of the
> > packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
> > just a cpumap internal data structure, with info carried between
> > enqueue to dequeue.  
> 
> <snip>
> 
> > +struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> > +				  struct xdp_pkt *xdp_pkt)
> > +{
> > +	unsigned int frame_size;
> > +	void *pkt_data_start;
> > +	struct sk_buff *skb;
> > +
> > +	/* build_skb need to place skb_shared_info after SKB end, and
> > +	 * also want to know the memory "truesize".  Thus, need to
> > +	 * know the memory frame size backing xdp_buff.
> > +	 *
> > +	 * XDP was designed to have PAGE_SIZE frames, but this
> > +	 * assumption is not longer true with ixgbe and i40e.  It
> > +	 * would be preferred to set frame_size to 2048 or 4096
> > +	 * depending on the driver.
> > +	 *   frame_size = 2048;
> > +	 *   frame_len  = frame_size - sizeof(*xdp_pkt);
> > +	 *
> > +	 * Instead, with info avail, skb_shared_info in placed after
> > +	 * packet len.  This, unfortunately fakes the truesize.
> > +	 * Another disadvantage of this approach, the skb_shared_info
> > +	 * is not at a fixed memory location, with mixed length
> > +	 * packets, which is bad for cache-line hotness.
> > +	 */
> > +	frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
> > +		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +
> > +	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
> > +	skb = build_skb(pkt_data_start, frame_size);
> > +	if (!skb)
> > +		return NULL;
> > +
> > +	skb_reserve(skb, xdp_pkt->headroom);
> > +	__skb_put(skb, xdp_pkt->len);
> > +	if (xdp_pkt->metasize)
> > +		skb_metadata_set(skb, xdp_pkt->metasize);
> > +
> > +	/* Essential SKB info: protocol and skb->dev */
> > +	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
> > +
> > +	/* Optional SKB info, currently missing:
> > +	 * - HW checksum info		(skb->ip_summed)
> > +	 * - HW RX hash			(skb_set_hash)
> > +	 * - RX ring dev queue index	(skb_record_rx_queue)
> > +	 */  
> One possibility for dealing with these and related issues — also things
>  like the proper way to free an xdp_buff if SKB creation fails, which
>  might not be page_frag_free() for some drivers with unusual recycle ring
>  implementations — is to have a new ndo for 'receiving' an xdp_pkt from a
>  cpumap redirect.
> Since you're always receiving from the same driver that enqueued it, even
>  the structure of the metadata stored in the top of the packet page
>  doesn't have to be standardised; instead, each driver can put there just
>  whatever happens to be needed for its ndo_xdp_rx routine.  (Though there
>  would probably be standard enqueue and dequeue functions that the
>  'common-case' drivers could use.)
> In some cases, the driver could even just leave in the page the packet
>  prefix it got from the NIC, rather than reading it and then writing an
>  interpreted version back, thus minimising the number of packet-page
>  cachelines the 'bottom half' RX function has to touch (it would still
>  need to write in anything it got from the RX event, of course).
> It shouldn't be much work as many driver RX routines are already
>  structured this way — sfc, for instance, has a split into efx_rx_packet()
>  and __efx_rx_packet(), as a software pipeline for prefetching.

This is all talking about future work.  I'll be happy to discuss this
outside/after this patchset.  I see nothing blocking these ideas.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2017-10-13  9:13 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-12 12:26 [net-next V7 PATCH 0/5] New bpf cpumap type for XDP_REDIRECT Jesper Dangaard Brouer
2017-10-12 12:26 ` [net-next V7 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP Jesper Dangaard Brouer
2017-10-12 20:35   ` Edward Cree
2017-10-13  8:17     ` Jesper Dangaard Brouer
2017-10-14  0:36   ` kbuild test robot
2017-10-12 12:26 ` [net-next V7 PATCH 2/5] bpf: XDP_REDIRECT enable use of cpumap Jesper Dangaard Brouer
2017-10-12 12:26 ` [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation Jesper Dangaard Brouer
2017-10-12 21:13   ` Edward Cree
2017-10-13  9:13     ` Jesper Dangaard Brouer [this message]
2017-10-12 12:27 ` [net-next V7 PATCH 4/5] bpf: cpumap add tracepoints Jesper Dangaard Brouer
2017-10-14 16:54   ` David Miller
2017-10-12 12:27 ` [net-next V7 PATCH 5/5] samples/bpf: add cpumap sample program xdp_redirect_cpu Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171013111347.1472fcfb@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andy@greyhouse.net \
    --cc=ast@fiberby.dk \
    --cc=borkmann@iogearbox.net \
    --cc=ecree@solarflare.com \
    --cc=jakub.kicinski@netronome.com \
    --cc=jasowang@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=mchan@broadcom.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pavel.odintsov@gmail.com \
    --cc=peter.waskiewicz.jr@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.