Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
       [not found] ` <20160408123614.2a15a346@redhat.com>
@ 2016-04-08 12:33   ` Jesper Dangaard Brouer
  2016-04-08 17:02     ` Brenden Blanco
  2016-04-08 17:26     ` Alexei Starovoitov
  0 siblings, 2 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-08 12:33 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo, brouer, linux-mm

On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > +/* user return codes for PHYS_DEV prog type */
> > +enum bpf_phys_dev_action {
> > +	BPF_PHYS_DEV_DROP,
> > +	BPF_PHYS_DEV_OK,
> > +};  
> 
> I can imagine these extra return codes:
> 
>  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
>  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
>  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> 
> The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> which we can look at when we get that far...

I want to point out something which is quite FUNDAMENTAL, for
understanding these return codes (and network stack).

At driver RX time, the network stack basically have two ways of
building an SKB, which is send up the stack.

Option-A (fastest): The packet page is writable. The SKB can be
allocated and skb->data/head can point directly to the page.  And
we place/write skb_shared_info in the end/tail-room. (This is done by
calling build_skb()).

Option-B (slower): The packet page is read-only.  The SKB cannot point
skb->data/head directly to the page, because skb_shared_info need to be
written into skb->end (slightly hidden via skb_shinfo() casting).  To
get around this, a separate piece of memory is allocated (speedup by
__alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
be written. (This is done when calling netdev/napi_alloc_skb()).
  Drivers then need to copy over packet headers, and assign + adjust
skb_shinfo(skb)->frags[0] offset to skip copied headers.

Unfortunately most drivers use option-B.  Due to cost of calling the
page allocator.  It is only slightly most expensive to get a larger
compound page from the page allocator, which then can be partitioned into
page-fragments, thus amortizing the page alloc cost.  Unfortunately the
cost is added later, when constructing the SKB.
 Another reason for option-B, is that archs with expensive IOMMU
requirements (like PowerPC), don't need to dma_unmap on every packet,
but only on the compound page level.

Side-note: Most drivers have a "copy-break" optimization.  Especially
for option-B, when copying header data anyhow. For small packet, one
might as well free (or recycle) the RX page, if header size fits into
the newly allocated memory (for skb_shared_info).

For the early filter drop (DDoS use-case), it does not matter that the
packet-page is read-only.

BUT for the future XDP (eXpress Data Path) use-case it does matter.  If
we ever want to see speeds comparable to DPDK, then drivers to
need to implement option-A, as this allow forwarding at the packet-page
level.

I hope, my future page-pool facility can remove/hide the cost calling
the page allocator.

Back to the return codes, thus:
-------------------------------
BPF_PHYS_DEV_SHARED requires driver use option-B, when constructing
the SKB, and treat packet data as read-only.

BPF_PHYS_DEV_MODIFIED requires driver to provide a writable packet-page.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-08 12:33   ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer
@ 2016-04-08 17:02     ` Brenden Blanco
  2016-04-08 17:26     ` Alexei Starovoitov
  1 sibling, 0 replies; 12+ messages in thread
From: Brenden Blanco @ 2016-04-08 17:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo, linux-mm

On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > +/* user return codes for PHYS_DEV prog type */
> > > +enum bpf_phys_dev_action {
> > > +	BPF_PHYS_DEV_DROP,
> > > +	BPF_PHYS_DEV_OK,
> > > +};  
> > 
> > I can imagine these extra return codes:
> > 
> >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
> >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
> >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> > 
> > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> > which we can look at when we get that far...
> 
> I want to point out something which is quite FUNDAMENTAL, for
> understanding these return codes (and network stack).
> 
> 
> At driver RX time, the network stack basically have two ways of
> building an SKB, which is send up the stack.
> 
> Option-A (fastest): The packet page is writable. The SKB can be
> allocated and skb->data/head can point directly to the page.  And
> we place/write skb_shared_info in the end/tail-room. (This is done by
> calling build_skb()).
> 
> Option-B (slower): The packet page is read-only.  The SKB cannot point
> skb->data/head directly to the page, because skb_shared_info need to be
> written into skb->end (slightly hidden via skb_shinfo() casting).  To
> get around this, a separate piece of memory is allocated (speedup by
> __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
> be written. (This is done when calling netdev/napi_alloc_skb()).
>   Drivers then need to copy over packet headers, and assign + adjust
> skb_shinfo(skb)->frags[0] offset to skip copied headers.
> 
> 
> Unfortunately most drivers use option-B.  Due to cost of calling the
> page allocator.  It is only slightly most expensive to get a larger
> compound page from the page allocator, which then can be partitioned into
> page-fragments, thus amortizing the page alloc cost.  Unfortunately the
> cost is added later, when constructing the SKB.
>  Another reason for option-B, is that archs with expensive IOMMU
> requirements (like PowerPC), don't need to dma_unmap on every packet,
> but only on the compound page level.
> 
> Side-note: Most drivers have a "copy-break" optimization.  Especially
> for option-B, when copying header data anyhow. For small packet, one
> might as well free (or recycle) the RX page, if header size fits into
> the newly allocated memory (for skb_shared_info).
> 
> 
> For the early filter drop (DDoS use-case), it does not matter that the
> packet-page is read-only.
> 
> BUT for the future XDP (eXpress Data Path) use-case it does matter.  If
> we ever want to see speeds comparable to DPDK, then drivers to
> need to implement option-A, as this allow forwarding at the packet-page
> level.
> 
> I hope, my future page-pool facility can remove/hide the cost calling
> the page allocator.
> 
Can't wait! This will open up a lot of doors.
> 
> Back to the return codes, thus:
> -------------------------------
> BPF_PHYS_DEV_SHARED requires driver use option-B, when constructing
> the SKB, and treat packet data as read-only.
> 
> BPF_PHYS_DEV_MODIFIED requires driver to provide a writable packet-page.
I understand the driver/hw requirement, but the codes themselves I think
need some tweaking. For instance, if the packet is both modified and
forwarded, should the flags be ORed together? Or is the need for this
return code made obsolete if the driver knows ahead of time via struct
bpf_prog flags that the prog intends to modify the packet, and can set
up the page accordingly?
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-08 12:33   ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer
  2016-04-08 17:02     ` Brenden Blanco
@ 2016-04-08 17:26     ` Alexei Starovoitov
  2016-04-08 20:08       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2016-04-08 17:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo, linux-mm

On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > +/* user return codes for PHYS_DEV prog type */
> > > +enum bpf_phys_dev_action {
> > > +	BPF_PHYS_DEV_DROP,
> > > +	BPF_PHYS_DEV_OK,
> > > +};  
> > 
> > I can imagine these extra return codes:
> > 
> >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
> >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
> >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> > 
> > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> > which we can look at when we get that far...
> 
> I want to point out something which is quite FUNDAMENTAL, for
> understanding these return codes (and network stack).
> 
> 
> At driver RX time, the network stack basically have two ways of
> building an SKB, which is send up the stack.
> 
> Option-A (fastest): The packet page is writable. The SKB can be
> allocated and skb->data/head can point directly to the page.  And
> we place/write skb_shared_info in the end/tail-room. (This is done by
> calling build_skb()).
> 
> Option-B (slower): The packet page is read-only.  The SKB cannot point
> skb->data/head directly to the page, because skb_shared_info need to be
> written into skb->end (slightly hidden via skb_shinfo() casting).  To
> get around this, a separate piece of memory is allocated (speedup by
> __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
> be written. (This is done when calling netdev/napi_alloc_skb()).
>   Drivers then need to copy over packet headers, and assign + adjust
> skb_shinfo(skb)->frags[0] offset to skip copied headers.
> 
> 
> Unfortunately most drivers use option-B.  Due to cost of calling the
> page allocator.  It is only slightly most expensive to get a larger
> compound page from the page allocator, which then can be partitioned into
> page-fragments, thus amortizing the page alloc cost.  Unfortunately the
> cost is added later, when constructing the SKB.
>  Another reason for option-B, is that archs with expensive IOMMU
> requirements (like PowerPC), don't need to dma_unmap on every packet,
> but only on the compound page level.
> 
> Side-note: Most drivers have a "copy-break" optimization.  Especially
> for option-B, when copying header data anyhow. For small packet, one
> might as well free (or recycle) the RX page, if header size fits into
> the newly allocated memory (for skb_shared_info).

I think you guys are going into overdesign territory, so
. nack on read-only pages
. nack on copy-break approach
. nack on per-ring programs
. nack on modified/stolen/shared return codes

The whole thing must be dead simple to use. Above is not simple by any means.
The programs must see writeable pages only and return codes:
drop, pass to stack, redirect to xmit.
If program wishes to modify packets before passing it to stack, it
shouldn't need to deal with different return values.
No special things to deal with small or large packets. No header splits.
Program must not be aware of any such things.
Drivers can use DMA_BIDIRECTIONAL to allow received page to be
modified by the program and immediately sent to xmit. 
No dma map/unmap/sync per packet. If some odd architectures/dma setups
cannot do it, then XDP will not be applicable there.
We are not going to sacrifice performance for generality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-08 17:26     ` Alexei Starovoitov
@ 2016-04-08 20:08       ` Jesper Dangaard Brouer
  2016-04-08 21:34         ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-08 20:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo, linux-mm, brouer

On Fri, 8 Apr 2016 10:26:53 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> > > > +/* user return codes for PHYS_DEV prog type */
> > > > +enum bpf_phys_dev_action {
> > > > +	BPF_PHYS_DEV_DROP,
> > > > +	BPF_PHYS_DEV_OK,
> > > > +};    
> > > 
> > > I can imagine these extra return codes:
> > > 
> > >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
> > >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
> > >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> > > 
> > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> > > which we can look at when we get that far...  
> > 
> > I want to point out something which is quite FUNDAMENTAL, for
> > understanding these return codes (and network stack).
> > 
> > 
> > At driver RX time, the network stack basically have two ways of
> > building an SKB, which is send up the stack.
> > 
> > Option-A (fastest): The packet page is writable. The SKB can be
> > allocated and skb->data/head can point directly to the page.  And
> > we place/write skb_shared_info in the end/tail-room. (This is done by
> > calling build_skb()).
> > 
> > Option-B (slower): The packet page is read-only.  The SKB cannot point
> > skb->data/head directly to the page, because skb_shared_info need to be
> > written into skb->end (slightly hidden via skb_shinfo() casting).  To
> > get around this, a separate piece of memory is allocated (speedup by
> > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
> > be written. (This is done when calling netdev/napi_alloc_skb()).
> >   Drivers then need to copy over packet headers, and assign + adjust
> > skb_shinfo(skb)->frags[0] offset to skip copied headers.
> > 
> > 
> > Unfortunately most drivers use option-B.  Due to cost of calling the
> > page allocator.  It is only slightly most expensive to get a larger
> > compound page from the page allocator, which then can be partitioned into
> > page-fragments, thus amortizing the page alloc cost.  Unfortunately the
> > cost is added later, when constructing the SKB.
> >  Another reason for option-B, is that archs with expensive IOMMU
> > requirements (like PowerPC), don't need to dma_unmap on every packet,
> > but only on the compound page level.
> > 
> > Side-note: Most drivers have a "copy-break" optimization.  Especially
> > for option-B, when copying header data anyhow. For small packet, one
> > might as well free (or recycle) the RX page, if header size fits into
> > the newly allocated memory (for skb_shared_info).  
> 
> I think you guys are going into overdesign territory, so
> . nack on read-only pages

Unfortunately you cannot just ignore or nack read-only pages. They are
a fact in the current drivers.

Most drivers today (at-least the ones we care about) only deliver
read-only pages.  If you don't accept read-only pages day-1, then you
first have to rewrite a lot of drivers... and that will stall the
project!  How will you deal with this fact?

The early drop filter use-case in this patchset, can ignore read-only
pages.  But ABI wise we need to deal with the future case where we do
need/require writeable pages.  A simple need-writable pages in the API
could help us move forward.


> . nack on copy-break approach

Copy-break can be ignored.  It sort of happens at a higher-level in the
driver. (Eric likely want/care this happens for local socket delivery).


> . nack on per-ring programs

Hmmm... I don't see it as a lot more complicated to attach the program
to the ring.  But maybe we can extend the API later, and thus postpone that
discussion.

> . nack on modified/stolen/shared return codes
> 
> The whole thing must be dead simple to use. Above is not simple by any means.

Maybe you missed that the above was a description of how the current
network stack handles this, which is not simple... which is root of the
hole performance issue.


> The programs must see writeable pages only and return codes:
> drop, pass to stack, redirect to xmit.
> If program wishes to modify packets before passing it to stack, it
> shouldn't need to deal with different return values.

> No special things to deal with small or large packets. No header splits.
> Program must not be aware of any such things.

I agree on this.  This layer only deals with packets at the page level,
single packets stored in continuous memory.


> Drivers can use DMA_BIDIRECTIONAL to allow received page to be
> modified by the program and immediately sent to xmit.

We just have to verify that DMA_BIDIRECTIONAL does not add extra
overhead (which is explicitly stated that it likely does on the
DMA-API-HOWTO.txt, but I like to verify this with a micro benchmark)

> No dma map/unmap/sync per packet. If some odd architectures/dma setups
> cannot do it, then XDP will not be applicable there.

I do like the idea of rejecting XDP eBPF programs based on the DMA
setup is not compatible, or if the driver does not implement e.g.
writable DMA pages.

Customers wanting this feature will then go buy the NIC which support
this feature.  There is nothing more motivating for NIC vendors seeing
customers buying the competitors hardware. And it only require a driver
change to get this market...


> We are not going to sacrifice performance for generality.

Agree.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-08 20:08       ` Jesper Dangaard Brouer
@ 2016-04-08 21:34         ` Alexei Starovoitov
  2016-04-09 11:29           ` Tom Herbert
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2016-04-08 21:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo, linux-mm

On Fri, Apr 08, 2016 at 10:08:08PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 8 Apr 2016 10:26:53 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >   
> > > > > +/* user return codes for PHYS_DEV prog type */
> > > > > +enum bpf_phys_dev_action {
> > > > > +	BPF_PHYS_DEV_DROP,
> > > > > +	BPF_PHYS_DEV_OK,
> > > > > +};    
> > > > 
> > > > I can imagine these extra return codes:
> > > > 
> > > >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
> > > >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
> > > >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
> > > > 
> > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
> > > > which we can look at when we get that far...  
> > > 
> > > I want to point out something which is quite FUNDAMENTAL, for
> > > understanding these return codes (and network stack).
> > > 
> > > 
> > > At driver RX time, the network stack basically have two ways of
> > > building an SKB, which is send up the stack.
> > > 
> > > Option-A (fastest): The packet page is writable. The SKB can be
> > > allocated and skb->data/head can point directly to the page.  And
> > > we place/write skb_shared_info in the end/tail-room. (This is done by
> > > calling build_skb()).
> > > 
> > > Option-B (slower): The packet page is read-only.  The SKB cannot point
> > > skb->data/head directly to the page, because skb_shared_info need to be
> > > written into skb->end (slightly hidden via skb_shinfo() casting).  To
> > > get around this, a separate piece of memory is allocated (speedup by
> > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
> > > be written. (This is done when calling netdev/napi_alloc_skb()).
> > >   Drivers then need to copy over packet headers, and assign + adjust
> > > skb_shinfo(skb)->frags[0] offset to skip copied headers.
> > > 
> > > 
> > > Unfortunately most drivers use option-B.  Due to cost of calling the
> > > page allocator.  It is only slightly most expensive to get a larger
> > > compound page from the page allocator, which then can be partitioned into
> > > page-fragments, thus amortizing the page alloc cost.  Unfortunately the
> > > cost is added later, when constructing the SKB.
> > >  Another reason for option-B, is that archs with expensive IOMMU
> > > requirements (like PowerPC), don't need to dma_unmap on every packet,
> > > but only on the compound page level.
> > > 
> > > Side-note: Most drivers have a "copy-break" optimization.  Especially
> > > for option-B, when copying header data anyhow. For small packet, one
> > > might as well free (or recycle) the RX page, if header size fits into
> > > the newly allocated memory (for skb_shared_info).  
> > 
> > I think you guys are going into overdesign territory, so
> > . nack on read-only pages
> 
> Unfortunately you cannot just ignore or nack read-only pages. They are
> a fact in the current drivers.
> 
> Most drivers today (at-least the ones we care about) only deliver
> read-only pages.  If you don't accept read-only pages day-1, then you
> first have to rewrite a lot of drivers... and that will stall the
> project!  How will you deal with this fact?
> 
> The early drop filter use-case in this patchset, can ignore read-only
> pages.  But ABI wise we need to deal with the future case where we do
> need/require writeable pages.  A simple need-writable pages in the API
> could help us move forward.

the program should never need to worry about whether dma buffer is
writeable or not. Complicating drivers, api, abi, usability
for the single use case of fast packet drop is not acceptable.
XDP is not going to be a fit for all drivers and all architectures.
That is cruicial 'performance vs generality' aspect of the design.
All kernel-bypasses are taking advantage of specific architecture.
We have to take advantage of it as well. If it doesn't fit
powerpc with iommu, so be it. XDP will return -enotsupp.
That is fundamental point. We have to cut such corners and avoid
all cases where unnecessary generality hurts performance.
Read-only pages is clearly such thing.

> > The whole thing must be dead simple to use. Above is not simple by any means.
> 
> Maybe you missed that the above was a description of how the current
> network stack handles this, which is not simple... which is root of the
> hole performance issue.

Disagree. The stack has copy-break, gro, gso and everything else because
it's serving _host_ use case. XDP is packet forwarder use case.
The requirements are completely different. Ex. the host needs gso
in the core and drivers. It needs to deliver data all the way
to user space and back. That is hard and that's where complexity
comes from. For packet forwarder none of it is needed. So saying,
look we have this complexity, so XDP needs it too, is flawed argument.
The kernel is serving host and applications.
XDP is pure packet-in/packet-out framework to achieve better
performance than kernel-bypass, since kernel is the right
place to do it. It has clean access to interrupts, per-cpu,
scheduler, device registers and so on.
Though there are only two broad use cases packet drop and forward,
they cover a ton of real cases: firewalls, dos prevention,
load balancer, nat, etc. In other words mostly stateless.
As soon as packet needs to be queued somewhere we have to
instantiate skb and pass it to the stack.
So no queues in XDP and no 'stolen' and 'shared' return codes.
The program always runs to completion with single packet.
There is no header vs payload split. There is no header
from program point of view. It's raw bytes in dma buffer.

> I do like the idea of rejecting XDP eBPF programs based on the DMA
> setup is not compatible, or if the driver does not implement e.g.
> writable DMA pages.

exactly.

> Customers wanting this feature will then go buy the NIC which support
> this feature.  There is nothing more motivating for NIC vendors seeing
> customers buying the competitors hardware. And it only require a driver
> change to get this market...

exactly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-08 21:34         ` Alexei Starovoitov
@ 2016-04-09 11:29           ` Tom Herbert
  2016-04-09 15:29             ` Jamal Hadi Salim
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Herbert @ 2016-04-09 11:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann,
	Eric Dumazet, Edward Cree, john fastabend, Thomas Graf,
	Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm

On Fri, Apr 8, 2016 at 6:34 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Apr 08, 2016 at 10:08:08PM +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 8 Apr 2016 10:26:53 -0700
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>> > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote:
>> > >
>> > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> > >
>> > > > > +/* user return codes for PHYS_DEV prog type */
>> > > > > +enum bpf_phys_dev_action {
>> > > > > +     BPF_PHYS_DEV_DROP,
>> > > > > +     BPF_PHYS_DEV_OK,
>> > > > > +};
>> > > >
>> > > > I can imagine these extra return codes:
>> > > >
>> > > >  BPF_PHYS_DEV_MODIFIED,   /* Packet page/payload modified */
>> > > >  BPF_PHYS_DEV_STOLEN,     /* E.g. forward use-case */
>> > > >  BPF_PHYS_DEV_SHARED,     /* Queue for async processing, e.g. tcpdump use-case */
>> > > >
>> > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations,
>> > > > which we can look at when we get that far...
>> > >
>> > > I want to point out something which is quite FUNDAMENTAL, for
>> > > understanding these return codes (and network stack).
>> > >
>> > >
>> > > At driver RX time, the network stack basically have two ways of
>> > > building an SKB, which is send up the stack.
>> > >
>> > > Option-A (fastest): The packet page is writable. The SKB can be
>> > > allocated and skb->data/head can point directly to the page.  And
>> > > we place/write skb_shared_info in the end/tail-room. (This is done by
>> > > calling build_skb()).
>> > >
>> > > Option-B (slower): The packet page is read-only.  The SKB cannot point
>> > > skb->data/head directly to the page, because skb_shared_info need to be
>> > > written into skb->end (slightly hidden via skb_shinfo() casting).  To
>> > > get around this, a separate piece of memory is allocated (speedup by
>> > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can
>> > > be written. (This is done when calling netdev/napi_alloc_skb()).
>> > >   Drivers then need to copy over packet headers, and assign + adjust
>> > > skb_shinfo(skb)->frags[0] offset to skip copied headers.
>> > >
>> > >
>> > > Unfortunately most drivers use option-B.  Due to cost of calling the
>> > > page allocator.  It is only slightly most expensive to get a larger
>> > > compound page from the page allocator, which then can be partitioned into
>> > > page-fragments, thus amortizing the page alloc cost.  Unfortunately the
>> > > cost is added later, when constructing the SKB.
>> > >  Another reason for option-B, is that archs with expensive IOMMU
>> > > requirements (like PowerPC), don't need to dma_unmap on every packet,
>> > > but only on the compound page level.
>> > >
>> > > Side-note: Most drivers have a "copy-break" optimization.  Especially
>> > > for option-B, when copying header data anyhow. For small packet, one
>> > > might as well free (or recycle) the RX page, if header size fits into
>> > > the newly allocated memory (for skb_shared_info).
>> >
>> > I think you guys are going into overdesign territory, so
>> > . nack on read-only pages
>>
>> Unfortunately you cannot just ignore or nack read-only pages. They are
>> a fact in the current drivers.
>>
>> Most drivers today (at-least the ones we care about) only deliver
>> read-only pages.  If you don't accept read-only pages day-1, then you
>> first have to rewrite a lot of drivers... and that will stall the
>> project!  How will you deal with this fact?
>>
>> The early drop filter use-case in this patchset, can ignore read-only
>> pages.  But ABI wise we need to deal with the future case where we do
>> need/require writeable pages.  A simple need-writable pages in the API
>> could help us move forward.
>
> the program should never need to worry about whether dma buffer is
> writeable or not. Complicating drivers, api, abi, usability
> for the single use case of fast packet drop is not acceptable.
> XDP is not going to be a fit for all drivers and all architectures.
> That is cruicial 'performance vs generality' aspect of the design.
> All kernel-bypasses are taking advantage of specific architecture.
> We have to take advantage of it as well. If it doesn't fit
> powerpc with iommu, so be it. XDP will return -enotsupp.
> That is fundamental point. We have to cut such corners and avoid
> all cases where unnecessary generality hurts performance.
> Read-only pages is clearly such thing.
>
+1. Forwarding which will be a common application almost always
requires modification (decrement TTL), and header data split has
always been a weak feature since the device has to have some arbitrary
rules about what headers needs to be split out (either implements
protocol specific parsing or some fixed length).

>> > The whole thing must be dead simple to use. Above is not simple by any means.
>>
>> Maybe you missed that the above was a description of how the current
>> network stack handles this, which is not simple... which is root of the
>> hole performance issue.
>
> Disagree. The stack has copy-break, gro, gso and everything else because
> it's serving _host_ use case. XDP is packet forwarder use case.
> The requirements are completely different. Ex. the host needs gso
> in the core and drivers. It needs to deliver data all the way
> to user space and back. That is hard and that's where complexity
> comes from. For packet forwarder none of it is needed. So saying,
> look we have this complexity, so XDP needs it too, is flawed argument.
> The kernel is serving host and applications.
> XDP is pure packet-in/packet-out framework to achieve better
> performance than kernel-bypass, since kernel is the right
> place to do it. It has clean access to interrupts, per-cpu,
> scheduler, device registers and so on.
> Though there are only two broad use cases packet drop and forward,
> they cover a ton of real cases: firewalls, dos prevention,
> load balancer, nat, etc. In other words mostly stateless.
> As soon as packet needs to be queued somewhere we have to
> instantiate skb and pass it to the stack.
> So no queues in XDP and no 'stolen' and 'shared' return codes.
> The program always runs to completion with single packet.
> There is no header vs payload split. There is no header
> from program point of view. It's raw bytes in dma buffer.
>
Exactly. We are rethinking the low level data path for performance. An
all encompassing solution that covers ever existing driver model only
results in complexity which is what makes things "slow" in the first
place. Drivers need to change to implement XDP, but the model is as
simple as we can make it-- for instance we are putting very little
requirements on device features.

Tom


>> I do like the idea of rejecting XDP eBPF programs based on the DMA
>> setup is not compatible, or if the driver does not implement e.g.
>> writable DMA pages.
>
> exactly.
>
>> Customers wanting this feature will then go buy the NIC which support
>> this feature.  There is nothing more motivating for NIC vendors seeing
>> customers buying the competitors hardware. And it only require a driver
>> change to get this market...
>
> exactly.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-09 11:29           ` Tom Herbert
@ 2016-04-09 15:29             ` Jamal Hadi Salim
  2016-04-09 17:26               ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Jamal Hadi Salim @ 2016-04-09 15:29 UTC (permalink / raw)
  To: Tom Herbert, Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann,
	Eric Dumazet, Edward Cree, john fastabend, Thomas Graf,
	Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm

On 16-04-09 07:29 AM, Tom Herbert wrote:

> +1. Forwarding which will be a common application almost always
> requires modification (decrement TTL), and header data split has
> always been a weak feature since the device has to have some arbitrary
> rules about what headers needs to be split out (either implements
> protocol specific parsing or some fixed length).

Then this is sensible. I was cruising the threads and
confused by your earlier emails Tom because you talked
about XPS etc. It sounded like the idea evolved into putting
the whole freaking stack on bpf.
If this is _forwarding only_ it maybe useful to look at
Alexey's old code in particular the DMA bits;
he built his own lookup algorithm but sounds like bpf is
a much better fit today.

cheers,
jamal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-09 15:29             ` Jamal Hadi Salim
@ 2016-04-09 17:26               ` Alexei Starovoitov
  2016-04-10  7:55                 ` Thomas Graf
  2016-04-10 13:07                 ` Jamal Hadi Salim
  0 siblings, 2 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2016-04-09 17:26 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers, Or Gerlitz,
	Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend,
	Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm

On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote:
> On 16-04-09 07:29 AM, Tom Herbert wrote:
> 
> >+1. Forwarding which will be a common application almost always
> >requires modification (decrement TTL), and header data split has
> >always been a weak feature since the device has to have some arbitrary
> >rules about what headers needs to be split out (either implements
> >protocol specific parsing or some fixed length).
> 
> Then this is sensible. I was cruising the threads and
> confused by your earlier emails Tom because you talked
> about XPS etc. It sounded like the idea evolved into putting
> the whole freaking stack on bpf.

yeah, no stack, no queues in bpf.

> If this is _forwarding only_ it maybe useful to look at
> Alexey's old code in particular the DMA bits;
> he built his own lookup algorithm but sounds like bpf is
> a much better fit today.

a link to these old bits?

Just to be clear: this rfc is not the only thing we're considering.
In particular huawei guys did a monster effort to improve performance
in this area as well. We'll try to blend all the code together and
pick what's the best.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-09 17:26               ` Alexei Starovoitov
@ 2016-04-10  7:55                 ` Thomas Graf
  2016-04-10 16:53                   ` Tom Herbert
  2016-04-10 13:07                 ` Jamal Hadi Salim
  1 sibling, 1 reply; 12+ messages in thread
From: Thomas Graf @ 2016-04-10  7:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jamal Hadi Salim, Tom Herbert, Jesper Dangaard Brouer,
	Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree,
	john fastabend, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm

On 04/09/16 at 10:26am, Alexei Starovoitov wrote:
> On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote:
> > If this is _forwarding only_ it maybe useful to look at
> > Alexey's old code in particular the DMA bits;
> > he built his own lookup algorithm but sounds like bpf is
> > a much better fit today.
> 
> a link to these old bits?
> 
> Just to be clear: this rfc is not the only thing we're considering.
> In particular huawei guys did a monster effort to improve performance
> in this area as well. We'll try to blend all the code together and
> pick what's the best.

What's the plan on opening the discussion on this? Can we get a peek?
Is it an alternative to XDP and the driver hook? Different architecture
or just different implementation? I understood it as another pseudo
skb model with a path on converting to real skbs for stack processing.

I really like the current proposal by Brenden for its simplicity and
targeted compatibility with cls_bpf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-10  7:55                 ` Thomas Graf
@ 2016-04-10 16:53                   ` Tom Herbert
  2016-04-10 18:09                     ` Jamal Hadi Salim
  0 siblings, 1 reply; 12+ messages in thread
From: Tom Herbert @ 2016-04-10 16:53 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Alexei Starovoitov, Jamal Hadi Salim, Jesper Dangaard Brouer,
	Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree,
	john fastabend, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm

On Sun, Apr 10, 2016 at 12:55 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 04/09/16 at 10:26am, Alexei Starovoitov wrote:
>> On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote:
>> > If this is _forwarding only_ it maybe useful to look at
>> > Alexey's old code in particular the DMA bits;
>> > he built his own lookup algorithm but sounds like bpf is
>> > a much better fit today.
>>
>> a link to these old bits?
>>
>> Just to be clear: this rfc is not the only thing we're considering.
>> In particular huawei guys did a monster effort to improve performance
>> in this area as well. We'll try to blend all the code together and
>> pick what's the best.
>
> What's the plan on opening the discussion on this? Can we get a peek?
> Is it an alternative to XDP and the driver hook? Different architecture
> or just different implementation? I understood it as another pseudo
> skb model with a path on converting to real skbs for stack processing.
>
We started discussions about this in IOvisor. The Huawei project is
called ceth (Common Ethernet). It is essentially a layer called
directly from drivers intended for fast path forwarding and network
virtualization. They have put quite a bit of effort into buffer
management and other parts of the infrastructure, much of which we
would like to leverage in XDP. The code is currently in github, will
ask them to make it generally accessible.

The programmability part, essentially BPF, should be part of a common
solution. We can define the necessary interfaces independently of the
underlying infrastructure which is really the only way we can do this
if we want the BPF programs to be portable across different
platforms-- in Linux, userspace, HW, etc.

Tom

> I really like the current proposal by Brenden for its simplicity and
> targeted compatibility with cls_bpf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-10 16:53                   ` Tom Herbert
@ 2016-04-10 18:09                     ` Jamal Hadi Salim
  0 siblings, 0 replies; 12+ messages in thread
From: Jamal Hadi Salim @ 2016-04-10 18:09 UTC (permalink / raw)
  To: Tom Herbert, Thomas Graf
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers, Or Gerlitz,
	Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend,
	Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm

On 16-04-10 12:53 PM, Tom Herbert wrote:

> We started discussions about this in IOvisor. The Huawei project is
> called ceth (Common Ethernet). It is essentially a layer called
> directly from drivers intended for fast path forwarding and network
> virtualization. They have put quite a bit of effort into buffer
> management and other parts of the infrastructure, much of which we
> would like to leverage in XDP. The code is currently in github, will
> ask them to make it generally accessible.
>

Cant seem to find any info on it on the googles.
If it is forwarding then it should hopefully at least make use of Linux
control APIs I hope.

cheers,
jamal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-09 17:26               ` Alexei Starovoitov
  2016-04-10  7:55                 ` Thomas Graf
@ 2016-04-10 13:07                 ` Jamal Hadi Salim
  1 sibling, 0 replies; 12+ messages in thread
From: Jamal Hadi Salim @ 2016-04-10 13:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers, Or Gerlitz,
	Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend,
	Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm, Robert Olsson, kuznet

On 16-04-09 01:26 PM, Alexei Starovoitov wrote:

>
> yeah, no stack, no queues in bpf.

Thanks.

>
>> If this is _forwarding only_ it maybe useful to look at
>> Alexey's old code in particular the DMA bits;
>> he built his own lookup algorithm but sounds like bpf is
>> a much better fit today.
>
> a link to these old bits?
>

Dang. Trying to remember exact name (I think it has been gone for at
least 10 years now). I know it is not CONFIG_NET_FASTROUTE although
it could have been that depending on the driver (tulip had some
nice DMA properties - which by todays standards would be considered
primitive ;->).
+Cc Robert and Alexey (Trying to figure out name of driver based
routing code that DMAed from ingress to egress port)

> Just to be clear: this rfc is not the only thing we're considering.
> In particular huawei guys did a monster effort to improve performance
> in this area as well. We'll try to blend all the code together and
> pick what's the best.
>

Sounds very interesting.

cheers,
jamal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-04-10 18:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1460090930-11219-1-git-send-email-bblanco@plumgrid.com>
     [not found] ` <20160408123614.2a15a346@redhat.com>
2016-04-08 12:33   ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer
2016-04-08 17:02     ` Brenden Blanco
2016-04-08 17:26     ` Alexei Starovoitov
2016-04-08 20:08       ` Jesper Dangaard Brouer
2016-04-08 21:34         ` Alexei Starovoitov
2016-04-09 11:29           ` Tom Herbert
2016-04-09 15:29             ` Jamal Hadi Salim
2016-04-09 17:26               ` Alexei Starovoitov
2016-04-10  7:55                 ` Thomas Graf
2016-04-10 16:53                   ` Tom Herbert
2016-04-10 18:09                     ` Jamal Hadi Salim
2016-04-10 13:07                 ` Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).