* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter [not found] ` <20160408123614.2a15a346@redhat.com> @ 2016-04-08 12:33 ` Jesper Dangaard Brouer 2016-04-08 17:02 ` Brenden Blanco 2016-04-08 17:26 ` Alexei Starovoitov 0 siblings, 2 replies; 12+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-08 12:33 UTC (permalink / raw) To: Brenden Blanco Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel, eric.dumazet, ecree, john.fastabend, tgraf, johannes, eranlinuxmellanox, lorenzo, brouer, linux-mm On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > +/* user return codes for PHYS_DEV prog type */ > > +enum bpf_phys_dev_action { > > + BPF_PHYS_DEV_DROP, > > + BPF_PHYS_DEV_OK, > > +}; > > I can imagine these extra return codes: > > BPF_PHYS_DEV_MODIFIED, /* Packet page/payload modified */ > BPF_PHYS_DEV_STOLEN, /* E.g. forward use-case */ > BPF_PHYS_DEV_SHARED, /* Queue for async processing, e.g. tcpdump use-case */ > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations, > which we can look at when we get that far... I want to point out something which is quite FUNDAMENTAL, for understanding these return codes (and network stack). At driver RX time, the network stack basically have two ways of building an SKB, which is send up the stack. Option-A (fastest): The packet page is writable. The SKB can be allocated and skb->data/head can point directly to the page. And we place/write skb_shared_info in the end/tail-room. (This is done by calling build_skb()). Option-B (slower): The packet page is read-only. The SKB cannot point skb->data/head directly to the page, because skb_shared_info need to be written into skb->end (slightly hidden via skb_shinfo() casting). To get around this, a separate piece of memory is allocated (speedup by __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can be written. (This is done when calling netdev/napi_alloc_skb()). Drivers then need to copy over packet headers, and assign + adjust skb_shinfo(skb)->frags[0] offset to skip copied headers. Unfortunately most drivers use option-B. Due to cost of calling the page allocator. It is only slightly most expensive to get a larger compound page from the page allocator, which then can be partitioned into page-fragments, thus amortizing the page alloc cost. Unfortunately the cost is added later, when constructing the SKB. Another reason for option-B, is that archs with expensive IOMMU requirements (like PowerPC), don't need to dma_unmap on every packet, but only on the compound page level. Side-note: Most drivers have a "copy-break" optimization. Especially for option-B, when copying header data anyhow. For small packet, one might as well free (or recycle) the RX page, if header size fits into the newly allocated memory (for skb_shared_info). For the early filter drop (DDoS use-case), it does not matter that the packet-page is read-only. BUT for the future XDP (eXpress Data Path) use-case it does matter. If we ever want to see speeds comparable to DPDK, then drivers to need to implement option-A, as this allow forwarding at the packet-page level. I hope, my future page-pool facility can remove/hide the cost calling the page allocator. Back to the return codes, thus: ------------------------------- BPF_PHYS_DEV_SHARED requires driver use option-B, when constructing the SKB, and treat packet data as read-only. BPF_PHYS_DEV_MODIFIED requires driver to provide a writable packet-page. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-08 12:33 ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer @ 2016-04-08 17:02 ` Brenden Blanco 2016-04-08 17:26 ` Alexei Starovoitov 1 sibling, 0 replies; 12+ messages in thread From: Brenden Blanco @ 2016-04-08 17:02 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel, eric.dumazet, ecree, john.fastabend, tgraf, johannes, eranlinuxmellanox, lorenzo, linux-mm On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > +/* user return codes for PHYS_DEV prog type */ > > > +enum bpf_phys_dev_action { > > > + BPF_PHYS_DEV_DROP, > > > + BPF_PHYS_DEV_OK, > > > +}; > > > > I can imagine these extra return codes: > > > > BPF_PHYS_DEV_MODIFIED, /* Packet page/payload modified */ > > BPF_PHYS_DEV_STOLEN, /* E.g. forward use-case */ > > BPF_PHYS_DEV_SHARED, /* Queue for async processing, e.g. tcpdump use-case */ > > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations, > > which we can look at when we get that far... > > I want to point out something which is quite FUNDAMENTAL, for > understanding these return codes (and network stack). > > > At driver RX time, the network stack basically have two ways of > building an SKB, which is send up the stack. > > Option-A (fastest): The packet page is writable. The SKB can be > allocated and skb->data/head can point directly to the page. And > we place/write skb_shared_info in the end/tail-room. (This is done by > calling build_skb()). > > Option-B (slower): The packet page is read-only. The SKB cannot point > skb->data/head directly to the page, because skb_shared_info need to be > written into skb->end (slightly hidden via skb_shinfo() casting). To > get around this, a separate piece of memory is allocated (speedup by > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can > be written. (This is done when calling netdev/napi_alloc_skb()). > Drivers then need to copy over packet headers, and assign + adjust > skb_shinfo(skb)->frags[0] offset to skip copied headers. > > > Unfortunately most drivers use option-B. Due to cost of calling the > page allocator. It is only slightly most expensive to get a larger > compound page from the page allocator, which then can be partitioned into > page-fragments, thus amortizing the page alloc cost. Unfortunately the > cost is added later, when constructing the SKB. > Another reason for option-B, is that archs with expensive IOMMU > requirements (like PowerPC), don't need to dma_unmap on every packet, > but only on the compound page level. > > Side-note: Most drivers have a "copy-break" optimization. Especially > for option-B, when copying header data anyhow. For small packet, one > might as well free (or recycle) the RX page, if header size fits into > the newly allocated memory (for skb_shared_info). > > > For the early filter drop (DDoS use-case), it does not matter that the > packet-page is read-only. > > BUT for the future XDP (eXpress Data Path) use-case it does matter. If > we ever want to see speeds comparable to DPDK, then drivers to > need to implement option-A, as this allow forwarding at the packet-page > level. > > I hope, my future page-pool facility can remove/hide the cost calling > the page allocator. > Can't wait! This will open up a lot of doors. > > Back to the return codes, thus: > ------------------------------- > BPF_PHYS_DEV_SHARED requires driver use option-B, when constructing > the SKB, and treat packet data as read-only. > > BPF_PHYS_DEV_MODIFIED requires driver to provide a writable packet-page. I understand the driver/hw requirement, but the codes themselves I think need some tweaking. For instance, if the packet is both modified and forwarded, should the flags be ORed together? Or is the need for this return code made obsolete if the driver knows ahead of time via struct bpf_prog flags that the prog intends to modify the packet, and can set up the page accordingly? > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-08 12:33 ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer 2016-04-08 17:02 ` Brenden Blanco @ 2016-04-08 17:26 ` Alexei Starovoitov 2016-04-08 20:08 ` Jesper Dangaard Brouer 1 sibling, 1 reply; 12+ messages in thread From: Alexei Starovoitov @ 2016-04-08 17:26 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel, eric.dumazet, ecree, john.fastabend, tgraf, johannes, eranlinuxmellanox, lorenzo, linux-mm On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > +/* user return codes for PHYS_DEV prog type */ > > > +enum bpf_phys_dev_action { > > > + BPF_PHYS_DEV_DROP, > > > + BPF_PHYS_DEV_OK, > > > +}; > > > > I can imagine these extra return codes: > > > > BPF_PHYS_DEV_MODIFIED, /* Packet page/payload modified */ > > BPF_PHYS_DEV_STOLEN, /* E.g. forward use-case */ > > BPF_PHYS_DEV_SHARED, /* Queue for async processing, e.g. tcpdump use-case */ > > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations, > > which we can look at when we get that far... > > I want to point out something which is quite FUNDAMENTAL, for > understanding these return codes (and network stack). > > > At driver RX time, the network stack basically have two ways of > building an SKB, which is send up the stack. > > Option-A (fastest): The packet page is writable. The SKB can be > allocated and skb->data/head can point directly to the page. And > we place/write skb_shared_info in the end/tail-room. (This is done by > calling build_skb()). > > Option-B (slower): The packet page is read-only. The SKB cannot point > skb->data/head directly to the page, because skb_shared_info need to be > written into skb->end (slightly hidden via skb_shinfo() casting). To > get around this, a separate piece of memory is allocated (speedup by > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can > be written. (This is done when calling netdev/napi_alloc_skb()). > Drivers then need to copy over packet headers, and assign + adjust > skb_shinfo(skb)->frags[0] offset to skip copied headers. > > > Unfortunately most drivers use option-B. Due to cost of calling the > page allocator. It is only slightly most expensive to get a larger > compound page from the page allocator, which then can be partitioned into > page-fragments, thus amortizing the page alloc cost. Unfortunately the > cost is added later, when constructing the SKB. > Another reason for option-B, is that archs with expensive IOMMU > requirements (like PowerPC), don't need to dma_unmap on every packet, > but only on the compound page level. > > Side-note: Most drivers have a "copy-break" optimization. Especially > for option-B, when copying header data anyhow. For small packet, one > might as well free (or recycle) the RX page, if header size fits into > the newly allocated memory (for skb_shared_info). I think you guys are going into overdesign territory, so . nack on read-only pages . nack on copy-break approach . nack on per-ring programs . nack on modified/stolen/shared return codes The whole thing must be dead simple to use. Above is not simple by any means. The programs must see writeable pages only and return codes: drop, pass to stack, redirect to xmit. If program wishes to modify packets before passing it to stack, it shouldn't need to deal with different return values. No special things to deal with small or large packets. No header splits. Program must not be aware of any such things. Drivers can use DMA_BIDIRECTIONAL to allow received page to be modified by the program and immediately sent to xmit. No dma map/unmap/sync per packet. If some odd architectures/dma setups cannot do it, then XDP will not be applicable there. We are not going to sacrifice performance for generality. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-08 17:26 ` Alexei Starovoitov @ 2016-04-08 20:08 ` Jesper Dangaard Brouer 2016-04-08 21:34 ` Alexei Starovoitov 0 siblings, 1 reply; 12+ messages in thread From: Jesper Dangaard Brouer @ 2016-04-08 20:08 UTC (permalink / raw) To: Alexei Starovoitov Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel, eric.dumazet, ecree, john.fastabend, tgraf, johannes, eranlinuxmellanox, lorenzo, linux-mm, brouer On Fri, 8 Apr 2016 10:26:53 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote: > > > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > > > +/* user return codes for PHYS_DEV prog type */ > > > > +enum bpf_phys_dev_action { > > > > + BPF_PHYS_DEV_DROP, > > > > + BPF_PHYS_DEV_OK, > > > > +}; > > > > > > I can imagine these extra return codes: > > > > > > BPF_PHYS_DEV_MODIFIED, /* Packet page/payload modified */ > > > BPF_PHYS_DEV_STOLEN, /* E.g. forward use-case */ > > > BPF_PHYS_DEV_SHARED, /* Queue for async processing, e.g. tcpdump use-case */ > > > > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations, > > > which we can look at when we get that far... > > > > I want to point out something which is quite FUNDAMENTAL, for > > understanding these return codes (and network stack). > > > > > > At driver RX time, the network stack basically have two ways of > > building an SKB, which is send up the stack. > > > > Option-A (fastest): The packet page is writable. The SKB can be > > allocated and skb->data/head can point directly to the page. And > > we place/write skb_shared_info in the end/tail-room. (This is done by > > calling build_skb()). > > > > Option-B (slower): The packet page is read-only. The SKB cannot point > > skb->data/head directly to the page, because skb_shared_info need to be > > written into skb->end (slightly hidden via skb_shinfo() casting). To > > get around this, a separate piece of memory is allocated (speedup by > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can > > be written. (This is done when calling netdev/napi_alloc_skb()). > > Drivers then need to copy over packet headers, and assign + adjust > > skb_shinfo(skb)->frags[0] offset to skip copied headers. > > > > > > Unfortunately most drivers use option-B. Due to cost of calling the > > page allocator. It is only slightly most expensive to get a larger > > compound page from the page allocator, which then can be partitioned into > > page-fragments, thus amortizing the page alloc cost. Unfortunately the > > cost is added later, when constructing the SKB. > > Another reason for option-B, is that archs with expensive IOMMU > > requirements (like PowerPC), don't need to dma_unmap on every packet, > > but only on the compound page level. > > > > Side-note: Most drivers have a "copy-break" optimization. Especially > > for option-B, when copying header data anyhow. For small packet, one > > might as well free (or recycle) the RX page, if header size fits into > > the newly allocated memory (for skb_shared_info). > > I think you guys are going into overdesign territory, so > . nack on read-only pages Unfortunately you cannot just ignore or nack read-only pages. They are a fact in the current drivers. Most drivers today (at-least the ones we care about) only deliver read-only pages. If you don't accept read-only pages day-1, then you first have to rewrite a lot of drivers... and that will stall the project! How will you deal with this fact? The early drop filter use-case in this patchset, can ignore read-only pages. But ABI wise we need to deal with the future case where we do need/require writeable pages. A simple need-writable pages in the API could help us move forward. > . nack on copy-break approach Copy-break can be ignored. It sort of happens at a higher-level in the driver. (Eric likely want/care this happens for local socket delivery). > . nack on per-ring programs Hmmm... I don't see it as a lot more complicated to attach the program to the ring. But maybe we can extend the API later, and thus postpone that discussion. > . nack on modified/stolen/shared return codes > > The whole thing must be dead simple to use. Above is not simple by any means. Maybe you missed that the above was a description of how the current network stack handles this, which is not simple... which is root of the hole performance issue. > The programs must see writeable pages only and return codes: > drop, pass to stack, redirect to xmit. > If program wishes to modify packets before passing it to stack, it > shouldn't need to deal with different return values. > No special things to deal with small or large packets. No header splits. > Program must not be aware of any such things. I agree on this. This layer only deals with packets at the page level, single packets stored in continuous memory. > Drivers can use DMA_BIDIRECTIONAL to allow received page to be > modified by the program and immediately sent to xmit. We just have to verify that DMA_BIDIRECTIONAL does not add extra overhead (which is explicitly stated that it likely does on the DMA-API-HOWTO.txt, but I like to verify this with a micro benchmark) > No dma map/unmap/sync per packet. If some odd architectures/dma setups > cannot do it, then XDP will not be applicable there. I do like the idea of rejecting XDP eBPF programs based on the DMA setup is not compatible, or if the driver does not implement e.g. writable DMA pages. Customers wanting this feature will then go buy the NIC which support this feature. There is nothing more motivating for NIC vendors seeing customers buying the competitors hardware. And it only require a driver change to get this market... > We are not going to sacrifice performance for generality. Agree. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-08 20:08 ` Jesper Dangaard Brouer @ 2016-04-08 21:34 ` Alexei Starovoitov 2016-04-09 11:29 ` Tom Herbert 0 siblings, 1 reply; 12+ messages in thread From: Alexei Starovoitov @ 2016-04-08 21:34 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel, eric.dumazet, ecree, john.fastabend, tgraf, johannes, eranlinuxmellanox, lorenzo, linux-mm On Fri, Apr 08, 2016 at 10:08:08PM +0200, Jesper Dangaard Brouer wrote: > On Fri, 8 Apr 2016 10:26:53 -0700 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote: > > > > > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > > > > > +/* user return codes for PHYS_DEV prog type */ > > > > > +enum bpf_phys_dev_action { > > > > > + BPF_PHYS_DEV_DROP, > > > > > + BPF_PHYS_DEV_OK, > > > > > +}; > > > > > > > > I can imagine these extra return codes: > > > > > > > > BPF_PHYS_DEV_MODIFIED, /* Packet page/payload modified */ > > > > BPF_PHYS_DEV_STOLEN, /* E.g. forward use-case */ > > > > BPF_PHYS_DEV_SHARED, /* Queue for async processing, e.g. tcpdump use-case */ > > > > > > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations, > > > > which we can look at when we get that far... > > > > > > I want to point out something which is quite FUNDAMENTAL, for > > > understanding these return codes (and network stack). > > > > > > > > > At driver RX time, the network stack basically have two ways of > > > building an SKB, which is send up the stack. > > > > > > Option-A (fastest): The packet page is writable. The SKB can be > > > allocated and skb->data/head can point directly to the page. And > > > we place/write skb_shared_info in the end/tail-room. (This is done by > > > calling build_skb()). > > > > > > Option-B (slower): The packet page is read-only. The SKB cannot point > > > skb->data/head directly to the page, because skb_shared_info need to be > > > written into skb->end (slightly hidden via skb_shinfo() casting). To > > > get around this, a separate piece of memory is allocated (speedup by > > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can > > > be written. (This is done when calling netdev/napi_alloc_skb()). > > > Drivers then need to copy over packet headers, and assign + adjust > > > skb_shinfo(skb)->frags[0] offset to skip copied headers. > > > > > > > > > Unfortunately most drivers use option-B. Due to cost of calling the > > > page allocator. It is only slightly most expensive to get a larger > > > compound page from the page allocator, which then can be partitioned into > > > page-fragments, thus amortizing the page alloc cost. Unfortunately the > > > cost is added later, when constructing the SKB. > > > Another reason for option-B, is that archs with expensive IOMMU > > > requirements (like PowerPC), don't need to dma_unmap on every packet, > > > but only on the compound page level. > > > > > > Side-note: Most drivers have a "copy-break" optimization. Especially > > > for option-B, when copying header data anyhow. For small packet, one > > > might as well free (or recycle) the RX page, if header size fits into > > > the newly allocated memory (for skb_shared_info). > > > > I think you guys are going into overdesign territory, so > > . nack on read-only pages > > Unfortunately you cannot just ignore or nack read-only pages. They are > a fact in the current drivers. > > Most drivers today (at-least the ones we care about) only deliver > read-only pages. If you don't accept read-only pages day-1, then you > first have to rewrite a lot of drivers... and that will stall the > project! How will you deal with this fact? > > The early drop filter use-case in this patchset, can ignore read-only > pages. But ABI wise we need to deal with the future case where we do > need/require writeable pages. A simple need-writable pages in the API > could help us move forward. the program should never need to worry about whether dma buffer is writeable or not. Complicating drivers, api, abi, usability for the single use case of fast packet drop is not acceptable. XDP is not going to be a fit for all drivers and all architectures. That is cruicial 'performance vs generality' aspect of the design. All kernel-bypasses are taking advantage of specific architecture. We have to take advantage of it as well. If it doesn't fit powerpc with iommu, so be it. XDP will return -enotsupp. That is fundamental point. We have to cut such corners and avoid all cases where unnecessary generality hurts performance. Read-only pages is clearly such thing. > > The whole thing must be dead simple to use. Above is not simple by any means. > > Maybe you missed that the above was a description of how the current > network stack handles this, which is not simple... which is root of the > hole performance issue. Disagree. The stack has copy-break, gro, gso and everything else because it's serving _host_ use case. XDP is packet forwarder use case. The requirements are completely different. Ex. the host needs gso in the core and drivers. It needs to deliver data all the way to user space and back. That is hard and that's where complexity comes from. For packet forwarder none of it is needed. So saying, look we have this complexity, so XDP needs it too, is flawed argument. The kernel is serving host and applications. XDP is pure packet-in/packet-out framework to achieve better performance than kernel-bypass, since kernel is the right place to do it. It has clean access to interrupts, per-cpu, scheduler, device registers and so on. Though there are only two broad use cases packet drop and forward, they cover a ton of real cases: firewalls, dos prevention, load balancer, nat, etc. In other words mostly stateless. As soon as packet needs to be queued somewhere we have to instantiate skb and pass it to the stack. So no queues in XDP and no 'stolen' and 'shared' return codes. The program always runs to completion with single packet. There is no header vs payload split. There is no header from program point of view. It's raw bytes in dma buffer. > I do like the idea of rejecting XDP eBPF programs based on the DMA > setup is not compatible, or if the driver does not implement e.g. > writable DMA pages. exactly. > Customers wanting this feature will then go buy the NIC which support > this feature. There is nothing more motivating for NIC vendors seeing > customers buying the competitors hardware. And it only require a driver > change to get this market... exactly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-08 21:34 ` Alexei Starovoitov @ 2016-04-09 11:29 ` Tom Herbert 2016-04-09 15:29 ` Jamal Hadi Salim 0 siblings, 1 reply; 12+ messages in thread From: Tom Herbert @ 2016-04-09 11:29 UTC (permalink / raw) To: Alexei Starovoitov Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm On Fri, Apr 8, 2016 at 6:34 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, Apr 08, 2016 at 10:08:08PM +0200, Jesper Dangaard Brouer wrote: >> On Fri, 8 Apr 2016 10:26:53 -0700 >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: >> >> > On Fri, Apr 08, 2016 at 02:33:40PM +0200, Jesper Dangaard Brouer wrote: >> > > >> > > On Fri, 8 Apr 2016 12:36:14 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: >> > > >> > > > > +/* user return codes for PHYS_DEV prog type */ >> > > > > +enum bpf_phys_dev_action { >> > > > > + BPF_PHYS_DEV_DROP, >> > > > > + BPF_PHYS_DEV_OK, >> > > > > +}; >> > > > >> > > > I can imagine these extra return codes: >> > > > >> > > > BPF_PHYS_DEV_MODIFIED, /* Packet page/payload modified */ >> > > > BPF_PHYS_DEV_STOLEN, /* E.g. forward use-case */ >> > > > BPF_PHYS_DEV_SHARED, /* Queue for async processing, e.g. tcpdump use-case */ >> > > > >> > > > The "STOLEN" and "SHARED" use-cases require some refcnt manipulations, >> > > > which we can look at when we get that far... >> > > >> > > I want to point out something which is quite FUNDAMENTAL, for >> > > understanding these return codes (and network stack). >> > > >> > > >> > > At driver RX time, the network stack basically have two ways of >> > > building an SKB, which is send up the stack. >> > > >> > > Option-A (fastest): The packet page is writable. The SKB can be >> > > allocated and skb->data/head can point directly to the page. And >> > > we place/write skb_shared_info in the end/tail-room. (This is done by >> > > calling build_skb()). >> > > >> > > Option-B (slower): The packet page is read-only. The SKB cannot point >> > > skb->data/head directly to the page, because skb_shared_info need to be >> > > written into skb->end (slightly hidden via skb_shinfo() casting). To >> > > get around this, a separate piece of memory is allocated (speedup by >> > > __alloc_page_frag) for pointing skb->data/head, so skb_shared_info can >> > > be written. (This is done when calling netdev/napi_alloc_skb()). >> > > Drivers then need to copy over packet headers, and assign + adjust >> > > skb_shinfo(skb)->frags[0] offset to skip copied headers. >> > > >> > > >> > > Unfortunately most drivers use option-B. Due to cost of calling the >> > > page allocator. It is only slightly most expensive to get a larger >> > > compound page from the page allocator, which then can be partitioned into >> > > page-fragments, thus amortizing the page alloc cost. Unfortunately the >> > > cost is added later, when constructing the SKB. >> > > Another reason for option-B, is that archs with expensive IOMMU >> > > requirements (like PowerPC), don't need to dma_unmap on every packet, >> > > but only on the compound page level. >> > > >> > > Side-note: Most drivers have a "copy-break" optimization. Especially >> > > for option-B, when copying header data anyhow. For small packet, one >> > > might as well free (or recycle) the RX page, if header size fits into >> > > the newly allocated memory (for skb_shared_info). >> > >> > I think you guys are going into overdesign territory, so >> > . nack on read-only pages >> >> Unfortunately you cannot just ignore or nack read-only pages. They are >> a fact in the current drivers. >> >> Most drivers today (at-least the ones we care about) only deliver >> read-only pages. If you don't accept read-only pages day-1, then you >> first have to rewrite a lot of drivers... and that will stall the >> project! How will you deal with this fact? >> >> The early drop filter use-case in this patchset, can ignore read-only >> pages. But ABI wise we need to deal with the future case where we do >> need/require writeable pages. A simple need-writable pages in the API >> could help us move forward. > > the program should never need to worry about whether dma buffer is > writeable or not. Complicating drivers, api, abi, usability > for the single use case of fast packet drop is not acceptable. > XDP is not going to be a fit for all drivers and all architectures. > That is cruicial 'performance vs generality' aspect of the design. > All kernel-bypasses are taking advantage of specific architecture. > We have to take advantage of it as well. If it doesn't fit > powerpc with iommu, so be it. XDP will return -enotsupp. > That is fundamental point. We have to cut such corners and avoid > all cases where unnecessary generality hurts performance. > Read-only pages is clearly such thing. > +1. Forwarding which will be a common application almost always requires modification (decrement TTL), and header data split has always been a weak feature since the device has to have some arbitrary rules about what headers needs to be split out (either implements protocol specific parsing or some fixed length). >> > The whole thing must be dead simple to use. Above is not simple by any means. >> >> Maybe you missed that the above was a description of how the current >> network stack handles this, which is not simple... which is root of the >> hole performance issue. > > Disagree. The stack has copy-break, gro, gso and everything else because > it's serving _host_ use case. XDP is packet forwarder use case. > The requirements are completely different. Ex. the host needs gso > in the core and drivers. It needs to deliver data all the way > to user space and back. That is hard and that's where complexity > comes from. For packet forwarder none of it is needed. So saying, > look we have this complexity, so XDP needs it too, is flawed argument. > The kernel is serving host and applications. > XDP is pure packet-in/packet-out framework to achieve better > performance than kernel-bypass, since kernel is the right > place to do it. It has clean access to interrupts, per-cpu, > scheduler, device registers and so on. > Though there are only two broad use cases packet drop and forward, > they cover a ton of real cases: firewalls, dos prevention, > load balancer, nat, etc. In other words mostly stateless. > As soon as packet needs to be queued somewhere we have to > instantiate skb and pass it to the stack. > So no queues in XDP and no 'stolen' and 'shared' return codes. > The program always runs to completion with single packet. > There is no header vs payload split. There is no header > from program point of view. It's raw bytes in dma buffer. > Exactly. We are rethinking the low level data path for performance. An all encompassing solution that covers ever existing driver model only results in complexity which is what makes things "slow" in the first place. Drivers need to change to implement XDP, but the model is as simple as we can make it-- for instance we are putting very little requirements on device features. Tom >> I do like the idea of rejecting XDP eBPF programs based on the DMA >> setup is not compatible, or if the driver does not implement e.g. >> writable DMA pages. > > exactly. > >> Customers wanting this feature will then go buy the NIC which support >> this feature. There is nothing more motivating for NIC vendors seeing >> customers buying the competitors hardware. And it only require a driver >> change to get this market... > > exactly. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-09 11:29 ` Tom Herbert @ 2016-04-09 15:29 ` Jamal Hadi Salim 2016-04-09 17:26 ` Alexei Starovoitov 0 siblings, 1 reply; 12+ messages in thread From: Jamal Hadi Salim @ 2016-04-09 15:29 UTC (permalink / raw) To: Tom Herbert, Alexei Starovoitov Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm On 16-04-09 07:29 AM, Tom Herbert wrote: > +1. Forwarding which will be a common application almost always > requires modification (decrement TTL), and header data split has > always been a weak feature since the device has to have some arbitrary > rules about what headers needs to be split out (either implements > protocol specific parsing or some fixed length). Then this is sensible. I was cruising the threads and confused by your earlier emails Tom because you talked about XPS etc. It sounded like the idea evolved into putting the whole freaking stack on bpf. If this is _forwarding only_ it maybe useful to look at Alexey's old code in particular the DMA bits; he built his own lookup algorithm but sounds like bpf is a much better fit today. cheers, jamal -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-09 15:29 ` Jamal Hadi Salim @ 2016-04-09 17:26 ` Alexei Starovoitov 2016-04-10 7:55 ` Thomas Graf 2016-04-10 13:07 ` Jamal Hadi Salim 0 siblings, 2 replies; 12+ messages in thread From: Alexei Starovoitov @ 2016-04-09 17:26 UTC (permalink / raw) To: Jamal Hadi Salim Cc: Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote: > On 16-04-09 07:29 AM, Tom Herbert wrote: > > >+1. Forwarding which will be a common application almost always > >requires modification (decrement TTL), and header data split has > >always been a weak feature since the device has to have some arbitrary > >rules about what headers needs to be split out (either implements > >protocol specific parsing or some fixed length). > > Then this is sensible. I was cruising the threads and > confused by your earlier emails Tom because you talked > about XPS etc. It sounded like the idea evolved into putting > the whole freaking stack on bpf. yeah, no stack, no queues in bpf. > If this is _forwarding only_ it maybe useful to look at > Alexey's old code in particular the DMA bits; > he built his own lookup algorithm but sounds like bpf is > a much better fit today. a link to these old bits? Just to be clear: this rfc is not the only thing we're considering. In particular huawei guys did a monster effort to improve performance in this area as well. We'll try to blend all the code together and pick what's the best. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-09 17:26 ` Alexei Starovoitov @ 2016-04-10 7:55 ` Thomas Graf 2016-04-10 16:53 ` Tom Herbert 2016-04-10 13:07 ` Jamal Hadi Salim 1 sibling, 1 reply; 12+ messages in thread From: Thomas Graf @ 2016-04-10 7:55 UTC (permalink / raw) To: Alexei Starovoitov Cc: Jamal Hadi Salim, Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm On 04/09/16 at 10:26am, Alexei Starovoitov wrote: > On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote: > > If this is _forwarding only_ it maybe useful to look at > > Alexey's old code in particular the DMA bits; > > he built his own lookup algorithm but sounds like bpf is > > a much better fit today. > > a link to these old bits? > > Just to be clear: this rfc is not the only thing we're considering. > In particular huawei guys did a monster effort to improve performance > in this area as well. We'll try to blend all the code together and > pick what's the best. What's the plan on opening the discussion on this? Can we get a peek? Is it an alternative to XDP and the driver hook? Different architecture or just different implementation? I understood it as another pseudo skb model with a path on converting to real skbs for stack processing. I really like the current proposal by Brenden for its simplicity and targeted compatibility with cls_bpf. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-10 7:55 ` Thomas Graf @ 2016-04-10 16:53 ` Tom Herbert 2016-04-10 18:09 ` Jamal Hadi Salim 0 siblings, 1 reply; 12+ messages in thread From: Tom Herbert @ 2016-04-10 16:53 UTC (permalink / raw) To: Thomas Graf Cc: Alexei Starovoitov, Jamal Hadi Salim, Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm On Sun, Apr 10, 2016 at 12:55 AM, Thomas Graf <tgraf@suug.ch> wrote: > On 04/09/16 at 10:26am, Alexei Starovoitov wrote: >> On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote: >> > If this is _forwarding only_ it maybe useful to look at >> > Alexey's old code in particular the DMA bits; >> > he built his own lookup algorithm but sounds like bpf is >> > a much better fit today. >> >> a link to these old bits? >> >> Just to be clear: this rfc is not the only thing we're considering. >> In particular huawei guys did a monster effort to improve performance >> in this area as well. We'll try to blend all the code together and >> pick what's the best. > > What's the plan on opening the discussion on this? Can we get a peek? > Is it an alternative to XDP and the driver hook? Different architecture > or just different implementation? I understood it as another pseudo > skb model with a path on converting to real skbs for stack processing. > We started discussions about this in IOvisor. The Huawei project is called ceth (Common Ethernet). It is essentially a layer called directly from drivers intended for fast path forwarding and network virtualization. They have put quite a bit of effort into buffer management and other parts of the infrastructure, much of which we would like to leverage in XDP. The code is currently in github, will ask them to make it generally accessible. The programmability part, essentially BPF, should be part of a common solution. We can define the necessary interfaces independently of the underlying infrastructure which is really the only way we can do this if we want the BPF programs to be portable across different platforms-- in Linux, userspace, HW, etc. Tom > I really like the current proposal by Brenden for its simplicity and > targeted compatibility with cls_bpf. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-10 16:53 ` Tom Herbert @ 2016-04-10 18:09 ` Jamal Hadi Salim 0 siblings, 0 replies; 12+ messages in thread From: Jamal Hadi Salim @ 2016-04-10 18:09 UTC (permalink / raw) To: Tom Herbert, Thomas Graf Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm On 16-04-10 12:53 PM, Tom Herbert wrote: > We started discussions about this in IOvisor. The Huawei project is > called ceth (Common Ethernet). It is essentially a layer called > directly from drivers intended for fast path forwarding and network > virtualization. They have put quite a bit of effort into buffer > management and other parts of the infrastructure, much of which we > would like to leverage in XDP. The code is currently in github, will > ask them to make it generally accessible. > Cant seem to find any info on it on the googles. If it is forwarding then it should hopefully at least make use of Linux control APIs I hope. cheers, jamal -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter 2016-04-09 17:26 ` Alexei Starovoitov 2016-04-10 7:55 ` Thomas Graf @ 2016-04-10 13:07 ` Jamal Hadi Salim 1 sibling, 0 replies; 12+ messages in thread From: Jamal Hadi Salim @ 2016-04-10 13:07 UTC (permalink / raw) To: Alexei Starovoitov Cc: Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco, David S. Miller, Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend, Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm, Robert Olsson, kuznet On 16-04-09 01:26 PM, Alexei Starovoitov wrote: > > yeah, no stack, no queues in bpf. Thanks. > >> If this is _forwarding only_ it maybe useful to look at >> Alexey's old code in particular the DMA bits; >> he built his own lookup algorithm but sounds like bpf is >> a much better fit today. > > a link to these old bits? > Dang. Trying to remember exact name (I think it has been gone for at least 10 years now). I know it is not CONFIG_NET_FASTROUTE although it could have been that depending on the driver (tulip had some nice DMA properties - which by todays standards would be considered primitive ;->). +Cc Robert and Alexey (Trying to figure out name of driver based routing code that DMAed from ingress to egress port) > Just to be clear: this rfc is not the only thing we're considering. > In particular huawei guys did a monster effort to improve performance > in this area as well. We'll try to blend all the code together and > pick what's the best. > Sounds very interesting. cheers, jamal -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-04-10 18:09 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1460090930-11219-1-git-send-email-bblanco@plumgrid.com> [not found] ` <20160408123614.2a15a346@redhat.com> 2016-04-08 12:33 ` [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter Jesper Dangaard Brouer 2016-04-08 17:02 ` Brenden Blanco 2016-04-08 17:26 ` Alexei Starovoitov 2016-04-08 20:08 ` Jesper Dangaard Brouer 2016-04-08 21:34 ` Alexei Starovoitov 2016-04-09 11:29 ` Tom Herbert 2016-04-09 15:29 ` Jamal Hadi Salim 2016-04-09 17:26 ` Alexei Starovoitov 2016-04-10 7:55 ` Thomas Graf 2016-04-10 16:53 ` Tom Herbert 2016-04-10 18:09 ` Jamal Hadi Salim 2016-04-10 13:07 ` Jamal Hadi Salim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).