From: Jesper Dangaard Brouer <brouer@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bjorn.topel@gmail.com, jasowang@redhat.com, ast@fb.com,
alexander.duyck@gmail.com, john.r.fastabend@intel.com,
netdev@vger.kernel.org, brouer@redhat.com
Subject: Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
Date: Mon, 30 Jan 2017 19:16:07 +0100 [thread overview]
Message-ID: <20170130191607.14d964e4@redhat.com> (raw)
In-Reply-To: <20170127213344.14162.59976.stgit@john-Precision-Tower-5810>
On Fri, 27 Jan 2017 13:33:44 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> This adds ndo ops for upper layer objects to request direct DMA from
> the network interface into memory "slots". The slots must be DMA'able
> memory given by a page/offset/size vector in a packet_ring_buffer
> structure.
>
> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
> RX from the network device into memory mapped userspace memory. For
> this to work drivers encode the correct descriptor blocks and headers
> so that existing PF_PACKET applications work without any modification.
> This only supports the V2 header formats for now. And works by mapping
> a ring of the network device to these slots. Originally I used V2
> header formats but this does complicate the driver a bit.
>
> V3 header formats added bulk polling via socket calls and timers
> used in the polling interface to return every n milliseconds. Currently,
> I don't see any way to support this in hardware because we can't
> know if the hardware is in the middle of a DMA operation or not
> on a slot. So when a timer fires I don't know how to advance the
> descriptor ring leaving empty descriptors similar to how the software
> ring works. The easiest (best?) route is to simply not support this.
>From a performance pov bulking is essential. Systems like netmap that
also depend on transferring control between kernel and userspace,
report[1] that they need at least bulking size 8, to amortize the overhead.
[1] Figure 7, page 10, http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf
> It might be worth creating a new v4 header that is simple for drivers
> to support direct DMA ops with. I can imagine using the xdp_buff
> structure as a header for example. Thoughts?
Likely, but I would like that we do a measurement based approach. Lets
benchmark with this V2 header format, and see how far we are from
target, and see what lights-up in perf report and if it is something we
can address.
> The ndo operations and new socket option PACKET_RX_DIRECT work by
> giving a queue_index to run the direct dma operations over. Once
> setsockopt returns successfully the indicated queue is mapped
> directly to the requesting application and can not be used for
> other purposes. Also any kernel layers such as tc will be bypassed
> and need to be implemented in the hardware via some other mechanism
> such as tc offload or other offload interfaces.
Will this also need to bypass XDP too?
E.g. how will you support XDP_TX? AFAIK you cannot remove/detach a
packet with this solution (and place it on a TX queue and wait for DMA
TX completion).
> Users steer traffic to the selected queue using flow director,
> tc offload infrastructure or via macvlan offload.
>
> The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
> It takes a single unsigned int value specifying the queue index,
>
> setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
> &queue_index, sizeof(queue_index));
>
> Implementing busy_poll support will allow userspace to kick the
> drivers receive routine if needed. This work is TBD.
>
> To test this I hacked a hardcoded test into the tool psock_tpacket
> in the selftests kernel directory here:
>
> ./tools/testing/selftests/net/psock_tpacket.c
>
> Running this tool opens a socket and listens for packets over
> the PACKET_RX_DIRECT enabled socket. Obviously it needs to be
> reworked to enable all the older tests and not hardcode my
> interface before it actually gets released.
>
> In general this is a rough patch to explore the interface and
> put something concrete up for debate. The patch does not handle
> all the error cases correctly and needs to be cleaned up.
>
> Known Limitations (TBD):
>
> (1) Users are required to match the number of rx ring
> slots with ethtool to the number requested by the
> setsockopt PF_PACKET layout. In the future we could
> possibly do this automatically.
>
> (2) Users need to configure Flow director or setup_tc
> to steer traffic to the correct queues. I don't believe
> this needs to be changed it seems to be a good mechanism
> for driving directed dma.
>
> (3) Not supporting timestamps or priv space yet, pushing
> a v4 packet header would resolve this nicely.
>
> (5) Only RX supported so far. TX already supports direct DMA
> interface but uses skbs which is really not needed. In
> the TX_RING case we can optimize this path as well.
>
> To support TX case we can do a similar "slots" mechanism and
> kick operation. The kick could be a busy_poll like operation
> but on the TX side. The flow would be user space loads up
> n number of slots with packets, kicks tx busy poll bit, the
> driver sends packets, and finally when xmit is complete
> clears header bits to give slots back. When we have qdisc
> bypass set today we already bypass the entire stack so no
> paticular reason to use skb's in this case. Using xdp_buff
> as a v4 packet header would also allow us to consolidate
> driver code.
>
> To be done:
>
> (1) More testing and performance analysis
> (2) Busy polling sockets
> (3) Implement v4 xdp_buff headers for analysis
> (4) performance testing :/ hopefully it looks good.
Guess, I don't understand the details of the af_packet versions well
enough, but can you explain to me, how userspace knows what slots it
can read/fetch, and how it marks when it is complete/finished so the
kernel knows it can reuse this slot?
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2017-01-30 18:16 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
2017-01-30 18:16 ` Jesper Dangaard Brouer [this message]
2017-01-30 21:51 ` John Fastabend
2017-01-31 1:31 ` Willem de Bruijn
2017-02-01 5:09 ` John Fastabend
2017-03-06 21:28 ` chetan loke
2017-01-31 12:20 ` Jesper Dangaard Brouer
2017-02-01 5:01 ` John Fastabend
2017-02-04 3:10 ` Jason Wang
2017-01-27 21:34 ` [RFC PATCH 2/2] ixgbe: add af_packet direct copy support John Fastabend
2017-01-31 2:53 ` Alexei Starovoitov
2017-02-01 4:58 ` John Fastabend
2017-01-30 22:02 ` [RFC PATCH 0/2] rx zero copy interface for af_packet David Miller
2017-01-31 16:30 ` Sowmini Varadhan
2017-02-01 4:23 ` John Fastabend
2017-01-31 19:39 ` tndave
2017-02-01 5:09 ` John Fastabend
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170130191607.14d964e4@redhat.com \
--to=brouer@redhat.com \
--cc=alexander.duyck@gmail.com \
--cc=ast@fb.com \
--cc=bjorn.topel@gmail.com \
--cc=jasowang@redhat.com \
--cc=john.fastabend@gmail.com \
--cc=john.r.fastabend@intel.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).