From: Lorenzo Bianconi <lorenzo@kernel.org>
To: John Fastabend <john.fastabend@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, davem@davemloft.net,
lorenzo.bianconi@redhat.com, brouer@redhat.com,
echaudro@redhat.com, sameehj@amazon.com, kuba@kernel.org,
daniel@iogearbox.net, ast@kernel.org, shayagr@amazon.com,
edumazet@google.com
Subject: Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
Date: Fri, 4 Sep 2020 11:45:11 +0200 [thread overview]
Message-ID: <20200904094511.GF2884@lore-desk> (raw)
In-Reply-To: <5f51e2f2eb22_3eceb20837@john-XPS-13-9370.notmuch>
[-- Attachment #1: Type: text/plain, Size: 8672 bytes --]
> Lorenzo Bianconi wrote:
> > Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> > headers moving *offset* bytes from/to the second buffer to/from the
> > first one.
> > This helper can be used to move headers when the hw DMA SG is not able
> > to copy all the headers in the first fragment and split header and data
> > pages.
> >
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> > include/uapi/linux/bpf.h | 25 ++++++++++++----
> > net/core/filter.c | 54 ++++++++++++++++++++++++++++++++++
> > tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
> > 3 files changed, 95 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 8dda13880957..c4a6d245619c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3571,11 +3571,25 @@ union bpf_attr {
> > * value.
> > *
> > * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> > - * Description
> > - * Read *size* bytes from user space address *user_ptr* and store
> > - * the data in *dst*. This is a wrapper of copy_from_user().
> > - * Return
> > - * 0 on success, or a negative error in case of failure.
> > + * Description
> > + * Read *size* bytes from user space address *user_ptr* and store
> > + * the data in *dst*. This is a wrapper of copy_from_user().
> > + *
> > + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
> > + * Description
> > + * Adjust frame headers moving *offset* bytes from/to the second
> > + * buffer to/from the first one. This helper can be used to move
> > + * headers when the hw DMA SG does not copy all the headers in
> > + * the first fragment.
+ Eric to the discussion
>
> This is confusing to read. Does this mean I can "move bytes to the second
> buffer from the first one" or "move bytes from the second buffer to the first
> one" And what are frame headers? I'm sure I can read below code and work
> it out, but users reading the header file should be able to parse this.
Our main goal with this helper is to fix the use-case where we request the hw
to put L2/L3/L4 headers (and all the TCP options) in the first fragment and TCP
data starting from the second fragment (headers split) but for some reasons
the hw puts the TCP options in the second fragment (if we understood correctly
this issue has been introduced by Eric @ NetDevConf 0x14).
bpf_xdp_adjust_mb_header() can fix this use-case moving bytes from the second fragment
to the first one (offset > 0) or from the first buffer to the second one (offset < 0).
>
> Also we want to be able to read all data not just headers. Reading the
> payload of a TCP packet is equally important for many l7 load balancers.
>
In order to avoid to slow down performances we require that eBPF sandbox can
read/write only buff0 in a xdp multi-buffer. The xdp program can only
perform some restricted changes to buff<n> (n >= 1) (e.g. what we did in
bpf_xdp_adjust_mb_header()).
You can find the xdp multi-buff design principles here [0][1]
[0] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[1] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html - XDP multi-buffers section (slide 40)
> > + *
> > + * A call to this helper is susceptible to change the underlying
> > + * packet buffer. Therefore, at load time, all checks on pointers
> > + * previously done by the verifier are invalidated and must be
> > + * performed again, if the helper is used in combination with
> > + * direct packet access.
> > + *
> > + * Return
> > + * 0 on success, or a negative error in case of failure.
> > */
> > #define __BPF_FUNC_MAPPER(FN) \
> > FN(unspec), \
> > @@ -3727,6 +3741,7 @@ union bpf_attr {
> > FN(inode_storage_delete), \
> > FN(d_path), \
> > FN(copy_from_user), \
> > + FN(xdp_adjust_mb_header), \
> > /* */
> >
> > /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 47eef9a0be6a..ae6b10cf062d 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> > .arg2_type = ARG_ANYTHING,
> > };
> >
> > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct xdp_buff *, xdp,
> > + int, offset)
> > +{
> > + void *data_hard_end, *data_end;
> > + struct skb_shared_info *sinfo;
> > + int frag_offset, frag_len;
> > + u8 *addr;
> > +
> > + if (!xdp->mb)
> > + return -EOPNOTSUPP;
> > +
> > + sinfo = xdp_get_shared_info_from_buff(xdp);
> > +
> > + frag_len = skb_frag_size(&sinfo->frags[0]);
> > + if (offset > frag_len)
> > + return -EINVAL;
>
> What if we want data in frags[1] and so on.
>
> > +
> > + frag_offset = skb_frag_off(&sinfo->frags[0]);
> > + data_end = xdp->data_end + offset;
> > +
> > + if (offset < 0 && (-offset > frag_offset ||
> > + data_end < xdp->data + ETH_HLEN))
> > + return -EINVAL;
> > +
> > + data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > + if (data_end > data_hard_end)
> > + return -EINVAL;
> > +
> > + addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > + if (offset > 0) {
> > + memcpy(xdp->data_end, addr, offset);
>
> But this could get expensive for large amount of data? And also
> limiting because we require the room to do the copy. Presumably
> the reason we have fargs[1] is because the normal page or half
> page is in use?
>
> > + } else {
> > + memcpy(addr + offset, xdp->data_end + offset, -offset);
> > + memset(xdp->data_end + offset, 0, -offset);
> > + }
> > +
> > + skb_frag_size_sub(&sinfo->frags[0], offset);
> > + skb_frag_off_add(&sinfo->frags[0], offset);
> > + xdp->data_end = data_end;
> > +
> > + return 0;
> > +}
>
> So overall I don't see the point of copying bytes from one frag to
> another. Create an API that adjusts the data pointers and then
> copies are avoided and manipulating frags is not needed.
please see above.
>
> Also and even more concerning I think this API requires the
> driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
> this means we need to populate shinfo when its probably not ever
> used. If our driver is smart L2/L3 headers are in the readable
> data and prefetched. Writing frags into the end of a page is likely
> not free.
Sorry I did not get what you mean with "populate shinfo" here. We need to
properly set shared_info in order to create the xdp multi-buff.
Apart of header splits, please consider the main uses-cases for
xdp multi-buff are XDP with TSO and Jumbo frames.
>
> Did you benchmark this?
will do, I need to understand if we can use tiny buffers in mvneta.
>
> In general users of this API should know the bytes they want
> to fetch. Use an API like this,
>
> bpf_xdp_adjust_bytes(xdp, start, end)
>
> Where xdp is the frame, start is the first byte the user wants
> and end is the last byte. Then usually you can skip the entire
> copy part and just move the xdp pointesr around. The ugly case
> is if the user puts start/end across a frag boundary and a copy
> is needed. In that case maybe we use end as a hint and not a
> hard requirement.
>
> The use case I see is I read L2/L3/L4 headers and I need the
> first N bytes of the payload. I know where the payload starts
> and I know how many bytes I need so,
>
> bpf_xdp_adjust_bytes(xdp, payload_offset, bytes_needed);
>
> Then hopefully that is all in one frag. If its not I'll need
> to do a second helper call. Still nicer than forcing drivers
> to populate this shinfo I think. If you think I'm wrong try
> a benchmark. Benchmarks aside I get stuck when data_end and
> data_hard_end are too close together.
IIUC what you mean here is to expose L2/L3/L4 headers + some data to
the ebpf program to perform like L7 load-balancing, right?
Let's consider the Jumbo frames use-case (so the data are split in multiple
buffers). I can see to issues here:
- since in XDP we can't linearize the buffer if start and end are on the
different pages (like we do in bpf_msg_pull_data()), are we ending up
in the case where requested data are all in buff0?
- if start and end are in buff<2>, we should report the fragment number to the
ebpf program to "fetch" the data. Are we exposing too low-level details to
user-space?
Regards,
Lorenzo
>
> Thanks,
> John
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
next prev parent reply other threads:[~2020-09-04 9:45 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
2020-09-04 1:07 ` Alexei Starovoitov
2020-09-04 7:19 ` Jesper Dangaard Brouer
2020-09-04 15:15 ` David Ahern
2020-09-04 15:59 ` Jesper Dangaard Brouer
2020-09-04 16:30 ` David Ahern
2020-09-07 18:02 ` Jesper Dangaard Brouer
2020-09-08 1:22 ` David Ahern
2020-09-03 20:58 ` [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
2020-09-04 7:35 ` Jesper Dangaard Brouer
2020-09-04 7:54 ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
2020-09-06 7:33 ` Shay Agroskin
2020-09-06 9:05 ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 4/9] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
2020-09-06 7:20 ` Shay Agroskin
2020-09-06 8:43 ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
2020-09-04 1:09 ` Alexei Starovoitov
2020-09-04 8:07 ` Lorenzo Bianconi
2020-09-04 1:13 ` Alexei Starovoitov
2020-09-04 7:50 ` Lorenzo Bianconi
2020-09-04 13:52 ` Jesper Dangaard Brouer
2020-09-04 14:27 ` Lorenzo Bianconi
2020-09-04 6:47 ` John Fastabend
2020-09-04 9:45 ` Lorenzo Bianconi [this message]
2020-09-04 15:23 ` John Fastabend
2020-09-06 13:36 ` Lorenzo Bianconi
2020-09-08 19:57 ` John Fastabend
2020-09-08 21:31 ` Lorenzo Bianconi
2020-09-09 20:51 ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support Lorenzo Bianconi
2020-09-03 21:24 ` Maciej Fijalkowski
2020-09-04 9:47 ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 8/9] samples/bpf: add bpf program that uses xdp mb helpers Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 9/9] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
2020-09-04 1:08 ` [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Alexei Starovoitov
2020-09-04 7:40 ` Lorenzo Bianconi
2020-09-04 5:41 ` John Fastabend
2020-09-04 7:39 ` Lorenzo Bianconi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200904094511.GF2884@lore-desk \
--to=lorenzo@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brouer@redhat.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=echaudro@redhat.com \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=lorenzo.bianconi@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=sameehj@amazon.com \
--cc=shayagr@amazon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).