From: Jesper Dangaard Brouer <brouer@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
Daniel Borkmann <borkmann@iogearbox.net>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
maze@google.com, lmb@cloudflare.com, shaun@tigera.io,
Lorenzo Bianconi <lorenzo@kernel.org>,
marek@cloudflare.com, Jakub Kicinski <kuba@kernel.org>,
eyal.birger@gmail.com, brouer@redhat.com
Subject: Re: [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking
Date: Mon, 2 Nov 2020 12:15:48 +0100 [thread overview]
Message-ID: <20201102121548.5e2c36b1@carbon> (raw)
In-Reply-To: <5f9c764fc98c6_16d4208d5@john-XPS-13-9370.notmuch>
On Fri, 30 Oct 2020 13:23:43 -0700
John Fastabend <john.fastabend@gmail.com> wrote:
> Jesper Dangaard Brouer wrote:
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.
> >
> > The API is designed to help the BPF-programmer, that want to do packet
> > context size changes, which involves other helpers. These other helpers
> > usually does a delta size adjustment. This helper also support a delta
> > size (len_diff), which allow BPF-programmer to reuse arguments needed by
> > these other helpers, and perform the MTU check prior to doing any actual
> > size adjustment of the packet context.
> >
> > It is on purpose, that we allow the len adjustment to become a negative
> > result, that will pass the MTU check. This might seem weird, but it's not
> > this helpers responsibility to "catch" wrong len_diff adjustments. Other
> > helpers will take care of these checks, if BPF-programmer chooses to do
> > actual size adjustment.
> >
> > V4: Lot of changes
> > - ifindex 0 now use current netdev for MTU lookup
> > - rename helper from bpf_mtu_check to bpf_check_mtu
> > - fix bug for GSO pkt length (as skb->len is total len)
> > - remove __bpf_len_adj_positive, simply allow negative len adj
> >
> > V3: Take L2/ETH_HLEN header size into account and document it.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
>
> Sorry for the late feedback here.
>
> This seems like a lot of baked in functionality into the helper? Can you
> say something about why the simpler and, at least to me, more intuitive
> helper to simply return the ifindex mtu is not ideal?
I tried to explain this in the patch description. This is for easier
collaboration with other helpers, that also have the len_diff parameter.
This API allow to check the MTU *prior* to doing the size adjustment.
Let me explain what is not in the patch desc:
In the first patchset, I started with the simply implementation of
returning the MTU. Then I realized that this puts more work into the
BPF program (thus increasing BPF code instructions). As we in BPF-prog
need to extract the packet length to compare against the returned MTU
size. Looking at other programs that does the ctx/packet size adjust, we
don't extract the packet length as ctx is about to change, and we don't
need the MTU variable in the BPF prog (unless it fails).
> Rough pseudo code being,
>
> my_sender(struct __sk_buff *skb, int fwd_ifindex)
> {
> mtu = bpf_get_ifindex_mtu(fwd_ifindex, 0);
> if (skb->len + HDR_SIZE < mtu)
> return send_with_hdrs(skb);
> return -EMSGSIZE
> }
>
>
> > include/uapi/linux/bpf.h | 70 +++++++++++++++++++++++
> > net/core/filter.c | 120 ++++++++++++++++++++++++++++++++++++++++
> > tools/include/uapi/linux/bpf.h | 70 +++++++++++++++++++++++
> > 3 files changed, 260 insertions(+)
> >
>
> [...]
>
> > + * **BPF_MTU_CHK_RELAX**
> > + * This flag relax or increase the MTU with room for one
> > + * VLAN header (4 bytes). This relaxation is also used by
> > + * the kernels own forwarding MTU checks.
>
> I noted below as well, but not sure why this is needed. Seems if user
> knows to add a flag because they want a vlan header we can just as
> easily expect BPF program to do it. Also it only works for VLAN headers
> any other header data wont be accounted for so it seems only useful
> in one specific case.
This was added because the kernels own forwarding have this relaxation
build in. Thus, I though that I should add flag to compatible with the
kernels forwarding checks.
> > + *
> > + * **BPF_MTU_CHK_SEGS**
> > + * This flag will only works for *ctx* **struct sk_buff**.
> > + * If packet context contains extra packet segment buffers
> > + * (often knows as GSO skb), then MTU check is partly
> > + * skipped, because in transmit path it is possible for the
> > + * skb packet to get re-segmented (depending on net device
> > + * features). This could still be a MTU violation, so this
> > + * flag enables performing MTU check against segments, with
> > + * a different violation return code to tell it apart.
> > + *
> > + * The *mtu_result* pointer contains the MTU value of the net
> > + * device including the L2 header size (usually 14 bytes Ethernet
> > + * header). The net device configured MTU is the L3 size, but as
> > + * XDP and TX length operate at L2 this helper include L2 header
> > + * size in reported MTU.
> > + *
> > + * Return
> > + * * 0 on success, and populate MTU value in *mtu_result* pointer.
> > + *
> > + * * < 0 if any input argument is invalid (*mtu_result* not updated)
> > + *
> > + * MTU violations return positive values, but also populate MTU
> > + * value in *mtu_result* pointer, as this can be needed for
> > + * implementing PMTU handing:
> > + *
> > + * * **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > + *
> > */
>
> [...]
>
> > +static int __bpf_lookup_mtu(struct net_device *dev_curr, u32 ifindex, u64 flags)
> > +{
> > + struct net *netns = dev_net(dev_curr);
> > + struct net_device *dev;
> > + int mtu;
> > +
> > + /* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > + if (ifindex == 0)
> > + dev = dev_curr;
> > + else
> > + dev = dev_get_by_index_rcu(netns, ifindex);
> > +
> > + if (!dev)
> > + return -ENODEV;
> > +
> > + /* XDP+TC len is L2: Add L2-header as dev MTU is L3 size */
> > + mtu = dev->mtu + dev->hard_header_len;
>
> READ_ONCE() on dev->mtu and hard_header_len as well? We don't have
> any locks.
This is based on similar checks done in the same execution context,
which don't have these READ_ONCE() macros. I'm not introducing reading
these, I'm simply moving when they are read. If this is really needed,
then I think we need separate fixes patches, for stable backporting.
While doing this work, I've realized that mtu + hard_header_len is
located on two different cache-lines, which is unfortunate, but I will
look at fixing this in followup patches.
> > +
> > + /* Same relax as xdp_ok_fwd_dev() and is_skb_forwardable() */
> > + if (flags & BPF_MTU_CHK_RELAX)
> > + mtu += VLAN_HLEN;
>
> I'm trying to think about the use case where this might be used?
> Compared to just adjusting MTU in BPF program side as needed for
> packet encapsulation/headers/etc.
As I wrote above, this were added because the kernels own forwarding
have this relaxation in it's checks (in is_skb_forwardable()). I even
tried to dig through the history, introduced in [1] and copy-pasted
in[2]. And this seems to be a workaround, that have become standard,
that still have practical implications.
My practical experiments showed, that e.g. ixgbe driver with MTU=1500
(L3-size) will allow and fully send packets with 1504 (L3-size). But
i40e will not, and drops the packet in hardware/firmware step. So,
what is the correct action, strict or relaxed?
My own conclusion is that we should inverse the flag. Meaning to
default add this VLAN_HLEN (4 bytes) relaxation, and have a flag to do
more strict check, e.g. BPF_MTU_CHK_STRICT. As for historical reasons
we must act like kernels version of MTU check. Unless you object, I will
do this in V6.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[1] https://git.kernel.org/torvalds/c/57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.") (Author: Ben Greear)
[2] https://git.kernel.org/torvalds/c/79b569f0ec53 ("netdev: fix mtu check when TSO is enabled") (Author: Daniel Lezcano)
next prev parent reply other threads:[~2020-11-02 11:16 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-30 16:50 [PATCH bpf-next V5 0/5] Subj: bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2020-10-30 16:50 ` [PATCH bpf-next V5 1/5] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2020-10-30 16:50 ` [PATCH bpf-next V5 2/5] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2020-10-30 19:40 ` John Fastabend
2020-11-02 9:28 ` Jesper Dangaard Brouer
2020-11-02 15:59 ` David Ahern
2020-11-02 16:18 ` John Fastabend
2020-10-31 15:52 ` David Ahern
2020-10-30 16:51 ` [PATCH bpf-next V5 3/5] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2020-10-30 20:23 ` John Fastabend
2020-11-02 11:15 ` Jesper Dangaard Brouer [this message]
2020-11-02 18:04 ` John Fastabend
2020-11-02 20:10 ` Jesper Dangaard Brouer
2020-11-12 12:58 ` Jesper Dangaard Brouer
2020-10-30 16:51 ` [PATCH bpf-next V5 4/5] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2020-10-30 20:36 ` John Fastabend
2020-11-02 12:46 ` Jesper Dangaard Brouer
2020-11-02 16:23 ` John Fastabend
2020-10-30 16:51 ` [PATCH bpf-next V5 5/5] bpf: make it possible to identify BPF redirected SKBs Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201102121548.5e2c36b1@carbon \
--to=brouer@redhat.com \
--cc=alexei.starovoitov@gmail.com \
--cc=borkmann@iogearbox.net \
--cc=bpf@vger.kernel.org \
--cc=eyal.birger@gmail.com \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=lmb@cloudflare.com \
--cc=lorenzo@kernel.org \
--cc=marek@cloudflare.com \
--cc=maze@google.com \
--cc=netdev@vger.kernel.org \
--cc=shaun@tigera.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.