Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
	Daniel Borkmann <borkmann@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	maze@google.com, lmb@cloudflare.com, shaun@tigera.io,
	Lorenzo Bianconi <lorenzo@kernel.org>,
	marek@cloudflare.com, John Fastabend <john.fastabend@gmail.com>,
	Jakub Kicinski <kuba@kernel.org>,
	eyal.birger@gmail.com, colrack@gmail.com, brouer@redhat.com
Subject: Re: [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking
Date: Thu, 14 Jan 2021 15:36:07 +0100	[thread overview]
Message-ID: <20210114153607.6eea9b37@carbon> (raw)
In-Reply-To: <a14a7490-88c6-9d14-0886-547113242c45@iogearbox.net>

On Thu, 14 Jan 2021 00:07:14 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 1/12/21 6:45 PM, Jesper Dangaard Brouer wrote:
> > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.  
> [...]
> > + * int bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
> > + *	Description
> > + *		Check ctx packet size against MTU of net device (based on
> > + *		*ifindex*).  This helper will likely be used in combination with
> > + *		helpers that adjust/change the packet size.  The argument
> > + *		*len_diff* can be used for querying with a planned size
> > + *		change. This allows to check MTU prior to changing packet ctx.
> > + *
> > + *		Specifying *ifindex* zero means the MTU check is performed
> > + *		against the current net device.  This is practical if this isn't
> > + *		used prior to redirect.
> > + *
> > + *		The Linux kernel route table can configure MTUs on a more
> > + *		specific per route level, which is not provided by this helper.
> > + *		For route level MTU checks use the **bpf_fib_lookup**\ ()
> > + *		helper.
> > + *
> > + *		*ctx* is either **struct xdp_md** for XDP programs or
> > + *		**struct sk_buff** for tc cls_act programs.
> > + *
> > + *		The *flags* argument can be a combination of one or more of the
> > + *		following values:
> > + *
> > + *		**BPF_MTU_CHK_SEGS**
> > + *			This flag will only works for *ctx* **struct sk_buff**.
> > + *			If packet context contains extra packet segment buffers
> > + *			(often knows as GSO skb), then MTU check is harder to
> > + *			check at this point, because in transmit path it is
> > + *			possible for the skb packet to get re-segmented
> > + *			(depending on net device features).  This could still be
> > + *			a MTU violation, so this flag enables performing MTU
> > + *			check against segments, with a different violation
> > + *			return code to tell it apart. Check cannot use len_diff.
> > + *
> > + *		On return *mtu_len* pointer contains the MTU value of the net
> > + *		device.  Remember the net device configured MTU is the L3 size,
> > + *		which is returned here and XDP and TX length operate at L2.
> > + *		Helper take this into account for you, but remember when using
> > + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid
> > + *		pointer and be initialized (to zero), else verifier will reject
> > + *		BPF program.
> > + *
> > + *	Return
> > + *		* 0 on success, and populate MTU value in *mtu_len* pointer.
> > + *
> > + *		* < 0 if any input argument is invalid (*mtu_len* not updated)
> > + *
> > + *		MTU violations return positive values, but also populate MTU
> > + *		value in *mtu_len* pointer, as this can be needed for
> > + *		implementing PMTU handing:
> > + *
> > + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**
> > + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**
> > + *
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)		\
> >   	FN(unspec),			\
> > @@ -3998,6 +4053,7 @@ union bpf_attr {
> >   	FN(ktime_get_coarse_ns),	\
> >   	FN(ima_inode_hash),		\
> >   	FN(sock_from_file),		\
> > +	FN(check_mtu),			\
> >   	/* */
> >   
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > @@ -5030,6 +5086,17 @@ struct bpf_redir_neigh {
> >   	};
> >   };
> >   
> > +/* bpf_check_mtu flags*/
> > +enum  bpf_check_mtu_flags {
> > +	BPF_MTU_CHK_SEGS  = (1U << 0),
> > +};
> > +
> > +enum bpf_check_mtu_ret {
> > +	BPF_MTU_CHK_RET_SUCCESS,      /* check and lookup successful */
> > +	BPF_MTU_CHK_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
> > +	BPF_MTU_CHK_RET_SEGS_TOOBIG,  /* GSO re-segmentation needed to fwd */
> > +};
> > +
> >   enum bpf_task_fd_type {
> >   	BPF_FD_TYPE_RAW_TRACEPOINT,	/* tp name */
> >   	BPF_FD_TYPE_TRACEPOINT,		/* tp name */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index db59ab55572c..3f2e593244ca 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5604,6 +5604,124 @@ static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
> >   	.arg4_type	= ARG_ANYTHING,
> >   };
> >   
> > +static struct net_device *__dev_via_ifindex(struct net_device *dev_curr,
> > +					    u32 ifindex)
> > +{
> > +	struct net *netns = dev_net(dev_curr);
> > +
> > +	/* Non-redirect use-cases can use ifindex=0 and save ifindex lookup */
> > +	if (ifindex == 0)
> > +		return dev_curr;
> > +
> > +	return dev_get_by_index_rcu(netns, ifindex);
> > +}
> > +
> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,
> > +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)
> > +{
> > +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +	struct net_device *dev = skb->dev;
> > +	int skb_len, dev_len;
> > +	int mtu;
> > +
> > +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))
> > +		return -EINVAL;
> > +
> > +	dev = __dev_via_ifindex(dev, ifindex);
> > +	if (unlikely(!dev))
> > +		return -ENODEV;
> > +
> > +	mtu = READ_ONCE(dev->mtu);
> > +
> > +	dev_len = mtu + dev->hard_header_len;
> > +	skb_len = skb->len + len_diff; /* minus result pass check */
> > +	if (skb_len <= dev_len) {
> > +		ret = BPF_MTU_CHK_RET_SUCCESS;
> > +		goto out;
> > +	}
> > +	/* At this point, skb->len exceed MTU, but as it include length of all
> > +	 * segments, it can still be below MTU.  The SKB can possibly get
> > +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user
> > +	 * must choose if segs are to be MTU checked.  Last SKB "headlen" is
> > +	 * checked against MTU.
> > +	 */
> > +	if (skb_is_gso(skb)) {
> > +		ret = BPF_MTU_CHK_RET_SUCCESS;
> > +
> > +		if (!(flags & BPF_MTU_CHK_SEGS))
> > +			goto out;
> > +
> > +		if (!skb_gso_validate_network_len(skb, mtu)) {
> > +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;
> > +			goto out;
> > +		}
> > +
> > +		skb_len = skb_headlen(skb) + len_diff;
> > +		if (skb_len > dev_len) {

Maybe I'm misunderstanding you below?  Do you just want the above two
lines moved from the patch? (sure I can do that... as it is just an
extra check of the "head"/first segment of the packet, and only done if
BPF_MTU_CHK_SEGS is set)

> 
> This is still not universally correct given drivers could cook up non-linear
> skbs (e.g. page frags) on rx. So the result from BPF_MTU_CHK_SEGS flag cannot
> be relied on. 

That is why it is a flag, that need to be explicitly set.

> Do you have a particular use case for the BPF_MTU_CHK_SEGS?

The complaint from Maze (and others) were that when skb_is_gso then all
the MTU checks are bypassed.  This flag enables checking the GSO part
via skb_gso_validate_network_len().  We cannot enable it per default,
as you say, it is universally correct in all cases.

> I also don't see the flag being used anywhere in your selftests, so I presume
> not as otherwise you would have added an example there?

I'm using the flag in the bpf-examples code[1], this is how I've tested
the code path.

I've not found a way to generate GSO packet via the selftests
infrastructure via bpf_prog_test_run_xattr().  I'm 

[1] https://github.com/xdp-project/bpf-examples/blob/master/MTU-tests/tc_mtu_enforce.c


> I would just drop the flag altogether for the tc helper..

As explain I cannot drop the flag altogether, I would also have to
remove the code then.  Sorry, but I don't 100% understand the change
you are requesting.


> > +			ret = BPF_MTU_CHK_RET_FRAG_NEEDED;
> > +			goto out;
> > +		}
> > +	}
> > +out:
> > +	/* BPF verifier guarantees valid pointer */
> > +	*mtu_len = mtu;
> > +
> > +	return ret;
> > +}  
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2021-01-14 14:37 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-12 17:45 [PATCH bpf-next V11 0/7] bpf: New approach for BPF MTU handling Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 1/7] bpf: Remove MTU check in __bpf_skb_max_len Jesper Dangaard Brouer
2021-01-14  8:18   ` John Fastabend
2021-01-12 17:45 ` [PATCH bpf-next V11 2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 3/7] bpf: bpf_fib_lookup return MTU value as output when looked up Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 4/7] bpf: add BPF-helper for MTU checking Jesper Dangaard Brouer
2021-01-12 19:23   ` Andrii Nakryiko
2021-01-14 14:52     ` Jesper Dangaard Brouer
2021-01-14 15:33       ` Yonghong Song
2021-01-13 23:07   ` Daniel Borkmann
2021-01-14 14:36     ` Jesper Dangaard Brouer [this message]
2021-01-14 22:28       ` Daniel Borkmann
2021-01-18 11:04         ` Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress Jesper Dangaard Brouer
2021-01-14  9:03   ` John Fastabend
2021-01-14 16:14     ` Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 6/7] selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect Jesper Dangaard Brouer
2021-01-12 17:45 ` [PATCH bpf-next V11 7/7] bpf/selftests: tests using bpf_check_mtu BPF-helper Jesper Dangaard Brouer
2021-01-12 19:29   ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210114153607.6eea9b37@carbon \
    --to=brouer@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=borkmann@iogearbox.net \
    --cc=bpf@vger.kernel.org \
    --cc=colrack@gmail.com \
    --cc=daniel@iogearbox.net \
    --cc=eyal.birger@gmail.com \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=lmb@cloudflare.com \
    --cc=lorenzo@kernel.org \
    --cc=marek@cloudflare.com \
    --cc=maze@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=shaun@tigera.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.