Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Marcelo Ricardo Leitner @ 2018-09-13 19:30 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-wireless, netdev, Michal Kubecek, Johannes Berg
In-Reply-To: <20180913084603.7979-1-johannes@sipsolutions.net>

On Thu, Sep 13, 2018 at 10:46:02AM +0200, Johannes Berg wrote:
> From: Johannes Berg <johannes.berg@intel.com>
> 
> In some situations some netlink attributes may be used for output
> only (kernel->userspace) or may be reserved for future use. It's
> then helpful to be able to prevent userspace from using them in
> messages sent to the kernel, since they'd otherwise be ignored and
> any future will become impossible if this happens.

I don't follow, what's the issue with simply ignoring such attributes?

> 
> Add NLA_REJECT to the policy which does nothing but reject (with
> EINVAL) validation of any messages containing this attribute.
> Allow for returning a specific extended ACK error message in the
> validation_data pointer.

This has potential to create confusion because we can't use it on
{output,reserved} attributes that are already there (as we must always
ignore them now) and we will end up with a mix of it.

> 
> While at it fix the indentation of NLA_BITFIELD32 and describe the
> validation_data pointer for it.
> 
> The specific case I have in mind now is a shared nested attribute
> containing request/response data, and it would be pointless and
> potentially confusing to have userspace include response data in
> the messages that actually contain a request.

Would be nice to see some patches for it too. Maybe it helps.

> 
> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
> ---
>  include/net/netlink.h |  6 +++++-
>  lib/nlattr.c          | 22 +++++++++++++++-------
>  2 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 0c154f98e987..04e40fcc70d6 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -180,6 +180,7 @@ enum {
>  	NLA_S32,
>  	NLA_S64,
>  	NLA_BITFIELD32,
> +	NLA_REJECT,
>  	__NLA_TYPE_MAX,
>  };
>  
> @@ -208,7 +209,10 @@ enum {
>   *    NLA_MSECS            Leaving the length field zero will verify the
>   *                         given type fits, using it verifies minimum length
>   *                         just like "All other"
> - *    NLA_BITFIELD32      A 32-bit bitmap/bitselector attribute
> + *    NLA_BITFIELD32       A 32-bit bitmap/bitselector attribute, validation
> + *                         data must point to a u32 value of valid flags
> + *    NLA_REJECT           Reject this attribute, validation data may point
> + *                         to a string to report as the error in extended ACK.
>   *    All other            Minimum length of attribute payload
>   *
>   * Example:
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index e335bcafa9e4..56e0aae5cf23 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -69,7 +69,8 @@ static int validate_nla_bitfield32(const struct nlattr *nla,
>  }
>  
>  static int validate_nla(const struct nlattr *nla, int maxtype,
> -			const struct nla_policy *policy)
> +			const struct nla_policy *policy,
> +			struct netlink_ext_ack *extack)
>  {
>  	const struct nla_policy *pt;
>  	int minlen = 0, attrlen = nla_len(nla), type = nla_type(nla);
> @@ -87,6 +88,11 @@ static int validate_nla(const struct nlattr *nla, int maxtype,
>  	}
>  
>  	switch (pt->type) {
> +	case NLA_REJECT:
> +		if (pt->validation_data && extack)
> +			extack->_msg = pt->validation_data;
> +		return -EINVAL;
> +
>  	case NLA_FLAG:
>  		if (attrlen > 0)
>  			return -ERANGE;
> @@ -180,11 +186,10 @@ int nla_validate(const struct nlattr *head, int len, int maxtype,
>  	int rem;
>  
>  	nla_for_each_attr(nla, head, len, rem) {
> -		int err = validate_nla(nla, maxtype, policy);
> +		int err = validate_nla(nla, maxtype, policy, extack);
>  
>  		if (err < 0) {
> -			if (extack)
> -				extack->bad_attr = nla;
> +			NL_SET_BAD_ATTR(extack, nla);
>  			return err;
>  		}
>  	}
> @@ -251,10 +256,13 @@ int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
>  
>  		if (type > 0 && type <= maxtype) {
>  			if (policy) {
> -				err = validate_nla(nla, maxtype, policy);
> +				err = validate_nla(nla, maxtype, policy,
> +						   extack);
>  				if (err < 0) {
> -					NL_SET_ERR_MSG_ATTR(extack, nla,
> -							    "Attribute failed policy validation");
> +					NL_SET_BAD_ATTR(extack, nla);
> +					if (extack && !extack->_msg)
> +						NL_SET_ERR_MSG(extack,
> +							       "Attribute failed policy validation");
>  					goto errout;
>  				}
>  			}
> -- 
> 2.14.4
> 

^ permalink raw reply

* [PATCH iproute2] libnetlink: fix leak and using unused memory on error
From: Stephen Hemminger @ 2018-09-13 19:33 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: netdev, Stephen Hemminger

If an error happens in multi-segment message (tc only)
then report the error and stop processing further responses.
This also fixes refering to the buffer after free.

The sequence check is not necessary here because the
response message has already been validated to be in
the window of the sequence number of the iov.

Reported-by: Mahesh Bandewar <mahesh@bandewar.net>
Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/libnetlink.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 928de1dd16d8..586809292594 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -617,7 +617,6 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iov,
 	msg.msg_iovlen = 1;
 	i = 0;
 	while (1) {
-next:
 		status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
 		++i;
 
@@ -660,27 +659,23 @@ next:
 
 				if (l < sizeof(struct nlmsgerr)) {
 					fprintf(stderr, "ERROR truncated\n");
-				} else if (!err->error) {
+					free(buf);
+					return -1;
+				}
+
+				if (!err->error)
 					/* check messages from kernel */
 					nl_dump_ext_ack(h, errfn);
 
-					if (answer)
-						*answer = (struct nlmsghdr *)buf;
-					else
-						free(buf);
-					if (h->nlmsg_seq == seq)
-						return 0;
-					else if (i < iovlen)
-						goto next;
-					return 0;
-				}
-
 				if (rtnl->proto != NETLINK_SOCK_DIAG &&
 				    show_rtnl_err)
 					rtnl_talk_error(h, err, errfn);
 
 				errno = -err->error;
-				free(buf);
+				if (answer)
+					*answer = (struct nlmsghdr *)buf;
+				else
+					free(buf);
 				return -i;
 			}
 
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Marcelo Ricardo Leitner @ 2018-09-13 19:41 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Michal Kubecek, linux-wireless, netdev
In-Reply-To: <1536840966.4160.6.camel@sipsolutions.net>

On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > On Thu, Sep 13, 2018 at 02:02:53PM +0200, Johannes Berg wrote:
> > > On Thu, 2018-09-13 at 13:58 +0200, Michal Kubecek wrote:
> > > 
> > > > The code looks correct to me but I have some doubts. Having a special
> > > > policy for MAC addresses may lead to adding one for IPv4 address (maybe
> > > > not, we can use NLA_U32 for them), IPv6 addresses and other data types
> > > > with fixed length. Wouldn't it be more helpful to add a variant of
> > > > NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
> > > > isn't equal to .len?
> > > 
> > > Yeah, I guess we could do that, and then
> > > 
> > > #define NLA_ETH_ADDR .len = ETH_ALEN, .type = NLA_BINARY_EXACT
> > > #define NLA_IP6_ADDR .len = 16, .type = NLA_BINARY_EXACT
> > > 
> > > or so?
> > 
> > Maybe rather
> > 
> >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > 
> > so that one could write
> > 
> >   { .type = NLA_ETH_ADDR }
> 
> Yeah, that's possible. I considered it for a second, but it was slightly
> too magical for my taste :-)
> 
> Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?

What about
#define NLA_FIELD_ETH_ADDR  { .type = NLA_BINARY_EXACT, .len = ETH_ALEN }

Or even
#define NLA_FIELD_BINARY_EXACT(_len)	{ .type = NLA_BINARY_EXACT, .len = (_len) }
#define NLA_FIELD_ETH_ADDR		NLA_FIELD_BINARY_EXACT(ETH_ALEN)

So that one would just:
   [MYADDR] = NLA_FIELD_ETH_ADDR,

and if we change how we parse/validate it, users should be good
already.

  Marcelo

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Alexei Starovoitov @ 2018-09-13 19:43 UTC (permalink / raw)
  To: Petar Penkov
  Cc: netdev, davem, ast, daniel, simon.horman, ecree, songliubraving,
	tom, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180913174557.188828-2-peterpenkov96@gmail.com>

On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
> 
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is per-network namespace.
> 
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
...
> @@ -2333,6 +2335,7 @@ struct __sk_buff {
>  	/* ... here. */
>  
>  	__u32 data_meta;
> +	struct bpf_flow_keys *flow_keys;

the bpf prog form patch 4 looks much better now. Thanks!

>  };
>  
>  struct bpf_tunnel_key {
> @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
>  	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
>  };
>  
> +struct bpf_flow_keys {
> +	__u16	nhoff;
> +	__u16	thoff;
> +	__u16	addr_proto;			/* ETH_P_* of valid addrs */
> +	__u8	is_frag;
> +	__u8	is_first_frag;
> +	__u8	is_encap;
> +	__be16	n_proto;
> +	__u8	ip_proto;
> +	union {
> +		struct {
> +			__be32	ipv4_src;
> +			__be32	ipv4_dst;
> +		};
> +		struct {
> +			__u32	ipv6_src[4];	/* in6_addr; network order */
> +			__u32	ipv6_dst[4];	/* in6_addr; network order */
> +		};
> +	};
> +	__be16	sport;
> +	__be16	dport;
> +};

can you please pack it?
struct bpf_flow_keys {
	__u16                      nhoff;                /*     0     2 */
	__u16                      thoff;                /*     2     2 */
	__u16                      addr_proto;           /*     4     2 */
	__u8                       is_frag;              /*     6     1 */
	__u8                       is_first_frag;        /*     7     1 */
	__u8                       is_encap;             /*     8     1 */

	/* XXX 1 byte hole, try to pack */

	__be16                     n_proto;              /*    10     2 */
	__u8                       ip_proto;             /*    12     1 */

	/* XXX 3 bytes hole, try to pack */

	union {

also is_frag and other fields are not used by the kernel and
only used by the prog to pass data between tail_calls ?
In such case reserve some space in bpf_flow_keys similar to skb->cb
so it can contain any fields and accommodate for inevitable changes
to bpf flow dissector prog in the future.

^ permalink raw reply

* [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init
From: Peter Oskolkov @ 2018-09-13 20:38 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Peter Oskolkov, David Ahern

DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..a3902f805305 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, struct fib6_info *ort)
 
 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-	rt->dst.flags |= fib6_info_dst_flags(ort);
+	rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;
 
 	if (ort->fib6_flags & RTF_REJECT) {
 		ip6_rt_init_dst_reject(rt, ort);
-- 
2.19.0.397.gdd90340f6a-goog

^ permalink raw reply related

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Michal Kubecek @ 2018-09-13 20:39 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: Johannes Berg, linux-wireless, netdev
In-Reply-To: <20180913194116.GG4590@localhost.localdomain>

On Thu, Sep 13, 2018 at 04:41:16PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> > On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > > On Thu, Sep 13, 2018 at 02:02:53PM +0200, Johannes Berg wrote:
> > > > On Thu, 2018-09-13 at 13:58 +0200, Michal Kubecek wrote:
> > > > 
> > > > > The code looks correct to me but I have some doubts. Having a special
> > > > > policy for MAC addresses may lead to adding one for IPv4 address (maybe
> > > > > not, we can use NLA_U32 for them), IPv6 addresses and other data types
> > > > > with fixed length. Wouldn't it be more helpful to add a variant of
> > > > > NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
> > > > > isn't equal to .len?
> > > > 
> > > > Yeah, I guess we could do that, and then
> > > > 
> > > > #define NLA_ETH_ADDR .len = ETH_ALEN, .type = NLA_BINARY_EXACT
> > > > #define NLA_IP6_ADDR .len = 16, .type = NLA_BINARY_EXACT
> > > > 
> > > > or so?
> > > 
> > > Maybe rather
> > > 
> > >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> > >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > > 
> > > so that one could write
> > > 
> > >   { .type = NLA_ETH_ADDR }
> > 
> > Yeah, that's possible. I considered it for a second, but it was slightly
> > too magical for my taste :-)
> > 
> > Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?
> 
> What about
> #define NLA_FIELD_ETH_ADDR  { .type = NLA_BINARY_EXACT, .len = ETH_ALEN }
> 
> Or even
> #define NLA_FIELD_BINARY_EXACT(_len)	{ .type = NLA_BINARY_EXACT, .len = (_len) }
> #define NLA_FIELD_ETH_ADDR		NLA_FIELD_BINARY_EXACT(ETH_ALEN)
> 
> So that one would just:
>    [MYADDR] = NLA_FIELD_ETH_ADDR,

Yes, that's how I understoold Johannes' proposal above.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Michal Kubecek @ 2018-09-13 20:43 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: Johannes Berg, linux-wireless, netdev
In-Reply-To: <20180913192014.GE4590@localhost.localdomain>

On Thu, Sep 13, 2018 at 04:20:14PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, Sep 13, 2018 at 02:05:54PM +0200, Michal Kubecek wrote:
> > On Thu, Sep 13, 2018 at 01:25:15PM +0200, Johannes Berg wrote:
> > > On Thu, 2018-09-13 at 12:49 +0200, Michal Kubecek wrote:
> > > 
> > > > >  		if (type > 0 && type <= maxtype) {
> > > > >  			if (policy) {
> > > > > -				err = validate_nla(nla, maxtype, policy);
> > > > > +				err = validate_nla(nla, maxtype, policy,
> > > > > +						   extack);
> > > > >  				if (err < 0) {
> > > > > -					NL_SET_ERR_MSG_ATTR(extack, nla,
> > > > > -							    "Attribute failed policy validation");
> > > > > +					NL_SET_BAD_ATTR(extack, nla);
> > > > > +					if (extack && !extack->_msg)
> > > > > +						NL_SET_ERR_MSG(extack,
> > > > > +							       "Attribute failed policy validation");
> > > > >  					goto errout;
> > > > >  				}
> > > > >  			}
> > > > > -- 
> > > > 
> > > > Technically, this would change the outcome when nla_parse() is called
> > > > with extack->_msg already set nad validate_nla() fails on something else
> > > > than NLA_REJECT; it will preserve the previous message in such case.
> > > > But I don't think this is a serious problem.
> > > 
> > > Yes, that's true. I looked at quite a few of the setters just now (there
> > > are ~500 already, wow!), and they all set & return, so this shouldn't be
> > > an issue.
> > 
> > In ethtool (work in progress) I sometimes use extack message for
> > non-fatal warnings but AFAICS never before parsing the userspace
> > request (actually always shortly before returning). So I don't have
> > a problem with it.
> 
> Considering we can only report 1 message, it should be okay to drop
> the previous message in favor of the new one, which is either a
> critical one or just another non-fatal one.

What I wanted to point out is that the code above does not behave like
this. It does not distinguish between extack->_msg set by NLA_REJECT
branch and extack->_msg which had been set before nla_parse() was
called.

Michal Kubecek

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-13 20:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180913194335.fszwj4jj4yputxlv@ast-mbp>

On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is per-network namespace.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> ...
> > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> >       /* ... here. */
> >
> >       __u32 data_meta;
> > +     struct bpf_flow_keys *flow_keys;
>
> the bpf prog form patch 4 looks much better now. Thanks!
>
> >  };
> >
> >  struct bpf_tunnel_key {
> > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> >       BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> >  };
> >
> > +struct bpf_flow_keys {
> > +     __u16   nhoff;
> > +     __u16   thoff;
> > +     __u16   addr_proto;                     /* ETH_P_* of valid addrs */
> > +     __u8    is_frag;
> > +     __u8    is_first_frag;
> > +     __u8    is_encap;
> > +     __be16  n_proto;
> > +     __u8    ip_proto;
> > +     union {
> > +             struct {
> > +                     __be32  ipv4_src;
> > +                     __be32  ipv4_dst;
> > +             };
> > +             struct {
> > +                     __u32   ipv6_src[4];    /* in6_addr; network order */
> > +                     __u32   ipv6_dst[4];    /* in6_addr; network order */
> > +             };
> > +     };
> > +     __be16  sport;
> > +     __be16  dport;
> > +};
>
> can you please pack it?
> struct bpf_flow_keys {
>         __u16                      nhoff;                /*     0     2 */
>         __u16                      thoff;                /*     2     2 */
>         __u16                      addr_proto;           /*     4     2 */
>         __u8                       is_frag;              /*     6     1 */
>         __u8                       is_first_frag;        /*     7     1 */
>         __u8                       is_encap;             /*     8     1 */
>
>         /* XXX 1 byte hole, try to pack */
>
>         __be16                     n_proto;              /*    10     2 */
>         __u8                       ip_proto;             /*    12     1 */
>
>         /* XXX 3 bytes hole, try to pack */
>
>         union {
>
> also is_frag and other fields are not used by the kernel and
> only used by the prog to pass data between tail_calls ?

No, these are mapped directly onto fields in struct flow_keys
on return from the BPF program in __skb_flow_bpf_to_target.
For is_frag, for instance:

       if (flow_keys->is_frag)
               key_control->flags |= FLOW_DIS_IS_FRAGMENT;

This is true for all fields in the struct except nhoff.

> In such case reserve some space in bpf_flow_keys similar to skb->cb
> so it can contain any fields and accommodate for inevitable changes
> to bpf flow dissector prog in the future.

Do you mean a second scratch space akin to cb[], just a few
reserved padding bytes?

We have given some thought to forward compatibility. The existing
fields cannot be changed, but it should be fine if we need to expand
the struct later.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-09-13 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <CAADnVQ+Ge1HYXPmkEqUWMP0X1F3a9RFg35arZQoktP2cVN4Fkg@mail.gmail.com>

On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> >> socket listening on this host, and returns a socket pointer which the
> >> BPF program can then access to determine, for instance, whether to
> >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> >> socket, so when a BPF program makes use of this function, it must
> >> subsequently pass the returned pointer into the newly added sk_release()
> >> to return the reference.
> >>
> >> By way of example, the following pseudocode would filter inbound
> >> connections at XDP if there is no corresponding service listening for
> >> the traffic:
> >>
> >>   struct bpf_sock_tuple tuple;
> >>   struct bpf_sock_ops *sk;
> >>
> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > ...
> >> +struct bpf_sock_tuple {
> >> +     union {
> >> +             __be32 ipv6[4];
> >> +             __be32 ipv4;
> >> +     } saddr;
> >> +     union {
> >> +             __be32 ipv6[4];
> >> +             __be32 ipv4;
> >> +     } daddr;
> >> +     __be16 sport;
> >> +     __be16 dport;
> >> +     __u8 family;
> >> +};
> >
> > since we can pass ptr_to_packet into map lookup and other helpers now,
> > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > then progs wouldn't need to copy bytes from the packet into tuple
> > to do a lookup.

If I follow, you're proposing that users should be able to pass a
pointer to the source address field of the L3 header, and assuming
that the L3 header ends with saddr+daddr (no options/extheaders), and
is immediately followed by the sport/dport then a packet pointer
should work for performing socket lookup. Then it is up to the BPF
program writer to ensure that this is the case, or otherwise fall back
to populating a copy of the sock tuple on the stack.

> have been thinking more about it.
> since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> to infer family inside the helper, so it doesn't need to be passed explicitly?

Let me make sure I understand the proposal here.

The current structure and function prototypes are:

struct bpf_sock_tuple {
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } saddr;
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } daddr;
      __be16 sport;
      __be16 dport;
      __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
                                           struct bpf_sock_tuple *tuple,
                                           int size, unsigned int netns_id,
                                           unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
                                           struct bpf_sock_tuple *tuple,
                                           int size, unsigned int netns_id,
                                           unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

You're proposing something like:

struct bpf_sock_tuple4 {
      __be32 saddr;
      __be32 daddr;
      __be16 sport;
      __be16 dport;
      __u8 family;
};

struct bpf_sock_tuple6 {
      __be32 saddr[4];
      __be32 daddr[4];
      __be16 sport;
      __be16 dport;
      __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
                                           void *tuple,
                                           int size, unsigned int
netns_id,
                                           unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
                                           void *tuple,
                                           int size, unsigned int netns_id,
                                           unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

Then the implementation will check the size against either
"sizeof(struct bpf_sock_tuple4)" or "sizeof(struct bpf_sock_tuple6)"
and interpret as the v4 or v6 handler from this.

Sure, I can try this out.

^ permalink raw reply

* Re: [PATCH bpf-next 10/11] selftests/bpf: Add C tests for reference tracking
From: Joe Stringer @ 2018-09-13 20:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180913001115.mhddud4avgojdwra@ast-mbp>

On Wed, 12 Sep 2018 at 17:11, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 11, 2018 at 05:36:39PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
>
> really nice set of tests.
> please describe them briefly in commit log.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Ack, will do.

^ permalink raw reply

* Re: [PATCH bpf-next 11/11] Documentation: Describe bpf reference tracking
From: Joe Stringer @ 2018-09-13 20:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180913001250.wqhqvox3kccslq32@ast-mbp>

On Wed, 12 Sep 2018 at 17:13, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 11, 2018 at 05:36:40PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
>
> just few words in commit log would be better than nothing.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Ack, thanks for the review!

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Alexei Starovoitov @ 2018-09-13 20:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <CAF=yD-LBqT9hZQXuhqs8st_-EJvkXm6cGvt9pu8Ox9rEMHOFQg@mail.gmail.com>

On Thu, Sep 13, 2018 at 04:51:49PM -0400, Willem de Bruijn wrote:
> On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > > From: Petar Penkov <ppenkov@google.com>
> > >
> > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > path. The BPF program is per-network namespace.
> > >
> > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ...
> > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > >       /* ... here. */
> > >
> > >       __u32 data_meta;
> > > +     struct bpf_flow_keys *flow_keys;
> >
> > the bpf prog form patch 4 looks much better now. Thanks!
> >
> > >  };
> > >
> > >  struct bpf_tunnel_key {
> > > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> > >       BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> > >  };
> > >
> > > +struct bpf_flow_keys {
> > > +     __u16   nhoff;
> > > +     __u16   thoff;
> > > +     __u16   addr_proto;                     /* ETH_P_* of valid addrs */
> > > +     __u8    is_frag;
> > > +     __u8    is_first_frag;
> > > +     __u8    is_encap;
> > > +     __be16  n_proto;
> > > +     __u8    ip_proto;
> > > +     union {
> > > +             struct {
> > > +                     __be32  ipv4_src;
> > > +                     __be32  ipv4_dst;
> > > +             };
> > > +             struct {
> > > +                     __u32   ipv6_src[4];    /* in6_addr; network order */
> > > +                     __u32   ipv6_dst[4];    /* in6_addr; network order */
> > > +             };
> > > +     };
> > > +     __be16  sport;
> > > +     __be16  dport;
> > > +};
> >
> > can you please pack it?
> > struct bpf_flow_keys {
> >         __u16                      nhoff;                /*     0     2 */
> >         __u16                      thoff;                /*     2     2 */
> >         __u16                      addr_proto;           /*     4     2 */
> >         __u8                       is_frag;              /*     6     1 */
> >         __u8                       is_first_frag;        /*     7     1 */
> >         __u8                       is_encap;             /*     8     1 */
> >
> >         /* XXX 1 byte hole, try to pack */
> >
> >         __be16                     n_proto;              /*    10     2 */
> >         __u8                       ip_proto;             /*    12     1 */
> >
> >         /* XXX 3 bytes hole, try to pack */
> >
> >         union {
> >
> > also is_frag and other fields are not used by the kernel and
> > only used by the prog to pass data between tail_calls ?
> 
> No, these are mapped directly onto fields in struct flow_keys
> on return from the BPF program in __skb_flow_bpf_to_target.
> For is_frag, for instance:
> 
>        if (flow_keys->is_frag)
>                key_control->flags |= FLOW_DIS_IS_FRAGMENT;

right. my search-fu failed me. only packing is needed then.

> This is true for all fields in the struct except nhoff.

> > In such case reserve some space in bpf_flow_keys similar to skb->cb
> > so it can contain any fields and accommodate for inevitable changes
> > to bpf flow dissector prog in the future.
> 
> Do you mean a second scratch space akin to cb[], just a few
> reserved padding bytes?

it looks to me it's possible to rearrange the fields to avoid all holes,
so no extra padding bytes necessary.

> We have given some thought to forward compatibility. The existing
> fields cannot be changed, but it should be fine if we need to expand
> the struct later.

let's keep cb-like idea for later. It seems to me we can add it to
the end of bpf_flow_keys any time later.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-09-13 20:57 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Alexei Starovoitov, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <CAOftzPiG6JMb2=U3ZU9D2+0U=1zLqZPgax8OFRHF_1UTcs5Shw@mail.gmail.com>

On Thu, 13 Sep 2018 at 13:55, Joe Stringer <joe@wand.net.nz> wrote:
> struct bpf_sock_tuple4 {
>       __be32 saddr;
>       __be32 daddr;
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };
>
> struct bpf_sock_tuple6 {
>       __be32 saddr[4];
>       __be32 daddr[4];
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };

(ignore the family bit here, I forgot to remove it..)

^ permalink raw reply

* Re: [PATCH v2,net-next 1/2] ip_gre: fix parsing gre header in ipgre_err
From: Haishuang Yan @ 2018-09-14  2:09 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, jbenc, netdev, linux-kernel
In-Reply-To: <20180913.105840.140151724801067072.davem@davemloft.net>



> On 2018年9月14日, at 上午1:58, David Miller <davem@davemloft.net> wrote:
> 
> From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> Date: Wed, 12 Sep 2018 17:21:21 +0800
> 
>> @@ -86,7 +86,7 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>> 
>> 	options = (__be32 *)(greh + 1);
>> 	if (greh->flags & GRE_CSUM) {
>> -		if (skb_checksum_simple_validate(skb)) {
>> +		if (csum_err && skb_checksum_simple_validate(skb)) {
>> 			*csum_err = true;
>> 			return -EINVAL;
>> 		}
> 
> You want to ignore csum errors, but you do not want to elide the side
> effects of the skb_checksum_simple_validate() call which are to set
> skb->csum_valid and skb->csum.
> 
> Therefore, the skb_checksum_simple_validate() call still needs to be
> performed. We just wont return -EINVAL in the NULL csum_err case.
> 
> 

How about doing like this:

--- a/net/ipv4/gre_demux.c
+++ b/net/ipv4/gre_demux.c
@@ -86,13 +86,14 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,

        options = (__be32 *)(greh + 1);
        if (greh->flags & GRE_CSUM) {
-               if (skb_checksum_simple_validate(skb)) {
+               if (!skb_checksum_simple_validate(skb)) {
+                       skb_checksum_try_convert(skb, IPPROTO_GRE, 0,
+                                                null_compute_pseudo);
+               } else if (csum_err) {
                        *csum_err = true;
                        return -EINVAL;
                }

-               skb_checksum_try_convert(skb, IPPROTO_GRE, 0,
-                                        null_compute_pseudo);

Thanks for reviewing.

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-13 21:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180913205740.a3zqadwevavshopv@ast-mbp>

On Thu, Sep 13, 2018 at 4:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 13, 2018 at 04:51:49PM -0400, Willem de Bruijn wrote:
> > On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > > > From: Petar Penkov <ppenkov@google.com>
> > > >
> > > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > > path. The BPF program is per-network namespace.
> > > >
> > > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > ...
> > > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > >       /* ... here. */
> > > >
> > > >       __u32 data_meta;
> > > > +     struct bpf_flow_keys *flow_keys;
> > >
> > > the bpf prog form patch 4 looks much better now. Thanks!
> > >
> > > >  };
> > > >
> > > >  struct bpf_tunnel_key {
> > > > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> > > >       BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> > > >  };
> > > >
> > > > +struct bpf_flow_keys {
> > > > +     __u16   nhoff;
> > > > +     __u16   thoff;
> > > > +     __u16   addr_proto;                     /* ETH_P_* of valid addrs */
> > > > +     __u8    is_frag;
> > > > +     __u8    is_first_frag;
> > > > +     __u8    is_encap;
> > > > +     __be16  n_proto;
> > > > +     __u8    ip_proto;
> > > > +     union {
> > > > +             struct {
> > > > +                     __be32  ipv4_src;
> > > > +                     __be32  ipv4_dst;
> > > > +             };
> > > > +             struct {
> > > > +                     __u32   ipv6_src[4];    /* in6_addr; network order */
> > > > +                     __u32   ipv6_dst[4];    /* in6_addr; network order */
> > > > +             };
> > > > +     };
> > > > +     __be16  sport;
> > > > +     __be16  dport;
> > > > +};
> > >
> > > can you please pack it?
> > > struct bpf_flow_keys {
> > >         __u16                      nhoff;                /*     0     2 */
> > >         __u16                      thoff;                /*     2     2 */
> > >         __u16                      addr_proto;           /*     4     2 */
> > >         __u8                       is_frag;              /*     6     1 */
> > >         __u8                       is_first_frag;        /*     7     1 */
> > >         __u8                       is_encap;             /*     8     1 */
> > >
> > >         /* XXX 1 byte hole, try to pack */
> > >
> > >         __be16                     n_proto;              /*    10     2 */
> > >         __u8                       ip_proto;             /*    12     1 */
> > >
> > >         /* XXX 3 bytes hole, try to pack */
> > >
> > >         union {
> > >
> > > also is_frag and other fields are not used by the kernel and
> > > only used by the prog to pass data between tail_calls ?
> >
> > No, these are mapped directly onto fields in struct flow_keys
> > on return from the BPF program in __skb_flow_bpf_to_target.
> > For is_frag, for instance:
> >
> >        if (flow_keys->is_frag)
> >                key_control->flags |= FLOW_DIS_IS_FRAGMENT;
>
> right. my search-fu failed me. only packing is needed then.
>
> > This is true for all fields in the struct except nhoff.
>
> > > In such case reserve some space in bpf_flow_keys similar to skb->cb
> > > so it can contain any fields and accommodate for inevitable changes
> > > to bpf flow dissector prog in the future.
> >
> > Do you mean a second scratch space akin to cb[], just a few
> > reserved padding bytes?
>
> it looks to me it's possible to rearrange the fields to avoid all holes,
> so no extra padding bytes necessary.

Absolutely. We forgot to run pahole earlier. Updating now.

> > We have given some thought to forward compatibility. The existing
> > fields cannot be changed, but it should be fine if we need to expand
> > the struct later.
>
> let's keep cb-like idea for later. It seems to me we can add it to
> the end of bpf_flow_keys any time later.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Alexei Starovoitov @ 2018-09-13 21:01 UTC (permalink / raw)
  To: Joe Stringer
  Cc: daniel, netdev, ast, john fastabend, tgraf, Martin KaFai Lau,
	Nitin Hande, mauricio.vasquez
In-Reply-To: <CAOftzPiG6JMb2=U3ZU9D2+0U=1zLqZPgax8OFRHF_1UTcs5Shw@mail.gmail.com>

On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > >> socket listening on this host, and returns a socket pointer which the
> > >> BPF program can then access to determine, for instance, whether to
> > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> > >> socket, so when a BPF program makes use of this function, it must
> > >> subsequently pass the returned pointer into the newly added sk_release()
> > >> to return the reference.
> > >>
> > >> By way of example, the following pseudocode would filter inbound
> > >> connections at XDP if there is no corresponding service listening for
> > >> the traffic:
> > >>
> > >>   struct bpf_sock_tuple tuple;
> > >>   struct bpf_sock_ops *sk;
> > >>
> > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > ...
> > >> +struct bpf_sock_tuple {
> > >> +     union {
> > >> +             __be32 ipv6[4];
> > >> +             __be32 ipv4;
> > >> +     } saddr;
> > >> +     union {
> > >> +             __be32 ipv6[4];
> > >> +             __be32 ipv4;
> > >> +     } daddr;
> > >> +     __be16 sport;
> > >> +     __be16 dport;
> > >> +     __u8 family;
> > >> +};
> > >
> > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > > then progs wouldn't need to copy bytes from the packet into tuple
> > > to do a lookup.
> 
> If I follow, you're proposing that users should be able to pass a
> pointer to the source address field of the L3 header, and assuming
> that the L3 header ends with saddr+daddr (no options/extheaders), and
> is immediately followed by the sport/dport then a packet pointer
> should work for performing socket lookup. Then it is up to the BPF
> program writer to ensure that this is the case, or otherwise fall back
> to populating a copy of the sock tuple on the stack.

yep.

> > have been thinking more about it.
> > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > to infer family inside the helper, so it doesn't need to be passed explicitly?
> 
> Let me make sure I understand the proposal here.
> 
> The current structure and function prototypes are:
> 
> struct bpf_sock_tuple {
>       union {
>               __be32 ipv6[4];
>               __be32 ipv4;
>       } saddr;
>       union {
>               __be32 ipv6[4];
>               __be32 ipv4;
>       } daddr;
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };
...
> You're proposing something like:
> 
> struct bpf_sock_tuple4 {
>       __be32 saddr;
>       __be32 daddr;
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };
> 
> struct bpf_sock_tuple6 {
>       __be32 saddr[4];
>       __be32 daddr[4];
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };

I think the split is unnecessary.
I'm proposing:
struct bpf_sock_tuple {
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } saddr;
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } daddr;
      __be16 sport;
      __be16 dport;
};

that points directly into the packet (when ipv4 options are not there)
and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: maowenan @ 2018-09-14  2:24 UTC (permalink / raw)
  To: Eric Dumazet, gregkh
  Cc: mkubecek, David Woodhouse, netdev, Eric Dumazet, David Miller,
	Yuchung Cheng, jdw, stable, tiwai
In-Reply-To: <CANn89iLVjY+KWs3y=cYW_OpwCLAOt9=x=iTnsPatwzCMYxUpTA@mail.gmail.com>



On 2018/9/13 20:44, Eric Dumazet wrote:
> On Thu, Sep 13, 2018 at 5:32 AM Greg KH <gregkh@linux-foundation.org> wrote:
>>
>> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
>>>> On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
>>>>> On 2018/8/16 19:39, Michal Kubecek wrote:
>>>>>>
>>>>>> I suspect you may be doing something wrong with your tests. I checked
>>>>>> the segmentsmack testcase and the CPU utilization on receiving side
>>>>>> (with sending 10 times as many packets as default) went down from ~100%
>>>>>> to ~3% even when comparing what is in stable 4.4 now against older 4.4
>>>>>> kernel.
>>>>>
>>>>> There seems no obvious problem when you send packets with default
>>>>> parameter in Segmentsmack POC, Which is also very related with your
>>>>> server's hardware configuration. Please try with below parameter to
>>>>> form OFO packets
>>>>
>>>> I did and even with these (questionable, see below) changes, I did not
>>>> get more than 10% (of one core) by receiving ksoftirqd.
>>>>
>>>>>       for (i = 0; i < 1024; i++)      // 128->1024
>>>> ...
>>>>>       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
>>>>
>>>> The comment in the testcase source suggests to do _one_ of these two
>>>> changes so that you generate 10 times as many packets as the original
>>>> testcase. You did both so that you end up sending 102400 packets per
>>>> second. With 55 byte long packets, this kind of attack requires at least
>>>> 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
>>>> DoS", I'm afraid.
>>>>
>>>> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
>>>>
>>>> What I can see, though, is that with current stable 4.4 code, modified
>>>> testcase which sends something like
>>>>
>>>>   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
>>>>
>>>> I quickly eat 6 MB of memory for receive queue of one socket while
>>>> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
>>>> Takashi's follow-up yet but I'm pretty sure it will help while
>>>> preserving nice performance when using the original segmentsmack
>>>> testcase (with increased packet ratio).
>>>
>>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>>> push out a new 4.4-rc later tonight.  Can everyone standardize on that
>>> and test and let me know if it does, or does not, fix the reported
>>> issues?
>>>
>>> If not, we can go from there and evaluate this much larger patch series.
>>> But let's try the simple thing first.
>>
>> So, is the issue still present on the latest 4.4 release?  Has anyone
>> tested it?  If not, I'm more than willing to look at backported patches,
>> but I want to ensure that they really are needed here.
>>
>> thanks,
> 
> Honestly, TCP stack without rb-tree for the OOO queue is vulnerable,
> even with non malicious sender,
> but with big enough TCP receive window and a not favorable network.
> 
> So a malicious peer can definitely send packets needed to make TCP
> stack behave in O(N), which is pretty bad if N is big...
> 
> 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ("tcp: use an RB tree for ooo
> receive queue")
> was proven to be almost bug free [1], and should be backported if possible.
> 
> [1] bug fixed :
> 76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a tcp: fix a stale ooo_last_skb
> after a replace

Thank you for Eric's suggestion, I will do some work to backport them.
> 
> .
> 

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-09-13 21:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180913210158.b5r53sk6vy6vcj52@ast-mbp>

On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > > >> socket listening on this host, and returns a socket pointer which the
> > > >> BPF program can then access to determine, for instance, whether to
> > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> > > >> socket, so when a BPF program makes use of this function, it must
> > > >> subsequently pass the returned pointer into the newly added sk_release()
> > > >> to return the reference.
> > > >>
> > > >> By way of example, the following pseudocode would filter inbound
> > > >> connections at XDP if there is no corresponding service listening for
> > > >> the traffic:
> > > >>
> > > >>   struct bpf_sock_tuple tuple;
> > > >>   struct bpf_sock_ops *sk;
> > > >>
> > > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> > > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > > ...
> > > >> +struct bpf_sock_tuple {
> > > >> +     union {
> > > >> +             __be32 ipv6[4];
> > > >> +             __be32 ipv4;
> > > >> +     } saddr;
> > > >> +     union {
> > > >> +             __be32 ipv6[4];
> > > >> +             __be32 ipv4;
> > > >> +     } daddr;
> > > >> +     __be16 sport;
> > > >> +     __be16 dport;
> > > >> +     __u8 family;
> > > >> +};
> > > >
> > > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > to do a lookup.
> >
> > If I follow, you're proposing that users should be able to pass a
> > pointer to the source address field of the L3 header, and assuming
> > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > is immediately followed by the sport/dport then a packet pointer
> > should work for performing socket lookup. Then it is up to the BPF
> > program writer to ensure that this is the case, or otherwise fall back
> > to populating a copy of the sock tuple on the stack.
>
> yep.
>
> > > have been thinking more about it.
> > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > to infer family inside the helper, so it doesn't need to be passed explicitly?
> >
> > Let me make sure I understand the proposal here.
> >
> > The current structure and function prototypes are:
> >
> > struct bpf_sock_tuple {
> >       union {
> >               __be32 ipv6[4];
> >               __be32 ipv4;
> >       } saddr;
> >       union {
> >               __be32 ipv6[4];
> >               __be32 ipv4;
> >       } daddr;
> >       __be16 sport;
> >       __be16 dport;
> >       __u8 family;
> > };
> ...
> > You're proposing something like:
> >
> > struct bpf_sock_tuple4 {
> >       __be32 saddr;
> >       __be32 daddr;
> >       __be16 sport;
> >       __be16 dport;
> >       __u8 family;
> > };
> >
> > struct bpf_sock_tuple6 {
> >       __be32 saddr[4];
> >       __be32 daddr[4];
> >       __be16 sport;
> >       __be16 dport;
> >       __u8 family;
> > };
>
> I think the split is unnecessary.
> I'm proposing:
> struct bpf_sock_tuple {
>       union {
>               __be32 ipv6[4];
>               __be32 ipv4;
>       } saddr;
>       union {
>               __be32 ipv6[4];
>               __be32 ipv4;
>       } daddr;
>       __be16 sport;
>       __be16 dport;
> };
>
> that points directly into the packet (when ipv4 options are not there)
> and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.

Needs to be subtly different, the 'sport'/'dport' offset would be
wrong in the IPv4 case otherwise:

$ cat foo.c
#include <linux/types.h>

struct bpf_sock_tuple {
     union {
             __be32 ipv6[4];
             __be32 ipv4;
     } saddr;
     union {
             __be32 ipv6[4];
             __be32 ipv4;
     } daddr;
     __be16 sport;
     __be16 dport;
};

int main(int argc, char *argv[]) {
       struct bpf_sock_tuple tuple;

       return 0;
}
$ gcc -g ./foo.c -o foo.o
$ pahole foo.o
struct bpf_sock_tuple {
       union {
               __be32             ipv6[4];              /*          16 */
               __be32             ipv4;                 /*           4 */
       } saddr;                                         /*     0    16 */
       union {
               __be32             ipv6[4];              /*          16 */
               __be32             ipv4;                 /*           4 */
       } daddr;                                         /*    16    16 */
       __be16                     sport;                /*    32     2 */
       __be16                     dport;                /*    34     2 */

       /* size: 36, cachelines: 1, members: 4 */
       /* last cacheline: 36 bytes */
};

---

We could take my definitions above and do the following if we want to
try to type the helper definition:

union bpf_sock_tuple {
       struct bpf_sock_tuple4 t4;
       struct bpf_sock_tuple6 t6;
};

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Alexei Starovoitov @ 2018-09-13 21:22 UTC (permalink / raw)
  To: Joe Stringer
  Cc: daniel, netdev, ast, john fastabend, tgraf, Martin KaFai Lau,
	Nitin Hande, mauricio.vasquez
In-Reply-To: <CAOftzPjodTDZHg+Y7ayR-JX7LMZ3PXfYWBgWDonXJj_1mhZaqA@mail.gmail.com>

On Thu, Sep 13, 2018 at 02:17:17PM -0700, Joe Stringer wrote:
> On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > > > >> socket listening on this host, and returns a socket pointer which the
> > > > >> BPF program can then access to determine, for instance, whether to
> > > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> > > > >> socket, so when a BPF program makes use of this function, it must
> > > > >> subsequently pass the returned pointer into the newly added sk_release()
> > > > >> to return the reference.
> > > > >>
> > > > >> By way of example, the following pseudocode would filter inbound
> > > > >> connections at XDP if there is no corresponding service listening for
> > > > >> the traffic:
> > > > >>
> > > > >>   struct bpf_sock_tuple tuple;
> > > > >>   struct bpf_sock_ops *sk;
> > > > >>
> > > > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> > > > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > > > ...
> > > > >> +struct bpf_sock_tuple {
> > > > >> +     union {
> > > > >> +             __be32 ipv6[4];
> > > > >> +             __be32 ipv4;
> > > > >> +     } saddr;
> > > > >> +     union {
> > > > >> +             __be32 ipv6[4];
> > > > >> +             __be32 ipv4;
> > > > >> +     } daddr;
> > > > >> +     __be16 sport;
> > > > >> +     __be16 dport;
> > > > >> +     __u8 family;
> > > > >> +};
> > > > >
> > > > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > > > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > > to do a lookup.
> > >
> > > If I follow, you're proposing that users should be able to pass a
> > > pointer to the source address field of the L3 header, and assuming
> > > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > > is immediately followed by the sport/dport then a packet pointer
> > > should work for performing socket lookup. Then it is up to the BPF
> > > program writer to ensure that this is the case, or otherwise fall back
> > > to populating a copy of the sock tuple on the stack.
> >
> > yep.
> >
> > > > have been thinking more about it.
> > > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > > to infer family inside the helper, so it doesn't need to be passed explicitly?
> > >
> > > Let me make sure I understand the proposal here.
> > >
> > > The current structure and function prototypes are:
> > >
> > > struct bpf_sock_tuple {
> > >       union {
> > >               __be32 ipv6[4];
> > >               __be32 ipv4;
> > >       } saddr;
> > >       union {
> > >               __be32 ipv6[4];
> > >               __be32 ipv4;
> > >       } daddr;
> > >       __be16 sport;
> > >       __be16 dport;
> > >       __u8 family;
> > > };
> > ...
> > > You're proposing something like:
> > >
> > > struct bpf_sock_tuple4 {
> > >       __be32 saddr;
> > >       __be32 daddr;
> > >       __be16 sport;
> > >       __be16 dport;
> > >       __u8 family;
> > > };
> > >
> > > struct bpf_sock_tuple6 {
> > >       __be32 saddr[4];
> > >       __be32 daddr[4];
> > >       __be16 sport;
> > >       __be16 dport;
> > >       __u8 family;
> > > };
> >
> > I think the split is unnecessary.
> > I'm proposing:
> > struct bpf_sock_tuple {
> >       union {
> >               __be32 ipv6[4];
> >               __be32 ipv4;
> >       } saddr;
> >       union {
> >               __be32 ipv6[4];
> >               __be32 ipv4;
> >       } daddr;
> >       __be16 sport;
> >       __be16 dport;
> > };
> >
> > that points directly into the packet (when ipv4 options are not there)
> > and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.
> 
> Needs to be subtly different, the 'sport'/'dport' offset would be
> wrong in the IPv4 case otherwise:

ahh. right.

> 
> We could take my definitions above and do the following if we want to
> try to type the helper definition:
> 
> union bpf_sock_tuple {
>        struct bpf_sock_tuple4 t4;
>        struct bpf_sock_tuple6 t6;
> };

yes. sounds great to me. Much better than 'void *' in the helper.

^ permalink raw reply

* [PATCH] net/ibm/emac: Remove VLA usage
From: Kees Cook @ 2018-09-13 21:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Benjamin Herrenschmidt, Christian Lamparter,
	Ivan Mikhaylov, linux-kernel

In the quest to remove all stack VLA usage from the kernel[1], this
removes the VLA used for the emac xaht registers size. Since the size
of registers can only ever be 4 or 8, as detected in emac_init_config(),
the max can be hardcoded and a runtime test added for robustness.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Christian Lamparter <chunkeey@gmail.com>
Cc: Ivan Mikhaylov <ivan@de.ibm.com>
Cc: netdev@vger.kernel.org
Co-developed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/net/ethernet/ibm/emac/core.c | 6 +++++-
 drivers/net/ethernet/ibm/emac/core.h | 3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index 372664686309..7410a1de8f1d 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -423,7 +423,7 @@ static void emac_hash_mc(struct emac_instance *dev)
 {
 	const int regs = EMAC_XAHT_REGS(dev);
 	u32 *gaht_base = emac_gaht_base(dev);
-	u32 gaht_temp[regs];
+	u32 gaht_temp[EMAC_XAHT_MAX_REGS];
 	struct netdev_hw_addr *ha;
 	int i;
 
@@ -2964,6 +2964,10 @@ static int emac_init_config(struct emac_instance *dev)
 		dev->xaht_width_shift = EMAC4_XAHT_WIDTH_SHIFT;
 	}
 
+	/* This should never happen */
+	if (WARN_ON(EMAC_XAHT_REGS(dev) > EMAC_XAHT_MAX_REGS))
+		return -ENXIO;
+
 	DBG(dev, "features     : 0x%08x / 0x%08x\n", dev->features, EMAC_FTRS_POSSIBLE);
 	DBG(dev, "tx_fifo_size : %d (%d gige)\n", dev->tx_fifo_size, dev->tx_fifo_size_gige);
 	DBG(dev, "rx_fifo_size : %d (%d gige)\n", dev->rx_fifo_size, dev->rx_fifo_size_gige);
diff --git a/drivers/net/ethernet/ibm/emac/core.h b/drivers/net/ethernet/ibm/emac/core.h
index 369de2cfb15b..84caa4a3fc52 100644
--- a/drivers/net/ethernet/ibm/emac/core.h
+++ b/drivers/net/ethernet/ibm/emac/core.h
@@ -390,6 +390,9 @@ static inline int emac_has_feature(struct emac_instance *dev,
 #define	EMAC4SYNC_XAHT_SLOTS_SHIFT	8
 #define	EMAC4SYNC_XAHT_WIDTH_SHIFT	5
 
+/* The largest span between slots and widths above is 3 */
+#define	EMAC_XAHT_MAX_REGS		(1 << 3)
+
 #define	EMAC_XAHT_SLOTS(dev)         	(1 << (dev)->xaht_slots_shift)
 #define	EMAC_XAHT_WIDTH(dev)         	(1 << (dev)->xaht_width_shift)
 #define	EMAC_XAHT_REGS(dev)          	(1 << ((dev)->xaht_slots_shift - \
-- 
2.17.1


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-09-13 21:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180913212205.ght2mompuoyuhd4g@ast-mbp>

On Thu, 13 Sep 2018 at 14:22, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 13, 2018 at 02:17:17PM -0700, Joe Stringer wrote:
> > On Thu, 13 Sep 2018 at 14:02, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> > > > On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > > > > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > > > > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > > > > >> socket listening on this host, and returns a socket pointer which the
> > > > > >> BPF program can then access to determine, for instance, whether to
> > > > > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> > > > > >> socket, so when a BPF program makes use of this function, it must
> > > > > >> subsequently pass the returned pointer into the newly added sk_release()
> > > > > >> to return the reference.
> > > > > >>
> > > > > >> By way of example, the following pseudocode would filter inbound
> > > > > >> connections at XDP if there is no corresponding service listening for
> > > > > >> the traffic:
> > > > > >>
> > > > > >>   struct bpf_sock_tuple tuple;
> > > > > >>   struct bpf_sock_ops *sk;
> > > > > >>
> > > > > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> > > > > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > > > > ...
> > > > > >> +struct bpf_sock_tuple {
> > > > > >> +     union {
> > > > > >> +             __be32 ipv6[4];
> > > > > >> +             __be32 ipv4;
> > > > > >> +     } saddr;
> > > > > >> +     union {
> > > > > >> +             __be32 ipv6[4];
> > > > > >> +             __be32 ipv4;
> > > > > >> +     } daddr;
> > > > > >> +     __be16 sport;
> > > > > >> +     __be16 dport;
> > > > > >> +     __u8 family;
> > > > > >> +};
> > > > > >
> > > > > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > > > > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > > > > > then progs wouldn't need to copy bytes from the packet into tuple
> > > > > > to do a lookup.
> > > >
> > > > If I follow, you're proposing that users should be able to pass a
> > > > pointer to the source address field of the L3 header, and assuming
> > > > that the L3 header ends with saddr+daddr (no options/extheaders), and
> > > > is immediately followed by the sport/dport then a packet pointer
> > > > should work for performing socket lookup. Then it is up to the BPF
> > > > program writer to ensure that this is the case, or otherwise fall back
> > > > to populating a copy of the sock tuple on the stack.
> > >
> > > yep.
> > >
> > > > > have been thinking more about it.
> > > > > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > > > > to infer family inside the helper, so it doesn't need to be passed explicitly?
> > > >
> > > > Let me make sure I understand the proposal here.
> > > >
> > > > The current structure and function prototypes are:
> > > >
> > > > struct bpf_sock_tuple {
> > > >       union {
> > > >               __be32 ipv6[4];
> > > >               __be32 ipv4;
> > > >       } saddr;
> > > >       union {
> > > >               __be32 ipv6[4];
> > > >               __be32 ipv4;
> > > >       } daddr;
> > > >       __be16 sport;
> > > >       __be16 dport;
> > > >       __u8 family;
> > > > };
> > > ...
> > > > You're proposing something like:
> > > >
> > > > struct bpf_sock_tuple4 {
> > > >       __be32 saddr;
> > > >       __be32 daddr;
> > > >       __be16 sport;
> > > >       __be16 dport;
> > > >       __u8 family;
> > > > };
> > > >
> > > > struct bpf_sock_tuple6 {
> > > >       __be32 saddr[4];
> > > >       __be32 daddr[4];
> > > >       __be16 sport;
> > > >       __be16 dport;
> > > >       __u8 family;
> > > > };
> > >
> > > I think the split is unnecessary.
> > > I'm proposing:
> > > struct bpf_sock_tuple {
> > >       union {
> > >               __be32 ipv6[4];
> > >               __be32 ipv4;
> > >       } saddr;
> > >       union {
> > >               __be32 ipv6[4];
> > >               __be32 ipv4;
> > >       } daddr;
> > >       __be16 sport;
> > >       __be16 dport;
> > > };
> > >
> > > that points directly into the packet (when ipv4 options are not there)
> > > and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.
> >
> > Needs to be subtly different, the 'sport'/'dport' offset would be
> > wrong in the IPv4 case otherwise:
>
> ahh. right.
>
> >
> > We could take my definitions above and do the following if we want to
> > try to type the helper definition:
> >
> > union bpf_sock_tuple {
> >        struct bpf_sock_tuple4 t4;
> >        struct bpf_sock_tuple6 t6;
> > };
>
> yes. sounds great to me. Much better than 'void *' in the helper.

Could even do something like this:

$ cat foo.c
#include <linux/types.h>

struct bpf_sock_tuple {
   union {
   struct {
       __be32 saddr;
       __be32 daddr;
       __be16 sport;
       __be16 dport;
   } ipv4;
   struct {
       __be32 saddr[4];
       __be32 daddr[4];
       __be16 sport;
       __be16 dport;
   } ipv6;
   };
};

int main(int argc, char *argv[]) {
       struct bpf_sock_tuple tuple;

       return 0;
}
$ gcc -g ./foo.c -o foo.o
$ pahole foo.o
struct bpf_sock_tuple {
       union {
               struct {
                       __be32     saddr;                /*     0     4 */
                       __be32     daddr;                /*     4     4 */
                       __be16     sport;                /*     8     2 */
                       __be16     dport;                /*    10     2 */
               } ipv4;                                  /*          12 */
               struct {
                       __be32     saddr[4];             /*     0    16 */
                       __be32     daddr[4];             /*    16    16 */
                       __be16     sport;                /*    32     2 */
                       __be16     dport;                /*    34     2 */
               } ipv6;                                  /*          36 */
       };                                               /*     0    36 */

       /* size: 36, cachelines: 1, members: 1 */
       /* last cacheline: 36 bytes */
};

^ permalink raw reply

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Michal Kubecek @ 2018-09-13 21:27 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Johannes Berg, linux-wireless, netdev, Johannes Berg
In-Reply-To: <20180913193004.GF4590@localhost.localdomain>

On Thu, Sep 13, 2018 at 04:30:04PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, Sep 13, 2018 at 10:46:02AM +0200, Johannes Berg wrote:
> > From: Johannes Berg <johannes.berg@intel.com>
> > 
> > In some situations some netlink attributes may be used for output
> > only (kernel->userspace) or may be reserved for future use. It's
> > then helpful to be able to prevent userspace from using them in
> > messages sent to the kernel, since they'd otherwise be ignored and
> > any future will become impossible if this happens.
> 
> I don't follow, what's the issue with simply ignoring such attributes?

A bit artificial example but I can't come with somethin better at the
moment: let's say newer kernel and iproute2 allow something like

  ip link del <condition1> <condition2>

and you run new ip with older kernel which only supports <condition1>.
You don't really want kernel to silently ignore the second condition.
Or e.g. when adding a netfilter rule, you wouldn't want kernel to ignore
parts of the rule it does not understand.

I must admit I'm not sure if there is really need for having reserved
attributes with netlink. Maybe e.g. when we want to share part of
(numeric) attribute types between different message types. Anyway, we
have the same problem with attributes higher than maximum; NLA_REJECT
wouldn't help with this but the discussion also touched the topic.

> > Add NLA_REJECT to the policy which does nothing but reject (with
> > EINVAL) validation of any messages containing this attribute.
> > Allow for returning a specific extended ACK error message in the
> > validation_data pointer.
> 
> This has potential to create confusion because we can't use it on
> {output,reserved} attributes that are already there (as we must always
> ignore them now) and we will end up with a mix of it.

We can return -EINVAL even now, we just need to add a check after
nla_parse() wrapper returns, e.g. here: (lines 314-320)

  https://github.com/mkubecek/ethnl/blob/master/kernel/0019-ethtool-implement-SET_PARAMS-message.patch#L314

It would be easier and IMHO cleaner if I could simply list these "read
only attributes" with NLA_REJECT policy for "set" request.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Marcelo Ricardo Leitner @ 2018-09-13 21:58 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Johannes Berg, linux-wireless, netdev, Johannes Berg, jbenc
In-Reply-To: <20180913212742.GC3876@unicorn.suse.cz>

On Thu, Sep 13, 2018 at 11:27:42PM +0200, Michal Kubecek wrote:
> On Thu, Sep 13, 2018 at 04:30:04PM -0300, Marcelo Ricardo Leitner wrote:
> > On Thu, Sep 13, 2018 at 10:46:02AM +0200, Johannes Berg wrote:
> > > From: Johannes Berg <johannes.berg@intel.com>
> > > 
> > > In some situations some netlink attributes may be used for output
> > > only (kernel->userspace) or may be reserved for future use. It's
> > > then helpful to be able to prevent userspace from using them in
> > > messages sent to the kernel, since they'd otherwise be ignored and
> > > any future will become impossible if this happens.
> > 
> > I don't follow, what's the issue with simply ignoring such attributes?
> 
> A bit artificial example but I can't come with somethin better at the
> moment: let's say newer kernel and iproute2 allow something like
> 
>   ip link del <condition1> <condition2>
> 
> and you run new ip with older kernel which only supports <condition1>.
> You don't really want kernel to silently ignore the second condition.
> Or e.g. when adding a netfilter rule, you wouldn't want kernel to ignore
> parts of the rule it does not understand.

Oh, I thought these are different issues. How would NLA_REJECT protect
against this, by reserving <condition2> in advance?

This example looks more like a case for NLM_F_STRICT:
https://lwn.net/Articles/661266/
On which the user space would explicitly say "please reject this if
you don't get it all", but it was rejected back then.

> 
> I must admit I'm not sure if there is really need for having reserved
> attributes with netlink. Maybe e.g. when we want to share part of
> (numeric) attribute types between different message types. Anyway, we
> have the same problem with attributes higher than maximum; NLA_REJECT
> wouldn't help with this but the discussion also touched the topic.

Yes, agree.

> 
> > > Add NLA_REJECT to the policy which does nothing but reject (with
> > > EINVAL) validation of any messages containing this attribute.
> > > Allow for returning a specific extended ACK error message in the
> > > validation_data pointer.
> > 
> > This has potential to create confusion because we can't use it on
> > {output,reserved} attributes that are already there (as we must always
> > ignore them now) and we will end up with a mix of it.
> 
> We can return -EINVAL even now, we just need to add a check after
> nla_parse() wrapper returns, e.g. here: (lines 314-320)
> 
>   https://github.com/mkubecek/ethnl/blob/master/kernel/0019-ethtool-implement-SET_PARAMS-message.patch#L314

That's new code, it's okay. Won't break anyone's setup.

> 
> It would be easier and IMHO cleaner if I could simply list these "read
> only attributes" with NLA_REJECT policy for "set" request.

Not that I'm against this. Point was fields that are considered output
only today are probably being silently ignored, and we can't change
them to be NLA_REJECT as it would break user applications. Then we
will have fields that are rejected, and those old that are not. In the
long run, nearly all output fields would be marked as NLA_REJECT,
okay.

Then I ask my first question again: why reject these? They are not
hurting anything, are they?  It's different from your example I think.
In there, the extra information which was ignored leads to a
different behavior.

Maybe it would be better to have NLA_IGNORE instead? </idea>

  Marcelo

^ permalink raw reply

* Re: [PATCH net-next V2 11/11] vhost_net: batch submitting XDP buffers to underlayer sockets
From: Jason Wang @ 2018-09-14  3:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180913140332-mutt-send-email-mst@kernel.org>



On 2018年09月14日 02:04, Michael S. Tsirkin wrote:
> On Wed, Sep 12, 2018 at 11:17:09AM +0800, Jason Wang wrote:
>> +static void vhost_tx_batch(struct vhost_net *net,
>> +			   struct vhost_net_virtqueue *nvq,
>> +			   struct socket *sock,
>> +			   struct msghdr *msghdr)
>> +{
>> +	struct tun_msg_ctl ctl = {
>> +		.type = TUN_MSG_PTR,
>> +		.num = nvq->batched_xdp,
>> +		.ptr = nvq->xdp,
>> +	};
>> +	int err;
>> +
>> +	if (nvq->batched_xdp == 0)
>> +		goto signal_used;
>> +
>> +	msghdr->msg_control = &ctl;
>> +	err = sock->ops->sendmsg(sock, msghdr, 0);
>> +	if (unlikely(err < 0)) {
>> +		vq_err(&nvq->vq, "Fail to batch sending packets\n");
>> +		return;
>> +	}
>> +
>> +signal_used:
>> +	vhost_net_signal_used(nvq);
>> +	nvq->batched_xdp = 0;
>> +}
>> +
> Given it's all tun-specific now, how about just exporting tap_sendmsg
> and calling that? Will get rid of some indirection too.
>

Yes we can do it on top in the future.

Thanks

^ permalink raw reply

* Re: [RFC PATCH iproute2-next] man: Add devlink health man page
From: Tobin C. Harding @ 2018-09-13 22:06 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: netdev, Jiri Pirko, Andy Gospodarek, Michael Chan, Jakub Kicinski,
	Simon Horman, Alexander Duyck, Andrew Lunn, Florian Fainelli,
	Tal Alon, Ariel Almog
In-Reply-To: <c6ca90da-c8f3-b632-4b39-b4c8a6e11460@mellanox.com>

On Thu, Sep 13, 2018 at 02:58:52PM +0300, Eran Ben Elisha wrote:
> 
> 
> On 9/13/2018 1:27 PM, Tobin C. Harding wrote:
> > On Thu, Sep 13, 2018 at 11:18:16AM +0300, Eran Ben Elisha wrote:
> > > Add devlink-health man page. Devlink-health tool will control device
> > > health attributes, sensors, actions and logging.
> > > 
> > > Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
> > > 
> > > -------------------------------------------------------
> > > Copy paste man output to here for easier review process of the RFC.
> > > 
> > > DEVLINK-HEALTH(8)                                                                                               Linux                                                                                              DEVLINK-HEALTH(8)
> > > 
> > > NAME
> > >         devlink-health - devlink health configuration
> > > 
> > > SYNOPSIS
> > >         devlink [ OPTIONS ] health  { COMMAND | help }
> > > 
> > >         OPTIONS := { -V[ersion] | -n[no-nice-names] }
> > > 
> > >         devlink health show [ DEV ] [ sensor NAME ]
> > > 
> > >         devlink health sensor set DEV name NAME [ action NAME { active | inactive } ]"
> > > 
> > >         devlink health action set DEV name NAME period PERIOD count COUNT fail { ignore | down }
> > > 
> > >         devlink health action reinit DEV name NAME
> > > 
> > >         devlink health help
> > > 
> > > DESCRIPTION
> > >         devlink-health tool allows user to configure the way driver treats unexpected status. The tool allows configuration of the sensors that can trigger health activity. Set for each sensor the follow up operations, such as,
> > >         reset and dump of info. In addition, set the health activity termination action.
> > > 
> > >     devlink health show - Display devlink health sensors and actions attributes
> > >         DEV - Specifies the devlink device to show.  If this argument is omitted, all devices are listed.
> > > 
> > >             Format is:
> > >               BUS_NAME/BUS_ADDRESS
> > > 
> > >         sensor NAME - Specifies the devlink sensor to show.
> > > 
> > 
> > Perhaps the commands should include the optional arguments so when
> > reading the description one doesn't have to scroll to the top of the
> > page all the time
> > 
> > e.g
> >       devlink health show [ DEV ] [ sensor NAME ] - Display devlink health sensors and actions attributes
> > 
> 
> I followed the scheme presented in all other devlink man pages.
> see devlink-region, devlink-port, etc.

Oh ok, my mistake.  I'd stick with what you have then.  Thanks for
pointing this out.

> From my perspective, I am fine with adding it to devlink-health, need ack
> from the devlink maintainer to see if he likes it...
> 
> > >     devlink health sensor set - sets devlink health sensor attributes
> > >         DEV    Specifies the devlink device to show.
> > 
> > 	 		      	      	     	set
> > 
> > >         name NAME
> > >                Name of the sensor to set.
> > > 
> > >         action NAME { active | inactive }
> > >                    Specify which actions to activate and which to deactivate once a sensor was triggered. actions can be dump, reset, etc.
> > > 
> > >     devlink health action set - sets devlink action attributes
> > >         DEV    Specifies the devlink device to set.
> > > 
> > >         name NAME
> > >                Specifies the devlink action to set.
> > 
> > This is a little unclear to me?
> 
> what is not clear? the term 'action' or the naming? can you elaborate?

It wasn't immediately clear what 'name' referred to.  But following on
from discussion above this may be because I have not read any of the
other devlink man pages.

thanks,
Tobin.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox