Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type
From: Joe Stringer @ 2018-09-13 18:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180912225013.gi5mwnwckhex3m3c@ast-mbp>

On Wed, 12 Sep 2018 at 15:50, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 11, 2018 at 05:36:33PM -0700, Joe Stringer wrote:
> > ...
> > +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type prev)
> > +{
> > +     return src != prev && (!reg_type_mismatch_ok(src) ||
> > +                            !reg_type_mismatch_ok(prev));
> > +}
> > +
> >  static int do_check(struct bpf_verifier_env *env)
> >  {
> >       struct bpf_verifier_state *state;
> > @@ -4778,9 +4862,7 @@ static int do_check(struct bpf_verifier_env *env)
> >                                */
> >                               *prev_src_type = src_reg_type;
> >
> > -                     } else if (src_reg_type != *prev_src_type &&
> > -                                (src_reg_type == PTR_TO_CTX ||
> > -                                 *prev_src_type == PTR_TO_CTX)) {
> > +                     } else if (reg_type_mismatch(src_reg_type, *prev_src_type)) {
> >                               /* ABuser program is trying to use the same insn
> >                                * dst_reg = *(u32*) (src_reg + off)
> >                                * with different pointer types:
> > @@ -4826,8 +4908,8 @@ static int do_check(struct bpf_verifier_env *env)
> >                       if (*prev_dst_type == NOT_INIT) {
> >                               *prev_dst_type = dst_reg_type;
> >                       } else if (dst_reg_type != *prev_dst_type &&
> > -                                (dst_reg_type == PTR_TO_CTX ||
> > -                                 *prev_dst_type == PTR_TO_CTX)) {
> > +                                (!reg_type_mismatch_ok(dst_reg_type) ||
> > +                                 !reg_type_mismatch_ok(*prev_dst_type))) {
>
> reg_type_mismatch() could have been used here as well ?

Missed that before, will fix.

> >                               verbose(env, "same insn cannot be used with different pointers\n");
> >                               return -EINVAL;
> >                       }
> > @@ -5244,10 +5326,14 @@ static void sanitize_dead_code(struct bpf_verifier_env *env)
> >       }
> >  }
> >
> > -/* convert load instructions that access fields of 'struct __sk_buff'
> > - * into sequence of instructions that access fields of 'struct sk_buff'
> > +/* convert load instructions that access fields of a context type into a
> > + * sequence of instructions that access fields of the underlying structure:
> > + *     struct __sk_buff    -> struct sk_buff
> > + *     struct bpf_sock_ops -> struct sock
> >   */
> > -static int convert_ctx_accesses(struct bpf_verifier_env *env)
> > +static int convert_ctx_accesses(struct bpf_verifier_env *env,
> > +                             bpf_convert_ctx_access_t convert_ctx_access,
> > +                             enum bpf_reg_type ctx_type)
> >  {
> >       const struct bpf_verifier_ops *ops = env->ops;
> >       int i, cnt, size, ctx_field_size, delta = 0;
> > @@ -5274,12 +5360,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> >               }
> >       }
> >
> > -     if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> > +     if (!convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> >               return 0;
> >
> >       insn = env->prog->insnsi + delta;
> >
> >       for (i = 0; i < insn_cnt; i++, insn++) {
> > +             enum bpf_reg_type ptr_type;
> > +
> >               if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
> >                   insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
> >                   insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
> > @@ -5321,7 +5409,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> >                       continue;
> >               }
> >
> > -             if (env->insn_aux_data[i + delta].ptr_type != PTR_TO_CTX)
> > +             ptr_type = env->insn_aux_data[i + delta].ptr_type;
> > +             if (ptr_type != ctx_type)
> >                       continue;
> >
> >               ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
> > @@ -5354,8 +5443,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> >               }
> >
> >               target_size = 0;
> > -             cnt = ops->convert_ctx_access(type, insn, insn_buf, env->prog,
> > -                                           &target_size);
> > +             cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
> > +                                      &target_size);
> >               if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
> >                   (ctx_field_size && !target_size)) {
> >                       verbose(env, "bpf verifier is misconfigured\n");
> > @@ -5899,7 +5988,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
> >
> >       if (ret == 0)
> >               /* program is valid, convert *(u32*)(ctx + off) accesses */
> > -             ret = convert_ctx_accesses(env);
> > +             ret = convert_ctx_accesses(env, env->ops->convert_ctx_access,
> > +                                        PTR_TO_CTX);
> > +
> > +     if (ret == 0)
> > +             /* Convert *(u32*)(sock_ops + off) accesses */
> > +             ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access,
> > +                                        PTR_TO_SOCKET);
>
> can it be done in single pass instead?
> env->insn_aux_data tells us what kind of conversion needs to happen.
> while progs are small, I guess, it's ok, but would like to see a follow up
> patch to this.

Yeah, good point - upon review it seems like this would be a simple
change (incremental):

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index db05f78bfc6e..7fa1b695a2a1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5329,9 +5329,7 @@ static void sanitize_dead_code(struct
bpf_verifier_env *env)
 *     struct __sk_buff    -> struct sk_buff
 *     struct bpf_sock_ops -> struct sock
 */
-static int convert_ctx_accesses(struct bpf_verifier_env *env,
-                               bpf_convert_ctx_access_t convert_ctx_access,
-                               enum bpf_reg_type ctx_type)
+static int convert_ctx_accesses(struct bpf_verifier_env *env)
{
       const struct bpf_verifier_ops *ops = env->ops;
       int i, cnt, size, ctx_field_size, delta = 0;
@@ -5358,13 +5356,13 @@ static int convert_ctx_accesses(struct
bpf_verifier_env *env,
               }
       }

-       if (!convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
+       if (bpf_prog_is_dev_bound(env->prog->aux))
               return 0;

       insn = env->prog->insnsi + delta;

       for (i = 0; i < insn_cnt; i++, insn++) {
-               enum bpf_reg_type ptr_type;
+               bpf_convert_ctx_access_t convert_ctx_access;

               if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
                   insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
@@ -5407,9 +5405,18 @@ static int convert_ctx_accesses(struct
bpf_verifier_env *env,
                       continue;
               }

-               ptr_type = env->insn_aux_data[i + delta].ptr_type;
-               if (ptr_type != ctx_type)
+               switch (env->insn_aux_data[i + delta].ptr_type) {
+               case PTR_TO_CTX:
+                       if (!ops->convert_ctx_access)
+                               continue;
+                       convert_ctx_access = ops->convert_ctx_access;
+                       break;
+               case PTR_TO_SOCKET:
+                       convert_ctx_access = bpf_sock_convert_ctx_access;
+                       break;
+               default:
                       continue;
+               }

               ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
               size = BPF_LDST_BYTES(insn);
@@ -5986,13 +5993,7 @@ int bpf_check(struct bpf_prog **prog, union
bpf_attr *attr)

       if (ret == 0)
               /* program is valid, convert *(u32*)(ctx + off) accesses */
-               ret = convert_ctx_accesses(env, env->ops->convert_ctx_access,
-                                          PTR_TO_CTX);
-
-       if (ret == 0)
-               /* Convert *(u32*)(sock_ops + off) accesses */
-               ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access,
-                                          PTR_TO_SOCKET);
+               ret = convert_ctx_accesses(env);

       if (ret == 0)
               ret = fixup_bpf_calls(env);

^ permalink raw reply related

* Re: [PATCH bpf-next 06/11] bpf: Add reference tracking to verifier
From: Joe Stringer @ 2018-09-13 19:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180912231735.nowruf5o3vkdxf64@ast-mbp>

On Wed, 12 Sep 2018 at 16:17, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 11, 2018 at 05:36:35PM -0700, Joe Stringer wrote:
> > ...
> > +
> > +/* release function corresponding to acquire_reference_state(). Idempotent. */
> > +static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
> > +{
> > +     int i, last_idx;
> > +
> > +     if (!ptr_id)
> > +             return 0;
>
> Is this defensive programming or this condition can actually happen?
> As far as I can see all callers suppose to pass valid ptr_id into it.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
>

Looks like defensive programming to me. That said, if it's being
defensive, why not return `-EFAULT`? I'll try this out locally.

^ permalink raw reply

* Re: [PATCH net v2 0/3] tls: don't leave keys in kernel memory
From: David Miller @ 2018-09-13 19:04 UTC (permalink / raw)
  To: sd; +Cc: netdev, aviadye, borisp, davejwatson, vakul.garg
In-Reply-To: <cover.1536766755.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Wed, 12 Sep 2018 17:44:40 +0200

> There are a few places where the RX/TX key for a TLS socket is copied
> to kernel memory. This series clears those memory areas when they're no
> longer needed.
> 
> v2: add union tls_crypto_context, following Vakul Garg's comment
>     swap patch 2 and 3, using new union in patch 3

Series applied and queued up for -stable.

Thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Alexei Starovoitov @ 2018-09-13 19:06 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Daniel Borkmann, Network Development, Alexei Starovoitov,
	John Fastabend, Thomas Graf, Martin KaFai Lau, Nitin Hande,
	Mauricio Vasquez

On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
>> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
>> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
>> socket listening on this host, and returns a socket pointer which the
>> BPF program can then access to determine, for instance, whether to
>> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
>> socket, so when a BPF program makes use of this function, it must
>> subsequently pass the returned pointer into the newly added sk_release()
>> to return the reference.
>>
>> By way of example, the following pseudocode would filter inbound
>> connections at XDP if there is no corresponding service listening for
>> the traffic:
>>
>>   struct bpf_sock_tuple tuple;
>>   struct bpf_sock_ops *sk;
>>
>>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
>>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> ...
>> +struct bpf_sock_tuple {
>> +     union {
>> +             __be32 ipv6[4];
>> +             __be32 ipv4;
>> +     } saddr;
>> +     union {
>> +             __be32 ipv6[4];
>> +             __be32 ipv4;
>> +     } daddr;
>> +     __be16 sport;
>> +     __be16 dport;
>> +     __u8 family;
>> +};
>
> since we can pass ptr_to_packet into map lookup and other helpers now,
> can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> then progs wouldn't need to copy bytes from the packet into tuple
> to do a lookup.

have been thinking more about it.
since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
to infer family inside the helper, so it doesn't need to be passed explicitly?

^ permalink raw reply

* Re: pull request: bluetooth 2018-09-13
From: David Miller @ 2018-09-13 19:07 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20180913094551.GA14832@x1c.lan>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Thu, 13 Sep 2018 12:45:51 +0300

> A few Bluetooth fixes for the 4.19-rc series:
> 
>  - Fixed rw_semaphore leak in hci_ldisc
>  - Fixed local Out-of-Band pairing data handling
> 
> Let me know if there are any issues pulling. Thanks.

Pulled, thanks Johan.

^ permalink raw reply

* Re: [PATCH net v2] gso_segment: Reset skb->mac_len after modifying network header
From: David Miller @ 2018-09-13 19:10 UTC (permalink / raw)
  To: toke; +Cc: netdev, cake, edumazet, dave.taht
In-Reply-To: <20180913144306.9077-1-toke@toke.dk>

From: Toke Høiland-Jørgensen <toke@toke.dk>
Date: Thu, 13 Sep 2018 16:43:07 +0200

> When splitting a GSO segment that consists of encapsulated packets, the
> skb->mac_len of the segments can end up being set wrong, causing packet
> drops in particular when using act_mirred and ifb interfaces in
> combination with a qdisc that splits GSO packets.
> 
> This happens because at the time skb_segment() is called, network_header
> will point to the inner header, throwing off the calculation in
> skb_reset_mac_len(). The network_header is subsequently adjust by the
> outer IP gso_segment handlers, but they don't set the mac_len.
> 
> Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
> gso_segment handlers, after they modify the network_header.
> 
> Many thanks to Eric Dumazet for his help in identifying the cause of
> the bug.
> 
> Acked-by: Dave Taht <dave.taht@gmail.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> ---
> v2:
>   - Properly credit Eric for his help
>   - Add review and ack tags

Applied and queued up for -stable, thanks.

^ permalink raw reply

* [PATCH] net/mlx4_core: print firmware version during driver loading
From: Qing Huang @ 2018-09-14  0:25 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel; +Cc: tariqt, davem

When debugging firmware related issues, it's very helpful to have
the installed FW version info in the kernel log when the driver is
loaded. It's easier to match error/warning messages with different
FW versions in the log other than running a separate tool to get
the information back and forth.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
---
 drivers/net/ethernet/mellanox/mlx4/fw.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index babcfd9..e1c5218 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -1686,11 +1686,11 @@ int mlx4_QUERY_FW(struct mlx4_dev *dev)
 	MLX4_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
 	cmd->max_cmds = 1 << lg;
 
-	mlx4_dbg(dev, "FW version %d.%d.%03d (cmd intf rev %d), max commands %d\n",
-		 (int) (dev->caps.fw_ver >> 32),
-		 (int) (dev->caps.fw_ver >> 16) & 0xffff,
-		 (int) dev->caps.fw_ver & 0xffff,
-		 cmd_if_rev, cmd->max_cmds);
+	mlx4_info(dev, "FW version %d.%d.%03d (cmd intf rev %d), max commands %d\n",
+		  (int)(dev->caps.fw_ver >> 32),
+		  (int)(dev->caps.fw_ver >> 16) & 0xffff,
+		  (int)dev->caps.fw_ver & 0xffff,
+		  cmd_if_rev, cmd->max_cmds);
 
 	MLX4_GET(fw->catas_offset, outbox, QUERY_FW_ERR_START_OFFSET);
 	MLX4_GET(fw->catas_size,   outbox, QUERY_FW_ERR_SIZE_OFFSET);
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Marcelo Ricardo Leitner @ 2018-09-13 19:20 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: Johannes Berg, linux-wireless, netdev
In-Reply-To: <20180913120554.GG29691@unicorn.suse.cz>

On Thu, Sep 13, 2018 at 02:05:54PM +0200, Michal Kubecek wrote:
> On Thu, Sep 13, 2018 at 01:25:15PM +0200, Johannes Berg wrote:
> > On Thu, 2018-09-13 at 12:49 +0200, Michal Kubecek wrote:
> > 
> > > >  		if (type > 0 && type <= maxtype) {
> > > >  			if (policy) {
> > > > -				err = validate_nla(nla, maxtype, policy);
> > > > +				err = validate_nla(nla, maxtype, policy,
> > > > +						   extack);
> > > >  				if (err < 0) {
> > > > -					NL_SET_ERR_MSG_ATTR(extack, nla,
> > > > -							    "Attribute failed policy validation");
> > > > +					NL_SET_BAD_ATTR(extack, nla);
> > > > +					if (extack && !extack->_msg)
> > > > +						NL_SET_ERR_MSG(extack,
> > > > +							       "Attribute failed policy validation");
> > > >  					goto errout;
> > > >  				}
> > > >  			}
> > > > -- 
> > > 
> > > Technically, this would change the outcome when nla_parse() is called
> > > with extack->_msg already set nad validate_nla() fails on something else
> > > than NLA_REJECT; it will preserve the previous message in such case.
> > > But I don't think this is a serious problem.
> > 
> > Yes, that's true. I looked at quite a few of the setters just now (there
> > are ~500 already, wow!), and they all set & return, so this shouldn't be
> > an issue.
> 
> In ethtool (work in progress) I sometimes use extack message for
> non-fatal warnings but AFAICS never before parsing the userspace
> request (actually always shortly before returning). So I don't have
> a problem with it.

Considering we can only report 1 message, it should be okay to drop
the previous message in favor of the new one, which is either a
critical one or just another non-fatal one.

  Marcelo

^ permalink raw reply

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Marcelo Ricardo Leitner @ 2018-09-13 19:30 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-wireless, netdev, Michal Kubecek, Johannes Berg
In-Reply-To: <20180913084603.7979-1-johannes@sipsolutions.net>

On Thu, Sep 13, 2018 at 10:46:02AM +0200, Johannes Berg wrote:
> From: Johannes Berg <johannes.berg@intel.com>
> 
> In some situations some netlink attributes may be used for output
> only (kernel->userspace) or may be reserved for future use. It's
> then helpful to be able to prevent userspace from using them in
> messages sent to the kernel, since they'd otherwise be ignored and
> any future will become impossible if this happens.

I don't follow, what's the issue with simply ignoring such attributes?

> 
> Add NLA_REJECT to the policy which does nothing but reject (with
> EINVAL) validation of any messages containing this attribute.
> Allow for returning a specific extended ACK error message in the
> validation_data pointer.

This has potential to create confusion because we can't use it on
{output,reserved} attributes that are already there (as we must always
ignore them now) and we will end up with a mix of it.

> 
> While at it fix the indentation of NLA_BITFIELD32 and describe the
> validation_data pointer for it.
> 
> The specific case I have in mind now is a shared nested attribute
> containing request/response data, and it would be pointless and
> potentially confusing to have userspace include response data in
> the messages that actually contain a request.

Would be nice to see some patches for it too. Maybe it helps.

> 
> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
> ---
>  include/net/netlink.h |  6 +++++-
>  lib/nlattr.c          | 22 +++++++++++++++-------
>  2 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 0c154f98e987..04e40fcc70d6 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -180,6 +180,7 @@ enum {
>  	NLA_S32,
>  	NLA_S64,
>  	NLA_BITFIELD32,
> +	NLA_REJECT,
>  	__NLA_TYPE_MAX,
>  };
>  
> @@ -208,7 +209,10 @@ enum {
>   *    NLA_MSECS            Leaving the length field zero will verify the
>   *                         given type fits, using it verifies minimum length
>   *                         just like "All other"
> - *    NLA_BITFIELD32      A 32-bit bitmap/bitselector attribute
> + *    NLA_BITFIELD32       A 32-bit bitmap/bitselector attribute, validation
> + *                         data must point to a u32 value of valid flags
> + *    NLA_REJECT           Reject this attribute, validation data may point
> + *                         to a string to report as the error in extended ACK.
>   *    All other            Minimum length of attribute payload
>   *
>   * Example:
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index e335bcafa9e4..56e0aae5cf23 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -69,7 +69,8 @@ static int validate_nla_bitfield32(const struct nlattr *nla,
>  }
>  
>  static int validate_nla(const struct nlattr *nla, int maxtype,
> -			const struct nla_policy *policy)
> +			const struct nla_policy *policy,
> +			struct netlink_ext_ack *extack)
>  {
>  	const struct nla_policy *pt;
>  	int minlen = 0, attrlen = nla_len(nla), type = nla_type(nla);
> @@ -87,6 +88,11 @@ static int validate_nla(const struct nlattr *nla, int maxtype,
>  	}
>  
>  	switch (pt->type) {
> +	case NLA_REJECT:
> +		if (pt->validation_data && extack)
> +			extack->_msg = pt->validation_data;
> +		return -EINVAL;
> +
>  	case NLA_FLAG:
>  		if (attrlen > 0)
>  			return -ERANGE;
> @@ -180,11 +186,10 @@ int nla_validate(const struct nlattr *head, int len, int maxtype,
>  	int rem;
>  
>  	nla_for_each_attr(nla, head, len, rem) {
> -		int err = validate_nla(nla, maxtype, policy);
> +		int err = validate_nla(nla, maxtype, policy, extack);
>  
>  		if (err < 0) {
> -			if (extack)
> -				extack->bad_attr = nla;
> +			NL_SET_BAD_ATTR(extack, nla);
>  			return err;
>  		}
>  	}
> @@ -251,10 +256,13 @@ int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
>  
>  		if (type > 0 && type <= maxtype) {
>  			if (policy) {
> -				err = validate_nla(nla, maxtype, policy);
> +				err = validate_nla(nla, maxtype, policy,
> +						   extack);
>  				if (err < 0) {
> -					NL_SET_ERR_MSG_ATTR(extack, nla,
> -							    "Attribute failed policy validation");
> +					NL_SET_BAD_ATTR(extack, nla);
> +					if (extack && !extack->_msg)
> +						NL_SET_ERR_MSG(extack,
> +							       "Attribute failed policy validation");
>  					goto errout;
>  				}
>  			}
> -- 
> 2.14.4
> 

^ permalink raw reply

* [PATCH iproute2] libnetlink: fix leak and using unused memory on error
From: Stephen Hemminger @ 2018-09-13 19:33 UTC (permalink / raw)
  To: Mahesh Bandewar; +Cc: netdev, Stephen Hemminger

If an error happens in multi-segment message (tc only)
then report the error and stop processing further responses.
This also fixes refering to the buffer after free.

The sequence check is not necessary here because the
response message has already been validated to be in
the window of the sequence number of the iov.

Reported-by: Mahesh Bandewar <mahesh@bandewar.net>
Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/libnetlink.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 928de1dd16d8..586809292594 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -617,7 +617,6 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iov,
 	msg.msg_iovlen = 1;
 	i = 0;
 	while (1) {
-next:
 		status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
 		++i;
 
@@ -660,27 +659,23 @@ next:
 
 				if (l < sizeof(struct nlmsgerr)) {
 					fprintf(stderr, "ERROR truncated\n");
-				} else if (!err->error) {
+					free(buf);
+					return -1;
+				}
+
+				if (!err->error)
 					/* check messages from kernel */
 					nl_dump_ext_ack(h, errfn);
 
-					if (answer)
-						*answer = (struct nlmsghdr *)buf;
-					else
-						free(buf);
-					if (h->nlmsg_seq == seq)
-						return 0;
-					else if (i < iovlen)
-						goto next;
-					return 0;
-				}
-
 				if (rtnl->proto != NETLINK_SOCK_DIAG &&
 				    show_rtnl_err)
 					rtnl_talk_error(h, err, errfn);
 
 				errno = -err->error;
-				free(buf);
+				if (answer)
+					*answer = (struct nlmsghdr *)buf;
+				else
+					free(buf);
 				return -i;
 			}
 
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Marcelo Ricardo Leitner @ 2018-09-13 19:41 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Michal Kubecek, linux-wireless, netdev
In-Reply-To: <1536840966.4160.6.camel@sipsolutions.net>

On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > On Thu, Sep 13, 2018 at 02:02:53PM +0200, Johannes Berg wrote:
> > > On Thu, 2018-09-13 at 13:58 +0200, Michal Kubecek wrote:
> > > 
> > > > The code looks correct to me but I have some doubts. Having a special
> > > > policy for MAC addresses may lead to adding one for IPv4 address (maybe
> > > > not, we can use NLA_U32 for them), IPv6 addresses and other data types
> > > > with fixed length. Wouldn't it be more helpful to add a variant of
> > > > NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
> > > > isn't equal to .len?
> > > 
> > > Yeah, I guess we could do that, and then
> > > 
> > > #define NLA_ETH_ADDR .len = ETH_ALEN, .type = NLA_BINARY_EXACT
> > > #define NLA_IP6_ADDR .len = 16, .type = NLA_BINARY_EXACT
> > > 
> > > or so?
> > 
> > Maybe rather
> > 
> >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > 
> > so that one could write
> > 
> >   { .type = NLA_ETH_ADDR }
> 
> Yeah, that's possible. I considered it for a second, but it was slightly
> too magical for my taste :-)
> 
> Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?

What about
#define NLA_FIELD_ETH_ADDR  { .type = NLA_BINARY_EXACT, .len = ETH_ALEN }

Or even
#define NLA_FIELD_BINARY_EXACT(_len)	{ .type = NLA_BINARY_EXACT, .len = (_len) }
#define NLA_FIELD_ETH_ADDR		NLA_FIELD_BINARY_EXACT(ETH_ALEN)

So that one would just:
   [MYADDR] = NLA_FIELD_ETH_ADDR,

and if we change how we parse/validate it, users should be good
already.

  Marcelo

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Alexei Starovoitov @ 2018-09-13 19:43 UTC (permalink / raw)
  To: Petar Penkov
  Cc: netdev, davem, ast, daniel, simon.horman, ecree, songliubraving,
	tom, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180913174557.188828-2-peterpenkov96@gmail.com>

On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
> 
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is per-network namespace.
> 
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
...
> @@ -2333,6 +2335,7 @@ struct __sk_buff {
>  	/* ... here. */
>  
>  	__u32 data_meta;
> +	struct bpf_flow_keys *flow_keys;

the bpf prog form patch 4 looks much better now. Thanks!

>  };
>  
>  struct bpf_tunnel_key {
> @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
>  	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
>  };
>  
> +struct bpf_flow_keys {
> +	__u16	nhoff;
> +	__u16	thoff;
> +	__u16	addr_proto;			/* ETH_P_* of valid addrs */
> +	__u8	is_frag;
> +	__u8	is_first_frag;
> +	__u8	is_encap;
> +	__be16	n_proto;
> +	__u8	ip_proto;
> +	union {
> +		struct {
> +			__be32	ipv4_src;
> +			__be32	ipv4_dst;
> +		};
> +		struct {
> +			__u32	ipv6_src[4];	/* in6_addr; network order */
> +			__u32	ipv6_dst[4];	/* in6_addr; network order */
> +		};
> +	};
> +	__be16	sport;
> +	__be16	dport;
> +};

can you please pack it?
struct bpf_flow_keys {
	__u16                      nhoff;                /*     0     2 */
	__u16                      thoff;                /*     2     2 */
	__u16                      addr_proto;           /*     4     2 */
	__u8                       is_frag;              /*     6     1 */
	__u8                       is_first_frag;        /*     7     1 */
	__u8                       is_encap;             /*     8     1 */

	/* XXX 1 byte hole, try to pack */

	__be16                     n_proto;              /*    10     2 */
	__u8                       ip_proto;             /*    12     1 */

	/* XXX 3 bytes hole, try to pack */

	union {

also is_frag and other fields are not used by the kernel and
only used by the prog to pass data between tail_calls ?
In such case reserve some space in bpf_flow_keys similar to skb->cb
so it can contain any fields and accommodate for inevitable changes
to bpf flow dissector prog in the future.

^ permalink raw reply

* [PATCH net] net/ipv6: do not copy DST_NOCOUNT flag on rt init
From: Peter Oskolkov @ 2018-09-13 20:38 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Peter Oskolkov, David Ahern

DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18bc1 ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3eed045c65a5..a3902f805305 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -946,7 +946,7 @@ static void ip6_rt_init_dst_reject(struct rt6_info *rt, struct fib6_info *ort)

 static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort)
 {
-	rt->dst.flags |= fib6_info_dst_flags(ort);
+	rt->dst.flags |= fib6_info_dst_flags(ort) & ~DST_NOCOUNT;

 	if (ort->fib6_flags & RTF_REJECT) {
 		ip6_rt_init_dst_reject(rt, ort);
-- 
2.19.0.397.gdd90340f6a-goog

^ permalink raw reply related

* Re: [PATCH 2/2] netlink: add ethernet address policy types
From: Michal Kubecek @ 2018-09-13 20:39 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: Johannes Berg, linux-wireless, netdev
In-Reply-To: <20180913194116.GG4590@localhost.localdomain>

On Thu, Sep 13, 2018 at 04:41:16PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, Sep 13, 2018 at 02:16:06PM +0200, Johannes Berg wrote:
> > On Thu, 2018-09-13 at 14:12 +0200, Michal Kubecek wrote:
> > > On Thu, Sep 13, 2018 at 02:02:53PM +0200, Johannes Berg wrote:
> > > > On Thu, 2018-09-13 at 13:58 +0200, Michal Kubecek wrote:
> > > > 
> > > > > The code looks correct to me but I have some doubts. Having a special
> > > > > policy for MAC addresses may lead to adding one for IPv4 address (maybe
> > > > > not, we can use NLA_U32 for them), IPv6 addresses and other data types
> > > > > with fixed length. Wouldn't it be more helpful to add a variant of
> > > > > NLA_BINARY (NLA_BINARY_EXACT?) which would fail/warn if attribute length
> > > > > isn't equal to .len?
> > > > 
> > > > Yeah, I guess we could do that, and then
> > > > 
> > > > #define NLA_ETH_ADDR .len = ETH_ALEN, .type = NLA_BINARY_EXACT
> > > > #define NLA_IP6_ADDR .len = 16, .type = NLA_BINARY_EXACT
> > > > 
> > > > or so?
> > > 
> > > Maybe rather
> > > 
> > >   #define NLA_ETH_ADDR NLA_BINARY_EXACT, .len = ETH_ALEN
> > >   #define NLA_IP6_ADDR NLA_BINARY_EXACT, .len = sizeof(struct in6_addr)
> > > 
> > > so that one could write
> > > 
> > >   { .type = NLA_ETH_ADDR }
> > 
> > Yeah, that's possible. I considered it for a second, but it was slightly
> > too magical for my taste :-)
> > 
> > Better pick a different "namespace", perhaps NLA_POLICY_ETH_ADDR or so?
> 
> What about
> #define NLA_FIELD_ETH_ADDR  { .type = NLA_BINARY_EXACT, .len = ETH_ALEN }
> 
> Or even
> #define NLA_FIELD_BINARY_EXACT(_len)	{ .type = NLA_BINARY_EXACT, .len = (_len) }
> #define NLA_FIELD_ETH_ADDR		NLA_FIELD_BINARY_EXACT(ETH_ALEN)
> 
> So that one would just:
>    [MYADDR] = NLA_FIELD_ETH_ADDR,

Yes, that's how I understoold Johannes' proposal above.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Michal Kubecek @ 2018-09-13 20:43 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: Johannes Berg, linux-wireless, netdev
In-Reply-To: <20180913192014.GE4590@localhost.localdomain>

On Thu, Sep 13, 2018 at 04:20:14PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, Sep 13, 2018 at 02:05:54PM +0200, Michal Kubecek wrote:
> > On Thu, Sep 13, 2018 at 01:25:15PM +0200, Johannes Berg wrote:
> > > On Thu, 2018-09-13 at 12:49 +0200, Michal Kubecek wrote:
> > > 
> > > > >  		if (type > 0 && type <= maxtype) {
> > > > >  			if (policy) {
> > > > > -				err = validate_nla(nla, maxtype, policy);
> > > > > +				err = validate_nla(nla, maxtype, policy,
> > > > > +						   extack);
> > > > >  				if (err < 0) {
> > > > > -					NL_SET_ERR_MSG_ATTR(extack, nla,
> > > > > -							    "Attribute failed policy validation");
> > > > > +					NL_SET_BAD_ATTR(extack, nla);
> > > > > +					if (extack && !extack->_msg)
> > > > > +						NL_SET_ERR_MSG(extack,
> > > > > +							       "Attribute failed policy validation");
> > > > >  					goto errout;
> > > > >  				}
> > > > >  			}
> > > > > -- 
> > > > 
> > > > Technically, this would change the outcome when nla_parse() is called
> > > > with extack->_msg already set nad validate_nla() fails on something else
> > > > than NLA_REJECT; it will preserve the previous message in such case.
> > > > But I don't think this is a serious problem.
> > > 
> > > Yes, that's true. I looked at quite a few of the setters just now (there
> > > are ~500 already, wow!), and they all set & return, so this shouldn't be
> > > an issue.
> > 
> > In ethtool (work in progress) I sometimes use extack message for
> > non-fatal warnings but AFAICS never before parsing the userspace
> > request (actually always shortly before returning). So I don't have
> > a problem with it.
> 
> Considering we can only report 1 message, it should be okay to drop
> the previous message in favor of the new one, which is either a
> critical one or just another non-fatal one.

What I wanted to point out is that the code above does not behave like
this. It does not distinguish between extack->_msg set by NLA_REJECT
branch and extack->_msg which had been set before nla_parse() was
called.

Michal Kubecek

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-13 20:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180913194335.fszwj4jj4yputxlv@ast-mbp>

On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is per-network namespace.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> ...
> > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> >       /* ... here. */
> >
> >       __u32 data_meta;
> > +     struct bpf_flow_keys *flow_keys;
>
> the bpf prog form patch 4 looks much better now. Thanks!
>
> >  };
> >
> >  struct bpf_tunnel_key {
> > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> >       BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> >  };
> >
> > +struct bpf_flow_keys {
> > +     __u16   nhoff;
> > +     __u16   thoff;
> > +     __u16   addr_proto;                     /* ETH_P_* of valid addrs */
> > +     __u8    is_frag;
> > +     __u8    is_first_frag;
> > +     __u8    is_encap;
> > +     __be16  n_proto;
> > +     __u8    ip_proto;
> > +     union {
> > +             struct {
> > +                     __be32  ipv4_src;
> > +                     __be32  ipv4_dst;
> > +             };
> > +             struct {
> > +                     __u32   ipv6_src[4];    /* in6_addr; network order */
> > +                     __u32   ipv6_dst[4];    /* in6_addr; network order */
> > +             };
> > +     };
> > +     __be16  sport;
> > +     __be16  dport;
> > +};
>
> can you please pack it?
> struct bpf_flow_keys {
>         __u16                      nhoff;                /*     0     2 */
>         __u16                      thoff;                /*     2     2 */
>         __u16                      addr_proto;           /*     4     2 */
>         __u8                       is_frag;              /*     6     1 */
>         __u8                       is_first_frag;        /*     7     1 */
>         __u8                       is_encap;             /*     8     1 */
>
>         /* XXX 1 byte hole, try to pack */
>
>         __be16                     n_proto;              /*    10     2 */
>         __u8                       ip_proto;             /*    12     1 */
>
>         /* XXX 3 bytes hole, try to pack */
>
>         union {
>
> also is_frag and other fields are not used by the kernel and
> only used by the prog to pass data between tail_calls ?

No, these are mapped directly onto fields in struct flow_keys
on return from the BPF program in __skb_flow_bpf_to_target.
For is_frag, for instance:

       if (flow_keys->is_frag)
               key_control->flags |= FLOW_DIS_IS_FRAGMENT;

This is true for all fields in the struct except nhoff.

> In such case reserve some space in bpf_flow_keys similar to skb->cb
> so it can contain any fields and accommodate for inevitable changes
> to bpf flow dissector prog in the future.

Do you mean a second scratch space akin to cb[], just a few
reserved padding bytes?

We have given some thought to forward compatibility. The existing
fields cannot be changed, but it should be fine if we need to expand
the struct later.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-09-13 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <CAADnVQ+Ge1HYXPmkEqUWMP0X1F3a9RFg35arZQoktP2cVN4Fkg@mail.gmail.com>

On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> >> socket listening on this host, and returns a socket pointer which the
> >> BPF program can then access to determine, for instance, whether to
> >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> >> socket, so when a BPF program makes use of this function, it must
> >> subsequently pass the returned pointer into the newly added sk_release()
> >> to return the reference.
> >>
> >> By way of example, the following pseudocode would filter inbound
> >> connections at XDP if there is no corresponding service listening for
> >> the traffic:
> >>
> >>   struct bpf_sock_tuple tuple;
> >>   struct bpf_sock_ops *sk;
> >>
> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > ...
> >> +struct bpf_sock_tuple {
> >> +     union {
> >> +             __be32 ipv6[4];
> >> +             __be32 ipv4;
> >> +     } saddr;
> >> +     union {
> >> +             __be32 ipv6[4];
> >> +             __be32 ipv4;
> >> +     } daddr;
> >> +     __be16 sport;
> >> +     __be16 dport;
> >> +     __u8 family;
> >> +};
> >
> > since we can pass ptr_to_packet into map lookup and other helpers now,
> > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > then progs wouldn't need to copy bytes from the packet into tuple
> > to do a lookup.

If I follow, you're proposing that users should be able to pass a
pointer to the source address field of the L3 header, and assuming
that the L3 header ends with saddr+daddr (no options/extheaders), and
is immediately followed by the sport/dport then a packet pointer
should work for performing socket lookup. Then it is up to the BPF
program writer to ensure that this is the case, or otherwise fall back
to populating a copy of the sock tuple on the stack.

> have been thinking more about it.
> since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> to infer family inside the helper, so it doesn't need to be passed explicitly?

Let me make sure I understand the proposal here.

The current structure and function prototypes are:

struct bpf_sock_tuple {
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } saddr;
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } daddr;
      __be16 sport;
      __be16 dport;
      __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
                                           struct bpf_sock_tuple *tuple,
                                           int size, unsigned int netns_id,
                                           unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
                                           struct bpf_sock_tuple *tuple,
                                           int size, unsigned int netns_id,
                                           unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

You're proposing something like:

struct bpf_sock_tuple4 {
      __be32 saddr;
      __be32 daddr;
      __be16 sport;
      __be16 dport;
      __u8 family;
};

struct bpf_sock_tuple6 {
      __be32 saddr[4];
      __be32 daddr[4];
      __be16 sport;
      __be16 dport;
      __u8 family;
};

static struct bpf_sock *(*bpf_sk_lookup_tcp)(void *ctx,
                                           void *tuple,
                                           int size, unsigned int
netns_id,
                                           unsigned long long flags);
static struct bpf_sock *(*bpf_sk_lookup_udp)(void *ctx,
                                           void *tuple,
                                           int size, unsigned int netns_id,
                                           unsigned long long flags);
static int (*bpf_sk_release)(struct bpf_sock *sk, unsigned long long flags);

Then the implementation will check the size against either
"sizeof(struct bpf_sock_tuple4)" or "sizeof(struct bpf_sock_tuple6)"
and interpret as the v4 or v6 handler from this.

Sure, I can try this out.

^ permalink raw reply

* Re: [PATCH bpf-next 10/11] selftests/bpf: Add C tests for reference tracking
From: Joe Stringer @ 2018-09-13 20:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180913001115.mhddud4avgojdwra@ast-mbp>

On Wed, 12 Sep 2018 at 17:11, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 11, 2018 at 05:36:39PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
>
> really nice set of tests.
> please describe them briefly in commit log.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Ack, will do.

^ permalink raw reply

* Re: [PATCH bpf-next 11/11] Documentation: Describe bpf reference tracking
From: Joe Stringer @ 2018-09-13 20:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <20180913001250.wqhqvox3kccslq32@ast-mbp>

On Wed, 12 Sep 2018 at 17:13, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 11, 2018 at 05:36:40PM -0700, Joe Stringer wrote:
> > Signed-off-by: Joe Stringer <joe@wand.net.nz>
>
> just few words in commit log would be better than nothing.
>
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Ack, thanks for the review!

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Alexei Starovoitov @ 2018-09-13 20:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <CAF=yD-LBqT9hZQXuhqs8st_-EJvkXm6cGvt9pu8Ox9rEMHOFQg@mail.gmail.com>

On Thu, Sep 13, 2018 at 04:51:49PM -0400, Willem de Bruijn wrote:
> On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > > From: Petar Penkov <ppenkov@google.com>
> > >
> > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > path. The BPF program is per-network namespace.
> > >
> > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ...
> > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > >       /* ... here. */
> > >
> > >       __u32 data_meta;
> > > +     struct bpf_flow_keys *flow_keys;
> >
> > the bpf prog form patch 4 looks much better now. Thanks!
> >
> > >  };
> > >
> > >  struct bpf_tunnel_key {
> > > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> > >       BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> > >  };
> > >
> > > +struct bpf_flow_keys {
> > > +     __u16   nhoff;
> > > +     __u16   thoff;
> > > +     __u16   addr_proto;                     /* ETH_P_* of valid addrs */
> > > +     __u8    is_frag;
> > > +     __u8    is_first_frag;
> > > +     __u8    is_encap;
> > > +     __be16  n_proto;
> > > +     __u8    ip_proto;
> > > +     union {
> > > +             struct {
> > > +                     __be32  ipv4_src;
> > > +                     __be32  ipv4_dst;
> > > +             };
> > > +             struct {
> > > +                     __u32   ipv6_src[4];    /* in6_addr; network order */
> > > +                     __u32   ipv6_dst[4];    /* in6_addr; network order */
> > > +             };
> > > +     };
> > > +     __be16  sport;
> > > +     __be16  dport;
> > > +};
> >
> > can you please pack it?
> > struct bpf_flow_keys {
> >         __u16                      nhoff;                /*     0     2 */
> >         __u16                      thoff;                /*     2     2 */
> >         __u16                      addr_proto;           /*     4     2 */
> >         __u8                       is_frag;              /*     6     1 */
> >         __u8                       is_first_frag;        /*     7     1 */
> >         __u8                       is_encap;             /*     8     1 */
> >
> >         /* XXX 1 byte hole, try to pack */
> >
> >         __be16                     n_proto;              /*    10     2 */
> >         __u8                       ip_proto;             /*    12     1 */
> >
> >         /* XXX 3 bytes hole, try to pack */
> >
> >         union {
> >
> > also is_frag and other fields are not used by the kernel and
> > only used by the prog to pass data between tail_calls ?
> 
> No, these are mapped directly onto fields in struct flow_keys
> on return from the BPF program in __skb_flow_bpf_to_target.
> For is_frag, for instance:
> 
>        if (flow_keys->is_frag)
>                key_control->flags |= FLOW_DIS_IS_FRAGMENT;

right. my search-fu failed me. only packing is needed then.

> This is true for all fields in the struct except nhoff.

> > In such case reserve some space in bpf_flow_keys similar to skb->cb
> > so it can contain any fields and accommodate for inevitable changes
> > to bpf flow dissector prog in the future.
> 
> Do you mean a second scratch space akin to cb[], just a few
> reserved padding bytes?

it looks to me it's possible to rearrange the fields to avoid all holes,
so no extra padding bytes necessary.

> We have given some thought to forward compatibility. The existing
> fields cannot be changed, but it should be fine if we need to expand
> the struct later.

let's keep cb-like idea for later. It seems to me we can add it to
the end of bpf_flow_keys any time later.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Joe Stringer @ 2018-09-13 20:57 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Alexei Starovoitov, daniel, netdev, ast, john fastabend, tgraf,
	Martin KaFai Lau, Nitin Hande, mauricio.vasquez
In-Reply-To: <CAOftzPiG6JMb2=U3ZU9D2+0U=1zLqZPgax8OFRHF_1UTcs5Shw@mail.gmail.com>

On Thu, 13 Sep 2018 at 13:55, Joe Stringer <joe@wand.net.nz> wrote:
> struct bpf_sock_tuple4 {
>       __be32 saddr;
>       __be32 daddr;
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };
>
> struct bpf_sock_tuple6 {
>       __be32 saddr[4];
>       __be32 daddr[4];
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };

(ignore the family bit here, I forgot to remove it..)

^ permalink raw reply

* Re: [PATCH v2,net-next 1/2] ip_gre: fix parsing gre header in ipgre_err
From: Haishuang Yan @ 2018-09-14  2:09 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, jbenc, netdev, linux-kernel
In-Reply-To: <20180913.105840.140151724801067072.davem@davemloft.net>



> On 2018年9月14日, at 上午1:58, David Miller <davem@davemloft.net> wrote:
> 
> From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> Date: Wed, 12 Sep 2018 17:21:21 +0800
> 
>> @@ -86,7 +86,7 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>> 
>> 	options = (__be32 *)(greh + 1);
>> 	if (greh->flags & GRE_CSUM) {
>> -		if (skb_checksum_simple_validate(skb)) {
>> +		if (csum_err && skb_checksum_simple_validate(skb)) {
>> 			*csum_err = true;
>> 			return -EINVAL;
>> 		}
> 
> You want to ignore csum errors, but you do not want to elide the side
> effects of the skb_checksum_simple_validate() call which are to set
> skb->csum_valid and skb->csum.
> 
> Therefore, the skb_checksum_simple_validate() call still needs to be
> performed. We just wont return -EINVAL in the NULL csum_err case.
> 
> 

How about doing like this:

--- a/net/ipv4/gre_demux.c
+++ b/net/ipv4/gre_demux.c
@@ -86,13 +86,14 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,

        options = (__be32 *)(greh + 1);
        if (greh->flags & GRE_CSUM) {
-               if (skb_checksum_simple_validate(skb)) {
+               if (!skb_checksum_simple_validate(skb)) {
+                       skb_checksum_try_convert(skb, IPPROTO_GRE, 0,
+                                                null_compute_pseudo);
+               } else if (csum_err) {
                        *csum_err = true;
                        return -EINVAL;
                }

-               skb_checksum_try_convert(skb, IPPROTO_GRE, 0,
-                                        null_compute_pseudo);

Thanks for reviewing.

^ permalink raw reply

* Re: [bpf-next, v3 1/5] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-13 21:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180913205740.a3zqadwevavshopv@ast-mbp>

On Thu, Sep 13, 2018 at 4:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 13, 2018 at 04:51:49PM -0400, Willem de Bruijn wrote:
> > On Thu, Sep 13, 2018 at 3:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Sep 13, 2018 at 10:45:53AM -0700, Petar Penkov wrote:
> > > > From: Petar Penkov <ppenkov@google.com>
> > > >
> > > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > > path. The BPF program is per-network namespace.
> > > >
> > > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > ...
> > > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > >       /* ... here. */
> > > >
> > > >       __u32 data_meta;
> > > > +     struct bpf_flow_keys *flow_keys;
> > >
> > > the bpf prog form patch 4 looks much better now. Thanks!
> > >
> > > >  };
> > > >
> > > >  struct bpf_tunnel_key {
> > > > @@ -2778,4 +2781,27 @@ enum bpf_task_fd_type {
> > > >       BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> > > >  };
> > > >
> > > > +struct bpf_flow_keys {
> > > > +     __u16   nhoff;
> > > > +     __u16   thoff;
> > > > +     __u16   addr_proto;                     /* ETH_P_* of valid addrs */
> > > > +     __u8    is_frag;
> > > > +     __u8    is_first_frag;
> > > > +     __u8    is_encap;
> > > > +     __be16  n_proto;
> > > > +     __u8    ip_proto;
> > > > +     union {
> > > > +             struct {
> > > > +                     __be32  ipv4_src;
> > > > +                     __be32  ipv4_dst;
> > > > +             };
> > > > +             struct {
> > > > +                     __u32   ipv6_src[4];    /* in6_addr; network order */
> > > > +                     __u32   ipv6_dst[4];    /* in6_addr; network order */
> > > > +             };
> > > > +     };
> > > > +     __be16  sport;
> > > > +     __be16  dport;
> > > > +};
> > >
> > > can you please pack it?
> > > struct bpf_flow_keys {
> > >         __u16                      nhoff;                /*     0     2 */
> > >         __u16                      thoff;                /*     2     2 */
> > >         __u16                      addr_proto;           /*     4     2 */
> > >         __u8                       is_frag;              /*     6     1 */
> > >         __u8                       is_first_frag;        /*     7     1 */
> > >         __u8                       is_encap;             /*     8     1 */
> > >
> > >         /* XXX 1 byte hole, try to pack */
> > >
> > >         __be16                     n_proto;              /*    10     2 */
> > >         __u8                       ip_proto;             /*    12     1 */
> > >
> > >         /* XXX 3 bytes hole, try to pack */
> > >
> > >         union {
> > >
> > > also is_frag and other fields are not used by the kernel and
> > > only used by the prog to pass data between tail_calls ?
> >
> > No, these are mapped directly onto fields in struct flow_keys
> > on return from the BPF program in __skb_flow_bpf_to_target.
> > For is_frag, for instance:
> >
> >        if (flow_keys->is_frag)
> >                key_control->flags |= FLOW_DIS_IS_FRAGMENT;
>
> right. my search-fu failed me. only packing is needed then.
>
> > This is true for all fields in the struct except nhoff.
>
> > > In such case reserve some space in bpf_flow_keys similar to skb->cb
> > > so it can contain any fields and accommodate for inevitable changes
> > > to bpf flow dissector prog in the future.
> >
> > Do you mean a second scratch space akin to cb[], just a few
> > reserved padding bytes?
>
> it looks to me it's possible to rearrange the fields to avoid all holes,
> so no extra padding bytes necessary.

Absolutely. We forgot to run pahole earlier. Updating now.

> > We have given some thought to forward compatibility. The existing
> > fields cannot be changed, but it should be fine if we need to expand
> > the struct later.
>
> let's keep cb-like idea for later. It seems to me we can add it to
> the end of bpf_flow_keys any time later.

^ permalink raw reply

* Re: [PATCH bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Alexei Starovoitov @ 2018-09-13 21:01 UTC (permalink / raw)
  To: Joe Stringer
  Cc: daniel, netdev, ast, john fastabend, tgraf, Martin KaFai Lau,
	Nitin Hande, mauricio.vasquez
In-Reply-To: <CAOftzPiG6JMb2=U3ZU9D2+0U=1zLqZPgax8OFRHF_1UTcs5Shw@mail.gmail.com>

On Thu, Sep 13, 2018 at 01:55:01PM -0700, Joe Stringer wrote:
> On Thu, 13 Sep 2018 at 12:06, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Sep 12, 2018 at 5:06 PM, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > On Tue, Sep 11, 2018 at 05:36:36PM -0700, Joe Stringer wrote:
> > >> This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
> > >> bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
> > >> socket listening on this host, and returns a socket pointer which the
> > >> BPF program can then access to determine, for instance, whether to
> > >> forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
> > >> socket, so when a BPF program makes use of this function, it must
> > >> subsequently pass the returned pointer into the newly added sk_release()
> > >> to return the reference.
> > >>
> > >> By way of example, the following pseudocode would filter inbound
> > >> connections at XDP if there is no corresponding service listening for
> > >> the traffic:
> > >>
> > >>   struct bpf_sock_tuple tuple;
> > >>   struct bpf_sock_ops *sk;
> > >>
> > >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> > >>   sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
> > > ...
> > >> +struct bpf_sock_tuple {
> > >> +     union {
> > >> +             __be32 ipv6[4];
> > >> +             __be32 ipv4;
> > >> +     } saddr;
> > >> +     union {
> > >> +             __be32 ipv6[4];
> > >> +             __be32 ipv4;
> > >> +     } daddr;
> > >> +     __be16 sport;
> > >> +     __be16 dport;
> > >> +     __u8 family;
> > >> +};
> > >
> > > since we can pass ptr_to_packet into map lookup and other helpers now,
> > > can you move 'family' out of bpf_sock_tuple and combine with netns_id arg?
> > > then progs wouldn't need to copy bytes from the packet into tuple
> > > to do a lookup.
> 
> If I follow, you're proposing that users should be able to pass a
> pointer to the source address field of the L3 header, and assuming
> that the L3 header ends with saddr+daddr (no options/extheaders), and
> is immediately followed by the sport/dport then a packet pointer
> should work for performing socket lookup. Then it is up to the BPF
> program writer to ensure that this is the case, or otherwise fall back
> to populating a copy of the sock tuple on the stack.

yep.

> > have been thinking more about it.
> > since only ipv4 and ipv6 supported may be use size of bpf_sock_tuple
> > to infer family inside the helper, so it doesn't need to be passed explicitly?
> 
> Let me make sure I understand the proposal here.
> 
> The current structure and function prototypes are:
> 
> struct bpf_sock_tuple {
>       union {
>               __be32 ipv6[4];
>               __be32 ipv4;
>       } saddr;
>       union {
>               __be32 ipv6[4];
>               __be32 ipv4;
>       } daddr;
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };
...
> You're proposing something like:
> 
> struct bpf_sock_tuple4 {
>       __be32 saddr;
>       __be32 daddr;
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };
> 
> struct bpf_sock_tuple6 {
>       __be32 saddr[4];
>       __be32 daddr[4];
>       __be16 sport;
>       __be16 dport;
>       __u8 family;
> };

I think the split is unnecessary.
I'm proposing:
struct bpf_sock_tuple {
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } saddr;
      union {
              __be32 ipv6[4];
              __be32 ipv4;
      } daddr;
      __be16 sport;
      __be16 dport;
};

that points directly into the packet (when ipv4 options are not there)
and bpf_sk_lookup_tcp() uses 'size' argument to figure out ipv4/ipv6 family.

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: maowenan @ 2018-09-14  2:24 UTC (permalink / raw)
  To: Eric Dumazet, gregkh
  Cc: mkubecek, David Woodhouse, netdev, Eric Dumazet, David Miller,
	Yuchung Cheng, jdw, stable, tiwai
In-Reply-To: <CANn89iLVjY+KWs3y=cYW_OpwCLAOt9=x=iTnsPatwzCMYxUpTA@mail.gmail.com>



On 2018/9/13 20:44, Eric Dumazet wrote:
> On Thu, Sep 13, 2018 at 5:32 AM Greg KH <gregkh@linux-foundation.org> wrote:
>>
>> On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
>>> On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
>>>> On Thu, Aug 16, 2018 at 08:05:50PM +0800, maowenan wrote:
>>>>> On 2018/8/16 19:39, Michal Kubecek wrote:
>>>>>>
>>>>>> I suspect you may be doing something wrong with your tests. I checked
>>>>>> the segmentsmack testcase and the CPU utilization on receiving side
>>>>>> (with sending 10 times as many packets as default) went down from ~100%
>>>>>> to ~3% even when comparing what is in stable 4.4 now against older 4.4
>>>>>> kernel.
>>>>>
>>>>> There seems no obvious problem when you send packets with default
>>>>> parameter in Segmentsmack POC, Which is also very related with your
>>>>> server's hardware configuration. Please try with below parameter to
>>>>> form OFO packets
>>>>
>>>> I did and even with these (questionable, see below) changes, I did not
>>>> get more than 10% (of one core) by receiving ksoftirqd.
>>>>
>>>>>       for (i = 0; i < 1024; i++)      // 128->1024
>>>> ...
>>>>>       usleep(10*1000); // Adjust this and packet count to match the target!, sleep 100ms->10ms
>>>>
>>>> The comment in the testcase source suggests to do _one_ of these two
>>>> changes so that you generate 10 times as many packets as the original
>>>> testcase. You did both so that you end up sending 102400 packets per
>>>> second. With 55 byte long packets, this kind of attack requires at least
>>>> 5.5 MB/s (44 Mb/s) of throughput. This is no longer a "low packet rate
>>>> DoS", I'm afraid.
>>>>
>>>> Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
>>>>
>>>> What I can see, though, is that with current stable 4.4 code, modified
>>>> testcase which sends something like
>>>>
>>>>   2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
>>>>
>>>> I quickly eat 6 MB of memory for receive queue of one socket while
>>>> earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with
>>>> Takashi's follow-up yet but I'm pretty sure it will help while
>>>> preserving nice performance when using the original segmentsmack
>>>> testcase (with increased packet ratio).
>>>
>>> Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will
>>> push out a new 4.4-rc later tonight.  Can everyone standardize on that
>>> and test and let me know if it does, or does not, fix the reported
>>> issues?
>>>
>>> If not, we can go from there and evaluate this much larger patch series.
>>> But let's try the simple thing first.
>>
>> So, is the issue still present on the latest 4.4 release?  Has anyone
>> tested it?  If not, I'm more than willing to look at backported patches,
>> but I want to ensure that they really are needed here.
>>
>> thanks,
> 
> Honestly, TCP stack without rb-tree for the OOO queue is vulnerable,
> even with non malicious sender,
> but with big enough TCP receive window and a not favorable network.
> 
> So a malicious peer can definitely send packets needed to make TCP
> stack behave in O(N), which is pretty bad if N is big...
> 
> 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ("tcp: use an RB tree for ooo
> receive queue")
> was proven to be almost bug free [1], and should be backported if possible.
> 
> [1] bug fixed :
> 76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a tcp: fix a stale ooo_last_skb
> after a replace

Thank you for Eric's suggestion, I will do some work to backport them.
> 
> .
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox