Netdev List
 help / color / mirror / Atom feed
* Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-12 22:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180912222500.57wnc7sjofgatuxu@ast-mbp>

On Wed, Sep 12, 2018 at 6:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 12, 2018 at 02:43:37PM -0400, Willem de Bruijn wrote:
> > On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > > > From: Petar Penkov <ppenkov@google.com>
> > > >
> > > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > > path. The BPF program is per-network namespace.
> > > >
> > > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > ---
> > > >  include/linux/bpf.h            |   1 +
> > > >  include/linux/bpf_types.h      |   1 +
> > > >  include/linux/skbuff.h         |   7 ++
> > > >  include/net/net_namespace.h    |   3 +
> > > >  include/net/sch_generic.h      |  12 ++-
> > > >  include/uapi/linux/bpf.h       |  25 ++++++
> > > >  kernel/bpf/syscall.c           |   8 ++
> > > >  kernel/bpf/verifier.c          |  32 ++++++++
> > > >  net/core/filter.c              |  67 ++++++++++++++++
> > > >  net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
> > > >  tools/bpf/bpftool/prog.c       |   1 +
> > > >  tools/include/uapi/linux/bpf.h |  25 ++++++
> > > >  tools/lib/bpf/libbpf.c         |   2 +
> > >
> > > please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> > > We often have conflicts in there, so best to have a separate.
> > > Also please split tools/lib and tools/bpf chnages into patch 3.
> >
> > Will do in v3.
> >
> > > >  struct qdisc_skb_cb {
> > > > -     unsigned int            pkt_len;
> > > > -     u16                     slave_dev_queue_mapping;
> > > > -     u16                     tc_classid;
> > > > +     union {
> > > > +             struct {
> > > > +                     unsigned int            pkt_len;
> > > > +                     u16                     slave_dev_queue_mapping;
> > > > +                     u16                     tc_classid;
> > > > +             };
> > > > +             struct bpf_flow_keys *flow_keys;
> > > > +     };
> > >
> > > is this magic really necessary? flow_dissector runs very early in recv path.
> > > There is no qdisc or conflicts with tcp/ip use of cb.
> > > I think the whole cb block can be used.
> >
> > The flow dissector also runs in the context of TC, from flower.
> > But that is not a reason to use this struct.
> >
> > We need both (a) data shared with the BPF application and between
> > applications using tail-calls, to pass along the parse offset (nhoff),
> > and (b) data not accessible by the program, to store the flow_keys
> > pointer.
> >
> > qdisc_skb_cb already has this split, exposing only the 20B .data
> > field to the application. Flow dissection currently reuses the existing
> > bpf_convert_ctx_access logic for this field.
> >
> > We could create a separate flowdissect_skb_cb struct with the
> > same split setup, but a second constraint is that relevant internal
> > BPF interfaces already expect qdisc_skb_cb, notably
> > bkf_skb_data_end. So the union was easier.
>
> got it. all makes sense.
>
> >
> > There is another way to pass nhoff besides cb[] (see below). But
> > I don't immediately see another place to store the flow_keys ptr.
> >
> > At least, if we pass skb as context. One larger change would
> > be to introduce another ctx struct, similar to sk_reuseport_(kern|md).
>
> yeah. thought about this too, but your approach looks easier and faster.
> Accesses to skb have one less dereference.
>
> > > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > >       /* ... here. */
> > > >
> > > >       __u32 data_meta;
> > > > +     __u32 flow_keys;
> > >
> > > please use
> > > struct bpf_flow_keys *flow_keys;
> > > instead.
> > >
> > > See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> > > There is no need to hide pointers in u32.
> > >
> >
> > Will do in v3.
> >
> > > > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > > >                                             FLOW_DISSECTOR_KEY_BASIC,
> > > >                                             target_container);
> > > >
> > > > +     rcu_read_lock();
> > > > +     attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > > > +                    : NULL;
> > > > +     if (attached) {
> > > > +             /* Note that even though the const qualifier is discarded
> > > > +              * throughout the execution of the BPF program, all changes(the
> > > > +              * control block) are reverted after the BPF program returns.
> > > > +              * Therefore, __skb_flow_dissect does not alter the skb.
> > > > +              */
> > > > +             struct bpf_flow_keys flow_keys = {};
> > > > +             struct qdisc_skb_cb cb_saved;
> > > > +             struct qdisc_skb_cb *cb;
> > > > +             u16 *pseudo_cb;
> > > > +             u32 result;
> > > > +
> > > > +             cb = qdisc_skb_cb(skb);
> > > > +             pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > > > +
> > > > +             /* Save Control Block */
> > > > +             memcpy(&cb_saved, cb, sizeof(cb_saved));
> > > > +             memset(cb, 0, sizeof(cb_saved));
> > > > +
> > > > +             /* Pass parameters to the BPF program */
> > > > +             cb->flow_keys = &flow_keys;
> > > > +             *pseudo_cb = nhoff;
> > >
> > > I don't understand this bit.
> > > What is this pseudo_cb and why nhoff goes in there?
> > > Some odd way to pass it into the prog?
> >
> > Yes. nhoff passes the offset to the program to start parsing from.
> > Applications also pass this during tail calls.
> >
> > Alternatively we can just add a new field to struct bpf_flow_keys.
>
> I think that certainly will be cleaner and easier to use from
> bpf prog pov. Since flow_keys stay constant any change to nhoff
> between tail_calls will be preserved too. I see no cons to such approach.

Yes, it's definitely simpler. We'll do that.

Thanks!

^ permalink raw reply

* Re: [PATCH bpf-next 05/11] bpf: Macrofy stack state copy
From: Alexei Starovoitov @ 2018-09-12 22:51 UTC (permalink / raw)
  To: Joe Stringer
  Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
	mauricio.vasquez
In-Reply-To: <20180912003640.28316-6-joe@wand.net.nz>

On Tue, Sep 11, 2018 at 05:36:34PM -0700, Joe Stringer wrote:
> An upcoming commit will need very similar copy/realloc boilerplate, so
> refactor the existing stack copy/realloc functions into macros to
> simplify it.
> 
> Signed-off-by: Joe Stringer <joe@wand.net.nz>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type
From: Alexei Starovoitov @ 2018-09-12 22:50 UTC (permalink / raw)
  To: Joe Stringer
  Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
	mauricio.vasquez
In-Reply-To: <20180912003640.28316-5-joe@wand.net.nz>

On Tue, Sep 11, 2018 at 05:36:33PM -0700, Joe Stringer wrote:
> Teach the verifier a little bit about a new type of pointer, a
> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> 'struct bpf_sock' structure.
> 
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
>  include/linux/bpf.h          |  17 +++++
>  include/linux/bpf_verifier.h |   2 +
>  kernel/bpf/verifier.c        | 125 ++++++++++++++++++++++++++++++-----
>  net/core/filter.c            |  30 +++++----
>  4 files changed, 147 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 523481a3471b..6ec93f3d66dd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -154,6 +154,7 @@ enum bpf_arg_type {
>  
>  	ARG_PTR_TO_CTX,		/* pointer to context */
>  	ARG_ANYTHING,		/* any (initialized) argument is ok */
> +	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
>  };
>  
>  /* type of values returned from helper functions */
> @@ -162,6 +163,7 @@ enum bpf_return_type {
>  	RET_VOID,			/* function doesn't return anything */
>  	RET_PTR_TO_MAP_VALUE,		/* returns a pointer to map elem value */
>  	RET_PTR_TO_MAP_VALUE_OR_NULL,	/* returns a pointer to map elem value or NULL */
> +	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
>  };
>  
>  /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
> @@ -212,6 +214,8 @@ enum bpf_reg_type {
>  	PTR_TO_PACKET_META,	 /* skb->data - meta_len */
>  	PTR_TO_PACKET,		 /* reg points to skb->data */
>  	PTR_TO_PACKET_END,	 /* skb->data + headlen */
> +	PTR_TO_SOCKET,		 /* reg points to struct bpf_sock */
> +	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
>  };
>  
>  /* The information passed from prog-specific *_is_valid_access
> @@ -334,6 +338,11 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
>  
>  typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
>  					unsigned long off, unsigned long len);
> +typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
> +					const struct bpf_insn *src,
> +					struct bpf_insn *dst,
> +					struct bpf_prog *prog,
> +					u32 *target_size);
>  
>  u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
>  		     void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
> @@ -827,4 +836,12 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
> +			      struct bpf_insn_access_aux *info);
> +u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
> +			        const struct bpf_insn *si,
> +			        struct bpf_insn *insn_buf,
> +			        struct bpf_prog *prog,
> +			        u32 *target_size);
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index af262b97f586..23a2b17bfd75 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -58,6 +58,8 @@ struct bpf_reg_state {
>  	 * offset, so they can share range knowledge.
>  	 * For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
>  	 * came from, when one is tested for != NULL.
> +	 * For PTR_TO_SOCKET this is used to share which pointers retain the
> +	 * same reference to the socket, to determine proper reference freeing.
>  	 */
>  	u32 id;
>  	/* For scalar types (SCALAR_VALUE), this represents our knowledge of
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f2357c8c90de..111e031cf65d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
>   * (like pointer plus pointer becomes SCALAR_VALUE type)
>   *
>   * When verifier sees load or store instructions the type of base register
> - * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK. These are three pointer
> - * types recognized by check_mem_access() function.
> + * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, PTR_TO_SOCKET. These are
> + * four pointer types recognized by check_mem_access() function.
>   *
>   * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value'
>   * and the range of [ptr, ptr + map's value_size) is accessible.
> @@ -266,6 +266,8 @@ static const char * const reg_type_str[] = {
>  	[PTR_TO_PACKET]		= "pkt",
>  	[PTR_TO_PACKET_META]	= "pkt_meta",
>  	[PTR_TO_PACKET_END]	= "pkt_end",
> +	[PTR_TO_SOCKET]		= "sock",
> +	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
>  };
>  
>  static char slot_type_char[] = {
> @@ -971,6 +973,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
>  	case PTR_TO_PACKET_META:
>  	case PTR_TO_PACKET_END:
>  	case CONST_PTR_TO_MAP:
> +	case PTR_TO_SOCKET:
> +	case PTR_TO_SOCKET_OR_NULL:
>  		return true;
>  	default:
>  		return false;
> @@ -1326,6 +1330,28 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
>  	return -EACCES;
>  }
>  
> +static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
> +			     int size, enum bpf_access_type t)
> +{
> +	struct bpf_reg_state *regs = cur_regs(env);
> +	struct bpf_reg_state *reg = &regs[regno];
> +	struct bpf_insn_access_aux info;
> +
> +	if (reg->smin_value < 0) {
> +		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
> +			regno);
> +		return -EACCES;
> +	}
> +
> +	if (!bpf_sock_is_valid_access(off, size, t, &info)) {
> +		verbose(env, "invalid bpf_sock_ops access off=%d size=%d\n",
> +			off, size);
> +		return -EACCES;
> +	}
> +
> +	return 0;
> +}
> +
>  static bool __is_pointer_value(bool allow_ptr_leaks,
>  			       const struct bpf_reg_state *reg)
>  {
> @@ -1441,6 +1467,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
>  		 */
>  		strict = true;
>  		break;
> +	case PTR_TO_SOCKET:
> +		pointer_desc = "sock ";
> +		break;
>  	default:
>  		break;
>  	}
> @@ -1697,6 +1726,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>  		err = check_packet_access(env, regno, off, size, false);
>  		if (!err && t == BPF_READ && value_regno >= 0)
>  			mark_reg_unknown(env, regs, value_regno);
> +
> +	} else if (reg->type == PTR_TO_SOCKET) {
> +		if (t == BPF_WRITE) {
> +			verbose(env, "cannot write into socket\n");
> +			return -EACCES;
> +		}
> +		err = check_sock_access(env, regno, off, size, t);
> +		if (!err && value_regno >= 0)
> +			mark_reg_unknown(env, regs, value_regno);
> +
>  	} else {
>  		verbose(env, "R%d invalid mem access '%s'\n", regno,
>  			reg_type_str[reg->type]);
> @@ -1918,6 +1957,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  		err = check_ctx_reg(env, reg, regno);
>  		if (err < 0)
>  			return err;
> +	} else if (arg_type == ARG_PTR_TO_SOCKET) {
> +		expected_type = PTR_TO_SOCKET;
> +		if (type != expected_type)
> +			goto err_type;
>  	} else if (arg_type_is_mem_ptr(arg_type)) {
>  		expected_type = PTR_TO_STACK;
>  		/* One exception here. In case function allows for NULL to be
> @@ -2511,6 +2554,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>  		}
>  		regs[BPF_REG_0].map_ptr = meta.map_ptr;
>  		regs[BPF_REG_0].id = ++env->id_gen;
> +	} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
> +		mark_reg_known_zero(env, regs, BPF_REG_0);
> +		regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL;
> +		regs[BPF_REG_0].id = ++env->id_gen;
>  	} else {
>  		verbose(env, "unknown return type %d of func %s#%d\n",
>  			fn->ret_type, func_id_name(func_id), func_id);
> @@ -2648,6 +2695,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	case CONST_PTR_TO_MAP:
>  	case PTR_TO_PACKET_END:
> +	case PTR_TO_SOCKET:
> +	case PTR_TO_SOCKET_OR_NULL:
>  		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
>  			dst, reg_type_str[ptr_reg->type]);
>  		return -EACCES;
> @@ -3595,6 +3644,8 @@ static void mark_ptr_or_null_reg(struct bpf_reg_state *reg, u32 id,
>  			} else {
>  				reg->type = PTR_TO_MAP_VALUE;
>  			}
> +		} else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
> +			reg->type = PTR_TO_SOCKET;
>  		}
>  		/* We don't need id from this point onwards anymore, thus we
>  		 * should better reset it, so that state pruning has chances
> @@ -4369,6 +4420,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
>  	case PTR_TO_CTX:
>  	case CONST_PTR_TO_MAP:
>  	case PTR_TO_PACKET_END:
> +	case PTR_TO_SOCKET:
> +	case PTR_TO_SOCKET_OR_NULL:
>  		/* Only valid matches are exact, which memcmp() above
>  		 * would have accepted
>  		 */
> @@ -4646,6 +4699,37 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
>  	return 0;
>  }
>  
> +/* Return true if it's OK to have the same insn return a different type. */
> +static bool reg_type_mismatch_ok(enum bpf_reg_type type)
> +{
> +	switch (type) {
> +	case PTR_TO_CTX:
> +	case PTR_TO_SOCKET:
> +	case PTR_TO_SOCKET_OR_NULL:
> +		return false;
> +	default:
> +		return true;
> +	}
> +}
> +
> +/* If an instruction was previously used with particular pointer types, then we
> + * need to be careful to avoid cases such as the below, where it may be ok
> + * for one branch accessing the pointer, but not ok for the other branch:
> + *
> + * R1 = sock_ptr
> + * goto X;
> + * ...
> + * R1 = some_other_valid_ptr;
> + * goto X;
> + * ...
> + * R2 = *(u32 *)(R1 + 0);
> + */
> +static bool reg_type_mismatch(enum bpf_reg_type src, enum bpf_reg_type prev)
> +{
> +	return src != prev && (!reg_type_mismatch_ok(src) ||
> +			       !reg_type_mismatch_ok(prev));
> +}
> +
>  static int do_check(struct bpf_verifier_env *env)
>  {
>  	struct bpf_verifier_state *state;
> @@ -4778,9 +4862,7 @@ static int do_check(struct bpf_verifier_env *env)
>  				 */
>  				*prev_src_type = src_reg_type;
>  
> -			} else if (src_reg_type != *prev_src_type &&
> -				   (src_reg_type == PTR_TO_CTX ||
> -				    *prev_src_type == PTR_TO_CTX)) {
> +			} else if (reg_type_mismatch(src_reg_type, *prev_src_type)) {
>  				/* ABuser program is trying to use the same insn
>  				 * dst_reg = *(u32*) (src_reg + off)
>  				 * with different pointer types:
> @@ -4826,8 +4908,8 @@ static int do_check(struct bpf_verifier_env *env)
>  			if (*prev_dst_type == NOT_INIT) {
>  				*prev_dst_type = dst_reg_type;
>  			} else if (dst_reg_type != *prev_dst_type &&
> -				   (dst_reg_type == PTR_TO_CTX ||
> -				    *prev_dst_type == PTR_TO_CTX)) {
> +				   (!reg_type_mismatch_ok(dst_reg_type) ||
> +				    !reg_type_mismatch_ok(*prev_dst_type))) {

reg_type_mismatch() could have been used here as well ?

>  				verbose(env, "same insn cannot be used with different pointers\n");
>  				return -EINVAL;
>  			}
> @@ -5244,10 +5326,14 @@ static void sanitize_dead_code(struct bpf_verifier_env *env)
>  	}
>  }
>  
> -/* convert load instructions that access fields of 'struct __sk_buff'
> - * into sequence of instructions that access fields of 'struct sk_buff'
> +/* convert load instructions that access fields of a context type into a
> + * sequence of instructions that access fields of the underlying structure:
> + *     struct __sk_buff    -> struct sk_buff
> + *     struct bpf_sock_ops -> struct sock
>   */
> -static int convert_ctx_accesses(struct bpf_verifier_env *env)
> +static int convert_ctx_accesses(struct bpf_verifier_env *env,
> +				bpf_convert_ctx_access_t convert_ctx_access,
> +				enum bpf_reg_type ctx_type)
>  {
>  	const struct bpf_verifier_ops *ops = env->ops;
>  	int i, cnt, size, ctx_field_size, delta = 0;
> @@ -5274,12 +5360,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  		}
>  	}
>  
> -	if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
> +	if (!convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
>  		return 0;
>  
>  	insn = env->prog->insnsi + delta;
>  
>  	for (i = 0; i < insn_cnt; i++, insn++) {
> +		enum bpf_reg_type ptr_type;
> +
>  		if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
>  		    insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
>  		    insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
> @@ -5321,7 +5409,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  			continue;
>  		}
>  
> -		if (env->insn_aux_data[i + delta].ptr_type != PTR_TO_CTX)
> +		ptr_type = env->insn_aux_data[i + delta].ptr_type;
> +		if (ptr_type != ctx_type)
>  			continue;
>  
>  		ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
> @@ -5354,8 +5443,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  		}
>  
>  		target_size = 0;
> -		cnt = ops->convert_ctx_access(type, insn, insn_buf, env->prog,
> -					      &target_size);
> +		cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
> +					 &target_size);
>  		if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
>  		    (ctx_field_size && !target_size)) {
>  			verbose(env, "bpf verifier is misconfigured\n");
> @@ -5899,7 +5988,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
>  
>  	if (ret == 0)
>  		/* program is valid, convert *(u32*)(ctx + off) accesses */
> -		ret = convert_ctx_accesses(env);
> +		ret = convert_ctx_accesses(env, env->ops->convert_ctx_access,
> +					   PTR_TO_CTX);
> +
> +	if (ret == 0)
> +		/* Convert *(u32*)(sock_ops + off) accesses */
> +		ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access,
> +					   PTR_TO_SOCKET);

can it be done in single pass instead?
env->insn_aux_data tells us what kind of conversion needs to happen.
while progs are small, I guess, it's ok, but would like to see a follow up
patch to this.

^ permalink raw reply

* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Daniel Borkmann @ 2018-09-12 22:40 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <b5b3d620-80a5-3b00-e84f-12ebce6c6925@iogearbox.net>

On 09/13/2018 12:29 AM, Daniel Borkmann wrote:
> On 09/06/2018 01:58 AM, Yonghong Song wrote:
>> Add "bpftool net" support. Networking devices are enumerated
>> to dump device index/name associated with xdp progs.
>>
>> For each networking device, tc classes and qdiscs are enumerated
>> in order to check their bpf filters.
>> In addition, root handle and clsact ingress/egress are also checked for
>> bpf filters.  Not all filter information is printed out. Only ifindex,
>> kind, filter name, prog_id and tag are printed out, which are good
>> enough to show attachment information. If the filter action
>> is a bpf action, its bpf program id, bpf name and tag will be
>> printed out as well.
>>
>> For example,
>>   $ ./bpftool net
>>   xdp [
>>   ifindex 2 devname eth0 prog_id 198
>>   ]
> 
> Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
> zero information but will take lots of space. 'eth0 (2)' would for example make it
> shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(
> 
>>   tc_filters [
>>   ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>>             prog_id 111727 tag d08fe3b4319bc2fd act []
>>   ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>>             prog_id 130246 tag 3f265c7f26db62c9 act []
>>   ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>>             prog_id 111726 tag 99a197826974c876
>>   ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>>             prog_id 108619 tag dc4630674fd72dcc act []
>>   ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>>             prog_id 130245 tag 72d2d830d6888d2c
>>   ]
> 
> Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
> the output of bpftool perf [0] doesn't provide it either and therefore makes the list
> as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
> do not show dev name as opposed to xdp progs? Can we shorten everything to make it
> a one-liner like in bpftool perf?
> 
> Should we have a small indicator here if the tc prog was offloaded?
> 
> Does the dump work with tc shared blocks?
> 
> Should we also dump networking related cgroup BPF progs here under bpftool net?

One more thought, I think it would make sense to also explicitly document current
limitations of the progs that this listing does not show where user should consult
other tools aka iproute2 e.g. lwt, seg6, tc bpf actions, etc, so the current scope
would be more clear for users.

> Thanks,
> Daniel
> 
>   [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
> 

^ permalink raw reply

* Re: [PATCH] docs: net: Remove TCP congestion document
From: David Miller @ 2018-09-13  3:43 UTC (permalink / raw)
  To: me; +Cc: edumazet, netdev, linux-doc, linux-kernel
In-Reply-To: <20180912081444.17258-1-me@tobin.cc>

From: "Tobin C. Harding" <me@tobin.cc>
Date: Wed, 12 Sep 2018 18:14:44 +1000

> Document is stale, let's remove it.
> 
> Remove TCP congestion document.
> 
> Signed-off-by: Tobin C. Harding <me@tobin.cc>

Applied, thanks.

^ permalink raw reply

* [PATCH v3] PCI: Reprogram bridge prefetch registers on resume
From: Daniel Drake @ 2018-09-13  3:37 UTC (permalink / raw)
  To: bhelgaas-hpIqsD4AKlfQT0dZR+AlfA
  Cc: andy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	rafael.j.wysocki-ral2JQCrhuEAvxtiuMwx3w,
	nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	rchang-eYqpPyKDWXRBDgjK7y7TUQ, linux-6IF/jdPJHihWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jonathan.derrick-ral2JQCrhuEAvxtiuMwx3w,
	hkallweit1-Re5JQEeQqe8AvxtiuMwx3w

On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:

    fifo: fault 00 [READ] at 0000005555555000 engine 00 [GR] client 04
          [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
    DRM: failed to idle channel 0 [DRM]

Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.

Runtime suspend/resume works fine, only S3 suspend is affected.

We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
the cases that I checked, this register has value 0 and we just have to
rewrite that value.

Linux already saves and restores PCI config space during suspend/resume,
but this register was being skipped because upon resume, it already
has value 0 (the correct, pre-suspend value).

Intel appear to have previously acknowledged this behaviour and the
requirement to rewrite this register.
https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23

Based on that, rewrite the prefetch register values even when that
appears unnecessary.

We have confirmed this solution on all the affected models we have
in-hands (X542UQ, UX533FD, X530UN, V272UN).

Additionally, this solves an issue where r8169 MSI-X interrupts were
broken after S3 suspend/resume on Asus X441UAR. This issue was recently
worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
Aimfor-tech laptop that we had not yet patched. I suspect it will also
fix the issue that was worked around in commit 7c53a722459c ("r8169:
don't use MSI-X on RTL8168g").

Thomas Martitz reports that this change also solves an issue where
the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
after S3 suspend/resume.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
Signed-off-by: Daniel Drake <drake@endlessm.com>
---
 drivers/pci/pci.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..5d58220b6997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_save_state);
 
 static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
-				     u32 saved_val, int retry)
+				     u32 saved_val, int retry, bool force)
 {
 	u32 val;
 
 	pci_read_config_dword(pdev, offset, &val);
-	if (val == saved_val)
+	if (!force && val == saved_val)
 		return;
 
 	for (;;) {
@@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
 }
 
 static void pci_restore_config_space_range(struct pci_dev *pdev,
-					   int start, int end, int retry)
+					   int start, int end, int retry,
+					   bool force)
 {
 	int index;
 
 	for (index = end; index >= start; index--)
 		pci_restore_config_dword(pdev, 4 * index,
 					 pdev->saved_config_space[index],
-					 retry);
+					 retry, force);
 }
 
 static void pci_restore_config_space(struct pci_dev *pdev)
 {
 	if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
-		pci_restore_config_space_range(pdev, 10, 15, 0);
+		pci_restore_config_space_range(pdev, 10, 15, 0, false);
 		/* Restore BARs before the command register. */
-		pci_restore_config_space_range(pdev, 4, 9, 10);
-		pci_restore_config_space_range(pdev, 0, 3, 0);
+		pci_restore_config_space_range(pdev, 4, 9, 10, false);
+		pci_restore_config_space_range(pdev, 0, 3, 0, false);
+	} else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		pci_restore_config_space_range(pdev, 12, 15, 0, false);
+		/* Force rewriting of prefetch registers to avoid
+		 * S3 resume issues on Intel PCI bridges that occur when
+		 * these registers are not explicitly written.
+		 */
+		pci_restore_config_space_range(pdev, 9, 11, 0, true);
+		pci_restore_config_space_range(pdev, 0, 8, 0, false);
 	} else {
-		pci_restore_config_space_range(pdev, 0, 15, 0);
+		pci_restore_config_space_range(pdev, 0, 15, 0, false);
 	}
 }
 
-- 
2.17.1

_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply related

* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Daniel Borkmann @ 2018-09-12 22:29 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180905235806.1536396-5-yhs@fb.com>

On 09/06/2018 01:58 AM, Yonghong Song wrote:
> Add "bpftool net" support. Networking devices are enumerated
> to dump device index/name associated with xdp progs.
> 
> For each networking device, tc classes and qdiscs are enumerated
> in order to check their bpf filters.
> In addition, root handle and clsact ingress/egress are also checked for
> bpf filters.  Not all filter information is printed out. Only ifindex,
> kind, filter name, prog_id and tag are printed out, which are good
> enough to show attachment information. If the filter action
> is a bpf action, its bpf program id, bpf name and tag will be
> printed out as well.
> 
> For example,
>   $ ./bpftool net
>   xdp [
>   ifindex 2 devname eth0 prog_id 198
>   ]

Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
zero information but will take lots of space. 'eth0 (2)' would for example make it
shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(

>   tc_filters [
>   ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>             prog_id 111727 tag d08fe3b4319bc2fd act []
>   ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>             prog_id 130246 tag 3f265c7f26db62c9 act []
>   ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>             prog_id 111726 tag 99a197826974c876
>   ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>             prog_id 108619 tag dc4630674fd72dcc act []
>   ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>             prog_id 130245 tag 72d2d830d6888d2c
>   ]

Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
the output of bpftool perf [0] doesn't provide it either and therefore makes the list
as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
do not show dev name as opposed to xdp progs? Can we shorten everything to make it
a one-liner like in bpftool perf?

Should we have a small indicator here if the tc prog was offloaded?

Does the dump work with tc shared blocks?

Should we also dump networking related cgroup BPF progs here under bpftool net?

Thanks,
Daniel

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e

^ permalink raw reply

* Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Alexei Starovoitov @ 2018-09-12 22:25 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <CAF=yD-LOvyj9t1O7MpqfXE7eYKpFZn96Gx144vmkFgppZe2UbQ@mail.gmail.com>

On Wed, Sep 12, 2018 at 02:43:37PM -0400, Willem de Bruijn wrote:
> On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > > From: Petar Penkov <ppenkov@google.com>
> > >
> > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > path. The BPF program is per-network namespace.
> > >
> > > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > ---
> > >  include/linux/bpf.h            |   1 +
> > >  include/linux/bpf_types.h      |   1 +
> > >  include/linux/skbuff.h         |   7 ++
> > >  include/net/net_namespace.h    |   3 +
> > >  include/net/sch_generic.h      |  12 ++-
> > >  include/uapi/linux/bpf.h       |  25 ++++++
> > >  kernel/bpf/syscall.c           |   8 ++
> > >  kernel/bpf/verifier.c          |  32 ++++++++
> > >  net/core/filter.c              |  67 ++++++++++++++++
> > >  net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
> > >  tools/bpf/bpftool/prog.c       |   1 +
> > >  tools/include/uapi/linux/bpf.h |  25 ++++++
> > >  tools/lib/bpf/libbpf.c         |   2 +
> >
> > please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> > We often have conflicts in there, so best to have a separate.
> > Also please split tools/lib and tools/bpf chnages into patch 3.
> 
> Will do in v3.
> 
> > >  struct qdisc_skb_cb {
> > > -     unsigned int            pkt_len;
> > > -     u16                     slave_dev_queue_mapping;
> > > -     u16                     tc_classid;
> > > +     union {
> > > +             struct {
> > > +                     unsigned int            pkt_len;
> > > +                     u16                     slave_dev_queue_mapping;
> > > +                     u16                     tc_classid;
> > > +             };
> > > +             struct bpf_flow_keys *flow_keys;
> > > +     };
> >
> > is this magic really necessary? flow_dissector runs very early in recv path.
> > There is no qdisc or conflicts with tcp/ip use of cb.
> > I think the whole cb block can be used.
> 
> The flow dissector also runs in the context of TC, from flower.
> But that is not a reason to use this struct.
> 
> We need both (a) data shared with the BPF application and between
> applications using tail-calls, to pass along the parse offset (nhoff),
> and (b) data not accessible by the program, to store the flow_keys
> pointer.
> 
> qdisc_skb_cb already has this split, exposing only the 20B .data
> field to the application. Flow dissection currently reuses the existing
> bpf_convert_ctx_access logic for this field.
> 
> We could create a separate flowdissect_skb_cb struct with the
> same split setup, but a second constraint is that relevant internal
> BPF interfaces already expect qdisc_skb_cb, notably
> bkf_skb_data_end. So the union was easier.

got it. all makes sense.

> 
> There is another way to pass nhoff besides cb[] (see below). But
> I don't immediately see another place to store the flow_keys ptr.
> 
> At least, if we pass skb as context. One larger change would
> be to introduce another ctx struct, similar to sk_reuseport_(kern|md).

yeah. thought about this too, but your approach looks easier and faster.
Accesses to skb have one less dereference.

> > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > >       /* ... here. */
> > >
> > >       __u32 data_meta;
> > > +     __u32 flow_keys;
> >
> > please use
> > struct bpf_flow_keys *flow_keys;
> > instead.
> >
> > See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> > There is no need to hide pointers in u32.
> >
> 
> Will do in v3.
> 
> > > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > >                                             FLOW_DISSECTOR_KEY_BASIC,
> > >                                             target_container);
> > >
> > > +     rcu_read_lock();
> > > +     attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > > +                    : NULL;
> > > +     if (attached) {
> > > +             /* Note that even though the const qualifier is discarded
> > > +              * throughout the execution of the BPF program, all changes(the
> > > +              * control block) are reverted after the BPF program returns.
> > > +              * Therefore, __skb_flow_dissect does not alter the skb.
> > > +              */
> > > +             struct bpf_flow_keys flow_keys = {};
> > > +             struct qdisc_skb_cb cb_saved;
> > > +             struct qdisc_skb_cb *cb;
> > > +             u16 *pseudo_cb;
> > > +             u32 result;
> > > +
> > > +             cb = qdisc_skb_cb(skb);
> > > +             pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > > +
> > > +             /* Save Control Block */
> > > +             memcpy(&cb_saved, cb, sizeof(cb_saved));
> > > +             memset(cb, 0, sizeof(cb_saved));
> > > +
> > > +             /* Pass parameters to the BPF program */
> > > +             cb->flow_keys = &flow_keys;
> > > +             *pseudo_cb = nhoff;
> >
> > I don't understand this bit.
> > What is this pseudo_cb and why nhoff goes in there?
> > Some odd way to pass it into the prog?
> 
> Yes. nhoff passes the offset to the program to start parsing from.
> Applications also pass this during tail calls.
> 
> Alternatively we can just add a new field to struct bpf_flow_keys.

I think that certainly will be cleaner and easier to use from
bpf prog pov. Since flow_keys stay constant any change to nhoff
between tail_calls will be preserved too. I see no cons to such approach.

^ permalink raw reply

* Re: [PATCH net-next v2] net: bridge: add support for sticky fdb entries
From: David Miller @ 2018-09-13  3:30 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, roopa, bridge
In-Reply-To: <20180911063953.15008-1-nikolay@cumulusnetworks.com>

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Tue, 11 Sep 2018 09:39:53 +0300

> Add support for entries which are "sticky", i.e. will not change their port
> if they show up from a different one. A new ndm flag is introduced for that
> purpose - NTF_STICKY. We allow to set it only to non-local entries.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> v2: As Stephen suggested pass the whole ndm_flags because it will probably
>     be used by someone else soon.

Applied, thanks Nikolay.

^ permalink raw reply

* Re: [PATCH bpf] bpf: use __GFP_COMP while allocating page
From: Daniel Borkmann @ 2018-09-12 21:49 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, netdev, john.fastabend
In-Reply-To: <1536783329-784-1-git-send-email-tushar.n.dave@oracle.com>

On 09/12/2018 10:15 PM, Tushar Dave wrote:
> Helper bpg_msg_pull_data() can allocate multiple pages while
> linearizing multiple scatterlist elements into one shared page.
> However, if the shared page has size > PAGE_SIZE, using
> copy_page_to_iter() causes below warning.
> 
> e.g.
> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> To avoid above warning, use __GFP_COMP while allocating multiple
> contiguous pages.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>

Applied to bpf, thanks Tushar!

^ permalink raw reply

* Re: [PATCH bpf] bpf/verifier: disallow pointer subtraction
From: Daniel Borkmann @ 2018-09-12 21:48 UTC (permalink / raw)
  To: Alexei Starovoitov, David S . Miller; +Cc: netdev, kernel-team
In-Reply-To: <20180912210610.829166-1-ast@kernel.org>

On 09/12/2018 11:06 PM, Alexei Starovoitov wrote:
> Subtraction of pointers was accidentally allowed for unpriv programs
> by commit 82abbf8d2fc4. Revert that part of commit.
> 
> Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers")
> Reported-by: Jann Horn <jannh@google.com>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Applied to bpf, thanks Alexei!

^ permalink raw reply

* [PATCH bpf] bpf/verifier: disallow pointer subtraction
From: Alexei Starovoitov @ 2018-09-12 21:06 UTC (permalink / raw)
  To: David S . Miller; +Cc: daniel, netdev, kernel-team

Subtraction of pointers was accidentally allowed for unpriv programs
by commit 82abbf8d2fc4. Revert that part of commit.

Fixes: 82abbf8d2fc4 ("bpf: do not allow root to mangle valid pointers")
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 92246117d2b0..bb07e74b34a2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3163,7 +3163,7 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 				 * an arbitrary scalar. Disallow all math except
 				 * pointer subtraction
 				 */
-				if (opcode == BPF_SUB){
+				if (opcode == BPF_SUB && env->allow_ptr_leaks) {
 					mark_reg_unknown(env, regs, insn->dst_reg);
 					return 0;
 				}
-- 
2.17.1

^ permalink raw reply related

* [PATCH net] net: rtnl_configure_link: fix dev flags changes arg to __dev_notify_flags
From: Roopa Prabhu @ 2018-09-12 20:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, stephen, liam.mcbirnie

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This fix addresses https://bugzilla.kernel.org/show_bug.cgi?id=201071

Commit 5025f7f7d506 wrongly relied on __dev_change_flags to notify users of
dev flag changes in the case when dev->rtnl_link_state = RTNL_LINK_INITIALIZED.
Fix it by indicating flag changes explicitly to __dev_notify_flags.

Fixes: 5025f7f7d506 ("rtnetlink: add rtnl_link_state check in rtnl_configure_link")
Reported-By: Liam mcbirnie <liam.mcbirnie@boeing.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
Dave, if 5025f7f7d506 made it to stable, request you to pls queue this one up too. Thanks.

 net/core/rtnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 60c9288..63ce2283 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2810,7 +2810,7 @@ int rtnl_configure_link(struct net_device *dev, const struct ifinfomsg *ifm)
 	}
 
 	if (dev->rtnl_link_state == RTNL_LINK_INITIALIZED) {
-		__dev_notify_flags(dev, old_flags, 0U);
+		__dev_notify_flags(dev, old_flags, (old_flags ^ dev->flags));
 	} else {
 		dev->rtnl_link_state = RTNL_LINK_INITIALIZED;
 		__dev_notify_flags(dev, old_flags, ~0U);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next] nfp: report FW vNIC stats in interface stats
From: David Miller @ 2018-09-12 20:21 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20180911134408.15129-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 11 Sep 2018 06:44:08 -0700

> Report in standard netdev stats drops and errors as well as
> RX multicast from the FW vNIC counters.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net 0/2] nfp: flower: fixes for flower offload
From: David Miller @ 2018-09-12 20:19 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20180911133845.14214-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 11 Sep 2018 06:38:43 -0700

> Two fixes for flower matching and tunnel encap.  Pieter fixes
> VLAN matching if the entire VLAN id is masked out and match
> is only performed on the PCP field.  Louis adds validation of
> tunnel flags for encap, most importantly we should not offload
> actions on IPv6 tunnels if it's not supported.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 20:15 UTC (permalink / raw)
  To: John Fastabend, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <6855742d-5925-0d94-e4f3-74bf118ca3d2@gmail.com>



On 09/12/2018 09:51 AM, John Fastabend wrote:
> On 09/12/2018 09:21 AM, Tushar Dave wrote:
>>
>>
>> On 09/11/2018 12:38 PM, Tushar Dave wrote:
>>> Helper bpg_msg_pull_data() can allocate multiple pages while
>>> linearizing multiple scatterlist elements into one shared page.
>>> However, if the shared page has size > PAGE_SIZE, using
>>> copy_page_to_iter() causes below warning.
>>>
>>> e.g.
>>> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>>> page_copy_sane.part.8+0x0/0x8
>>>
>>> To avoid above warning, use __GFP_COMP while allocating multiple
>>> contiguous pages.
>>>
>>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>>> ---
>>>    net/core/filter.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index d301134..0b40f95 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>        if (unlikely(bytes_sg_total > copy))
>>>            return -EINVAL;
>>>    -    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>>> +    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
>>> +               get_order(copy));
>>>        if (unlikely(!page))
>>>            return -ENOMEM;
>>>        p = page_address(page);
>>
>> I should have mentioned that I could re-order this patch anywhere in
>> patch series (as long as it doesn't break git bisect). I kept it first
>> because I think it is more like a bug fix. I sent it along with these
>> patch series considering we have a context of why and for what I need
>> this patch!
>>
>> Daniel, John,
>>
>> Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
>> copy sg page to userspace using copy_page_to_iter().
>>
> 
> I have not hit this before but I'm working on a set of patches for
> test_sockmap to test the bpf_msg_pull_data() so I'll add a case
> for this. Currently, we only test the simple case where we pull
> data out of a single page in selftests. This was sufficient for
> my use case but missed a handful of other valid cases.
> 
>> example:
>>
>> RDS packet size 8KB represented in scatterlist:
>> sg_data[0].length = 1400
>> sg_data[1].length = 1448
>> sg_data[2].length = 1448
>> sg_data[3].length = 1448
>> sg_data[4].length = 1448
>> sg_data[5].length = 1000
>>
>> If start=0 and end=8192, bpf_msg_pull_data() will linearize all
>> sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
>> Using this sg_data[0].page in function copy_page_to_iter() causes:
>> WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>> page_copy_sane.part.8+0x0/0x8
>>
>> (FYI, patch 4 has code that does copy_page_to_iter)
>>
> 
> How about sending it as a bugfix against bpf on its own. It
> looks like we could reproduce this with a combination of
> bpf_msg_pull_data() + redirect (to ingress) perhaps. Either
> way seems like a candidate for the bpf fixes tree to me.

Done.

Thanks.
-Tushar
> 
> Thanks,
> John
> 
>>
>> Comments?
>>
>> Thanks in advance,
>> -Tushar
>>
>>
> 
> 

^ permalink raw reply

* [PATCH bpf] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 20:15 UTC (permalink / raw)
  To: ast, daniel, davem, netdev, john.fastabend

Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.

e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
 net/core/filter.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d301134..0b40f95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(bytes_sg_total > copy))
 		return -EINVAL;
 
-	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
+	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+			   get_order(copy));
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [Patch net] tipc: check return value of __tipc_dump_start()
From: David Miller @ 2018-09-12 20:15 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, tipc-discussion, jon.maloy, ying.xue
In-Reply-To: <20180911221217.23392-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 11 Sep 2018 15:12:17 -0700

> When __tipc_dump_start() fails with running out of memory,
> we have no reason to continue, especially we should avoid
> calling tipc_dump_done().
> 
> Fixes: 8f5c5fcf3533 ("tipc: call start and done ops directly in __tipc_nl_compat_dumpit()")
> Reported-and-tested-by: syzbot+3f8324abccfbf8c74a9f@syzkaller.appspotmail.com
> Cc: Jon Maloy <jon.maloy@ericsson.com>
> Cc: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks Cong.

^ permalink raw reply

* Re: [PATCH net 0/4] s390/qeth: fixes 2018-09-12
From: David Miller @ 2018-09-12 20:13 UTC (permalink / raw)
  To: jwi; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180912133135.12335-1-jwi@linux.ibm.com>

From: Julian Wiedmann <jwi@linux.ibm.com>
Date: Wed, 12 Sep 2018 15:31:31 +0200

> please apply the following qeth fixes for -net.
> 
> Patch 1 resolves a regression in an error path, while patch 2 enables
> the SG support by default that was newly introduced with 4.19.
> Patch 3 takes care of a longstanding problem with large-order
> allocations, and patch 4 fixes a potential out-of-bounds access.

Series applied, thanks Julian.

^ permalink raw reply

* Re: [PATCH iproute2] q_cake: Add printing of no-split-gso option
From: Stephen Hemminger @ 2018-09-12 20:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev
In-Reply-To: <20180911223216.15609-1-toke@toke.dk>

On Wed, 12 Sep 2018 00:32:16 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> When the GSO splitting was turned into dual split-gso/no-split-gso options,
> the printing of the latter was left out. Add that, so output is consistent
> with the options passed.
> 
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>

Applied. I noticed that nat/nonat and wash/nowash have similar missing output.

^ permalink raw reply

* Re: [PATCH bpf] bpf: btf: Fix end boundary calculation for type section
From: Daniel Borkmann @ 2018-09-12 20:02 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team
In-Reply-To: <20180912172911.3609494-1-kafai@fb.com>

On 09/12/2018 07:29 PM, Martin KaFai Lau wrote:
> The end boundary math for type section is incorrect in
> btf_check_all_metas().  It just happens that hdr->type_off
> is always 0 for now because there are only two sections
> (type and string) and string section must be at the end (ensured
> in btf_parse_str_sec).
> 
> However, type_off may not be 0 if a new section would be added later.
> This patch fixes it.
> 
> Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied to bpf, thanks everyone!

^ permalink raw reply

* Re: [RFC v2 1/2] netlink: add NLA_REJECT policy type
From: Johannes Berg @ 2018-09-12 19:37 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: David Miller, linux-wireless, netdev
In-Reply-To: <20180912192925.GD29691@unicorn.suse.cz>

On Wed, 2018-09-12 at 21:29 +0200, Michal Kubecek wrote:

> > 3) eventually replace nlmsg_parse() calls by nlmsg_parse_strict() and
> >    see what breaks? :-) We won't be able to rely on that any time soon
> >    though (unless userspace first checks with a guaranteed rejected
> >    attribute, e.g. one that has NLA_REJECT, perhaps the u64 pad
> >    attributes could be marked such since the kernel can't assume
> >    alignment anyway)
> 
> I'm not so sure we (eventually) want to reject unknown attributes
> everywhere. I don't have any concrete example in mind but I think there
> will be use cases where we want to ignore unrecognized attributes
> (probably per parse call). But it makes sense to require caller to
> explicitely declare this is the case.

Yeah, I think nla_parse() vs. nla_parse_strict() - along with
remembering in review to say "perhaps you should prefer
nla_parse_strict() for this new thing" might be all we want (or
realistically can do).

> > While we're talking about wishlist, I'm also toying with the idea of
> > having some sort of generic mechanism to convert netlink attributes
> > to/from structs, for internal kernel representation; so far though I
> > haven't been able to come up with anything useful.
> 
> I was also thinking about something like this. One motivation was to
> design extensible version of ethtool_ops, the other was allowing complex
> data types (structures, arrays) for ethtool tunables. But I have nothing
> more than a vague idea so far.

I played with macros a bit, but ultimately that wasn't so readable.
There's no way to do upper-casing in the preprocessor :-) Realistically,
I ended up with something that would require either a lot of boiler-
plate code estill, or perhaps double-including a header file (though I
never really finished the latter thought.)

This is what I toyed with earlier today:
https://p.sipsolutions.net/35e260d7debaa406.txt

johannes

^ permalink raw reply

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-12 19:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912040038.oobnr4yfzoaajk6k@ast-mbp>



On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:04PM +0200, Tushar Dave wrote:
>> Add a sample program that shows how socksg program is used and attached
>> to socket filter. The kernel sample program deals with struct
>> scatterlist that is passed as bpf context.
>>
>> When run in server mode, the sample RDS program opens PF_RDS socket,
>> attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
>> helper to inspect packet data contained in struct scatterlist and
>> returns appropriate action code back to kernel.
>>
>> To ease testing, RDS client functionality is also added so that users
>> can generate RDS packet.
>>
>> Server:
>> [root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
>> running server in a loop
>> transport tcp
>> server bound to address: 192.168.3.71 port 4000
>> server listening on 192.168.3.71
>>
>> Client:
>> [root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
>> transport tcp
>> client bound to address: 192.168.3.70 port 25278
>> client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
>> port 25278
>> payload contains:30 31 32 33 34 35 36 37 38 39 ...
>>
>> Server output:
>> 192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
>> on port 25278
>> payload contains:30 31 32 33 34 35 36 37 38 39 ...
>> server listening on 192.168.3.71
>>
>> [root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
>>            <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
>>            <idle>-0     [038] ..s.   146.947364: 0: 33 34 35
>>
>> Similarly specifying '-t ib' will run this on IB link.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   samples/bpf/Makefile          |   3 +
>>   samples/bpf/rds_filter_kern.c |  42 ++++++
>>   samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
> 
> please no samples.
> Add this as proper test to tools/testing/selftests/bpf
> that reports PASS/FAIL and can be run automatically.
> samples/bpf is effectively dead code.

Okay :(

-Tushar
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 3/5] ebpf: Add sg_filter_run()
From: Tushar Dave @ 2018-09-12 19:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912035846.3lwf4wrbqky7vpwe@ast-mbp>



On 09/11/2018 08:58 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:02PM +0200, Tushar Dave wrote:
>> When sg_filter_run() is invoked it runs the attached eBPF
>> prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
>> struct scatterlist.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/filter.h         |  8 ++++++++
>>   include/uapi/linux/bpf.h       |  6 ++++++
>>   net/core/filter.c              | 35 +++++++++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h |  6 ++++++
>>   4 files changed, 55 insertions(+)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 6791a0a..ae664a9 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
>>   					 */
>>   };
>>   
>> +enum __socksg_action {
>> +	__SOCKSG_PASS = 0,
>> +	__SOCKSG_DROP,
>> +	__SOCKSG_REDIRECT,
> 
> what is this? I see no code that handles it either in this patch
> or in the later patches?!

Yes, I am not handling these actions in RDS in this patch series. I am
thinking first I should have basic infrastructure in place (and want to
make sure it works and accepted by community) then I will add more
changes for RDS which includes handling socksg actions at RDS level.
But fine, I can add code that handles socksg actions.

Thanks

-Tushar
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-12 19:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912035719.ff5mcjg3gbrg52xt@ast-mbp>



On 09/11/2018 08:57 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:01PM +0200, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/bpf_types.h      |  1 +
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/syscall.c           |  1 +
>>   kernel/bpf/verifier.c          |  1 +
>>   net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
>>   samples/bpf/bpf_load.c         | 11 ++++++---
>>   tools/bpf/bpftool/prog.c       |  1 +
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   tools/lib/bpf/libbpf.c         |  3 +++
>>   tools/lib/bpf/libbpf.h         |  2 ++
>

Alexi,

Thank you for reviewing the patches.

> please do not mix core kernel and user space into single patch.
> split tools/include/uapi/linux/bpf.h sync into separate patch
> and changes to tools/lib/bpf as yet another patch.

Sure, I can do that.

> 
>>   10 files changed, 72 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index cd26c09..7dc1503 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -16,6 +16,7 @@
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>>   #endif
>>   #ifdef CONFIG_BPF_EVENTS
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 66917a4..6ec1e32 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>>   	BPF_PROG_TYPE_LIRC_MODE2,
>>   	BPF_PROG_TYPE_SK_REUSEPORT,
>> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>>   };
>>   
>>   enum bpf_attach_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 3c9636f..5f302b7 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>>   
>>   	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>>   	    type != BPF_PROG_TYPE_CGROUP_SKB &&
>> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
> 
> I'm not comfortable to let unpriv use this right away.
> Can you live with root-only ?

Honestly, I prep this the same as how we treat
BPF_PROG_TYPE_SOCKET_FILTER. But sure I can live with root-only.

> 
>>   	    !capable(CAP_SYS_ADMIN))
>>   		return -EPERM;
>>   
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f4ff0c5..17fc4d2 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>>   	case BPF_PROG_TYPE_LWT_XMIT:
>>   	case BPF_PROG_TYPE_SK_SKB:
>>   	case BPF_PROG_TYPE_SK_MSG:
>> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>>   		if (meta)
>>   			return meta->pkt_access;
>>   
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 0b40f95..469c488 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>>   
>>   static void __bpf_prog_release(struct bpf_prog *prog)
>>   {
>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>   		bpf_prog_put(prog);
> 
> this doesn't look right.
> Why this is needed?
> Are you using old-style setsockopt to attach?

Yes.

> I think new style of attaching that all bpf prog types that came
> after socket_filter are using is preferred.
> Pls take a look at BPF_PROG_ATTACH cmd.

well, I am not sure if that is going to work.

I did this way so I can attach eBPF prog type
BPF_PROG_TYPE_SOCKET_SG_FILTER to a socket. Like today we can attach
regular socket filter bpf program (e.g. BPF_PROG_TYPE_SOCKET_FILTER) to
TCP, UDP sockets using setsockopt. The only difference between them is,
BPF_PROG_TYPE_SOCKET_FILTER deals with struct sk_buff while
BPF_PROG_TYPE_SOCKET_SG_FILTER deals with strut scatterlist.

e.g. setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
sizeof(prog_fd[0]));


> 
> Also it looks the first patch doesn't really add the useful logic, but adds
> few lines of code here and there. Then more code comes in patches 3 and 4.
> Please rearrange them that they're reviewable as logical pieces.

okay.

-Tushar
> 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox