Netdev List
 help / color / mirror / Atom feed
* Re: [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191102.21348.85601.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 12:11:02PM -0700, John Fastabend wrote:
> BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
> data_end pointers. Additionally, the verifier ensures that every
> accesses into the values is a __u64 read. This correctly maps on
> to the BPF 64-bit architecture.
> 
> However, to ensure that when building on 32bit architectures that
> clang uses correct types the '-target bpf' option _must_ be
> specified. To make this clear add a note to the Documentation.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [bpf PATCH 2/2] bpf: sockmap sample use clang flag, -target bpf
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191107.21348.15124.stgit@john-Precision-Tower-5810>

On Mon, Apr 23, 2018 at 12:11:08PM -0700, John Fastabend wrote:
> Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
> sockmap Makefile. Relevant text quoted here,
> 
>    Otherwise, you can use bpf target. Additionally, you _must_ use
>    bpf target when:
> 
>  - Your program uses data structures with pointer or long / unsigned
>    long types that interface with BPF helpers or context data
>    structures. Access into these structures is verified by the BPF
>    verifier and may result in verification failures if the native
>    architecture is not aligned with the BPF architecture, e.g. 64-bit.
>    An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'
> 
> Fixes: 69e8cc134bcb ("bpf: sockmap sample program")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next v5 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Alexei Starovoitov @ 2018-04-23 19:42 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-4-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:10AM -0700, Yonghong Song wrote:
> The special property of return values for helpers bpf_get_stack
> and bpf_probe_read_str are captured in verifier.
> Both helpers return a negative error code or
> a length, which is equal to or smaller than the buffer
> size argument. This additional information in the
> verifier can avoid the condition such as "retval > bufsize"
> in the bpf program. For example, for the code blow,
>     usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
>     if (usize < 0 || usize > max_len)
>         return 0;
> The verifier may have the following errors:
>     52: (85) call bpf_get_stack#65
>      R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>      R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R9_w=inv800 R10=fp0,call_-1
>     53: (bf) r8 = r0
>     54: (bf) r1 = r8
>     55: (67) r1 <<= 32
>     56: (bf) r2 = r1
>     57: (77) r2 >>= 32
>     58: (25) if r2 > 0x31f goto pc+33
>      R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
>                          umax_value=18446744069414584320,
>                          var_off=(0x0; 0xffffffff00000000))
>      R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R8=inv(id=0) R9=inv800 R10=fp0,call_-1
>     59: (1f) r9 -= r8
>     60: (c7) r1 s>>= 32
>     61: (bf) r2 = r7
>     62: (0f) r2 += r1
>     math between map_value pointer and register with unbounded
>     min value is not allowed
> The failure is due to llvm compiler optimization where register "r2",
> which is a copy of "r1", is tested for condition while later on "r1"
> is used for map_ptr operation. The verifier is not able to track such
> inst sequence effectively.
> 
> Without the "usize > max_len" condition, there is no llvm optimization
> and the below generated code passed verifier:
>     52: (85) call bpf_get_stack#65
>      R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
>      R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
>      R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R9_w=inv800 R10=fp0,call_-1
>     53: (b7) r1 = 0
>     54: (bf) r8 = r0
>     55: (67) r8 <<= 32
>     56: (c7) r8 s>>= 32
>     57: (6d) if r1 s> r8 goto pc+24
>      R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
>      R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
>      R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
>      R10=fp0,call_-1
>     58: (bf) r2 = r7
>     59: (0f) r2 += r8
>     60: (1f) r9 -= r8
>     61: (bf) r1 = r6
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/verifier.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index aba9425..d00bf53 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -164,6 +164,8 @@ struct bpf_call_arg_meta {
>  	bool pkt_access;
>  	int regno;
>  	int access_size;
> +	s64 msize_smax_value;
> +	u64 msize_umax_value;
>  };
>  
>  static DEFINE_MUTEX(bpf_verifier_lock);
> @@ -1994,6 +1996,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  	} else if (arg_type_is_mem_size(arg_type)) {
>  		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
>  
> +		/* remember the mem_size which may be used later
> +		 * to refine return values.
> +		 */
> +		meta->msize_smax_value = reg->smax_value;
> +		meta->msize_umax_value = reg->umax_value;
> +
>  		/* The register is SCALAR_VALUE; the access check
>  		 * happens using its boundaries.
>  		 */
> @@ -2333,6 +2341,21 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  	return 0;
>  }
>  
> +static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
> +				   int func_id,
> +				   struct bpf_call_arg_meta *meta)
> +{
> +	struct bpf_reg_state *ret_reg = &regs[BPF_REG_0];
> +
> +	if (ret_type != RET_INTEGER ||
> +	    (func_id != BPF_FUNC_get_stack &&
> +	     func_id != BPF_FUNC_probe_read_str))
> +		return;
> +
> +	ret_reg->smax_value = meta->msize_smax_value;
> +	ret_reg->umax_value = meta->msize_umax_value;

should we call:
        __reg_deduce_bounds(ret_reg);
        __reg_bound_offset(ret_reg);
here?
It doesn't seem that it will be called later and
ret_reg->var_off state will be more conservative than necessary.

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-23 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <20180423205019-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> On Mon, 23 Apr 2018 20:24:56 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> > > > >
>> > > > >I will NAK patches to change to common code for netvsc especially the
>> > > > >three device model.  MS worked hard with distro vendors to support transparent
>> > > > >mode, ans we really can't have a new model; or do backport.
>> > > > >
>> > > > >Plus, DPDK is now dependent on existing model.
>> > > >
>> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> > >
>> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >
>> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> > AFAIK it's normally banging device registers directly.
>> >
>> > > You can't go breaking userspace even if you don't like the application.
>> >
>> > Could you please explain how is the proposed patchset breaking
>> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> > API at all.
>> >
>>
>> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> to look for Linux netvsc device and the paired VF device and setup the
>> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> VF device.
>>
>> So it depends on existing 2 device model. You can't go to a 3 device model
>> or start hiding devices from userspace.
>
> Okay so how does the existing patch break that? IIUC does not go to
> a 3 device model since netvsc calls failover_register directly.
>
>> Also, I am working on associating netvsc and VF device based on serial number
>> rather than MAC address. The serial number is how Windows works now, and it makes
>> sense for Linux and Windows to use the same mechanism if possible.
>
> Maybe we should support same for virtio ...
> Which serial do you mean? From vpd?
>
> I guess you will want to keep supporting MAC for old hypervisors?
>
> It all seems like a reasonable thing to support in the generic core.

That's the reason why I chose explicit identifier rather than rely on
MAC address to bind/pair a device. MAC address can change. Even if it
can't, malicious guest user can fake MAC address to skip binding.

-Siwei


>
> --
> MST

^ permalink raw reply

* Re: [PATCH bpf-next v5 10/10] tools/bpf: add a test for bpf_get_stack with tracepoint prog
From: Alexei Starovoitov @ 2018-04-23 19:47 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-11-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:17AM -0700, Yonghong Song wrote:
> The test_stacktrace_map and test_stacktrace_build_id are
> enhanced to call bpf_get_stack in the helper to get the
> stack trace as well.  The stack traces from bpf_get_stack
> and bpf_get_stackid are compared to ensure that for the
> same stack as represented as the same hash, their ip addresses
> or build id's must be the same.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
...
>  /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
> @@ -44,7 +51,10 @@ struct sched_switch_args {
>  SEC("tracepoint/sched/sched_switch")
>  int oncpu(struct sched_switch_args *ctx)
>  {
> +	__u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
>  	__u32 key = 0, val = 0, *value_p;
> +	void *stack_p;
> +

nit: unnecessary extra empty line.
otherwise
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next v5 06/10] tools/bpf: add bpf_get_stack helper to tools headers
From: Alexei Starovoitov @ 2018-04-23 19:48 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-7-yhs@fb.com>

On Mon, Apr 23, 2018 at 10:54:13AM -0700, Yonghong Song wrote:
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

please add few words to commit log. Don't leave it empty
even for trivial patches.

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Michael S. Tsirkin @ 2018-04-23 19:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, tiwei.bie,
	jfreimann, wexu
In-Reply-To: <20180423193120.GD30033@char.us.oracle.com>

On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > Hi all:
> > 
> > This RFC implement packed ring layout. The code were tested with
> > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > and pktgen does not show obvious difference compared with split ring.
> 
> I have to ask then - what is the benefit of this?

You can use this with dpdk within guest.
The DPDK version did see a performance improvement so hopefully with
future versions this will too.

> > 
> > Changes from V2:
> > - do not use & in checking desc_event_flags
> > - off should be most significant bit
> > - remove the workaround of mergeable buffer for dpdk prototype
> > - id should be in the last descriptor in the chain
> > - keep _F_WRITE for write descriptor when adding used
> > - device flags updating should use ADDR_USED type
> > - return error on unexpected unavail descriptor in a chain
> > - return false in vhost_ve_avail_empty is descriptor is available
> > - track last seen avail_wrap_counter
> > - correctly examine available descriptor in get_indirect_packed()
> > - vhost_idx_diff should return u16 instead of bool
> > 
> > Changes from V1:
> > 
> > - Refactor vhost used elem code to avoid open coding on used elem
> > - Event suppression support (compile test only).
> > - Indirect descriptor support (compile test only).
> > - Zerocopy support.
> > - vIOMMU support.
> > - SCSI/VSOCK support (compile test only).
> > - Fix several bugs
> > 
> > For simplicity, I don't implement batching or other optimizations.
> > 
> > Please review.
> > 
> > Jason Wang (8):
> >   vhost: move get_rx_bufs to vhost.c
> >   vhost: hide used ring layout from device
> >   vhost: do not use vring_used_elem
> >   vhost_net: do not explicitly manipulate vhost_used_elem
> >   vhost: vhost_put_user() can accept metadata type
> >   virtio: introduce packed ring defines
> >   vhost: packed ring support
> >   vhost: event suppression for packed ring
> > 
> >  drivers/vhost/net.c                | 136 ++----
> >  drivers/vhost/scsi.c               |  62 +--
> >  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
> >  drivers/vhost/vhost.h              |  47 ++-
> >  drivers/vhost/vsock.c              |  42 +-
> >  include/uapi/linux/virtio_config.h |   9 +
> >  include/uapi/linux/virtio_ring.h   |  32 ++
> >  7 files changed, 926 insertions(+), 226 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-23 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423190615-mutt-send-email-mst@kernel.org>

2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:

[...]

>> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> +{
>> +     unsigned int i;
>> +
>> +     if (umem->pgs) {
>> +             for (i = 0; i < umem->npgs; i++)
>
> Since you pin them with FOLL_WRITE, I assume these pages
> are written to.
> Don't you need set_page_dirty_lock here?
>

Hmm, I actually *removed* it from the RFC V2, but after doing some
homework, I think you're right. Thanks for pointing this out!

Thinking more about this; This function is called from sk_destruct,
and in the Tx case the sk_destruct can be called from interrupt
context, where set_page_dirty_lock cannot be called.

Are there any preferred ways of solving this? Scheduling the whole
xsk_destruct call to a workqueue is one way (I think). Any
cleaner/better way?

[...]

>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> +     u64 addr = mr->addr, size = mr->len;
>> +     unsigned int nframes;
>> +     int size_chk, err;
>> +
>> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +             /* Strictly speaking we could support this, if:
>> +              * - huge pages, or*
>
> what does "or*" here mean?
>

Oops, I'll change to just 'or' in the next revision.


Thanks!
Björn

^ permalink raw reply

* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 20:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: fw, davem, dav.lebrun, linux-kernel, netfilter-devel, coreteam,
	netdev
In-Reply-To: <20180423173047.gsf2xjlmpichyvte@salvia>

On Mon, 23 Apr 2018 19:30:47 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Mon, Apr 23, 2018 at 05:48:22AM -0500, Ahmed Abdelsalam wrote:
> > Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
> > ---
> >  include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
> >  net/ipv6/netfilter/ip6t_srh.c                | 41 +++++++++++++++++++++++++++-
> >  2 files changed, 60 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > index f3cc0ef..9808382 100644
> > --- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > +++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > @@ -17,7 +17,10 @@
> >  #define IP6T_SRH_LAST_GT        0x0100
> >  #define IP6T_SRH_LAST_LT        0x0200
> >  #define IP6T_SRH_TAG            0x0400
> > -#define IP6T_SRH_MASK           0x07FF
> > +#define IP6T_SRH_PSID           0x0800
> > +#define IP6T_SRH_NSID           0x1000
> > +#define IP6T_SRH_LSID           0x2000
> > +#define IP6T_SRH_MASK           0x3FFF
> >  
> >  /* Values for "mt_invflags" field in struct ip6t_srh */
> >  #define IP6T_SRH_INV_NEXTHDR    0x0001
> > @@ -31,7 +34,10 @@
> >  #define IP6T_SRH_INV_LAST_GT    0x0100
> >  #define IP6T_SRH_INV_LAST_LT    0x0200
> >  #define IP6T_SRH_INV_TAG        0x0400
> > -#define IP6T_SRH_INV_MASK       0x07FF
> > +#define IP6T_SRH_INV_PSID       0x0800
> > +#define IP6T_SRH_INV_NSID       0x1000
> > +#define IP6T_SRH_INV_LSID       0x2000
> > +#define IP6T_SRH_INV_MASK       0x3FFF
> >  
> >  /**
> >   *      struct ip6t_srh - SRH match options
> > @@ -40,6 +46,12 @@
> >   *      @ segs_left: Segments left field of SRH
> >   *      @ last_entry: Last entry field of SRH
> >   *      @ tag: Tag field of SRH
> > + *      @ psid_addr: Address of previous SID in SRH SID list
> > + *      @ nsid_addr: Address of NEXT SID in SRH SID list
> > + *      @ lsid_addr: Address of LAST SID in SRH SID list
> > + *      @ psid_msk: Mask of previous SID in SRH SID list
> > + *      @ nsid_msk: Mask of next SID in SRH SID list
> > + *      @ lsid_msk: MAsk of last SID in SRH SID list
> >   *      @ mt_flags: match options
> >   *      @ mt_invflags: Invert the sense of match options
> >   */
> > @@ -50,6 +62,12 @@ struct ip6t_srh {
> >  	__u8                    segs_left;
> >  	__u8                    last_entry;
> >  	__u16                   tag;
> > +	struct in6_addr		psid_addr;
> > +	struct in6_addr		nsid_addr;
> > +	struct in6_addr		lsid_addr;
> > +	struct in6_addr		psid_msk;
> > +	struct in6_addr		nsid_msk;
> > +	struct in6_addr		lsid_msk;
> 
> This is changing something exposed through UAPI, so you will need a
> new revision for this.

Could you please advice what should be done in this case? 

> 
> >  	__u16                   mt_flags;
> >  	__u16                   mt_invflags;
> >  };
> > diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
> > index 33719d5..2b5cc73 100644
> > --- a/net/ipv6/netfilter/ip6t_srh.c
> > +++ b/net/ipv6/netfilter/ip6t_srh.c
> > @@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
> >  	const struct ip6t_srh *srhinfo = par->matchinfo;
> >  	struct ipv6_sr_hdr *srh;
> >  	struct ipv6_sr_hdr _srh;
> > -	int hdrlen, srhoff = 0;
> > +	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
> > +	struct in6_addr *psid, *nsid, *lsid;
> > +	struct in6_addr _psid, _nsid, _lsid;
> 
> Could you rearrange variable definitions? ie. longest line first, eg.
> 
> 	int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
>   	const struct ip6t_srh *srhinfo = par->matchinfo;
> 	struct in6_addr *psid, *nsid, *lsid;
>   	struct ipv6_sr_hdr *srh;
>   	struct ipv6_sr_hdr _srh;
> 

Ok I will re-arrange them in reverse christmas tree form. 

Ahmed 

-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf] bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY
From: Alexei Starovoitov @ 2018-04-23 20:05 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180423170921.16162-1-guro@fb.com>

On Mon, Apr 23, 2018 at 06:09:21PM +0100, Roman Gushchin wrote:
> Running bpf programs requires disabled preemption,
> however at least some* of the BPF_PROG_RUN_ARRAY users
> do not follow this rule.
> 
> To fix this bug, and also to make it not happen in the future,
> let's add explicit preemption disabling/re-enabling
> to the __BPF_PROG_RUN_ARRAY code.
> 
> * for example:
>  [   17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
>  ...
>  [   17.640890]  inet6_create+0x3eb/0x520
>  [   17.641405]  __sock_create+0x242/0x340
>  [   17.641939]  __sys_socket+0x57/0xe0
>  [   17.642370]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>  [   17.642944]  SyS_socket+0xa/0x10
>  [   17.643357]  do_syscall_64+0x79/0x220
>  [   17.643879]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: KASAN: null-ptr-deref Read in refcount_inc_not_zero
From: Cong Wang @ 2018-04-23 20:05 UTC (permalink / raw)
  To: syzbot
  Cc: David Miller, Denys Vlasenko, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, xiaolou4617
In-Reply-To: <000000000000eeeeff056a79954c@google.com>

#syz fix: llc: fix NULL pointer deref for SOCK_ZAPPED

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-23 20:06 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <CADGSJ20ge75T+ddxtUBT4d9m1i3=HLOAHJEoS7Cg0bqnXrutwA@mail.gmail.com>

On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >
> >> > > You can't go breaking userspace even if you don't like the application.
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >
> >> Also, I am working on associating netvsc and VF device based on serial number
> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> sense for Linux and Windows to use the same mechanism if possible.
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?
> >
> > It all seems like a reasonable thing to support in the generic core.
> 
> That's the reason why I chose explicit identifier rather than rely on
> MAC address to bind/pair a device. MAC address can change. Even if it
> can't, malicious guest user can fake MAC address to skip binding.
> 
> -Siwei

Address should be sampled at device creation to prevent this
kind of hack. Not that it buys the malicious user much:
if you can poke at MAC addresses you probably already can
break networking.




> 
> >
> > --
> > MST

^ permalink raw reply

* Re: [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Alexei Starovoitov @ 2018-04-23 20:08 UTC (permalink / raw)
  To: Sebastiano Miano
  Cc: Jesper Dangaard Brouer, netdev, ast, daniel, mingo, rostedt,
	fulvio.risso, David S. Miller
In-Reply-To: <062780b1-6a24-a7f3-175c-db3b02605850@polito.it>

On Mon, Apr 23, 2018 at 04:08:36PM +0200, Sebastiano Miano wrote:
> 
> That's in fact the real use case for the first two patches. Since bpf
> tracepoints are still a rather common (and easy to use) troubleshooting and
> monitoring tool why shouldn't we "enhance" their support with the newly
> added map/prog IDs?

because these tracepoints can be abused in the way that this patch demonstrated.
Whether to keep this patch in the series or not is irrelevant.

^ permalink raw reply

* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Florian Westphal @ 2018-04-23 20:08 UTC (permalink / raw)
  To: Ahmed Abdelsalam
  Cc: Pablo Neira Ayuso, fw, davem, dav.lebrun, linux-kernel,
	netfilter-devel, coreteam, netdev
In-Reply-To: <20180423220148.03800031d0cb8e8a7a83dc31@gmail.com>

Ahmed Abdelsalam <amsalam20@gmail.com> wrote:
> > > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > >  	__u8                    segs_left;
> > >  	__u8                    last_entry;
> > >  	__u16                   tag;
> > > +	struct in6_addr		psid_addr;
> > > +	struct in6_addr		nsid_addr;
> > > +	struct in6_addr		lsid_addr;
> > > +	struct in6_addr		psid_msk;
> > > +	struct in6_addr		nsid_msk;
> > > +	struct in6_addr		lsid_msk;
> > 
> > This is changing something exposed through UAPI, so you will need a
> > new revision for this.
> 
> Could you please advice what should be done in this case? 

You need to add
struct ip6t_srh_v1 {
	/* copy of struct ip6t_srh here */

	/* new fields go here */
};


Look at xt_conntrack.c, conntrack_mt_reg[] for an example of
multi-revision match.

You can probably re-origanise code to avoid too much duplication.
See 5a786232eb69a1f870ddc0cfd69d5bdef241a2ea in nf.git for an example,
it makes v0 into a v1 struct at runtime and re-uses new v1 code
for old v0.

^ permalink raw reply

* Re: [PATCH 0/3] bpf: Store/dump license string for loaded program
From: Alexei Starovoitov @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexei Starovoitov, Daniel Borkmann, lkml, netdev, Quentin Monnet
In-Reply-To: <20180423065927.23127-1-jolsa@kernel.org>

On Mon, Apr 23, 2018 at 08:59:24AM +0200, Jiri Olsa wrote:
> hi,
> sending the change to store and dump the license
> info for loaded BPF programs. It's important for
> us get the license info, when investigating on
> screwed up machine.

hmm. boolean flag whether bpf prog is gpl or not
is already exposed via bpf_prog_info.
I see no point of wasting extra 128 bytes of kernel memory.

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Michael S. Tsirkin @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <CAJ+HfNjwMFig+aJbacFK--5_1i8F2DLSyAUOvU12Xc-OvJBAzQ@mail.gmail.com>

On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> 
> [...]
> 
> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> +{
> >> +     unsigned int i;
> >> +
> >> +     if (umem->pgs) {
> >> +             for (i = 0; i < umem->npgs; i++)
> >
> > Since you pin them with FOLL_WRITE, I assume these pages
> > are written to.
> > Don't you need set_page_dirty_lock here?
> >
> 
> Hmm, I actually *removed* it from the RFC V2, but after doing some
> homework, I think you're right. Thanks for pointing this out!
> 
> Thinking more about this; This function is called from sk_destruct,
> and in the Tx case the sk_destruct can be called from interrupt
> context, where set_page_dirty_lock cannot be called.
> 
> Are there any preferred ways of solving this? Scheduling the whole
> xsk_destruct call to a workqueue is one way (I think). Any
> cleaner/better way?
> 
> [...]

Defer unpinning pages until the next tx call?


> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> +{
> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> +     u64 addr = mr->addr, size = mr->len;
> >> +     unsigned int nframes;
> >> +     int size_chk, err;
> >> +
> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> +             /* Strictly speaking we could support this, if:
> >> +              * - huge pages, or*
> >
> > what does "or*" here mean?
> >
> 
> Oops, I'll change to just 'or' in the next revision.
> 
> 
> Thanks!
> Björn

^ permalink raw reply

* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Konrad Rzeszutek Wilk @ 2018-04-23 20:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423224715-mutt-send-email-mst@kernel.org>

On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > > Hi all:
> > > 
> > > This RFC implement packed ring layout. The code were tested with
> > > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > > and pktgen does not show obvious difference compared with split ring.
> > 
> > I have to ask then - what is the benefit of this?
> 
> You can use this with dpdk within guest.
> The DPDK version did see a performance improvement so hopefully with

Is there a link to this performance improvement document?

> future versions this will too.
> 
> > > 
> > > Changes from V2:
> > > - do not use & in checking desc_event_flags
> > > - off should be most significant bit
> > > - remove the workaround of mergeable buffer for dpdk prototype
> > > - id should be in the last descriptor in the chain
> > > - keep _F_WRITE for write descriptor when adding used
> > > - device flags updating should use ADDR_USED type
> > > - return error on unexpected unavail descriptor in a chain
> > > - return false in vhost_ve_avail_empty is descriptor is available
> > > - track last seen avail_wrap_counter
> > > - correctly examine available descriptor in get_indirect_packed()
> > > - vhost_idx_diff should return u16 instead of bool
> > > 
> > > Changes from V1:
> > > 
> > > - Refactor vhost used elem code to avoid open coding on used elem
> > > - Event suppression support (compile test only).
> > > - Indirect descriptor support (compile test only).
> > > - Zerocopy support.
> > > - vIOMMU support.
> > > - SCSI/VSOCK support (compile test only).
> > > - Fix several bugs
> > > 
> > > For simplicity, I don't implement batching or other optimizations.
> > > 
> > > Please review.
> > > 
> > > Jason Wang (8):
> > >   vhost: move get_rx_bufs to vhost.c
> > >   vhost: hide used ring layout from device
> > >   vhost: do not use vring_used_elem
> > >   vhost_net: do not explicitly manipulate vhost_used_elem
> > >   vhost: vhost_put_user() can accept metadata type
> > >   virtio: introduce packed ring defines
> > >   vhost: packed ring support
> > >   vhost: event suppression for packed ring
> > > 
> > >  drivers/vhost/net.c                | 136 ++----
> > >  drivers/vhost/scsi.c               |  62 +--
> > >  drivers/vhost/vhost.c              | 824 ++++++++++++++++++++++++++++++++++---
> > >  drivers/vhost/vhost.h              |  47 ++-
> > >  drivers/vhost/vsock.c              |  42 +-
> > >  include/uapi/linux/virtio_config.h |   9 +
> > >  include/uapi/linux/virtio_ring.h   |  32 ++
> > >  7 files changed, 926 insertions(+), 226 deletions(-)
> > > 
> > > -- 
> > > 2.7.4
> > > 

^ permalink raw reply

* Re: [PATCH net-next 0/2] net/ipv6: couple of fixes for rcu change to from
From: David Miller @ 2018-04-23 20:13 UTC (permalink / raw)
  To: dsahern; +Cc: netdev
In-Reply-To: <20180423183207.9124-1-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 23 Apr 2018 11:32:05 -0700

> So many details... I am thankful for all the robots running the
> permutations and tools.

This is why I am not afraid of the robots taking over. :-)

> Two bug fixes from the rcu change to rt->from:
> 1. missing rcu lock in ip6_negative_advice
> 2. rcu dereferences in 2 sites

Series applied, thanks David.

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-23 20:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <20180423231026-mutt-send-email-mst@kernel.org>

2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
>> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>>
>> [...]
>>
>> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> >> +{
>> >> +     unsigned int i;
>> >> +
>> >> +     if (umem->pgs) {
>> >> +             for (i = 0; i < umem->npgs; i++)
>> >
>> > Since you pin them with FOLL_WRITE, I assume these pages
>> > are written to.
>> > Don't you need set_page_dirty_lock here?
>> >
>>
>> Hmm, I actually *removed* it from the RFC V2, but after doing some
>> homework, I think you're right. Thanks for pointing this out!
>>
>> Thinking more about this; This function is called from sk_destruct,
>> and in the Tx case the sk_destruct can be called from interrupt
>> context, where set_page_dirty_lock cannot be called.
>>
>> Are there any preferred ways of solving this? Scheduling the whole
>> xsk_destruct call to a workqueue is one way (I think). Any
>> cleaner/better way?
>>
>> [...]
>
> Defer unpinning pages until the next tx call?
>

If the sock is released, there wont be another tx call. Or am I
missing something obvious?

>
>> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> >> +{
>> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> >> +     u64 addr = mr->addr, size = mr->len;
>> >> +     unsigned int nframes;
>> >> +     int size_chk, err;
>> >> +
>> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> >> +             /* Strictly speaking we could support this, if:
>> >> +              * - huge pages, or*
>> >
>> > what does "or*" here mean?
>> >
>>
>> Oops, I'll change to just 'or' in the next revision.
>>
>>
>> Thanks!
>> Björn

^ permalink raw reply

* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 20:16 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, davem, dav.lebrun, linux-kernel,
	netfilter-devel, coreteam, netdev
In-Reply-To: <20180423200844.bq3ksj262brrifnj@breakpoint.cc>

On Mon, 23 Apr 2018 22:08:44 +0200
Florian Westphal <fw@strlen.de> wrote:

> Ahmed Abdelsalam <amsalam20@gmail.com> wrote:
> > > > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > > >  	__u8                    segs_left;
> > > >  	__u8                    last_entry;
> > > >  	__u16                   tag;
> > > > +	struct in6_addr		psid_addr;
> > > > +	struct in6_addr		nsid_addr;
> > > > +	struct in6_addr		lsid_addr;
> > > > +	struct in6_addr		psid_msk;
> > > > +	struct in6_addr		nsid_msk;
> > > > +	struct in6_addr		lsid_msk;
> > > 
> > > This is changing something exposed through UAPI, so you will need a
> > > new revision for this.
> > 
> > Could you please advice what should be done in this case? 
> 
> You need to add
> struct ip6t_srh_v1 {
> 	/* copy of struct ip6t_srh here */
> 
> 	/* new fields go here */
> };
> 
> 
> Look at xt_conntrack.c, conntrack_mt_reg[] for an example of
> multi-revision match.
> 
> You can probably re-origanise code to avoid too much duplication.
> See 5a786232eb69a1f870ddc0cfd69d5bdef241a2ea in nf.git for an example,
> it makes v0 into a v1 struct at runtime and re-uses new v1 code
> for old v0.
> 
> 

Thanks Florian!  

-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* Re: [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Jesper Dangaard Brouer @ 2018-04-23 20:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastiano Miano, netdev, ast, daniel, mingo, rostedt,
	fulvio.risso, David S. Miller, brouer
In-Reply-To: <20180423200801.m2kcrjcoavuibbmk@ast-mbp>

On Mon, 23 Apr 2018 14:08:02 -0600
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, Apr 23, 2018 at 04:08:36PM +0200, Sebastiano Miano wrote:
> > 
> > That's in fact the real use case for the first two patches. Since bpf
> > tracepoints are still a rather common (and easy to use) troubleshooting and
> > monitoring tool why shouldn't we "enhance" their support with the newly
> > added map/prog IDs?  
> 
> because these tracepoints can be abused in the way that this patch demonstrated.
> Whether to keep this patch in the series or not is irrelevant.

I don't understand your abuse use-case, can you explain what you mean?

You do realize that these tracepoints can _only_ monitor the userspace
map activity (not kernel map changes) ... and we _do_ need a way to
debug this (and without the map_id I can tell which map).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 00/12] Netfilter/IPVS fixes for net
From: David Miller @ 2018-04-23 20:22 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180423175714.9794-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 23 Apr 2018 19:57:02 +0200

> The following patchset contains Netfilter/IPVS fixes for your net tree,
> they are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thank you.

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Michael S. Tsirkin @ 2018-04-23 20:26 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <CAJ+HfNhbNeBfiMP+dK0ZGU0dfYrkWpo5jOAidUoUnkxRMBb0xg@mail.gmail.com>

On Mon, Apr 23, 2018 at 10:15:18PM +0200, Björn Töpel wrote:
> 2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> > On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> >> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> >>
> >> [...]
> >>
> >> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> >> +{
> >> >> +     unsigned int i;
> >> >> +
> >> >> +     if (umem->pgs) {
> >> >> +             for (i = 0; i < umem->npgs; i++)
> >> >
> >> > Since you pin them with FOLL_WRITE, I assume these pages
> >> > are written to.
> >> > Don't you need set_page_dirty_lock here?
> >> >
> >>
> >> Hmm, I actually *removed* it from the RFC V2, but after doing some
> >> homework, I think you're right. Thanks for pointing this out!
> >>
> >> Thinking more about this; This function is called from sk_destruct,
> >> and in the Tx case the sk_destruct can be called from interrupt
> >> context, where set_page_dirty_lock cannot be called.
> >>
> >> Are there any preferred ways of solving this? Scheduling the whole
> >> xsk_destruct call to a workqueue is one way (I think). Any
> >> cleaner/better way?
> >>
> >> [...]
> >
> > Defer unpinning pages until the next tx call?
> >
> 
> If the sock is released, there wont be another tx call.

unpin them on socket release too?

> Or am I
> missing something obvious?
> 
> >
> >> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> >> +{
> >> >> +     u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> >> +     u64 addr = mr->addr, size = mr->len;
> >> >> +     unsigned int nframes;
> >> >> +     int size_chk, err;
> >> >> +
> >> >> +     if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> >> +             /* Strictly speaking we could support this, if:
> >> >> +              * - huge pages, or*
> >> >
> >> > what does "or*" here mean?
> >> >
> >>
> >> Oops, I'll change to just 'or' in the next revision.
> >>
> >>
> >> Thanks!
> >> Björn

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: Pravin Shelar @ 2018-04-23 20:10 UTC (permalink / raw)
  To: David Miller; +Cc: Yi-Hung Wei, Linux Kernel Network Developers
In-Reply-To: <20180423.093927.2276689799061091037.davem@davemloft.net>

On Mon, Apr 23, 2018 at 6:39 AM, David Miller <davem@davemloft.net> wrote:
> From: Yi-Hung Wei <yihung.wei@gmail.com>
> Date: Tue, 17 Apr 2018 17:30:27 -0700
>
>> Currently, nf_conntrack_max is used to limit the maximum number of
>> conntrack entries in the conntrack table for every network namespace.
>> For the VMs and containers that reside in the same namespace,
>> they share the same conntrack table, and the total # of conntrack entries
>> for all the VMs and containers are limited by nf_conntrack_max.  In this
>> case, if one of the VM/container abuses the usage the conntrack entries,
>> it blocks the others from committing valid conntrack entries into the
>> conntrack table.  Even if we can possibly put the VM in different network
>> namespace, the current nf_conntrack_max configuration is kind of rigid
>> that we cannot limit different VM/container to have different # conntrack
>> entries.
>>

Hi
This looks like general problem related to nf zone usage limit, Did
you considered changing nf-conntrack to have a per zone limit, so that
all users of nf-filter can use it. I prefer this to adding a wrapper
in OVS nf-filter layer.

Thanks,
Pravin.

>> To address the aforementioned issue, this patch proposes to have a
>> fine-grained mechanism that could further limit the # of conntrack entries
>> per-zone.  For example, we can designate different zone to different VM,
>> and set conntrack limit to each zone.  By providing this isolation, a
>> mis-behaved VM only consumes the conntrack entries in its own zone, and
>> it will not influence other well-behaved VMs.  Moreover, the users can
>> set various conntrack limit to different zone based on their preference.
>>
>> The proposed implementation utilizes Netfilter's nf_conncount backend
>> to count the number of connections in a particular zone.  If the number of
>> connection is above a configured limitation, OVS will return ENOMEM to the
>> userspace.  If userspace does not configure the zone limit, the limit
>> defaults to zero that is no limitation, which is backward compatible to
>> the behavior without this patch.
>>
>> The first patch defines the conntrack limit netlink definition, and the
>> second patch provides the implementation.
>
> Pravin, I need this series reviewed.
>
> Thank you.

^ permalink raw reply

* [Patch nf] ipvs: initialize tbl->entries after allocation
From: Cong Wang @ 2018-04-23 20:53 UTC (permalink / raw)
  To: netdev
  Cc: lvs-devel, netfilter-devel, Cong Wang, Simon Horman,
	Julian Anastasov, Pablo Neira Ayuso

tbl->entries is not initialized after kmalloc(), therefore
causes an uninit-value warning in ip_vs_lblc_check_expire()
as reported by syzbot.

Reported-by: <syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/netfilter/ipvs/ip_vs_lblcr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 92adc04557ed..bc2bc5eebcb8 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
 	tbl->counter = 1;
 	tbl->dead = false;
 	tbl->svc = svc;
+	atomic_set(&tbl->entries, 0);
 
 	/*
 	 *    Hook periodic timer for garbage collection
-- 
2.13.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox