* Re: [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191102.21348.85601.stgit@john-Precision-Tower-5810>
On Mon, Apr 23, 2018 at 12:11:02PM -0700, John Fastabend wrote:
> BPF_PROG_TYPE_SK_MSG programs use a 'void *' for both data and the
> data_end pointers. Additionally, the verifier ensures that every
> accesses into the values is a __u64 read. This correctly maps on
> to the BPF 64-bit architecture.
>
> However, to ensure that when building on 32bit architectures that
> clang uses correct types the '-target bpf' option _must_ be
> specified. To make this clear add a note to the Documentation.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [bpf PATCH 2/2] bpf: sockmap sample use clang flag, -target bpf
From: Alexei Starovoitov @ 2018-04-23 19:35 UTC (permalink / raw)
To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180423191107.21348.15124.stgit@john-Precision-Tower-5810>
On Mon, Apr 23, 2018 at 12:11:08PM -0700, John Fastabend wrote:
> Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
> sockmap Makefile. Relevant text quoted here,
>
> Otherwise, you can use bpf target. Additionally, you _must_ use
> bpf target when:
>
> - Your program uses data structures with pointer or long / unsigned
> long types that interface with BPF helpers or context data
> structures. Access into these structures is verified by the BPF
> verifier and may result in verification failures if the native
> architecture is not aligned with the BPF architecture, e.g. 64-bit.
> An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'
>
> Fixes: 69e8cc134bcb ("bpf: sockmap sample program")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next v5 03/10] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Alexei Starovoitov @ 2018-04-23 19:42 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-4-yhs@fb.com>
On Mon, Apr 23, 2018 at 10:54:10AM -0700, Yonghong Song wrote:
> The special property of return values for helpers bpf_get_stack
> and bpf_probe_read_str are captured in verifier.
> Both helpers return a negative error code or
> a length, which is equal to or smaller than the buffer
> size argument. This additional information in the
> verifier can avoid the condition such as "retval > bufsize"
> in the bpf program. For example, for the code blow,
> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> if (usize < 0 || usize > max_len)
> return 0;
> The verifier may have the following errors:
> 52: (85) call bpf_get_stack#65
> R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
> R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R9_w=inv800 R10=fp0,call_-1
> 53: (bf) r8 = r0
> 54: (bf) r1 = r8
> 55: (67) r1 <<= 32
> 56: (bf) r2 = r1
> 57: (77) r2 >>= 32
> 58: (25) if r2 > 0x31f goto pc+33
> R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
> umax_value=18446744069414584320,
> var_off=(0x0; 0xffffffff00000000))
> R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R8=inv(id=0) R9=inv800 R10=fp0,call_-1
> 59: (1f) r9 -= r8
> 60: (c7) r1 s>>= 32
> 61: (bf) r2 = r7
> 62: (0f) r2 += r1
> math between map_value pointer and register with unbounded
> min value is not allowed
> The failure is due to llvm compiler optimization where register "r2",
> which is a copy of "r1", is tested for condition while later on "r1"
> is used for map_ptr operation. The verifier is not able to track such
> inst sequence effectively.
>
> Without the "usize > max_len" condition, there is no llvm optimization
> and the below generated code passed verifier:
> 52: (85) call bpf_get_stack#65
> R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
> R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R9_w=inv800 R10=fp0,call_-1
> 53: (b7) r1 = 0
> 54: (bf) r8 = r0
> 55: (67) r8 <<= 32
> 56: (c7) r8 s>>= 32
> 57: (6d) if r1 s> r8 goto pc+24
> R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
> R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
> R10=fp0,call_-1
> 58: (bf) r2 = r7
> 59: (0f) r2 += r8
> 60: (1f) r9 -= r8
> 61: (bf) r1 = r6
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> kernel/bpf/verifier.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index aba9425..d00bf53 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -164,6 +164,8 @@ struct bpf_call_arg_meta {
> bool pkt_access;
> int regno;
> int access_size;
> + s64 msize_smax_value;
> + u64 msize_umax_value;
> };
>
> static DEFINE_MUTEX(bpf_verifier_lock);
> @@ -1994,6 +1996,12 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> } else if (arg_type_is_mem_size(arg_type)) {
> bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
>
> + /* remember the mem_size which may be used later
> + * to refine return values.
> + */
> + meta->msize_smax_value = reg->smax_value;
> + meta->msize_umax_value = reg->umax_value;
> +
> /* The register is SCALAR_VALUE; the access check
> * happens using its boundaries.
> */
> @@ -2333,6 +2341,21 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
> return 0;
> }
>
> +static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
> + int func_id,
> + struct bpf_call_arg_meta *meta)
> +{
> + struct bpf_reg_state *ret_reg = ®s[BPF_REG_0];
> +
> + if (ret_type != RET_INTEGER ||
> + (func_id != BPF_FUNC_get_stack &&
> + func_id != BPF_FUNC_probe_read_str))
> + return;
> +
> + ret_reg->smax_value = meta->msize_smax_value;
> + ret_reg->umax_value = meta->msize_umax_value;
should we call:
__reg_deduce_bounds(ret_reg);
__reg_bound_offset(ret_reg);
here?
It doesn't seem that it will be called later and
ret_reg->var_off state will be more conservative than necessary.
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-23 19:44 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <20180423205019-mutt-send-email-mst@kernel.org>
On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> On Mon, 23 Apr 2018 20:24:56 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> > > > >
>> > > > >I will NAK patches to change to common code for netvsc especially the
>> > > > >three device model. MS worked hard with distro vendors to support transparent
>> > > > >mode, ans we really can't have a new model; or do backport.
>> > > > >
>> > > > >Plus, DPDK is now dependent on existing model.
>> > > >
>> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> > >
>> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >
>> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> > AFAIK it's normally banging device registers directly.
>> >
>> > > You can't go breaking userspace even if you don't like the application.
>> >
>> > Could you please explain how is the proposed patchset breaking
>> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> > API at all.
>> >
>>
>> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> to look for Linux netvsc device and the paired VF device and setup the
>> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance
>> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> VF device.
>>
>> So it depends on existing 2 device model. You can't go to a 3 device model
>> or start hiding devices from userspace.
>
> Okay so how does the existing patch break that? IIUC does not go to
> a 3 device model since netvsc calls failover_register directly.
>
>> Also, I am working on associating netvsc and VF device based on serial number
>> rather than MAC address. The serial number is how Windows works now, and it makes
>> sense for Linux and Windows to use the same mechanism if possible.
>
> Maybe we should support same for virtio ...
> Which serial do you mean? From vpd?
>
> I guess you will want to keep supporting MAC for old hypervisors?
>
> It all seems like a reasonable thing to support in the generic core.
That's the reason why I chose explicit identifier rather than rely on
MAC address to bind/pair a device. MAC address can change. Even if it
can't, malicious guest user can fake MAC address to skip binding.
-Siwei
>
> --
> MST
^ permalink raw reply
* Re: [PATCH bpf-next v5 10/10] tools/bpf: add a test for bpf_get_stack with tracepoint prog
From: Alexei Starovoitov @ 2018-04-23 19:47 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-11-yhs@fb.com>
On Mon, Apr 23, 2018 at 10:54:17AM -0700, Yonghong Song wrote:
> The test_stacktrace_map and test_stacktrace_build_id are
> enhanced to call bpf_get_stack in the helper to get the
> stack trace as well. The stack traces from bpf_get_stack
> and bpf_get_stackid are compared to ensure that for the
> same stack as represented as the same hash, their ip addresses
> or build id's must be the same.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
...
> /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
> @@ -44,7 +51,10 @@ struct sched_switch_args {
> SEC("tracepoint/sched/sched_switch")
> int oncpu(struct sched_switch_args *ctx)
> {
> + __u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
> __u32 key = 0, val = 0, *value_p;
> + void *stack_p;
> +
nit: unnecessary extra empty line.
otherwise
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next v5 06/10] tools/bpf: add bpf_get_stack helper to tools headers
From: Alexei Starovoitov @ 2018-04-23 19:48 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, ecree, kernel-team
In-Reply-To: <20180423175417.4104464-7-yhs@fb.com>
On Mon, Apr 23, 2018 at 10:54:13AM -0700, Yonghong Song wrote:
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
please add few words to commit log. Don't leave it empty
even for trivial patches.
^ permalink raw reply
* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Michael S. Tsirkin @ 2018-04-23 19:59 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, tiwei.bie,
jfreimann, wexu
In-Reply-To: <20180423193120.GD30033@char.us.oracle.com>
On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > Hi all:
> >
> > This RFC implement packed ring layout. The code were tested with
> > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > and pktgen does not show obvious difference compared with split ring.
>
> I have to ask then - what is the benefit of this?
You can use this with dpdk within guest.
The DPDK version did see a performance improvement so hopefully with
future versions this will too.
> >
> > Changes from V2:
> > - do not use & in checking desc_event_flags
> > - off should be most significant bit
> > - remove the workaround of mergeable buffer for dpdk prototype
> > - id should be in the last descriptor in the chain
> > - keep _F_WRITE for write descriptor when adding used
> > - device flags updating should use ADDR_USED type
> > - return error on unexpected unavail descriptor in a chain
> > - return false in vhost_ve_avail_empty is descriptor is available
> > - track last seen avail_wrap_counter
> > - correctly examine available descriptor in get_indirect_packed()
> > - vhost_idx_diff should return u16 instead of bool
> >
> > Changes from V1:
> >
> > - Refactor vhost used elem code to avoid open coding on used elem
> > - Event suppression support (compile test only).
> > - Indirect descriptor support (compile test only).
> > - Zerocopy support.
> > - vIOMMU support.
> > - SCSI/VSOCK support (compile test only).
> > - Fix several bugs
> >
> > For simplicity, I don't implement batching or other optimizations.
> >
> > Please review.
> >
> > Jason Wang (8):
> > vhost: move get_rx_bufs to vhost.c
> > vhost: hide used ring layout from device
> > vhost: do not use vring_used_elem
> > vhost_net: do not explicitly manipulate vhost_used_elem
> > vhost: vhost_put_user() can accept metadata type
> > virtio: introduce packed ring defines
> > vhost: packed ring support
> > vhost: event suppression for packed ring
> >
> > drivers/vhost/net.c | 136 ++----
> > drivers/vhost/scsi.c | 62 +--
> > drivers/vhost/vhost.c | 824 ++++++++++++++++++++++++++++++++++---
> > drivers/vhost/vhost.h | 47 ++-
> > drivers/vhost/vsock.c | 42 +-
> > include/uapi/linux/virtio_config.h | 9 +
> > include/uapi/linux/virtio_ring.h | 32 ++
> > 7 files changed, 926 insertions(+), 226 deletions(-)
> >
> > --
> > 2.7.4
> >
^ permalink raw reply
* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-23 20:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <20180423190615-mutt-send-email-mst@kernel.org>
2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
[...]
>> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> +{
>> + unsigned int i;
>> +
>> + if (umem->pgs) {
>> + for (i = 0; i < umem->npgs; i++)
>
> Since you pin them with FOLL_WRITE, I assume these pages
> are written to.
> Don't you need set_page_dirty_lock here?
>
Hmm, I actually *removed* it from the RFC V2, but after doing some
homework, I think you're right. Thanks for pointing this out!
Thinking more about this; This function is called from sk_destruct,
and in the Tx case the sk_destruct can be called from interrupt
context, where set_page_dirty_lock cannot be called.
Are there any preferred ways of solving this? Scheduling the whole
xsk_destruct call to a workqueue is one way (I think). Any
cleaner/better way?
[...]
>> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> +{
>> + u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> + u64 addr = mr->addr, size = mr->len;
>> + unsigned int nframes;
>> + int size_chk, err;
>> +
>> + if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> + /* Strictly speaking we could support this, if:
>> + * - huge pages, or*
>
> what does "or*" here mean?
>
Oops, I'll change to just 'or' in the next revision.
Thanks!
Björn
^ permalink raw reply
* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 20:01 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: fw, davem, dav.lebrun, linux-kernel, netfilter-devel, coreteam,
netdev
In-Reply-To: <20180423173047.gsf2xjlmpichyvte@salvia>
On Mon, 23 Apr 2018 19:30:47 +0200
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Apr 23, 2018 at 05:48:22AM -0500, Ahmed Abdelsalam wrote:
> > Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
> > ---
> > include/uapi/linux/netfilter_ipv6/ip6t_srh.h | 22 +++++++++++++--
> > net/ipv6/netfilter/ip6t_srh.c | 41 +++++++++++++++++++++++++++-
> > 2 files changed, 60 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > index f3cc0ef..9808382 100644
> > --- a/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > +++ b/include/uapi/linux/netfilter_ipv6/ip6t_srh.h
> > @@ -17,7 +17,10 @@
> > #define IP6T_SRH_LAST_GT 0x0100
> > #define IP6T_SRH_LAST_LT 0x0200
> > #define IP6T_SRH_TAG 0x0400
> > -#define IP6T_SRH_MASK 0x07FF
> > +#define IP6T_SRH_PSID 0x0800
> > +#define IP6T_SRH_NSID 0x1000
> > +#define IP6T_SRH_LSID 0x2000
> > +#define IP6T_SRH_MASK 0x3FFF
> >
> > /* Values for "mt_invflags" field in struct ip6t_srh */
> > #define IP6T_SRH_INV_NEXTHDR 0x0001
> > @@ -31,7 +34,10 @@
> > #define IP6T_SRH_INV_LAST_GT 0x0100
> > #define IP6T_SRH_INV_LAST_LT 0x0200
> > #define IP6T_SRH_INV_TAG 0x0400
> > -#define IP6T_SRH_INV_MASK 0x07FF
> > +#define IP6T_SRH_INV_PSID 0x0800
> > +#define IP6T_SRH_INV_NSID 0x1000
> > +#define IP6T_SRH_INV_LSID 0x2000
> > +#define IP6T_SRH_INV_MASK 0x3FFF
> >
> > /**
> > * struct ip6t_srh - SRH match options
> > @@ -40,6 +46,12 @@
> > * @ segs_left: Segments left field of SRH
> > * @ last_entry: Last entry field of SRH
> > * @ tag: Tag field of SRH
> > + * @ psid_addr: Address of previous SID in SRH SID list
> > + * @ nsid_addr: Address of NEXT SID in SRH SID list
> > + * @ lsid_addr: Address of LAST SID in SRH SID list
> > + * @ psid_msk: Mask of previous SID in SRH SID list
> > + * @ nsid_msk: Mask of next SID in SRH SID list
> > + * @ lsid_msk: MAsk of last SID in SRH SID list
> > * @ mt_flags: match options
> > * @ mt_invflags: Invert the sense of match options
> > */
> > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > __u8 segs_left;
> > __u8 last_entry;
> > __u16 tag;
> > + struct in6_addr psid_addr;
> > + struct in6_addr nsid_addr;
> > + struct in6_addr lsid_addr;
> > + struct in6_addr psid_msk;
> > + struct in6_addr nsid_msk;
> > + struct in6_addr lsid_msk;
>
> This is changing something exposed through UAPI, so you will need a
> new revision for this.
Could you please advice what should be done in this case?
>
> > __u16 mt_flags;
> > __u16 mt_invflags;
> > };
> > diff --git a/net/ipv6/netfilter/ip6t_srh.c b/net/ipv6/netfilter/ip6t_srh.c
> > index 33719d5..2b5cc73 100644
> > --- a/net/ipv6/netfilter/ip6t_srh.c
> > +++ b/net/ipv6/netfilter/ip6t_srh.c
> > @@ -30,7 +30,9 @@ static bool srh_mt6(const struct sk_buff *skb, struct xt_action_param *par)
> > const struct ip6t_srh *srhinfo = par->matchinfo;
> > struct ipv6_sr_hdr *srh;
> > struct ipv6_sr_hdr _srh;
> > - int hdrlen, srhoff = 0;
> > + int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
> > + struct in6_addr *psid, *nsid, *lsid;
> > + struct in6_addr _psid, _nsid, _lsid;
>
> Could you rearrange variable definitions? ie. longest line first, eg.
>
> int hdrlen, psidoff, nsidoff, lsidoff, srhoff = 0;
> const struct ip6t_srh *srhinfo = par->matchinfo;
> struct in6_addr *psid, *nsid, *lsid;
> struct ipv6_sr_hdr *srh;
> struct ipv6_sr_hdr _srh;
>
Ok I will re-arrange them in reverse christmas tree form.
Ahmed
--
Ahmed Abdelsalam <amsalam20@gmail.com>
^ permalink raw reply
* Re: [PATCH bpf] bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY
From: Alexei Starovoitov @ 2018-04-23 20:05 UTC (permalink / raw)
To: Roman Gushchin; +Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180423170921.16162-1-guro@fb.com>
On Mon, Apr 23, 2018 at 06:09:21PM +0100, Roman Gushchin wrote:
> Running bpf programs requires disabled preemption,
> however at least some* of the BPF_PROG_RUN_ARRAY users
> do not follow this rule.
>
> To fix this bug, and also to make it not happen in the future,
> let's add explicit preemption disabling/re-enabling
> to the __BPF_PROG_RUN_ARRAY code.
>
> * for example:
> [ 17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
> ...
> [ 17.640890] inet6_create+0x3eb/0x520
> [ 17.641405] __sock_create+0x242/0x340
> [ 17.641939] __sys_socket+0x57/0xe0
> [ 17.642370] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 17.642944] SyS_socket+0xa/0x10
> [ 17.643357] do_syscall_64+0x79/0x220
> [ 17.643879] entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: KASAN: null-ptr-deref Read in refcount_inc_not_zero
From: Cong Wang @ 2018-04-23 20:05 UTC (permalink / raw)
To: syzbot
Cc: David Miller, Denys Vlasenko, LKML,
Linux Kernel Network Developers, syzkaller-bugs, xiaolou4617
In-Reply-To: <000000000000eeeeff056a79954c@google.com>
#syz fix: llc: fix NULL pointer deref for SOCK_ZAPPED
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-23 20:06 UTC (permalink / raw)
To: Siwei Liu
Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <CADGSJ20ge75T+ddxtUBT4d9m1i3=HLOAHJEoS7Cg0bqnXrutwA@mail.gmail.com>
On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> > > > >three device model. MS worked hard with distro vendors to support transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >
> >> > > You can't go breaking userspace even if you don't like the application.
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >
> >> Also, I am working on associating netvsc and VF device based on serial number
> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> sense for Linux and Windows to use the same mechanism if possible.
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?
> >
> > It all seems like a reasonable thing to support in the generic core.
>
> That's the reason why I chose explicit identifier rather than rely on
> MAC address to bind/pair a device. MAC address can change. Even if it
> can't, malicious guest user can fake MAC address to skip binding.
>
> -Siwei
Address should be sampled at device creation to prevent this
kind of hack. Not that it buys the malicious user much:
if you can poke at MAC addresses you probably already can
break networking.
>
> >
> > --
> > MST
^ permalink raw reply
* Re: [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Alexei Starovoitov @ 2018-04-23 20:08 UTC (permalink / raw)
To: Sebastiano Miano
Cc: Jesper Dangaard Brouer, netdev, ast, daniel, mingo, rostedt,
fulvio.risso, David S. Miller
In-Reply-To: <062780b1-6a24-a7f3-175c-db3b02605850@polito.it>
On Mon, Apr 23, 2018 at 04:08:36PM +0200, Sebastiano Miano wrote:
>
> That's in fact the real use case for the first two patches. Since bpf
> tracepoints are still a rather common (and easy to use) troubleshooting and
> monitoring tool why shouldn't we "enhance" their support with the newly
> added map/prog IDs?
because these tracepoints can be abused in the way that this patch demonstrated.
Whether to keep this patch in the series or not is irrelevant.
^ permalink raw reply
* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Florian Westphal @ 2018-04-23 20:08 UTC (permalink / raw)
To: Ahmed Abdelsalam
Cc: Pablo Neira Ayuso, fw, davem, dav.lebrun, linux-kernel,
netfilter-devel, coreteam, netdev
In-Reply-To: <20180423220148.03800031d0cb8e8a7a83dc31@gmail.com>
Ahmed Abdelsalam <amsalam20@gmail.com> wrote:
> > > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > > __u8 segs_left;
> > > __u8 last_entry;
> > > __u16 tag;
> > > + struct in6_addr psid_addr;
> > > + struct in6_addr nsid_addr;
> > > + struct in6_addr lsid_addr;
> > > + struct in6_addr psid_msk;
> > > + struct in6_addr nsid_msk;
> > > + struct in6_addr lsid_msk;
> >
> > This is changing something exposed through UAPI, so you will need a
> > new revision for this.
>
> Could you please advice what should be done in this case?
You need to add
struct ip6t_srh_v1 {
/* copy of struct ip6t_srh here */
/* new fields go here */
};
Look at xt_conntrack.c, conntrack_mt_reg[] for an example of
multi-revision match.
You can probably re-origanise code to avoid too much duplication.
See 5a786232eb69a1f870ddc0cfd69d5bdef241a2ea in nf.git for an example,
it makes v0 into a v1 struct at runtime and re-uses new v1 code
for old v0.
^ permalink raw reply
* Re: [PATCH 0/3] bpf: Store/dump license string for loaded program
From: Alexei Starovoitov @ 2018-04-23 20:11 UTC (permalink / raw)
To: Jiri Olsa
Cc: Alexei Starovoitov, Daniel Borkmann, lkml, netdev, Quentin Monnet
In-Reply-To: <20180423065927.23127-1-jolsa@kernel.org>
On Mon, Apr 23, 2018 at 08:59:24AM +0200, Jiri Olsa wrote:
> hi,
> sending the change to store and dump the license
> info for loaded BPF programs. It's important for
> us get the license info, when investigating on
> screwed up machine.
hmm. boolean flag whether bpf prog is gpl or not
is already exposed via bpf_prog_info.
I see no point of wasting extra 128 bytes of kernel memory.
^ permalink raw reply
* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Michael S. Tsirkin @ 2018-04-23 20:11 UTC (permalink / raw)
To: Björn Töpel
Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <CAJ+HfNjwMFig+aJbacFK--5_1i8F2DLSyAUOvU12Xc-OvJBAzQ@mail.gmail.com>
On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>
> [...]
>
> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> +{
> >> + unsigned int i;
> >> +
> >> + if (umem->pgs) {
> >> + for (i = 0; i < umem->npgs; i++)
> >
> > Since you pin them with FOLL_WRITE, I assume these pages
> > are written to.
> > Don't you need set_page_dirty_lock here?
> >
>
> Hmm, I actually *removed* it from the RFC V2, but after doing some
> homework, I think you're right. Thanks for pointing this out!
>
> Thinking more about this; This function is called from sk_destruct,
> and in the Tx case the sk_destruct can be called from interrupt
> context, where set_page_dirty_lock cannot be called.
>
> Are there any preferred ways of solving this? Scheduling the whole
> xsk_destruct call to a workqueue is one way (I think). Any
> cleaner/better way?
>
> [...]
Defer unpinning pages until the next tx call?
> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> +{
> >> + u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> + u64 addr = mr->addr, size = mr->len;
> >> + unsigned int nframes;
> >> + int size_chk, err;
> >> +
> >> + if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> + /* Strictly speaking we could support this, if:
> >> + * - huge pages, or*
> >
> > what does "or*" here mean?
> >
>
> Oops, I'll change to just 'or' in the next revision.
>
>
> Thanks!
> Björn
^ permalink raw reply
* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Konrad Rzeszutek Wilk @ 2018-04-23 20:11 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423224715-mutt-send-email-mst@kernel.org>
On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > > Hi all:
> > >
> > > This RFC implement packed ring layout. The code were tested with
> > > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > > and pktgen does not show obvious difference compared with split ring.
> >
> > I have to ask then - what is the benefit of this?
>
> You can use this with dpdk within guest.
> The DPDK version did see a performance improvement so hopefully with
Is there a link to this performance improvement document?
> future versions this will too.
>
> > >
> > > Changes from V2:
> > > - do not use & in checking desc_event_flags
> > > - off should be most significant bit
> > > - remove the workaround of mergeable buffer for dpdk prototype
> > > - id should be in the last descriptor in the chain
> > > - keep _F_WRITE for write descriptor when adding used
> > > - device flags updating should use ADDR_USED type
> > > - return error on unexpected unavail descriptor in a chain
> > > - return false in vhost_ve_avail_empty is descriptor is available
> > > - track last seen avail_wrap_counter
> > > - correctly examine available descriptor in get_indirect_packed()
> > > - vhost_idx_diff should return u16 instead of bool
> > >
> > > Changes from V1:
> > >
> > > - Refactor vhost used elem code to avoid open coding on used elem
> > > - Event suppression support (compile test only).
> > > - Indirect descriptor support (compile test only).
> > > - Zerocopy support.
> > > - vIOMMU support.
> > > - SCSI/VSOCK support (compile test only).
> > > - Fix several bugs
> > >
> > > For simplicity, I don't implement batching or other optimizations.
> > >
> > > Please review.
> > >
> > > Jason Wang (8):
> > > vhost: move get_rx_bufs to vhost.c
> > > vhost: hide used ring layout from device
> > > vhost: do not use vring_used_elem
> > > vhost_net: do not explicitly manipulate vhost_used_elem
> > > vhost: vhost_put_user() can accept metadata type
> > > virtio: introduce packed ring defines
> > > vhost: packed ring support
> > > vhost: event suppression for packed ring
> > >
> > > drivers/vhost/net.c | 136 ++----
> > > drivers/vhost/scsi.c | 62 +--
> > > drivers/vhost/vhost.c | 824 ++++++++++++++++++++++++++++++++++---
> > > drivers/vhost/vhost.h | 47 ++-
> > > drivers/vhost/vsock.c | 42 +-
> > > include/uapi/linux/virtio_config.h | 9 +
> > > include/uapi/linux/virtio_ring.h | 32 ++
> > > 7 files changed, 926 insertions(+), 226 deletions(-)
> > >
> > > --
> > > 2.7.4
> > >
^ permalink raw reply
* Re: [PATCH net-next 0/2] net/ipv6: couple of fixes for rcu change to from
From: David Miller @ 2018-04-23 20:13 UTC (permalink / raw)
To: dsahern; +Cc: netdev
In-Reply-To: <20180423183207.9124-1-dsahern@gmail.com>
From: David Ahern <dsahern@gmail.com>
Date: Mon, 23 Apr 2018 11:32:05 -0700
> So many details... I am thankful for all the robots running the
> permutations and tools.
This is why I am not afraid of the robots taking over. :-)
> Two bug fixes from the rcu change to rt->from:
> 1. missing rcu lock in ip6_negative_advice
> 2. rcu dereferences in 2 sites
Series applied, thanks David.
^ permalink raw reply
* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-23 20:15 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <20180423231026-mutt-send-email-mst@kernel.org>
2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
>> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
>>
>> [...]
>>
>> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>> >> +{
>> >> + unsigned int i;
>> >> +
>> >> + if (umem->pgs) {
>> >> + for (i = 0; i < umem->npgs; i++)
>> >
>> > Since you pin them with FOLL_WRITE, I assume these pages
>> > are written to.
>> > Don't you need set_page_dirty_lock here?
>> >
>>
>> Hmm, I actually *removed* it from the RFC V2, but after doing some
>> homework, I think you're right. Thanks for pointing this out!
>>
>> Thinking more about this; This function is called from sk_destruct,
>> and in the Tx case the sk_destruct can be called from interrupt
>> context, where set_page_dirty_lock cannot be called.
>>
>> Are there any preferred ways of solving this? Scheduling the whole
>> xsk_destruct call to a workqueue is one way (I think). Any
>> cleaner/better way?
>>
>> [...]
>
> Defer unpinning pages until the next tx call?
>
If the sock is released, there wont be another tx call. Or am I
missing something obvious?
>
>> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>> >> +{
>> >> + u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
>> >> + u64 addr = mr->addr, size = mr->len;
>> >> + unsigned int nframes;
>> >> + int size_chk, err;
>> >> +
>> >> + if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> >> + /* Strictly speaking we could support this, if:
>> >> + * - huge pages, or*
>> >
>> > what does "or*" here mean?
>> >
>>
>> Oops, I'll change to just 'or' in the next revision.
>>
>>
>> Thanks!
>> Björn
^ permalink raw reply
* Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID
From: Ahmed Abdelsalam @ 2018-04-23 20:16 UTC (permalink / raw)
To: Florian Westphal
Cc: Pablo Neira Ayuso, davem, dav.lebrun, linux-kernel,
netfilter-devel, coreteam, netdev
In-Reply-To: <20180423200844.bq3ksj262brrifnj@breakpoint.cc>
On Mon, 23 Apr 2018 22:08:44 +0200
Florian Westphal <fw@strlen.de> wrote:
> Ahmed Abdelsalam <amsalam20@gmail.com> wrote:
> > > > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > > > __u8 segs_left;
> > > > __u8 last_entry;
> > > > __u16 tag;
> > > > + struct in6_addr psid_addr;
> > > > + struct in6_addr nsid_addr;
> > > > + struct in6_addr lsid_addr;
> > > > + struct in6_addr psid_msk;
> > > > + struct in6_addr nsid_msk;
> > > > + struct in6_addr lsid_msk;
> > >
> > > This is changing something exposed through UAPI, so you will need a
> > > new revision for this.
> >
> > Could you please advice what should be done in this case?
>
> You need to add
> struct ip6t_srh_v1 {
> /* copy of struct ip6t_srh here */
>
> /* new fields go here */
> };
>
>
> Look at xt_conntrack.c, conntrack_mt_reg[] for an example of
> multi-revision match.
>
> You can probably re-origanise code to avoid too much duplication.
> See 5a786232eb69a1f870ddc0cfd69d5bdef241a2ea in nf.git for an example,
> it makes v0 into a v1 struct at runtime and re-uses new v1 code
> for old v0.
>
>
Thanks Florian!
--
Ahmed Abdelsalam <amsalam20@gmail.com>
^ permalink raw reply
* Re: [bpf-next PATCH 3/3] bpf: add sample program to trace map events
From: Jesper Dangaard Brouer @ 2018-04-23 20:22 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Sebastiano Miano, netdev, ast, daniel, mingo, rostedt,
fulvio.risso, David S. Miller, brouer
In-Reply-To: <20180423200801.m2kcrjcoavuibbmk@ast-mbp>
On Mon, 23 Apr 2018 14:08:02 -0600
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 04:08:36PM +0200, Sebastiano Miano wrote:
> >
> > That's in fact the real use case for the first two patches. Since bpf
> > tracepoints are still a rather common (and easy to use) troubleshooting and
> > monitoring tool why shouldn't we "enhance" their support with the newly
> > added map/prog IDs?
>
> because these tracepoints can be abused in the way that this patch demonstrated.
> Whether to keep this patch in the series or not is irrelevant.
I don't understand your abuse use-case, can you explain what you mean?
You do realize that these tracepoints can _only_ monitor the userspace
map activity (not kernel map changes) ... and we _do_ need a way to
debug this (and without the map_id I can tell which map).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* Re: [PATCH 00/12] Netfilter/IPVS fixes for net
From: David Miller @ 2018-04-23 20:22 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180423175714.9794-1-pablo@netfilter.org>
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 23 Apr 2018 19:57:02 +0200
> The following patchset contains Netfilter/IPVS fixes for your net tree,
> they are:
...
> You can pull these changes from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git
Pulled, thank you.
^ permalink raw reply
* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Michael S. Tsirkin @ 2018-04-23 20:26 UTC (permalink / raw)
To: Björn Töpel
Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
Willem de Bruijn, Daniel Borkmann, Netdev, Björn Töpel,
michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
Zhang, Qi Z
In-Reply-To: <CAJ+HfNhbNeBfiMP+dK0ZGU0dfYrkWpo5jOAidUoUnkxRMBb0xg@mail.gmail.com>
On Mon, Apr 23, 2018 at 10:15:18PM +0200, Björn Töpel wrote:
> 2018-04-23 22:11 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> > On Mon, Apr 23, 2018 at 10:00:15PM +0200, Björn Töpel wrote:
> >> 2018-04-23 18:18 GMT+02:00 Michael S. Tsirkin <mst@redhat.com>:
> >>
> >> [...]
> >>
> >> >> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >> >> +{
> >> >> + unsigned int i;
> >> >> +
> >> >> + if (umem->pgs) {
> >> >> + for (i = 0; i < umem->npgs; i++)
> >> >
> >> > Since you pin them with FOLL_WRITE, I assume these pages
> >> > are written to.
> >> > Don't you need set_page_dirty_lock here?
> >> >
> >>
> >> Hmm, I actually *removed* it from the RFC V2, but after doing some
> >> homework, I think you're right. Thanks for pointing this out!
> >>
> >> Thinking more about this; This function is called from sk_destruct,
> >> and in the Tx case the sk_destruct can be called from interrupt
> >> context, where set_page_dirty_lock cannot be called.
> >>
> >> Are there any preferred ways of solving this? Scheduling the whole
> >> xsk_destruct call to a workqueue is one way (I think). Any
> >> cleaner/better way?
> >>
> >> [...]
> >
> > Defer unpinning pages until the next tx call?
> >
>
> If the sock is released, there wont be another tx call.
unpin them on socket release too?
> Or am I
> missing something obvious?
>
> >
> >> >> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> >> >> +{
> >> >> + u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> >> >> + u64 addr = mr->addr, size = mr->len;
> >> >> + unsigned int nframes;
> >> >> + int size_chk, err;
> >> >> +
> >> >> + if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> >> >> + /* Strictly speaking we could support this, if:
> >> >> + * - huge pages, or*
> >> >
> >> > what does "or*" here mean?
> >> >
> >>
> >> Oops, I'll change to just 'or' in the next revision.
> >>
> >>
> >> Thanks!
> >> Björn
^ permalink raw reply
* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: Pravin Shelar @ 2018-04-23 20:10 UTC (permalink / raw)
To: David Miller; +Cc: Yi-Hung Wei, Linux Kernel Network Developers
In-Reply-To: <20180423.093927.2276689799061091037.davem@davemloft.net>
On Mon, Apr 23, 2018 at 6:39 AM, David Miller <davem@davemloft.net> wrote:
> From: Yi-Hung Wei <yihung.wei@gmail.com>
> Date: Tue, 17 Apr 2018 17:30:27 -0700
>
>> Currently, nf_conntrack_max is used to limit the maximum number of
>> conntrack entries in the conntrack table for every network namespace.
>> For the VMs and containers that reside in the same namespace,
>> they share the same conntrack table, and the total # of conntrack entries
>> for all the VMs and containers are limited by nf_conntrack_max. In this
>> case, if one of the VM/container abuses the usage the conntrack entries,
>> it blocks the others from committing valid conntrack entries into the
>> conntrack table. Even if we can possibly put the VM in different network
>> namespace, the current nf_conntrack_max configuration is kind of rigid
>> that we cannot limit different VM/container to have different # conntrack
>> entries.
>>
Hi
This looks like general problem related to nf zone usage limit, Did
you considered changing nf-conntrack to have a per zone limit, so that
all users of nf-filter can use it. I prefer this to adding a wrapper
in OVS nf-filter layer.
Thanks,
Pravin.
>> To address the aforementioned issue, this patch proposes to have a
>> fine-grained mechanism that could further limit the # of conntrack entries
>> per-zone. For example, we can designate different zone to different VM,
>> and set conntrack limit to each zone. By providing this isolation, a
>> mis-behaved VM only consumes the conntrack entries in its own zone, and
>> it will not influence other well-behaved VMs. Moreover, the users can
>> set various conntrack limit to different zone based on their preference.
>>
>> The proposed implementation utilizes Netfilter's nf_conncount backend
>> to count the number of connections in a particular zone. If the number of
>> connection is above a configured limitation, OVS will return ENOMEM to the
>> userspace. If userspace does not configure the zone limit, the limit
>> defaults to zero that is no limitation, which is backward compatible to
>> the behavior without this patch.
>>
>> The first patch defines the conntrack limit netlink definition, and the
>> second patch provides the implementation.
>
> Pravin, I need this series reviewed.
>
> Thank you.
^ permalink raw reply
* [Patch nf] ipvs: initialize tbl->entries after allocation
From: Cong Wang @ 2018-04-23 20:53 UTC (permalink / raw)
To: netdev
Cc: lvs-devel, netfilter-devel, Cong Wang, Simon Horman,
Julian Anastasov, Pablo Neira Ayuso
tbl->entries is not initialized after kmalloc(), therefore
causes an uninit-value warning in ip_vs_lblc_check_expire()
as reported by syzbot.
Reported-by: <syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
net/netfilter/ipvs/ip_vs_lblcr.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 92adc04557ed..bc2bc5eebcb8 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
tbl->counter = 1;
tbl->dead = false;
tbl->svc = svc;
+ atomic_set(&tbl->entries, 0);
/*
* Hook periodic timer for garbage collection
--
2.13.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox