* Re: [PATCH net-next] net: stmmac: Implement logic to automatically select HW Interface
From: David Miller @ 2018-04-23 0:59 UTC (permalink / raw)
To: Jose.Abreu
Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
alexandre.torgue
In-Reply-To: <9d14a1f97e0b963125c898d571ff44547660e9a9.1524151243.git.joabreu@synopsys.com>
From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Thu, 19 Apr 2018 16:24:15 +0100
> @@ -0,0 +1,216 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> +// Copyright (c) 2018 Synopsys, Inc. and/or its affiliates.
> +// stmmac HW Interface Handling
Please do not use C++ style comments for anything past the
SPDX identifier line.
Thank you.
^ permalink raw reply
* Re: XDP breakage with virtio due to 6870de435b90c083ae0f3f7f341287976ef56f03
From: Nikita V. Shirokov @ 2018-04-23 0:55 UTC (permalink / raw)
To: David Ahern; +Cc: Jason Wang, daniel, netdev@vger.kernel.org
In-Reply-To: <f50aa641-09a7-4a08-498c-e7aa6ac50cc6@gmail.com>
On Sun, Apr 22, 2018 at 04:47:48PM -0600, David Ahern wrote:
> This commit breaks my FIB forwarding program:
>
> commit 6870de435b90c083ae0f3f7f341287976ef56f03
> Author: Nikita V. Shirokov <tehnerd@tehnerd.com>
> Date: Tue Apr 17 21:42:20 2018 -0700
>
> bpf: make virtio compatible w/ bpf_xdp_adjust_tail
>
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for virtio driver we need to adjust XDP_PASS handling by recalculating
> length of the packet if it was passed to the TCP/IP stack
>
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>
> ###
>
> Some of the packets (e.g., ARP or those without a resolved neighbor) are
> passed to the networking stack. What shows up are clearly broken packets:
>
> # tcpdump -n -i eth1
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
> 15:45:29.693238 [|ARP]
> 0x0000: 0001 0800 0604 0001 42c0 ce2f 3fa9 0a64 ........B../?..d
> 15:45:30.710327 [|ARP]
> 0x0000: 0001 0800 0604 0001 42c0 ce2f 3fa9 0a64 ........B../?..d
> 15:45:31.734296 [|ARP]
> 0x0000: 0001 0800 0604 0001 42c0 ce2f 3fa9 0a64 ........B../?..d
> 15:45:32.908720 IP6 truncated-ip6 - 12 bytes
> missing!fe80::40c0:ceff:fe2f:3fa9 > ff02::1:ff00:2: ICMP6, neighbor
> solicitation[|icmp6]
> 15:45:33.910530 IP6 truncated-ip6 - 12 bytes missing!2001:db8:1::64 >
> ff02::1:ff00:2: ICMP6, neighbor solicitation[|icmp6]
> 15:45:34.934437 IP6 truncated-ip6 - 12 bytes missing!2001:db8:1::64 >
> ff02::1:ff00:2: ICMP6, neighbor solicitation[|icmp6]
> 15:45:35.958394 IP6 truncated-ip6 - 12 bytes missing!2001:db8:1::64 >
> ff02::1:ff00:2: ICMP6, neighbor solicitation[|icmp6]
>
> Reverting the mentioned patch fixes the problem.
Hi, David.
thanks for reporting this. looks like in my calculation i've missed
vi->hdr_len during new lengths calculation (it was len = xdp->data_end -
xdp->data; but also shouldbe +vi->hdr_len). will run few more tests
before sending a fix.
^ permalink raw reply
* Re: [PATCH bpf-next,v2 1/2] bpf: add helper for getting xfrm states
From: Alexei Starovoitov @ 2018-04-23 0:34 UTC (permalink / raw)
To: Eyal Birger; +Cc: netdev, shmulik, ast, daniel, fw, steffen.klassert
In-Reply-To: <20180420064356.7a703a1e@jimi>
On Fri, Apr 20, 2018 at 06:43:56AM +0300, Eyal Birger wrote:
> Hi,
>
> On Wed, 18 Apr 2018 15:31:03 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > On Thu, Apr 19, 2018 at 12:58:22AM +0300, Eyal Birger wrote:
> > > This commit introduces a helper which allows fetching xfrm state
> > > parameters by eBPF programs attached to TC.
> > >
> > > Prototype:
> > > bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> > >
> > > skb: pointer to skb
> > > index: the index in the skb xfrm_state secpath array
> > > xfrm_state: pointer to 'struct bpf_xfrm_state'
> > > size: size of 'struct bpf_xfrm_state'
> > > flags: reserved for future extensions
> > >
>
> <snip>
>
> > > +#ifdef CONFIG_XFRM
> > > +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32,
> > > index,
> > > + struct bpf_xfrm_state *, to, u32, size, u64, flags)
> > > +{
> > > + const struct sec_path *sp = skb_sec_path(skb);
> > > + const struct xfrm_state *x;
> > > +
> > > + if (!sp || unlikely(index >= sp->len || flags))
> > > + goto err_clear;
> > > +
> > > + x = sp->xvec[index];
> > > +
> > > + if (unlikely(size != sizeof(struct bpf_xfrm_state)))
> > > + goto err_clear;
> > > +
> > > + to->reqid = x->props.reqid;
> > > + to->spi = be32_to_cpu(x->id.spi);
> > > + to->family = x->props.family;
> > > + if (to->family == AF_INET6) {
> > > + memcpy(to->remote_ipv6, x->props.saddr.a6,
> > > + sizeof(to->remote_ipv6));
> > > + } else {
> > > + to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
> > > + }
> >
> > that looks inconsistent. Why v4 is cpu endian, but v6 not?
>
> I agree. I followed the reference in bpf_skb_get_tunnel_key().
> I can keep v4 in net endianess too.
argh.
On one side it makes sense to be consistent with bpf_skb_get_tunnel_key()
but it's certainly confusing to have v4 and v6 in different endianness.
Imagine man page that says that bpf folks made a mistake in that
helper can kept repeating it in other helpers for consistency...
Daniel, what do you think?
Do you remember the history with bpf_skb_get_tunnel_key and
why it happened that way?
> > Why change endianness of the spi?
>
> I felt it was more consistent with other fields and usually helpful for
> programs. I can keep it in network order.
>
> In which case, do you expect it to be typed as __be32 in bpf.h?
> (I haven't seen other cases)?
It can be __u32 with a comment /* Stored in network byte order */
like in bunch of other fields.
^ permalink raw reply
* Re: [PATCH bpf-next v3 9/9] tools/bpf: add a test for bpf_get_stack with tracepoint prog
From: Alexei Starovoitov @ 2018-04-23 0:27 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180420221842.742330-10-yhs@fb.com>
On Fri, Apr 20, 2018 at 03:18:42PM -0700, Yonghong Song wrote:
> The test_stacktrace_map and test_stacktrace_build_id are
> enhanced to call bpf_get_stack in the helper to get the
> stack trace as well. The stack traces from bpf_get_stack
> and bpf_get_stackid are compared to ensure that for the
> same stack as represented as the same hash, their ip addresses
> or build id's must be the same.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> tools/testing/selftests/bpf/test_progs.c | 63 +++++++++++++++++++---
> .../selftests/bpf/test_stacktrace_build_id.c | 20 ++++++-
> tools/testing/selftests/bpf/test_stacktrace_map.c | 20 +++++--
> 3 files changed, 92 insertions(+), 11 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index dad4c3f..06b922a 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -897,11 +897,40 @@ static int compare_map_keys(int map1_fd, int map2_fd)
> return 0;
> }
>
> +static int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len)
> +{
> + __u32 key, next_key, *cur_key_p, *next_key_p;
> + char val_buf1[stack_trace_len], val_buf2[stack_trace_len];
the kernel is trying to get rid of VLAs.
test_progs.c already uses them, but if possible let's not
add more uses of them.
Other than that looks great.
^ permalink raw reply
* Re: [PATCH bpf-next v3 8/9] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
From: Alexei Starovoitov @ 2018-04-23 0:23 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180420221842.742330-9-yhs@fb.com>
On Fri, Apr 20, 2018 at 03:18:41PM -0700, Yonghong Song wrote:
> The test attached a raw_tracepoint program to sched/sched_switch.
> It tested to get stack for user space, kernel space and user
> space with build_id request. It also tested to get user
> and kernel stack into the same buffer with back-to-back
> bpf_get_stack helper calls.
>
> Whenever the kernel stack is available, the user space
> application will check to ensure that the kernel function
> for raw_tracepoint ___bpf_prog_run is part of the stack.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> tools/testing/selftests/bpf/Makefile | 3 +-
> tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 ++++++++++++++++++
> tools/testing/selftests/bpf/test_progs.c | 115 +++++++++++++++++++++
> 3 files changed, 219 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 0b72cc7..54e9e74 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -32,7 +32,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
> test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
> sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
> sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
> - test_btf_haskv.o test_btf_nokv.o
> + test_btf_haskv.o test_btf_nokv.o test_get_stack_rawtp.o
>
> # Order correspond to 'make run_tests' order
> TEST_PROGS := test_kmod.sh \
> @@ -56,6 +56,7 @@ $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
> $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
> $(OUTPUT)/test_sock: cgroup_helpers.c
> $(OUTPUT)/test_sock_addr: cgroup_helpers.c
> +$(OUTPUT)/test_progs: trace_helpers.c
>
> .PHONY: force
>
> diff --git a/tools/testing/selftests/bpf/test_get_stack_rawtp.c b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
> new file mode 100644
> index 0000000..ba1dcf9
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +/* Permit pretty deep stack traces */
> +#define MAX_STACK_RAWTP 100
> +struct stack_trace_t {
> + int pid;
> + int kern_stack_size;
> + int user_stack_size;
> + int user_stack_buildid_size;
> + __u64 kern_stack[MAX_STACK_RAWTP];
> + __u64 user_stack[MAX_STACK_RAWTP];
> + struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
> +};
> +
> +struct bpf_map_def SEC("maps") perfmap = {
> + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> + .key_size = sizeof(int),
> + .value_size = sizeof(__u32),
> + .max_entries = 2,
> +};
> +
> +struct bpf_map_def SEC("maps") stackdata_map = {
> + .type = BPF_MAP_TYPE_PERCPU_ARRAY,
> + .key_size = sizeof(__u32),
> + .value_size = sizeof(struct stack_trace_t),
> + .max_entries = 1,
> +};
> +
> +/* Allocate per-cpu space twice the needed. For the code below
> + * usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> + * if (usize < 0)
> + * return 0;
> + * ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> + *
> + * If we have value_size = MAX_STACK_RAWTP * sizeof(__u64),
> + * verifier will complain that access "raw_data + usize"
> + * with size "max_len - usize" may be out of bound.
> + * The maximum "raw_data + usize" is "raw_data + max_len"
> + * and the maximum "max_len - usize" is "max_len", verifier
> + * concludes that the maximum buffer access range is
> + * "raw_data[0...max_len * 2 - 1]" and hence reject the program.
> + *
> + * Doubling the to-be-used max buffer size can fix this verifier
> + * issue and avoid complicated C programming massaging.
> + * This is an acceptable workaround since there is one entry here.
> + */
> +struct bpf_map_def SEC("maps") rawdata_map = {
> + .type = BPF_MAP_TYPE_PERCPU_ARRAY,
> + .key_size = sizeof(__u32),
> + .value_size = MAX_STACK_RAWTP * sizeof(__u64) * 2,
> + .max_entries = 1,
> +};
> +
> +SEC("tracepoint/sched/sched_switch")
> +int bpf_prog1(void *ctx)
> +{
> + int max_len, max_buildid_len, usize, ksize, total_size;
> + struct stack_trace_t *data;
> + void *raw_data;
> + __u32 key = 0;
> +
> + data = bpf_map_lookup_elem(&stackdata_map, &key);
> + if (!data)
> + return 0;
> +
> + max_len = MAX_STACK_RAWTP * sizeof(__u64);
> + max_buildid_len = MAX_STACK_RAWTP * sizeof(struct bpf_stack_build_id);
> + data->pid = bpf_get_current_pid_tgid();
> + data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
> + max_len, 0);
> + data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
> + BPF_F_USER_STACK);
> + data->user_stack_buildid_size = bpf_get_stack(
> + ctx, data->user_stack_buildid, max_buildid_len,
> + BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
> + bpf_perf_event_output(ctx, &perfmap, 0, data, sizeof(*data));
> +
> + /* write both kernel and user stacks to the same buffer */
> + raw_data = bpf_map_lookup_elem(&rawdata_map, &key);
> + if (!raw_data)
> + return 0;
> +
> + usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> + if (usize < 0)
> + return 0;
> +
> + ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> + if (ksize < 0)
may be instead of teaching verifier about ARSH (which doesn't look
straighforward) such use case can be done as:
u32 max_len, usize, ksize;
ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
if ((int)ksize < 0)
That's certainly suboptimal and very much non obvious to program
developers, but at least it can unblock the bpf_get_stack part
landing and proper ARSH support can be added later?
Just a thought.
> + return 0;
> +
> + total_size = usize + ksize;
> + if (total_size > 0 && total_size <= max_len)
> + bpf_perf_event_output(ctx, &perfmap, 0, raw_data, total_size);
> +
> + return 0;
> +}
the rest of the test looks great. Thank you for adding such exhaustive test.
^ permalink raw reply
* Re: [PATCH bpf-next v3 6/9] samples/bpf: move common-purpose trace functions to selftests
From: Alexei Starovoitov @ 2018-04-23 0:17 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180420221842.742330-7-yhs@fb.com>
On Fri, Apr 20, 2018 at 03:18:39PM -0700, Yonghong Song wrote:
> There is no functionality change in this patch. The common-purpose
> trace functions, including perf_event polling and ksym lookup,
> are moved from trace_output_user.c and bpf_load.c to
> selftests/bpf/trace_helpers.c so that these function can
> be reused later in selftests.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Alexei Starovoitov @ 2018-04-23 0:16 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180420221842.742330-5-yhs@fb.com>
On Fri, Apr 20, 2018 at 03:18:37PM -0700, Yonghong Song wrote:
> When helpers like bpf_get_stack returns an int value
> and later on used for arithmetic computation, the LSH and ARSH
> operations are often required to get proper sign extension into
> 64-bit. For example, without this patch:
> 54: R0=inv(id=0,umax_value=800)
> 54: (bf) r8 = r0
> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
> 55: (67) r8 <<= 32
> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
> 56: (c7) r8 s>>= 32
> 57: R8=inv(id=0)
> With this patch:
> 54: R0=inv(id=0,umax_value=800)
> 54: (bf) r8 = r0
> 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
> 55: (67) r8 <<= 32
> 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
> 56: (c7) r8 s>>= 32
> 57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
> With better range of "R8", later on when "R8" is added to other register,
> e.g., a map pointer or scalar-value register, the better register
> range can be derived and verifier failure may be avoided.
>
> In our later example,
> ......
> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> if (usize < 0)
> return 0;
> ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
> ......
> Without improving ARSH value range tracking, the register representing
> "max_len - usize" will have smin_value equal to S64_MIN and will be
> rejected by verifier.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> kernel/bpf/verifier.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3c8bb92..01c215d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
> /* We may learn something more from the var_off */
> __update_reg_bounds(dst_reg);
> break;
> + case BPF_ARSH:
> + if (umax_val >= insn_bitness) {
> + /* Shifts greater than 31 or 63 are undefined.
> + * This includes shifts by a negative number.
> + */
> + mark_reg_unknown(env, regs, insn->dst_reg);
> + break;
> + }
> + if (dst_reg->smin_value < 0)
> + dst_reg->smin_value >>= umin_val;
> + else
> + dst_reg->smin_value >>= umax_val;
> + if (dst_reg->smax_value < 0)
> + dst_reg->smax_value >>= umax_val;
> + else
> + dst_reg->smax_value >>= umin_val;
> + if (src_known)
> + dst_reg->var_off = tnum_rshift(dst_reg->var_off,
> + umin_val);
> + else
> + dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
> + dst_reg->umin_value >>= umax_val;
> + dst_reg->umax_value >>= umin_val;
> + /* We may learn something more from the var_off */
> + __update_reg_bounds(dst_reg);
I'm struggling to understand how these bounds are computed.
Could you add examples in the comments?
In particular if dst_reg is unknown (tnum.mask == -1)
the above tnum_rshift() will clear upper bits and will make it
64-bit positive, but that doesn't seem correct.
What am I missing?
^ permalink raw reply
* Re: [PATCH bpf-next v3 3/9] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Alexei Starovoitov @ 2018-04-22 23:55 UTC (permalink / raw)
To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180420221842.742330-4-yhs@fb.com>
On Fri, Apr 20, 2018 at 03:18:36PM -0700, Yonghong Song wrote:
> The special property of return values for helpers bpf_get_stack
> and bpf_probe_read_str are captured in verifier.
> Both helpers return a negative error code or
> a length, which is equal to or smaller than the buffer
> size argument. This additional information in the
> verifier can avoid the condition such as "retval > bufsize"
> in the bpf program. For example, for the code blow,
> usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
> if (usize < 0 || usize > max_len)
> return 0;
> The verifier may have the following errors:
> 52: (85) call bpf_get_stack#65
> R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
> R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R9_w=inv800 R10=fp0,call_-1
> 53: (bf) r8 = r0
> 54: (bf) r1 = r8
> 55: (67) r1 <<= 32
> 56: (bf) r2 = r1
> 57: (77) r2 >>= 32
> 58: (25) if r2 > 0x31f goto pc+33
> R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
> umax_value=18446744069414584320,
> var_off=(0x0; 0xffffffff00000000))
> R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R8=inv(id=0) R9=inv800 R10=fp0,call_-1
> 59: (1f) r9 -= r8
> 60: (c7) r1 s>>= 32
> 61: (bf) r2 = r7
> 62: (0f) r2 += r1
> math between map_value pointer and register with unbounded
> min value is not allowed
> The failure is due to llvm compiler optimization where register "r2",
> which is a copy of "r1", is tested for condition while later on "r1"
> is used for map_ptr operation. The verifier is not able to track such
> inst sequence effectively.
>
> Without the "usize > max_len" condition, there is no llvm optimization
> and the below generated code passed verifier:
> 52: (85) call bpf_get_stack#65
> R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
> R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
> R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R9_w=inv800 R10=fp0,call_-1
> 53: (b7) r1 = 0
> 54: (bf) r8 = r0
> 55: (67) r8 <<= 32
> 56: (c7) r8 s>>= 32
> 57: (6d) if r1 s> r8 goto pc+24
> R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
> R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
> R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
> R10=fp0,call_-1
> 58: (bf) r2 = r7
> 59: (0f) r2 += r8
> 60: (1f) r9 -= r8
> 61: (bf) r1 = r6
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> kernel/bpf/verifier.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index aba9425..3c8bb92 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -164,6 +164,8 @@ struct bpf_call_arg_meta {
> bool pkt_access;
> int regno;
> int access_size;
> + s64 msize_smax_value;
> + u64 msize_umax_value;
> };
>
> static DEFINE_MUTEX(bpf_verifier_lock);
> @@ -2027,6 +2029,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
> err = check_helper_mem_access(env, regno - 1,
> reg->umax_value,
> zero_size_allowed, meta);
> +
> + if (!err && !!meta) {
Please drop !! in the above.
Also what happens when
if (!tnum_is_const(reg->var_off))
meta = NULL;
?
it seems two new fields of meta will stay zero initialized
that later do_refine_retval_range() will set R0->umax_value = 0
which seems incorrect.
> + /* remember the mem_size which may be used later
> + * to refine return values.
> + */
> + meta->msize_smax_value = reg->smax_value;
> + meta->msize_umax_value = reg->umax_value;
> + }
> }
>
> return err;
> @@ -2333,6 +2343,21 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
> return 0;
> }
>
> +static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
> + int func_id,
> + struct bpf_call_arg_meta *meta)
> +{
> + struct bpf_reg_state *ret_reg = ®s[BPF_REG_0];
> +
> + if (ret_type != RET_INTEGER ||
> + (func_id != BPF_FUNC_get_stack &&
> + func_id != BPF_FUNC_probe_read_str))
> + return;
> +
> + ret_reg->smax_value = meta->msize_smax_value;
> + ret_reg->umax_value = meta->msize_umax_value;
> +}
> +
> static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
> {
> const struct bpf_func_proto *fn = NULL;
> @@ -2456,6 +2481,8 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
> return -EINVAL;
> }
>
> + do_refine_retval_range(regs, fn->ret_type, func_id, &meta);
> +
> err = check_map_func_compatibility(env, meta.map_ptr, func_id);
> if (err)
> return err;
> --
> 2.9.5
>
^ permalink raw reply
* XDP breakage with virtio due to 6870de435b90c083ae0f3f7f341287976ef56f03
From: David Ahern @ 2018-04-22 22:47 UTC (permalink / raw)
To: tehnerd, Jason Wang, daniel; +Cc: netdev@vger.kernel.org
This commit breaks my FIB forwarding program:
commit 6870de435b90c083ae0f3f7f341287976ef56f03
Author: Nikita V. Shirokov <tehnerd@tehnerd.com>
Date: Tue Apr 17 21:42:20 2018 -0700
bpf: make virtio compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
###
Some of the packets (e.g., ARP or those without a resolved neighbor) are
passed to the networking stack. What shows up are clearly broken packets:
# tcpdump -n -i eth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
15:45:29.693238 [|ARP]
0x0000: 0001 0800 0604 0001 42c0 ce2f 3fa9 0a64 ........B../?..d
15:45:30.710327 [|ARP]
0x0000: 0001 0800 0604 0001 42c0 ce2f 3fa9 0a64 ........B../?..d
15:45:31.734296 [|ARP]
0x0000: 0001 0800 0604 0001 42c0 ce2f 3fa9 0a64 ........B../?..d
15:45:32.908720 IP6 truncated-ip6 - 12 bytes
missing!fe80::40c0:ceff:fe2f:3fa9 > ff02::1:ff00:2: ICMP6, neighbor
solicitation[|icmp6]
15:45:33.910530 IP6 truncated-ip6 - 12 bytes missing!2001:db8:1::64 >
ff02::1:ff00:2: ICMP6, neighbor solicitation[|icmp6]
15:45:34.934437 IP6 truncated-ip6 - 12 bytes missing!2001:db8:1::64 >
ff02::1:ff00:2: ICMP6, neighbor solicitation[|icmp6]
15:45:35.958394 IP6 truncated-ip6 - 12 bytes missing!2001:db8:1::64 >
ff02::1:ff00:2: ICMP6, neighbor solicitation[|icmp6]
Reverting the mentioned patch fixes the problem.
^ permalink raw reply
* [PATCH bpf-next v4 1/2] bpf: allow map helpers access to map values directly
From: Paul Chaignon @ 2018-04-22 21:52 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: iovisor-dev, paul.chaignon
In-Reply-To: <cover.1524407664.git.paul.chaignon@orange.com>
Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
access stack and packet memory. Allow these helpers to directly access
map values by passing registers of type PTR_TO_MAP_VALUE.
This change removes the need for an extra copy to the stack when using a
map value to perform a second map lookup, as in the following:
struct bpf_map_def SEC("maps") infobyreq = {
.type = BPF_MAP_TYPE_HASHMAP,
.key_size = sizeof(struct request *),
.value_size = sizeof(struct info_t),
.max_entries = 1024,
};
struct bpf_map_def SEC("maps") counts = {
.type = BPF_MAP_TYPE_HASHMAP,
.key_size = sizeof(struct info_t),
.value_size = sizeof(u64),
.max_entries = 1024,
};
SEC("kprobe/blk_account_io_start")
int bpf_blk_account_io_start(struct pt_regs *ctx)
{
struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di);
u64 *count = bpf_map_lookup_elem(&counts, info);
(*count)++;
}
Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
---
kernel/bpf/verifier.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb902bf..70e00beade03 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1914,7 +1914,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
if (arg_type == ARG_PTR_TO_MAP_KEY ||
arg_type == ARG_PTR_TO_MAP_VALUE) {
expected_type = PTR_TO_STACK;
- if (!type_is_pkt_pointer(type) &&
+ if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
type != expected_type)
goto err_type;
} else if (arg_type == ARG_CONST_SIZE ||
@@ -1970,6 +1970,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
err = check_packet_access(env, regno, reg->off,
meta->map_ptr->key_size,
false);
+ else if (type == PTR_TO_MAP_VALUE)
+ err = check_map_access(env, regno, reg->off,
+ meta->map_ptr->key_size, false);
else
err = check_stack_boundary(env, regno,
meta->map_ptr->key_size,
@@ -1987,6 +1990,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
err = check_packet_access(env, regno, reg->off,
meta->map_ptr->value_size,
false);
+ else if (type == PTR_TO_MAP_VALUE)
+ err = check_map_access(env, regno, reg->off,
+ meta->map_ptr->value_size,
+ false);
else
err = check_stack_boundary(env, regno,
meta->map_ptr->value_size,
--
2.14.1
^ permalink raw reply related
* [PATCH bpf-next v4 2/2] tools/bpf: add verifier tests for accesses to map
From: Paul Chaignon via iovisor-dev @ 2018-04-22 21:52 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann,
netdev-u79uwXL29TY76Z2rM5mHXA
Cc: iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy
In-Reply-To: <cover.1524407664.git.paul.chaignon-C0LM0jrOve7QT0dZR+AlfA@public.gmane.org>
This patch adds new test cases for accesses to map values from map
helpers.
Signed-off-by: Paul Chaignon <paul.chaignon-C0LM0jrOve7QT0dZR+AlfA@public.gmane.org>
---
tools/testing/selftests/bpf/test_verifier.c | 266 ++++++++++++++++++++++++++++
1 file changed, 266 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b1a9ae..165e9ddfa446 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -64,6 +64,7 @@ struct bpf_test {
struct bpf_insn insns[MAX_INSNS];
int fixup_map1[MAX_FIXUPS];
int fixup_map2[MAX_FIXUPS];
+ int fixup_map3[MAX_FIXUPS];
int fixup_prog[MAX_FIXUPS];
int fixup_map_in_map[MAX_FIXUPS];
const char *errstr;
@@ -88,6 +89,11 @@ struct test_val {
int foo[MAX_ENTRIES];
};
+struct other_val {
+ long long foo;
+ long long bar;
+};
+
static struct bpf_test tests[] = {
{
"add+sub+mul",
@@ -5593,6 +5599,257 @@ static struct bpf_test tests[] = {
.errstr = "R1 min value is negative",
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
},
+ {
+ "map lookup helper access to map",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 8 },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map update helper access to map",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+ BPF_MOV64_IMM(BPF_REG_4, 0),
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_update_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 10 },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map update helper access to map: wrong size",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+ BPF_MOV64_IMM(BPF_REG_4, 0),
+ BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_update_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map1 = { 3 },
+ .fixup_map3 = { 10 },
+ .result = REJECT,
+ .errstr = "invalid access to map value, value_size=8 off=0 size=16",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via const imm)",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2,
+ offsetof(struct other_val, bar)),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 9 },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via const imm): out-of-bound 1",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2,
+ sizeof(struct other_val) - 4),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 9 },
+ .result = REJECT,
+ .errstr = "invalid access to map value, value_size=16 off=12 size=8",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via const imm): out-of-bound 2",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 9 },
+ .result = REJECT,
+ .errstr = "invalid access to map value, value_size=16 off=-4 size=8",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via const reg)",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_MOV64_IMM(BPF_REG_3,
+ offsetof(struct other_val, bar)),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 10 },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via const reg): out-of-bound 1",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_MOV64_IMM(BPF_REG_3,
+ sizeof(struct other_val) - 4),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 10 },
+ .result = REJECT,
+ .errstr = "invalid access to map value, value_size=16 off=12 size=8",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via const reg): out-of-bound 2",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_MOV64_IMM(BPF_REG_3, -4),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 10 },
+ .result = REJECT,
+ .errstr = "invalid access to map value, value_size=16 off=-4 size=8",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via variable)",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 7),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
+ BPF_JMP_IMM(BPF_JGT, BPF_REG_3,
+ offsetof(struct other_val, bar), 4),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 11 },
+ .result = ACCEPT,
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via variable): no max check",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 10 },
+ .result = REJECT,
+ .errstr = "R2 unbounded memory access, make sure to bounds check any array access into a map",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
+ {
+ "map helper access to adjusted map (via variable): wrong max check",
+ .insns = {
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 7),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+ BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
+ BPF_JMP_IMM(BPF_JGT, BPF_REG_3,
+ offsetof(struct other_val, bar) + 1, 4),
+ BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+ },
+ .fixup_map3 = { 3, 11 },
+ .result = REJECT,
+ .errstr = "invalid access to map value, value_size=16 off=9 size=8",
+ .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+ },
{
"map element value is preserved across register spilling",
.insns = {
@@ -11533,6 +11790,7 @@ static void do_test_fixup(struct bpf_test *test, struct bpf_insn *prog,
{
int *fixup_map1 = test->fixup_map1;
int *fixup_map2 = test->fixup_map2;
+ int *fixup_map3 = test->fixup_map3;
int *fixup_prog = test->fixup_prog;
int *fixup_map_in_map = test->fixup_map_in_map;
@@ -11556,6 +11814,14 @@ static void do_test_fixup(struct bpf_test *test, struct bpf_insn *prog,
} while (*fixup_map2);
}
+ if (*fixup_map3) {
+ map_fds[1] = create_map(sizeof(struct other_val), 1);
+ do {
+ prog[*fixup_map3].imm = map_fds[1];
+ fixup_map3++;
+ } while (*fixup_map3);
+ }
+
if (*fixup_prog) {
map_fds[2] = create_prog_array();
do {
--
2.14.1
^ permalink raw reply related
* [PATCH bpf-next v4 0/2] bpf: allow map helpers access to map values directly
From: Paul Chaignon via iovisor-dev @ 2018-04-22 21:50 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann,
netdev-u79uwXL29TY76Z2rM5mHXA
Cc: iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy
Currently, helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE
can only access stack and packet memory. This patchset allows these
helpers to directly access map values by passing registers of type
PTR_TO_MAP_VALUE.
The first patch changes the verifier; the second adds new test cases.
Previous versions of this patchset were sent on the iovisor-dev mailing
list only.
Changelogs:
Changes in v4:
- Rebase.
Changes in v3:
- Bug fixes.
- Negative test cases.
Changes in v2:
- Additional test cases for adjusted maps.
Paul Chaignon (2):
bpf: allow map helpers access to map values directly
tools/bpf: add verifier tests for accesses to map
kernel/bpf/verifier.c | 9 +-
tools/testing/selftests/bpf/test_verifier.c | 266 ++++++++++++++++++++++++++++
2 files changed, 274 insertions(+), 1 deletion(-)
--
2.14.1
^ permalink raw reply
* kTLS in combination with mlx4 is very unstable
From: Andre Tomt @ 2018-04-22 21:21 UTC (permalink / raw)
To: netdev; +Cc: davejwatson, Tariq Toukan
Hello!
kTLS looks fun, so I decided to play with it. It is quite spiffy -
however with mlx4 I get kernel crashes I'm not seeing when testing on ixgbe.
For testing I'm using a git build of the "stream reflector" cubemap[1]
configured with kTLS and 8 worker threads running on 4 physical cores,
loading it up with a ~13Mbps MPEG-TS stream pulled from satelite TV.
The kernel seems to get increasingly unstable as I load it up with
client connections. At about 9Gbps and 700 connections, it is okay at
least for a while - it might run fine for say 45 minutes. Once it gets
to 20 - 30Gbps, the kernel will usually start spewing OOPSes within
minutes and the traffic drops.
Some bad interaction between mlx4 and kTLS?
Hardware is a quad core Xeon-D 1520 using a dual port Mellanox
ConnectX-3 VPI with a single 40Gbps ethernet link configured. Mellanox
mlx4 driver settings are kernel.org upstream defaults. Interface is
configured with FQ qdisc and sockets are using BBR congestion control.
Tested on kernel 4.14.34, 4.15.17, and 4.16.2 - 4.16.3.
[1] https://git.sesse.net/?p=cubemap
First OOPS (from 4.16.3):
> [ 660.467358] BUG: stack guard page was hit at 00000000b136e403 (stack is 00000000ded3f179..00000000835ee6c5)
> [ 660.467422] kernel stack overflow (double-fault): 0000 [#1] SMP PTI
> [ 660.467457] Modules linked in: coretemp intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp iTCO_wdt gpio_ich iTCO_vendor_support kvm_intel mxm_wmi xfs libcrc32c kvm crc32c_generic irqbypass nls_iso8859_1 crct10dif_pclmul crc32_pclmul nls_cp437 ghash_clmulni_intel vfat fat aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_pch_thermal mei_me sg mei lpc_ich mfd_core evdev ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad tls ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid mlx4_ib mlx4_en ib_core sd_mod ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect ehci_pci xhci_pci sysimgblt fb_sys_fops ahci libahci xhci_hcd ehci_hcd libata crc32c_intel nvme drm usbcore scsi_mod mlx4_core ixgbe i2c_core mdio usb_common devlink hwmon nvme_core rtc_cm
os
> [ 660.467856] CPU: 4 PID: 660 Comm: cubemap Not tainted 4.16.0-1 #1
> [ 660.467890] Hardware name: Supermicro Super Server/X10SDV-4C-TLN2F, BIOS 1.2c 09/19/2017
> [ 660.467939] RIP: 0010:__kmalloc+0x7/0x1f0
> [ 660.467962] RSP: 0018:ffffabafc27b8000 EFLAGS: 00010206
> [ 660.467992] RAX: 000000000000000d RBX: 0000000000000010 RCX: ffffabafc27b8070
> [ 660.468030] RDX: ffff98a0d0235490 RSI: 0000000001080020 RDI: 000000000000001d
> [ 660.468069] RBP: 000000000000000d R08: ffff98a0d5be4860 R09: ffff98a0ec299180
> [ 660.468106] R10: ffffabafc27b80b8 R11: 0000000000000010 R12: 0000000000000010
> [ 660.468145] R13: ffff98a0ec299180 R14: ffff98a0ec299180 R15: 0000000000000000
> [ 660.468184] FS: 00007f8a35ffb700(0000) GS:ffff98a17fd00000(0000) knlGS:0000000000000000
> [ 660.468227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 660.468258] CR2: ffffabafc27b7ff8 CR3: 00000004698ee001 CR4: 00000000003606e0
> [ 660.468297] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 660.468334] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 660.468373] Call Trace:
> [ 660.468401] gcmaes_encrypt.constprop.5+0x137/0x240 [aesni_intel]
> [ 660.468439] ? generic_gcmaes_encrypt+0x5f/0x80 [aesni_intel]
> [ 660.468476] ? gcmaes_wrapper_encrypt+0x36/0x80 [aesni_intel]
> [ 660.468511] ? tls_push_record+0x1d3/0x390 [tls]
> [ 660.468537] ? tls_push_record+0x1d3/0x390 [tls]
> [ 660.468565] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.468593] ? do_tcp_sendpages+0x8d/0x580
> [ 660.468618] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.468643] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.468671] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.468697] ? do_tcp_sendpages+0x8d/0x580
> [ 660.468722] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.468748] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.468776] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.468802] ? do_tcp_sendpages+0x8d/0x580
> [ 660.468826] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.468852] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.468880] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.468906] ? do_tcp_sendpages+0x8d/0x580
> [ 660.468931] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.468957] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.470165] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.471363] ? do_tcp_sendpages+0x8d/0x580
> [ 660.472555] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.473713] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.474838] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.475927] ? do_tcp_sendpages+0x8d/0x580
> [ 660.476977] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.477999] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.478968] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.479902] ? do_tcp_sendpages+0x8d/0x580
> [ 660.480790] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.481644] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.482483] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.483301] ? do_tcp_sendpages+0x8d/0x580
> [ 660.484099] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.484891] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.485674] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.486455] ? do_tcp_sendpages+0x8d/0x580
> [ 660.487220] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.487890] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.488328] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.488748] ? do_tcp_sendpages+0x8d/0x580
> [ 660.489167] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.489565] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.489970] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.490370] ? do_tcp_sendpages+0x8d/0x580
> [ 660.490771] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.491165] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.491550] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.491914] ? do_tcp_sendpages+0x8d/0x580
> [ 660.492274] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.492641] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.493008] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.493374] ? do_tcp_sendpages+0x8d/0x580
> [ 660.493787] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.494177] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.494585] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.494972] ? do_tcp_sendpages+0x8d/0x580
> [ 660.495359] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.495742] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.496128] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.496512] ? do_tcp_sendpages+0x8d/0x580
> [ 660.496901] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.497301] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.497697] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.498096] ? do_tcp_sendpages+0x8d/0x580
> [ 660.498490] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.498884] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.499291] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.499700] ? do_tcp_sendpages+0x8d/0x580
> [ 660.500103] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.500511] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.500909] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.501326] ? do_tcp_sendpages+0x8d/0x580
> [ 660.501737] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.502131] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.502525] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.502928] ? do_tcp_sendpages+0x8d/0x580
> [ 660.503331] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.503724] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.504127] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.504547] ? do_tcp_sendpages+0x8d/0x580
> [ 660.504949] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.505348] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.505769] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.506207] ? do_tcp_sendpages+0x8d/0x580
> [ 660.506622] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.507030] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.507435] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.507841] ? do_tcp_sendpages+0x8d/0x580
> [ 660.508518] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.509261] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.510011] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.510187] BUG: stack guard page was hit at 00000000e0315e51 (stack is 00000000bea6f919..0000000005fc5eb4)
> [ 660.510473] BUG: stack guard page was hit at 000000004b958a15 (stack is 000000001f2af2d1..000000006295a4b1)
> [ 660.510758] ? do_tcp_sendpages+0x8d/0x580
> [ 660.513094] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.513886] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.514680] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.515487] ? do_tcp_sendpages+0x8d/0x580
> [ 660.515750] BUG: stack guard page was hit at 00000000bc93cf0d (stack is 0000000031a15c9c..0000000029a82776)
> [ 660.516295] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.518017] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.518883] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.519752] ? do_tcp_sendpages+0x8d/0x580
> [ 660.519816] BUG: stack guard page was hit at 000000002d1db286 (stack is 00000000b5bb06d4..000000007a29c8f2)
> [ 660.520544] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.522315] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.523162] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.524006] ? do_tcp_sendpages+0x8d/0x580
> [ 660.524849] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.525695] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.526545] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.527399] ? do_tcp_sendpages+0x8d/0x580
> [ 660.528247] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.529099] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.529955] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.530797] ? do_tcp_sendpages+0x8d/0x580
> [ 660.531643] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.532010] BUG: stack guard page was hit at 0000000027abda92 (stack is 00000000aadcb221..00000000a587b67b)
> [ 660.532535] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.534511] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.535506] ? do_tcp_sendpages+0x8d/0x580
> [ 660.536500] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.537495] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.538493] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.539482] ? do_tcp_sendpages+0x8d/0x580
> [ 660.540462] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.541447] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.542430] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.543411] ? do_tcp_sendpages+0x8d/0x580
> [ 660.544395] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.545382] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.546365] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.547347] ? do_tcp_sendpages+0x8d/0x580
> [ 660.548334] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.549318] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.550300] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.551284] ? do_tcp_sendpages+0x8d/0x580
> [ 660.552267] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.553250] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.554205] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.555158] ? do_tcp_sendpages+0x8d/0x580
> [ 660.556083] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.557009] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.557936] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.558862] ? do_tcp_sendpages+0x8d/0x580
> [ 660.559786] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.560681] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.561547] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.562413] ? do_tcp_sendpages+0x8d/0x580
> [ 660.563279] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.564143] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.564979] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.565783] ? do_tcp_sendpages+0x8d/0x580
> [ 660.566587] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.567392] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.568197] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.569000] ? do_tcp_sendpages+0x8d/0x580
> [ 660.569804] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.570609] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.571415] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.572218] ? do_tcp_sendpages+0x8d/0x580
> [ 660.573023] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.573830] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.574634] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.575437] ? do_tcp_sendpages+0x8d/0x580
> [ 660.576210] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.576953] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.577698] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.578441] ? do_tcp_sendpages+0x8d/0x580
> [ 660.579183] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.579929] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.580673] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.581417] ? do_tcp_sendpages+0x8d/0x580
> [ 660.582159] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.582904] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.583649] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.584394] ? do_tcp_sendpages+0x8d/0x580
> [ 660.585137] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.585882] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.586628] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.587372] ? do_tcp_sendpages+0x8d/0x580
> [ 660.588115] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.588861] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.589605] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.590350] ? do_tcp_sendpages+0x8d/0x580
> [ 660.591093] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.591840] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.592585] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.593328] ? do_tcp_sendpages+0x8d/0x580
> [ 660.594072] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.594816] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.595563] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.596308] ? do_tcp_sendpages+0x8d/0x580
> [ 660.597050] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.597794] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.598540] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.599281] ? do_tcp_sendpages+0x8d/0x580
> [ 660.600025] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.600772] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.601517] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.602260] ? do_tcp_sendpages+0x8d/0x580
> [ 660.603003] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.603750] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.604495] ? tls_write_space+0x6a/0x80 [tls]
> [ 660.605238] ? do_tcp_sendpages+0x8d/0x580
> [ 660.605981] ? tls_push_sg+0x74/0x130 [tls]
> [ 660.606726] ? tls_push_record+0x24a/0x390 [tls]
> [ 660.607474] ? tls_sw_sendpage+0x14a/0x390 [tls]
> [ 660.608214] ? direct_splice_actor+0x40/0x40
> [ 660.608951] ? inet_sendpage+0x40/0xf0
> [ 660.609689] ? kernel_sendpage+0x1a/0x30
> [ 660.610426] ? sock_sendpage+0x20/0x30
> [ 660.611161] ? pipe_to_sendpage+0x5f/0x70
> [ 660.611898] ? __splice_from_pipe+0x80/0x180
> [ 660.612637] ? generic_file_splice_read+0x100/0x150
> [ 660.613382] ? direct_splice_actor+0x40/0x40
> [ 660.614128] ? splice_from_pipe+0x4f/0x70
> [ 660.614871] ? direct_splice_actor+0x35/0x40
> [ 660.615619] ? splice_direct_to_actor+0xce/0x1d0
> [ 660.616368] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 660.617122] ? do_splice_direct+0x8c/0xa0
> [ 660.617876] ? do_sendfile+0x19d/0x380
> [ 660.618626] ? SyS_sendfile64+0x4c/0x90
> [ 660.619376] ? do_syscall_64+0x7a/0x390
> [ 660.620121] ? do_page_fault+0x31/0x130
> [ 660.620863] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [ 660.621618] Code: 24 08 4c 89 e9 48 89 de e8 d7 66 63 00 4d 8b 17 5a 4d 85 d2 75 d7 e9 e9 fe ff ff 48 89 c5 e9 e1 fe ff ff 90 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 53 48 81 ff 00 20 00 00 0f 87 a4 01 00 00
> [ 660.623316] RIP: __kmalloc+0x7/0x1f0 RSP: ffffabafc27b8000
> [ 660.624168] ---[ end trace 7f6206177c0cc58f ]---
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Roopa Prabhu @ 2018-04-22 21:15 UTC (permalink / raw)
To: David Miller
Cc: Johannes Berg, Ben Greear, netdev,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
ath10k-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <20180422.145420.1197041027922699603.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
On Sun, Apr 22, 2018 at 11:54 AM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
> Date: Thu, 19 Apr 2018 17:26:57 +0200
>
>> On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:
>>>
>>> Maybe this could be in followup patches? It's going to touch a lot of files,
>>> and might be hell to get merged all at once, and I've never used spatch, so
>>> just maybe someone else will volunteer that part :)
>>
>> I guess you'll have to ask davem. :)
>
> Well, first of all, I really don't like this.
>
> The first reason is that every time I see interface foo become foo2,
> foo3 is never far behind it.
>
> If foo was not extensible enough such that we needed foo2, we beter
> design the new thing with explicitly better extensibility in mind.
>
> Furthermore, what you want here is a specific filter. Someone else
> will want to filter on another criteria, and the next person will
> want yet another.
>
> This needs to be properly generalized.
>
> And frankly if we had moved to ethtool netlink/devlink by now, we
> could just add a netlink attribute for filtering and not even be
> having this conversation.
+1.
Also, the RTM_GETSTATS api was added to improve stats query efficiency
(with filters).
we should look at it to see if this fits there. Keeping all stats
queries in one place will help.
^ permalink raw reply
* Re: linux-next on x60: network manager often complains "network is disabled" after resume
From: Pavel Machek @ 2018-04-22 19:22 UTC (permalink / raw)
To: Woody Suwalski
Cc: Rafael J. Wysocki, kernel list, Linux-pm mailing list,
Netdev list
In-Reply-To: <01331018-8c62-30b2-673c-11b0bcd38c0e@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]
Hi!
> >I guess you can't reproduce it easily? I tried bisecting, but while it
> >happens often enough to make v4.17 hard to use, it does not permit
> >reliable bisect.
> >
> >These should be bad according to my notes
> >
> >b04240a33b99b32cf6fbdf5c943c04e505a0cb07
> > ed80dc19e4dd395c951f745acd1484d61c4cfb20
> > 52113a0d3889d6e2738cf09bf79bc9cac7b5e1c6
> > 4fc97ef94bbfa185d16b3e44199b7559d0668747
> > 14ebdb2c814f508936fe178a2abc906a16a3ab48
> > 639adbeef5ae1bb8eeebbb0cde0b885397bde192
> >
> >bisection claimed
> >
> >c16add24522547bf52c189b3c0d1ab6f5c2b4375
> >
> >is first bad commit, but I'm not sure if I trust that.
> > Pavel
> It has not happen on any of my systems in the last month. Good, but bad for
> getting more info :-(
My current theory is that it only happens if you suspend your machine
for > ten minutes, or something like that.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* Re: [Patch net] llc: fix NULL pointer deref for SOCK_ZAPPED
From: David Miller @ 2018-04-22 18:56 UTC (permalink / raw)
To: xiyou.wangcong; +Cc: netdev
In-Reply-To: <20180420045434.21477-1-xiyou.wangcong@gmail.com>
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 19 Apr 2018 21:54:34 -0700
> For SOCK_ZAPPED socket, we don't need to care about llc->sap,
> so we should just skip these refcount functions in this case.
>
> Fixes: f7e43672683b ("llc: hold llc_sap before release_sock()")
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Applied and queued up for -stable, thanks Cong.
^ permalink raw reply
* Re: [PATCH v2] net: ethernet: ti: cpsw: fix tx vlan priority mapping
From: David Miller @ 2018-04-22 18:56 UTC (permalink / raw)
To: ivan.khoronzhuk; +Cc: grygorii.strashko, linux-omap, netdev, linux-kernel
In-Reply-To: <1524167349-11004-1-git-send-email-ivan.khoronzhuk@linaro.org>
From: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Date: Thu, 19 Apr 2018 22:49:09 +0300
> The CPDMA_TX_PRIORITY_MAP in real is vlan pcp field priority mapping
> register and basically replaces vlan pcp field for tagged packets.
> So, set it to be 1:1 mapping. Otherwise, it will cause unexpected
> change of egress vlan tagged packets, like prio 2 -> prio 5.
>
> Fixes: e05107e6b747 ("net: ethernet: ti: cpsw: add multi queue support")
> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> ---
Applied and queud up for -stable.
^ permalink raw reply
* Re: [Patch net] llc: delete timers synchronously in llc_sk_free()
From: David Miller @ 2018-04-22 18:55 UTC (permalink / raw)
To: xiyou.wangcong; +Cc: netdev
In-Reply-To: <20180419192538.3362-1-xiyou.wangcong@gmail.com>
From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 19 Apr 2018 12:25:38 -0700
> The connection timers of an llc sock could be still flying
> after we delete them in llc_sk_free(), and even possibly
> after we free the sock. We could just wait synchronously
> here in case of troubles.
>
> Note, I leave other call paths as they are, since they may
> not have to wait, at least we can change them to synchronously
> when needed.
>
> Also, move the code to net/llc/llc_conn.c, which is apparently
> a better place.
>
> Reported-by: <syzbot+f922284c18ea23a8e457@syzkaller.appspotmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: David Miller @ 2018-04-22 18:54 UTC (permalink / raw)
To: johannes-cdvu00un1VgdHxzADdlk8Q
Cc: greearb-my8/4N5VtI7c+919tysfdA, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
ath10k-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <1524151617.3024.25.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
From: Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
Date: Thu, 19 Apr 2018 17:26:57 +0200
> On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:
>>
>> Maybe this could be in followup patches? It's going to touch a lot of files,
>> and might be hell to get merged all at once, and I've never used spatch, so
>> just maybe someone else will volunteer that part :)
>
> I guess you'll have to ask davem. :)
Well, first of all, I really don't like this.
The first reason is that every time I see interface foo become foo2,
foo3 is never far behind it.
If foo was not extensible enough such that we needed foo2, we beter
design the new thing with explicitly better extensibility in mind.
Furthermore, what you want here is a specific filter. Someone else
will want to filter on another criteria, and the next person will
want yet another.
This needs to be properly generalized.
And frankly if we had moved to ethtool netlink/devlink by now, we
could just add a netlink attribute for filtering and not even be
having this conversation.
^ permalink raw reply
* Re: [PATCH net] l2tp: fix {pppol2tp,l2tp_dfs}_seq_stop() in case of seq_file overflow
From: David Miller @ 2018-04-22 18:49 UTC (permalink / raw)
To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <27278b212848db65313d91d3c39407212bc21fe8.1524146802.git.g.nault@alphalink.fr>
From: Guillaume Nault <g.nault@alphalink.fr>
Date: Thu, 19 Apr 2018 16:20:48 +0200
> Commit 0e0c3fee3a59 ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
> assumed that if pppol2tp_seq_stop() was called with non-NULL private
> data (the 'v' pointer), then pppol2tp_seq_start() would not be called
> again. It turns out that this isn't guaranteed, and overflowing the
> seq_file's buffer in pppol2tp_seq_show() is a way to get into this
> situation.
>
> Therefore, pppol2tp_seq_stop() needs to reset pd->tunnel, so that
> pppol2tp_seq_start() won't drop a reference again if it gets called.
> We also have to clear pd->session, because the rest of the code expects
> a non-NULL tunnel when pd->session is set.
>
> The l2tp_debugfs module has the same issue. Fix it in the same way.
>
> Fixes: 0e0c3fee3a59 ("l2tp: hold reference on tunnels printed in pppol2tp proc file")
> Fixes: f726214d9b23 ("l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file")
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Applied and queued up for -stable, thank you.
^ permalink raw reply
* Re: [PATCH net 0/6] s390/qeth: fixes 2018-04-19
From: David Miller @ 2018-04-22 18:42 UTC (permalink / raw)
To: jwi; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180419105211.83935-1-jwi@linux.ibm.com>
From: Julian Wiedmann <jwi@linux.ibm.com>
Date: Thu, 19 Apr 2018 12:52:05 +0200
> new mail address, same old me.
>
> Please apply the following qeth fixes for 4.17. The common theme
> seems to be error handling improvements in various areas of cmd IO.
>
> Patches 1-3 should also go back to stable.
Series applied and patches 1-3 queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: kbuild test robot @ 2018-04-22 18:29 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: alexander.h.duyck, virtio-dev, jiri, mst, kubakici, netdev,
virtualization, loseweigh, kbuild-all, sridhar.samudrala, davem
In-Reply-To: <1524188524-28411-3-git-send-email-sridhar.samudrala@intel.com>
Hi Sridhar,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net/master]
[also build test WARNING on v4.17-rc1]
[cannot apply to net-next/master next-20180420]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Sridhar-Samudrala/Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device/20180422-210557
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
>> net/core/failover.c:99:36: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct net_device *dev @@ got struct net_devicestruct net_device *dev @@
net/core/failover.c:99:36: expected struct net_device *dev
net/core/failover.c:99:36: got struct net_device [noderef] <asn:4>*standby_dev
net/core/failover.c:102:36: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct net_device *dev @@ got struct net_devicestruct net_device *dev @@
net/core/failover.c:102:36: expected struct net_device *dev
net/core/failover.c:102:36: got struct net_device [noderef] <asn:4>*primary_dev
>> net/core/failover.c:468:12: sparse: context imbalance in 'failover_select_queue' - wrong count at exit
vim +99 net/core/failover.c
58
59 static int failover_slave_join(struct net_device *slave_dev,
60 struct net_device *failover_dev,
61 struct failover_ops *failover_ops)
62 {
63 struct failover_info *finfo;
64 int err, orig_mtu;
65 bool standby;
66
67 if (failover_ops) {
68 if (!failover_ops->slave_join)
69 return -EINVAL;
70
71 return failover_ops->slave_join(slave_dev, failover_dev);
72 }
73
74 if (netif_running(failover_dev)) {
75 err = dev_open(slave_dev);
76 if (err && (err != -EBUSY)) {
77 netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
78 slave_dev->name, err);
79 goto err_dev_open;
80 }
81 }
82
83 /* Align MTU of slave with failover dev */
84 orig_mtu = slave_dev->mtu;
85 err = dev_set_mtu(slave_dev, failover_dev->mtu);
86 if (err) {
87 netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
88 slave_dev->name, failover_dev->mtu);
89 goto err_set_mtu;
90 }
91
92 finfo = netdev_priv(failover_dev);
93 standby = (slave_dev->dev.parent == failover_dev->dev.parent);
94
95 dev_hold(slave_dev);
96
97 if (standby) {
98 rcu_assign_pointer(finfo->standby_dev, slave_dev);
> 99 dev_get_stats(finfo->standby_dev, &finfo->standby_stats);
100 } else {
101 rcu_assign_pointer(finfo->primary_dev, slave_dev);
102 dev_get_stats(finfo->primary_dev, &finfo->primary_stats);
103 failover_dev->min_mtu = slave_dev->min_mtu;
104 failover_dev->max_mtu = slave_dev->max_mtu;
105 }
106
107 netdev_info(failover_dev, "failover slave:%s joined\n",
108 slave_dev->name);
109
110 return 0;
111
112 err_set_mtu:
113 dev_close(slave_dev);
114 err_dev_open:
115 return err;
116 }
117
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
^ permalink raw reply
* Re: [PATCH v3 14/20] mtd: Remove depends on HAS_DMA in case of platform dependency
From: Boris Brezillon @ 2018-04-22 17:12 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Ulf Hansson, Wolfram Sang, linux-iio-u79uwXL29TY76Z2rM5mHXA,
linux-fpga-u79uwXL29TY76Z2rM5mHXA,
linux-remoteproc-u79uwXL29TY76Z2rM5mHXA,
alsa-devel-K7yf7f+aM1XWsZ/bQMPhNw, Bjorn Andersson, Eric Anholt,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
linux-i2c-u79uwXL29TY76Z2rM5mHXA,
linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
Christoph Hellwig, Stefan Wahren, Boris Brezillon, Herbert Xu,
Richard Weinberger, Jassi Brar, Marek Vasut,
linux-serial-u79uwXL29TY76Z2rM5mHXA, Matias Bjorling,
David Woodhouse, linux-media-u79uwXL29TY76Z2rM5mHXA,
Ohad Ben-Cohen, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b, Alan Tull
In-Reply-To: <1523987360-18760-15-git-send-email-geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org>
On Tue, 17 Apr 2018 19:49:14 +0200
Geert Uytterhoeven <geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org> wrote:
> Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
> symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
> In most cases this other symbol is an architecture or platform specific
> symbol, or PCI.
>
> Generic symbols and drivers without platform dependencies keep their
> dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
> cannot work anyway.
>
> This simplifies the dependencies, and allows to improve compile-testing.
>
> Signed-off-by: Geert Uytterhoeven <geert-Td1EMuHUCqxL1ZNQvxDV9g@public.gmane.org>
> Reviewed-by: Mark Brown <broonie-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Acked-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
Applied to mtd/next.
Thanks,
Boris
> ---
> v3:
> - Rebase to v4.17-rc1,
>
> v2:
> - Add Reviewed-by, Acked-by,
> - Drop RFC state,
> - Drop new dependency of MTD_NAND_MARVELL on HAS_DMA,
> - Split per subsystem.
> ---
> drivers/mtd/nand/raw/Kconfig | 8 ++------
> drivers/mtd/spi-nor/Kconfig | 2 +-
> 2 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/mtd/nand/raw/Kconfig b/drivers/mtd/nand/raw/Kconfig
> index 19a2b283fbbe627e..6871ff0fd300bb81 100644
> --- a/drivers/mtd/nand/raw/Kconfig
> +++ b/drivers/mtd/nand/raw/Kconfig
> @@ -46,7 +46,7 @@ config MTD_NAND_DENALI
> config MTD_NAND_DENALI_PCI
> tristate "Support Denali NAND controller on Intel Moorestown"
> select MTD_NAND_DENALI
> - depends on HAS_DMA && PCI
> + depends on PCI
> help
> Enable the driver for NAND flash on Intel Moorestown, using the
> Denali NAND controller core.
> @@ -152,7 +152,6 @@ config MTD_NAND_S3C2410_CLKSTOP
> config MTD_NAND_TANGO
> tristate "NAND Flash support for Tango chips"
> depends on ARCH_TANGO || COMPILE_TEST
> - depends on HAS_DMA
> help
> Enables the NAND Flash controller on Tango chips.
>
> @@ -285,7 +284,7 @@ config MTD_NAND_MARVELL
> tristate "NAND controller support on Marvell boards"
> depends on PXA3xx || ARCH_MMP || PLAT_ORION || ARCH_MVEBU || \
> COMPILE_TEST
> - depends on HAS_IOMEM && HAS_DMA
> + depends on HAS_IOMEM
> help
> This enables the NAND flash controller driver for Marvell boards,
> including:
> @@ -447,7 +446,6 @@ config MTD_NAND_SH_FLCTL
> tristate "Support for NAND on Renesas SuperH FLCTL"
> depends on SUPERH || COMPILE_TEST
> depends on HAS_IOMEM
> - depends on HAS_DMA
> help
> Several Renesas SuperH CPU has FLCTL. This option enables support
> for NAND Flash using FLCTL.
> @@ -515,7 +513,6 @@ config MTD_NAND_SUNXI
> config MTD_NAND_HISI504
> tristate "Support for NAND controller on Hisilicon SoC Hip04"
> depends on ARCH_HISI || COMPILE_TEST
> - depends on HAS_DMA
> help
> Enables support for NAND controller on Hisilicon SoC Hip04.
>
> @@ -529,7 +526,6 @@ config MTD_NAND_QCOM
> config MTD_NAND_MTK
> tristate "Support for NAND controller on MTK SoCs"
> depends on ARCH_MEDIATEK || COMPILE_TEST
> - depends on HAS_DMA
> help
> Enables support for NAND controller on MTK SoCs.
> This controller is found on mt27xx, mt81xx, mt65xx SoCs.
> diff --git a/drivers/mtd/spi-nor/Kconfig b/drivers/mtd/spi-nor/Kconfig
> index 89da88e591215db1..c493b8230a38c059 100644
> --- a/drivers/mtd/spi-nor/Kconfig
> +++ b/drivers/mtd/spi-nor/Kconfig
> @@ -71,7 +71,7 @@ config SPI_FSL_QUADSPI
> config SPI_HISI_SFC
> tristate "Hisilicon SPI-NOR Flash Controller(SFC)"
> depends on ARCH_HISI || COMPILE_TEST
> - depends on HAS_IOMEM && HAS_DMA
> + depends on HAS_IOMEM
> help
> This enables support for hisilicon SPI-NOR flash controller.
>
^ permalink raw reply
* Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module
From: Michael S. Tsirkin @ 2018-04-22 17:06 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh, jiri
In-Reply-To: <1524188524-28411-3-git-send-email-sridhar.samudrala@intel.com>
On Thu, Apr 19, 2018 at 06:42:02PM -0700, Sridhar Samudrala wrote:
> +#if IS_ENABLED(CONFIG_NET_FAILOVER)
> +
> +int failover_create(struct net_device *standby_dev,
> + struct failover **pfailover);
Should we rename all these structs net_failover?
It's possible to extend the concept to storage I think.
> +void failover_destroy(struct failover *failover);
> +
> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
> + struct failover **pfailover);
> +void failover_unregister(struct failover *failover);
> +
> +int failover_slave_unregister(struct net_device *slave_dev);
> +
> +#else
> +
> +static inline
> +int failover_create(struct net_device *standby_dev,
> + struct failover **pfailover);
> +{
> + return 0;
Does this make callers do something sane?
Shouldn't these return an error?
> +}
> +
> +static inline
> +void failover_destroy(struct failover *failover)
> +{
> +}
> +
> +static inline
> +int failover_register(struct net_device *standby_dev, struct failover_ops *ops,
> + struct pfailover **pfailover);
> +{
> + return 0;
> +}
struct pfailover seems like a typo.
> +
> +static inline
> +void failover_unregister(struct failover *failover)
> +{
> +}
> +
> +static inline
> +int failover_slave_unregister(struct net_device *slave_dev)
> +{
> + return 0;
> +}
Does anyone test return value of unregister?
should this be void?
> +
> +#endif
> +
> +#endif /* _NET_FAILOVER_H */
^ permalink raw reply
* Re: Page allocator bottleneck
From: Tariq Toukan @ 2018-04-22 16:43 UTC (permalink / raw)
To: Aaron Lu
Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
David Miller, Jesper Dangaard Brouer, Eric Dumazet,
Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
Andrew Morton, Michal Hocko
In-Reply-To: <20180421081505.GA24916@intel.com>
On 21/04/2018 11:15 AM, Aaron Lu wrote:
> Sorry to bring up an old thread...
>
I want to thank you very much for bringing this up!
> On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
>>
>>
>> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
>>>
>>>
>>> On 15/09/2017 1:23 PM, Mel Gorman wrote:
>>>> On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
>>>>> Insights: Major degradation between #1 and #2, not getting any
>>>>> close to linerate! Degradation is fixed between #2 and #3. This is
>>>>> because page allocator cannot stand the higher allocation rate. In
>>>>> #2, we also see that the addition of rings (cores) reduces BW (!!),
>>>>> as result of increasing congestion over shared resources.
>>>>>
>>>>
>>>> Unfortunately, no surprises there.
>>>>
>>>>> Congestion in this case is very clear. When monitored in perf
>>>>> top: 85.58% [kernel] [k] queued_spin_lock_slowpath
>>>>>
>>>>
>>>> While it's not proven, the most likely candidate is the zone lock
>>>> and that should be confirmed using a call-graph profile. If so, then
>>>> the suggestion to tune to the size of the per-cpu allocator would
>>>> mitigate the problem.
>>>>
>>> Indeed, I tuned the per-cpu allocator and bottleneck is released.
>>>
>>
>> Hi all,
>>
>> After leaving this task for a while doing other tasks, I got back to it now
>> and see that the good behavior I observed earlier was not stable.
>
> I posted a patchset to improve zone->lock contention for order-0 pages
> recently, it can almost eliminate 80% zone->lock contention for
> will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
> Skylake server and it doesn't require PCP size tune, so should have
> some effects on your workload where one CPU does allocation while
> another does free.
>
That is great news. In our driver's memory scheme (and many others as
well) we allocate only order-0 pages (the only flow that does not do
that yet in upstream will do so very soon, we already have the patches
in our internal branch).
Allocation of order-0 pages is not only the common case, but is the only
type of allocation in our data-path. Let's optimize it!
> It did this by some disruptive changes:
> 1 on free path, it skipped doing merge(so could be bad for mixed
> workloads where both 4K and high order pages are needed);
I think there are so many advantages to not using high order
allocations, especially in production servers that are not rebooted for
long periods and become fragmented.
AFAIK, the community direction (at least in networking) is using order-0
pages in datapath, so optimizing their allocaiton is a very good idea.
Need of course to perf evaluate possible degradations, and see how
important these use cases are.
> 2 on allocation path, it avoided touching multiple cachelines.
>
Great!
> RFC v2 patchset:
> https://lkml.org/lkml/2018/3/20/171
>
> repo:
> https://github.com/aaronlu/linux zone_lock_rfc_v2
>
I will check them out first thing tomorrow!
p.s., I will be on vacation for a week starting Tuesday.
I hope I can make some progress before that :)
Thanks,
Tariq
>
>> Recall: I work with a modified driver that allocates a page (4K) per packet
>> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
>> NICs.
>>
>> Performance is good as long as pages are available in the allocating cores's
>> PCP.
>> Issue is that pages are allocated in one core, then free'd in another,
>> making it's hard for the PCP to work efficiently, and both the allocator
>> core and the freeing core need to access the buddy allocator very often.
>>
>> I'd like to share with you some testing numbers:
>>
>> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
>>
>> 100% cpu on all cores, top func in perf:
>> 84.98% [kernel] [k] queued_spin_lock_slowpath
>>
>> system wide (all cores)
>> 1135941 kmem:mm_page_alloc
>>
>> 2606629 kmem:mm_page_free
>>
>> 0 kmem:mm_page_alloc_extfrag
>> 4784616 kmem:mm_page_alloc_zone_locked
>>
>> 1337 kmem:mm_page_free_batched
>>
>> 6488213 kmem:mm_page_pcpu_drain
>>
>> 8925503 net:napi_gro_receive_entry
>>
>>
>> Two types of cores:
>> A core mostly running napi (8 such cores):
>> 221875 kmem:mm_page_alloc
>>
>> 17100 kmem:mm_page_free
>>
>> 0 kmem:mm_page_alloc_extfrag
>> 766584 kmem:mm_page_alloc_zone_locked
>>
>> 16 kmem:mm_page_free_batched
>>
>> 35 kmem:mm_page_pcpu_drain
>>
>> 1340139 net:napi_gro_receive_entry
>>
>>
>> Other core, mostly running user application (40 such):
>> 2 kmem:mm_page_alloc
>>
>> 38922 kmem:mm_page_free
>>
>> 0 kmem:mm_page_alloc_extfrag
>> 1 kmem:mm_page_alloc_zone_locked
>>
>> 8 kmem:mm_page_free_batched
>>
>> 107289 kmem:mm_page_pcpu_drain
>>
>> 34 net:napi_gro_receive_entry
>>
>>
>> As you can see, sync overhead is enormous.
>>
>> PCP-wise, a key improvement in such scenarios would be reached if we could
>> (1) keep and handle the allocated page on same cpu, or (2) somehow get the
>> page back to the allocating core's PCP in a fast-path, without going through
>> the regular buddy allocator paths.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox