* Re: [PATCH bpf-next v3 4/8] bpf: add documentation for eBPF helpers (23-32)
From: Alexei Starovoitov @ 2018-04-18 22:11 UTC (permalink / raw)
To: Quentin Monnet; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180417143438.7018-5-quentin.monnet@netronome.com>
On Tue, Apr 17, 2018 at 03:34:34PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
>
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
>
> This patch contains descriptions for the following helper functions, all
> written by Daniel:
>
> - bpf_get_prandom_u32()
> - bpf_get_smp_processor_id()
> - bpf_get_cgroup_classid()
> - bpf_get_route_realm()
> - bpf_skb_load_bytes()
> - bpf_csum_diff()
> - bpf_skb_get_tunnel_opt()
> - bpf_skb_set_tunnel_opt()
> - bpf_skb_change_proto()
> - bpf_skb_change_type()
>
> v3:
> - bpf_get_prandom_u32(): Fix helper name :(. Add description, including
> a note on the internal random state.
> - bpf_get_smp_processor_id(): Add description, including a note on the
> processor id remaining stable during program run.
> - bpf_get_cgroup_classid(): State that CONFIG_CGROUP_NET_CLASSID is
> required to use the helper. Add a reference to related documentation.
> State that placing a task in net_cls controller disables cgroup-bpf.
> - bpf_get_route_realm(): State that CONFIG_CGROUP_NET_CLASSID is
> required to use this helper.
> - bpf_skb_load_bytes(): Fix comment on current use cases for the helper.
>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next v3 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Alexei Starovoitov @ 2018-04-18 22:10 UTC (permalink / raw)
To: Quentin Monnet; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180417143438.7018-4-quentin.monnet@netronome.com>
On Tue, Apr 17, 2018 at 03:34:33PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
>
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
>
> This patch contains descriptions for the following helper functions, all
> written by Alexei:
>
> - bpf_get_current_pid_tgid()
> - bpf_get_current_uid_gid()
> - bpf_get_current_comm()
> - bpf_skb_vlan_push()
> - bpf_skb_vlan_pop()
> - bpf_skb_get_tunnel_key()
> - bpf_skb_set_tunnel_key()
> - bpf_redirect()
> - bpf_perf_event_output()
> - bpf_get_stackid()
> - bpf_get_current_task()
>
> v3:
> - bpf_skb_get_tunnel_key(): Change and improve description and example.
> - bpf_redirect(): Improve description of BPF_F_INGRESS flag.
> - bpf_perf_event_output(): Fix first sentence of description. Delete
> wrong statement on context being evaluated as a struct pt_reg. Remove
> the long yet incomplete example.
> - bpf_get_stackid(): Add a note about PERF_MAX_STACK_DEPTH being
> configurable.
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
looks great.
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: net namespaces kernel stack overflow
From: Kirill Tkhai @ 2018-04-18 22:08 UTC (permalink / raw)
To: Alexander Aring; +Cc: netdev, Jamal Hadi Salim
In-Reply-To: <CAOHTApjPwWDxHmGrfMwFvszWubfjJ7YfBgZL-DQ0mdv00UtQEQ@mail.gmail.com>
Hi, Alexander!
On 18.04.2018 22:45, Alexander Aring wrote:
> I currently can crash my net/master kernel by execute the following script:
>
> --- snip
>
> modprobe dummy
>
> #mkdir /var/run/netns
> #touch /var/run/netns/init_net
> #mount --bind /proc/1/ns/net /var/run/netns/init_net
>
> while true
> do
> mkdir /var/run/netns
> touch /var/run/netns/init_net
> mount --bind /proc/1/ns/net /var/run/netns/init_net
>
> ip netns add foo
> ip netns exec foo ip link add dummy0 type dummy
> ip netns delete foo
> done
Fast answer is the best, so I tried your test on my not-for-work computer.
There is old kernel without asynchronous pernet operations:
$uname -a
Linux localhost.localdomain 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux
After approximately 15 seconds of your test execution it died :(
(Hopefully, I executed it in "init 1" with all partitions RO as usual).
There is no serial console, so I can't say that the first stack is exactly
the same as you see. But it crashed. So, it seems, the problem have been
existing long ago.
Have you tried to reproduce it in older kernels or to bisect the problem commit?
Or maybe it doesn't reproduce on old kernels in your environment?
> --- snap
>
> After max ~1 minute the kernel will crash.
> Doing my hack of saving init_net outside the loop it will run fine...
> So the mount bind is necessary.
>
> The last message which I see is:
>
> BUG: stack guard page was hit at 00000000f0751759 (stack is
> 0000000069363195..0000000073ddc474)
> kernel stack overflow (double-fault): 0000 [#1] SMP PTI
> Modules linked in:
> CPU: 0 PID: 13917 Comm: ip Not tainted 4.16.0-11878-gef9d066f6808 #32
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> RIP: 0010:validate_chain.isra.23+0x44/0xc40
> RSP: 0018:ffffc900002cbff8 EFLAGS: 00010002
> RAX: 0000000000040000 RBX: 0e58b88e1d4d15da RCX: 0e58b88e1d4d15da
> RDX: 0000000000000000 RSI: ffff8802b25ee2a0 RDI: ffff8802b25edb00
> RBP: 0e58b88e1d4d15da R08: 0000000000000000 R09: 0000000000000004
> R10: ffffc900002cc050 R11: ffff8802b1054be8 R12: 0000000000000001
> R13: ffff8802b25ee268 R14: ffff8802b25edb00 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff8802bfc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffc900002cbfe8 CR3: 0000000002024000 CR4: 00000000000006f0
> Call Trace:
> ? get_max_files+0x10/0x10
> __lock_acquire+0x332/0x710
> lock_acquire+0x67/0xb0
> ? lockref_put_or_lock+0x9/0x30
> ? dput.part.7+0x17/0x2d0
> _raw_spin_lock+0x2b/0x60
> ? lockref_put_or_lock+0x9/0x30
> lockref_put_or_lock+0x9/0x30
> dput.part.7+0x1ec/0x2d0
> drop_mountpoint+0x10/0x40
> pin_kill+0x9b/0x3a0
> ? wait_woken+0x90/0x90
> ? mnt_pin_kill+0x2d/0x100
> mnt_pin_kill+0x2d/0x100
> cleanup_mnt+0x66/0x70
> pin_kill+0x9b/0x3a0
> ? wait_woken+0x90/0x90
> ? mnt_pin_kill+0x2d/0x100
> mnt_pin_kill+0x2d/0x100
> cleanup_mnt+0x66/0x70
> ...
>
> I guess maybe it has something to do with recently switching to
> migrate per-net ops to async.
>
> - Alex
Kirill
^ permalink raw reply
* Re: [PATCH bpf-next v3 00/11] introduction of bpf_xdp_adjust_tail
From: Daniel Borkmann @ 2018-04-18 22:05 UTC (permalink / raw)
To: Nikita V. Shirokov, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20180418044223.17685-1-tehnerd@tehnerd.com>
On 04/18/2018 06:42 AM, Nikita V. Shirokov wrote:
> In this patch series i'm add new bpf helper which allow to manupulate
> xdp's data_end pointer. right now only "shrinking" (reduce packet's size
> by moving pointer) is supported (and i see no use case for "growing").
> Main use case for such helper is to be able to generate controll (ICMP)
> messages from XDP context. such messages usually contains first N bytes
> from original packets as a payload, and this is exactly what this helper
> would allow us to do (see patch 3 for sample program, where we generate
> ICMP "packet too big" message). This helper could be usefull for load
> balancing applications where after additional encapsulation, resulting
> packet could be bigger then interface MTU.
> Aside from new helper this patch series contains minor changes in device
> drivers (for ones which requires), so they would recal packet's length
> not only when head pointer was adjusted, but if tail's one as well.
>
> v2->v3:
> * adding missed "signed off by" in v2
>
> v1->v2:
> * fixed kbuild warning
> * made offset eq 0 invalid for xdp_bpf_adjust_tail
> * splitted bpf_prog_test_run fix and selftests in sep commits
> * added SPDX licence where applicable
> * some reshuffling in patches order (tests now in the end)
Looks good! Applied to bpf-next, thanks Nikita!
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: add helper for getting xfrm states
From: Eyal Birger @ 2018-04-18 22:02 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev
In-Reply-To: <3ebc429e-c8fb-5f1b-29bb-3285352c0831@iogearbox.net>
On Wed, 18 Apr 2018 22:59:27 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 04/17/2018 06:48 AM, Eyal Birger wrote:
> > This commit introduces a helper which allows fetching xfrm state
> > parameters by eBPF programs attached to TC.
> >
> > Prototype:
> > bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> >
> > skb: pointer to skb
> > index: the index in the skb xfrm_state secpath array
> > xfrm_state: pointer to 'struct bpf_xfrm_state'
> > size: size of 'struct bpf_xfrm_state'
> > flags: reserved for future extensions
> >
> > The helper returns 0 on success. Non zero if no xfrm state at the
> > index is found - or non exists at all.
> >
> > struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
> > address and the reqid; it can be further extended by adding
> > elements to its end - indicating the populated fields by the 'size'
> > argument - keeping backwards compatibility.
> >
> > Typical usage:
> >
> > struct bpf_xfrm_state x = {};
> > bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
> > ...
> >
> > Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
>
> Patch looks good to me, two comments below:
Thanks! I incorporated your suggestions in v2.
Eyal.
>
> > ---
> > include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
> > net/core/filter.c | 46
> > ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70
> > insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c5ec897..132e172 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -755,6 +755,15 @@ union bpf_attr {
> > * @addr: pointer to struct sockaddr to bind socket to
> > * @addr_len: length of sockaddr structure
> > * Return: 0 on success or negative error code
> > + *
> > + * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> > + * retrieve XFRM state
> > + * @skb: pointer to skb
> > + * @index: index of the xfrm state in the secpath
> > + * @key: pointer to 'struct bpf_xfrm_state'
> > + * @size: size of 'struct bpf_xfrm_state'
> > + * @flags: room for future extensions
> > + * Return: 0 on success or negative error
> > */
> > #define __BPF_FUNC_MAPPER(FN) \
> > FN(unspec), \
> > @@ -821,7 +830,8 @@ union bpf_attr {
> > FN(msg_apply_bytes), \
> > FN(msg_cork_bytes), \
> > FN(msg_pull_data), \
> > - FN(bind),
> > + FN(bind), \
> > + FN(skb_get_xfrm_state),
> >
> > /* integer value in 'imm' field of BPF_CALL instruction selects
> > which helper
> > * function eBPF program intends to call
> > @@ -927,6 +937,19 @@ struct bpf_tunnel_key {
> > __u32 tunnel_label;
> > };
> >
> > +/* user accessible mirror of in-kernel xfrm_state.
> > + * new fields can only be added to the end of this structure
> > + */
> > +struct bpf_xfrm_state {
> > + __u32 reqid;
> > + __u32 spi;
> > + __u16 family;
> > + union {
> > + __u32 remote_ipv4;
> > + __u32 remote_ipv6[4];
> > + };
> > +};
> > +
> > /* Generic BPF return codes which all BPF program types may
> > support.
> > * The values are binary compatible with their TC_ACT_*
> > counter-part to
> > * provide backwards compatibility with existing SCHED_CLS and
> > SCHED_ACT diff --git a/net/core/filter.c b/net/core/filter.c
> > index d31aff9..c06600a 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -57,6 +57,7 @@
> > #include <net/sock_reuseport.h>
> > #include <net/busy_poll.h>
> > #include <net/tcp.h>
> > +#include <net/xfrm.h>
> > #include <linux/bpf_trace.h>
> >
> > /**
> > @@ -3703,6 +3704,49 @@ static const struct bpf_func_proto
> > bpf_bind_proto = { .arg3_type = ARG_CONST_SIZE,
> > };
> >
> > +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32,
> > index,
> > + struct bpf_xfrm_state *, to, u32, size, u64, flags)
> > +{
> > +#ifdef CONFIG_XFRM
> > + const struct sec_path *sp = skb_sec_path(skb);
> > + const struct xfrm_state *x;
> > +
> > + if (!sp || index >= sp->len)
>
> This should be something like: if (!sp || unlikely(index >= sp->len
> || flags)) Such that we unconditionally bail out on any flags
> currently, since this is reserved for future use and anything
> non-zero would be invalid and rejected until we start extending it.
>
> > + goto err_clear;
> > +
> > + x = sp->xvec[index];
> > +
> > + if (unlikely(size != sizeof(struct bpf_xfrm_state)))
> > + goto err_clear;
> > +
> > + to->reqid = x->props.reqid;
> > + to->spi = be32_to_cpu(x->id.spi);
> > + to->family = x->props.family;
> > + if (to->family == AF_INET6) {
> > + memcpy(to->remote_ipv6, x->props.saddr.a6,
> > + sizeof(to->remote_ipv6));
> > + } else {
> > + to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
> > + }
> > +
> > + return 0;
> > +err_clear:
> > +#endif
> > + memset(to, 0, size);
> > + return -EINVAL;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
> > + .func = bpf_skb_get_xfrm_state,
> > + .gpl_only = false,
> > + .ret_type = RET_INTEGER,
> > + .arg1_type = ARG_PTR_TO_CTX,
> > + .arg2_type = ARG_ANYTHING,
> > + .arg3_type = ARG_PTR_TO_UNINIT_MEM,
> > + .arg4_type = ARG_CONST_SIZE,
> > + .arg5_type = ARG_ANYTHING,
> > +};
> > +
> > static const struct bpf_func_proto *
> > bpf_base_func_proto(enum bpf_func_id func_id)
> > {
> > @@ -3844,6 +3888,8 @@ tc_cls_act_func_proto(enum bpf_func_id
> > func_id, const struct bpf_prog *prog) return
> > &bpf_get_socket_cookie_proto; case BPF_FUNC_get_socket_uid:
> > return &bpf_get_socket_uid_proto;
> > + case BPF_FUNC_skb_get_xfrm_state:
> > + return &bpf_skb_get_xfrm_state_proto;
>
> Potentially, on kernels with !CONFIG_XFRM, you might want to let the
> program bail out at program verification phase already? Thus it would
> become ...
>
> #ifdef CONFIG_XFRM
> case BPF_FUNC_skb_get_xfrm_state:
> return &bpf_skb_get_xfrm_state_proto;
> #endif
>
> ... where you'd also wrap the helper + state_proto in CONFIG_XFRM.
>
> > default:
> > return bpf_base_func_proto(func_id);
> > }
> >
>
^ permalink raw reply
* [PATCH bpf-next,v2 2/2] samples/bpf: extend test_tunnel_bpf.sh with xfrm state test
From: Eyal Birger @ 2018-04-18 21:58 UTC (permalink / raw)
To: netdev; +Cc: shmulik, ast, daniel, fw, steffen.klassert, Eyal Birger
In-Reply-To: <1524088703-6324-1-git-send-email-eyal.birger@gmail.com>
Add a test for fetching xfrm state parameters from a tc program running
on ingress.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---
samples/bpf/tcbpf2_kern.c | 15 +++++++
samples/bpf/test_tunnel_bpf.sh | 71 +++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 25 ++++++++++-
tools/testing/selftests/bpf/bpf_helpers.h | 4 +-
4 files changed, 113 insertions(+), 2 deletions(-)
diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 9a8db7bd..3303803 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -593,4 +593,19 @@ int _ip6ip6_get_tunnel(struct __sk_buff *skb)
return TC_ACT_OK;
}
+SEC("xfrm_get_state")
+int _xfrm_get_state(struct __sk_buff *skb)
+{
+ struct bpf_xfrm_state x;
+ char fmt[] = "reqid %d spi 0x%x remote ip 0x%x\n";
+ int ret;
+
+ ret = bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
+ if (ret < 0)
+ return TC_ACT_OK;
+
+ bpf_trace_printk(fmt, sizeof(fmt), x.reqid, x.spi, x.remote_ipv4);
+ return TC_ACT_OK;
+}
+
char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index c265863..9c534dc 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -155,6 +155,57 @@ function add_ipip_tunnel {
ip addr add dev $DEV 10.1.1.200/24
}
+function setup_xfrm_tunnel {
+ auth=0x$(printf '1%.0s' {1..40})
+ enc=0x$(printf '2%.0s' {1..32})
+ spi_in_to_out=0x1
+ spi_out_to_in=0x2
+ # in namespace
+ # in -> out
+ ip netns exec at_ns0 \
+ ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+ spi $spi_in_to_out reqid 1 mode tunnel \
+ auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+ ip netns exec at_ns0 \
+ ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir out \
+ tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+ mode tunnel
+ # out -> in
+ ip netns exec at_ns0 \
+ ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+ spi $spi_out_to_in reqid 2 mode tunnel \
+ auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+ ip netns exec at_ns0 \
+ ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir in \
+ tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+ mode tunnel
+ # address & route
+ ip netns exec at_ns0 \
+ ip addr add dev veth0 10.1.1.100/32
+ ip netns exec at_ns0 \
+ ip route add 10.1.1.200 dev veth0 via 172.16.1.200 \
+ src 10.1.1.100
+
+ # out of namespace
+ # in -> out
+ ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
+ spi $spi_in_to_out reqid 1 mode tunnel \
+ auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+ ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir in \
+ tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
+ mode tunnel
+ # out -> in
+ ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
+ spi $spi_out_to_in reqid 2 mode tunnel \
+ auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
+ ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir out \
+ tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
+ mode tunnel
+ # address & route
+ ip addr add dev veth1 10.1.1.200/32
+ ip route add 10.1.1.100 dev veth1 via 172.16.1.100 src 10.1.1.200
+}
+
function attach_bpf {
DEV=$1
SET_TUNNEL=$2
@@ -278,6 +329,22 @@ function test_ipip {
cleanup
}
+function test_xfrm_tunnel {
+ config_device
+ tcpdump -nei veth1 ip &
+ output=$(mktemp)
+ cat /sys/kernel/debug/tracing/trace_pipe | tee $output &
+ setup_xfrm_tunnel
+ tc qdisc add dev veth1 clsact
+ tc filter add dev veth1 proto ip ingress bpf da obj tcbpf2_kern.o \
+ sec xfrm_get_state
+ ip netns exec at_ns0 ping -c 1 10.1.1.200
+ grep "reqid 1" $output
+ grep "spi 0x1" $output
+ grep "remote ip 0xac100164" $output
+ cleanup
+}
+
function cleanup {
set +ex
pkill iperf
@@ -291,6 +358,8 @@ function cleanup {
ip link del geneve11
ip link del erspan11
ip link del ip6erspan11
+ ip x s flush
+ ip x p flush
pkill tcpdump
pkill cat
set -ex
@@ -316,4 +385,6 @@ echo "Testing GENEVE tunnel..."
test_geneve
echo "Testing IPIP tunnel..."
test_ipip
+echo "Testing IPSec tunnel..."
+test_xfrm_tunnel
echo "*** PASS ***"
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 56bf493..233754a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -762,6 +762,15 @@ union bpf_attr {
* @xdp_md: pointer to xdp_md
* @delta: A negative integer to be added to xdp_md.data_end
* Return: 0 on success or negative on error
+ *
+ * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
+ * retrieve XFRM state
+ * @skb: pointer to skb
+ * @index: index of the xfrm state in the secpath
+ * @key: pointer to 'struct bpf_xfrm_state'
+ * @size: size of 'struct bpf_xfrm_state'
+ * @flags: room for future extensions
+ * Return: 0 on success or negative error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -829,7 +838,8 @@ union bpf_attr {
FN(msg_cork_bytes), \
FN(msg_pull_data), \
FN(bind), \
- FN(xdp_adjust_tail),
+ FN(xdp_adjust_tail), \
+ FN(skb_get_xfrm_state),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -934,6 +944,19 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
};
+/* user accessible mirror of in-kernel xfrm_state.
+ * new fields can only be added to the end of this structure
+ */
+struct bpf_xfrm_state {
+ __u32 reqid;
+ __u32 spi;
+ __u16 family;
+ union {
+ __u32 remote_ipv4;
+ __u32 remote_ipv6[4];
+ };
+};
+
/* Generic BPF return codes which all BPF program types may support.
* The values are binary compatible with their TC_ACT_* counter-part to
* provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 9271576..69d7b91 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -98,7 +98,9 @@ static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
(void *) BPF_FUNC_xdp_adjust_tail;
-
+static int (*bpf_skb_get_xfrm_state)(void *ctx, int index, void *state,
+ int size, int flags) =
+ (void *) BPF_FUNC_skb_get_xfrm_state;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
--
2.7.4
^ permalink raw reply related
* [PATCH bpf-next,v2 1/2] bpf: add helper for getting xfrm states
From: Eyal Birger @ 2018-04-18 21:58 UTC (permalink / raw)
To: netdev; +Cc: shmulik, ast, daniel, fw, steffen.klassert, Eyal Birger
In-Reply-To: <1524088703-6324-1-git-send-email-eyal.birger@gmail.com>
This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.
Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions
The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.
struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.
Typical usage:
struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---
include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
net/core/filter.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9a2d1a0..82b407a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -762,6 +762,15 @@ union bpf_attr {
* @xdp_md: pointer to xdp_md
* @delta: A negative integer to be added to xdp_md.data_end
* Return: 0 on success or negative on error
+ *
+ * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
+ * retrieve XFRM state
+ * @skb: pointer to skb
+ * @index: index of the xfrm state in the secpath
+ * @key: pointer to 'struct bpf_xfrm_state'
+ * @size: size of 'struct bpf_xfrm_state'
+ * @flags: room for future extensions
+ * Return: 0 on success or negative error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -829,7 +838,8 @@ union bpf_attr {
FN(msg_cork_bytes), \
FN(msg_pull_data), \
FN(bind), \
- FN(xdp_adjust_tail),
+ FN(xdp_adjust_tail), \
+ FN(skb_get_xfrm_state),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -935,6 +945,19 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
};
+/* user accessible mirror of in-kernel xfrm_state.
+ * new fields can only be added to the end of this structure
+ */
+struct bpf_xfrm_state {
+ __u32 reqid;
+ __u32 spi;
+ __u16 family;
+ union {
+ __u32 remote_ipv4;
+ __u32 remote_ipv6[4];
+ };
+};
+
/* Generic BPF return codes which all BPF program types may support.
* The values are binary compatible with their TC_ACT_* counter-part to
* provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/net/core/filter.c b/net/core/filter.c
index 2931859..489d360 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
#include <net/sock_reuseport.h>
#include <net/busy_poll.h>
#include <net/tcp.h>
+#include <net/xfrm.h>
#include <linux/bpf_trace.h>
/**
@@ -3749,6 +3750,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
.arg3_type = ARG_CONST_SIZE,
};
+#ifdef CONFIG_XFRM
+BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
+ struct bpf_xfrm_state *, to, u32, size, u64, flags)
+{
+ const struct sec_path *sp = skb_sec_path(skb);
+ const struct xfrm_state *x;
+
+ if (!sp || unlikely(index >= sp->len || flags))
+ goto err_clear;
+
+ x = sp->xvec[index];
+
+ if (unlikely(size != sizeof(struct bpf_xfrm_state)))
+ goto err_clear;
+
+ to->reqid = x->props.reqid;
+ to->spi = be32_to_cpu(x->id.spi);
+ to->family = x->props.family;
+ if (to->family == AF_INET6) {
+ memcpy(to->remote_ipv6, x->props.saddr.a6,
+ sizeof(to->remote_ipv6));
+ } else {
+ to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
+ }
+
+ return 0;
+err_clear:
+ memset(to, 0, size);
+ return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
+ .func = bpf_skb_get_xfrm_state,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg4_type = ARG_CONST_SIZE,
+ .arg5_type = ARG_ANYTHING,
+};
+#endif
+
static const struct bpf_func_proto *
bpf_base_func_proto(enum bpf_func_id func_id)
{
@@ -3890,6 +3934,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
+#ifdef CONFIG_XFRM
+ case BPF_FUNC_skb_get_xfrm_state:
+ return &bpf_skb_get_xfrm_state_proto;
+#endif
default:
return bpf_base_func_proto(func_id);
}
--
2.7.4
^ permalink raw reply related
* [PATCH bpf-next,v2 0/2] bpf: add helper for getting xfrm states
From: Eyal Birger @ 2018-04-18 21:58 UTC (permalink / raw)
To: netdev; +Cc: shmulik, ast, daniel, fw, steffen.klassert, Eyal Birger
This patchset adds support for fetching XFRM state information from
an eBPF program called from TC.
The first patch introduces a helper for fetching an XFRM state from the
skb's secpath. The XFRM state is modeled using a new virtual struct which
contains the SPI, peer address, and reqid values of the state; This struct
can be extended in the future to provide additional state information.
The second patch adds a test example in test_tunnel_bpf.sh. The sample
validates the correct extraction of state information by the eBPF program.
---
v2:
- Fixed two comments by Daniel Borkmann:
- disallow reserved flags in helper call
- avoid compiling in helper code when CONFIG_XFRM is off
Eyal Birger (2):
bpf: add helper for getting xfrm states
samples/bpf: extend test_tunnel_bpf.sh with xfrm state test
include/uapi/linux/bpf.h | 25 ++++++++++-
net/core/filter.c | 48 +++++++++++++++++++++
samples/bpf/tcbpf2_kern.c | 15 +++++++
samples/bpf/test_tunnel_bpf.sh | 71 +++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 25 ++++++++++-
tools/testing/selftests/bpf/bpf_helpers.h | 4 +-
6 files changed, 185 insertions(+), 3 deletions(-)
--
2.7.4
^ permalink raw reply
* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-18 21:52 UTC (permalink / raw)
To: Eric W. Biederman
Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <874lk8wj1j.fsf@xmission.com>
On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@ubuntu.com> writes:
>
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking
> > everything.
>
> Numbers please. I personally expect that the netlink mc_list issues
> will swamp any benefit you get from this.
I wouldn't see how this would be the case. The gist of this is:
Everytime you send a uevent into a network namespace *not* owned by
init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
effectively blocking the host from processing uevents even though
- the uevent you're receiving might be totally different from the
uevent that you're sending
- the uevent socket of the non-init_user_ns owned network namespace
isn't even recorded in the list.
The other argument is that we now have properly isolated network
namespaces wrt to uevents such that each netns can have its own set of
uevents. This can either happen by a sufficiently privileged userspace
process sending it uevents that are only dedicated to that specific
netns. Or - and this *has been true for a long time* - because network
devices are *properly namespaced*. Meaning a uevent for that network
device is *tied to a network namespace*. For both cases the uevent
sequence numbering will be absolutely misleading. For example, whenever
you create e.g. a new veth device in a new network namespace it
shouldn't be accounted against the initial network namespace but *only*
against the network namespace that has that device added to it.
Thanks!
Christian
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Ben Greear @ 2018-04-18 21:51 UTC (permalink / raw)
To: Johannes Berg, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <1524086817.3024.9.camel@sipsolutions.net>
On 04/18/2018 02:26 PM, Johannes Berg wrote:
> On Tue, 2018-04-17 at 18:49 -0700, greearb@candelatech.com wrote:
>>
>> + * @get_ethtool_stats2: Return extended statistics about the device.
>> + * This is only useful if the device maintains statistics not
>> + * included in &struct rtnl_link_stats64.
>> + * Takes a flags argument: 0 means all (same as get_ethtool_stats),
>> + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
>> + * Other flags are reserved for now.
>
> It'd be pretty hard to know which flags are firmware stats?
Yes, it is, but ethtool stats are difficult to understand in a generic
manner anyway, so someone using them is already likely aware of low-level
details of the driver(s) they are using.
In my case, I have lots of virtual stations (or APs), and I want stats
for them as well as for the 'radio', so I would probe the first vdev with
flags of 'skip-none' to get all stats, including radio (firmware) stats.
And then the rest I would just probe the non-firmware stats.
To be honest, I was slightly amused that anyone expressed interest in
this patch originally, but maybe other people have similar use case
and/or drivers with slow-to-acquire stats.
> Anyway, there's no way I'm going to take this patch, so you need to
> float it on netdev first (best CC us here) and get it applied there
> before we can do anything on the wifi side.
I posted the patches to netdev, ath10k and linux-wireless. If I had only
posted them individually to different lists I figure I'd be hearing about how
the netdev patch is useless because it has no driver support, etc.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* [PATCH net 3/3] net: sched: ife: check on metadata length
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
To: yotam.gi
Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
Alexander Aring
In-Reply-To: <20180418213534.6215-1-aring@mojatatu.com>
This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.
Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
net/ife/ife.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 8632d2685efb..7c100034fbee 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
u16 ifehdrln;
ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+ if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
+ return NULL;
+
ifehdrln = ntohs(ifehdr->metalen);
total_pull = skb->dev->hard_header_len + ifehdrln;
--
2.11.0
^ permalink raw reply related
* [PATCH net 2/3] net: sched: ife: handle malformed tlv length
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
To: yotam.gi
Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
Alexander Aring
In-Reply-To: <20180418213534.6215-1-aring@mojatatu.com>
There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.
Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
include/net/ife.h | 3 ++-
net/ife/ife.c | 35 +++++++++++++++++++++++++++++++++--
net/sched/act_ife.c | 7 ++++++-
3 files changed, 41 insertions(+), 4 deletions(-)
diff --git a/include/net/ife.h b/include/net/ife.h
index 44b9c00f7223..e117617e3c34 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -12,7 +12,8 @@
void *ife_encode(struct sk_buff *skb, u16 metalen);
void *ife_decode(struct sk_buff *skb, u16 *metalen);
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+ u16 *dlen, u16 *totlen);
int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
const void *dval);
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7d1ec76e7f43..8632d2685efb 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -92,12 +92,43 @@ struct meta_tlvhdr {
__be16 len;
};
+static inline bool __ife_tlv_meta_valid(const unsigned char *skbdata,
+ const unsigned char *ifehdr_end)
+{
+ const struct meta_tlvhdr *tlv;
+ u16 tlvlen;
+
+ if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
+ return false;
+
+ tlv = (struct meta_tlvhdr *)skbdata;
+ tlvlen = ntohs(tlv->len);
+
+ /* tlv length field is inc header, check on minimum */
+ if (tlvlen < NLA_HDRLEN)
+ return false;
+
+ /* overflow by NLA_ALIGN check */
+ if (NLA_ALIGN(tlvlen) < tlvlen)
+ return false;
+
+ if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
+ return false;
+
+ return true;
+}
+
/* Caller takes care of presenting data in network order
*/
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+ u16 *dlen, u16 *totlen)
{
- struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+ struct meta_tlvhdr *tlv;
+
+ if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
+ return NULL;
+ tlv = (struct meta_tlvhdr *)skbdata;
*dlen = ntohs(tlv->len) - NLA_HDRLEN;
*attrtype = ntohs(tlv->type);
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 49b8ab551fbe..8527cfdc446d 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
u16 mtype;
u16 dlen;
- curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
+ curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
+ &dlen, NULL);
+ if (!curr_data) {
+ qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
+ return TC_ACT_SHOT;
+ }
if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
/* abuse overlimits to count when we receive metadata
--
2.11.0
^ permalink raw reply related
* [PATCH net 1/3] net: sched: ife: signal not finding metaid
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
To: yotam.gi
Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
Alexander Aring
In-Reply-To: <20180418213534.6215-1-aring@mojatatu.com>
We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.
Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
net/sched/act_ife.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index a5994cf0512b..49b8ab551fbe 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
}
}
- return 0;
+ return -ENOENT;
}
static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
--
2.11.0
^ permalink raw reply related
* [PATCH net 0/3] net: sched: ife: malformed ife packet fixes
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
To: yotam.gi
Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
Alexander Aring
As promised at netdev 2.2 tc workshop I am working on adding scapy support for
tdc testing. It is still work in progress. I will submit the patches to tdc
later (they are not in good shape yet). The good news is I have been able to
find bugs which normal packet testing would not be able to find.
With fuzzy testing I was able to craft certain malformed packets that IFE
action was not able to deal with. This patch set fixes those bugs.
Alexander Aring (3):
net: sched: ife: signal not finding metaid
net: sched: ife: handle malformed tlv length
net: sched: ife: check on metadata length
include/net/ife.h | 3 ++-
net/ife/ife.c | 38 ++++++++++++++++++++++++++++++++++++--
net/sched/act_ife.c | 9 +++++++--
3 files changed, 45 insertions(+), 5 deletions(-)
--
2.11.0
^ permalink raw reply
* Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Michael Schmitz @ 2018-04-18 21:34 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, Finn Thain, Geert Uytterhoeven, Florian Fainelli,
Linux/m68k, Michael Karcher
In-Reply-To: <20180418121957.GB31643@lunn.ch>
Hi Andrew,
sorry, my mistake. I didn't realize how fast DaveM's tree diverges
from Linus' (and Geert's) once the merge window opens.
Cheers,
Michael
On Thu, Apr 19, 2018 at 12:19 AM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Wed, Apr 18, 2018 at 05:10:45PM +1200, Michael Schmitz wrote:
>> All,
>>
>> just noticed belatedly that the Makefile hunk of patch 9 does no
>> longer apply cleanly in 4.17-rc1, sorry. My series was based on 4.16.
>> I'll resend that one, OK?
>
> Hi Michael
>
> You should be based on DaveM net-next tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
>
> Please also have "net-next" in the patch subject. See
> Documentation/networking/netdev-FAQ.txt
>
> Andrew
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Johannes Berg @ 2018-04-18 21:26 UTC (permalink / raw)
To: greearb, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <1524016176-3881-1-git-send-email-greearb@candelatech.com>
On Tue, 2018-04-17 at 18:49 -0700, greearb@candelatech.com wrote:
>
> + * @get_ethtool_stats2: Return extended statistics about the device.
> + * This is only useful if the device maintains statistics not
> + * included in &struct rtnl_link_stats64.
> + * Takes a flags argument: 0 means all (same as get_ethtool_stats),
> + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
> + * Other flags are reserved for now.
It'd be pretty hard to know which flags are firmware stats?
Anyway, there's no way I'm going to take this patch, so you need to
float it on netdev first (best CC us here) and get it applied there
before we can do anything on the wifi side.
johannes
^ permalink raw reply
* Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Michael Schmitz @ 2018-04-18 21:26 UTC (permalink / raw)
To: Finn Thain
Cc: netdev, Andrew Lunn, Geert Uytterhoeven, Florian Fainelli,
Linux/m68k, Michael Karcher
In-Reply-To: <alpine.LNX.2.21.1804181531180.8@nippy.intranet>
Hi Finn,
thanks for the feedback!
On Wed, Apr 18, 2018 at 5:45 PM, Finn Thain <fthain@telegraphics.com.au> wrote:
>> > 1/9 net: phy: new Asix Electronics PHY driver
>> > 2/9 net: ax88796: Fix MAC address reading
>> > 3/9 net: ax88796: Attach MII bus only when open
>> > 4/9 net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
>> > 5/9 net: ax88796: Add block_input/output hooks to ax_plat_data
>
> I found that git am rejects this one, though 'patch' applies it with fuzz.
Can't remember seeing anything there when rebasing the series, odd.
>
>> > 6/9 net: ax88796: add interrupt status callback to platform data
>> > 7/9 net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as shareable
>> > 8/9 net: ax88796: release platform device drvdata on probe error and module remove
>> > 9/9 net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
>
> git am rejected this one and also complained about trailing whitespace.
>
> I'd rebase on v4.17-rc1 and also run checkpatch over the results.
Done that now, thanks. Andrew recommended basing my patches on
net-next so I'll do that before resubmitting.
Cheers,
Michael
> --
>
>> >
>> > drivers/net/ethernet/8390/Kconfig | 17 ++-
>> > drivers/net/ethernet/8390/Makefile | 1 +
>> > drivers/net/ethernet/8390/ax88796.c | 228 ++++++++++++--------
>> > drivers/net/ethernet/8390/xsurf100.c | 381 ++++++++++++++++++++++++++++++++++
>> > drivers/net/phy/Kconfig | 6 +
>> > drivers/net/phy/Makefile | 1 +
>> > drivers/net/phy/asix.c | 65 ++++++
>> > drivers/net/phy/phy_device.c | 3 +-
>> > include/linux/phy.h | 1 +
>> > include/net/ax88796.h | 14 ++
>> > 10 files changed, 621 insertions(+), 96 deletions(-)
>> >
>> > Cheers,
>> >
>> > Michael
^ permalink raw reply
* Re: WARNING: suspicious RCU usage in fib6_info_alloc
From: Eric Dumazet @ 2018-04-18 21:16 UTC (permalink / raw)
To: David Ahern, syzbot, christian.brauner, davem, fw, jbenc, ktkhai,
linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
vyasevich
In-Reply-To: <3096e146-ee1d-d3f8-e900-8aa70cb54d00@gmail.com>
On 04/18/2018 02:04 PM, David Ahern wrote:
> On 4/18/18 3:02 PM, syzbot wrote:
>> stack backtrace:
>> CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Workqueue: ipv6_addrconf addrconf_dad_work
>> Call Trace:
>> __dump_stack lib/dump_stack.c:77 [inline]
>> dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>> lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
>> ___might_sleep+0x2e7/0x320 kernel/sched/core.c:6153
>> __might_sleep+0x95/0x190 kernel/sched/core.c:6141
>> slab_pre_alloc_hook mm/slab.h:421 [inline]
>> slab_alloc mm/slab.c:3378 [inline]
>> kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
>> kmalloc include/linux/slab.h:512 [inline]
>> kzalloc include/linux/slab.h:701 [inline]
>> fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
>> ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
>> ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
>> addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
>> __ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
>
> Eric is faster than the robot ;-)
Not really ;)
>
> Patch is queued up; will send later today.
>
David, I simply looked at the report before allowing the bot to send it ;)
I determined the bug was added recently, and warned you about it.
Thanks !
^ permalink raw reply
* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Florian Fainelli @ 2018-04-18 21:07 UTC (permalink / raw)
To: greearb, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <1524016176-3881-1-git-send-email-greearb@candelatech.com>
On 04/17/2018 06:49 PM, greearb@candelatech.com wrote:
> From: Ben Greear <greearb@candelatech.com>
>
> This is similar to ETHTOOL_GSTATS, but it allows you to specify
> flags. These flags can be used by the driver to decrease the
> amount of stats refreshed. In particular, this helps with ath10k
> since getting the firmware stats can be slow.
You can configure the statistics refresh rate through the ethtool
coalescing parameter stats_block_coalesce_usecs, not sure if this is
exactly what you had in mind, but if it works, then you might as well
want to use it.
>
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
> include/linux/ethtool.h | 12 ++++++++++++
> include/uapi/linux/ethtool.h | 10 ++++++++++
> net/core/ethtool.c | 40 +++++++++++++++++++++++++++++++++++-----
> 3 files changed, 57 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index ebe4181..a4aa11f 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -243,6 +243,15 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
> * @get_ethtool_stats: Return extended statistics about the device.
> * This is only useful if the device maintains statistics not
> * included in &struct rtnl_link_stats64.
> + * @get_ethtool_stats2: Return extended statistics about the device.
> + * This is only useful if the device maintains statistics not
> + * included in &struct rtnl_link_stats64.
> + * Takes a flags argument: 0 means all (same as get_ethtool_stats),
> + * 0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
> + * Other flags are reserved for now.
> + * Same number of stats will be returned, but some of them might
> + * not be as accurate/refreshed. This is to allow not querying
> + * firmware or other expensive-to-read stats, for instance.
> * @begin: Function to be called before any other operation. Returns a
> * negative error code or zero.
> * @complete: Function to be called after any other operation except
> @@ -355,6 +364,9 @@ struct ethtool_ops {
> int (*set_phys_id)(struct net_device *, enum ethtool_phys_id_state);
> void (*get_ethtool_stats)(struct net_device *,
> struct ethtool_stats *, u64 *);
> + void (*get_ethtool_stats2)(struct net_device *dev,
> + struct ethtool_stats *gstats, u64 *data,
> + u32 flags);
> int (*begin)(struct net_device *);
> void (*complete)(struct net_device *);
> u32 (*get_priv_flags)(struct net_device *);
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index 4ca65b5..1c74f3e 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -1396,11 +1396,21 @@ enum ethtool_fec_config_bits {
> #define ETHTOOL_PHY_STUNABLE 0x0000004f /* Set PHY tunable configuration */
> #define ETHTOOL_GFECPARAM 0x00000050 /* Get FEC settings */
> #define ETHTOOL_SFECPARAM 0x00000051 /* Set FEC settings */
> +#define ETHTOOL_GSTATS2 0x00000052 /* get NIC-specific statistics
> + * with ability to specify flags.
> + * See ETHTOOL_GS2* below.
> + */
>
> /* compatibility with older code */
> #define SPARC_ETH_GSET ETHTOOL_GSET
> #define SPARC_ETH_SSET ETHTOOL_SSET
>
> +/* GSTATS2 flags */
> +#define ETHTOOL_GS2_SKIP_NONE (0) /* default is to update all stats */
> +#define ETHTOOL_GS2_SKIP_FW (1<<0) /* Skip reading stats that probe firmware,
> + * and thus are slow/expensive.
> + */
> +
> /* Link mode bit indices */
> enum ethtool_link_mode_bit_indices {
> ETHTOOL_LINK_MODE_10baseT_Half_BIT = 0,
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 03416e6..6ec3413 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -1952,16 +1952,14 @@ static int ethtool_phys_id(struct net_device *dev, void __user *useraddr)
> return rc;
> }
>
> -static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
> +static int _ethtool_get_stats(struct net_device *dev, void __user *useraddr,
> + u32 flags)
> {
> struct ethtool_stats stats;
> const struct ethtool_ops *ops = dev->ethtool_ops;
> u64 *data;
> int ret, n_stats;
>
> - if (!ops->get_ethtool_stats || !ops->get_sset_count)
> - return -EOPNOTSUPP;
> -
> n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
> if (n_stats < 0)
> return n_stats;
> @@ -1976,7 +1974,10 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
> if (n_stats && !data)
> return -ENOMEM;
>
> - ops->get_ethtool_stats(dev, &stats, data);
> + if (flags != ETHTOOL_GS2_SKIP_NONE)
> + ops->get_ethtool_stats2(dev, &stats, data, flags);
> + else
> + ops->get_ethtool_stats(dev, &stats, data);
>
> ret = -EFAULT;
> if (copy_to_user(useraddr, &stats, sizeof(stats)))
> @@ -1991,6 +1992,31 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
> return ret;
> }
>
> +static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
> +{
> + const struct ethtool_ops *ops = dev->ethtool_ops;
> +
> + if (!ops->get_ethtool_stats || !ops->get_sset_count)
> + return -EOPNOTSUPP;
> + return _ethtool_get_stats(dev, useraddr, ETHTOOL_GS2_SKIP_NONE);
> +}
> +
> +static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr)
> +{
> + struct ethtool_stats stats;
> + const struct ethtool_ops *ops = dev->ethtool_ops;
> + u32 flags = 0;
> +
> + if (!ops->get_ethtool_stats2 || !ops->get_sset_count)
> + return -EOPNOTSUPP;
> +
> + if (copy_from_user(&stats, useraddr, sizeof(stats)))
> + return -EFAULT;
> +
> + flags = stats.n_stats;
> + return _ethtool_get_stats(dev, useraddr, flags);
> +}
> +
> static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
> {
> struct ethtool_stats stats;
> @@ -2632,6 +2658,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
> case ETHTOOL_GSSET_INFO:
> case ETHTOOL_GSTRINGS:
> case ETHTOOL_GSTATS:
> + case ETHTOOL_GSTATS2:
> case ETHTOOL_GPHYSTATS:
> case ETHTOOL_GTSO:
> case ETHTOOL_GPERMADDR:
> @@ -2742,6 +2769,9 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
> case ETHTOOL_GSTATS:
> rc = ethtool_get_stats(dev, useraddr);
> break;
> + case ETHTOOL_GSTATS2:
> + rc = ethtool_get_stats2(dev, useraddr);
> + break;
> case ETHTOOL_GPERMADDR:
> rc = ethtool_get_perm_addr(dev, useraddr);
> break;
>
--
Florian
^ permalink raw reply
* Re: WARNING: suspicious RCU usage in fib6_info_alloc
From: David Ahern @ 2018-04-18 21:04 UTC (permalink / raw)
To: syzbot, christian.brauner, davem, fw, jbenc, ktkhai, linux-kernel,
lucien.xin, mschiffer, netdev, syzkaller-bugs, vyasevich
In-Reply-To: <0000000000004366bc056a25c4f3@google.com>
On 4/18/18 3:02 PM, syzbot wrote:
> stack backtrace:
> CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: ipv6_addrconf addrconf_dad_work
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
> ___might_sleep+0x2e7/0x320 kernel/sched/core.c:6153
> __might_sleep+0x95/0x190 kernel/sched/core.c:6141
> slab_pre_alloc_hook mm/slab.h:421 [inline]
> slab_alloc mm/slab.c:3378 [inline]
> kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
> kmalloc include/linux/slab.h:512 [inline]
> kzalloc include/linux/slab.h:701 [inline]
> fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
> ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
> ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
> addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
> __ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
Eric is faster than the robot ;-)
Patch is queued up; will send later today.
^ permalink raw reply
* WARNING: suspicious RCU usage in fib6_info_alloc
From: syzbot @ 2018-04-18 21:02 UTC (permalink / raw)
To: christian.brauner, davem, dsahern, fw, jbenc, ktkhai,
linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
vyasevich
Hello,
syzbot hit the following crash on net-next commit
0565de29cbd65b378147d36f9642f93a046240dc (Wed Apr 18 03:41:18 2018 +0000)
Merge branch 'ipv6-Separate-data-structures-for-FIB-and-data-path'
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=2add39b05179b31f912f
So far this crash happened 2 times on net-next.
syzkaller reproducer:
https://syzkaller.appspot.com/x/repro.syz?id=4660613020123136
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=5742127124316160
Kernel config:
https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
=============================
WARNING: suspicious RCU usage
4.16.0+ #5 Not tainted
-----------------------------
kernel/sched/core.c:6153 Illegal context switch in RCU-bh read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
4 locks held by kworker/1:1/25:
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
__write_once_size include/linux/compiler.h:215 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
set_work_data kernel/workqueue.c:617 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
#1: 000000007d88bc46 ((work_completion)(&(&ifa->dad_work)->work)){+.+.},
at: process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
#2: 00000000943eaf98 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
#3: 00000000a39c89a4 (rcu_read_lock_bh){....}, at:
ipv6_ifa_notify+0x0/0x210 net/ipv6/addrconf.c:5621
stack backtrace:
CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
___might_sleep+0x2e7/0x320 kernel/sched/core.c:6153
__might_sleep+0x95/0x190 kernel/sched/core.c:6141
slab_pre_alloc_hook mm/slab.h:421 [inline]
slab_alloc mm/slab.c:3378 [inline]
kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
kmalloc include/linux/slab.h:512 [inline]
kzalloc include/linux/slab.h:701 [inline]
fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
__ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
ipv6_ifa_notify+0xff/0x210 net/ipv6/addrconf.c:5650
addrconf_dad_completed+0xeb/0xbf0 net/ipv6/addrconf.c:4083
addrconf_dad_begin net/ipv6/addrconf.c:3889 [inline]
addrconf_dad_work+0x873/0x1300 net/ipv6/addrconf.c:3991
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
kthread+0x345/0x410 kernel/kthread.c:238
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
BUG: sleeping function called from invalid context at mm/slab.h:421
in_atomic(): 1, irqs_disabled(): 0, pid: 25, name: kworker/1:1
4 locks held by kworker/1:1/25:
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
__write_once_size include/linux/compiler.h:215 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
set_work_data kernel/workqueue.c:617 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
#0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
#1: 000000007d88bc46 ((work_completion)(&(&ifa->dad_work)->work)){+.+.},
at: process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
#2: 00000000943eaf98 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
#3: 00000000a39c89a4 (rcu_read_lock_bh){....}, at:
ipv6_ifa_notify+0x0/0x210 net/ipv6/addrconf.c:5621
CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
___might_sleep.cold.88+0x11f/0x13a kernel/sched/core.c:6188
__might_sleep+0x95/0x190 kernel/sched/core.c:6141
slab_pre_alloc_hook mm/slab.h:421 [inline]
slab_alloc mm/slab.c:3378 [inline]
kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
kmalloc include/linux/slab.h:512 [inline]
kzalloc include/linux/slab.h:701 [inline]
fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
__ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
ipv6_ifa_notify+0xff/0x210 net/ipv6/addrconf.c:5650
addrconf_dad_completed+0xeb/0xbf0 net/ipv6/addrconf.c:4083
addrconf_dad_begin net/ipv6/addrconf.c:3889 [inline]
addrconf_dad_work+0x873/0x1300 net/ipv6/addrconf.c:3991
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
kthread+0x345/0x410 kernel/kthread.c:238
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: add helper for getting xfrm states
From: Daniel Borkmann @ 2018-04-18 20:59 UTC (permalink / raw)
To: Eyal Birger, netdev; +Cc: shmulik, ast, fw, steffen.klassert
In-Reply-To: <1523940529-3791-2-git-send-email-eyal.birger@gmail.com>
On 04/17/2018 06:48 AM, Eyal Birger wrote:
> This commit introduces a helper which allows fetching xfrm state
> parameters by eBPF programs attached to TC.
>
> Prototype:
> bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
>
> skb: pointer to skb
> index: the index in the skb xfrm_state secpath array
> xfrm_state: pointer to 'struct bpf_xfrm_state'
> size: size of 'struct bpf_xfrm_state'
> flags: reserved for future extensions
>
> The helper returns 0 on success. Non zero if no xfrm state at the index
> is found - or non exists at all.
>
> struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
> address and the reqid; it can be further extended by adding elements to
> its end - indicating the populated fields by the 'size' argument -
> keeping backwards compatibility.
>
> Typical usage:
>
> struct bpf_xfrm_state x = {};
> bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
> ...
>
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Patch looks good to me, two comments below:
> ---
> include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
> net/core/filter.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 70 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c5ec897..132e172 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -755,6 +755,15 @@ union bpf_attr {
> * @addr: pointer to struct sockaddr to bind socket to
> * @addr_len: length of sockaddr structure
> * Return: 0 on success or negative error code
> + *
> + * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> + * retrieve XFRM state
> + * @skb: pointer to skb
> + * @index: index of the xfrm state in the secpath
> + * @key: pointer to 'struct bpf_xfrm_state'
> + * @size: size of 'struct bpf_xfrm_state'
> + * @flags: room for future extensions
> + * Return: 0 on success or negative error
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -821,7 +830,8 @@ union bpf_attr {
> FN(msg_apply_bytes), \
> FN(msg_cork_bytes), \
> FN(msg_pull_data), \
> - FN(bind),
> + FN(bind), \
> + FN(skb_get_xfrm_state),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> @@ -927,6 +937,19 @@ struct bpf_tunnel_key {
> __u32 tunnel_label;
> };
>
> +/* user accessible mirror of in-kernel xfrm_state.
> + * new fields can only be added to the end of this structure
> + */
> +struct bpf_xfrm_state {
> + __u32 reqid;
> + __u32 spi;
> + __u16 family;
> + union {
> + __u32 remote_ipv4;
> + __u32 remote_ipv6[4];
> + };
> +};
> +
> /* Generic BPF return codes which all BPF program types may support.
> * The values are binary compatible with their TC_ACT_* counter-part to
> * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d31aff9..c06600a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -57,6 +57,7 @@
> #include <net/sock_reuseport.h>
> #include <net/busy_poll.h>
> #include <net/tcp.h>
> +#include <net/xfrm.h>
> #include <linux/bpf_trace.h>
>
> /**
> @@ -3703,6 +3704,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
> .arg3_type = ARG_CONST_SIZE,
> };
>
> +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
> + struct bpf_xfrm_state *, to, u32, size, u64, flags)
> +{
> +#ifdef CONFIG_XFRM
> + const struct sec_path *sp = skb_sec_path(skb);
> + const struct xfrm_state *x;
> +
> + if (!sp || index >= sp->len)
This should be something like: if (!sp || unlikely(index >= sp->len || flags))
Such that we unconditionally bail out on any flags currently, since this is
reserved for future use and anything non-zero would be invalid and rejected
until we start extending it.
> + goto err_clear;
> +
> + x = sp->xvec[index];
> +
> + if (unlikely(size != sizeof(struct bpf_xfrm_state)))
> + goto err_clear;
> +
> + to->reqid = x->props.reqid;
> + to->spi = be32_to_cpu(x->id.spi);
> + to->family = x->props.family;
> + if (to->family == AF_INET6) {
> + memcpy(to->remote_ipv6, x->props.saddr.a6,
> + sizeof(to->remote_ipv6));
> + } else {
> + to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
> + }
> +
> + return 0;
> +err_clear:
> +#endif
> + memset(to, 0, size);
> + return -EINVAL;
> +}
> +
> +static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
> + .func = bpf_skb_get_xfrm_state,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_ANYTHING,
> + .arg3_type = ARG_PTR_TO_UNINIT_MEM,
> + .arg4_type = ARG_CONST_SIZE,
> + .arg5_type = ARG_ANYTHING,
> +};
> +
> static const struct bpf_func_proto *
> bpf_base_func_proto(enum bpf_func_id func_id)
> {
> @@ -3844,6 +3888,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> return &bpf_get_socket_cookie_proto;
> case BPF_FUNC_get_socket_uid:
> return &bpf_get_socket_uid_proto;
> + case BPF_FUNC_skb_get_xfrm_state:
> + return &bpf_skb_get_xfrm_state_proto;
Potentially, on kernels with !CONFIG_XFRM, you might want to let the program
bail out at program verification phase already? Thus it would become ...
#ifdef CONFIG_XFRM
case BPF_FUNC_skb_get_xfrm_state:
return &bpf_skb_get_xfrm_state_proto;
#endif
... where you'd also wrap the helper + state_proto in CONFIG_XFRM.
> default:
> return bpf_base_func_proto(func_id);
> }
>
^ permalink raw reply
* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Eric Dumazet @ 2018-04-18 20:38 UTC (permalink / raw)
To: Michael S. Tsirkin, Eric Dumazet
Cc: Mikulas Patocka, David S. Miller, Eric Dumazet, Joby Poriyath,
Ben Hutchings, netdev, linux-kernel
In-Reply-To: <20180418204229-mutt-send-email-mst@kernel.org>
On 04/18/2018 10:55 AM, Michael S. Tsirkin wrote:
> Imagine you want to pass some data to card.
> Natural thing is to just put it in a variable and start DMA.
> However DMA API disallows stack access nowdays,
> so it's natural to put this within struct device.
>
> See e.g.
>
> commit a725ee3e44e39dab1ec82cc745899a785d2a555e
> Author: Andy Lutomirski <luto@kernel.org>
> Date: Mon Jul 18 15:34:49 2016 -0700
>
> virtio-net: Remove more stack DMA
>
Andy just moved the problem to another one, since at that time we already
had vmalloc() fallback for at least 2 years.
Note that my original patch had :
p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
if (!p)
p = vzalloc(alloc_size);
So really, normal (less than PAGE_SIZE) allocations would have almost-zero-chance to end up to vmalloc(one_page)
^ permalink raw reply
* Re: [PATCH bpf-next v4 00/10] BTF: BPF Type Format
From: Daniel Borkmann @ 2018-04-18 20:36 UTC (permalink / raw)
To: Martin KaFai Lau, netdev
Cc: Alexei Starovoitov, kernel-team, Arnaldo Carvalho de Melo
In-Reply-To: <20180417204243.4028831-1-kafai@fb.com>
On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> This patch introduces BPF Type Format (BTF).
>
> BTF (BPF Type Format) is the meta data format which describes
> the data types of BPF program/map. Hence, it basically focus
> on the C programming language which the modern BPF is primary
> using. The first use case is to provide a generic pretty print
> capability for a BPF map.
Ok, as discussed with Martin this will have one more minor respin.
Thanks,
Daniel
^ permalink raw reply
* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 20:32 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
jasowang, loseweigh
In-Reply-To: <20180418222309-mutt-send-email-mst@kernel.org>
Wed, Apr 18, 2018 at 09:46:04PM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > >
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> >> > > > master netdev is created. The paravirtual driver registers each bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > > bypass_master_register()
>> >> > > > bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> >> > > > the bypass module provides interfaces to create/destroy additional master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > > bypass_master_create()
>> >> > > > bypass_master_destroy()
>> >> > > >
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h | 14 +
>> >> > > > include/net/bypass.h | 96 ++++++
>> >> > > > net/Kconfig | 18 +
>> >> > > > net/core/Makefile | 1 +
>> >> > > > net/core/bypass.c | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > >
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > IFF_PHONY_HEADROOM = 1<<24,
>> >> > > > IFF_MACSEC = 1<<25,
>> >> > > > IFF_NO_RX_HANDLER = 1<<26,
>> >> > > > + IFF_BYPASS = 1 << 27,
>> >> > > > + IFF_BYPASS_SLAVE = 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >>
>> >>
>> >> > >
>> >> > > > };
>> >> > > >
>> >> > > > #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE IFF_BYPASS_SLAVE
>> >> > > >
>> >> > > > /**
>> >> > > > * struct net_device - The DEVICE structure.
>> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> >> > > > return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> >> > > > }
>> >> > > >
>> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> >> > > > +{
>> >> > > > + return dev->priv_flags & IFF_BYPASS;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> >> > > > +{
>> >> > > > + return dev->priv_flags & IFF_BYPASS_SLAVE;
>> >> > > > +}
>> >> > > > +
>> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> >> > > > static inline void netif_keep_dst(struct net_device *dev)
>> >> > > > {
>> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> >> > > > new file mode 100644
>> >> > > > index 000000000000..86b02cb894cf
>> >> > > > --- /dev/null
>> >> > > > +++ b/include/net/bypass.h
>> >> > > > @@ -0,0 +1,96 @@
>> >> > > > +// SPDX-License-Identifier: GPL-2.0
>> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> >> > > > +
>> >> > > > +#ifndef _NET_BYPASS_H
>> >> > > > +#define _NET_BYPASS_H
>> >> > > > +
>> >> > > > +#include <linux/netdevice.h>
>> >> > > > +
>> >> > > > +struct bypass_ops {
>> >> > > > + int (*slave_pre_register)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_join)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_release)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + int (*slave_link_change)(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev);
>> >> > > > + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> >> > > > +};
>> >> > > > +
>> >> > > > +struct bypass_master {
>> >> > > > + struct list_head list;
>> >> > > > + struct net_device __rcu *bypass_netdev;
>> >> > > > + struct bypass_ops __rcu *ops;
>> >> > > > +};
>> >> > > > +
>> >> > > > +/* bypass state */
>> >> > > > +struct bypass_info {
>> >> > > > + /* passthru netdev with same MAC */
>> >> > > > + struct net_device __rcu *active_netdev;
>> >> > > You still use "active"/"backup" names which is highly misleading as
>> >> > > it has completely different meaning that in bond for example.
>> >> > > I noted that in my previous review already. Please change it.
>> >> > I guess the issue is with only the 'active' name. 'backup' should be fine as it also
>> >> > matches with the BACKUP feature bit we are adding to virtio_net.
>> >> I think that "backup" is also misleading. Both "active" and "backup"
>> >> mean a *state* of slaves. This should be named differently.
>> >>
>> >>
>> >>
>> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> > am not too happy with it.
>> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> I think "stolen" is quite appropriate since it describes the modus
>> >> operandi. The bypass master steals some netdevice according to some
>> >> match.
>> >>
>> >> But I don't insist on "stolen". Just sounds right.
>> >
>> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >'backup' name is consistent.
>>
>> It perhaps makes sense from the view of virtio device. However, as I
>> described couple of times, for master/slave device the name "backup" is
>> highly misleading.
>
>virtio is the backup. You are supposed to use another
>(typically passthrough) device, if that fails use virtio.
>It does seem appropriate to me. If you like, we can
>change that to "standby". Active I don't like either. "main"?
Sounds much better, yes.
>
>In fact would failover be better than bypass?
Also, much better.
>
>
>>
>> >
>> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> >
>> >Will look for any suggestions in the next day or two. If i don't get any, i will go
>> >with 'stolen'
>> >
>> ><snip>
>> >
>> >
>> >> +
>> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> >> + struct bypass_ops **ops)
>> >> +{
>> >> + struct bypass_master *bypass_master;
>> >> + struct net_device *bypass_netdev;
>> >> +
>> >> + spin_lock(&bypass_lock);
>> >> + list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> >> > > As I wrote the last time, you don't need this list, spinlock.
>> >> > > You can do just something like:
>> >> > > for_each_net(net) {
>> >> > > for_each_netdev(net, dev) {
>> >> > > if (netif_is_bypass_master(dev)) {
>> >> > This function returns the upper netdev as well as the ops associated
>> >> > with that netdev.
>> >> > bypass_master_list is a list of 'struct bypass_master' that associates
>> >> Well, can't you have it in netdev priv?
>> >
>> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>>
>> Howcome? You have no master? I don't understand..
>>
>>
>>
>> >
>> >>
>> >>
>> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> >> > NULL for 3-netdev model.
>> >> I see :(
>> >>
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > >
>> >> > > > + bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> >> > > > + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> >> > > > + *ops = rcu_dereference(bypass_master->ops);
>> >> > > I don't see how rcu_dereference is ok here.
>> >> > > 1) I don't see rcu_read_lock taken
>> >> > > 2) Looks like bypass_master->ops has the same value across the whole
>> >> > > existence.
>> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> >> > Yes. ops doesn't change.
>> >> If it does not change, you can just access it directly.
>> >>
>> >>
>> >> > >
>> >> > > > + spin_unlock(&bypass_lock);
>> >> > > > + return bypass_netdev;
>> >> > > > + }
>> >> > > > + }
>> >> > > > + spin_unlock(&bypass_lock);
>> >> > > > + return NULL;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > + struct net_device *bypass_netdev;
>> >> > > > + struct bypass_ops *bypass_ops;
>> >> > > > + int ret, orig_mtu;
>> >> > > > +
>> >> > > > + ASSERT_RTNL();
>> >> > > > +
>> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > + &bypass_ops);
>> >> > > For master, could you use word "master" in the variables so it is clear?
>> >> > > Also, "dev" is fine instead of "netdev".
>> >> > > Something like "bpmaster_dev"
>> >> > bypass_master is of type struct bypass_master, bypass_netdev is of type struct net_device.
>> >> I was trying to point out, that "bypass_netdev" represents a "master"
>> >> netdev, yet it does not say master. That is why I suggested
>> >> "bpmaster_dev"
>> >>
>> >>
>> >> > I can change all _netdev suffixes to _dev to make the names shorter.
>> >> ok.
>> >>
>> >>
>> >> >
>> >> > >
>> >> > > > + if (!bypass_netdev)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> >> > > > + bypass_ops);
>> >> > > > + if (ret != 0)
>> >> > > Just "if (ret)" will do. You have this on more places.
>> >> > OK.
>> >> >
>> >> >
>> >> > >
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ret = netdev_rx_handler_register(slave_netdev,
>> >> > > > + bypass_ops ? bypass_ops->handle_frame :
>> >> > > > + bypass_handle_frame, bypass_netdev);
>> >> > > > + if (ret != 0) {
>> >> > > > + netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> >> > > > + ret);
>> >> > > > + goto done;
>> >> > > > + }
>> >> > > > +
>> >> > > > + ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> >> > > > + if (ret != 0) {
>> >> > > > + netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> >> > > > + bypass_netdev->name, ret);
>> >> > > > + goto upper_link_failed;
>> >> > > > + }
>> >> > > > +
>> >> > > > + slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > + if (netif_running(bypass_netdev)) {
>> >> > > > + ret = dev_open(slave_netdev);
>> >> > > > + if (ret && (ret != -EBUSY)) {
>> >> > > > + netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> >> > > > + slave_netdev->name, ret);
>> >> > > > + goto err_interface_up;
>> >> > > > + }
>> >> > > > + }
>> >> > > > +
>> >> > > > + /* Align MTU of slave with master */
>> >> > > > + orig_mtu = slave_netdev->mtu;
>> >> > > > + ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> >> > > > + if (ret != 0) {
>> >> > > > + netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> >> > > > + slave_netdev->name, bypass_netdev->mtu);
>> >> > > > + goto err_set_mtu;
>> >> > > > + }
>> >> > > > +
>> >> > > > + ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > + if (ret != 0)
>> >> > > > + goto err_join;
>> >> > > > +
>> >> > > > + call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> >> > > > +
>> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> >> > > > + slave_netdev->name);
>> >> > > > +
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > +err_join:
>> >> > > > + dev_set_mtu(slave_netdev, orig_mtu);
>> >> > > > +err_set_mtu:
>> >> > > > + dev_close(slave_netdev);
>> >> > > > +err_interface_up:
>> >> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +upper_link_failed:
>> >> > > > + netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +done:
>> >> > > > + return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev,
>> >> > > > + struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > + struct net_device *backup_netdev, *active_netdev;
>> >> > > > + struct bypass_info *bi;
>> >> > > > +
>> >> > > > + if (bypass_ops) {
>> >> > > > + if (!bypass_ops->slave_pre_unregister)
>> >> > > > + return -EINVAL;
>> >> > > > +
>> >> > > > + return bypass_ops->slave_pre_unregister(slave_netdev,
>> >> > > > + bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > + bi = netdev_priv(bypass_netdev);
>> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > + return -EINVAL;
>> >> > > > +
>> >> > > > + return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> >> > > > + struct net_device *bypass_netdev,
>> >> > > > + struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > + struct net_device *backup_netdev, *active_netdev;
>> >> > > > + struct bypass_info *bi;
>> >> > > > +
>> >> > > > + if (bypass_ops) {
>> >> > > > + if (!bypass_ops->slave_release)
>> >> > > > + return -EINVAL;
>> >> > > I think it would be good to make the API to the driver more strict and
>> >> > > have a separate set of ops for "active" and "backup" netdevices.
>> >> > > That should stop people thinking about extending this to more slaves in
>> >> > > the future.
>> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> >> > 'active' slave.
>> >> I'm very well aware of that. I just thought that explicit ops for the
>> >> two slaves would make this more clear.
>> >>
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > > > +
>> >> > > > + return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > + bi = netdev_priv(bypass_netdev);
>> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > + if (slave_netdev == backup_netdev) {
>> >> > > > + RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> >> > > > + } else {
>> >> > > > + RCU_INIT_POINTER(bi->active_netdev, NULL);
>> >> > > > + if (backup_netdev) {
>> >> > > > + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> >> > > > + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> >> > > > + }
>> >> > > > + }
>> >> > > > +
>> >> > > > + dev_put(slave_netdev);
>> >> > > > +
>> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> >> > > > + slave_netdev->name);
>> >> > > > +
>> >> > > > + return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > + struct net_device *bypass_netdev;
>> >> > > > + struct bypass_ops *bypass_ops;
>> >> > > > + int ret;
>> >> > > > +
>> >> > > > + if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ASSERT_RTNL();
>> >> > > > +
>> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > + &bypass_ops);
>> >> > > > + if (!bypass_netdev)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> >> > > > + bypass_ops);
>> >> > > > + if (ret != 0)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + netdev_rx_handler_unregister(slave_netdev);
>> >> > > > + netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > + slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > + bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +
>> >> > > > + netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> >> > > > + slave_netdev->name);
>> >> > > > +
>> >> > > > +done:
>> >> > > > + return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> >> > > > +
>> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> >> > > > +{
>> >> > > > + return netif_running(dev) && netif_carrier_ok(dev);
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > + struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> >> > > > + struct bypass_ops *bypass_ops;
>> >> > > > + struct bypass_info *bi;
>> >> > > > +
>> >> > > > + if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + ASSERT_RTNL();
>> >> > > > +
>> >> > > > + bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > + &bypass_ops);
>> >> > > > + if (!bypass_netdev)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + if (bypass_ops) {
>> >> > > > + if (!bypass_ops->slave_link_change)
>> >> > > > + goto done;
>> >> > > > +
>> >> > > > + return bypass_ops->slave_link_change(slave_netdev,
>> >> > > > + bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > + if (!netif_running(bypass_netdev))
>> >> > > > + return 0;
>> >> > > > +
>> >> > > > + bi = netdev_priv(bypass_netdev);
>> >> > > > +
>> >> > > > + active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > + backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > + if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > + goto done;
>> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> >> > > above is enough.
>> >> > I think we need this check to not allow events from a slave that is not
>> >> > attached to this master but has the same MAC.
>> >> Why do we need such events? Seems wrong to me.
>> >
>> >We want to avoid events from a netdev that is mis-configured with the same MAC as
>> >a bypass setup.
>> >
>> >> Consider:
>> >>
>> >> bp1 bp2
>> >> a1 b1 a2 b2
>> >>
>> >>
>> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>> >
>> >We should not have 2 bypass configs with the same MAC.
>> >I need to add a check in the bypass_master_register() to prevent this.
>>
>> Mac can change, you would have to check in change as well. Feels odd
>> thought.
>>
>>
>> >
>> >The above check is to avoid cases where we have
>> >bp1(a1, b1) with mac1
>> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>> >
>> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> >> the order of creation.
>> >> Let's say it will return bp1. Then when we have event for a2, the
>> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> >>
>> >>
>> >> You cannot use bypass_master_get_bymac() here.
>> >>
>> >>
>> >>
>> >> > >
>> >> > > > +
>> >> > > > + if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> >> > > > + (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> >> > > > + netif_carrier_on(bypass_netdev);
>> >> > > > + netif_tx_wake_all_queues(bypass_netdev);
>> >> > > > + } else {
>> >> > > > + netif_carrier_off(bypass_netdev);
>> >> > > > + netif_tx_stop_all_queues(bypass_netdev);
>> >> > > > + }
>> >> > > > +
>> >> > > > +done:
>> >> > > > + return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> >> > > > +{
>> >> > > > + /* Skip parent events */
>> >> > > > + if (netif_is_bypass_master(dev))
>> >> > > > + return false;
>> >> > > > +
>> >> > > > + /* Avoid non-Ethernet type devices */
>> >> > > > + if (dev->type != ARPHRD_ETHER)
>> >> > > > + return false;
>> >> > > > +
>> >> > > > + /* Avoid Vlan dev with same MAC registering as VF */
>> >> > > > + if (is_vlan_dev(dev))
>> >> > > > + return false;
>> >> > > > +
>> >> > > > + /* Avoid Bonding master dev with same MAC registering as slave dev */
>> >> > > > + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> >> > > helpers netif_is_bond_master().
>> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> >> > >
>> >> > > You need to do it not by blacklisting, but with whitelisting. You need
>> >> > > to whitelist VF devices. My port flavours patchset might help with this.
>> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> >> I don't see such function in the code.
>> >
>> >It is netdev_has_any_lower_dev(). I need to export it.
>>
>> Come on, you cannot use that. That would allow bonding without slaves,
>> but the slaves could be added later on.
>>
>> What exactly you are trying to achieve by this?
>>
>>
>> >
>> >>
>> >>
>> >> > device is not an upper dev.
>> >> > Can you point to your port flavours patchset? Is it upstream?
>> >> I sent rfc couple of weeks ago:
>> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>> >
>> >
>> >
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox