* Re: netdevice notifier and device private data
From: Alexander Aring @ 2018-06-08 19:41 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, linux-wpan, linux-bluetooth
In-Reply-To: <20180608111457.0a9b4cae@xeon-e3>
Hi Stephen,
On Fri, Jun 08, 2018 at 11:14:57AM -0700, Stephen Hemminger wrote:
...
>
> notifiers are always called with RTNL mutex held
> and dev->type should not change unless RTNL is held.
thanks for you answer. I am not talking about any race between notifiers
vs dev->type change.
I am talking that dev->type was already changed and a upcoming notifier ends
in undefined behaviour when it derefences dev->priv. I have some notifier
which maps a cast from dev->type to a specific structure at dev->priv. This
structure is not there in tap/tun devices if they changed to "my" dev->type
and the notifier occurs.
- Alex
^ permalink raw reply
* Re: netdevice notifier and device private data
From: Michael Richardson @ 2018-06-08 19:37 UTC (permalink / raw)
To: Alexander Aring; +Cc: netdev, linux-wpan, linux-bluetooth
In-Reply-To: <20180608173455.vrnfvv7dlu4oxwqf@x220t>
[-- Attachment #1: Type: text/plain, Size: 822 bytes --]
Alexander Aring <aring@mojatatu.com> wrote:
Alex> I already see code outside who changed tun netdevice to the
Alex> ARPHRD_6LOWPAN type and I suppose they running into this
Alex> issue. (Btw: I don't know why somebody wants to changed that
Alex> type to ARPHRD_6LOWPAN on tun).
so that they can have the kernel do 6lowpan processing, emitting 6lowPAN
packets into userspace to be transfered into a radio via some proprietary
interface (including, for instance SLIP over USB cable to Contiki or OpenWSN stack,
set up to act as radio only)
--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 464 bytes --]
^ permalink raw reply
* Re: [PATCH 2/3] bpfilter: include bpfilter_umh in assembly instead of using objcopy
From: Alexei Starovoitov @ 2018-06-08 20:47 UTC (permalink / raw)
To: Masahiro Yamada
Cc: netdev, Alexei Starovoitov, David S . Miller, Arnd Bergmann,
Geert Uytterhoeven, linux-kernel, YueHaibing
In-Reply-To: <1528477930-7342-3-git-send-email-yamada.masahiro@socionext.com>
On Sat, Jun 09, 2018 at 02:12:09AM +0900, Masahiro Yamada wrote:
> Do not use the troublesome ELF magic. What is happening here is to
> embed a user-space program into the kernel. Simply wrap it in the
> assembly with the '.incbin' directive.
>
> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> ---
>
> net/bpfilter/Makefile | 15 ++-------------
> net/bpfilter/bpfilter_kern.c | 11 +++++------
> net/bpfilter/bpfilter_umh_blob.S | 7 +++++++
> 3 files changed, 14 insertions(+), 19 deletions(-)
> create mode 100644 net/bpfilter/bpfilter_umh_blob.S
>
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> index aafa720..39c6980 100644
> --- a/net/bpfilter/Makefile
> +++ b/net/bpfilter/Makefile
> @@ -15,18 +15,7 @@ ifeq ($(CONFIG_BPFILTER_UMH), y)
> HOSTLDFLAGS += -static
> endif
>
> -# a bit of elf magic to convert bpfilter_umh binary into a binary blob
> -# inside bpfilter_umh.o elf file referenced by
> -# _binary_net_bpfilter_bpfilter_umh_start symbol
> -# which bpfilter_kern.c passes further into umh blob loader at run-time
> -quiet_cmd_copy_umh = GEN $@
> - cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> - $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
> - -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> - --rename-section .data=.init.rodata $< $@
> -
> -$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
> - $(call cmd,copy_umh)
> +$(obj)/bpfilter_umh_blob.o: $(obj)/bpfilter_umh
>
> obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
> -bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
> +bpfilter-objs += bpfilter_kern.o bpfilter_umh_blob.o
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> index b13d058..fcc1a7c 100644
> --- a/net/bpfilter/bpfilter_kern.c
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -10,11 +10,8 @@
> #include <linux/file.h>
> #include "msgfmt.h"
>
> -#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
> -#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
> -
> -extern char UMH_start;
> -extern char UMH_end;
> +extern char bpfilter_umh_start;
> +extern char bpfilter_umh_end;
>
> static struct umh_info info;
> /* since ip_getsockopt() can run in parallel, serialize access to umh */
> @@ -89,7 +86,9 @@ static int __init load_umh(void)
> int err;
>
> /* fork usermode process */
> - err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
> + err = fork_usermode_blob(&bpfilter_umh_end,
> + &bpfilter_umh_end - &bpfilter_umh_start,
> + &info);
> if (err)
> return err;
> pr_info("Loaded bpfilter_umh pid %d\n", info.pid);
> diff --git a/net/bpfilter/bpfilter_umh_blob.S b/net/bpfilter/bpfilter_umh_blob.S
> new file mode 100644
> index 0000000..40311d1
> --- /dev/null
> +++ b/net/bpfilter/bpfilter_umh_blob.S
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> + .section .init.rodata, "a"
> + .global bpfilter_umh_start
> +bpfilter_umh_start:
> + .incbin "net/bpfilter/bpfilter_umh"
Interesting. I think this is good idea. Looks cleaner than objcopy magic.
btw CONFIG_OUTPUT_FORMAT already fixed by
commit 8d97ca6b6755 ("bpfilter: fix OUTPUT_FORMAT") in net tree.
Could you please rebase on top of that tree?
^ permalink raw reply
* Re: [PATCH 3/3] bpfilter: do not (ab)use host-program build rule
From: Alexei Starovoitov @ 2018-06-08 20:52 UTC (permalink / raw)
To: Masahiro Yamada
Cc: netdev, Alexei Starovoitov, David S . Miller, Arnd Bergmann,
Geert Uytterhoeven, linux-kernel, YueHaibing, Daniel Borkmann
In-Reply-To: <1528477930-7342-4-git-send-email-yamada.masahiro@socionext.com>
On Sat, Jun 09, 2018 at 02:12:10AM +0900, Masahiro Yamada wrote:
> It is an ugly hack to overwrite $(HOSTCC) with $(CC) to reuse the
> build rules from scripts/Makefile.host. It should not be tedious
> to write a build rule for its own.
>
> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> ---
>
> net/bpfilter/Makefile | 17 +++++++++++------
> net/bpfilter/{main.c => bpfilter_umh.c} | 0
> 2 files changed, 11 insertions(+), 6 deletions(-)
> rename net/bpfilter/{main.c => bpfilter_umh.c} (100%)
>
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> index 39c6980..6571b30 100644
> --- a/net/bpfilter/Makefile
> +++ b/net/bpfilter/Makefile
> @@ -3,18 +3,23 @@
> # Makefile for the Linux BPFILTER layer.
> #
>
> -hostprogs-y := bpfilter_umh
> -bpfilter_umh-objs := main.o
> -HOSTCFLAGS += -I. -Itools/include/ -Itools/include/uapi
> -HOSTCC := $(CC)
that is a hack indeed. I don't like it either, but see below.
> -
> ifeq ($(CONFIG_BPFILTER_UMH), y)
> # builtin bpfilter_umh should be compiled with -static
> # since rootfs isn't mounted at the time of __init
> # function is called and do_execv won't find elf interpreter
> -HOSTLDFLAGS += -static
> +STATIC := -static
> endif
>
> +quiet_cmd_cc_user = CC $@
> + cmd_cc_user = $(CC) -Wall -Wmissing-prototypes -O2 -std=gnu89 \
> + -I$(srctree) -I$(srctree)/tools/include/ \
> + -I$(srctree)/tools/include/uapi $(STATIC) -o $@ $<
> +
> +$(obj)/bpfilter_umh: $(src)/bpfilter_umh.c FORCE
> + $(call if_changed,cc_user)
Does this scale?
Please see two top patches here:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/log/?h=ipt_bpf
that add more meat to bpfilter and a lot more files.
Recompiling all of them at once isn't nice either.
This Makefile needs different .c -> .o rules for bpfilter_kern.c files
that get compiled into kernel module and for the rest of umh files:
main.c ctor.c init.c gen.c etc
that need to be compiled .c -> .o differently.
I don't see how to do it without ugly hacks in Makefile.
In that sense HOSTCC = CC hack looked the least ugly to me that's
why I went with it.
Better ideas?
^ permalink raw reply
* [RFC PATCH 0/3] BPF socket filter to deal with skb frags
From: Tushar Dave @ 2018-06-08 21:00 UTC (permalink / raw)
To: netdev, ast, daniel, davem, john.fastabend, jakub.kicinski, kafai,
rdna, quentin.monnet, brakmo, acme
This RFC allows bpf socket filter programs to look into complete skb
i.e. linear and non-linear part of skb. (patch1)
For a proof of concept I'm using RDS sample program that uses bpf socket
filter and inspect skb packet data from linear and non-linear part e.g.
skb frags. (patch 2 and 3)
I'm sharing this RFC to get some feedback on direction.
Details:
patch1 adds new bpf helper function and needed infrastructure so that
socket(sk) filter based eBPF program can retrieve non-linear part of skb
(e.g. skb frags) unlike current socket filter that only deals with
linear skb. This patch adds very basic functionality and for now allow
socket filter programs to only read packet data (from linear and
non-linear part of) skb. The idea behind this patch is to add eBPF
helper that allow socket filter based ebpf program to walk through the
skb frag using bpf tail call. This way ebpf program can do deep packet
inspection (i.e. allows to look into headers as well as payload).
patch2 adds sample ebpf socket filter program that uses rds socket. The
sample program opens an rds socket, attach ebpf program to rds socket
and uses bpf helper added in patch 1 to look into skb. For a test,
current ebpf program only prints first few bytes from skb->data and skb
frags.
patch3 allows rds_recv_incoming to invoke bpf socket filter program if
any program is attached to rds socket.
FYI, I'm also working on a follow-up patchset that deals with *struct
scatterlist* to allow RDS filtering for IB/RDMA use cases that do not
have an sk_buff.
Thanks.
-Tushar
Tushar Dave (3):
ebpf: add next_skb_frag bpf helper for sk filter
samples/bpf: add sample RDS program
rds: invoke sk filter attached to rds socket
include/linux/filter.h | 2 +
include/uapi/linux/bpf.h | 10 +-
net/core/filter.c | 44 ++++-
net/rds/recv.c | 17 ++
samples/bpf/Makefile | 3 +
samples/bpf/rds_skb_kern.c | 87 +++++++++
samples/bpf/rds_skb_user.c | 311 ++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 10 +-
tools/testing/selftests/bpf/bpf_helpers.h | 2 +
9 files changed, 482 insertions(+), 4 deletions(-)
create mode 100644 samples/bpf/rds_skb_kern.c
create mode 100644 samples/bpf/rds_skb_user.c
--
1.8.3.1
^ permalink raw reply
* [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Tushar Dave @ 2018-06-08 21:00 UTC (permalink / raw)
To: netdev, ast, daniel, davem, john.fastabend, jakub.kicinski, kafai,
rdna, quentin.monnet, brakmo, acme
In-Reply-To: <1528491607-10399-1-git-send-email-tushar.n.dave@oracle.com>
Today socket filter only deals with linear skbs. This change allows
ebpf programs to look into non-linear skb e.g. skb frags. This will be
useful when users need to look into data which is not contained in the
linear part of skb.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
include/linux/filter.h | 2 ++
include/uapi/linux/bpf.h | 10 ++++++-
net/core/filter.c | 44 +++++++++++++++++++++++++++++--
tools/include/uapi/linux/bpf.h | 10 ++++++-
tools/testing/selftests/bpf/bpf_helpers.h | 2 ++
5 files changed, 64 insertions(+), 4 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9dbcb9d..603b8bf 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -500,6 +500,7 @@ struct sk_filter {
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb;
+ u8 index;
void *data_meta;
void *data_end;
};
@@ -534,6 +535,7 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb)
BUILD_BUG_ON(sizeof(*cb) > FIELD_SIZEOF(struct sk_buff, cb));
cb->data_meta = skb->data - skb_metadata_len(skb);
cb->data_end = skb->data + skb_headlen(skb);
+ cb->index = 0;
}
static inline u8 *bpf_skb_cb(struct sk_buff *skb)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d94d333..5fe9668 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1902,6 +1902,13 @@ struct bpf_stack_build_id {
* egress otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_next_skb_frag(struct sk_buff *skb)
+ * Description
+ * This helper allows users to look into non-linear part of skb
+ * e.g. skb frags.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -1976,7 +1983,8 @@ struct bpf_stack_build_id {
FN(fib_lookup), \
FN(sock_hash_update), \
FN(msg_redirect_hash), \
- FN(sk_redirect_hash),
+ FN(sk_redirect_hash), \
+ FN(next_skb_frag),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 51ea7dd..fd8e90f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3752,6 +3752,38 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
.arg1_type = ARG_PTR_TO_CTX,
};
+BPF_CALL_1(bpf_next_skb_frag, struct sk_buff *, skb)
+{
+ struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+ const skb_frag_t *frag;
+
+ if (skb->data_len == 0)
+ return -ENODATA;
+
+ if (cb->index == (u8)skb_shinfo(skb)->nr_frags)
+ return -ENODATA;
+
+ /* get the frag start and end address into data_meta and data_end
+ * respectively so eBPF program can look into skb frag
+ */
+ frag = &skb_shinfo(skb)->frags[cb->index];
+ cb->data_meta = page_address(skb_frag_page(frag)) +
+ frag->page_offset;
+ cb->data_end = cb->data_meta + skb_frag_size(frag);
+
+ /* update frag index */
+ cb->index++;
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_next_skb_frag_proto = {
+ .func = bpf_next_skb_frag,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+};
+
BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
int, level, int, optname, char *, optval, int, optlen)
{
@@ -4415,6 +4447,8 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
return &bpf_get_socket_cookie_proto;
case BPF_FUNC_get_socket_uid:
return &bpf_get_socket_uid_proto;
+ case BPF_FUNC_next_skb_frag:
+ return &bpf_next_skb_frag_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -4698,10 +4732,16 @@ static bool sk_filter_is_valid_access(int off, int size,
struct bpf_insn_access_aux *info)
{
switch (off) {
- case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data):
- case bpf_ctx_range(struct __sk_buff, data_meta):
+ info->reg_type = PTR_TO_PACKET;
+ break;
case bpf_ctx_range(struct __sk_buff, data_end):
+ info->reg_type = PTR_TO_PACKET_END;
+ break;
+ case bpf_ctx_range(struct __sk_buff, data_meta):
+ info->reg_type = PTR_TO_PACKET;
+ break;
+ case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
return false;
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d94d333..5fe9668 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1902,6 +1902,13 @@ struct bpf_stack_build_id {
* egress otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_next_skb_frag(struct sk_buff *skb)
+ * Description
+ * This helper allows users to look into non-linear part of skb
+ * e.g. skb frags.
+ * Return
+ * 0 on success, or a negative error in case of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -1976,7 +1983,8 @@ struct bpf_stack_build_id {
FN(fib_lookup), \
FN(sock_hash_update), \
FN(msg_redirect_hash), \
- FN(sk_redirect_hash),
+ FN(sk_redirect_hash), \
+ FN(next_skb_frag),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 8f143df..51f2153 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -114,6 +114,8 @@ static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params,
int plen, __u32 flags) =
(void *) BPF_FUNC_fib_lookup;
+static unsigned long long (*bpf_next_skb_frag)(void *ctx) =
+ (void *) BPF_FUNC_next_skb_frag;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
--
1.8.3.1
^ permalink raw reply related
* [RFC PATCH 2/3] samples/bpf: add sample RDS program
From: Tushar Dave @ 2018-06-08 21:00 UTC (permalink / raw)
To: netdev, ast, daniel, davem, john.fastabend, jakub.kicinski, kafai,
rdna, quentin.monnet, brakmo, acme
In-Reply-To: <1528491607-10399-1-git-send-email-tushar.n.dave@oracle.com>
When run in server mode, the sample RDS program opens PF_RDS socket,
attaches ebpf program to RDS socket which then uses bpf_skb_next_frag
helper along with bpf tail calls to inspect skb linear and non-linear
data.
To ease testing, RDS client functionality is also added so that users
can generate RDS packet.
Run server:
[root@lab71 bpf]# ./rds_skb -s 192.168.3.71
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 52287
payload contains:30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41
42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59
5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67 68 69 6a 6b ...
server listening on 192.168.3.71
Run client:
[root@lab70 bpf]# ./rds_skb -s 192.168.3.71 -c 192.168.3.70
transport tcp
client bound to address: 192.168.3.71 port 47437
client sending 8192 byte message from 192.168.3.71 to 192.168.3.70 on
port 47437
bpf program output:
[root@lab71]# cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [000] ..s. 218923.839673: 0: 30 31 32
<idle>-0 [000] ..s. 218923.839682: 0: 33 34 35
<idle>-0 [000] ..s. 218923.845133: 0: be bf c0
<idle>-0 [000] ..s. 218923.845135: 0: c1 c2 c3
<idle>-0 [000] ..s. 218923.850581: 0: be bf c0
<idle>-0 [000] ..s. 218923.850582: 0: c1 c2 c3
<idle>-0 [000] ..s. 218923.850582: 0: no more skb frag
Note: changing MTU to 9000 help assure that RDS get skb with
fragments.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
samples/bpf/Makefile | 3 +
samples/bpf/rds_skb_kern.c | 87 +++++++++++++
samples/bpf/rds_skb_user.c | 311 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 401 insertions(+)
create mode 100644 samples/bpf/rds_skb_kern.c
create mode 100644 samples/bpf/rds_skb_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 62a99ab..a05c3b2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -51,6 +51,7 @@ hostprogs-y += cpustat
hostprogs-y += xdp_adjust_tail
hostprogs-y += xdpsock
hostprogs-y += xdp_fwd
+hostprogs-y += rds_skb
# Libbpf dependencies
LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -105,6 +106,7 @@ cpustat-objs := bpf_load.o cpustat_user.o
xdp_adjust_tail-objs := xdp_adjust_tail_user.o
xdpsock-objs := bpf_load.o xdpsock_user.o
xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
+rds_skb-objs := bpf_load.o rds_skb_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -160,6 +162,7 @@ always += cpustat_kern.o
always += xdp_adjust_tail_kern.o
always += xdpsock_kern.o
always += xdp_fwd_kern.o
+always += rds_skb_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_skb_kern.c b/samples/bpf/rds_skb_kern.c
new file mode 100644
index 0000000..c8832d4
--- /dev/null
+++ b/samples/bpf/rds_skb_kern.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+
+#define PROG(F) SEC("socket/"__stringify(F)) int bpf_func_##F
+
+#define bpf_printk(fmt, ...) \
+({ \
+ char ____fmt[] = fmt; \
+ bpf_trace_printk(____fmt, sizeof(____fmt), \
+ ##__VA_ARGS__); \
+})
+
+
+struct bpf_map_def SEC("maps") jmp_table = {
+ .type = BPF_MAP_TYPE_PROG_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(u32),
+ .max_entries = 2,
+};
+
+#define FRAG 1
+
+static inline void dump_skb(struct __sk_buff *skb)
+{
+ void *data = (void *)(long) skb->data_meta;
+ void *data_end = (void *)(long) skb->data_end;
+ unsigned char *d;
+
+ if (data + 6 > data_end)
+ return;
+
+ d = (unsigned char *)data;
+ bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+ bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+ return;
+}
+
+static void populate_skb_frags(struct __sk_buff *skb)
+{
+ int ret;
+
+ ret = bpf_next_skb_frag(skb);
+ if (ret == -ENODATA) {
+ bpf_printk("no more skb frag\n");
+ return;
+ }
+
+ bpf_tail_call(skb, &jmp_table, 1);
+}
+
+/* walk skb frag */
+
+PROG(FRAG)(struct __sk_buff *skb)
+{
+ dump_skb(skb);
+ populate_skb_frags(skb);
+ return 0;
+}
+
+SEC("socket/0")
+int main_prog(struct __sk_buff *skb)
+{
+ void *data = (void *)(long) skb->data;
+ void *data_end = (void *)(long) skb->data_end;
+ int ret;
+ unsigned char *d;
+
+ if (data + 6 > data_end) {
+ bpf_printk("out\n");
+ return 0;
+ }
+
+ d = (unsigned char *)data;
+ bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+ bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+ populate_skb_frags(skb);
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/rds_skb_user.c b/samples/bpf/rds_skb_user.c
new file mode 100644
index 0000000..9f73dc3
--- /dev/null
+++ b/samples/bpf/rds_skb_user.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT 4000
+#define BUFSIZE 8192
+
+static const char *trans2str(int trans)
+{
+ switch (trans) {
+ case RDS_TRANS_TCP:
+ return ("tcp");
+ case RDS_TRANS_NONE:
+ return ("none");
+ default:
+ return ("unknown");
+ }
+}
+
+static int gettransport(int sock)
+{
+ int err;
+ char val;
+ socklen_t len = sizeof(int);
+
+ err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+ (char *)&val, &len);
+ if (err < 0) {
+ fprintf(stderr, "%s: getsockopt %s\n",
+ __func__, strerror(errno));
+ return err;
+ }
+ return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+ int err;
+
+ err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+ (char *)&transport, sizeof(transport));
+ if (err < 0) {
+ fprintf(stderr, "could not set transport %s, %s\n",
+ trans2str(transport), strerror(errno));
+ }
+ return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+ socklen_t sin_size = sizeof(struct sockaddr_in);
+ struct sockaddr_in sin;
+ int err;
+
+ err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+ if (err < 0) {
+ fprintf(stderr, "%s getsockname %s\n",
+ __func__, strerror(errno));
+ return;
+ }
+ printf("%s address: %s port %d\n",
+ (str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+ if (ret != NULL)
+ *ret = sin;
+}
+
+static void server(char *address, in_port_t port)
+{
+ struct sockaddr_in sin, din;
+ struct msghdr msg;
+ struct iovec *iov;
+ int rc, sock;
+ char *buf;
+
+ buf = calloc(BUFSIZE, sizeof(char));
+ if (!buf) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ return;
+ }
+
+ sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ if (sock < 0) {
+ fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+ goto out;
+ }
+ if (settransport(sock, RDS_TRANS_TCP) < 0)
+ goto out;
+
+ printf("transport %s\n", trans2str(gettransport(sock)));
+
+ memset(&sin, 0, sizeof(sin));
+ sin.sin_family = AF_INET;
+ sin.sin_addr.s_addr = inet_addr(address);
+ sin.sin_port = htons(port);
+
+ rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+ if (rc < 0) {
+ fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+ goto out;
+ }
+
+ /* attach eBPF program */
+ assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd[1],
+ sizeof(prog_fd[0])) == 0);
+
+ print_sock_local_info(sock, "server bound to", NULL);
+
+ iov = calloc(1, sizeof(struct iovec));
+ if (!iov) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ goto out;
+ }
+
+ while (1) {
+ memset(buf, 0, BUFSIZE);
+ iov[0].iov_base = buf;
+ iov[0].iov_len = BUFSIZE;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_name = &din;
+ msg.msg_namelen = sizeof(din);
+ msg.msg_iov = iov;
+ msg.msg_iovlen = 1;
+
+ printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+ rc = recvmsg(sock, &msg, 0);
+ if (rc < 0) {
+ fprintf(stderr, "%s: recvmsg %s\n",
+ __func__, strerror(errno));
+ break;
+ }
+
+ printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+ inet_ntoa(sin.sin_addr),
+ inet_ntoa(din.sin_addr),
+ (uint32_t) iov[0].iov_len,
+ (uint32_t) msg.msg_controllen,
+ ntohs(din.sin_port));
+
+ {
+ int i;
+
+ printf("payload contains:");
+ for (i = 0; i < 60; i++)
+ printf("%x ", buf[i]);
+ printf("...\n");
+ }
+ }
+ free(iov);
+out:
+ free(buf);
+}
+
+static void create_message(char *buf)
+{
+ unsigned int i;
+
+ for (i = 0; i < BUFSIZE; i++) {
+ buf[i] = i + 0x30;
+ }
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+ struct iovec *iov;
+
+ iov = calloc(1, sizeof(struct iovec));
+ if (!iov) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ return -1;
+ }
+
+ msg->msg_iov = iov;
+ msg->msg_iovlen = 1;
+
+ iov[0].iov_base = buf;
+ iov[0].iov_len = BUFSIZE * sizeof(char);
+
+ return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+ struct sockaddr_in sin, din;
+ struct msghdr msg;
+ int rc, sock;
+ char *buf;
+
+ buf = calloc(BUFSIZE, sizeof(char));
+ if (!buf) {
+ fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+ return;
+ }
+
+ create_message(buf);
+
+ sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ if (sock < 0) {
+ fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+ goto out;
+ }
+
+ if (settransport(sock, RDS_TRANS_TCP) < 0)
+ goto out;
+
+ printf("transport %s\n", trans2str(gettransport(sock)));
+
+ memset(&sin, 0, sizeof(sin));
+ sin.sin_family = AF_INET;
+ sin.sin_addr.s_addr = inet_addr(localaddr);
+ sin.sin_port = 0;
+
+ rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+ if (rc < 0) {
+ fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+ goto out;
+ }
+ print_sock_local_info(sock, "client bound to", &sin);
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_name = &din;
+ msg.msg_namelen = sizeof(din);
+
+ memset(&din, 0, sizeof(din));
+ din.sin_family = AF_INET;
+ din.sin_addr.s_addr = inet_addr(remoteaddr);
+ din.sin_port = htons(server_port);
+
+ rc = build_rds_packet(&msg, buf);
+ if (rc < 0)
+ goto out;
+
+ printf("client sending %d byte message from %s to %s on port %d\n",
+ (uint32_t) msg.msg_iov->iov_len, localaddr,
+ remoteaddr, ntohs(sin.sin_port));
+
+ rc = sendmsg(sock, &msg, 0);
+ if (rc < 0)
+ fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+ if (msg.msg_control)
+ free(msg.msg_control);
+ if (msg.msg_iov)
+ free(msg.msg_iov);
+out:
+ free(buf);
+
+ return;
+}
+
+static void usage(char *progname)
+{
+ fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr]\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+ in_port_t server_port = TESTPORT;
+ char *serveraddr = NULL;
+ char *clientaddr = NULL;
+ char filename[256];
+ int opt;
+
+ while ((opt = getopt(argc, argv, "s:c:")) != -1) {
+ switch (opt) {
+ case 's':
+ serveraddr = optarg;
+ break;
+ case 'c':
+ clientaddr = optarg;
+ break;
+ default:
+ usage(argv[0]);
+ return 1;
+ }
+ }
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_bpf_file(filename)) {
+ fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+ return 1;
+ }
+
+ if (serveraddr && !clientaddr) {
+ printf("running server in a loop\n");
+ server(serveraddr, server_port);
+ } else if (serveraddr && clientaddr) {
+ client(clientaddr, serveraddr, server_port);
+ }
+
+ return 0;
+}
--
1.8.3.1
^ permalink raw reply related
* [RFC PATCH 3/3] rds: invoke sk filter attached to rds socket
From: Tushar Dave @ 2018-06-08 21:00 UTC (permalink / raw)
To: netdev, ast, daniel, davem, john.fastabend, jakub.kicinski, kafai,
rdna, quentin.monnet, brakmo, acme
In-Reply-To: <1528491607-10399-1-git-send-email-tushar.n.dave@oracle.com>
RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma. For that reason this patch invokes socket filter only for
rds socket with tcp transport e.g. rds_tcp.
note:
BTW, we dont want rds-core to be polluted by module-specific data
structures e.g. we included tcp.h to retrieve rds_tcp specific
structures. For non-RFC version we will add a way to get transport
specific indirections to get the skb.
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
net/rds/recv.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/net/rds/recv.c b/net/rds/recv.c
index dc67458..3be9628 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -39,6 +39,7 @@
#include <linux/rds.h>
#include "rds.h"
+#include "tcp.h"
void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
__be32 saddr)
@@ -369,6 +370,22 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
/* We can be racing with rds_release() which marks the socket dead. */
sk = rds_rs_to_sk(rs);
+ if (rs->rs_transport->t_type == RDS_TRANS_TCP) {
+ struct sk_buff *skb;
+ struct sk_filter *filter = sk->sk_filter;
+ struct rds_tcp_incoming *tinc;
+
+ tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+ skb = tinc->ti_skb_list.next;
+ rcu_read_lock();
+ filter = rcu_dereference(sk->sk_filter);
+ if (filter) {
+ bpf_compute_data_pointers(skb);
+ bpf_prog_run_save_cb(filter->prog, skb);
+ }
+ rcu_read_unlock();
+ }
+
/* serialize with rds_release -> sock_orphan */
write_lock_irqsave(&rs->rs_recv_lock, flags);
if (!sock_flag(sk, SOCK_DEAD)) {
--
1.8.3.1
^ permalink raw reply related
* Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"
From: Michal Kubecek @ 2018-06-08 21:04 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Eric Dumazet, netdev
In-Reply-To: <20180608095954.4a0437e4@xeon-e3>
On Fri, Jun 08, 2018 at 09:59:54AM -0700, Stephen Hemminger wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=199995
>
> Bug ID: 199995
> Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
> "brodcast"
>
> after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 (Kernel
> 4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
> the Kernel
> on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.
>
> With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
> packets after the SYN/ACK/ACK packet is received (see attached PCAP).
> This only happens if the kernel receives the initial SYN packet on both
> physical NICs (and therefore seeing it twice), before the connection is
> established by sending SYN/ACK.
> It's not happening in 100% of all cases and only, if the system can use two or
> more CPU cores/threads. With only one CPU available to the system, this
> behaviour is not reproducable.
I have seen similar report earlier from one of our customers running
SLE12 SP2 (kernel 4.4). The problem is that if duplicated SYN packet is
received on both slaves, these two copies can be processed by the
lockless listener simultaneously on different CPUs and each can reply by
SYNACK with different sequence number which results in a reset.
I tried to think of a way to prevent this race without losing the
performance gain of lockless listener but couldn't come with anything.
Eventually, I managed to persuade the customer that this setup (where
each packet is received twice under normal circumstances) is not what
broadcast mode was designed for (based on the description in
Documentation/networking/bonding.txt).
However, the lockless listener was introduced in 4.4 so it's not clear
why reporter started encountering this after an upgrade from 4.13 to
4.15.
Michal Kubecek
^ permalink raw reply
* Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Daniel Borkmann @ 2018-06-08 21:27 UTC (permalink / raw)
To: Tushar Dave, netdev, ast, davem, john.fastabend, jakub.kicinski,
kafai, rdna, quentin.monnet, brakmo, acme
In-Reply-To: <1528491607-10399-2-git-send-email-tushar.n.dave@oracle.com>
On 06/08/2018 11:00 PM, Tushar Dave wrote:
> Today socket filter only deals with linear skbs. This change allows
> ebpf programs to look into non-linear skb e.g. skb frags. This will be
> useful when users need to look into data which is not contained in the
> linear part of skb.
Hmm, I don't think this statement is correct in its form here ... they
can handle non-linear skbs just fine.
Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
internally skb_header_pointer(), and that one of course walks everything
if it really has to via skb_copy_bits() (page frags _and_ frag list). And
if you need to look into mac/net headers that may otherwise not be accessible
anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
which is effectively doing the negative offset trick from ld_abs/ind more
efficient for multi-byte loads.
Thanks,
Daniel
^ permalink raw reply
* Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"
From: Eric Dumazet @ 2018-06-08 21:38 UTC (permalink / raw)
To: Michal Kubecek, Stephen Hemminger; +Cc: Eric Dumazet, netdev
In-Reply-To: <20180608210403.2moomjshtwszvsso@unicorn.suse.cz>
On 06/08/2018 02:04 PM, Michal Kubecek wrote:
> On Fri, Jun 08, 2018 at 09:59:54AM -0700, Stephen Hemminger wrote:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=199995
>>
>> Bug ID: 199995
>> Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
>> "brodcast"
>>
>> after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 (Kernel
>> 4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
>> the Kernel
>> on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.
>>
>> With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
>> packets after the SYN/ACK/ACK packet is received (see attached PCAP).
>> This only happens if the kernel receives the initial SYN packet on both
>> physical NICs (and therefore seeing it twice), before the connection is
>> established by sending SYN/ACK.
>> It's not happening in 100% of all cases and only, if the system can use two or
>> more CPU cores/threads. With only one CPU available to the system, this
>> behaviour is not reproducable.
>
> I have seen similar report earlier from one of our customers running
> SLE12 SP2 (kernel 4.4). The problem is that if duplicated SYN packet is
> received on both slaves, these two copies can be processed by the
> lockless listener simultaneously on different CPUs and each can reply by
> SYNACK with different sequence number which results in a reset.
>
> I tried to think of a way to prevent this race without losing the
> performance gain of lockless listener but couldn't come with anything.
> Eventually, I managed to persuade the customer that this setup (where
> each packet is received twice under normal circumstances) is not what
> broadcast mode was designed for (based on the description in
> Documentation/networking/bonding.txt).
>
> However, the lockless listener was introduced in 4.4 so it's not clear
> why reporter started encountering this after an upgrade from 4.13 to
> 4.15.
Yes, I do not buy this at all.
If two identical SYN are received by two cpus, we should create one SYN_RECV and send
two SYNACK.
But it is a bit hard to test this :/
I will take a look, thanks.
^ permalink raw reply
* Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
From: Arend van Spriel @ 2018-06-08 21:40 UTC (permalink / raw)
To: Ben Greear, Michał Kazior
Cc: Cong Wang, Linux Kernel Network Developers,
linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1f11144f-7580-03f4-72bd-76b0907d7ed1-my8/4N5VtI7c+919tysfdA@public.gmane.org>
On 6/8/2018 5:17 PM, Ben Greear wrote:
I recalled an email from Michał leaving tieto so adding his alternate
email he provided back then.
Gr. AvS
> On 06/07/2018 04:59 PM, Cong Wang wrote:
>> On Thu, Jun 7, 2018 at 4:48 PM, <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
>>> diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
>>> index be7c0fa..cb911f0 100644
>>> --- a/include/net/fq_impl.h
>>> +++ b/include/net/fq_impl.h
>>> @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>>> return NULL;
>>> }
>>>
>>> - flow = list_first_entry(head, struct fq_flow, flowchain);
>>> + flow = list_first_entry_or_null(head, struct fq_flow,
>>> flowchain);
>>> +
>>> + if (WARN_ON_ONCE(!flow))
>>> + return NULL;
>>
>> This does not make sense either. list_first_entry_or_null()
>> returns NULL only when the list is empty, but we already check
>> list_empty() right before this code, and it is protected by fq->lock.
>>
>
> Hello Michal,
>
> git blame shows you as the author of the fq_impl.h code.
>
> I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks
> kernel. There was an apparent
> mostly-null deref in the fq_tin_dequeue method. According to gdb, it
> was within
> 1 line of the dereference of 'flow'.
>
> My hack above is probably not that useful. Cong thinks maybe the
> locking is bad.
>
> If you get a chance, please review this thread and see if you have any
> ideas for
> a better fix (or better debugging code).
>
> As always, if you would like me to generate you a buggy firmware that
> will crash
> in the tx path and cause all sorts of mayhem in the ath10k driver and
> wifi stack,
> I will be happy to do so.
>
> https://www.mail-archive.com/netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg239738.html
>
> Thanks,
> Ben
>
^ permalink raw reply
* Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Tushar Dave @ 2018-06-08 21:46 UTC (permalink / raw)
To: Daniel Borkmann, netdev, ast, davem, john.fastabend,
jakub.kicinski, kafai, rdna, quentin.monnet, brakmo, acme
In-Reply-To: <9588eb72-f1d5-f6ce-b2a3-aefb431e70d5@iogearbox.net>
On 06/08/2018 02:27 PM, Daniel Borkmann wrote:
> On 06/08/2018 11:00 PM, Tushar Dave wrote:
>> Today socket filter only deals with linear skbs. This change allows
>> ebpf programs to look into non-linear skb e.g. skb frags. This will be
>> useful when users need to look into data which is not contained in the
>> linear part of skb.
>
> Hmm, I don't think this statement is correct in its form here ... they
> can handle non-linear skbs just fine.
Thanks Daniel for your reply.
>
> Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
> internally skb_header_pointer(), and that one of course walks everything
> if it really has to via skb_copy_bits() (page frags _and_ frag list). And
> if you need to look into mac/net headers that may otherwise not be accessible
> anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
> which is effectively doing the negative offset trick from ld_abs/ind more
> efficient for multi-byte loads.
I'm looking into bpf_skb_load_bytes and friends.
Thanks.
-Tushar
>
> Thanks,
> Daniel
>
^ permalink raw reply
* Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"
From: Eric Dumazet @ 2018-06-08 21:53 UTC (permalink / raw)
To: Eric Dumazet, Michal Kubecek, Stephen Hemminger; +Cc: netdev
In-Reply-To: <3cbd2c1f-4e03-1cb1-3731-4ce440778bb8@gmail.com>
On 06/08/2018 02:38 PM, Eric Dumazet wrote:
>
>
> On 06/08/2018 02:04 PM, Michal Kubecek wrote:
>>
>> However, the lockless listener was introduced in 4.4 so it's not clear
>> why reporter started encountering this after an upgrade from 4.13 to
>> 4.15.
>
> Yes, I do not buy this at all.
>
> If two identical SYN are received by two cpus, we should create one SYN_RECV and send
> two SYNACK.
>
> But it is a bit hard to test this :/
>
> I will take a look, thanks.
Oh well, this is not done as I thought, this needs a fix, I will work on this.
reqsk_queue_hash_req() calls inet_ehash_insert() without making sure that the same 4-tuple
is not already there.
Do not worry, we will keep the listener lockless :)
^ permalink raw reply
* Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Tushar Dave @ 2018-06-08 22:24 UTC (permalink / raw)
To: Daniel Borkmann, netdev, ast, davem, john.fastabend,
jakub.kicinski, kafai, rdna, quentin.monnet, brakmo, acme
In-Reply-To: <39186936-9af3-f609-7b2a-26c908558a5a@oracle.com>
On 06/08/2018 02:46 PM, Tushar Dave wrote:
>
>
> On 06/08/2018 02:27 PM, Daniel Borkmann wrote:
>> On 06/08/2018 11:00 PM, Tushar Dave wrote:
>>> Today socket filter only deals with linear skbs. This change allows
>>> ebpf programs to look into non-linear skb e.g. skb frags. This will be
>>> useful when users need to look into data which is not contained in the
>>> linear part of skb.
>>
>> Hmm, I don't think this statement is correct in its form here ... they
>> can handle non-linear skbs just fine.
> Thanks Daniel for your reply.
>>
>> Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
>> internally skb_header_pointer(), and that one of course walks everything
>> if it really has to via skb_copy_bits() (page frags _and_ frag list). And
>> if you need to look into mac/net headers that may otherwise not be
>> accessible
>> anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
>> which is effectively doing the negative offset trick from ld_abs/ind more
>> efficient for multi-byte loads.
> I'm looking into bpf_skb_load_bytes and friends.
Daniel,
While I am trying to see if I can use exiting bpf_skb_load helpers, I am
wondering socket filter based ebpf program are allowed to change packet
data? In other words, can we use them to build firewall?
Thanks.
-Tushar
>
> Thanks.
> -Tushar
>>
>> Thanks,
>> Daniel
>>
>
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-08 22:25 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180606142447.3c5072d8@xeon-e3>
On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> > >The net failover should be a simple library, not a virtual
>> > >object with function callbacks (see callback hell).
>> >
>> > Why just a library? It should do a common things. I think it should be a
>> > virtual object. Looks like your patch again splits the common
>> > functionality into multiple drivers. That is kind of backwards attitude.
>> > I don't get it. We should rather focus on fixing the mess the
>> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> > model.
>>
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen? Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.
>
> With virtio you can work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.
I have a somewhat different view regarding IFF_HIDDEN. The purpose of
that flag, as well as the 1-netdev model, is to have a means to
inherit the interface name from the VF, and to eliminate playing hacks
around renaming devices, customizing udev rules and et al. Why
inheriting VF's name important? To allow existing config/setup around
VF continues to work across kernel feature upgrade. Most of network
config files in all distros are based on interface names. Few are MAC
address based but making lower slaves hidden would cover the rest. And
most importantly, preserving the same level of user experience as
using raw VF interface once getting all ndo_ops and ethtool_ops
exposed. This is essential to realize transparent live migration that
users dont have to learn and be aware of the undertaken.
It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few
use cases don't care about getting slaves exposed, the 3-netdev model
would work for them. For the rest, the pre-existing semantics to them
is the single VF device model they've already dealt with. This is
nothing different than having Azure stick to the 2-netdev model
because of existing user base IMHO.
-Siwei
> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.
^ permalink raw reply
* Re: [PATCH net] net: aquantia: fix unsigned numvecs comparison with less than zero
From: David Miller @ 2018-06-08 22:46 UTC (permalink / raw)
To: igor.russkikh; +Cc: netdev, darcari, pavel.belous, colin.king
In-Reply-To: <b3f15fb11d16929c60728800bfaae4fdd36f1406.1528407764.git.igor.russkikh@aquantia.com>
From: Igor Russkikh <igor.russkikh@aquantia.com>
Date: Thu, 7 Jun 2018 17:54:37 -0400
> From: Colin Ian King <colin.king@canonical.com>
>
> From: Colin Ian King <colin.king@canonical.com>
>
> This was originally mistakenly submitted to net-next. Resubmitting to net.
>
> The comparison of numvecs < 0 is always false because numvecs is a u32
> and hence the error return from a failed call to pci_alloc_irq_vectores
> is never detected. Fix this by using the signed int ret to handle the
> error return and assign numvecs to err.
>
> Detected by CoverityScan, CID#1468650 ("Unsigned compared against 0")
>
> Fixes: a09bd81b5413 ("net: aquantia: Limit number of vectors to actually allocated irqs")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH v2 net] net: fddi: fix a possible null-ptr-deref
From: David Miller @ 2018-06-08 22:48 UTC (permalink / raw)
To: yuehaibing; +Cc: netdev, linux-kernel
In-Reply-To: <20180608025825.25716-1-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Fri, 8 Jun 2018 10:58:25 +0800
> bp->SharedMemAddr is set to NULL while bp->SharedMemSize lesser-or-equal 0,
> then memset will trigger null-ptr-deref.
>
> fix it by replacing pci_alloc_consistent with dma_zalloc_coherent.
>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
> v1->v2: move from pci_dma* to dma_* as Christoph suggested
Applied.
^ permalink raw reply
* Re: [PATCH net v2] net/sched: act_simple: fix parsing of TCA_DEF_DATA
From: David Miller @ 2018-06-08 22:50 UTC (permalink / raw)
To: dcaratti; +Cc: jhs, xiyou.wangcong, jiri, netdev
In-Reply-To: <473a0039fa5c28284ebb5928eb894674cc65b925.1528426801.git.dcaratti@redhat.com>
From: Davide Caratti <dcaratti@redhat.com>
Date: Fri, 8 Jun 2018 05:02:31 +0200
> use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
> netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
> end with '\0' character.
>
> v2: fix errors in the commit message, thanks Hangbin Liu
>
> Fixes: fa1b1cff3d06 ("net_cls_act: Make act_simple use of netlink policy.")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-08 22:54 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Stephen Hemminger, Michael S. Tsirkin, Jiri Pirko, kys, haiyangz,
David Miller, Netdev, Stephen Hemminger
In-Reply-To: <e20a6cdf-34b4-cbc9-1dc9-75c436d6c2fe@intel.com>
On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>>
>> On Wed, 6 Jun 2018 15:30:27 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>>>>
>>>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>>>>>
>>>>> The net failover should be a simple library, not a virtual
>>>>> object with function callbacks (see callback hell).
>>>>
>>>> Why just a library? It should do a common things. I think it should be a
>>>> virtual object. Looks like your patch again splits the common
>>>> functionality into multiple drivers. That is kind of backwards attitude.
>>>> I don't get it. We should rather focus on fixing the mess the
>>>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>>>> model.
>>>
>>> So it seems that at least one benefit for netvsc would be better
>>> handling of renames.
>>>
>>> Question is how can this change to 3-netdev happen? Stephen is
>>> concerned about risk of breaking some userspace.
>>>
>>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>>> address, and you said then "why not use existing network namespaces
>>> rather than inventing a new abstraction". So how about it then? Do you
>>> want to find a way to use namespaces to hide the PV device for netvsc
>>> compatibility?
>>>
>> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> startups that all demand eth0 always be present. And VF may come and go.
>> After this history, there is a strong motivation not to change how kernel
>> behaves. Switching to 3 device model would be perceived as breaking
>> existing userspace.
>
>
> I think it should be possible for netvsc to work with 3 dev model if the
> only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an
> issue
> if somehow there is userspace requirement that there can be only 2 netdevs,
> not 3
> when VF is plugged.
>
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> device
> and the IP address gets configured on eth0. Will this be an issue?
>
Did you realize that the eth0 name in the current 3-netdev code can't
be consistently persisted across reboot, if you have more than one VFs
to pair with? On one boot it got eth0/eth0nsby, on the next it may get
eth1/eth1nsby on the same interface.
It won't be useable by default until you add some custom udev rules.
-Siwei
>
>
>>
>> With virtio you can work it out with the distro's yourself.
>> There is no pre-existing semantics to deal with.
>>
>> For the virtio, I don't see the need for IFF_HIDDEN.
>> With 3-dev model as long as you mark the PV and VF devices
>> as slaves, then userspace knows to leave them alone. Assuming userspace
>> is already able to deal with team and bond devices.
>> Any time you introduce new UAPI behavior something breaks.
>>
>> On the rename front, I really don't care if VF can be renamed. And for
>> netvsc want to allow the PV device to be renamed. Udev developers want
>> that
>> but have not found a stable/persistent value to expose to userspace
>> to allow it.
>
>
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-08 23:18 UTC (permalink / raw)
To: Siwei Liu
Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <CADGSJ201TaacZhRTgQZC+a3c5fu8JkHcGifGTfbFQzyocCJxVg@mail.gmail.com>
On Fri, 8 Jun 2018 15:25:59 -0700
Siwei Liu <loseweigh@gmail.com> wrote:
> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
> >> > >The net failover should be a simple library, not a virtual
> >> > >object with function callbacks (see callback hell).
> >> >
> >> > Why just a library? It should do a common things. I think it should be a
> >> > virtual object. Looks like your patch again splits the common
> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> > I don't get it. We should rather focus on fixing the mess the
> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> > model.
> >>
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen? Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>
> >
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.
> >
> > With virtio you can work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> >
> > For the virtio, I don't see the need for IFF_HIDDEN.
>
> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> that flag, as well as the 1-netdev model, is to have a means to
> inherit the interface name from the VF, and to eliminate playing hacks
> around renaming devices, customizing udev rules and et al. Why
> inheriting VF's name important? To allow existing config/setup around
> VF continues to work across kernel feature upgrade. Most of network
> config files in all distros are based on interface names. Few are MAC
> address based but making lower slaves hidden would cover the rest. And
> most importantly, preserving the same level of user experience as
> using raw VF interface once getting all ndo_ops and ethtool_ops
> exposed. This is essential to realize transparent live migration that
> users dont have to learn and be aware of the undertaken.
Inheriting the VF name will fail in the migration scenario.
It is perfectly reasonable to migrate a guest to another machine where
the VF PCI address is different. And since current udev/systemd model
is to base network device name off of PCI address, the device will change
name when guest is migrated.
On Azure, the VF maybe removed (by host) at any time and then later
reattached. There is no guarantee that VF will show back up at
the same synthetic PCI address. It will likely have a different
PCI domain value.
^ permalink raw reply
* Re: linux-next: Please add mlx5-next tree
From: Stephen Hemminger @ 2018-06-08 23:19 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Stephen Rothwell, Linux-Next Mailing List,
Linux Kernel Mailing List, Saeed Mahameed, Jason Gunthorpe,
Doug Ledford, David S. Miller, linux-netdev, RDMA mailing list
In-Reply-To: <20180606192500.GZ2958@mtr-leonro.mtl.com>
[-- Attachment #1: Type: text/plain, Size: 579 bytes --]
On Wed, 6 Jun 2018 22:25:00 +0300
Leon Romanovsky <leon@kernel.org> wrote:
> Hi Stephen,
>
> Can you please add the branch "mlx5-next" from the following tree to the
> linux-next integration tree?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/
>
> The mlx5-next branch is used to send shared pull requests to both netdev
> and RDMA subsystems and it is worth to have linux-next coverage on it
> before.
>
> Thanks
I would hope all network drivers keep getting review in netdev.
Don't want a path to upstream that bypasses review.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-08 23:44 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180608161801.64afda65@xeon-e3>
On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 15:25:59 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> > >The net failover should be a simple library, not a virtual
>> >> > >object with function callbacks (see callback hell).
>> >> >
>> >> > Why just a library? It should do a common things. I think it should be a
>> >> > virtual object. Looks like your patch again splits the common
>> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> > I don't get it. We should rather focus on fixing the mess the
>> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> > model.
>> >>
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen? Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> >
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>> >
>> > With virtio you can work it out with the distro's yourself.
>> > There is no pre-existing semantics to deal with.
>> >
>> > For the virtio, I don't see the need for IFF_HIDDEN.
>>
>> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> that flag, as well as the 1-netdev model, is to have a means to
>> inherit the interface name from the VF, and to eliminate playing hacks
>> around renaming devices, customizing udev rules and et al. Why
>> inheriting VF's name important? To allow existing config/setup around
>> VF continues to work across kernel feature upgrade. Most of network
>> config files in all distros are based on interface names. Few are MAC
>> address based but making lower slaves hidden would cover the rest. And
>> most importantly, preserving the same level of user experience as
>> using raw VF interface once getting all ndo_ops and ethtool_ops
>> exposed. This is essential to realize transparent live migration that
>> users dont have to learn and be aware of the undertaken.
>
> Inheriting the VF name will fail in the migration scenario.
> It is perfectly reasonable to migrate a guest to another machine where
> the VF PCI address is different. And since current udev/systemd model
> is to base network device name off of PCI address, the device will change
> name when guest is migrated.
>
The scenario of having VF on a different PCI address on post migration
is essentially equal to plugging in a new NIC. Why it has to pair with
the original PV? A sepearte PV device should be in place to pair the
new VF.
> On Azure, the VF maybe removed (by host) at any time and then later
> reattached. There is no guarantee that VF will show back up at
> the same synthetic PCI address. It will likely have a different
> PCI domain value.
This is something QEMU can do and make sure the PCI address is
consistent after migration.
-Siwei
^ permalink raw reply
* Re: [PATCH net,stable] cdc_ncm: avoid padding beyond end of skb
From: David Miller @ 2018-06-08 23:50 UTC (permalink / raw)
To: bjorn; +Cc: netdev, linux-usb, bodqhrohro, dennis.wassenberg, mrkiko.rs
In-Reply-To: <20180608071524.11528-1-bjorn@mork.no>
From: Bjørn Mork <bjorn@mork.no>
Date: Fri, 8 Jun 2018 09:15:24 +0200
> Commit 4a0e3e989d66 ("cdc_ncm: Add support for moving NDP to end
> of NCM frame") added logic to reserve space for the NDP at the
> end of the NTB/skb. This reservation did not take the final
> alignment of the NDP into account, causing us to reserve too
> little space. Additionally the padding prior to NDP addition did
> not ensure there was enough space for the NDP.
>
> The NTB/skb with the NDP appended would then exceed the configured
> max size. This caused the final padding of the NTB to use a
> negative count, padding to almost INT_MAX, and resulting in:
...
> Commit e1069bbfcf3b ("net: cdc_ncm: Reduce memory use when kernel
> memory low") made this bug much more likely to trigger by reducing
> the NTB size under memory pressure.
>
> Link: https://bugs.debian.org/893393
> Reported-by: Горбешко Богдан <bodqhrohro@gmail.com>
> Reported-and-tested-by: Dennis Wassenberg <dennis.wassenberg@secunet.com>
> Cc: Enrico Mioso <mrkiko.rs@gmail.com>
> Fixes: 4a0e3e989d66 ("cdc_ncm: Add support for moving NDP to end of NCM frame")
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> Big thanks to Dennis for the observation that this crash depended on
> FLAG_SEND_ZLP not being set. This made it possible to pinpoint where
> the problem was.
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] udp: fix rx queue len reported by diag and proc interface
From: David Miller @ 2018-06-08 23:56 UTC (permalink / raw)
To: pabeni; +Cc: netdev, edumazet, trevor.francis
In-Reply-To: <6189b60449cb3ab524db8e91acd6be8d01ba887c.1528450183.git.pabeni@redhat.com>
From: Paolo Abeni <pabeni@redhat.com>
Date: Fri, 8 Jun 2018 11:35:40 +0200
> After commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
> the sk_rmem_alloc field does not measure exactly anymore the
> receive queue length, because we batch the rmem release. The issue
> is really apparent only after commit 0d4a6608f68c ("udp: do rmem bulk
> free even if the rx sk queue is empty"): the user space can easily
> check for an empty socket with not-0 queue length reported by the 'ss'
> tool or the procfs interface.
>
> We need to use a custom UDP helper to report the correct queue length,
> taking into account the forward allocation deficit.
>
> Reported-by: trevor.francis@46labs.com
> Fixes: 6b229cf77d68 ("UDP: add batching to udp_rmem_release()")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Applied and queued up for -stable, thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox