Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch iproute2 v3 3/4] tc: Add -bs option to batch mode
From: Marcelo Ricardo Leitner @ 2017-12-27 19:56 UTC (permalink / raw)
  To: Chris Mi; +Cc: netdev, gerlitz.or, stephen, dsahern
In-Reply-To: <20171225084658.24076-4-chrism@mellanox.com>

On Mon, Dec 25, 2017 at 05:46:57PM +0900, Chris Mi wrote:
> @@ -267,6 +287,7 @@ int main(int argc, char **argv)
>  {
>  	int ret;
>  	char *batch_file = NULL;
> +	int batch_size = 1;
>  
>  	while (argc > 1) {
>  		if (argv[1][0] != '-')
> @@ -297,6 +318,14 @@ int main(int argc, char **argv)
>  			if (argc <= 1)
>  				usage();
>  			batch_file = argv[1];
> +		} else if (matches(argv[1], "-batchsize") == 0 ||
> +				matches(argv[1], "-bs") == 0) {
> +			argc--;	argv++;
> +			if (argc <= 1)
> +				usage();
> +			batch_size = atoi(argv[1]);
> +			if (batch_size > MSG_IOV_MAX)
> +				batch_size = MSG_IOV_MAX;

what about
if (batch_size < 1)
	batch_size = 1;

>  		} else if (matches(argv[1], "-netns") == 0) {
>  			NEXT_ARG();
>  			if (netns_switch(argv[1]))

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Roopa Prabhu @ 2017-12-27 19:43 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, netdev, David Miller, Arkadi Sharshevsky, mlxsw,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Michael Chan,
	ganeshgr, Saeed Mahameed, matanb, leonro, Ido Schimmel,
	jakub.kicinski, ast, Daniel Borkmann, Simon Horman,
	pieter.jansenvanvuuren, john.hurley, Alexander Duyck,
	John W. Linville, Andy Gospodarek <gospo@
In-Reply-To: <ae70d810-8277-899b-b2a9-6b2dbdd5eb21@cumulusnetworks.com>

On Wed, Dec 27, 2017 at 8:34 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 12/27/17 2:09 AM, Jiri Pirko wrote:
>> Wed, Dec 27, 2017 at 05:05:09AM CET, dsa@cumulusnetworks.com wrote:
>>> On 12/26/17 5:23 AM, Jiri Pirko wrote:
>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>> Many of the ASIC's internal resources are limited and are shared between
>>>> several hardware procedures. For example, unified hash-based memory can
>>>> be used for many lookup purposes, like FDB and LPM. In many cases the user
>>>> can provide a partitioning scheme for such a resource in order to perform
>>>> fine tuning for his application. In such cases performing driver reload is
>>>> needed for the changes to take place, thus this patchset also adds support
>>>> for hot reload.
>>>>
>>>> Such an abstraction can be coupled with devlink's dpipe interface, which
>>>> models the ASIC's pipeline as a graph of match/action tables. By modeling
>>>> the hardware resource object, and by coupling it to several dpipe tables,
>>>> further visibility can be achieved in order to debug ASIC-wide issues.
>>>>
>>>> The proposed interface will provide the user the ability to understand the
>>>> limitations of the hardware, and receive notification regarding its occupancy.
>>>> Furthermore, monitoring the resource occupancy can be done in real-time and
>>>> can be useful in many cases.
>>>
>>> In the last RFC (not v1, but RFC) I asked for some kind of description
>>> for each resource, and you and Arkadi have pushed back. Let's walk
>>> through an example to see what I mean:
>>>
>>> $ devlink resource show pci/0000:03:00.0
>>> pci/0000:03:00.0:
>>>  name kvd size 245760 size_valid true
>>>  resources:
>>>    name linear size 98304 occ 0
>>>    name hash_double size 60416
>>>    name hash_single size 87040
>>>
>>> So this 2700 has 3 resources that can be managed -- some table or
>>> resource or something named 'kvd' with linear, hash_double and
>>> hash_single sub-resources. What are these names referring too? The above
>>> output gives no description, and 'kvd' is not an industry term. Further,
>>
>> This are internal resources specific to the ASIC. Would you like some
>> description to each or something like that?
>
> devlink has some nice self-documenting capabilities. What's missing here
> is a description of what the resource is used for in standard terms --
> ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> short list versus an exhaustive list of everything it is used for. e.g.,
> Why would a user decrease linear and increase hash_single or vice versa?


Arkadi, on what david says above, can the resource names and ids not
be driver specific, but moved up to the switchdev layer and just map
to fdb, host routes, nexthops table sizes etc ?.
Can these generic networking resources then in-turn be mapped to kvd
sizes by the driver ?

^ permalink raw reply

* Re: [patch net-next v2 00/10] Add support for resource abstraction
From: Andrew Lunn @ 2017-12-27 19:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Ahern, netdev, davem, arkadis, mlxsw, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, matanb, leonro,
	idosch, jakub.kicinski, ast, daniel, simon.horman,
	pieter.jansenvanvuuren, john.hurley, alexander.h.duyck, linville,
	gospo, steven.lin1, yuvalm, ogerlitz, roopa
In-Reply-To: <20171227131531.GE1997@nanopsycho>

> Hmm. That documents mainly sysfs. No mention of Netlink at all. But
> maybe I missed it. Also, that defines the interface as is. However we
> are talking about the data exchanged over the interface, not the
> interface itself. I don't see how ASIC/HW specific thing, like for
> example KVD in our case could be part of kernel ABI.

You need to be very careful here. As soon as somebody starts using it,
it might become an ABI. Or you need to clearly document it is not ABI,
there is no guarantee it will not disappear or change its meaning in
the next kernel, and it should be used with extreme caution.

Personally, if DSA drivers were to expose such settings, i would
consider them ABI, which people can rely on to remain stable.

	Andrew

^ permalink raw reply

* Re: [PATCH net] sfp: fix sfp-bus oops when removing socket/upstream
From: Florian Fainelli @ 2017-12-27 19:29 UTC (permalink / raw)
  To: Russell King, Andrew Lunn; +Cc: netdev
In-Reply-To: <E1eTyRJ-0007ZJ-IS@rmk-PC.armlinux.org.uk>



On 12/26/2017 03:15 PM, Russell King wrote:
> When we remove a socket or upstream, and the other side isn't
> registered, we dereference a NULL pointer, causing a kernel oops.
> Fix this.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: ce0aa27ff3f6 ("sfp: add sfp-bus to bridge between network devices
and sfp cages")
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] phylink: ensure we report link down when LOS asserted
From: Florian Fainelli @ 2017-12-27 19:27 UTC (permalink / raw)
  To: Russell King, Andrew Lunn; +Cc: netdev
In-Reply-To: <E1eTyRE-0007ZC-Ap@rmk-PC.armlinux.org.uk>



On 12/26/2017 03:15 PM, Russell King wrote:
> Although we disable the netdev carrier, we fail to report in the kernel
> log that the link went down.  Fix this.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
-- 
Florian

^ permalink raw reply

* Re: WARNING in strp_data_ready
From: Dmitry Vyukov @ 2017-12-27 19:20 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, syzbot, David S. Miller, Eric Biggers, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, Tom Herbert,
	Cong Wang
In-Reply-To: <CALx6S37RTEzd5pABpULPfrsoe-Huj0GZtQpOUfG=tT9dn5wL_A@mail.gmail.com>

On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert <tom@herbertland.com> wrote:
> Did you try the patch I posted?

Hi Tom,

No. And I didn't know I need to. Why?
If you think the patch needs additional testing, you can ask syzbot to
test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
Otherwise proceed with committing it. Or what are we waiting for?

Thanks



> On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>>> <john.fastabend@gmail.com> wrote:
>>>>> On 10/24/2017 08:20 AM, syzbot wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>>> compiler: gcc (GCC) 7.1.1 20170620
>>>>>> .config is attached
>>>>>> Raw console output is attached.
>>>>>> C reproducer is attached
>>>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>>> for information about syzkaller reproducers
>>>>>>
>>>>>>
>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>> Kernel panic - not syncing: panic_on_warn set ...
>>>>>>
>>>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>>>> Call Trace:
>>>>>>  <IRQ>
>>>>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>>  panic+0x1e4/0x417 kernel/panic.c:181
>>>>>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>>>
>>>>> Looks like KCM is calling sk_data_ready() without first taking the
>>>>> sock lock.
>>>>>
>>>>> /* Called with lower sock held */
>>>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>>>> {
>>>>>  [...]
>>>>>         if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>>>
>>>>> In this case kcm->sk is not the same lock the comment is referring to.
>>>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>>>
>>>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>>> I don't have anything better in mind immediately.
>>>>>
>>>> The sock locks are taken in reverse order in the send path so so
>>>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>>> lead to deadlock like I think.
>>>>
>>>> It might be possible to change the order in the send path to do this.
>>>> Something like:
>>>>
>>>> trylock on lower socket lock
>>>> -if trylock fails
>>>>   - release kcm sock lock
>>>>   - lock lower sock
>>>>   - lock kcm sock
>>>> - call sendpage locked function
>>>>
>>>> I admit that dealing with two levels of socket locks in the data path
>>>> is quite a pain :-)
>>>
>>> up
>>>
>>> still happening and we've lost 50K+ test VMs on this
>>
>> up
>>
>> Still happens and number of crashes crossed 60K, can we do something
>> with this please?

^ permalink raw reply

* [PATCH v3 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.9
From: Roman Gushchin @ 2017-12-27 19:16 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel-team, Roman Gushchin, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20171227191629.4920-1-guro@fb.com>

Bpftool build is broken with binutils version 2.29 and later.
The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
in the binutils repo, which changed the disassembler() function
signature.

Fix this by adding a new "feature" to the tools/build/features
infrastructure and make it responsible for decision which
disassembler() function signature to use.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/Makefile                                | 29 +++++++++++++++++++++++
 tools/bpf/bpf_jit_disasm.c                        |  7 ++++++
 tools/bpf/bpftool/Makefile                        | 24 +++++++++++++++++++
 tools/bpf/bpftool/jit_disasm.c                    |  7 ++++++
 tools/build/feature/Makefile                      |  4 ++++
 tools/build/feature/test-disassembler-four-args.c | 15 ++++++++++++
 6 files changed, 86 insertions(+)
 create mode 100644 tools/build/feature/test-disassembler-four-args.c

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index 07a6697466ef..c8ec0ae16bf0 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -9,6 +9,35 @@ MAKE = make
 CFLAGS += -Wall -O2
 CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
 
+ifeq ($(srctree),)
+srctree := $(patsubst %/,%,$(dir $(CURDIR)))
+srctree := $(patsubst %/,%,$(dir $(srctree)))
+endif
+
+FEATURE_USER = .bpf
+FEATURE_TESTS = libbfd disassembler-four-args
+FEATURE_DISPLAY = libbfd disassembler-four-args
+
+check_feat := 1
+NON_CHECK_FEAT_TARGETS := clean bpftool_clean
+ifdef MAKECMDGOALS
+ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),)
+  check_feat := 0
+endif
+endif
+
+ifeq ($(check_feat),1)
+ifeq ($(FEATURES_DUMP),)
+include $(srctree)/tools/build/Makefile.feature
+else
+include $(FEATURES_DUMP)
+endif
+endif
+
+ifeq ($(feature-disassembler-four-args), 1)
+CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
+endif
+
 %.yacc.c: %.y
 	$(YACC) -o $@ -d $<
 
diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
index 75bf526a0168..30044bc4f389 100644
--- a/tools/bpf/bpf_jit_disasm.c
+++ b/tools/bpf/bpf_jit_disasm.c
@@ -72,7 +72,14 @@ static void get_asm_insns(uint8_t *image, size_t len, int opcodes)
 
 	disassemble_init_for_target(&info);
 
+#ifdef DISASM_FOUR_ARGS_SIGNATURE
+	disassemble = disassembler(info.arch,
+				   bfd_big_endian(bfdf),
+				   info.mach,
+				   bfdf);
+#else
 	disassemble = disassembler(bfdf);
+#endif
 	assert(disassemble);
 
 	do {
diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index f8f31a8d9269..2237bc43f71c 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -46,6 +46,30 @@ LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
 INSTALL ?= install
 RM ?= rm -f
 
+FEATURE_USER = .bpftool
+FEATURE_TESTS = libbfd disassembler-four-args
+FEATURE_DISPLAY = libbfd disassembler-four-args
+
+check_feat := 1
+NON_CHECK_FEAT_TARGETS := clean uninstall doc doc-clean doc-install doc-uninstall
+ifdef MAKECMDGOALS
+ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),)
+  check_feat := 0
+endif
+endif
+
+ifeq ($(check_feat),1)
+ifeq ($(FEATURES_DUMP),)
+include $(srctree)/tools/build/Makefile.feature
+else
+include $(FEATURES_DUMP)
+endif
+endif
+
+ifeq ($(feature-disassembler-four-args), 1)
+CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
+endif
+
 include $(wildcard *.d)
 
 all: $(OUTPUT)bpftool
diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
index 1551d3918d4c..57d32e8a1391 100644
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -107,7 +107,14 @@ void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes)
 
 	disassemble_init_for_target(&info);
 
+#ifdef DISASM_FOUR_ARGS_SIGNATURE
+	disassemble = disassembler(info.arch,
+				   bfd_big_endian(bfdf),
+				   info.mach,
+				   bfdf);
+#else
 	disassemble = disassembler(bfdf);
+#endif
 	assert(disassemble);
 
 	if (json_output)
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 96982640fbf8..17f2c73fff8b 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -13,6 +13,7 @@ FILES=                                          \
          test-hello.bin                         \
          test-libaudit.bin                      \
          test-libbfd.bin                        \
+         test-disassembler-four-args.bin        \
          test-liberty.bin                       \
          test-liberty-z.bin                     \
          test-cplus-demangle.bin                \
@@ -188,6 +189,9 @@ $(OUTPUT)test-libpython-version.bin:
 $(OUTPUT)test-libbfd.bin:
 	$(BUILD) -DPACKAGE='"perf"' -lbfd -lz -liberty -ldl
 
+$(OUTPUT)test-disassembler-four-args.bin:
+	$(BUILD) -lbfd -lopcodes
+
 $(OUTPUT)test-liberty.bin:
 	$(CC) $(CFLAGS) -Wall -Werror -o $@ test-libbfd.c -DPACKAGE='"perf"' $(LDFLAGS) -lbfd -ldl -liberty
 
diff --git a/tools/build/feature/test-disassembler-four-args.c b/tools/build/feature/test-disassembler-four-args.c
new file mode 100644
index 000000000000..45ce65cfddf0
--- /dev/null
+++ b/tools/build/feature/test-disassembler-four-args.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <bfd.h>
+#include <dis-asm.h>
+
+int main(void)
+{
+	bfd *abfd = bfd_openr(NULL, NULL);
+
+	disassembler(bfd_get_arch(abfd),
+		     bfd_big_endian(abfd),
+		     bfd_get_mach(abfd),
+		     abfd);
+
+	return 0;
+}
-- 
2.14.3

^ permalink raw reply related

* [PATCH v3 bpf-next 1/2] tools/bpftool: use version from the kernel source tree
From: Roman Gushchin @ 2017-12-27 19:16 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel-team, Roman Gushchin, Alexei Starovoitov,
	Daniel Borkmann

Bpftool determines it's own version based on the kernel
version, which is picked from the linux/version.h header.

It's strange to use the version of the installed kernel
headers, and makes much more sense to use the version
of the actual source tree, where bpftool sources are.

Fix this by building kernelversion target and use
the resulting string as bpftool version.

Example:
before:

$ bpftool version
bpftool v4.14.6

after:
$ bpftool version
bpftool v4.15.0-rc3

$bpftool version --json
{"version":"4.15.0-rc3"}

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/Makefile |  3 +++
 tools/bpf/bpftool/main.c   | 13 ++-----------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 3f17ad317512..f8f31a8d9269 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -23,6 +23,8 @@ endif
 
 LIBBPF = $(BPF_PATH)libbpf.a
 
+BPFTOOL_VERSION=$(shell make --no-print-directory -sC ../../.. kernelversion)
+
 $(LIBBPF): FORCE
 	$(Q)$(MAKE) -C $(BPF_DIR) OUTPUT=$(OUTPUT) $(OUTPUT)libbpf.a FEATURES_DUMP=$(FEATURE_DUMP_EXPORT)
 
@@ -38,6 +40,7 @@ CC = gcc
 CFLAGS += -O2
 CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
+CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"'
 LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
 
 INSTALL ?= install
diff --git a/tools/bpf/bpftool/main.c b/tools/bpf/bpftool/main.c
index ecd53ccf1239..3a0396d87c42 100644
--- a/tools/bpf/bpftool/main.c
+++ b/tools/bpf/bpftool/main.c
@@ -38,7 +38,6 @@
 #include <errno.h>
 #include <getopt.h>
 #include <linux/bpf.h>
-#include <linux/version.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -95,21 +94,13 @@ static int do_help(int argc, char **argv)
 
 static int do_version(int argc, char **argv)
 {
-	unsigned int version[3];
-
-	version[0] = LINUX_VERSION_CODE >> 16;
-	version[1] = LINUX_VERSION_CODE >> 8 & 0xf;
-	version[2] = LINUX_VERSION_CODE & 0xf;
-
 	if (json_output) {
 		jsonw_start_object(json_wtr);
 		jsonw_name(json_wtr, "version");
-		jsonw_printf(json_wtr, "\"%u.%u.%u\"",
-			     version[0], version[1], version[2]);
+		jsonw_printf(json_wtr, "\"%s\"", BPFTOOL_VERSION);
 		jsonw_end_object(json_wtr);
 	} else {
-		printf("%s v%u.%u.%u\n", bin_name,
-		       version[0], version[1], version[2]);
+		printf("%s v%s\n", bin_name, BPFTOOL_VERSION);
 	}
 	return 0;
 }
-- 
2.14.3

^ permalink raw reply related

* Re: WARNING in strp_data_ready
From: Tom Herbert @ 2017-12-27 19:09 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: John Fastabend, syzbot, David S. Miller, Eric Biggers, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, Tom Herbert,
	Cong Wang
In-Reply-To: <CACT4Y+bFsT4ZVvZSZvQdcKXb8n-79gy2tLB02jBm8wwp32YoAA@mail.gmail.com>

Did you try the patch I posted?


On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> <john.fastabend@gmail.com> wrote:
>>>> On 10/24/2017 08:20 AM, syzbot wrote:
>>>>> Hello,
>>>>>
>>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>> compiler: gcc (GCC) 7.1.1 20170620
>>>>> .config is attached
>>>>> Raw console output is attached.
>>>>> C reproducer is attached
>>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>> for information about syzkaller reproducers
>>>>>
>>>>>
>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>> Kernel panic - not syncing: panic_on_warn set ...
>>>>>
>>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>>> Call Trace:
>>>>>  <IRQ>
>>>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>  panic+0x1e4/0x417 kernel/panic.c:181
>>>>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>>
>>>> Looks like KCM is calling sk_data_ready() without first taking the
>>>> sock lock.
>>>>
>>>> /* Called with lower sock held */
>>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>>> {
>>>>  [...]
>>>>         if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>>
>>>> In this case kcm->sk is not the same lock the comment is referring to.
>>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>>
>>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>>> I don't have anything better in mind immediately.
>>>>
>>> The sock locks are taken in reverse order in the send path so so
>>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>> lead to deadlock like I think.
>>>
>>> It might be possible to change the order in the send path to do this.
>>> Something like:
>>>
>>> trylock on lower socket lock
>>> -if trylock fails
>>>   - release kcm sock lock
>>>   - lock lower sock
>>>   - lock kcm sock
>>> - call sendpage locked function
>>>
>>> I admit that dealing with two levels of socket locks in the data path
>>> is quite a pain :-)
>>
>> up
>>
>> still happening and we've lost 50K+ test VMs on this
>
> up
>
> Still happens and number of crashes crossed 60K, can we do something
> with this please?

^ permalink raw reply

* Re: [PATCH v2 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.8
From: Roman Gushchin @ 2017-12-27 19:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Quentin Monnet, netdev, linux-kernel, kernel-team, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20171227023204.eulgkbg7epj7nl76@ast-mbp>

On Tue, Dec 26, 2017 at 06:32:05PM -0800, Alexei Starovoitov wrote:
> On Fri, Dec 22, 2017 at 06:50:01PM +0000, Quentin Monnet wrote:
> > Hi Roman,
> > 
> > 2017-12-22 16:11 UTC+0000 ~ Roman Gushchin <guro@fb.com>
> > > Bpftool build is broken with binutils version 2.28 and later.
> > 
> > Could you check the binutils version? I believe it changed in 2.29
> > instead of 2.28. Could you update your commit log and subject
> > accordingly, please?

Yes, you're right. Thanks!

> > 
> > > The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
> > > in the binutils repo, which changed the disassembler() function
> > > signature.
> > > 
> > > Fix this by adding a new "feature" to the tools/build/features
> > > infrastructure and make it responsible for decision which
> > > disassembler() function signature to use.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
> > > Cc: Alexei Starovoitov <ast@kernel.org>
> > > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > > ---
> > >  tools/bpf/Makefile                                | 29 +++++++++++++++++++++++
> > >  tools/bpf/bpf_jit_disasm.c                        |  7 ++++++
> > >  tools/bpf/bpftool/Makefile                        | 24 +++++++++++++++++++
> > >  tools/bpf/bpftool/jit_disasm.c                    |  7 ++++++
> > >  tools/build/feature/Makefile                      |  4 ++++
> > >  tools/build/feature/test-disassembler-four-args.c | 15 ++++++++++++
> > >  6 files changed, 86 insertions(+)
> > >  create mode 100644 tools/build/feature/test-disassembler-four-args.c
> > > 
> > > diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
> > > index 07a6697466ef..c8ec0ae16bf0 100644
> > > --- a/tools/bpf/Makefile
> > > +++ b/tools/bpf/Makefile
> > > @@ -9,6 +9,35 @@ MAKE = make
> > >  CFLAGS += -Wall -O2
> > >  CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
> > >  
> > > +ifeq ($(srctree),)
> > > +srctree := $(patsubst %/,%,$(dir $(CURDIR)))
> > > +srctree := $(patsubst %/,%,$(dir $(srctree)))
> > > +endif
> > > +
> > > +FEATURE_USER = .bpf
> > > +FEATURE_TESTS = libbfd disassembler-four-args
> > > +FEATURE_DISPLAY = libbfd disassembler-four-args
> > 
> > Thanks for adding libbfd as I requested. However, you do not use it in
> > the Makefile to prevent compilation if the feature is not detected (see
> > "bpfdep" or "elfdep" in tools/lib/bpf/Makefile. Sorry, I should have
> > pointed it in my previous review.
> > 
> > But actually, I have another issue related to the libbfd feature: since
> > commit 280e7c48c3b8 ("perf tools: fix BFD detection on opensuse") it
> > requires libiberty so that libbfd is correctly detected, but libiberty
> > is not needed on all distros (at least Ubuntu can have libbfd without
> > libiberty). Typically, detection fails on my setup, although I do have
> > libbfd installed. So forcing libbfd feature here may eventually force
> > users to install libraries they do not need to compile bpftool, which is
> > not what we want.
> > 
> > I do not have a clean work around to suggest. Maybe have one
> > "libbfd-something" feature that tries to compile without libiberty, then
> > another one that tries with it, and compile the tools if at least one of
> > them succeeds. But it's probably for another patch series. In the
> > meantime, would you please simply remove libbfd detection here and
> > accept my apologies for suggesting to add it in the previous review?
> 
> I think since libbfd is already used by bpftool it's a good thing
> to add feature detection. Even if it's not perfect on some setups.

Agree, we can enhance it later.

> 
> Roman,
> I think you still need to do one more respin to address commit log nit?
> 

Sure, will send soon-ish.

Thanks!

^ permalink raw reply

* Re: [PATCH 4/4] libbpf: add missing SPDX-License-Identifier
From: Alexei Starovoitov @ 2017-12-27 19:01 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-5-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:29PM +0100, Eric Leblond wrote:
> Signed-off-by: Eric Leblond <eric@regit.org>

thank you.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH 3/4] libbpf: break loop earlier
From: Alexei Starovoitov @ 2017-12-27 19:00 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-4-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:28PM +0100, Eric Leblond wrote:
> Get out of the loop when we have a match.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>
> ---
>  tools/lib/bpf/libbpf.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 5fe8aaa2123e..d263748aa341 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object *obj)
>  					   prog->section_name);
>  				return -LIBBPF_ERRNO__LIBELF;
>  			}
> +			break;

why this is needed?
The top of the loop is:
 for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name;

so as soon as name is found the loop will exit.
I agree that the loop structure is non-standard is confusing,
but adding break here will make it even more so.
If 'break' is added then '&& !name' should be removed.

^ permalink raw reply

* Re: [PATCH 1/4] libbpf: add function to setup XDP
From: Alexei Starovoitov @ 2017-12-27 18:57 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-2-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:26PM +0100, Eric Leblond wrote:
> Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
> slightly modified to be library compliant.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH 2/4] libbpf: add error reporting in XDP
From: Alexei Starovoitov @ 2017-12-27 18:57 UTC (permalink / raw)
  To: Eric Leblond; +Cc: netdev, daniel, linux-kernel
In-Reply-To: <20171227180229.1926-3-eric@regit.org>

On Wed, Dec 27, 2017 at 07:02:27PM +0100, Eric Leblond wrote:
> Parse netlink ext attribute to get the error message returned by
> the card. Code is partially take from libnl.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v2] sctp: Replace use of sockets_allocated with specified macro.
From: David Miller @ 2017-12-27 18:48 UTC (permalink / raw)
  To: xiangxia.m.yue; +Cc: netdev, eric.dumazet
In-Reply-To: <1513966520-5429-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date: Fri, 22 Dec 2017 10:15:20 -0800

> The patch(180d8cd942ce) replaces all uses of struct sock fields'
> memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem
> to accessor macros. But the sockets_allocated field of sctp sock is
> not replaced at all. Then replace it now for unifying the code.
> 
> Fixes: 180d8cd942ce ("foundations of per-cgroup memory pressure controlling.")
> Cc: Glauber Costa <glommer@parallels.com>
> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH V2 net-next 0/3] rds bug fixes
From: David Miller @ 2017-12-27 18:38 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: netdev, rds-devel, santosh.shilimkar
In-Reply-To: <cover.1513962765.git.sowmini.varadhan@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Fri, 22 Dec 2017 09:38:58 -0800

> Ran into pre-existing bugs when working on the fix for
>    https://www.spinics.net/lists/netdev/msg472849.html
> 
> The bugs fixed in this patchset are unrelated to the syzbot 
> failure (which I'm still testing and trying to reproduce) but 
> meanwhile, let's get these fixes out of the way.
> 
> V2: target net-next (rds:tcp patches have a dependancy on 
> changes that are in net-next, but not yet in net)

Series applied, thanks.

^ permalink raw reply

* Re: [ovs-dev] Pravin Shelar
From: Joe Perches @ 2017-12-27 18:33 UTC (permalink / raw)
  To: Ben Pfaff, Julia Lawall, Pravin Shelar; +Cc: netdev, dev
In-Reply-To: <20171227182540.GT13883@ovn.org>

On Wed, 2017-12-27 at 10:25 -0800, Ben Pfaff wrote:
> On Wed, Dec 27, 2017 at 04:22:55PM +0100, Julia Lawall wrote:
> > The email address pshelar@nicira.com listed for Pravin Shelar in
> > MAINTAINERS (OPENVSWITCH section) seems to bounce.
> 
> Pravin has used a newer address recently, so I sent out a suggested
> update (for OVS):
>         https://patchwork.ozlabs.org/patch/853232/

As Pravin is still active with acks but not any authored patches in
the
last year, this should still be updated in the linux-kernel's
MAINTAINERS
file too.
---
diff --git a/MAINTAINERS b/MAINTAINERS
index
a6e86e20761e..5869e5f0b930 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@
-10137,7 +10137,7 @@ F:	drivers/irqchip/irq-ompic.c
 F:	dri
vers/irqchip/irq-or1k-*
 
 OPENVSWITCH
-M:	Pravin Shelar <pshelar@ni
cira.com>
+M:	Pravin Shelar <pshelar@ovn.org>
 L:	netdev@vge
r.kernel.org
 L:	dev@openvswitch.org
 W:	http://openvswitch.org

^ permalink raw reply

* Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
From: Cong Wang @ 2017-12-27 18:29 UTC (permalink / raw)
  To: John Fastabend; +Cc: Linux Kernel Network Developers, Jakub Kicinski
In-Reply-To: <419268d4-4078-098c-3c2e-5ce967feb5e5@gmail.com>

On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/22/2017 12:31 PM, Cong Wang wrote:
>> I understand why you had it, but it is just not safe. You don't want
>> to achieve performance gain by crashing system, right?
>
> huh? So my point is the patch you submit here is not a
> real fix but a work around. To peek the head of a consumer/producer ring
> without a lock, _should_ be fine. This _should_ work as well with
> consumer or producer operations happening at the same time. After some
> digging the issue is in the ptr_ring code.


The comments disagree with you:

/* Might be slightly faster than skb_array_empty below, but only safe if the
 * array is never resized. Also, callers invoking this in a loop must take care
 * to use a compiler barrier, for example cpu_relax().
 */

If the comments are right, you miss a barrier here too.


>
> The peek code (what empty check calls) is the following,
>
> static inline void *__ptr_ring_peek(struct ptr_ring *r)
> {
>         if (likely(r->size))
>                 return r->queue[r->consumer_head];
>         return NULL;
> }
>
> So what the splat is detecting is consumer head being 'out of bounds'.
> This happens because ptr_ring_discard_one increments the consumer_head
> and then checks to see if it overran the array size. If above peek
> happens after the increment, but before the size check we get the
> splat. There are two ways, as far as I can see, to fix this. First
> do the check before incrementing the consumer head. Or the easier
> fix,
>
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
> ptr_ring *r,
>
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
> gfp_t gfp)
>  {
> -       return kcalloc(size, sizeof(void *), gfp);
> +       return kcalloc(size + 1, sizeof(void *), gfp);
>  }
>
> With Jakub's help (Thanks!) I was able to reproduce the original splat
> and also verify the above removes it.
>
> To be clear "resizing" a skb_array only refers to changing the actual
> array size not adding/removing elements.

I never look into the implementation, just simply trust the comments.

At least the comments above __skb_array_empty() need to improve.


>
>>
>>>
>>> Although its not logical IMO to have both reset and dequeue running at
>>> the same time. Some skbs would get through others would get sent, sort
>>> of a mess. I don't see how it can be an issue. The never resized bit
>>> in the documentation is referring to resizing the ring size _not_ popping
>>> off elements of the ring. array_empty just reads the consumer head.
>>> The only ring resizing in pfifo fast should be at init and destroy where
>>> enqueue/dequeue should be disconnected by then. Although based on the
>>> trace I missed a case.
>>
>>
>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>
>
> Sorry not following.
>
>> And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>> concurrently.
>
> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
> though if this API can be cleaned up. What are the paths that do a reset
> without a destroy.. Do we really need to have this pattern where reset
> is called then later destroy. Seems destroy could do the entire cleanup
> and this would simplify things. None of this has to do with the splat
> though.

I don't follow your point any more.

We are talking about ->reset() race with ->dequeue() which is the
cause of the bug, right?

If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
why did you even mention synchronize_net() from the beginning???
Also you changed the code too, to adjust rcu grace period.


>
>>
>>
>>>
>>> I think the right fix is to only call reset/destroy patterns after
>>> waiting a grace period and for all tx_action calls in-flight to
>>> complete. This is also better going forward for more complex qdiscs.
>>
>> But we don't even have rcu read lock in TX BH, do we?
>>
>> Also, people certainly don't like yet another synchronize_net()...
>>
>
> This needs a fix and is a _real_ bug, but removing __skb_array_empty()
> doesn't help solve this at all. Will work on a fix after the holiday
> break. The fix here is to ensure the destroy is not going to happen
> while tx_action is in-flight. Can be done with qdisc_run and checking
> correct bits in lockless case.

Sounds like you missed a lot of things with your "lockless" patches....
First qdisc rcu callback, second rcu read lock in TX BH...

My quick one-line fix is to amend this bug before you going deeper
in this rabbit hole.

^ permalink raw reply

* Re: Pravin Shelar
From: Ben Pfaff @ 2017-12-27 18:25 UTC (permalink / raw)
  To: Julia Lawall; +Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.20.1712271621370.14107@hadrien>

On Wed, Dec 27, 2017 at 04:22:55PM +0100, Julia Lawall wrote:
> The email address pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org listed for Pravin Shelar in
> MAINTAINERS (OPENVSWITCH section) seems to bounce.

Pravin has used a newer address recently, so I sent out a suggested
update (for OVS):
        https://patchwork.ozlabs.org/patch/853232/

^ permalink raw reply

* Re: WARNING in strp_data_ready
From: Dmitry Vyukov @ 2017-12-27 18:25 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, syzbot, David S. Miller, Eric Biggers, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, Tom Herbert,
	Cong Wang
In-Reply-To: <CACT4Y+YXhkZsDdst+H1VQZ9Tc0fMJhF3niXEJoK9TqcTUaC4iA@mail.gmail.com>

On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> <john.fastabend@gmail.com> wrote:
>>> On 10/24/2017 08:20 AM, syzbot wrote:
>>>> Hello,
>>>>
>>>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>> compiler: gcc (GCC) 7.1.1 20170620
>>>> .config is attached
>>>> Raw console output is attached.
>>>> C reproducer is attached
>>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>> for information about syzkaller reproducers
>>>>
>>>>
>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me include/net/sock.h:1505 [inline]
>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user include/net/sock.h:1511 [inline]
>>>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>> Kernel panic - not syncing: panic_on_warn set ...
>>>>
>>>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>>> Call Trace:
>>>>  <IRQ>
>>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>  panic+0x1e4/0x417 kernel/panic.c:181
>>>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>>>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>>>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>>>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>> RSP: 0018:ffff8801db206b18 EFLAGS: 00010206
>>>> RAX: ffff8801d1e02080 RBX: ffff8801dad74c48 RCX: 0000000000000000
>>>> RDX: 0000000000000100 RSI: ffff8801d29fa0a0 RDI: ffffffff85cbede0
>>>> RBP: ffff8801db206b38 R08: 0000000000000005 R09: 1ffffffff0ce0bcd
>>>> R10: ffff8801db206a00 R11: dffffc0000000000 R12: ffff8801d29fa000
>>>> R13: ffff8801dad74c50 R14: ffff8801d4350a92 R15: 0000000000000001
>>>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>>>
>>> Looks like KCM is calling sk_data_ready() without first taking the
>>> sock lock.
>>>
>>> /* Called with lower sock held */
>>> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
>>> {
>>>  [...]
>>>         if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>>>
>>> In this case kcm->sk is not the same lock the comment is referring to.
>>> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>>>
>>> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
>>> I don't have anything better in mind immediately.
>>>
>> The sock locks are taken in reverse order in the send path so so
>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>> lead to deadlock like I think.
>>
>> It might be possible to change the order in the send path to do this.
>> Something like:
>>
>> trylock on lower socket lock
>> -if trylock fails
>>   - release kcm sock lock
>>   - lock lower sock
>>   - lock kcm sock
>> - call sendpage locked function
>>
>> I admit that dealing with two levels of socket locks in the data path
>> is quite a pain :-)
>
> up
>
> still happening and we've lost 50K+ test VMs on this

up

Still happens and number of crashes crossed 60K, can we do something
with this please?

^ permalink raw reply

* Re: lost connection to test machine (3)
From: Dmitry Vyukov @ 2017-12-27 18:22 UTC (permalink / raw)
  To: syzbot
  Cc: LKML, syzkaller-bugs, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David Miller, netfilter-devel, coreteam, netdev
In-Reply-To: <001a1143d40c2b55b10561566d26@google.com>

On Wed, Dec 27, 2017 at 7:18 PM, syzbot
<syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzkaller hit the following crash on
> beacbc68ac3e23821a681adb30b45dc55b17488d
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: <syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com>
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.

+netfilter maintainers

Here is cleaned reproducer:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/if.h>
#include <linux/netfilter_ipv4/ip_tables.h>

int main()
{
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
  struct ipt_replace opt = {};
  opt.num_counters = 1;
  opt.size = -1;
  setsockopt(fd, SOL_IP, 0x40, &opt, 0x4);
  return 0;
}


What happens there is that here:

struct xt_table_info *xt_alloc_table_info(unsigned int size)
{
    ...
    if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
        return NULL;

size = -1 and SMP_ALIGN(size) = 0, so this still tries to allocate
4GB+delta bytes.

I don't understand why this uses SMP_ALIGN since we add 2 pages on
top, it seems that we could just drop SMP_ALIGN and local SMP_ALIGN
definition altogether.



> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/001a1143d40c2b55b10561566d26%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* [PATCH 4/4] libbpf: add missing SPDX-License-Identifier
From: Eric Leblond @ 2017-12-27 18:02 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, linux-kernel, Eric Leblond
In-Reply-To: <20171227180229.1926-1-eric@regit.org>

Signed-off-by: Eric Leblond <eric@regit.org>
---
 tools/lib/bpf/bpf.c    | 2 ++
 tools/lib/bpf/bpf.h    | 2 ++
 tools/lib/bpf/libbpf.c | 2 ++
 tools/lib/bpf/libbpf.h | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index cdfabbe118cc..9e53dbbca2bd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * common eBPF ELF operations.
  *
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9f44c196931e..8d18fb73d7fb 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * common eBPF ELF operations.
  *
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index d263748aa341..50d4b5e73d0e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * Common eBPF ELF object loading operations.
  *
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e42f96900318..f85906533cdd 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
 /*
  * Common eBPF ELF object loading operations.
  *
-- 
2.15.1

^ permalink raw reply related

* [PATCH 3/4] libbpf: break loop earlier
From: Eric Leblond @ 2017-12-27 18:02 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, linux-kernel, Eric Leblond
In-Reply-To: <20171227180229.1926-1-eric@regit.org>

Get out of the loop when we have a match.

Signed-off-by: Eric Leblond <eric@regit.org>
---
 tools/lib/bpf/libbpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5fe8aaa2123e..d263748aa341 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -412,6 +412,7 @@ bpf_object__init_prog_names(struct bpf_object *obj)
 					   prog->section_name);
 				return -LIBBPF_ERRNO__LIBELF;
 			}
+			break;
 		}
 
 		if (!name) {
-- 
2.15.1

^ permalink raw reply related

* [PATCH 2/4] libbpf: add error reporting in XDP
From: Eric Leblond @ 2017-12-27 18:02 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, linux-kernel, Eric Leblond
In-Reply-To: <20171227180229.1926-1-eric@regit.org>

Parse netlink ext attribute to get the error message returned by
the card. Code is partially take from libnl.

Signed-off-by: Eric Leblond <eric@regit.org>
---
 tools/lib/bpf/Build    |   2 +-
 tools/lib/bpf/bpf.c    |   9 +++
 tools/lib/bpf/nlattr.c | 187 +++++++++++++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/nlattr.h |  69 ++++++++++++++++++
 4 files changed, 266 insertions(+), 1 deletion(-)
 create mode 100644 tools/lib/bpf/nlattr.c
 create mode 100644 tools/lib/bpf/nlattr.h

diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
index d8749756352d..64c679d67109 100644
--- a/tools/lib/bpf/Build
+++ b/tools/lib/bpf/Build
@@ -1 +1 @@
-libbpf-y := libbpf.o bpf.o
+libbpf-y := libbpf.o bpf.o nlattr.o
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 1e3cfe6b9fce..cdfabbe118cc 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -26,6 +26,7 @@
 #include <linux/bpf.h>
 #include "bpf.h"
 #include "libbpf.h"
+#include "nlattr.h"
 #include <linux/rtnetlink.h>
 #include <sys/socket.h>
 #include <errno.h>
@@ -436,6 +437,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 	struct nlmsghdr *nh;
 	struct nlmsgerr *err;
 	socklen_t addrlen;
+	int one;
 
 	memset(&sa, 0, sizeof(sa));
 	sa.nl_family = AF_NETLINK;
@@ -445,6 +447,12 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 		return -errno;
 	}
 
+	if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
+		       &one, sizeof(one)) < 0) {
+		/* debug/verbose message that it is not supported */
+		fprintf(stderr, "Netlink error reporting not supported\n");
+	}
+
 	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
 		ret = -errno;
 		goto cleanup;
@@ -521,6 +529,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 			if (!err->error)
 				continue;
 			ret = err->error;
+			nla_dump_errormsg(nh);
 			goto cleanup;
 		case NLMSG_DONE:
 			break;
diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c
new file mode 100644
index 000000000000..5cc74fa98049
--- /dev/null
+++ b/tools/lib/bpf/nlattr.c
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
+/*
+ * NETLINK      Netlink attributes
+ *
+ *	This library is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU Lesser General Public
+ *	License as published by the Free Software Foundation version 2.1
+ *	of the License.
+ *
+ * Copyright (c) 2003-2013 Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <errno.h>
+#include "nlattr.h"
+#include <linux/rtnetlink.h>
+#include <string.h>
+#include <stdio.h>
+
+static uint16_t nla_attr_minlen[NLA_TYPE_MAX+1] = {
+	[NLA_U8]	= sizeof(uint8_t),
+	[NLA_U16]	= sizeof(uint16_t),
+	[NLA_U32]	= sizeof(uint32_t),
+	[NLA_U64]	= sizeof(uint64_t),
+	[NLA_STRING]	= 1,
+	[NLA_FLAG]	= 0,
+};
+
+static int nla_len(const struct nlattr *nla)
+{
+	return nla->nla_len - NLA_HDRLEN;
+}
+
+static struct nlattr *nla_next(const struct nlattr *nla, int *remaining)
+{
+	int totlen = NLA_ALIGN(nla->nla_len);
+
+	*remaining -= totlen;
+	return (struct nlattr *) ((char *) nla + totlen);
+}
+
+static int nla_ok(const struct nlattr *nla, int remaining)
+{
+	return remaining >= sizeof(*nla) &&
+	       nla->nla_len >= sizeof(*nla) &&
+	       nla->nla_len <= remaining;
+}
+
+static void *nla_data(const struct nlattr *nla)
+{
+	return (char *) nla + NLA_HDRLEN;
+}
+
+static int nla_type(const struct nlattr *nla)
+{
+	return nla->nla_type & NLA_TYPE_MASK;
+}
+
+static int validate_nla(struct nlattr *nla, int maxtype,
+			struct nla_policy *policy)
+{
+	struct nla_policy *pt;
+	unsigned int minlen = 0;
+	int type = nla_type(nla);
+
+	if (type < 0 || type > maxtype)
+		return 0;
+
+	pt = &policy[type];
+
+	if (pt->type > NLA_TYPE_MAX)
+		return 0;
+
+	if (pt->minlen)
+		minlen = pt->minlen;
+	else if (pt->type != NLA_UNSPEC)
+		minlen = nla_attr_minlen[pt->type];
+
+	if (nla_len(nla) < minlen)
+		return -1;
+
+	if (pt->maxlen && nla_len(nla) > pt->maxlen)
+		return -1;
+
+	if (pt->type == NLA_STRING) {
+		char *data = nla_data(nla);
+		if (data[nla_len(nla) - 1] != '\0')
+			return -1;
+	}
+
+	return 0;
+}
+
+static inline int nlmsg_len(const struct nlmsghdr *nlh)
+{
+	return nlh->nlmsg_len - NLMSG_HDRLEN;
+}
+
+/**
+ * Create attribute index based on a stream of attributes.
+ * @arg tb		Index array to be filled (maxtype+1 elements).
+ * @arg maxtype		Maximum attribute type expected and accepted.
+ * @arg head		Head of attribute stream.
+ * @arg len		Length of attribute stream.
+ * @arg policy		Attribute validation policy.
+ *
+ * Iterates over the stream of attributes and stores a pointer to each
+ * attribute in the index array using the attribute type as index to
+ * the array. Attribute with a type greater than the maximum type
+ * specified will be silently ignored in order to maintain backwards
+ * compatibility. If \a policy is not NULL, the attribute will be
+ * validated using the specified policy.
+ *
+ * @see nla_validate
+ * @return 0 on success or a negative error code.
+ */
+static int nla_parse(struct nlattr *tb[], int maxtype, struct nlattr *head, int len,
+		     struct nla_policy *policy)
+{
+	struct nlattr *nla;
+	int rem, err;
+
+	memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));
+
+	nla_for_each_attr(nla, head, len, rem) {
+		int type = nla_type(nla);
+
+		if (type > maxtype)
+			continue;
+
+		if (policy) {
+			err = validate_nla(nla, maxtype, policy);
+			if (err < 0)
+				goto errout;
+		}
+
+		if (tb[type])
+			fprintf(stderr, "Attribute of type %#x found multiple times in message, "
+				  "previous attribute is being ignored.\n", type);
+
+		tb[type] = nla;
+	}
+
+	err = 0;
+errout:
+	return err;
+}
+
+/* dump netlink extended ack error message */
+int nla_dump_errormsg(struct nlmsghdr *nlh)
+{
+	struct nla_policy extack_policy[NLMSGERR_ATTR_MAX + 1] = {
+		[NLMSGERR_ATTR_MSG]	= { .type = NLA_STRING },
+		[NLMSGERR_ATTR_OFFS]	= { .type = NLA_U32 },
+	};
+	struct nlattr *tb[NLMSGERR_ATTR_MAX + 1], *attr;
+	struct nlmsgerr *err;
+	char *errmsg = NULL;
+	int hlen, alen;
+
+	/* no TLVs, nothing to do here */
+	if (!(nlh->nlmsg_flags & NLM_F_ACK_TLVS))
+		return 0;
+
+	err = (struct nlmsgerr *)NLMSG_DATA(nlh);
+	hlen = sizeof(*err);
+
+	/* if NLM_F_CAPPED is set then the inner err msg was capped */
+	if (!(nlh->nlmsg_flags & NLM_F_CAPPED))
+		hlen += nlmsg_len(&err->msg);
+
+	attr = (struct nlattr *) ((void *) err + hlen);
+	alen = nlh->nlmsg_len - hlen;
+
+	if (nla_parse(tb, NLMSGERR_ATTR_MAX, attr, alen, extack_policy) != 0) {
+		fprintf(stderr,
+			"Failed to parse extended error attributes\n");
+		return 0;
+	}
+
+	if (tb[NLMSGERR_ATTR_MSG])
+		errmsg = (char *) nla_data(tb[NLMSGERR_ATTR_MSG]);
+
+	fprintf(stderr, "Kernel error message: %s\n", errmsg);
+
+	return 0;
+}
diff --git a/tools/lib/bpf/nlattr.h b/tools/lib/bpf/nlattr.h
new file mode 100644
index 000000000000..fa2d015334ef
--- /dev/null
+++ b/tools/lib/bpf/nlattr.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
+/*
+ * NETLINK      Netlink attributes
+ *
+ *	This library is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU Lesser General Public
+ *	License as published by the Free Software Foundation version 2.1
+ *	of the License.
+ *
+ * Copyright (c) 2003-2013 Thomas Graf <tgraf@suug.ch>
+ */
+
+#ifndef __NLATTR_H
+#define __NLATTR_H
+
+#include <linux/netlink.h>
+
+/**
+ * Standard attribute types to specify validation policy
+ */
+enum {
+	NLA_UNSPEC,	/**< Unspecified type, binary data chunk */
+	NLA_U8,		/**< 8 bit integer */
+	NLA_U16,	/**< 16 bit integer */
+	NLA_U32,	/**< 32 bit integer */
+	NLA_U64,	/**< 64 bit integer */
+	NLA_STRING,	/**< NUL terminated character string */
+	NLA_FLAG,	/**< Flag */
+	NLA_MSECS,	/**< Micro seconds (64bit) */
+	NLA_NESTED,	/**< Nested attributes */
+	__NLA_TYPE_MAX,
+};
+
+#define NLA_TYPE_MAX (__NLA_TYPE_MAX - 1)
+
+/**
+ * @ingroup attr
+ * Attribute validation policy.
+ *
+ * See section @core_doc{core_attr_parse,Attribute Parsing} for more details.
+ */
+struct nla_policy {
+	/** Type of attribute or NLA_UNSPEC */
+	uint16_t	type;
+
+	/** Minimal length of payload required */
+	uint16_t	minlen;
+
+	/** Maximal length of payload allowed */
+	uint16_t	maxlen;
+};
+
+/**
+ * @ingroup attr
+ * Iterate over a stream of attributes
+ * @arg pos	loop counter, set to current attribute
+ * @arg head	head of attribute stream
+ * @arg len	length of attribute stream
+ * @arg rem	initialized to len, holds bytes currently remaining in stream
+ */
+#define nla_for_each_attr(pos, head, len, rem) \
+	for (pos = head, rem = len; \
+	     nla_ok(pos, rem); \
+	     pos = nla_next(pos, &(rem)))
+
+int nla_dump_errormsg(struct nlmsghdr *nlh);
+
+#endif /* __NLATTR_H */
-- 
2.15.1

^ permalink raw reply related

* [PATCH 1/4] libbpf: add function to setup XDP
From: Eric Leblond @ 2017-12-27 18:02 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, daniel, linux-kernel, Eric Leblond
In-Reply-To: <20171227180229.1926-1-eric@regit.org>

Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
slightly modified to be library compliant.

Signed-off-by: Eric Leblond <eric@regit.org>
---
 tools/lib/bpf/bpf.c    | 126 ++++++++++++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.c |   2 +
 tools/lib/bpf/libbpf.h |   4 ++
 3 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5128677e4117..1e3cfe6b9fce 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -25,6 +25,16 @@
 #include <asm/unistd.h>
 #include <linux/bpf.h>
 #include "bpf.h"
+#include "libbpf.h"
+#include <linux/rtnetlink.h>
+#include <sys/socket.h>
+#include <errno.h>
+
+#ifndef IFLA_XDP_MAX
+#define IFLA_XDP	43
+#define IFLA_XDP_FD	1
+#define IFLA_XDP_FLAGS	3
+#endif
 
 /*
  * When building perf, unistd.h is overridden. __NR_bpf is
@@ -46,8 +56,6 @@
 # endif
 #endif
 
-#define min(x, y) ((x) < (y) ? (x) : (y))
-
 static inline __u64 ptr_to_u64(const void *ptr)
 {
 	return (__u64) (unsigned long) ptr;
@@ -413,3 +421,117 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
 
 	return err;
 }
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+	socklen_t addrlen;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		return -errno;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		ret = -errno;
+		goto cleanup;
+	}
+
+	addrlen = sizeof(sa);
+	if (getsockname(sock, (struct sockaddr *)&sa, &addrlen) < 0) {
+		ret = errno;
+		goto cleanup;
+	}
+
+	if (addrlen != sizeof(sa)) {
+		ret = errno;
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+
+	/* started nested attribute for XDP */
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | IFLA_XDP;
+	nla->nla_len = NLA_HDRLEN;
+
+	/* add XDP fd */
+	nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+	nla_xdp->nla_type = IFLA_XDP_FD;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len += nla_xdp->nla_len;
+
+	/* if user passed in any flags, add those too */
+	if (flags) {
+		nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+		nla_xdp->nla_type = IFLA_XDP_FLAGS;
+		nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+		memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+		nla->nla_len += nla_xdp->nla_len;
+	}
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		ret = -errno;
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		ret = -errno;
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != sa.nl_pid) {
+			ret = -LIBBPF_ERRNO__WRNGPID;
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			ret = -LIBBPF_ERRNO__INVSEQ;
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			ret = err->error;
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		default:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e9c4b7cabcf2..5fe8aaa2123e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -106,6 +106,8 @@ static const char *libbpf_strerror_table[NR_ERRNO] = {
 	[ERRCODE_OFFSET(PROG2BIG)]	= "Program too big",
 	[ERRCODE_OFFSET(KVER)]		= "Incorrect kernel version",
 	[ERRCODE_OFFSET(PROGTYPE)]	= "Kernel doesn't support this program type",
+	[ERRCODE_OFFSET(WRNGPID)]	= "Wrong pid in netlink message",
+	[ERRCODE_OFFSET(INVSEQ)]	= "Invalid netlink sequence",
 };
 
 int libbpf_strerror(int err, char *buf, size_t size)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 6e20003109e0..e42f96900318 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -42,6 +42,8 @@ enum libbpf_errno {
 	LIBBPF_ERRNO__PROG2BIG,	/* Program too big */
 	LIBBPF_ERRNO__KVER,	/* Incorrect kernel version */
 	LIBBPF_ERRNO__PROGTYPE,	/* Kernel doesn't support this program type */
+	LIBBPF_ERRNO__WRNGPID,	/* Wrong pid in netlink message */
+	LIBBPF_ERRNO__INVSEQ,	/* Invalid netlink sequence */
 	__LIBBPF_ERRNO__END,
 };
 
@@ -246,4 +248,6 @@ long libbpf_get_error(const void *ptr);
 
 int bpf_prog_load(const char *file, enum bpf_prog_type type,
 		  struct bpf_object **pobj, int *prog_fd);
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 #endif
-- 
2.15.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox