Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2] hv_netvsc: automatically name slave VF network device
From: David Miller @ 2017-12-20 18:47 UTC (permalink / raw)
  To: stephen; +Cc: netdev, sthemmin
In-Reply-To: <20171219235930.16916-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Tue, 19 Dec 2017 15:59:30 -0800

> Rename the VF device to ethX_vf based on the ethX as the
> synthetic device.  This eliminates the need for delay on setup,
> and the PCI (udev based) naming is not reproducible on Hyper-V
> anyway. The name of the VF does not matter since all control
> operations take place the primary device. It does make the
> user experience better to associate the names.
> 
> Based on feedback from all.systems.go talk.
> 
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---
> v2 - also handle case where synthetic device gets renamed
>      and the case where rename causes clash with pre-existing device name

Besides Jakub's objections (which I agree with) it is just so
unexpected that changing the name of device X has the side effect of
changing the name of device Y.

This is why we do this policy stuff in userspace.  The kernel has no
business messing with these netdev names.

This patch means that if a user decides to add his own custom udev
rules for the VF (for whatever reason, it doesn't have to make sense
to you or me) we're just going to undo them.  That's bad.

^ permalink raw reply

* Re: [pull request][net 00/14] Mellanox, mlx5 fixes 2017-12-19
From: David Miller @ 2017-12-20 18:42 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20171219222456.29627-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 20 Dec 2017 00:24:42 +0200

> The follwoing series includes some fixes for mlx5 core and etherent
> driver.
> 
> Please pull and let me know if there is any problem.

Pulled.

> For -stable:
> 
> kernels >= v4.7.y
>     ("net/mlx5e: Fix possible deadlock of VXLAN lock")
>     ("net/mlx5e: Add refcount to VXLAN structure")
>     ("net/mlx5e: Prevent possible races in VXLAN control flow")
>     ("net/mlx5e: Fix features check of IPv6 traffic")
> 
> kernels >= v4.9.y
>     ("net/mlx5: Fix error flow in CREATE_QP command")
>     ("net/mlx5: Fix rate limit packet pacing naming and struct")
> 
> kernels >= v4.13.y
>     ("net/mlx5: FPGA, return -EINVAL if size is zero")
> 
> kernels >= v4.14.y
>     ("Revert "mlx5: move affinity hints assignments to generic code")

Queued up.

Thanks.

^ permalink raw reply

* [PATCH bpf-next] tools/bpf: adjust rlimit RLIMIT_MEMLOCK for test_dev_cgroup
From: Yonghong Song @ 2017-12-20 18:37 UTC (permalink / raw)
  To: ast, daniel, guro, netdev; +Cc: kernel-team

The default rlimit RLIMIT_MEMLOCK is 64KB. In certain cases,
e.g. in a test machine mimicking our production system, this test may
fail due to unable to charge the required memory for prog load:

  $ ./test_dev_cgroup
  libbpf: load bpf program failed: Operation not permitted
  libbpf: failed to load program 'cgroup/dev'
  libbpf: failed to load object './dev_cgroup.o'
  Failed to load DEV_CGROUP program
  ...

Changing the default rlimit RLIMIT_MEMLOCK to unlimited
makes the test pass.

This patch also fixed a problem where when bpf_prog_load fails,
cleanup_cgroup_environment() should not be called since
setup_cgroup_environment() has not been invoked. Otherwise,
the following confusing message will appear:
  ...
  (/home/yhs/local/linux/tools/testing/selftests/bpf/cgroup_helpers.c:95:
   errno: No such file or directory) Opening Cgroup Procs: /mnt/cgroup.procs
  ...

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/test_dev_cgroup.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_dev_cgroup.c b/tools/testing/selftests/bpf/test_dev_cgroup.c
index 02c85d6..c1535b3 100644
--- a/tools/testing/selftests/bpf/test_dev_cgroup.c
+++ b/tools/testing/selftests/bpf/test_dev_cgroup.c
@@ -10,6 +10,8 @@
 #include <string.h>
 #include <errno.h>
 #include <assert.h>
+#include <sys/time.h>
+#include <sys/resource.h>
 
 #include <linux/bpf.h>
 #include <bpf/bpf.h>
@@ -23,15 +25,19 @@
 
 int main(int argc, char **argv)
 {
+	struct rlimit limit  = { RLIM_INFINITY, RLIM_INFINITY };
 	struct bpf_object *obj;
 	int error = EXIT_FAILURE;
 	int prog_fd, cgroup_fd;
 	__u32 prog_cnt;
 
+	if (setrlimit(RLIMIT_MEMLOCK, &limit) < 0)
+		perror("Unable to lift memlock rlimit");
+
 	if (bpf_prog_load(DEV_CGROUP_PROG, BPF_PROG_TYPE_CGROUP_DEVICE,
 			  &obj, &prog_fd)) {
 		printf("Failed to load DEV_CGROUP program\n");
-		goto err;
+		goto out;
 	}
 
 	if (setup_cgroup_environment()) {
@@ -89,5 +95,6 @@ int main(int argc, char **argv)
 err:
 	cleanup_cgroup_environment();
 
+out:
 	return error;
 }
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net 0/2] cls_bpf: fix offload state tracking with block callbacks
From: David Miller @ 2017-12-20 18:32 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, daniel, jiri, oss-drivers
In-Reply-To: <20171219213214.1084-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 19 Dec 2017 13:32:12 -0800

> After introduction of block callbacks classifiers can no longer track
> offload state.  cls_bpf used to do that in an attempt to move common
> code from drivers to the core.  Remove that functionality and fix
> drivers.
> 
> The user-visible bug this is fixing is that trying to offload a second
> filter would trigger a spurious DESTROY and in turn disable the already
> installed one.

Series applied, thanks for the heads up about the net-next conflict.

^ permalink raw reply

* Re: [RFC PATCH net-next] tools/bpf: fix build with binutils >= 2.28
From: Roman Gushchin @ 2017-12-20 18:32 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: netdev, linux-kernel, kernel-team, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <140ea0e9-964f-6f62-0721-88c9f8905cd3@netronome.com>

On Tue, Dec 19, 2017 at 04:22:51PM +0000, Quentin Monnet wrote:
> 2017-12-19 16:10 UTC+0000 ~ Roman Gushchin <guro@fb.com>
> > On Tue, Dec 19, 2017 at 03:57:02PM +0000, Quentin Monnet wrote:
> >> Hi Roman, thanks for working on this!
> >>
> >>
> >> I discussed this issue with Jakub recently, and one suggestion he had
> >> was to look in tools/build/feature to add a new "feature", by trying to
> >> compile short programs, for making the distinction between binutils
> >> versions. It probably requires more work, but could be more robust than
> >> parsing the version from the command line?
> > 
> > Hm, might be an option. Parsing readelf output is pretty ugly, here I agree.
> > In general it feels more like a binutils issue, so we have to workaround it
> > in either way.
> > 
> > Is Jakub or someone else working on it?
> > 
> > Thanks!
> > 
> 
> Jakub isn't. On our side, I noticed last week that there was this change
> in binutils, and started to have a look at how these "features" work.
> But I have nothing that works so far, so feel free to tackle this.
> 
> Quentin

Hi Quentin!

Can you, please, check that the patch below works in your environment.

Thanks!

--


>From b08deabf42e4c143b9e0eec8c49714e4d2c928e3 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Wed, 20 Dec 2017 13:27:32 +0000
Subject: [RFC PATCH net-next] tools/bpftool: fix bpftool build with bintutils
 >= 2.8

Bpftool build is broken with binutils version 2.28 and later.
The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
in the binutils repo, which changed the disassembler() function
signature.

Fix this by adding a new "feature" to the tools/build/features
infrastructure and make it responsible for decision which
disassembler() function signature to use.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/Makefile                      | 18 ++++++++++++++++++
 tools/bpf/bpf_jit_disasm.c              |  7 +++++++
 tools/bpf/bpftool/Makefile              | 13 +++++++++++++
 tools/bpf/bpftool/jit_disasm.c          |  7 +++++++
 tools/build/feature/Makefile            |  4 ++++
 tools/build/feature/test-disassembler.c | 15 +++++++++++++++
 6 files changed, 64 insertions(+)
 create mode 100644 tools/build/feature/test-disassembler.c

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index 07a6697466ef..c62b3a311486 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -9,6 +9,24 @@ MAKE = make
 CFLAGS += -Wall -O2
 CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
 
+ifeq ($(srctree),)
+srctree := $(patsubst %/,%,$(dir $(CURDIR)))
+srctree := $(patsubst %/,%,$(dir $(srctree)))
+endif
+
+FEATURE_TESTS = disassembler
+FEATURE_DISPLAY = disassembler
+
+ifeq ($(FEATURES_DUMP),)
+include $(srctree)/tools/build/Makefile.feature
+else
+include $(FEATURES_DUMP)
+endif
+
+ifeq ($(feature-disassembler), 1)
+CFLAGS += -DNEW_DISSASSEMBLER_SIGNATURE
+endif
+
 %.yacc.c: %.y
 	$(YACC) -o $@ -d $<
 
diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
index 75bf526a0168..a5f4dbacdb11 100644
--- a/tools/bpf/bpf_jit_disasm.c
+++ b/tools/bpf/bpf_jit_disasm.c
@@ -72,7 +72,14 @@ static void get_asm_insns(uint8_t *image, size_t len, int opcodes)
 
 	disassemble_init_for_target(&info);
 
+#ifdef NEW_DISSASSEMBLER_SIGNATURE
+	disassemble = disassembler(bfd_get_arch(bfdf),
+				   bfd_big_endian(bfdf),
+				   bfd_get_mach(bfdf),
+				   bfdf);
+#else
 	disassemble = disassembler(bfdf);
+#endif
 	assert(disassemble);
 
 	do {
diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 3f17ad317512..9c089cfa5f3f 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -43,6 +43,19 @@ LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
 INSTALL ?= install
 RM ?= rm -f
 
+FEATURE_TESTS = disassembler
+FEATURE_DISPLAY = disassembler
+
+ifeq ($(FEATURES_DUMP),)
+include $(srctree)/tools/build/Makefile.feature
+else
+include $(FEATURES_DUMP)
+endif
+
+ifeq ($(feature-disassembler), 1)
+CFLAGS += -DNEW_DISSASSEMBLER_SIGNATURE
+endif
+
 include $(wildcard *.d)
 
 all: $(OUTPUT)bpftool
diff --git a/tools/bpf/bpftool/jit_disasm.c b/tools/bpf/bpftool/jit_disasm.c
index 1551d3918d4c..8295e2f14ed7 100644
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@@ -107,7 +107,14 @@ void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes)
 
 	disassemble_init_for_target(&info);
 
+#ifdef NEW_DISSASSEMBLER_SIGNATURE
+	disassemble = disassembler(bfd_get_arch(bfdf),
+				   bfd_big_endian(bfdf),
+				   bfd_get_mach(bfdf),
+				   bfdf);
+#else
 	disassemble = disassembler(bfdf);
+#endif
 	assert(disassemble);
 
 	if (json_output)
diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile
index 96982640fbf8..91f937943918 100644
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@@ -13,6 +13,7 @@ FILES=                                          \
          test-hello.bin                         \
          test-libaudit.bin                      \
          test-libbfd.bin                        \
+         test-disassembler.bin                  \
          test-liberty.bin                       \
          test-liberty-z.bin                     \
          test-cplus-demangle.bin                \
@@ -188,6 +189,9 @@ $(OUTPUT)test-libpython-version.bin:
 $(OUTPUT)test-libbfd.bin:
 	$(BUILD) -DPACKAGE='"perf"' -lbfd -lz -liberty -ldl
 
+$(OUTPUT)test-disassembler.bin:
+	$(BUILD) -lopcodes
+
 $(OUTPUT)test-liberty.bin:
 	$(CC) $(CFLAGS) -Wall -Werror -o $@ test-libbfd.c -DPACKAGE='"perf"' $(LDFLAGS) -lbfd -ldl -liberty
 
diff --git a/tools/build/feature/test-disassembler.c b/tools/build/feature/test-disassembler.c
new file mode 100644
index 000000000000..45ce65cfddf0
--- /dev/null
+++ b/tools/build/feature/test-disassembler.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <bfd.h>
+#include <dis-asm.h>
+
+int main(void)
+{
+	bfd *abfd = bfd_openr(NULL, NULL);
+
+	disassembler(bfd_get_arch(abfd),
+		     bfd_big_endian(abfd),
+		     bfd_get_mach(abfd),
+		     abfd);
+
+	return 0;
+}
-- 
2.14.3

^ permalink raw reply related

* Re: RCU callback crashes
From: Cong Wang @ 2017-12-20 18:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpWUjfv2-Sirmdb5WfV4pZ4uF0m7=HR5YGWaKxb4KHp8gQ@mail.gmail.com>

On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> I guess it is q->miniqp which is freed in qdisc_graft() without properly
> waiting for rcu readers?

It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
but miniq is being freed, no rcu barrier waits for it...

You can try to add a rcu_barrier_bh() at the end to see if this crash
is gone, but I don't think people like adding yet another rcu barrier...

^ permalink raw reply

* (U.N.O/W.B.O/20/12/2017/1982/09/05)
From: UNCU @ 2017-12-20 18:19 UTC (permalink / raw)
  To: Recipients

[-- Attachment #1: Mail message body --]
[-- Type: text/plain, Size: 57 bytes --]

Please view the attached file for your compensation code.

[-- Attachment #2: United Nations Compensation Unit.docx --]
[-- Type: application/octet-stream, Size: 17648 bytes --]

^ permalink raw reply

* Re: RCU callback crashes
From: Cong Wang @ 2017-12-20 18:17 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <20171219223404.03786d66@cakuba.netronome.com>

On Tue, Dec 19, 2017 at 10:34 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> Ah, no object debug but KASAN on produces this:
>


I bet it is an ingress qdisc which is being freed?



> [   39.268209] BUG: KASAN: use-after-free in cpu_needs_another_gp+0x246/0x2b0
> [   39.275965] Read of size 8 at addr ffff8803aa64f138 by task swapper/13/0
> [   39.283524]
> [   39.285256] CPU: 13 PID: 0 Comm: swapper/13 Not tainted 4.15.0-rc3-perf-00955-g1d0b01347dd5-dirty #8
> [   39.295535] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
> [   39.303969] Call Trace:
> [   39.306769]  <IRQ>
> [   39.309088]  dump_stack+0xa6/0x118
> [   39.312957]  ? _atomic_dec_and_lock+0xe8/0xe8
> [   39.317895]  ? cpu_needs_another_gp+0x246/0x2b0
> [   39.323030]  print_address_description+0x6a/0x270
> [   39.328380]  ? cpu_needs_another_gp+0x246/0x2b0
> [   39.333510]  kasan_report+0x23f/0x350
> [   39.337672]  cpu_needs_another_gp+0x246/0x2b0
> ...
> [   39.383026]  rcu_process_callbacks+0x1a0/0x620
> ...


This is confusing.

I guess it is q->miniqp which is freed in qdisc_graft() without properly
waiting for rcu readers?


> [   39.426713]  __do_softirq+0x17f/0x4de
> ...
> [   39.463841]  irq_exit+0xe1/0xf0
> [   39.467437]  smp_apic_timer_interrupt+0xd9/0x290
> [   39.472685]  ? smp_call_function_single_interrupt+0x230/0x230
> [   39.479195]  ? smp_reschedule_interrupt+0x240/0x240
> [   39.484736]  apic_timer_interrupt+0x8c/0xa0
> [   39.489497]  </IRQ>
> [   39.491929] RIP: 0010:cpuidle_enter_state+0x12a/0x510
> [   39.497660] RSP: 0018:ffff88086bf9fd08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
> [   39.506228] RAX: 0000000000000000 RBX: ffffe8ffffb060e0 RCX: ffffffff921329f5
> [   39.514291] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88086f3246e8
> [   39.522354] RBP: 1ffff1010d7f3fa6 R08: fffffbfff2742768 R09: fffffbfff2742768
> [   39.530418] R10: ffff88086bf9fcc8 R11: fffffbfff2742767 R12: 0000000924148b4b
> [   39.538480] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffff9383eb80
> [   39.546545]  ? sched_idle_set_state+0x25/0x30
> [   39.551502]  ? cpuidle_enter_state+0x106/0x510
> [   39.556556]  ? cpuidle_enter_s2idle+0x130/0x130
> [   39.561706]  ? rcu_eqs_enter_common.constprop.62+0xd1/0x1e0
> [   39.568037]  ? rcu_gp_init+0xf70/0xf70
> [   39.572331]  ? sched_set_stop_task+0x160/0x160
> [   39.577384]  do_idle+0x1af/0x200
> [   39.581076]  cpu_startup_entry+0xd2/0xe0
> [   39.585545]  ? cpu_in_idle+0x20/0x20
> [   39.589626]  ? _raw_spin_trylock+0xe0/0xe0
> [   39.594292]  ? memcpy+0x34/0x50
> [   39.597890]  start_secondary+0x271/0x2b0
> [   39.602361]  ? set_cpu_sibling_map+0x840/0x840
> [   39.607416]  secondary_startup_64+0xa5/0xb0
> [   39.612180]
> [   39.613929] Allocated by task 1358:
> [   39.617914]  __kmalloc_node+0x183/0x2c0
> [   39.622290]  qdisc_alloc+0xbd/0x3f0
> [   39.626274]  qdisc_create+0xd8/0x720
> [   39.630355]  tc_modify_qdisc+0x657/0x910
> [   39.634826]  rtnetlink_rcv_msg+0x37c/0x7e0
> [   39.639491]  netlink_rcv_skb+0x122/0x230
> [   39.643960]  netlink_unicast+0x2ae/0x360
> [   39.648443]  netlink_sendmsg+0x5d5/0x620
> [   39.652915]  sock_sendmsg+0x64/0x80
> [   39.656900]  ___sys_sendmsg+0x4a8/0x500
> [   39.661272]  __sys_sendmsg+0xa9/0x140
> [   39.665450]  entry_SYSCALL_64_fastpath+0x1e/0x81
> [   39.670695]
> [   39.672441] Freed by task 1370:
> [   39.676052]  kfree+0x8d/0x1c0
> [   39.679454]  qdisc_graft+0x208/0x670
> [   39.683535]  tc_get_qdisc+0x229/0x350
> [   39.687713]  rtnetlink_rcv_msg+0x37c/0x7e0
> [   39.692411]  netlink_rcv_skb+0x122/0x230
> [   39.696881]  netlink_unicast+0x2ae/0x360
> [   39.701350]  netlink_sendmsg+0x5d5/0x620
> [   39.705819]  sock_sendmsg+0x64/0x80
> [   39.709801]  ___sys_sendmsg+0x4a8/0x500
> [   39.714172]  __sys_sendmsg+0xa9/0x140
> [   39.718351]  entry_SYSCALL_64_fastpath+0x1e/0x81
> [   39.723597]
> [   39.725347] The buggy address belongs to the object at ffff8803aa64ef80
> [   39.725347]  which belongs to the cache kmalloc-512 of size 512
> [   39.739453] The buggy address is located 440 bytes inside of
> [   39.739453]  512-byte region [ffff8803aa64ef80, ffff8803aa64f180)
> [   39.752684] The buggy address belongs to the page:
> [   39.758127] page:0000000042b3124b count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
> [   39.769222] flags: 0x2ffff0000008100(slab|head)
> [   39.774365] raw: 02ffff0000008100 0000000000000000 0000000000000000 0000000180190019
> [   39.783129] raw: dead000000000100 dead000000000200 ffff8803afc0ed80 0000000000000000
> [   39.791986] page dumped because: kasan: bad access detected
> [   39.798300]
> [   39.800063] Memory state around the buggy address:
> [   39.805503]  ffff8803aa64f000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [   39.813684]  ffff8803aa64f080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [   39.821866] >ffff8803aa64f100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [   39.830045]                                         ^
> [   39.835778]  ffff8803aa64f180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [   39.843958]  ffff8803aa64f200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>

^ permalink raw reply

* Re: [PATCH net v2] openvswitch: Fix pop_vlan action for double tagged frames
From: Eric Garver @ 2017-12-20 18:13 UTC (permalink / raw)
  To: Jiri Benc; +Cc: netdev, ovs-dev
In-Reply-To: <20171220184117.19dfa575@redhat.com>

On Wed, Dec 20, 2017 at 06:41:17PM +0100, Jiri Benc wrote:
> On Wed, 20 Dec 2017 10:39:32 -0500, Eric Garver wrote:
> > +	if (is_flow_key_valid(key) && key->eth.vlan.tci && key->eth.cvlan.tci)
> 
> Maybe (key->eth.vlan.tci & htons(VLAN_TAG_PRESENT)) for consistency
> with the rest of the code? But it's just nitpicking.
> 
> The real problem here is when a double tagged packet leaves the ovs
> bridge, it won't have the skb->protocol that the kernel expects: it
> will be ethertype of the payload, while my understanding is it should
> be the inner tpid, right?

Right.

> 
> This patch fixes that nicely for the pop vlan case. But what about
> other cases? It seems to me that we need to add the logic to
> key_extract.
> 

The part I was missing is; before encap into the L3 tunnel all the VLAN
tags must be explicitly popped.

Setting skb->protocol to the TPID for double tagged frames means the pop
operations shift the payload ethertype into skb->protocol.

I'll send a v3 that does this is key_extract().

^ permalink raw reply

* Re: [PATCH v5 3/6] perf: implement pmu perf_kprobe
From: Song Liu @ 2017-12-20 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, mingo@redhat.com, David Miller,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Daniel Borkmann, Kernel Team
In-Reply-To: <20171220101426.cquh5uv4bgj4bk7y@hirez.programming.kicks-ass.net>


> On Dec 20, 2017, at 2:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Dec 20, 2017 at 11:03:01AM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 06, 2017 at 02:45:15PM -0800, Song Liu wrote:
>>> @@ -8537,7 +8620,7 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>>> 	char *filter_str;
>>> 	int ret = -EINVAL;
>>> 
>>> -	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
>>> +	if ((!perf_event_is_tracing(event) ||
>>> 	     !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
>>> 	    !has_addr_filter(event))
>>> 		return -EINVAL;
>> 
>> You actually missed an instance later in this same function... fixing
>> that.
> 
> 
> @@ -8518,23 +8601,19 @@ perf_event_set_addr_filter(struct perf_e
> 
> static int perf_event_set_filter(struct perf_event *event, void __user *arg)
> {
> -	char *filter_str;
> 	int ret = -EINVAL;
> -
> -	if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
> -	    !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
> -	    !has_addr_filter(event))
> -		return -EINVAL;
> +	char *filter_str;
> 
> 	filter_str = strndup_user(arg, PAGE_SIZE);
> 	if (IS_ERR(filter_str))
> 		return PTR_ERR(filter_str);
> 
> -	if (IS_ENABLED(CONFIG_EVENT_TRACING) &&
> -	    event->attr.type == PERF_TYPE_TRACEPOINT)
> -		ret = ftrace_profile_set_filter(event, event->attr.config,
> -						filter_str);
> -	else if (has_addr_filter(event))
> +#ifdef CONFIG_EVENT_TRACING
> +	if (perf_event_is_tracing(event))
> +		ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
> +	else
> +#endif
> +	if (has_addr_filter(event))
> 		ret = perf_event_set_addr_filter(event, filter_str);
> 
> 	kfree(filter_str);
> 
> 
> 
> Is that right?

Yeah, this is right and neat. Thanks a lot for your help on this. 

I think there is one more thing to change:

diff --git i/kernel/events/core.c w/kernel/events/core.c
index a906f30..516ff9b 100644
--- i/kernel/events/core.c
+++ w/kernel/events/core.c
@@ -8226,7 +8226,7 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)

 static void perf_event_free_bpf_prog(struct perf_event *event)
 {
-       if (event->attr.type != PERF_TYPE_TRACEPOINT) {
+       if (!perf_event_is_tracing(event)) {
                perf_event_free_bpf_handler(event);
                return;
        }

Thanks,
Song

^ permalink raw reply related

* Re: [PATCH v2] net: amd-xgbe: Get rid of custom hex_dump_to_buffer()
From: David Miller @ 2017-12-20 18:05 UTC (permalink / raw)
  To: andriy.shevchenko; +Cc: thomas.lendacky, netdev
In-Reply-To: <20171219212215.11561-1-andriy.shevchenko@linux.intel.com>

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Tue, 19 Dec 2017 23:22:15 +0200

> Get rid of yet another custom hex_dump_to_buffer().
> 
> The output is slightly changed, i.e. each byte followed by white space.
> 
> Note, we don't use print_hex_dump() here since the original code uses
> nedev_dbg().
> 
> Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Applied to net-next.

^ permalink raw reply

* Re: RCU callback crashes
From: John Fastabend @ 2017-12-20 18:04 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko; +Cc: netdev@vger.kernel.org, Cong Wang
In-Reply-To: <20171219223404.03786d66@cakuba.netronome.com>

On 12/19/2017 10:34 PM, Jakub Kicinski wrote:
> On Tue, 19 Dec 2017 22:22:27 -0800, Jakub Kicinski wrote:
>>>> I get this:    
>>>
>>> Could you try to run it with kasan on?  
>>
>> I didn't manage to reproduce it with KASAN on so far :(  Even enabling
>> object debugging to get the second splat in my email (which is more
>> useful) actually makes the crash go away, I only see the warning...
> 
> Ah, no object debug but KASAN on produces this:
> 

@Jakub, This is with mq and pfifo_fast I guess?

> [   39.268209] BUG: KASAN: use-after-free in cpu_needs_another_gp+0x246/0x2b0
> [   39.275965] Read of size 8 at addr ffff8803aa64f138 by task swapper/13/0
> [   39.283524] 
> [   39.285256] CPU: 13 PID: 0 Comm: swapper/13 Not tainted 4.15.0-rc3-perf-00955-g1d0b01347dd5-dirty #8
> [   39.295535] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
> [   39.303969] Call Trace:
> [   39.306769]  <IRQ>
> [   39.309088]  dump_stack+0xa6/0x118
> [   39.312957]  ? _atomic_dec_and_lock+0xe8/0xe8
> [   39.317895]  ? cpu_needs_another_gp+0x246/0x2b0
> [   39.323030]  print_address_description+0x6a/0x270
> [   39.328380]  ? cpu_needs_another_gp+0x246/0x2b0
> [   39.333510]  kasan_report+0x23f/0x350
> [   39.337672]  cpu_needs_another_gp+0x246/0x2b0
> ...
> [   39.383026]  rcu_process_callbacks+0x1a0/0x620
> ...
> [   39.426713]  __do_softirq+0x17f/0x4de
> ...
> [   39.463841]  irq_exit+0xe1/0xf0
> [   39.467437]  smp_apic_timer_interrupt+0xd9/0x290
> [   39.472685]  ? smp_call_function_single_interrupt+0x230/0x230
> [   39.479195]  ? smp_reschedule_interrupt+0x240/0x240
> [   39.484736]  apic_timer_interrupt+0x8c/0xa0
> [   39.489497]  </IRQ>
> [   39.491929] RIP: 0010:cpuidle_enter_state+0x12a/0x510
> [   39.497660] RSP: 0018:ffff88086bf9fd08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
> [   39.506228] RAX: 0000000000000000 RBX: ffffe8ffffb060e0 RCX: ffffffff921329f5
> [   39.514291] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88086f3246e8
> [   39.522354] RBP: 1ffff1010d7f3fa6 R08: fffffbfff2742768 R09: fffffbfff2742768
> [   39.530418] R10: ffff88086bf9fcc8 R11: fffffbfff2742767 R12: 0000000924148b4b
> [   39.538480] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffff9383eb80
> [   39.546545]  ? sched_idle_set_state+0x25/0x30
> [   39.551502]  ? cpuidle_enter_state+0x106/0x510
> [   39.556556]  ? cpuidle_enter_s2idle+0x130/0x130
> [   39.561706]  ? rcu_eqs_enter_common.constprop.62+0xd1/0x1e0
> [   39.568037]  ? rcu_gp_init+0xf70/0xf70
> [   39.572331]  ? sched_set_stop_task+0x160/0x160
> [   39.577384]  do_idle+0x1af/0x200
> [   39.581076]  cpu_startup_entry+0xd2/0xe0
> [   39.585545]  ? cpu_in_idle+0x20/0x20
> [   39.589626]  ? _raw_spin_trylock+0xe0/0xe0
> [   39.594292]  ? memcpy+0x34/0x50
> [   39.597890]  start_secondary+0x271/0x2b0
> [   39.602361]  ? set_cpu_sibling_map+0x840/0x840
> [   39.607416]  secondary_startup_64+0xa5/0xb0
> [   39.612180] 
> [   39.613929] Allocated by task 1358:
> [   39.617914]  __kmalloc_node+0x183/0x2c0
> [   39.622290]  qdisc_alloc+0xbd/0x3f0
> [   39.626274]  qdisc_create+0xd8/0x720
> [   39.630355]  tc_modify_qdisc+0x657/0x910
> [   39.634826]  rtnetlink_rcv_msg+0x37c/0x7e0
> [   39.639491]  netlink_rcv_skb+0x122/0x230
> [   39.643960]  netlink_unicast+0x2ae/0x360
> [   39.648443]  netlink_sendmsg+0x5d5/0x620
> [   39.652915]  sock_sendmsg+0x64/0x80
> [   39.656900]  ___sys_sendmsg+0x4a8/0x500
> [   39.661272]  __sys_sendmsg+0xa9/0x140
> [   39.665450]  entry_SYSCALL_64_fastpath+0x1e/0x81
> [   39.670695] 
> [   39.672441] Freed by task 1370:
> [   39.676052]  kfree+0x8d/0x1c0
> [   39.679454]  qdisc_graft+0x208/0x670
> [   39.683535]  tc_get_qdisc+0x229/0x350
> [   39.687713]  rtnetlink_rcv_msg+0x37c/0x7e0
> [   39.692411]  netlink_rcv_skb+0x122/0x230
> [   39.696881]  netlink_unicast+0x2ae/0x360
> [   39.701350]  netlink_sendmsg+0x5d5/0x620
> [   39.705819]  sock_sendmsg+0x64/0x80
> [   39.709801]  ___sys_sendmsg+0x4a8/0x500
> [   39.714172]  __sys_sendmsg+0xa9/0x140
> [   39.718351]  entry_SYSCALL_64_fastpath+0x1e/0x81
> [   39.723597] 
> [   39.725347] The buggy address belongs to the object at ffff8803aa64ef80
> [   39.725347]  which belongs to the cache kmalloc-512 of size 512
> [   39.739453] The buggy address is located 440 bytes inside of
> [   39.739453]  512-byte region [ffff8803aa64ef80, ffff8803aa64f180)
> [   39.752684] The buggy address belongs to the page:
> [   39.758127] page:0000000042b3124b count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
> [   39.769222] flags: 0x2ffff0000008100(slab|head)
> [   39.774365] raw: 02ffff0000008100 0000000000000000 0000000000000000 0000000180190019
> [   39.783129] raw: dead000000000100 dead000000000200 ffff8803afc0ed80 0000000000000000
> [   39.791986] page dumped because: kasan: bad access detected
> [   39.798300] 
> [   39.800063] Memory state around the buggy address:
> [   39.805503]  ffff8803aa64f000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [   39.813684]  ffff8803aa64f080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [   39.821866] >ffff8803aa64f100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [   39.830045]                                         ^
> [   39.835778]  ffff8803aa64f180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [   39.843958]  ffff8803aa64f200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 

So with lockless qdisc support we really do need to wait a
RCU grace period before free'ing the qdisc. I missed this initially
in the lockless qdisc set but we need to revert this,

commit 752fbcc33405d6f8249465e4b2c4e420091bb825
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Tue Sep 19 13:15:42 2017 -0700

    net_sched: no need to free qdisc in RCU callback
    
    gen estimator has been rewritten in commit 1c0d32fde5bd
    ("net_sched: gen_estimator: complete rewrite of rate estimators"),
    the caller no longer needs to wait for a grace period. So this
    patch gets rid of it.
    
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Acked-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Thanks,
John

^ permalink raw reply

* Re: [PATCH net-next] net: Clarify dev_weight documentation for LRO and GRO_HW.
From: David Miller @ 2017-12-20 18:04 UTC (permalink / raw)
  To: michael.chan; +Cc: netdev
In-Reply-To: <1513717976-24263-1-git-send-email-michael.chan@broadcom.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Tue, 19 Dec 2017 16:12:56 -0500

> Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>

Applied, thanks a lot.

^ permalink raw reply

* Re: [PATCH net V2] net: always reevalulate autoflowlabel setting for reset packet
From: David Miller @ 2017-12-20 18:03 UTC (permalink / raw)
  To: shli; +Cc: netdev, Kernel-team, shli, kafai
In-Reply-To: <b44762a8392137c6008e2f7490cbd0853f0015ac.1513716787.git.shli@fb.com>

From: Shaohua Li <shli@kernel.org>
Date: Tue, 19 Dec 2017 12:58:21 -0800

> From: Shaohua Li <shli@fb.com>
> 
> ipv6_pinfo.autoflowlabel is set in sock creation. Later if we change
> sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't changed,
> so the sock will keep the old behavior in terms of auto flowlabel. Reset
> packet is suffering from this problem, because reset packset is sent
> from a special control socket, which is created at boot time. Since
> sysctl.ipv6.auto_flowlabels is 2 by default, the control socket will
> always have its ipv6_pinfo.autoflowlabel set, even after user set
> sysctl.ipv6.auto_flowlabels to 1, so reset packset will always have
> flowlabel.
> 
> To fix this, we always reevaluate autoflowlabel setting for reset
> packet. Normal sock has the same issue too, but since the
> sysctl.ipv6.auto_flowlabels is usually set at host startup, this isn't a
> big issue for normal sock.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

This definitely goes into the category of hack.

Special casing resets is not the answer.

What about normal user sockets openned before the sysctl setting is
modified?

No user is going to be happy about this nice little surprise.

What you're saying is that we should specially treat the kernel
sockets that emit resets, in response to this sysctl setting being
modified.

But that is not what you are doing.  You are treating all sockets
in the system in this way, and only for reset packets.  That's
really terrible.

^ permalink raw reply

* Re: [for-next 07/11] net/mlx5: E-Switch, Create generic header struct to be used by representors
From: David Miller @ 2017-12-20 17:58 UTC (permalink / raw)
  To: saeedm; +Cc: dledford, netdev, linux-rdma, leonro, markb
In-Reply-To: <20171219203340.2600-8-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue, 19 Dec 2017 12:33:36 -0800

> +static int esw_offloads_load_reps(struct mlx5_eswitch *esw, int nvports)
> +{
> +	u8 rep_type = 0;
> +	int err;
> +
> +	while (rep_type < NUM_REP_TYPES) {
> +		err = esw_offloads_load_reps_type(esw, nvports,
> +						  rep_type);
> +		if (err)
> +			goto err_reps;
> +		rep_type++;
> +	}

Please, don't obfuscate what is a normal for() loop:

	for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++)

Thanks.

^ permalink raw reply

* Re: [for-next 01/11] net/mlx5: E-Switch, Refactor vport representors initialization
From: David Miller @ 2017-12-20 17:57 UTC (permalink / raw)
  To: saeedm-VPRAkNaXOzVWk0Htik3J/w
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w,
	markb-VPRAkNaXOzVWk0Htik3J/w
In-Reply-To: <20171219203340.2600-2-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

From: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Tue, 19 Dec 2017 12:33:30 -0800

> +int esw_offloads_init_reps(struct mlx5_eswitch *esw)
> +{
> +	struct mlx5_core_dev *dev = esw->dev;
> +	struct mlx5_esw_offload *offloads;
> +	struct mlx5_eswitch_rep *rep;
> +	int total_vfs = MLX5_TOTAL_VPORTS(dev);
> +	u8 hw_id[ETH_ALEN];
> +	int vport;

Reverse christmas-tree please.

> +	esw->offloads.vport_reps =
> +		kcalloc(total_vfs, sizeof(struct mlx5_eswitch_rep),
> +			GFP_KERNEL);

That looks really unpleasant:

	x = kcalloc(y,
		    z, GFP_KERNEL);

would look so much nicer.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [pull request][for-next 00/11] Mellanox, mlx5 E-Switch updates 2017-12-19
From: David Miller @ 2017-12-20 17:56 UTC (permalink / raw)
  To: saeedm-VPRAkNaXOzVWk0Htik3J/w
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, leonro-VPRAkNaXOzVWk0Htik3J/w
In-Reply-To: <20171219203340.2600-1-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

From: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Tue, 19 Dec 2017 12:33:29 -0800

> This patchset is based on rc4 and I see that net-next is still on rc3, i hope
> this is not a problem.

If it doesn't pull cleanly into net-next, then it would be a problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/2] netdevsim: couple of build warning fixes
From: David Miller @ 2017-12-20 17:51 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20171219201043.18851-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 19 Dec 2017 12:10:41 -0800

> This series fixes two harmless build warning about a symbol which
> should be static and an unused variable.

Series applied, thanks Jakub.

^ permalink raw reply

* Re: [PATCH net-next] selftests: rtnetlink: add gretap test cases
From: David Miller @ 2017-12-20 17:49 UTC (permalink / raw)
  To: u9012063; +Cc: netdev, fw
In-Reply-To: <1513708622-111607-1-git-send-email-u9012063@gmail.com>

From: William Tu <u9012063@gmail.com>
Date: Tue, 19 Dec 2017 10:37:02 -0800

> Add test cases for gretap and ip6gretap, native mode
> and external (collect metadata) mode.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH v1] net: pasemi: Replace mac address parsing
From: David Miller @ 2017-12-20 17:48 UTC (permalink / raw)
  To: andriy.shevchenko; +Cc: keescook, edumazet, netdev
In-Reply-To: <20171219183103.74530-1-andriy.shevchenko@linux.intel.com>

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Tue, 19 Dec 2017 20:31:03 +0200

> Replace sscanf() with mac_pton().
> 
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH v1] net: bonding: Replace mac address parsing
From: David Miller @ 2017-12-20 17:48 UTC (permalink / raw)
  To: andriy.shevchenko; +Cc: j.vosburgh, vfalico, andy, netdev
In-Reply-To: <20171219182044.5678-1-andriy.shevchenko@linux.intel.com>

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Tue, 19 Dec 2017 20:20:44 +0200

> Replace sscanf() with mac_pton().
> 
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH v1] bridge: Use helpers to handle MAC address
From: David Miller @ 2017-12-20 17:46 UTC (permalink / raw)
  To: andriy.shevchenko; +Cc: stephen, bridge, netdev
In-Reply-To: <20171219181053.86320-1-andriy.shevchenko@linux.intel.com>

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Tue, 19 Dec 2017 20:10:53 +0200

> Use
> 	%pM to print MAC
> 	mac_pton() to convert it from ASCII to binary format, and
> 	ether_addr_copy() to copy.
> 
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Applied to net-next

^ permalink raw reply

* Re: [PATCH net v3] ipv4: Fix use-after-free when flushing FIB tables
From: Alexander Duyck @ 2017-12-20 17:46 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Netdev, David Miller, Duyck, Alexander H, David Ahern,
	Fengguang Wu, mlxsw
In-Reply-To: <20171220173419.11259-1-idosch@mellanox.com>

On Wed, Dec 20, 2017 at 9:34 AM, Ido Schimmel <idosch@mellanox.com> wrote:
> Since commit 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") the
> local table uses the same trie allocated for the main table when custom
> rules are not in use.
>
> When a net namespace is dismantled, the main table is flushed and freed
> (via an RCU callback) before the local table. In case the callback is
> invoked before the local table is iterated, a use-after-free can occur.
>
> Fix this by iterating over the FIB tables in reverse order, so that the
> main table is always freed after the local table.
>
> v3: Reworded comment according to Alex's suggestion.
> v2: Add a comment to make the fix more explicit per Dave's and Alex's
> feedback.
>
> Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>

Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>

> ---
>  net/ipv4/fib_frontend.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index f52d27a422c3..08259d078b1c 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -1298,14 +1298,19 @@ static int __net_init ip_fib_net_init(struct net *net)
>
>  static void ip_fib_net_exit(struct net *net)
>  {
> -       unsigned int i;
> +       int i;
>
>         rtnl_lock();
>  #ifdef CONFIG_IP_MULTIPLE_TABLES
>         RCU_INIT_POINTER(net->ipv4.fib_main, NULL);
>         RCU_INIT_POINTER(net->ipv4.fib_default, NULL);
>  #endif
> -       for (i = 0; i < FIB_TABLE_HASHSZ; i++) {
> +       /* Destroy the tables in reverse order to guarantee that the
> +        * local table, ID 255, is destroyed before the main table, ID
> +        * 254. This is necessary as the local table may contain
> +        * references to data contained in the main table.
> +        */
> +       for (i = FIB_TABLE_HASHSZ - 1; i >= 0; i--) {
>                 struct hlist_head *head = &net->ipv4.fib_table_hash[i];
>                 struct hlist_node *tmp;
>                 struct fib_table *tb;
> --
> 2.14.3
>

^ permalink raw reply

* Re: [PATCH v1] net: dwc-xlgmac: Get rid of custom hex_dump_to_buffer()
From: David Miller @ 2017-12-20 17:45 UTC (permalink / raw)
  To: andriy.shevchenko; +Cc: jiedeng, Jose.Abreu, netdev
In-Reply-To: <20171219180306.13409-1-andriy.shevchenko@linux.intel.com>

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Date: Tue, 19 Dec 2017 20:03:06 +0200

> +		hex_dump_to_buffer(&skb->data[i], len, 32, 1, buffer, 128, false);

Please make the changes here that were requested for the amd-xgbe version of
this patch.  In particular, use "sizeof(buffer)" instead of the magic constant 128.

Thank you.

^ permalink raw reply

* Re: [PATCH] net: Fix double free and memory corruption in get_net_ns_by_id()
From: David Miller @ 2017-12-20 17:43 UTC (permalink / raw)
  To: ebiederm
  Cc: netdev, ktkhai, security, secalert, eric.dumazet, stephen,
	nicolas.dichtel
In-Reply-To: <87d13aaaqr.fsf@xmission.com>

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 19 Dec 2017 11:27:56 -0600

> (I can trivially verify that that idr_remove in cleanup_net happens
>  after the network namespace count has dropped to zero --EWB)
> 
> Function get_net_ns_by_id() does not check for net::count
> after it has found a peer in netns_ids idr.
> 
> It may dereference a peer, after its count has already been
> finaly decremented. This leads to double free and memory
> corruption:
 ...
> Also, put_net() on the right cpu may reorder with left's cpu
> list_replace_init(&cleanup_list, ..), and then cleanup_list
> will be corrupted.
> 
> Since cleanup_net() is executed in worker thread, while
> put_net(peer) can happen everywhere, there should be
> enough time for concurrent get_net_ns_by_id() to pick
> the peer up, and the race does not seem to be unlikely.
> The patch fixes the problem in standard way.
> 
> (Also, there is possible problem in peernet2id_alloc(), which requires
> check for net::count under nsid_lock and maybe_get_net(peer), but
> in current stable kernel it's used under rtnl_lock() and it has to be
> safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
> be fixed too. While this is not in stable kernel yet, so I'll send
> a separate message to netdev@ later).
> 
> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> Fixes: 0c7aecd4bde4 "netns: add rtnl cmd to add and get peer netns ids"
> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox