Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Björn Töpel @ 2019-08-15  8:28 UTC (permalink / raw)
  To: Daniel Borkmann, Björn Töpel
  Cc: Alexei Starovoitov, Netdev, Karlsson, Magnus, Bruce Richardson,
	Song Liu, bpf
In-Reply-To: <5ce25e5b-a07a-31d8-4141-c6bd250bba0e@iogearbox.net>

On 2019-08-15 01:17, Daniel Borkmann wrote:
> Seems reasonable to me, and inc as opposed to inc_not_zero is also fine
> here since at this point in time we're still holding one reference to
> the map.

Ok, good.

> But I think there's a catch with the current code that still
> needs fixing:
> 
> Imagine you do a xsk_map_update_elem() where we have a situation where
> xs == old_xs. There, we first do the xsk_map_sock_add() to add the new
> xsk map node at the tail of the socket's xs->map_list. We do the xchg()
> and then xsk_map_sock_delete() for old_xs which then walks xs->map_list
> again and purges all entries including the just newly created one. This
> means we'll end up with an xs socket at the given map slot, but the xs
> socket has empty xs->map_list. This means we could release the xs sock
> and the xsk_delete_from_maps() won't need to clean up anything anymore
> but yet the xs is still in the map slot, so if you redirect to that
> socket, it would be use-after-free, no?

Ah, correct. Checking against self-assignment, or doing the delete prior 
add. I'll spin a v5.

...and again, thanks for detailed review, Daniel!


Björn

^ permalink raw reply

* [PATCH v5] perf machine: arm/arm64: Improve completeness for kernel address space
From: Leo Yan @ 2019-08-15  8:25 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, linux-kernel, netdev,
	bpf, clang-built-linux
  Cc: Leo Yan, Mathieu Poirier, Peter Zijlstra, Suzuki Poulouse,
	Adrian Hunter, coresight, linux-arm-kernel

Arm and arm64 architecture reserve some memory regions prior to the
symbol '_stext' and these memory regions later will be used by device
module and BPF jit.  The current code misses to consider these memory
regions thus any address in the regions will be taken as user space
mode, but perf cannot find the corresponding dso with the wrong CPU
mode so we misses to generate samples for device module and BPF
related trace data.

This patch parse the link scripts to get the memory size prior to start
address and reduce this size from 'machine>->kernel_start', then can
get a fixed up kernel start address which contain memory regions for
device module and BPF.  Finally, machine__get_kernel_start() can reflect
more complete kernel memory regions and perf can successfully generate
samples.

The reason for parsing the link scripts is Arm architecture changes text
offset dependent on different platforms, which define multiple text
offsets in $kernel/arch/arm/Makefile.  This offset is decided when build
kernel and the final value is extended in the link script, so we can
extract the used value from the link script.  We use the same way to
parse arm64 link script as well.  If fail to find the link script, the
pre start memory size is assumed as zero, in this case it has no any
change caused with this patch.

Below is detailed info for testing this patch:

- Install or build LLVM/Clang;

- Configure perf with ~/.perfconfig:

  root@debian:~# cat ~/.perfconfig
  # this file is auto-generated.
  [llvm]
          clang-path = /mnt/build/llvm-build/build/install/bin/clang
          kbuild-dir = /mnt/linux-kernel/linux-cs-dev/
          clang-opt = "-g"
          dump-obj = true

  [trace]
          show_zeros = yes
          show_duration = no
          no_inherit = yes
          show_timestamp = no
          show_arg_names = no
          args_alignment = 40
          show_prefix = yes

- Run 'perf trace' command with eBPF event:

  root@debian:~# perf trace -e string \
      -e $kernel/tools/perf/examples/bpf/augmented_raw_syscalls.c

- Read eBPF program memory mapping in kernel:

  root@debian:~# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
  root@debian:~# cat /proc/kallsyms | grep -E "bpf_prog_.+_sys_[enter|exit]"
  ffff00000008a0d0 t bpf_prog_e470211b846088d5_sys_enter  [bpf]
  ffff00000008c6a4 t bpf_prog_29c7ae234d79bd5c_sys_exit   [bpf]

- Launch any program which accesses file system frequently so can hit
  the system calls trace flow with eBPF event;

- Capture CoreSight trace data with filtering eBPF program:

  root@debian:~# perf record -e cs_etm/@tmc_etr0/ \
	--filter 'filter 0xffff00000008a0d0/0x800' -a sleep 5s

- Decode the eBPF program symbol 'bpf_prog_f173133dc38ccf87_sys_enter':

  root@debian:~# perf script -F,ip,sym
  Frame deformatter: Found 4 FSYNCS
                  0 [unknown]
   ffff00000008a1ac bpf_prog_e470211b846088d5_sys_enter
   ffff00000008a250 bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a124 bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a14c bpf_prog_e470211b846088d5_sys_enter
   ffff00000008a13c bpf_prog_e470211b846088d5_sys_enter
   ffff00000008a14c bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a180 bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a1ac bpf_prog_e470211b846088d5_sys_enter
   ffff00000008a190 bpf_prog_e470211b846088d5_sys_enter
   ffff00000008a1ac bpf_prog_e470211b846088d5_sys_enter
   ffff00000008a250 bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a124 bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a14c bpf_prog_e470211b846088d5_sys_enter
                  0 [unknown]
   ffff00000008a180 bpf_prog_e470211b846088d5_sys_enter
   [...]

Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: coresight@lists.linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/Makefile.config | 22 ++++++++++++++++++++++
 tools/perf/util/machine.c  | 15 ++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index e4988f49ea79..d7ff839d8b20 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -48,9 +48,20 @@ ifeq ($(SRCARCH),x86)
   NO_PERF_REGS := 0
 endif
 
+ARM_PRE_START_SIZE := 0
+
 ifeq ($(SRCARCH),arm)
   NO_PERF_REGS := 0
   LIBUNWIND_LIBS = -lunwind -lunwind-arm
+  ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds),)
+    # Extract info from lds:
+    #   . = ((0xC0000000)) + 0x00208000;
+    # ARM_PRE_START_SIZE := 0x00208000
+    ARM_PRE_START_SIZE := $(shell egrep ' \. \= \({2}0x[0-9a-fA-F]+\){2}' \
+      $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds | \
+      sed -e 's/[(|)|.|=|+|<|;|-]//g' -e 's/ \+/ /g' -e 's/^[ \t]*//' | \
+      awk -F' ' '{printf "0x%x", $$2}' 2>/dev/null)
+  endif
 endif
 
 ifeq ($(SRCARCH),arm64)
@@ -58,8 +69,19 @@ ifeq ($(SRCARCH),arm64)
   NO_SYSCALL_TABLE := 0
   CFLAGS += -I$(OUTPUT)arch/arm64/include/generated
   LIBUNWIND_LIBS = -lunwind -lunwind-aarch64
+  ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds),)
+    # Extract info from lds:
+    #  . = ((((((((0xffffffffffffffff)) - (((1)) << (48)) + 1) + (0)) + (0x08000000))) + (0x08000000))) + 0x00080000;
+    # ARM_PRE_START_SIZE := (0x08000000 + 0x08000000 + 0x00080000) = 0x10080000
+    ARM_PRE_START_SIZE := $(shell egrep ' \. \= \({8}0x[0-9a-fA-F]+\){2}' \
+      $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds | \
+      sed -e 's/[(|)|.|=|+|<|;|-]//g' -e 's/ \+/ /g' -e 's/^[ \t]*//' | \
+      awk -F' ' '{printf "0x%x", $$6+$$7+$$8}' 2>/dev/null)
+  endif
 endif
 
+CFLAGS += -DARM_PRE_START_SIZE=$(ARM_PRE_START_SIZE)
+
 ifeq ($(SRCARCH),csky)
   NO_PERF_REGS := 0
 endif
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index f6ee7fbad3e4..e993f891bb82 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2687,13 +2687,26 @@ int machine__get_kernel_start(struct machine *machine)
 	machine->kernel_start = 1ULL << 63;
 	if (map) {
 		err = map__load(map);
+		if (err)
+			return err;
+
 		/*
 		 * On x86_64, PTI entry trampolines are less than the
 		 * start of kernel text, but still above 2^63. So leave
 		 * kernel_start = 1ULL << 63 for x86_64.
 		 */
-		if (!err && !machine__is(machine, "x86_64"))
+		if (!machine__is(machine, "x86_64"))
 			machine->kernel_start = map->start;
+
+		/*
+		 * On arm/arm64, the kernel uses some memory regions which are
+		 * prior to '_stext' symbol; to reflect the complete kernel
+		 * address space, compensate these pre-defined regions for
+		 * kernel start address.
+		 */
+		if (!strcmp(perf_env__arch(machine->env), "arm") ||
+		    !strcmp(perf_env__arch(machine->env), "arm64"))
+			machine->kernel_start -= ARM_PRE_START_SIZE;
 	}
 	return err;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH] tun: fix use-after-free when register netdev failed
From: Yang Yingliang @ 2019-08-15  8:18 UTC (permalink / raw)
  To: netdev; +Cc: jasowang, xiyou.wangcong, davem, yangyingliang

I got a UAF repport in tun driver when doing fuzzy test:

[  466.269490] ==================================================================
[  466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0
[  466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699
[  466.271810]
[  466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427
[  466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  466.271838] Call Trace:
[  466.271858]  dump_stack+0xca/0x13e
[  466.271871]  ? tun_chr_read_iter+0x2ca/0x2d0
[  466.271890]  print_address_description+0x79/0x440
[  466.271906]  ? vprintk_func+0x5e/0xf0
[  466.271920]  ? tun_chr_read_iter+0x2ca/0x2d0
[  466.271935]  __kasan_report+0x15c/0x1df
[  466.271958]  ? tun_chr_read_iter+0x2ca/0x2d0
[  466.271976]  kasan_report+0xe/0x20
[  466.271987]  tun_chr_read_iter+0x2ca/0x2d0
[  466.272013]  do_iter_readv_writev+0x4b7/0x740
[  466.272032]  ? default_llseek+0x2d0/0x2d0
[  466.272072]  do_iter_read+0x1c5/0x5e0
[  466.272110]  vfs_readv+0x108/0x180
[  466.299007]  ? compat_rw_copy_check_uvector+0x440/0x440
[  466.299020]  ? fsnotify+0x888/0xd50
[  466.299040]  ? __fsnotify_parent+0xd0/0x350
[  466.299064]  ? fsnotify_first_mark+0x1e0/0x1e0
[  466.304548]  ? vfs_write+0x264/0x510
[  466.304569]  ? ksys_write+0x101/0x210
[  466.304591]  ? do_preadv+0x116/0x1a0
[  466.304609]  do_preadv+0x116/0x1a0
[  466.309829]  do_syscall_64+0xc8/0x600
[  466.309849]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  466.309861] RIP: 0033:0x4560f9
[  466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
[  466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127
[  466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9
[  466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
[  466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000
[  466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10
[  466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000
[  466.323057]
[  466.323064] Allocated by task 2605:
[  466.335165]  save_stack+0x19/0x80
[  466.336240]  __kasan_kmalloc.constprop.8+0xa0/0xd0
[  466.337755]  kmem_cache_alloc+0xe8/0x320
[  466.339050]  getname_flags+0xca/0x560
[  466.340229]  user_path_at_empty+0x2c/0x50
[  466.341508]  vfs_statx+0xe6/0x190
[  466.342619]  __do_sys_newstat+0x81/0x100
[  466.343908]  do_syscall_64+0xc8/0x600
[  466.345303]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  466.347034]
[  466.347517] Freed by task 2605:
[  466.348471]  save_stack+0x19/0x80
[  466.349476]  __kasan_slab_free+0x12e/0x180
[  466.350726]  kmem_cache_free+0xc8/0x430
[  466.351874]  putname+0xe2/0x120
[  466.352921]  filename_lookup+0x257/0x3e0
[  466.354319]  vfs_statx+0xe6/0x190
[  466.355498]  __do_sys_newstat+0x81/0x100
[  466.356889]  do_syscall_64+0xc8/0x600
[  466.358037]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  466.359567]
[  466.360050] The buggy address belongs to the object at ffff888372139100
[  466.360050]  which belongs to the cache names_cache of size 4096
[  466.363735] The buggy address is located 336 bytes inside of
[  466.363735]  4096-byte region [ffff888372139100, ffff88837213a100)
[  466.367179] The buggy address belongs to the page:
[  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
[  466.371582] flags: 0x2fffff80010200(slab|head)
[  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
[  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
[  466.377778] page dumped because: kasan: bad access detected
[  466.379730]
[  466.380288] Memory state around the buggy address:
[  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  466.388257]                                                  ^
[  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  466.394667] ==================================================================

tun_chr_read_iter() accessed the memory which freed by free_netdev()
called by tun_set_iff():

	CPUA				CPUB
    tun_set_iff()
      alloc_netdev_mqs()
      tun_attach()
				    tun_chr_read_iter()
				      tun_get()
      register_netdevice()
      tun_detach_all()
        synchronize_net()
				      tun_do_read()
				        tun_ring_recv()
				          schedule()
      free_netdev()
				      tun_put() <-- UAF

Set a new bit in tun->flag if register_netdevice() successed,
without this bit, tun_get() returns NULL to avoid using a
freed tun pointer.

Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
---
 drivers/net/tun.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index db16d7a13e00..cbd60c276c40 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -115,6 +115,7 @@ do {								\
 /* High bits in flags field are unused. */
 #define TUN_VNET_LE     0x80000000
 #define TUN_VNET_BE     0x40000000
+#define TUN_DEV_REGISTERED	0x20000000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
 		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
@@ -719,8 +720,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 			netif_carrier_off(tun->dev);
 
 			if (!(tun->flags & IFF_PERSIST) &&
-			    tun->dev->reg_state == NETREG_REGISTERED)
+			    tun->dev->reg_state == NETREG_REGISTERED) {
 				unregister_netdevice(tun->dev);
+				tun->flags &= ~TUN_DEV_REGISTERED;
+			}
 		}
 		if (tun)
 			xdp_rxq_info_unreg(&tfile->xdp_rxq);
@@ -884,8 +887,10 @@ static struct tun_struct *tun_get(struct tun_file *tfile)
 
 	rcu_read_lock();
 	tun = rcu_dereference(tfile->tun);
-	if (tun)
+	if (tun && (tun->flags & TUN_DEV_REGISTERED))
 		dev_hold(tun->dev);
+	else
+		tun = NULL;
 	rcu_read_unlock();
 
 	return tun;
@@ -2836,6 +2841,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		err = register_netdevice(tun->dev);
 		if (err < 0)
 			goto err_detach;
+		tun->flags |= TUN_DEV_REGISTERED;
 	}
 
 	netif_carrier_on(tun->dev);
-- 
2.17.1


^ permalink raw reply related

* [PATCH] can: flexcan: disable completely the ECC mechanism
From: Joakim Zhang @ 2019-08-15  8:00 UTC (permalink / raw)
  To: mkl@pengutronix.de, linux-can@vger.kernel.org
  Cc: wg@grandegger.com, netdev@vger.kernel.org, stefan@agner.ch,
	dl-linux-imx, Joakim Zhang

The ECC(memory error detection and correction) mechanism can be
activated or not, controlled by the ECCDIS bit in CAN_MECR. When
disabled, updates on indications and reporting registers are stopped.
So if want to disable ECC completely, had better assert ECCDIS bit,
not just mask the related interrupts.

Fixes:cdce844865be("can: flexcan: add vf610 support for FlexCAN")
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
---
 drivers/net/can/flexcan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index def8cbbc04e8..6a48333b40c7 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -1428,6 +1428,7 @@ static int flexcan_chip_start(struct net_device *dev)
 		reg_mecr = priv->read(&regs->mecr);
 		reg_mecr &= ~FLEXCAN_MECR_ECRWRDIS;
 		priv->write(reg_mecr, &regs->mecr);
+		reg_mecr |= FLEXCAN_MECR_ECCDIS;
 		reg_mecr &= ~(FLEXCAN_MECR_NCEFAFRZ | FLEXCAN_MECR_HANCEI_MSK |
 			      FLEXCAN_MECR_FANCEI_MSK);
 		priv->write(reg_mecr, &regs->mecr);
-- 
2.17.1


^ permalink raw reply related

* [PATCH] can: flexcan: update hardware feature information for i.MX8QM
From: Joakim Zhang @ 2019-08-15  7:58 UTC (permalink / raw)
  To: mkl@pengutronix.de, linux-can@vger.kernel.org
  Cc: wg@grandegger.com, netdev@vger.kernel.org, dl-linux-imx,
	Joakim Zhang

Update hardware feature information for i.MX8QM.

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
---
 drivers/net/can/flexcan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index 58a794c7387b..def8cbbc04e8 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -207,7 +207,7 @@
  *   MX35  FlexCAN2  03.00.00.00     no        no        no       no        no           no
  *   MX53  FlexCAN2  03.00.00.00    yes        no        no       no        no           no
  *   MX6s  FlexCAN3  10.00.12.00    yes       yes        no       no       yes           no
- *  MX8QM         ?  ?                ?         ?         ?        ?         ?          yes
+ *  MX8QM  FlexCAN3  03.00.23.00    yes       yes        no       no       yes          yes
  *   VF610 FlexCAN3  ?               no       yes        no      yes       yes?          no
  * LS1021A FlexCAN2  03.00.04.00     no       yes        no       no       yes           no
  * LX2160A FlexCAN3  03.00.23.00     no       yes        no       no       yes          yes
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2] selftests: net: tcp_fastopen_backup_key.sh: fix shellcheck issue
From: Anders Roxell @ 2019-08-15  7:58 UTC (permalink / raw)
  To: davem, shuah
  Cc: netdev, tim.bird, linux-kselftest, linux-kernel, Anders Roxell

When running tcp_fastopen_backup_key.sh the following issue was seen in
a busybox environment.
./tcp_fastopen_backup_key.sh: line 33: [: -ne: unary operator expected

Shellcheck showed the following issue.
$ shellcheck tools/testing/selftests/net/tcp_fastopen_backup_key.sh

In tools/testing/selftests/net/tcp_fastopen_backup_key.sh line 33:
        if [ $val -ne 0 ]; then
             ^-- SC2086: Double quote to prevent globbing and word splitting.

Rework to do a string comparison instead.

Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
---
 tools/testing/selftests/net/tcp_fastopen_backup_key.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/tcp_fastopen_backup_key.sh b/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
index 41476399e184..f6e65674b83c 100755
--- a/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
+++ b/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
@@ -30,7 +30,7 @@ do_test() {
 	ip netns exec "${NETNS}" ./tcp_fastopen_backup_key "$1"
 	val=$(ip netns exec "${NETNS}" nstat -az | \
 		grep TcpExtTCPFastOpenPassiveFail | awk '{print $2}')
-	if [ $val -ne 0 ]; then
+	if [ "$val" != 0 ]; then
 		echo "FAIL: TcpExtTCPFastOpenPassiveFail non-zero"
 		return 1
 	fi
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] selftests: net: tcp_fastopen_backup_key.sh: fix shellcheck issue
From: Anders Roxell @ 2019-08-15  7:57 UTC (permalink / raw)
  To: Tim.Bird
  Cc: David Miller, Shuah Khan, Networking,
	open list:KERNEL SELFTEST FRAMEWORK, Linux Kernel Mailing List
In-Reply-To: <ECADFF3FD767C149AD96A924E7EA6EAF977A2939@USCULXMSG01.am.sony.com>

On Thu, 15 Aug 2019 at 01:35, <Tim.Bird@sony.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Anders Roxell
> >
> > When running tcp_fastopen_backup_key.sh the following issue was seen in
> > a busybox environment.
> > ./tcp_fastopen_backup_key.sh: line 33: [: -ne: unary operator expected
> >
> > Shellcheck showed the following issue.
> > $ shellcheck tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> >
> > In tools/testing/selftests/net/tcp_fastopen_backup_key.sh line 33:
> >         if [ $val -ne 0 ]; then
> >              ^-- SC2086: Double quote to prevent globbing and word splitting.
> >
> > Rework to add double quotes around the variable 'val' that shellcheck
> > recommends.
> >
> > Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> > ---
> >  tools/testing/selftests/net/tcp_fastopen_backup_key.sh | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> > b/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> > index 41476399e184..ba5ec3eb314e 100755
> > --- a/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> > +++ b/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> > @@ -30,7 +30,7 @@ do_test() {
> >       ip netns exec "${NETNS}" ./tcp_fastopen_backup_key "$1"
> >       val=$(ip netns exec "${NETNS}" nstat -az | \
> >               grep TcpExtTCPFastOpenPassiveFail | awk '{print $2}')
> > -     if [ $val -ne 0 ]; then
> > +     if [ "$val" -ne 0 ]; then
>
> Did you test this in the failing environment?

I thought I did that but the environment wasn't exactly the same. =/

>
> With a busybox shell, I get:
>  $ [ "" -ne 0 ]
> sh: bad number
>
> You might need to explicitly check for empty string here, or switch to a string comparison instead:
> if [ "$val" != 0 ]; then

I'll do that instead.

Sending out a v2 shortly.

Cheers,
Anders

>
>    -- Tim
>
> >               echo "FAIL: TcpExtTCPFastOpenPassiveFail non-zero"
> >               return 1
> >       fi
> > --
> > 2.20.1
>

^ permalink raw reply

* Re: WARNING in is_bpf_text_address
From: Will Deacon @ 2019-08-15  7:51 UTC (permalink / raw)
  To: syzbot, bvanassche
  Cc: akpm, ast, bpf, bvanassche, daniel, davem, dvyukov, hawk, hdanton,
	jakub.kicinski, johannes.berg, johannes, john.fastabend, kafai,
	linux-kernel, longman, mingo, netdev, paulmck, peterz,
	songliubraving, syzkaller-bugs, tglx, tj, torvalds, will.deacon,
	xdp-newbies, yhs
In-Reply-To: <000000000000e56cb0058fcc6c28@google.com>

Hi Bart,

On Sat, Aug 10, 2019 at 05:24:06PM -0700, syzbot wrote:
> syzbot has found a reproducer for the following crash on:
> 
> HEAD commit:    451577f3 Merge tag 'kbuild-fixes-v5.3-3' of git://git.kern..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=120850a6600000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2031e7d221391b8a
> dashboard link: https://syzkaller.appspot.com/bug?extid=bd3bba6ff3fcea7a6ec6
> compiler:       clang version 9.0.0 (/home/glider/llvm/clang
> 80fee25776c2fb61e74c1ecb1a523375c2500b69)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=130ffe4a600000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17137d2c600000
> 
> The bug was bisected to:
> 
> commit a0b0fd53e1e67639b303b15939b9c653dbe7a8c4
> Author: Bart Van Assche <bvanassche@acm.org>
> Date:   Thu Feb 14 23:00:46 2019 +0000
> 
>     locking/lockdep: Free lock classes that are no longer in use
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=152f6a9da00000
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=172f6a9da00000
> console output: https://syzkaller.appspot.com/x/log.txt?x=132f6a9da00000

I know you don't think much to these reports, but please could you have a
look (even if it's just to declare it a false positive)?

Cheers,

Will

^ permalink raw reply

* Re: [PATCH] netfilter: nft_bitwise: Adjust parentheses to fix memcmp size argument
From: Joe Perches @ 2019-08-15  6:56 UTC (permalink / raw)
  To: Nathan Chancellor, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal
  Cc: David S. Miller, netfilter-devel, coreteam, netdev, linux-kernel,
	clang-built-linux, kbuild test robot
In-Reply-To: <20190814165809.46421-1-natechancellor@gmail.com>

On Wed, 2019-08-14 at 09:58 -0700, Nathan Chancellor wrote:
> clang warns:
> 
> net/netfilter/nft_bitwise.c:138:50: error: size argument in 'memcmp'
> call is a comparison [-Werror,-Wmemsize-comparison]
>         if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
>                                       ~~~~~~~~~~~~~~~~~~^~
> net/netfilter/nft_bitwise.c:138:6: note: did you mean to compare the
> result of 'memcmp' instead?
>         if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
>             ^
>                                                        )
> net/netfilter/nft_bitwise.c:138:32: note: explicitly cast the argument
> to size_t to silence this warning
>         if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
>                                       ^
>                                       (size_t)(
> 1 error generated.
> 
> Adjust the parentheses so that the result of the sizeof is used for the
> size argument in memcmp, rather than the result of the comparison (which
> would always be true because sizeof is a non-zero number).
> 
> Fixes: bd8699e9e292 ("netfilter: nft_bitwise: add offload support")
> Link: https://github.com/ClangBuiltLinux/linux/issues/638
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> ---
>  net/netfilter/nft_bitwise.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/nft_bitwise.c b/net/netfilter/nft_bitwise.c
[]
> @@ -135,8 +135,8 @@ static int nft_bitwise_offload(struct nft_offload_ctx *ctx,
>  {
>  	const struct nft_bitwise *priv = nft_expr_priv(expr);
>  
> -	if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
> -	    priv->sreg != priv->dreg))
> +	if (memcmp(&priv->xor, &zero, sizeof(priv->xor)) ||
> +	    priv->sreg != priv->dreg)

This code should use memchr_inv and not compare against a
static uninitialized struct.

Perhaps linux should introduce and use memcchr like bsd. 
or just add something like #define memcchr memchr_inv





^ permalink raw reply

* Re: [PATCH] Bluetooth: 6lowpan: Make variable header_ops constant
From: Marcel Holtmann @ 2019-08-15  6:46 UTC (permalink / raw)
  To: Nishka Dasgupta
  Cc: Johan Hedberg, linux-bluetooth, linux-kernel, David S. Miller,
	netdev
In-Reply-To: <20190815055255.1153-1-nishkadg.linux@gmail.com>

Hi Nishka,

> Static variable header_ops, of type header_ops, is used only once, when
> it is assigned to field header_ops of a variable having type net_device.
> This corresponding field is declared as const in the definition of
> net_device. Hence make header_ops constant as well to protect it from
> unnecessary modification.
> Issue found with Coccinelle.
> 
> Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
> ---
> net/bluetooth/6lowpan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel


^ permalink raw reply

* Re: [PATCH net-next v2 13/14] selftests: devlink_trap: Add test cases for devlink-trap
From: Ido Schimmel @ 2019-08-15  6:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, nhorman, jiri, toke, dsahern, roopa, nikolay, andy,
	f.fainelli, andrew, vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190814174229.1ab4fd1b@cakuba.netronome.com>

On Wed, Aug 14, 2019 at 05:42:29PM -0700, Jakub Kicinski wrote:
> On Tue, 13 Aug 2019 10:53:59 +0300, Ido Schimmel wrote:
> > From: Ido Schimmel <idosch@mellanox.com>
> > 
> > Add test cases for devlink-trap on top of the netdevsim implementation.
> > 
> > The tests focus on the devlink-trap core infrastructure and user space
> > API. They test both good and bad flows and also dismantle of the netdev
> > and devlink device used to report trapped packets.
> > 
> > This allows device drivers to focus their tests on device-specific
> > functionality.
> > 
> > Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> > Acked-by: Jiri Pirko <jiri@mellanox.com>
> 
> Thanks for the test!
> 
> Should it perhaps live in:
> tools/testing/selftests/drivers/net/netdevsim/
> ?
> 
> That's where Jiri puts his devlink tests..

Yea, good point. Will move it there.

> 
> Also the test seems to require netdevsim to be loaded, otherwise:
> # ./devlink_trap.sh 
> SKIP: No netdevsim support
> 
> Is that expected?

No, my bad. I need to change the check to see if netdevsim is loaded and
otherwise load the module.

Thanks!

^ permalink raw reply

* [PATCH] net: hns: hns_enet: Add of_node_put in hns_nic_dev_probe()
From: Nishka Dasgupta @ 2019-08-15  6:28 UTC (permalink / raw)
  To: yisen.zhuang, salil.mehta, davem, netdev; +Cc: Nishka Dasgupta

The local variable ae_node in function hns_nic_dev_probe takes the
return value of of_parse_phandle, which gets a node but does not put it.
This may cause a memory leak. Hence put ae_node after the last time it
is invoked.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 2235dd55fab2..b26e84929e1e 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2309,6 +2309,7 @@ static int hns_nic_dev_probe(struct platform_device *pdev)
 			goto out_read_prop_fail;
 		}
 		priv->fwnode = &ae_node->fwnode;
+		of_node_put(ae_node);
 	} else if (is_acpi_node(dev->fwnode)) {
 		struct fwnode_reference_args args;
 
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH net-next v2 11/14] netdevsim: Add devlink-trap support
From: Ido Schimmel @ 2019-08-15  6:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, nhorman, jiri, toke, dsahern, roopa, nikolay, andy,
	f.fainelli, andrew, vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190814165957.0e626f57@cakuba.netronome.com>

On Wed, Aug 14, 2019 at 04:59:57PM -0700, Jakub Kicinski wrote:
> On Tue, 13 Aug 2019 10:53:57 +0300, Ido Schimmel wrote:
> > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> > index 08ca59fc189b..2758d95c8d18 100644
> > --- a/drivers/net/netdevsim/dev.c
> > +++ b/drivers/net/netdevsim/dev.c
> > @@ -17,11 +17,21 @@
> >  
> >  #include <linux/debugfs.h>
> >  #include <linux/device.h>
> > +#include <linux/etherdevice.h>
> > +#include <linux/inet.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/kernel.h>
> >  #include <linux/list.h>
> >  #include <linux/mutex.h>
> >  #include <linux/random.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/random.h>
> >  #include <linux/rtnetlink.h>
> >  #include <net/devlink.h>
> > +#include <net/ip.h>
> > +#include <uapi/linux/devlink.h>
> > +#include <uapi/linux/ip.h>
> > +#include <uapi/linux/udp.h>
> 
> Please keep includes ordered alphabetically. You're adding
> linux/random.h second time.

Will fix.

> 
> >  #include "netdevsim.h"
> 
> > +static void nsim_dev_trap_report(struct nsim_dev_port *nsim_dev_port)
> > +{
> > +	struct nsim_dev *nsim_dev = nsim_dev_port->ns->nsim_dev;
> > +	struct nsim_trap_data *nsim_trap_data = nsim_dev->trap_data;
> > +	struct devlink *devlink = priv_to_devlink(nsim_dev);
> > +	int i;
> 
> reverse christmas tree, please

Likewise.

Thanks!

^ permalink raw reply

* [PATCH net] tunnel: fix dev null pointer dereference when send pkg larger than mtu in collect_md mode
From: Hangbin Liu @ 2019-08-15  6:09 UTC (permalink / raw)
  To: netdev
  Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller,
	Hangbin Liu

When we send a packet larger than PMTU, we need to reply with
icmp_send(ICMP_FRAG_NEEDED) or icmpv6_send(ICMPV6_PKT_TOOBIG).

But in collect_md mode, kernel will crash while accessing the dst dev
as __metadata_dst_init() init dst->dev to NULL by default. Here is what
the code path looks like, for GRE:

- ip6gre_tunnel_xmit
  - ip6gre_xmit_ipv4
    - __gre6_xmit
      - ip6_tnl_xmit
        - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
    - icmp_send
      - net = dev_net(rt->dst.dev); <-- here
  - ip6gre_xmit_ipv6
    - __gre6_xmit
      - ip6_tnl_xmit
        - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
    - icmpv6_send
      ...
      - decode_session4
        - oif = skb_dst(skb)->dev->ifindex; <-- here
      - decode_session6
        - oif = skb_dst(skb)->dev->ifindex; <-- here

Fix it by updating the dst dev if not set.

The reproducer is easy:

ovs-vsctl add-br br0
ip link set br0 up
ovs-vsctl add-port br0 gre0 -- \
	  set interface gre0 type=gre options:remote_ip=$dst_addr
ip link set gre0 up
ip addr add ${local_gre6}/64 dev br0
ping6 $remote_gre6 -s 1500

The kernel will crash like
[40595.821651] BUG: kernel NULL pointer dereference, address: 0000000000000108
[40595.822411] #PF: supervisor read access in kernel mode
[40595.822949] #PF: error_code(0x0000) - not-present page
[40595.823492] PGD 0 P4D 0
[40595.823767] Oops: 0000 [#1] SMP PTI
[40595.824139] CPU: 0 PID: 2831 Comm: handler12 Not tainted 5.2.0 #57
[40595.824788] Hardware name: Red Hat KVM, BIOS 1.11.1-3.module+el8.1.0+2983+b2ae9c0a 04/01/2014
[40595.825680] RIP: 0010:__xfrm_decode_session+0x6b/0x930
[40595.826219] Code: b7 c0 00 00 00 b8 06 00 00 00 66 85 d2 0f b7 ca 48 0f 45 c1 44 0f b6 2c 06 48 8b 47 58 48 83 e0 fe 0f 84 f4 04 00 00 48 8b 00 <44> 8b 80 08 01 00 00 41 f6 c4 01 4c 89 e7
ba 58 00 00 00 0f 85 47
[40595.828155] RSP: 0018:ffffc90000a73438 EFLAGS: 00010286
[40595.828705] RAX: 0000000000000000 RBX: ffff8881329d7100 RCX: 0000000000000000
[40595.829450] RDX: 0000000000000000 RSI: ffff8881339e70ce RDI: ffff8881329d7100
[40595.830191] RBP: ffffc90000a73470 R08: 0000000000000000 R09: 000000000000000a
[40595.830936] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90000a73490
[40595.831682] R13: 000000000000002c R14: ffff888132ff1301 R15: ffff8881329d7100
[40595.832427] FS:  00007f5bfcfd6700(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000
[40595.833266] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40595.833883] CR2: 0000000000000108 CR3: 000000013a368000 CR4: 00000000000006f0
[40595.834633] Call Trace:
[40595.835392]  ? rt6_multipath_hash+0x4c/0x390
[40595.835853]  icmpv6_route_lookup+0xcb/0x1d0
[40595.836296]  ? icmpv6_xrlim_allow+0x3e/0x140
[40595.836751]  icmp6_send+0x537/0x840
[40595.837125]  icmpv6_send+0x20/0x30
[40595.837494]  tnl_update_pmtu.isra.27+0x19d/0x2a0 [ip_tunnel]
[40595.838088]  ip_md_tunnel_xmit+0x1b6/0x510 [ip_tunnel]
[40595.838633]  gre_tap_xmit+0x10c/0x160 [ip_gre]
[40595.839103]  dev_hard_start_xmit+0x93/0x200
[40595.839551]  sch_direct_xmit+0x101/0x2d0
[40595.839967]  __dev_queue_xmit+0x69f/0x9c0
[40595.840399]  do_execute_actions+0x1717/0x1910 [openvswitch]
[40595.840987]  ? validate_set.isra.12+0x2f5/0x3d0 [openvswitch]
[40595.841596]  ? reserve_sfa_size+0x31/0x130 [openvswitch]
[40595.842154]  ? __ovs_nla_copy_actions+0x1b4/0xad0 [openvswitch]
[40595.842778]  ? __kmalloc_reserve.isra.50+0x2e/0x80
[40595.843285]  ? should_failslab+0xa/0x20
[40595.843696]  ? __kmalloc+0x188/0x220
[40595.844078]  ? __alloc_skb+0x97/0x270
[40595.844472]  ovs_execute_actions+0x47/0x120 [openvswitch]
[40595.845041]  ovs_packet_cmd_execute+0x27d/0x2b0 [openvswitch]
[40595.845648]  genl_family_rcv_msg+0x3a8/0x430
[40595.846101]  genl_rcv_msg+0x47/0x90
[40595.846476]  ? __alloc_skb+0x83/0x270
[40595.846866]  ? genl_family_rcv_msg+0x430/0x430
[40595.847335]  netlink_rcv_skb+0xcb/0x100
[40595.847777]  genl_rcv+0x24/0x40
[40595.848113]  netlink_unicast+0x17f/0x230
[40595.848535]  netlink_sendmsg+0x2ed/0x3e0
[40595.848951]  sock_sendmsg+0x4f/0x60
[40595.849323]  ___sys_sendmsg+0x2bd/0x2e0
[40595.849733]  ? sock_poll+0x6f/0xb0
[40595.850098]  ? ep_scan_ready_list.isra.14+0x20b/0x240
[40595.850634]  ? _cond_resched+0x15/0x30
[40595.851032]  ? ep_poll+0x11b/0x440
[40595.851401]  ? _copy_to_user+0x22/0x30
[40595.851799]  __sys_sendmsg+0x58/0xa0
[40595.852180]  do_syscall_64+0x5b/0x190
[40595.852574]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[40595.853105] RIP: 0033:0x7f5c00038c7d
[40595.853489] Code: c7 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 8e f7 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d7 f7 ff ff 48 89
d0 48 83 c4 08 48 3d 01
[40595.855443] RSP: 002b:00007f5bfcf73c00 EFLAGS: 00003293 ORIG_RAX: 000000000000002e
[40595.856244] RAX: ffffffffffffffda RBX: 00007f5bfcf74a60 RCX: 00007f5c00038c7d
[40595.856990] RDX: 0000000000000000 RSI: 00007f5bfcf73c60 RDI: 0000000000000015
[40595.857736] RBP: 0000000000000004 R08: 0000000000000b7c R09: 0000000000000110
[40595.858613] R10: 0001000800050004 R11: 0000000000003293 R12: 000055c2d8329da0
[40595.859401] R13: 00007f5bfcf74120 R14: 0000000000000347 R15: 00007f5bfcf73c60
[40595.860185] Modules linked in: ip_gre ip_tunnel gre openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc bochs_drm ttm drm_kms_helper drm pcspkr joydev i2c_piix4 qemu_fw_cfg xfs libcrc32c virtio_net net_failover serio_raw failover ata_generic virtio_blk pata_acpi floppy
[40595.863155] CR2: 0000000000000108
[40595.863551] ---[ end trace 22209bbcacb4addd ]---

Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/ipv4/ip_tunnel.c  |  3 +++
 net/ipv6/ip6_tunnel.c | 13 +++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 38c02bb62e2c..c6713c7287df 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -597,6 +597,9 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto tx_error;
 	}
 
+	if (skb_dst(skb) && !skb_dst(skb)->dev)
+		skb_dst(skb)->dev = rt->dst.dev;
+
 	if (key->tun_flags & TUNNEL_DONT_FRAGMENT)
 		df = htons(IP_DF);
 	if (tnl_update_pmtu(dev, skb, rt, df, inner_iph, tunnel_hlen,
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 754a484d35df..6ccf8f0eb8e7 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1109,10 +1109,15 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 			dst = NULL;
 			goto tx_err_link_failure;
 		}
-		if (t->parms.collect_md && ipv6_addr_any(&fl6->saddr) &&
-		    ipv6_dev_get_saddr(net, ip6_dst_idev(dst)->dev,
-				       &fl6->daddr, 0, &fl6->saddr))
-			goto tx_err_link_failure;
+		if (t->parms.collect_md) {
+			if (ipv6_addr_any(&fl6->saddr) &&
+			    ipv6_dev_get_saddr(net, ip6_dst_idev(dst)->dev,
+					       &fl6->daddr, 0, &fl6->saddr))
+				goto tx_err_link_failure;
+
+			if (skb_dst(skb) && !skb_dst(skb)->dev)
+				skb_dst(skb)->dev = dst->dev;
+		}
 		ndst = dst;
 	}
 
-- 
2.19.2


^ permalink raw reply related

* Re: [patch net-next v2 2/2] selftests: netdevsim: add devlink params tests
From: Jiri Pirko @ 2019-08-15  5:58 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190814180900.71712d88@cakuba.netronome.com>

Thu, Aug 15, 2019 at 03:09:00AM CEST, jakub.kicinski@netronome.com wrote:
>On Wed, 14 Aug 2019 17:26:04 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Test recently added netdevsim devlink param implementation.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>> v1->v2:
>> -using cmd_jq helper
>
>Still failing here :(

Ugh :/

>
># ./devlink.sh 
>TEST: fw flash test                                                 [ OK ]
>TEST: params test                                                   [FAIL]
>	Failed to get test1 param value
>TEST: regions test                                                  [ OK ]
>
># jq --version
>jq-1.5-1-a5b5cbe
># echo '{ "a" : false }' | jq -e -r '.[]'
>false
># echo $?
>1
>
>On another machine:
>
>$ echo '{ "a" : false }' | jq -e -r '.[]'
>false
>$ echo $?
>1
>
>Did you mean to drop the -e ?

No. -e is needed in order to jq return error in case there is no output.
Looks like a bug in jq 1.6 fixed. How about I add a check for jq >= 1.6?

^ permalink raw reply

* Re: [net-next  1/1] tipc: clean up skb list lock handling on send path
From: Xin Long @ 2019-08-15  5:57 UTC (permalink / raw)
  To: Jon Maloy
  Cc: davem, netdev, tung q nguyen, hoang h le, shuali, ying xue,
	edumazet, tipc-discussion
In-Reply-To: <1565794548-15425-1-git-send-email-jon.maloy@ericsson.com>



----- Original Message -----
> The policy for handling the skb list locks on the send and receive paths
> is simple.
> 
> - On the send path we never need to grab the lock on the 'xmitq' list
>   when the destination is an exernal node.
> 
> - On the receive path we always need to grab the lock on the 'inputq'
>   list, irrespective of source node.
> 
> However, when transmitting node local messages those will eventually
> end up on the receive path of a local socket, meaning that the argument
> 'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in  the
> function tipc_sk_rcv(). This has been handled by always initializing
> the spinlock of the 'xmitq' list at message creation, just in case it
> may end up on the receive path later, and despite knowing that the lock
> in most cases never will be used.
> 
> This approach is inaccurate and confusing, and has also concealed the
> fact that the stated 'no lock grabbing' policy for the send path is
> violated in some cases.
> 
> We now clean up this by never initializing the lock at message creation,
> instead doing this at the moment we find that the message actually will
> enter the receive path. At the same time we fix the four locations
> where we incorrectly access the spinlock on the send/error path.
> 
> This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock
> is initialised") which has now become redundant.
> 
> CC: Eric Dumazet <edumazet@google.com>
> Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
> Acked-by: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> ---
>  net/tipc/bcast.c      | 10 +++++-----
>  net/tipc/link.c       |  4 ++--
>  net/tipc/name_distr.c |  2 +-
>  net/tipc/node.c       |  7 ++++---
>  net/tipc/socket.c     | 14 +++++++-------
>  5 files changed, 19 insertions(+), 18 deletions(-)
> 
> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
> index 34f3e56..6ef1abd 100644
> --- a/net/tipc/bcast.c
> +++ b/net/tipc/bcast.c
> @@ -185,7 +185,7 @@ static void tipc_bcbase_xmit(struct net *net, struct
> sk_buff_head *xmitq)
>  	}
>  
>  	/* We have to transmit across all bearers */
> -	skb_queue_head_init(&_xmitq);
> +	__skb_queue_head_init(&_xmitq);
>  	for (bearer_id = 0; bearer_id < MAX_BEARERS; bearer_id++) {
>  		if (!bb->dests[bearer_id])
>  			continue;
> @@ -256,7 +256,7 @@ static int tipc_bcast_xmit(struct net *net, struct
> sk_buff_head *pkts,
>  	struct sk_buff_head xmitq;
>  	int rc = 0;
>  
> -	skb_queue_head_init(&xmitq);
> +	__skb_queue_head_init(&xmitq);
>  	tipc_bcast_lock(net);
>  	if (tipc_link_bc_peers(l))
>  		rc = tipc_link_xmit(l, pkts, &xmitq);
> @@ -286,7 +286,7 @@ static int tipc_rcast_xmit(struct net *net, struct
> sk_buff_head *pkts,
>  	u32 dnode, selector;
>  
>  	selector = msg_link_selector(buf_msg(skb_peek(pkts)));
> -	skb_queue_head_init(&_pkts);
> +	__skb_queue_head_init(&_pkts);
>  
>  	list_for_each_entry_safe(dst, tmp, &dests->list, list) {
>  		dnode = dst->node;
> @@ -344,7 +344,7 @@ static int tipc_mcast_send_sync(struct net *net, struct
> sk_buff *skb,
>  	msg_set_size(_hdr, MCAST_H_SIZE);
>  	msg_set_is_rcast(_hdr, !msg_is_rcast(hdr));
>  
> -	skb_queue_head_init(&tmpq);
> +	__skb_queue_head_init(&tmpq);
>  	__skb_queue_tail(&tmpq, _skb);
>  	if (method->rcast)
>  		tipc_bcast_xmit(net, &tmpq, cong_link_cnt);
> @@ -378,7 +378,7 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head
> *pkts,
>  	int rc = 0;
>  
>  	skb_queue_head_init(&inputq);
> -	skb_queue_head_init(&localq);
> +	__skb_queue_head_init(&localq);
>  
>  	/* Clone packets before they are consumed by next call */
>  	if (dests->local && !tipc_msg_reassemble(pkts, &localq)) {
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index dd3155b..ba057a9 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -959,7 +959,7 @@ int tipc_link_xmit(struct tipc_link *l, struct
> sk_buff_head *list,
>  		pr_warn("Too large msg, purging xmit list %d %d %d %d %d!\n",
>  			skb_queue_len(list), msg_user(hdr),
>  			msg_type(hdr), msg_size(hdr), mtu);
> -		skb_queue_purge(list);
> +		__skb_queue_purge(list);
>  		return -EMSGSIZE;
>  	}
>  
> @@ -988,7 +988,7 @@ int tipc_link_xmit(struct tipc_link *l, struct
> sk_buff_head *list,
>  		if (likely(skb_queue_len(transmq) < maxwin)) {
>  			_skb = skb_clone(skb, GFP_ATOMIC);
>  			if (!_skb) {
> -				skb_queue_purge(list);
> +				__skb_queue_purge(list);
>  				return -ENOBUFS;
>  			}
>  			__skb_dequeue(list);
> diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
> index 44abc8e..61219f0 100644
> --- a/net/tipc/name_distr.c
> +++ b/net/tipc/name_distr.c
> @@ -190,7 +190,7 @@ void tipc_named_node_up(struct net *net, u32 dnode)
>  	struct name_table *nt = tipc_name_table(net);
>  	struct sk_buff_head head;
>  
> -	skb_queue_head_init(&head);
> +	__skb_queue_head_init(&head);
>  
>  	read_lock_bh(&nt->cluster_scope_lock);
>  	named_distribute(net, &head, dnode, &nt->cluster_scope);
> diff --git a/net/tipc/node.c b/net/tipc/node.c
> index 1bdcf0f..c8f6177 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -1444,13 +1444,14 @@ int tipc_node_xmit(struct net *net, struct
> sk_buff_head *list,
>  
>  	if (in_own_node(net, dnode)) {
>  		tipc_loopback_trace(net, list);
> +		spin_lock_init(&list->lock);
>  		tipc_sk_rcv(net, list);
>  		return 0;
>  	}
>  
>  	n = tipc_node_find(net, dnode);
>  	if (unlikely(!n)) {
> -		skb_queue_purge(list);
> +		__skb_queue_purge(list);
>  		return -EHOSTUNREACH;
>  	}
>  
> @@ -1459,7 +1460,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head
> *list,
>  	if (unlikely(bearer_id == INVALID_BEARER_ID)) {
>  		tipc_node_read_unlock(n);
>  		tipc_node_put(n);
> -		skb_queue_purge(list);
> +		__skb_queue_purge(list);
>  		return -EHOSTUNREACH;
>  	}
>  
> @@ -1491,7 +1492,7 @@ int tipc_node_xmit_skb(struct net *net, struct sk_buff
> *skb, u32 dnode,
>  {
>  	struct sk_buff_head head;
>  
> -	skb_queue_head_init(&head);
> +	__skb_queue_head_init(&head);
>  	__skb_queue_tail(&head, skb);
>  	tipc_node_xmit(net, &head, dnode, selector);
>  	return 0;
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 83ae41d..3b9f8cc 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -809,7 +809,7 @@ static int tipc_sendmcast(struct  socket *sock, struct
> tipc_name_seq *seq,
>  	msg_set_nameupper(hdr, seq->upper);
>  
>  	/* Build message as chain of buffers */
> -	skb_queue_head_init(&pkts);
> +	__skb_queue_head_init(&pkts);
>  	rc = tipc_msg_build(hdr, msg, 0, dlen, mtu, &pkts);
>  
>  	/* Send message if build was successful */
> @@ -853,7 +853,7 @@ static int tipc_send_group_msg(struct net *net, struct
> tipc_sock *tsk,
>  	msg_set_grp_bc_seqno(hdr, bc_snd_nxt);
>  
>  	/* Build message as chain of buffers */
> -	skb_queue_head_init(&pkts);
> +	__skb_queue_head_init(&pkts);
>  	mtu = tipc_node_get_mtu(net, dnode, tsk->portid);
>  	rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
>  	if (unlikely(rc != dlen))
> @@ -1058,7 +1058,7 @@ static int tipc_send_group_bcast(struct socket *sock,
> struct msghdr *m,
>  	msg_set_grp_bc_ack_req(hdr, ack);
>  
>  	/* Build message as chain of buffers */
> -	skb_queue_head_init(&pkts);
> +	__skb_queue_head_init(&pkts);
>  	rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
>  	if (unlikely(rc != dlen))
>  		return rc;
> @@ -1387,7 +1387,7 @@ static int __tipc_sendmsg(struct socket *sock, struct
> msghdr *m, size_t dlen)
>  	if (unlikely(rc))
>  		return rc;
>  
> -	skb_queue_head_init(&pkts);
> +	__skb_queue_head_init(&pkts);
>  	mtu = tipc_node_get_mtu(net, dnode, tsk->portid);
>  	rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts);
>  	if (unlikely(rc != dlen))
> @@ -1445,7 +1445,7 @@ static int __tipc_sendstream(struct socket *sock,
> struct msghdr *m, size_t dlen)
>  	int send, sent = 0;
>  	int rc = 0;
>  
> -	skb_queue_head_init(&pkts);
> +	__skb_queue_head_init(&pkts);
>  
>  	if (unlikely(dlen > INT_MAX))
>  		return -EMSGSIZE;
> @@ -1805,7 +1805,7 @@ static int tipc_recvmsg(struct socket *sock, struct
> msghdr *m,
>  
>  	/* Send group flow control advertisement when applicable */
>  	if (tsk->group && msg_in_group(hdr) && !grp_evt) {
> -		skb_queue_head_init(&xmitq);
> +		__skb_queue_head_init(&xmitq);
>  		tipc_group_update_rcv_win(tsk->group, tsk_blocks(hlen + dlen),
>  					  msg_orignode(hdr), msg_origport(hdr),
>  					  &xmitq);
> @@ -2674,7 +2674,7 @@ static void tipc_sk_timeout(struct timer_list *t)
>  	struct sk_buff_head list;
>  	int rc = 0;
>  
> -	skb_queue_head_init(&list);
> +	__skb_queue_head_init(&list);
>  	bh_lock_sock(sk);
>  
>  	/* Try again later if socket is busy */
> --
> 2.1.4
> 
> 
Patch looks good, can you also check those tmp tx queues in:

  tipc_group_cong()
  tipc_group_join()
  tipc_link_create_dummy_tnl_msg()
  tipc_link_tnl_prepare()

which are using skb_queue_head_init() to init?

Thanks.

^ permalink raw reply

* Re: [RFC PATCH 1/5] x86: tsc: add tsc to art helpers
From: Felipe Balbi @ 2019-08-15  5:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Richard Cochran, netdev, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall
In-Reply-To: <alpine.DEB.2.21.1907160952040.1767@nanos.tec.linutronix.de>


Hi,


Thomas Gleixner <tglx@linutronix.de> writes:

> Felipe,
>
> On Tue, 16 Jul 2019, Felipe Balbi wrote:
>
> -ENOCHANGELOG
>
> As you said in the cover letter:
>
>>  (3) The change in arch/x86/kernel/tsc.c needs to be reviewed at length
>>      before going in.
>
> So some information what those interfaces are used for and why they are
> needed would be really helpful.

Okay, I have some more details about this. The TGPIO device itself uses
ART since TSC is not directly available to anything other than the
CPU. The 'problem' here is that reading ART incurs extra latency which
we would like to avoid. Therefore, we use TSC and scale it to
nanoseconds which, would be the same as ART to ns.

>> +void get_tsc_ns(struct system_counterval_t *tsc_counterval, u64 *tsc_ns)
>> +{
>> +	u64 tmp, res, rem;
>> +	u64 cycles;
>> +
>> +	tsc_counterval->cycles = clocksource_tsc.read(NULL);
>> +	cycles = tsc_counterval->cycles;
>> +	tsc_counterval->cs = art_related_clocksource;
>> +
>> +	rem = do_div(cycles, tsc_khz);
>> +
>> +	res = cycles * USEC_PER_SEC;
>> +	tmp = rem * USEC_PER_SEC;
>> +
>> +	do_div(tmp, tsc_khz);
>> +	res += tmp;
>> +
>> +	*tsc_ns = res;
>> +}
>> +EXPORT_SYMBOL(get_tsc_ns);
>> +
>> +u64 get_art_ns_now(void)
>> +{
>> +	struct system_counterval_t tsc_cycles;
>> +	u64 tsc_ns;
>> +
>> +	get_tsc_ns(&tsc_cycles, &tsc_ns);
>> +
>> +	return tsc_ns;
>> +}
>> +EXPORT_SYMBOL(get_art_ns_now);
>
> While the changes look innocuous I'm missing the big picture why this needs
> to emulate ART instead of simply using TSC directly.

i don't think we're emulating ART here (other than the name in the
function). We're just reading TSC and converting to nanoseconds, right?

Cheers

-- 
balbi

^ permalink raw reply

* [PATCH] Bluetooth: 6lowpan: Make variable header_ops constant
From: Nishka Dasgupta @ 2019-08-15  5:52 UTC (permalink / raw)
  To: marcel, johan.hedberg, linux-bluetooth, linux-kernel, davem,
	netdev
  Cc: Nishka Dasgupta

Static variable header_ops, of type header_ops, is used only once, when
it is assigned to field header_ops of a variable having type net_device.
This corresponding field is declared as const in the definition of
net_device. Hence make header_ops constant as well to protect it from
unnecessary modification.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
---
 net/bluetooth/6lowpan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index 9d41de1ec90f..bb55d92691b0 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -583,7 +583,7 @@ static const struct net_device_ops netdev_ops = {
 	.ndo_start_xmit		= bt_xmit,
 };
 
-static struct header_ops header_ops = {
+static const struct header_ops header_ops = {
 	.create	= header_create,
 };
 
-- 
2.19.1


^ permalink raw reply related

* Re: tun: mark small packets as owned by the tap sock
From: Jason Wang @ 2019-08-15  5:11 UTC (permalink / raw)
  To: Dave Jones; +Cc: Alexis Bauvin, netdev
In-Reply-To: <20190813140025.GA17823@codemonkey.org.uk>


On 2019/8/13 下午10:00, Dave Jones wrote:
> On Tue, Aug 13, 2019 at 04:33:59PM +0800, Jason Wang wrote:
>   >
>   > On 2019/8/13 上午6:19, Dave Jones wrote:
>   > > On Wed, Aug 07, 2019 at 12:30:07AM +0000, Linux Kernel wrote:
>   > >   > Commit:     4b663366246be1d1d4b1b8b01245b2e88ad9e706
>   > >   > Parent:     16b2084a8afa1432d14ba72b7c97d7908e178178
>   > >   > Web:        https://git.kernel.org/torvalds/c/4b663366246be1d1d4b1b8b01245b2e88ad9e706
>   > >   > Author:     Alexis Bauvin <abauvin@scaleway.com>
>   > >   > AuthorDate: Tue Jul 23 16:23:01 2019 +0200
>   > >   >
>   > >   >     tun: mark small packets as owned by the tap sock
>   > >   >
>   > >   >     - v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size
>   > >
>   > > This commit breaks ipv6 routing when I deployed on it a linode.
>   > > It seems to work briefly after boot, and then silently all packets get
>   > > dropped. (Presumably, it's dropping RA or ND packets)
>   > >
>   > > With this reverted, everything works as it did in rc3.
>   > >
>   > Two questions:
>   >
>   > - Are you using XDP for TUN?
>
> not knowingly.
> $ grep XDP .config
> # CONFIG_XDP_SOCKETS is not set
>
> What's configured on the hypervisor side I have no idea.


Ok, please tell me more about your setups:

- Are you using TUN in host or guest?

- Are you using it for VM or VPN(tunneling)?

- Where did the packet get dropped?


>
>   > - Does it work before 66ccbc9c87c2?
>
> that's been around since 4.14-rc1, and at one point it ran whatever was
> in debian9 (4.9).  I don't recall it ever not working, so I'd say yes.
>
> I can build a 4.13 if it'll prove something, but it'll take me a while.
> (This is my primary MX, so it's dropping email while it's on the broken
>   kernel, so I need to plan some time to be around to babysit it)


If possible please try that.


>
>   > If yes, could you show us the result of net_dropmonitor?
>
> where do I get that?  It doesn't seem packaged for debian.
>
> 	Dave


It's part of perf-script(1). You can simply start it through perf script 
record net_dropmonitor.

Thanks

>

^ permalink raw reply

* [PATCH bpf-next 1/5] xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
From: Sridhar Samudrala @ 2019-08-15  3:46 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com>

The bool 'zc' field in struct xdp_uem is replaced with a u32 flags
field and a bit within flags is used to indicate zerocopy. This flags
field will be used in later patches for other bit fields.

Also, removed the bool 'zc' field from struct xdp_sock as it can be
accessed via flags in xs->umem.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/xdp_sock.h | 12 ++++++++++--
 net/xdp/xdp_umem.c     |  6 +++---
 net/xdp/xsk.c          | 12 +++++++++---
 net/xdp/xsk_diag.c     |  3 ++-
 4 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..b6716dbdce1a 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -27,6 +27,9 @@ struct xdp_umem_fq_reuse {
 	u64 handles[];
 };
 
+/* Bits for the umem flags field. */
+#define XDP_UMEM_F_ZEROCOPY	(1 << 0)
+
 struct xdp_umem {
 	struct xsk_queue *fq;
 	struct xsk_queue *cq;
@@ -45,7 +48,7 @@ struct xdp_umem {
 	struct net_device *dev;
 	struct xdp_umem_fq_reuse *fq_reuse;
 	u16 queue_id;
-	bool zc;
+	u32 flags;
 	spinlock_t xsk_list_lock;
 	struct list_head xsk_list;
 };
@@ -58,7 +61,6 @@ struct xdp_sock {
 	struct xdp_umem *umem;
 	struct list_head flush_node;
 	u16 queue_id;
-	bool zc;
 	enum {
 		XSK_READY = 0,
 		XSK_BOUND,
@@ -95,6 +97,7 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
 					  struct xdp_umem_fq_reuse *newq);
 void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
+bool xsk_umem_zerocopy(struct xdp_umem *umem);
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
@@ -213,6 +216,11 @@ static inline struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
 	return NULL;
 }
 
+static inline bool xsk_umem_zerocopy(struct xdp_umem *umem)
+{
+	return false;
+}
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
 	return NULL;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index a0607969f8c0..411b3e3498c4 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -126,7 +126,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 	if (err)
 		goto err_unreg_umem;
 
-	umem->zc = true;
+	umem->flags |= XDP_UMEM_F_ZEROCOPY;
 	return 0;
 
 err_unreg_umem:
@@ -147,7 +147,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
 	if (!umem->dev)
 		return;
 
-	if (umem->zc) {
+	if (xsk_umem_zerocopy(umem)) {
 		bpf.command = XDP_SETUP_XSK_UMEM;
 		bpf.xsk.umem = NULL;
 		bpf.xsk.queue_id = umem->queue_id;
@@ -162,7 +162,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
 
 	dev_put(umem->dev);
 	umem->dev = NULL;
-	umem->zc = false;
+	umem->flags &= ~XDP_UMEM_F_ZEROCOPY;
 }
 
 static void xdp_umem_unmap_pages(struct xdp_umem *umem)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59b57d708697..ca95676ef75d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -295,6 +295,12 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 	return err;
 }
 
+bool xsk_umem_zerocopy(struct xdp_umem *umem)
+{
+	return (umem && (umem->flags & XDP_UMEM_F_ZEROCOPY));
+}
+EXPORT_SYMBOL(xsk_umem_zerocopy);
+
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
 	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
@@ -310,7 +316,8 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	if (need_wait)
 		return -EOPNOTSUPP;
 
-	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
+	return xsk_umem_zerocopy(xs->umem) ? xsk_zc_xmit(sk) :
+					     xsk_generic_xmit(sk, m, total_len);
 }
 
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
@@ -503,7 +510,6 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	}
 
 	xs->dev = dev;
-	xs->zc = xs->umem->zc;
 	xs->queue_id = qid;
 	xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
 	xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
@@ -683,7 +689,7 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 			return -EINVAL;
 
 		mutex_lock(&xs->mutex);
-		if (xs->zc)
+		if (xsk_umem_zerocopy(xs->umem))
 			opts.flags |= XDP_OPTIONS_ZEROCOPY;
 		mutex_unlock(&xs->mutex);
 
diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
index d5e06c8e0cbf..8a19b7e87cfb 100644
--- a/net/xdp/xsk_diag.c
+++ b/net/xdp/xsk_diag.c
@@ -10,6 +10,7 @@
 #include <net/xdp_sock.h>
 #include <linux/xdp_diag.h>
 #include <linux/sock_diag.h>
+#include <net/xdp_sock.h>
 
 #include "xsk_queue.h"
 #include "xsk.h"
@@ -61,7 +62,7 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb)
 	du.ifindex = umem->dev ? umem->dev->ifindex : 0;
 	du.queue_id = umem->queue_id;
 	du.flags = 0;
-	if (umem->zc)
+	if (xsk_umem_zerocopy(umem))
 		du.flags |= XDP_DU_F_ZEROCOPY;
 	du.refs = refcount_read(&umem->users);
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 4/5] ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
From: Sridhar Samudrala @ 2019-08-15  3:46 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com>

This patch skips calling BPF program in the receive path if
the queue is associated with UMEM that is not shared and
bound to an AF_XDP socket that has enabled skip bpf during
bind() call.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 +++++++++++++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 +++++++++++++--
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index dc7b128c780e..594792860cdd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2197,6 +2197,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
 	struct xdp_frame *xdpf;
+	struct xdp_umem *umem;
 	u32 act;
 
 	rcu_read_lock();
@@ -2207,6 +2208,13 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	umem = xdp_get_umem_from_qid(rx_ring->netdev, rx_ring->queue_index);
+	if (xsk_umem_skip_bpf(umem)) {
+		err = xsk_umem_rcv(umem, xdp);
+		result = !err ? IXGBE_XDP_REDIR : IXGBE_XDP_CONSUMED;
+		goto xdp_out;
+	}
+
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2400,8 +2408,16 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		total_rx_packets++;
 	}
 
-	if (xdp_xmit & IXGBE_XDP_REDIR)
-		xdp_do_flush_map();
+	if (xdp_xmit & IXGBE_XDP_REDIR) {
+		struct xdp_umem *umem;
+
+		umem = xdp_get_umem_from_qid(rx_ring->netdev,
+					     rx_ring->queue_index);
+		if (xsk_umem_skip_bpf(umem))
+			xsk_umem_flush(umem);
+		else
+			xdp_do_flush_map();
+	}
 
 	if (xdp_xmit & IXGBE_XDP_TX) {
 		struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..9ea8a769d7a8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -148,6 +148,12 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 	struct xdp_frame *xdpf;
 	u32 act;
 
+	if (xsk_umem_skip_bpf(rx_ring->xsk_umem)) {
+		err = xsk_umem_rcv(rx_ring->xsk_umem, xdp);
+		result = !err ? IXGBE_XDP_REDIR : IXGBE_XDP_CONSUMED;
+		return result;
+	}
+
 	rcu_read_lock();
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
@@ -527,8 +533,14 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 		ixgbe_rx_skb(q_vector, skb);
 	}
 
-	if (xdp_xmit & IXGBE_XDP_REDIR)
-		xdp_do_flush_map();
+	if (xdp_xmit & IXGBE_XDP_REDIR) {
+		struct xdp_umem *umem = rx_ring->xsk_umem;
+
+		if (xsk_umem_skip_bpf(umem))
+			xsk_umem_flush(umem);
+		else
+			xdp_do_flush_map();
+	}
 
 	if (xdp_xmit & IXGBE_XDP_TX) {
 		struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 5/5] xdpsock_user: Add skip_bpf option
From: Sridhar Samudrala @ 2019-08-15  3:46 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com>

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 samples/bpf/xdpsock_user.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 93eaaf7239b2..509fc6a18af9 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -123,6 +123,9 @@ static void print_benchmark(bool running)
 	if (opt_poll)
 		printf("poll() ");
 
+	if (opt_xdp_bind_flags & XDP_SKIP_BPF)
+		printf("skip-bpf ");
+
 	if (running) {
 		printf("running...");
 		fflush(stdout);
@@ -352,6 +355,7 @@ static struct option long_options[] = {
 	{"zero-copy", no_argument, 0, 'z'},
 	{"copy", no_argument, 0, 'c'},
 	{"frame-size", required_argument, 0, 'f'},
+	{"skip-bpf", no_argument, 0, 's'},
 	{0, 0, 0, 0}
 };
 
@@ -372,6 +376,7 @@ static void usage(const char *prog)
 		"  -z, --zero-copy      Force zero-copy mode.\n"
 		"  -c, --copy           Force copy mode.\n"
 		"  -f, --frame-size=n   Set the frame size (must be a power of two, default is %d).\n"
+		"  -s, --skip-bpf       Skip running bpf program.\n"
 		"\n";
 	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
 	exit(EXIT_FAILURE);
@@ -430,6 +435,9 @@ static void parse_command_line(int argc, char **argv)
 		case 'f':
 			opt_xsk_frame_size = atoi(optarg);
 			break;
+		case 's':
+			opt_xdp_bind_flags |= XDP_SKIP_BPF;
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 2/5] xsk: Introduce XDP_SKIP_BPF bind option
From: Sridhar Samudrala @ 2019-08-15  3:46 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com>

This option enables an AF_XDP socket to specify XDP_SKIP_BPF flag
with the bind() call to skip calling the BPF program in the receive
path and pass the XDP buffer directly to the socket.

When a single AF_XDP socket is associated with a queue and a HW
filter is used to redirect the packets and the app is interested in
receiving all the packets on that queue, we don't need an additional 
BPF program to do further filtering or lookup/redirect to a socket.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/xdp_sock.h        |  9 +++++++++
 include/uapi/linux/if_xdp.h   |  1 +
 include/uapi/linux/xdp_diag.h |  1 +
 net/xdp/xdp_umem.c            |  5 ++++-
 net/xdp/xsk.c                 | 31 +++++++++++++++++++++++++++++--
 net/xdp/xsk_diag.c            |  2 ++
 6 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index b6716dbdce1a..ad132a69db7c 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -29,6 +29,7 @@ struct xdp_umem_fq_reuse {
 
 /* Bits for the umem flags field. */
 #define XDP_UMEM_F_ZEROCOPY	(1 << 0)
+#define XDP_UMEM_F_SKIP_BPF	(1 << 1)
 
 struct xdp_umem {
 	struct xsk_queue *fq;
@@ -98,6 +99,9 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
 void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
 bool xsk_umem_zerocopy(struct xdp_umem *umem);
+bool xsk_umem_skip_bpf(struct xdp_umem *umem);
+void xsk_umem_flush(struct xdp_umem *umem);
+int xsk_umem_rcv(struct xdp_umem *umem, struct xdp_buff *xdp);
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
@@ -221,6 +225,11 @@ static inline bool xsk_umem_zerocopy(struct xdp_umem *umem)
 	return false;
 }
 
+static inline bool xsk_umem_skip_bpf(struct xdp_umem *umem)
+{
+	return false;
+}
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
 	return NULL;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index faaa5ca2a117..881447ebf3c9 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -16,6 +16,7 @@
 #define XDP_SHARED_UMEM	(1 << 0)
 #define XDP_COPY	(1 << 1) /* Force copy-mode */
 #define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
+#define XDP_SKIP_BPF	(1 << 3) /* Skip running BPF program */
 
 struct sockaddr_xdp {
 	__u16 sxdp_family;
diff --git a/include/uapi/linux/xdp_diag.h b/include/uapi/linux/xdp_diag.h
index 78b2591a7782..6caf3d9c9abe 100644
--- a/include/uapi/linux/xdp_diag.h
+++ b/include/uapi/linux/xdp_diag.h
@@ -56,6 +56,7 @@ struct xdp_diag_ring {
 };
 
 #define XDP_DU_F_ZEROCOPY (1 << 0)
+#define XDP_DU_F_SKIP_BPF (2 << 0)
 
 struct xdp_diag_umem {
 	__u64	size;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 411b3e3498c4..cbc02509dc90 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -106,6 +106,9 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
 	umem->dev = dev;
 	umem->queue_id = queue_id;
 
+	if (flags & XDP_SKIP_BPF)
+		umem->flags |= XDP_UMEM_F_SKIP_BPF;
+
 	dev_hold(dev);
 
 	if (force_copy)
@@ -162,7 +165,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
 
 	dev_put(umem->dev);
 	umem->dev = NULL;
-	umem->flags &= ~XDP_UMEM_F_ZEROCOPY;
+	umem->flags &= ~(XDP_UMEM_F_ZEROCOPY | XDP_UMEM_F_SKIP_BPF);
 }
 
 static void xdp_umem_unmap_pages(struct xdp_umem *umem)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ca95676ef75d..bcb6a77fae22 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -166,6 +166,27 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+void xsk_umem_flush(struct xdp_umem *umem)
+{
+	struct xdp_sock *xs;
+
+	if (!list_empty(&umem->xsk_list)) {
+		xs = list_first_entry(&umem->xsk_list, struct xdp_sock, list);
+		xsk_flush(xs);
+	}
+}
+EXPORT_SYMBOL(xsk_umem_flush);
+
+int xsk_umem_rcv(struct xdp_umem *umem, struct xdp_buff *xdp)
+{
+	struct xdp_sock *xs;
+
+	xs = list_first_entry(&umem->xsk_list, struct xdp_sock, list);
+	xdp->handle += xdp->data - xdp->data_hard_start;
+	return xsk_rcv(xs, xdp);
+}
+EXPORT_SYMBOL(xsk_umem_rcv);
+
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
 {
 	xskq_produce_flush_addr_n(umem->cq, nb_entries);
@@ -301,6 +322,12 @@ bool xsk_umem_zerocopy(struct xdp_umem *umem)
 }
 EXPORT_SYMBOL(xsk_umem_zerocopy);
 
+bool xsk_umem_skip_bpf(struct xdp_umem *umem)
+{
+	return (umem && (umem->flags & XDP_UMEM_F_SKIP_BPF));
+}
+EXPORT_SYMBOL(xsk_umem_skip_bpf);
+
 static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 {
 	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
@@ -434,7 +461,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		return -EINVAL;
 
 	flags = sxdp->sxdp_flags;
-	if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY))
+	if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY | XDP_SKIP_BPF))
 		return -EINVAL;
 
 	rtnl_lock();
@@ -461,7 +488,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		struct xdp_sock *umem_xs;
 		struct socket *sock;
 
-		if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY)) {
+		if (flags & (XDP_COPY | XDP_ZEROCOPY | XDP_SKIP_BPF)) {
 			/* Cannot specify flags for shared sockets. */
 			err = -EINVAL;
 			goto out_unlock;
diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
index 8a19b7e87cfb..f6f4b7912a22 100644
--- a/net/xdp/xsk_diag.c
+++ b/net/xdp/xsk_diag.c
@@ -64,6 +64,8 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb)
 	du.flags = 0;
 	if (xsk_umem_zerocopy(umem))
 		du.flags |= XDP_DU_F_ZEROCOPY;
+	if (xsk_umem_skip_bpf(umem))
+		du.flags |= XDP_DU_F_SKIP_BPF;
 	du.refs = refcount_read(&umem->users);
 
 	err = nla_put(nlskb, XDP_DIAG_UMEM, sizeof(du), &du);
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 3/5] i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
From: Sridhar Samudrala @ 2019-08-15  3:46 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <1565840783-8269-1-git-send-email-sridhar.samudrala@intel.com>

This patch skips calling BPF program in the receive path if
the queue is associated with UMEM that is not shared and
bound to an AF_XDP socket that has enabled skip bpf during
bind() call.

Here are some performance numbers collected on 
  - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
  - Intel 40Gb Ethernet NIC (i40e)

All tests use 2 cores and the results are in Mpps.

turbo on (default)
---------------------------------------------	
                      no-skip-bpf    skip-bpf
---------------------------------------------	
rxdrop zerocopy           21.9         38.5 
l2fwd  zerocopy           17.0         20.5
rxdrop copy               11.1         13.3
l2fwd  copy                1.9          2.0

no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
---------------------------------------------	
                      no-skip-bpf    skip-bpf
---------------------------------------------	
rxdrop zerocopy           15.4         29.0
l2fwd  zerocopy           11.8         18.2
rxdrop copy                8.2         10.5
l2fwd  copy                1.7          1.7
---------------------------------------------	

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 +++++++++++++++++++--
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  |  6 ++++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e3f29dc8b290..5e63e3644e87 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2199,6 +2199,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	int err, result = I40E_XDP_PASS;
 	struct i40e_ring *xdp_ring;
 	struct bpf_prog *xdp_prog;
+	struct xdp_umem *umem;
 	u32 act;
 
 	rcu_read_lock();
@@ -2209,6 +2210,13 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
+	umem = xdp_get_umem_from_qid(rx_ring->netdev, rx_ring->queue_index);
+	if (xsk_umem_skip_bpf(umem)) {
+		err = xsk_umem_rcv(umem, xdp);
+		result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+		goto xdp_out;
+	}
+
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
@@ -2303,8 +2311,18 @@ void i40e_update_rx_stats(struct i40e_ring *rx_ring,
  **/
 void i40e_finalize_xdp_rx(struct i40e_ring *rx_ring, unsigned int xdp_res)
 {
-	if (xdp_res & I40E_XDP_REDIR)
-		xdp_do_flush_map();
+	if (xdp_res & I40E_XDP_REDIR) {
+		struct xdp_umem *umem;
+
+		umem = rx_ring->xsk_umem;
+		if (!umem)
+			umem = xdp_get_umem_from_qid(rx_ring->netdev,
+						     rx_ring->queue_index);
+		if (xsk_umem_skip_bpf(umem))
+			xsk_umem_flush(umem);
+		else
+			xdp_do_flush_map();
+	}
 
 	if (xdp_res & I40E_XDP_TX) {
 		struct i40e_ring *xdp_ring =
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 32bad014d76c..cc538479c95d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -195,6 +195,12 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
 	struct bpf_prog *xdp_prog;
 	u32 act;
 
+	if (xsk_umem_skip_bpf(rx_ring->xsk_umem)) {
+		err = xsk_umem_rcv(rx_ring->xsk_umem, xdp);
+		result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+		return result;
+	}
+
 	rcu_read_lock();
 	/* NB! xdp_prog will always be !NULL, due to the fact that
 	 * this path is enabled by setting an XDP program.
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 0/5] Add support for SKIP_BPF flag for AF_XDP sockets
From: Sridhar Samudrala @ 2019-08-15  3:46 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
	intel-wired-lan, maciej.fijalkowski, tom.herbert

This patch series introduces XDP_SKIP_BPF flag that can be specified
during the bind() call of an AF_XDP socket to skip calling the BPF 
program in the receive path and pass the buffer directly to the socket.

When a single AF_XDP socket is associated with a queue and a HW
filter is used to redirect the packets and the app is interested in
receiving all the packets on that queue, we don't need an additional 
BPF program to do further filtering or lookup/redirect to a socket.

Here are some performance numbers collected on 
  - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
  - Intel 40Gb Ethernet NIC (i40e)

All tests use 2 cores and the results are in Mpps.

turbo on (default)
---------------------------------------------	
                      no-skip-bpf    skip-bpf
---------------------------------------------	
rxdrop zerocopy           21.9         38.5 
l2fwd  zerocopy           17.0         20.5
rxdrop copy               11.1         13.3
l2fwd  copy                1.9          2.0

no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
---------------------------------------------	
                      no-skip-bpf    skip-bpf
---------------------------------------------	
rxdrop zerocopy           15.4         29.0
l2fwd  zerocopy           11.8         18.2
rxdrop copy                8.2         10.5
l2fwd  copy                1.7          1.7
---------------------------------------------	

Sridhar Samudrala (5):
  xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
  xsk: Introduce XDP_SKIP_BPF bind option
  i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
  ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
  xdpsock_user: Add skip_bpf option

 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +++++++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 ++++++-
 include/net/xdp_sock.h                        | 21 ++++++++-
 include/uapi/linux/if_xdp.h                   |  1 +
 include/uapi/linux/xdp_diag.h                 |  1 +
 net/xdp/xdp_umem.c                            |  9 ++--
 net/xdp/xsk.c                                 | 43 ++++++++++++++++---
 net/xdp/xsk_diag.c                            |  5 ++-
 samples/bpf/xdpsock_user.c                    |  8 ++++
 11 files changed, 135 insertions(+), 17 deletions(-)

-- 
2.20.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox