Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] nfp: replace spin_lock_bh with spin_lock in tasklet callback
From: David Miller @ 2018-09-08  5:56 UTC (permalink / raw)
  To: hangdianqj
  Cc: jakub.kicinski, dirk.vandermerwe, daniel, quentin.monnet,
	oss-drivers, netdev, linux-kernel
In-Reply-To: <20180907170109.48150-1-hangdianqj@163.com>

From: jun qian <hangdianqj@163.com>
Date: Fri,  7 Sep 2018 10:01:09 -0700

> As you are already in a tasklet, it is unnecessary to call spin_lock_bh.
> 
> Signed-off-by: jun qian <hangdianqj@163.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: Expose tagging protocol to user-space
From: Jiri Pirko @ 2018-09-08  9:43 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, Andrew Lunn, Vivien Didelot, David S. Miller, open list
In-Reply-To: <20180907180908.3223-1-f.fainelli@gmail.com>

Fri, Sep 07, 2018 at 08:09:02PM CEST, f.fainelli@gmail.com wrote:
>There is no way for user-space to know what a given DSA network device's
>tagging protocol is. Expose this information through a dsa/tagging
>attribute which reflects the tagging protocol currently in use.
>
>This is helpful for configuration (e.g: none behaves dramatically
>different wrt. bridges) as well as for packet capture tools when there
>is not a proper Ethernet type available.

Hmm, I wonder. It this something that varies between ports of an
individual ASIC? Or is it rather something defined per-ASIC. If so, this
looks more like a devlink-api material.

^ permalink raw reply

* Re: RE packet: fix reserve calculation - net/packet/af_packet.c
From: James Sakalaukus @ 2018-09-08  3:18 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Willem de Bruijn
In-Reply-To: <20180907.103402.1676027167656921237.davem@davemloft.net>

On Fri, Sep 7, 2018 at 1:34 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemb@google.com>
> Date: Fri, 7 Sep 2018 11:50:02 -0400
>
>> If you are not seeing these problems with other protocols, I must be
>> misreading that code.
>
> Right, the ethernet header is only guaranteed to be 2 byte aligned on
> transmit.
>
> Nothing ever gave larger alignment guarantees.
>
> This is the _only_ reasonable situation.
>
> If we made the ethernet header to be 4 byte aligned, that means the
> ipv4 addresses in the ipv4 header are not 4 byte aligned so we would
> do unaligned memory accesses when building the headers which are
> either expensive or trap depending upon the architecture.

Understood.

I thought SOCK_RAW made no assumptions with respect to the network
layer headers.  Previously, the buffer was allocated and then data was
copied into it without giving thought to alignment.  Sounds like I got
lucky with the alignment and I should have expected 2-bytes from day
1.

I've been using this device with only SOCK_RAW on a real time network,
and I have not even tried to use SOCK_STREAM or SOCK_DGRAM.
So, I just tried to use TCP on the device, and the buffer alignment is
in fact 2-bytes inside my transmit function, and my DMA core does not
work.  So...I guess I need to bring my hardware into the 21st century
or do a copy in the driver transmit function.

Thanks for your time

^ permalink raw reply

* Re: [PATCH] mt76: another missing 'select' in Kconfig
From: Lorenzo Bianconi @ 2018-09-08  7:46 UTC (permalink / raw)
  To: valdis.kletnieks-PjAqaU27lzQ
  Cc: Kalle Valo, Network Development, linux-wireless
In-Reply-To: <30971.1536347293-+bZmOdGhbsPr6rcHtW+onFJE71vCis6O@public.gmane.org>

>
> commit b40b15e1521f7764ea8c68d5a00ecc971b673d21
> Author: Lorenzo Bianconi <lorenzo.bianconi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date:   Tue Jul 31 10:09:19 2018 +0200
>
>     mt76: add usb support to mt76 layer
>
> add a new mt76-usb.c for MT76x2U USB devices, but failed to wire
> it up for MT76x0U USB devices.
>
> Signed-off-by: Valdis Kletnieks <valdis.kletnieks-PjAqaU27lzQ@public.gmane.org>
> ---
> diff --git a/drivers/net/wireless/mediatek/mt76/Kconfig b/drivers/net/wireless/mediatek/mt76/Kconfig
> index 6a270e759006..cec977b81305 100644
> --- a/drivers/net/wireless/mediatek/mt76/Kconfig
> +++ b/drivers/net/wireless/mediatek/mt76/Kconfig
> @@ -17,6 +17,7 @@ config MT76x2_COMMON
>  config MT76x0U
>         tristate "MediaTek MT76x0U (USB) support"
>         select MT76_CORE
> +       select MT76_USB
>         depends on MAC80211

This fix is contained in latest series sent by Stanislaw:
https://patchwork.kernel.org/patch/10590265/

Regards,
Lorenzo

^ permalink raw reply

* Re: RE packet: fix reserve calculation - net/packet/af_packet.c
From: James Sakalaukus @ 2018-09-08  2:59 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: David Miller, netdev
In-Reply-To: <CA+FuTSdiO_L8mOC_urUz24RQ5AMD=Ntyi1vcMQ+QpuKCV3Y9Sg@mail.gmail.com>

On Fri, Sep 7, 2018 at 11:50 AM, Willem de Bruijn <willemb@google.com> wrote:
> Hi James,
>
> Thanks for the report. In the future please always include
> netdev@vger.kernel.org in technical discussions.
>
> On Fri, Sep 7, 2018 at 1:00 AM James Sakalaukus <james@sakalaukus.com> wrote:
>>
>> Hello Willem and David,
>>
>> I have an unpolished Ethernet driver for a PCIe FPGA subsystem, and
>> the following commit has the side effect of moving the data alignment
>> for SOCK_RAW packets.
>>
>>
>> commit b84bbaf7a6c8cca24f8acf25a2c8e46913a947ba
>>     net/packet/af_packet.c
>>
>> These changes to packet_snd() moves the data and tail pointers
>> backwards by net_device->hard_header_len, which is nominally ETH_HLEN.
>> The .ndo_start_xmit driver function now gets a struct sk_buff with
>> data alignment on a 2-byte boundary.  My DMA core is not happy about
>> it.
>>
>>
>> commit 9aad13b087ab0a588cd68259de618f100053360e
>>
>> This commit changed the previous fix from a skb_push to skb_reserve.
>> The functionality from my end did not change though.  .ndo_start_xmit
>> still gets a struct sk_buff with 2 byte alignment.
>>
>>
>> This may be causing problems for other network drivers with DMA
>> alignment requirements, but maybe its just me.
>
> This is the crux of the question.
>
>
> Did the PACKET_TX_RING variant work with your device?
>
> A quick scan seems to indicate that it is common to allocate a linear
> buffer and then reserve hard_header_len aligned up to 16B
> (HH_MOD_LEN). The actual alignment of both link layer and network
> header then depends on the alignment with which kmalloc returned. It
> is probably safe to assume that the buddy allocator returns a multiple
> of 4B at least for allocations of this size. Then the network layer
> header is 4B aligned. But for Ethernet, the skb_push in eth_header()
> would make the link layer header 2B aligned.
>
> If you are not seeing these problems with other protocols, I must be
> misreading that code.
>
> I will take a closer look.
>

I actually have not tested the device with any protocols other than
SOCK_RAW.  This device is on a real time network that does not use
standard network layer protocols.
PACKET_TX_RING is something I haven't had time to delve into.

It may be the case that other protocols always hand over a 2-byte
aligned buffer.  All I know for sure is that I was getting a 4-byte
aligned buffer before the update.


>> I've only been working
>> on the driver for about 6 months, so its not released, though I would
>> like to put it out there for others.  I updated my distro kernel this
>> week, and then proceeded to beat my head against the wall trying to
>> figure out why my driver mysteriously stopped working.
>>
>>
>> Thanks for your time,
>>
>> James Sakalaukus
>> james@sakalaukus.com

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: b53: Fix build with B53_SRAB enabled and not B53_SERDES
From: David Miller @ 2018-09-08  6:12 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, andrew, vivien.didelot, linux-kernel
In-Reply-To: <20180906184245.31442-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu,  6 Sep 2018 11:42:45 -0700

> In case B53_SRAB is enabled, but not B53_SERDES, we can get the
> following linking error:
> 
> ERROR: "b53_serdes_init" [drivers/net/dsa/b53/b53_srab.ko] undefined!
> 
> We also need to ifdef the body of b53_srab_serdes_map_lane() since it
> would not be used when B53_SERDES is disabled and that would produce a
> warning.
> 
> Fixes: 0e01491de646 ("net: dsa: b53: Add SerDes support")
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] xen/netfront: fix waiting for xenbus state change
From: David Miller @ 2018-09-08  6:03 UTC (permalink / raw)
  To: jgross; +Cc: linux-kernel, xen-devel, netdev, boris.ostrovsky, stable
In-Reply-To: <20180907122130.30810-1-jgross@suse.com>

From: Juergen Gross <jgross@suse.com>
Date: Fri,  7 Sep 2018 14:21:30 +0200

> Commit 822fb18a82aba ("xen-netfront: wait xenbus state change when load
> module manually") added a new wait queue to wait on for a state change
> when the module is loaded manually. Unfortunately there is no wakeup
> anywhere to stop that waiting.
> 
> Instead of introducing a new wait queue rename the existing
> module_unload_q to module_wq and use it for both purposes (loading and
> unloading).
> 
> As any state change of the backend might be intended to stop waiting
> do the wake_up_all() in any case when netback_changed() is called.
> 
> Fixes: 822fb18a82aba ("xen-netfront: wait xenbus state change when load module manually")
> Cc: <stable@vger.kernel.org> #4.18
> Signed-off-by: Juergen Gross <jgross@suse.com>

Applied.

^ permalink raw reply

* Re: [PATCH 3/4] of: Convert to using %pOFn instead of device_node.name
From: Joe Perches @ 2018-09-08  0:30 UTC (permalink / raw)
  To: Thierry Reding, Rob Herring
  Cc: Frank Rowand, devicetree, linux-kernel, Andrew Lunn,
	Florian Fainelli, netdev
In-Reply-To: <20180907122928.GA5821@ulmo>

On Fri, 2018-09-07 at 14:29 +0200, Thierry Reding wrote:
> On Tue, Aug 28, 2018 at 10:52:53AM -0500, Rob Herring wrote:
> > In preparation to remove the node name pointer from struct device_node,
> > convert printf users to use the %pOFn format specifier.
> > 
> > Cc: Frank Rowand <frowand.list@gmail.com>
> > Cc: Andrew Lunn <andrew@lunn.ch>
> > Cc: Florian Fainelli <f.fainelli@gmail.com>
> > Cc: devicetree@vger.kernel.org
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Rob Herring <robh@kernel.org>
> > ---
> >  drivers/of/device.c   |  4 ++--
> >  drivers/of/of_mdio.c  | 12 ++++++------
> >  drivers/of/of_numa.c  |  4 ++--
> >  drivers/of/overlay.c  |  4 ++--
> >  drivers/of/platform.c |  8 ++++----
> >  drivers/of/unittest.c | 12 ++++++------
> >  6 files changed, 22 insertions(+), 22 deletions(-)
> > 
> > diff --git a/drivers/of/device.c b/drivers/of/device.c
> > index 5957cd4fa262..daa075d87317 100644
> > --- a/drivers/of/device.c
> > +++ b/drivers/of/device.c
> > @@ -219,7 +219,7 @@ static ssize_t of_device_get_modalias(struct device *dev, char *str, ssize_t len
> >  		return -ENODEV;
> >  
> >  	/* Name & Type */
> > -	csize = snprintf(str, len, "of:N%sT%s", dev->of_node->name,
> > +	csize = snprintf(str, len, "of:N%pOFnT%s", dev->of_node,
> >  			 dev->of_node->type);
> >  	tsize = csize;
> >  	len -= csize;
> 
> This seems to cause the modalias to be improperly constructed. As a
> consequence, automatic module loading at boot time is now broken. I
> think the reason why this fails is because vsnprintf() will skip all
> alpha-numeric characters after a call to pointer(). Presumably this
> is meant to be a generic way of skipping whatever specifiers we throw
> at it.
> 
> Unfortunately for the case of OF modaliases, this means that the 'T'
> character gets eaten, so we end up with something like this:
> 
> 	# udevadm info /sys/bus/platform/devices/54200000.dc
> 	[...]
> 	E: MODALIAS=of:Ndc<NULL>Cnvidia,tegra124-dc
> 	[...]
> 
> instead of this:
> 
> 	# udevadm info /sys/bus/platform/devices/54200000.dc
> 	[...]
> 	E: MODALIAS=of:NdcT<NULL>Cnvidia,tegra124-dc
> 	[...]
> 
> Everything is back to normal if I revert this patch. However, since
> that's obviously not what we want, I think perhaps what we need is a
> way for pointer() (and its implementations) to report back how many
> characters in the format string it consumed so that we can support
> these kinds of back-to-back strings.
> 
> If nobody else has the time I can look into coding up a fix, but in the
> meantime it might be best to back this one out until we can handle the
> OF modalias format string.

Or just use 2 consecutive snprintf calls

	csize = snprintf(str, len, "of:N%pOFn", dev->of_node);
	csize += snprintf(str + csize, len - csize, "T%s",
			  dev->of_node->type);

^ permalink raw reply

* Re: Containers and checkpoint/restart micro-conference at LPC2018
From: Stéphane Graber @ 2018-09-08  4:59 UTC (permalink / raw)
  To: lxc-devel-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I,
	lxc-users-cunTk1MwBs9qMoObBWhMNEqPaTDuhLve2LY78lusg7I,
	containers-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180813161015.GD18492@castiana>


[-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --]

On Mon, Aug 13, 2018 at 12:10:15PM -0400, Stéphane Graber wrote:
> Hello,
> 
> This year's edition of the Linux Plumbers Conference will once again
> have a containers micro-conference but this time around we'll have twice
> the usual amount of time and will include the content that would
> traditionally go into the checkpoint/restore micro-conference.
> 
> LPC2018 will be held in Vancouver, Canada from the 13th to the 15th of
> November, co-located with the Linux Kernel Summit.
> 
> 
> We're looking for discussion topics around kernel work related to
> containers and namespacing, resource control, access control,
> checkpoint/restore of kernel structures, filesystem/mount handling for
> containers and any related userspace work.
> 
> 
> The format of the event will mostly be discussions where someone
> introduces a given topic/problem and it then gets discussed for 20-30min
> before moving on to something else. There will also be limited room for
> short demos of recent work with shorter 15min slots.
> 
> 
> Details can be found here:
> 
>   https://discuss.linuxcontainers.org/t/containers-micro-conference-at-linux-plumbers-2018/2417
> 
> 
> Looking forward to seeing you in Vancouver!

Hello,

We've added an extra week to the CFP, new deadline is Friday 14th of September.

If you were thinking about sending something bug then forgot or just
missed the deadline, now is your chance to send it!

Stéphane

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
lxc-devel mailing list
lxc-devel@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel

^ permalink raw reply

* [bpf-next, v2 3/3] selftests/bpf: test bpf flow dissection
From: Petar Penkov @ 2018-09-08  0:11 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov, Willem de Bruijn
In-Reply-To: <20180908001110.200828-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 4d789c1e5167..8a60c9b9892d 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -23,3 +23,5 @@ test_skb_cgroup_id_user
 test_socket_cookie
 test_cgroup_storage
 test_select_reuseport
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
 	test_tunnel.sh \
 	test_lwt_seg6local.sh \
 	test_lirc_mode2.sh \
-	test_skb_cgroup_id.sh
+	test_skb_cgroup_id.sh \
+	test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user \
+	flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index 000000000000..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <error.h>
+#include <errno.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+	struct bpf_program *prog, *main_prog;
+	struct bpf_map *prog_array;
+	int i, fd, prog_fd, ret;
+	struct bpf_object *obj;
+	int prog_array_fd;
+
+	ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
+			    &prog_fd);
+	if (ret)
+		error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+	main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+	if (!main_prog)
+		error(1, 0, "bpf_object__find_program_by_title %s",
+		      cfg_section_name);
+
+	prog_fd = bpf_program__fd(main_prog);
+	if (prog_fd < 0)
+		error(1, 0, "bpf_program__fd");
+
+	prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+	if (!prog_array)
+		error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+	prog_array_fd = bpf_map__fd(prog_array);
+	if (prog_array_fd < 0)
+		error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+	i = 0;
+	bpf_object__for_each_program(prog, obj) {
+		fd = bpf_program__fd(prog);
+		if (fd < 0)
+			error(1, 0, "bpf_program__fd");
+
+		if (fd != prog_fd) {
+			printf("%d: %s\n", i, bpf_program__title(prog, false));
+			bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
+			++i;
+		}
+	}
+
+	ret = bpf_prog_attach(prog_fd, 0 /* Ignore */, BPF_FLOW_DISSECTOR, 0);
+	if (ret)
+		error(1, 0, "bpf_prog_attach %s", cfg_path_name);
+
+	ret = bpf_object__pin(obj, cfg_pin_path);
+	if (ret)
+		error(1, 0, "bpf_object__pin %s", cfg_pin_path);
+
+}
+
+static void detach_program(void)
+{
+	char command[64];
+	int ret;
+
+	ret = bpf_prog_detach(0, BPF_FLOW_DISSECTOR);
+	if (ret)
+		error(1, 0, "bpf_prog_detach");
+
+	/* To unpin, it is necessary and sufficient to just remove this dir */
+	sprintf(command, "rm -r %s", cfg_pin_path);
+	ret = system(command);
+	if (ret)
+		error(1, errno, command);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	bool attach = false;
+	bool detach = false;
+	int c;
+
+	while ((c = getopt(argc, argv, "adp:s:")) != -1) {
+		switch (c) {
+		case 'a':
+			if (detach)
+				error(1, 0, "attach/detach are exclusive");
+			attach = true;
+			break;
+		case 'd':
+			if (attach)
+				error(1, 0, "attach/detach are exclusive");
+			detach = true;
+			break;
+		case 'p':
+			if (cfg_path_name)
+				error(1, 0, "only one prog name can be given");
+
+			cfg_path_name = optarg;
+			break;
+		case 's':
+			if (cfg_section_name)
+				error(1, 0, "only one section can be given");
+
+			cfg_section_name = optarg;
+			break;
+		}
+	}
+
+	if (detach)
+		cfg_attach = false;
+
+	if (cfg_attach && !cfg_path_name)
+		error(1, 0, "must provide a path to the BPF program");
+
+	if (cfg_attach && !cfg_section_name)
+		error(1, 0, "must provide a section name");
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	if (cfg_attach)
+		load_and_attach_program();
+	else
+		detach_program();
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.c b/tools/testing/selftests/bpf/test_flow_dissector.c
new file mode 100644
index 000000000000..12b784afba31
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.c
@@ -0,0 +1,782 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Inject packets with all sorts of encapsulation into the kernel.
+ *
+ * IPv4/IPv6	outer layer 3
+ * GRE/GUE/BARE outer layer 4, where bare is IPIP/SIT/IPv4-in-IPv6/..
+ * IPv4/IPv6    inner layer 3
+ */
+
+#define _GNU_SOURCE
+
+#include <stddef.h>
+#include <arpa/inet.h>
+#include <asm/byteorder.h>
+#include <error.h>
+#include <errno.h>
+#include <linux/if_packet.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ipv6.h>
+#include <netinet/ip.h>
+#include <netinet/in.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define CFG_PORT_INNER	8000
+
+/* Add some protocol definitions that do not exist in userspace */
+
+struct grehdr {
+	uint16_t unused;
+	uint16_t protocol;
+} __attribute__((packed));
+
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:5,
+				control:1,
+				version:2;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:2,
+				control:1,
+				hlen:5;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8	proto_ctype;
+			__be16	flags;
+		};
+		__be32	word;
+	};
+};
+
+static uint8_t	cfg_dsfield_inner;
+static uint8_t	cfg_dsfield_outer;
+static uint8_t	cfg_encap_proto;
+static bool	cfg_expect_failure = false;
+static int	cfg_l3_extra = AF_UNSPEC;	/* optional SIT prefix */
+static int	cfg_l3_inner = AF_UNSPEC;
+static int	cfg_l3_outer = AF_UNSPEC;
+static int	cfg_num_pkt = 10;
+static int	cfg_num_secs = 0;
+static char	cfg_payload_char = 'a';
+static int	cfg_payload_len = 100;
+static int	cfg_port_gue = 6080;
+static bool	cfg_only_rx;
+static bool	cfg_only_tx;
+static int	cfg_src_port = 9;
+
+static char	buf[ETH_DATA_LEN];
+
+#define INIT_ADDR4(name, addr4, port)				\
+	static struct sockaddr_in name = {			\
+		.sin_family = AF_INET,				\
+		.sin_port = __constant_htons(port),		\
+		.sin_addr.s_addr = __constant_htonl(addr4),	\
+	};
+
+#define INIT_ADDR6(name, addr6, port)				\
+	static struct sockaddr_in6 name = {			\
+		.sin6_family = AF_INET6,			\
+		.sin6_port = __constant_htons(port),		\
+		.sin6_addr = addr6,				\
+	};
+
+INIT_ADDR4(in_daddr4, INADDR_LOOPBACK, CFG_PORT_INNER)
+INIT_ADDR4(in_saddr4, INADDR_LOOPBACK + 2, 0)
+INIT_ADDR4(out_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(out_saddr4, INADDR_LOOPBACK + 1, 0)
+INIT_ADDR4(extra_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(extra_saddr4, INADDR_LOOPBACK + 1, 0)
+
+INIT_ADDR6(in_daddr6, IN6ADDR_LOOPBACK_INIT, CFG_PORT_INNER)
+INIT_ADDR6(in_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+
+static unsigned long util_gettime(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void util_printaddr(const char *msg, struct sockaddr *addr)
+{
+	unsigned long off = 0;
+	char nbuf[INET6_ADDRSTRLEN];
+
+	switch (addr->sa_family) {
+	case PF_INET:
+		off = __builtin_offsetof(struct sockaddr_in, sin_addr);
+		break;
+	case PF_INET6:
+		off = __builtin_offsetof(struct sockaddr_in6, sin6_addr);
+		break;
+	default:
+		error(1, 0, "printaddr: unsupported family %u\n",
+		      addr->sa_family);
+	}
+
+	if (!inet_ntop(addr->sa_family, ((void *) addr) + off, nbuf,
+		       sizeof(nbuf)))
+		error(1, errno, "inet_ntop");
+
+	fprintf(stderr, "%s: %s\n", msg, nbuf);
+}
+
+static unsigned long add_csum_hword(const uint16_t *start, int num_u16)
+{
+	unsigned long sum = 0;
+	int i;
+
+	for (i = 0; i < num_u16; i++)
+		sum += start[i];
+
+	return sum;
+}
+
+static uint16_t build_ip_csum(const uint16_t *start, int num_u16,
+			      unsigned long sum)
+{
+	sum += add_csum_hword(start, num_u16);
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	return ~sum;
+}
+
+static void build_ipv4_header(void *header, uint8_t proto,
+			      uint32_t src, uint32_t dst,
+			      int payload_len, uint8_t tos)
+{
+	struct iphdr *iph = header;
+
+	iph->ihl = 5;
+	iph->version = 4;
+	iph->tos = tos;
+	iph->ttl = 8;
+	iph->tot_len = htons(sizeof(*iph) + payload_len);
+	iph->id = htons(1337);
+	iph->protocol = proto;
+	iph->saddr = src;
+	iph->daddr = dst;
+	iph->check = build_ip_csum((void *) iph, iph->ihl << 1, 0);
+}
+
+static void ipv6_set_dsfield(struct ipv6hdr *ip6h, uint8_t dsfield)
+{
+	uint16_t val, *ptr = (uint16_t *)ip6h;
+
+	val = ntohs(*ptr);
+	val &= 0xF00F;
+	val |= ((uint16_t) dsfield) << 4;
+	*ptr = htons(val);
+}
+
+static void build_ipv6_header(void *header, uint8_t proto,
+			      struct sockaddr_in6 *src,
+			      struct sockaddr_in6 *dst,
+			      int payload_len, uint8_t dsfield)
+{
+	struct ipv6hdr *ip6h = header;
+
+	ip6h->version = 6;
+	ip6h->payload_len = htons(payload_len);
+	ip6h->nexthdr = proto;
+	ip6h->hop_limit = 8;
+	ipv6_set_dsfield(ip6h, dsfield);
+
+	memcpy(&ip6h->saddr, &src->sin6_addr, sizeof(ip6h->saddr));
+	memcpy(&ip6h->daddr, &dst->sin6_addr, sizeof(ip6h->daddr));
+}
+
+static uint16_t build_udp_v4_csum(const struct iphdr *iph,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(iph->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &iph->saddr, num_u16);
+	pseudo_sum += htons(IPPROTO_UDP);
+	pseudo_sum += udph->len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static uint16_t build_udp_v6_csum(const struct ipv6hdr *ip6h,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(ip6h->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &ip6h->saddr, num_u16);
+	pseudo_sum += htons(ip6h->nexthdr);
+	pseudo_sum += ip6h->payload_len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static void build_udp_header(void *header, int payload_len,
+			     uint16_t dport, int family)
+{
+	struct udphdr *udph = header;
+	int len = sizeof(*udph) + payload_len;
+
+	udph->source = htons(cfg_src_port);
+	udph->dest = htons(dport);
+	udph->len = htons(len);
+	udph->check = 0;
+	if (family == AF_INET)
+		udph->check = build_udp_v4_csum(header - sizeof(struct iphdr),
+						udph, len >> 1);
+	else
+		udph->check = build_udp_v6_csum(header - sizeof(struct ipv6hdr),
+						udph, len >> 1);
+}
+
+static void build_gue_header(void *header, uint8_t proto)
+{
+	struct guehdr *gueh = header;
+
+	gueh->proto_ctype = proto;
+}
+
+static void build_gre_header(void *header, uint16_t proto)
+{
+	struct grehdr *greh = header;
+
+	greh->protocol = htons(proto);
+}
+
+static int l3_length(int family)
+{
+	if (family == AF_INET)
+		return sizeof(struct iphdr);
+	else
+		return sizeof(struct ipv6hdr);
+}
+
+static int build_packet(void)
+{
+	int ol3_len = 0, ol4_len = 0, il3_len = 0, il4_len = 0;
+	int el3_len = 0;
+
+	if (cfg_l3_extra)
+		el3_len = l3_length(cfg_l3_extra);
+
+	/* calculate header offsets */
+	if (cfg_encap_proto) {
+		ol3_len = l3_length(cfg_l3_outer);
+
+		if (cfg_encap_proto == IPPROTO_GRE)
+			ol4_len = sizeof(struct grehdr);
+		else if (cfg_encap_proto == IPPROTO_UDP)
+			ol4_len = sizeof(struct udphdr) + sizeof(struct guehdr);
+	}
+
+	il3_len = l3_length(cfg_l3_inner);
+	il4_len = sizeof(struct udphdr);
+
+	if (el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len >=
+	    sizeof(buf))
+		error(1, 0, "packet too large\n");
+
+	/*
+	 * Fill packet from inside out, to calculate correct checksums.
+	 * But create ip before udp headers, as udp uses ip for pseudo-sum.
+	 */
+	memset(buf + el3_len + ol3_len + ol4_len + il3_len + il4_len,
+	       cfg_payload_char, cfg_payload_len);
+
+	/* add zero byte for udp csum padding */
+	buf[el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len] = 0;
+
+	switch (cfg_l3_inner) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  in_saddr4.sin_addr.s_addr,
+				  in_daddr4.sin_addr.s_addr,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  &in_saddr6, &in_daddr6,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	}
+
+	build_udp_header(buf + el3_len + ol3_len + ol4_len + il3_len,
+			 cfg_payload_len, CFG_PORT_INNER, cfg_l3_inner);
+
+	if (!cfg_encap_proto)
+		return il3_len + il4_len + cfg_payload_len;
+
+	switch (cfg_l3_outer) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len, cfg_encap_proto,
+				  out_saddr4.sin_addr.s_addr,
+				  out_daddr4.sin_addr.s_addr,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len, cfg_encap_proto,
+				  &out_saddr6, &out_daddr6,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	}
+
+	switch (cfg_encap_proto) {
+	case IPPROTO_UDP:
+		build_gue_header(buf + el3_len + ol3_len + ol4_len -
+				 sizeof(struct guehdr),
+				 cfg_l3_inner == PF_INET ? IPPROTO_IPIP
+							 : IPPROTO_IPV6);
+		build_udp_header(buf + el3_len + ol3_len,
+				 sizeof(struct guehdr) + il3_len + il4_len +
+				 cfg_payload_len,
+				 cfg_port_gue, cfg_l3_outer);
+		break;
+	case IPPROTO_GRE:
+		build_gre_header(buf + el3_len + ol3_len,
+				 cfg_l3_inner == PF_INET ? ETH_P_IP
+							 : ETH_P_IPV6);
+		break;
+	}
+
+	switch (cfg_l3_extra) {
+	case PF_INET:
+		build_ipv4_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  extra_saddr4.sin_addr.s_addr,
+				  extra_daddr4.sin_addr.s_addr,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  &extra_saddr6, &extra_daddr6,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	}
+
+	return el3_len + ol3_len + ol4_len + il3_len + il4_len +
+	       cfg_payload_len;
+}
+
+/* sender transmits encapsulated over RAW or unencap'd over UDP */
+static int setup_tx(void)
+{
+	int family, fd, ret;
+
+	if (cfg_l3_extra)
+		family = cfg_l3_extra;
+	else if (cfg_l3_outer)
+		family = cfg_l3_outer;
+	else
+		family = cfg_l3_inner;
+
+	fd = socket(family, SOCK_RAW, IPPROTO_RAW);
+	if (fd == -1)
+		error(1, errno, "socket tx");
+
+	if (cfg_l3_extra) {
+		if (cfg_l3_extra == PF_INET)
+			ret = connect(fd, (void *) &extra_daddr4,
+				      sizeof(extra_daddr4));
+		else
+			ret = connect(fd, (void *) &extra_daddr6,
+				      sizeof(extra_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else if (cfg_l3_outer) {
+		/* connect to destination if not encapsulated */
+		if (cfg_l3_outer == PF_INET)
+			ret = connect(fd, (void *) &out_daddr4,
+				      sizeof(out_daddr4));
+		else
+			ret = connect(fd, (void *) &out_daddr6,
+				      sizeof(out_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else {
+		/* otherwise using loopback */
+		if (cfg_l3_inner == PF_INET)
+			ret = connect(fd, (void *) &in_daddr4,
+				      sizeof(in_daddr4));
+		else
+			ret = connect(fd, (void *) &in_daddr6,
+				      sizeof(in_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	}
+
+	return fd;
+}
+
+/* receiver reads unencapsulated UDP */
+static int setup_rx(void)
+{
+	int fd, ret;
+
+	fd = socket(cfg_l3_inner, SOCK_DGRAM, 0);
+	if (fd == -1)
+		error(1, errno, "socket rx");
+
+	if (cfg_l3_inner == PF_INET)
+		ret = bind(fd, (void *) &in_daddr4, sizeof(in_daddr4));
+	else
+		ret = bind(fd, (void *) &in_daddr6, sizeof(in_daddr6));
+	if (ret)
+		error(1, errno, "bind rx");
+
+	return fd;
+}
+
+static int do_tx(int fd, const char *pkt, int len)
+{
+	int ret;
+
+	ret = write(fd, pkt, len);
+	if (ret == -1)
+		error(1, errno, "send");
+	if (ret != len)
+		error(1, errno, "send: len (%d < %d)\n", ret, len);
+
+	return 1;
+}
+
+static int do_poll(int fd, short events, int timeout)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.fd = fd;
+	pfd.events = events;
+
+	ret = poll(&pfd, 1, timeout);
+	if (ret == -1)
+		error(1, errno, "poll");
+	if (ret && !(pfd.revents & POLLIN))
+		error(1, errno, "poll: unexpected event 0x%x\n", pfd.revents);
+
+	return ret;
+}
+
+static int do_rx(int fd)
+{
+	char rbuf;
+	int ret, num = 0;
+
+	while (1) {
+		ret = recv(fd, &rbuf, 1, MSG_DONTWAIT);
+		if (ret == -1 && errno == EAGAIN)
+			break;
+		if (ret == -1)
+			error(1, errno, "recv");
+		if (rbuf != cfg_payload_char)
+			error(1, 0, "recv: payload mismatch");
+		num++;
+	};
+
+	return num;
+}
+
+static int do_main(void)
+{
+	unsigned long tstop, treport, tcur;
+	int fdt = -1, fdr = -1, len, tx = 0, rx = 0;
+
+	if (!cfg_only_tx)
+		fdr = setup_rx();
+	if (!cfg_only_rx)
+		fdt = setup_tx();
+
+	len = build_packet();
+
+	tcur = util_gettime();
+	treport = tcur + 1000;
+	tstop = tcur + (cfg_num_secs * 1000);
+
+	while (1) {
+		if (!cfg_only_rx)
+			tx += do_tx(fdt, buf, len);
+
+		if (!cfg_only_tx)
+			rx += do_rx(fdr);
+
+		if (cfg_num_secs) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+			if (tcur >= treport) {
+				fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+				tx = 0;
+				rx = 0;
+				treport = tcur + 1000;
+			}
+		} else {
+			if (tx == cfg_num_pkt)
+				break;
+		}
+	}
+
+	/* read straggler packets, if any */
+	if (rx < tx) {
+		tstop = util_gettime() + 100;
+		while (rx < tx) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+
+			do_poll(fdr, POLLIN, tstop - tcur);
+			rx += do_rx(fdr);
+		}
+	}
+
+	fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+
+	if (fdr != -1 && close(fdr))
+		error(1, errno, "close rx");
+	if (fdt != -1 && close(fdt))
+		error(1, errno, "close tx");
+
+	/*
+	 * success (== 0) only if received all packets
+	 * unless failure is expected, in which case none must arrive.
+	 */
+	if (cfg_expect_failure)
+		return rx != 0;
+	else
+		return rx != tx;
+}
+
+
+static void __attribute__((noreturn)) usage(const char *filepath)
+{
+	fprintf(stderr, "Usage: %s [-e gre|gue|bare|none] [-i 4|6] [-l len] "
+			"[-O 4|6] [-o 4|6] [-n num] [-t secs] [-R] [-T] "
+			"[-s <osrc> [-d <odst>] [-S <isrc>] [-D <idst>] "
+			"[-x <otos>] [-X <itos>] [-f <isport>] [-F]\n",
+		filepath);
+	exit(1);
+}
+
+static void parse_addr(int family, void *addr, const char *optarg)
+{
+	int ret;
+
+	ret = inet_pton(family, optarg, addr);
+	if (ret == -1)
+		error(1, errno, "inet_pton");
+	if (ret == 0)
+		error(1, 0, "inet_pton: bad string");
+}
+
+static void parse_addr4(struct sockaddr_in *addr, const char *optarg)
+{
+	parse_addr(AF_INET, &addr->sin_addr, optarg);
+}
+
+static void parse_addr6(struct sockaddr_in6 *addr, const char *optarg)
+{
+	parse_addr(AF_INET6, &addr->sin6_addr, optarg);
+}
+
+static int parse_protocol_family(const char *filepath, const char *optarg)
+{
+	if (!strcmp(optarg, "4"))
+		return PF_INET;
+	if (!strcmp(optarg, "6"))
+		return PF_INET6;
+
+	usage(filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "d:D:e:f:Fhi:l:n:o:O:Rs:S:t:Tx:X:")) != -1) {
+		switch (c) {
+		case 'd':
+			if (cfg_l3_outer == AF_UNSPEC)
+				error(1, 0, "-d must be preceded by -o");
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_daddr4, optarg);
+			else
+				parse_addr6(&out_daddr6, optarg);
+			break;
+		case 'D':
+			if (cfg_l3_inner == AF_UNSPEC)
+				error(1, 0, "-D must be preceded by -i");
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_daddr4, optarg);
+			else
+				parse_addr6(&in_daddr6, optarg);
+			break;
+		case 'e':
+			if (!strcmp(optarg, "gre"))
+				cfg_encap_proto = IPPROTO_GRE;
+			else if (!strcmp(optarg, "gue"))
+				cfg_encap_proto = IPPROTO_UDP;
+			else if (!strcmp(optarg, "bare"))
+				cfg_encap_proto = IPPROTO_IPIP;
+			else if (!strcmp(optarg, "none"))
+				cfg_encap_proto = IPPROTO_IP;	/* == 0 */
+			else
+				usage(argv[0]);
+			break;
+		case 'f':
+			cfg_src_port = strtol(optarg, NULL, 0);
+			break;
+		case 'F':
+			cfg_expect_failure = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			break;
+		case 'i':
+			if (!strcmp(optarg, "4"))
+				cfg_l3_inner = PF_INET;
+			else if (!strcmp(optarg, "6"))
+				cfg_l3_inner = PF_INET6;
+			else
+				usage(argv[0]);
+			break;
+		case 'l':
+			cfg_payload_len = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			cfg_num_pkt = strtol(optarg, NULL, 0);
+			break;
+		case 'o':
+			cfg_l3_outer = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'O':
+			cfg_l3_extra = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'R':
+			cfg_only_rx = true;
+			break;
+		case 's':
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_saddr4, optarg);
+			else
+				parse_addr6(&out_saddr6, optarg);
+			break;
+		case 'S':
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_saddr4, optarg);
+			else
+				parse_addr6(&in_saddr6, optarg);
+			break;
+		case 't':
+			cfg_num_secs = strtol(optarg, NULL, 0);
+			break;
+		case 'T':
+			cfg_only_tx = true;
+			break;
+		case 'x':
+			cfg_dsfield_outer = strtol(optarg, NULL, 0);
+			break;
+		case 'X':
+			cfg_dsfield_inner = strtol(optarg, NULL, 0);
+			break;
+		}
+	}
+
+	if (cfg_only_rx && cfg_only_tx)
+		error(1, 0, "options: cannot combine rx-only and tx-only");
+
+	if (cfg_encap_proto && cfg_l3_outer == AF_UNSPEC)
+		error(1, 0, "options: must specify outer with encap");
+	else if ((!cfg_encap_proto) && cfg_l3_outer != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and outer");
+	else if ((!cfg_encap_proto) && cfg_l3_extra != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and extra");
+
+	if (cfg_l3_inner == AF_UNSPEC)
+		cfg_l3_inner = AF_INET6;
+	if (cfg_l3_inner == AF_INET6 && cfg_encap_proto == IPPROTO_IPIP)
+		cfg_encap_proto = IPPROTO_IPV6;
+
+	/* RFC 6040 4.2:
+	 *   on decap, if outer encountered congestion (CE == 0x3),
+	 *   but inner cannot encode ECN (NoECT == 0x0), then drop packet.
+	 */
+	if (((cfg_dsfield_outer & 0x3) == 0x3) &&
+	    ((cfg_dsfield_inner & 0x3) == 0x0))
+		cfg_expect_failure = true;
+}
+
+static void print_opts(void)
+{
+	if (cfg_l3_inner == PF_INET6) {
+		util_printaddr("inner.dest6", (void *) &in_daddr6);
+		util_printaddr("inner.source6", (void *) &in_saddr6);
+	} else {
+		util_printaddr("inner.dest4", (void *) &in_daddr4);
+		util_printaddr("inner.source4", (void *) &in_saddr4);
+	}
+
+	if (!cfg_l3_outer)
+		return;
+
+	fprintf(stderr, "encap proto:   %u\n", cfg_encap_proto);
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("outer.dest6", (void *) &out_daddr6);
+		util_printaddr("outer.source6", (void *) &out_saddr6);
+	} else {
+		util_printaddr("outer.dest4", (void *) &out_daddr4);
+		util_printaddr("outer.source4", (void *) &out_saddr4);
+	}
+
+	if (!cfg_l3_extra)
+		return;
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("extra.dest6", (void *) &extra_daddr6);
+		util_printaddr("extra.source6", (void *) &extra_saddr6);
+	} else {
+		util_printaddr("extra.dest4", (void *) &extra_daddr4);
+		util_printaddr("extra.source4", (void *) &extra_saddr4);
+	}
+
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	print_opts();
+	return do_main();
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh b/tools/testing/selftests/bpf/test_flow_dissector.sh
new file mode 100755
index 000000000000..c0fb073b5eab
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.sh
@@ -0,0 +1,115 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Load BPF flow dissector and verify it correctly dissects traffic
+export TESTNAME=test_flow_dissector
+unmount=0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+msg="skip all tests:"
+if [ $UID != 0 ]; then
+	echo $msg please run this as root >&2
+	exit $ksft_skip
+fi
+
+# This test needs to be run in a network namespace with in_netns.sh. Check if
+# this is the case and run it with in_netns.sh if it is being run in the root
+# namespace.
+if [[ -z $(ip netns identify $$) ]]; then
+	../net/in_netns.sh "$0" "$@"
+	exit $?
+fi
+
+# Determine selftest success via shell exit code
+exit_handler()
+{
+	if (( $? == 0 )); then
+		echo "selftests: $TESTNAME [PASS]";
+	else
+		echo "selftests: $TESTNAME [FAILED]";
+	fi
+
+	set +e
+
+	# Cleanup
+	tc filter del dev lo ingress pref 1337 2> /dev/null
+	tc qdisc del dev lo ingress 2> /dev/null
+	./flow_dissector_load -d 2> /dev/null
+	if [ $unmount -ne 0 ]; then
+		umount bpffs 2> /dev/null
+	fi
+}
+
+# Exit script immediately (well catched by trap handler) if any
+# program/thing exits with a non-zero status.
+set -e
+
+# (Use 'trap -l' to list meaning of numbers)
+trap exit_handler 0 2 3 6 9
+
+# Mount BPF file system
+if /bin/mount | grep /sys/fs/bpf > /dev/null; then
+	echo "bpffs already mounted"
+else
+	echo "bpffs not mounted. Mounting..."
+	unmount=1
+	/bin/mount bpffs /sys/fs/bpf -t bpf
+fi
+
+# Attach BPF program
+./flow_dissector_load -p bpf_flow.o -s dissect
+
+# Setup
+tc qdisc add dev lo ingress
+
+echo "Testing IPv4..."
+# Drops all IP/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ip pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv4/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 4 -f 8
+# Send 10 IPv4/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 4 -f 9 -F
+# Send 10 IPv4/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 4 -f 10
+
+echo "Testing IPIP..."
+# Send 10 IPv4/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+echo "Testing IPv4 + GRE..."
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+tc filter del dev lo ingress pref 1337
+
+echo "Testing IPv6..."
+# Drops all IPv6/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ipv6 pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv6/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 6 -f 8
+# Send 10 IPv6/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 6 -f 9 -F
+# Send 10 IPv6/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 6 -f 10
+
+exit 0
diff --git a/tools/testing/selftests/bpf/with_addr.sh b/tools/testing/selftests/bpf/with_addr.sh
new file mode 100755
index 000000000000..ffcd3953f94c
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_addr.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# add private ipv4 and ipv6 addresses to loopback
+
+readonly V6_INNER='100::a/128'
+readonly V4_INNER='192.168.0.1/32'
+
+if getopts ":s" opt; then
+  readonly SIT_DEV_NAME='sixtofourtest0'
+  readonly V6_SIT='2::/64'
+  readonly V4_SIT='172.17.0.1/32'
+  shift
+fi
+
+fail() {
+  echo "error: $*" 1>&2
+  exit 1
+}
+
+setup() {
+  ip -6 addr add "${V6_INNER}" dev lo || fail 'failed to setup v6 address'
+  ip -4 addr add "${V4_INNER}" dev lo || fail 'failed to setup v4 address'
+
+  if [[ -n "${V6_SIT}" ]]; then
+    ip link add "${SIT_DEV_NAME}" type sit remote any local any \
+	    || fail 'failed to add sit'
+    ip link set dev "${SIT_DEV_NAME}" up \
+	    || fail 'failed to bring sit device up'
+    ip -6 addr add "${V6_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v6 SIT address'
+    ip -4 addr add "${V4_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v4 SIT address'
+  fi
+
+  sleep 2	# avoid race causing bind to fail
+}
+
+cleanup() {
+  if [[ -n "${V6_SIT}" ]]; then
+    ip -4 addr del "${V4_SIT}" dev "${SIT_DEV_NAME}"
+    ip -6 addr del "${V6_SIT}" dev "${SIT_DEV_NAME}"
+    ip link del "${SIT_DEV_NAME}"
+  fi
+
+  ip -4 addr del "${V4_INNER}" dev lo
+  ip -6 addr del "${V6_INNER}" dev lo
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
diff --git a/tools/testing/selftests/bpf/with_tunnels.sh b/tools/testing/selftests/bpf/with_tunnels.sh
new file mode 100755
index 000000000000..e24949ed3a20
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_tunnels.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# setup tunnels for flow dissection test
+
+readonly SUFFIX="test_$(mktemp -u XXXX)"
+CONFIG="remote 127.0.0.2 local 127.0.0.1 dev lo"
+
+setup() {
+  ip link add "ipip_${SUFFIX}" type ipip ${CONFIG}
+  ip link add "gre_${SUFFIX}" type gre ${CONFIG}
+  ip link add "sit_${SUFFIX}" type sit ${CONFIG}
+
+  echo "tunnels before test:"
+  ip tunnel show
+
+  ip link set "ipip_${SUFFIX}" up
+  ip link set "gre_${SUFFIX}" up
+  ip link set "sit_${SUFFIX}" up
+}
+
+
+cleanup() {
+  ip tunnel del "ipip_${SUFFIX}"
+  ip tunnel del "gre_${SUFFIX}"
+  ip tunnel del "sit_${SUFFIX}"
+
+  echo "tunnels after test:"
+  ip tunnel show
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

^ permalink raw reply related

* [bpf-next, v2 2/3] flow_dissector: implements eBPF parser
From: Petar Penkov @ 2018-09-08  0:11 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov, Willem de Bruijn
In-Reply-To: <20180908001110.200828-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP encapsulation,
and VLAN, along with IPv4/IPv6 and extension headers.  This program is
meant to show how flow dissection and key extraction can be done in
eBPF.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 386 +++++++++++++++++++++++++
 2 files changed, 387 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
 	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-	test_skb_cgroup_id_kern.o
+	test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index 000000000000..f1e33a574841
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,386 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <limits.h>
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/pkt_cls.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/icmp.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_packet.h>
+#include <sys/socket.h>
+#include <linux/if_tunnel.h>
+#include <linux/mpls.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be cropped.
+ */
+enum {
+	IP,
+	IPV6,
+	IPV6OP,	/* Destination/Hop-by-Hop Options IPv6 Extension header */
+	IPV6FR,	/* Fragmentation IPv6 Extension Header */
+	MPLS,
+	VLAN,
+};
+
+#define IP_MF		0x2000
+#define IP_OFFSET	0x1FFF
+#define IP6_MF		0x0001
+#define IP6_OFFSET	0xFFF8
+
+struct vlan_hdr {
+	__be16 h_vlan_TCI;
+	__be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+	__be16 flags;
+	__be16 proto;
+};
+
+struct frag_hdr {
+	__u8 nexthdr;
+	__u8 reserved;
+	__be16 frag_off;
+	__be32 identification;
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 8
+};
+
+struct bpf_dissect_cb {
+	__u16 nhoff;
+};
+
+static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
+							 __u16 hdr_size,
+							 void *buffer)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data_end = (__u8 *)(long)skb->data_end;
+	void *data = (__u8 *)(long)skb->data;
+	__u8 *hdr;
+
+	/* Verifies this variable offset does not overflow */
+	if (cb->nhoff > (USHRT_MAX - hdr_size))
+		return NULL;
+
+	hdr = data + cb->nhoff;
+	if (hdr + hdr_size <= data_end)
+		return hdr;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, buffer, hdr_size))
+		return NULL;
+
+	return buffer;
+}
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+
+	keys->n_proto = proto;
+	switch (proto) {
+	case bpf_htons(ETH_P_IP):
+		bpf_tail_call(skb, &jmp_table, IP);
+		break;
+	case bpf_htons(ETH_P_IPV6):
+		bpf_tail_call(skb, &jmp_table, IPV6);
+		break;
+	case bpf_htons(ETH_P_MPLS_MC):
+	case bpf_htons(ETH_P_MPLS_UC):
+		bpf_tail_call(skb, &jmp_table, MPLS);
+		break;
+	case bpf_htons(ETH_P_8021Q):
+	case bpf_htons(ETH_P_8021AD):
+		bpf_tail_call(skb, &jmp_table, VLAN);
+		break;
+	default:
+		/* Protocol not supported */
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+SEC("dissect")
+int dissect(struct __sk_buff *skb)
+{
+	if (!skb->vlan_present)
+		return parse_eth_proto(skb, skb->protocol);
+	else
+		return parse_eth_proto(skb, skb->vlan_proto);
+}
+
+/* Parses on IPPROTO_* */
+static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data_end = (void *)(long)skb->data_end;
+	struct icmphdr *icmp, _icmp;
+	struct gre_hdr *gre, _gre;
+	struct ethhdr *eth, _eth;
+	struct tcphdr *tcp, _tcp;
+	struct udphdr *udp, _udp;
+
+	keys->ip_proto = proto;
+	switch (proto) {
+	case IPPROTO_ICMP:
+		icmp = bpf_flow_dissect_get_header(skb, sizeof(*icmp), &_icmp);
+		if (!icmp)
+			return BPF_DROP;
+		return BPF_OK;
+	case IPPROTO_IPIP:
+		keys->is_encap = true;
+		return parse_eth_proto(skb, bpf_htons(ETH_P_IP));
+	case IPPROTO_IPV6:
+		keys->is_encap = true;
+		return parse_eth_proto(skb, bpf_htons(ETH_P_IPV6));
+	case IPPROTO_GRE:
+		gre = bpf_flow_dissect_get_header(skb, sizeof(*gre), &_gre);
+		if (!gre)
+			return BPF_DROP;
+
+		if (bpf_htons(gre->flags & GRE_VERSION))
+			/* Only inspect standard GRE packets with version 0 */
+			return BPF_OK;
+
+		cb->nhoff += sizeof(*gre); /* Step over GRE Flags and Proto */
+		if (GRE_IS_CSUM(gre->flags))
+			cb->nhoff += 4; /* Step over chksum and Padding */
+		if (GRE_IS_KEY(gre->flags))
+			cb->nhoff += 4; /* Step over key */
+		if (GRE_IS_SEQ(gre->flags))
+			cb->nhoff += 4; /* Step over sequence number */
+
+		keys->is_encap = true;
+
+		if (gre->proto == bpf_htons(ETH_P_TEB)) {
+			eth = bpf_flow_dissect_get_header(skb, sizeof(*eth),
+							  &_eth);
+			if (!eth)
+				return BPF_DROP;
+
+			cb->nhoff += sizeof(*eth);
+
+			return parse_eth_proto(skb, eth->h_proto);
+		} else {
+			return parse_eth_proto(skb, gre->proto);
+		}
+	case IPPROTO_TCP:
+		tcp = bpf_flow_dissect_get_header(skb, sizeof(*tcp), &_tcp);
+		if (!tcp)
+			return BPF_DROP;
+
+		if (tcp->doff < 5)
+			return BPF_DROP;
+
+		if ((__u8 *)tcp + (tcp->doff << 2) > data_end)
+			return BPF_DROP;
+
+		keys->thoff = cb->nhoff;
+		keys->sport = tcp->source;
+		keys->dport = tcp->dest;
+		return BPF_OK;
+	case IPPROTO_UDP:
+	case IPPROTO_UDPLITE:
+		udp = bpf_flow_dissect_get_header(skb, sizeof(*udp), &_udp);
+		if (!udp)
+			return BPF_DROP;
+
+		keys->thoff = cb->nhoff;
+		keys->sport = udp->source;
+		keys->dport = udp->dest;
+		return BPF_OK;
+	default:
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+
+	keys->ip_proto = nexthdr;
+	switch (nexthdr) {
+	case IPPROTO_HOPOPTS:
+	case IPPROTO_DSTOPTS:
+		bpf_tail_call(skb, &jmp_table, IPV6OP);
+		break;
+	case IPPROTO_FRAGMENT:
+		bpf_tail_call(skb, &jmp_table, IPV6FR);
+		break;
+	default:
+		return parse_ip_proto(skb, nexthdr);
+	}
+
+	return BPF_DROP;
+}
+
+PROG(IP)(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data_end = (__u8 *)(long)skb->data_end;
+	void *data = (__u8 *)(long)skb->data;
+	bool done = false;
+	struct iphdr *iph, _iph;
+
+	iph = bpf_flow_dissect_get_header(skb, sizeof(*iph), &_iph);
+	if (!iph)
+		return BPF_DROP;
+
+	/* IP header cannot be smaller than 20 bytes */
+	if (iph->ihl < 5)
+		return BPF_DROP;
+
+	keys->addr_proto = ETH_P_IP;
+	keys->ipv4_src = iph->saddr;
+	keys->ipv4_dst = iph->daddr;
+
+	cb->nhoff += iph->ihl << 2;
+	if (data + cb->nhoff > data_end)
+		return BPF_DROP;
+
+	if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
+		keys->is_frag = true;
+		if (iph->frag_off & bpf_htons(IP_OFFSET))
+			/* From second fragment on, packets do not have headers
+			 * we can parse.
+			 */
+			done = true;
+		else
+			keys->is_first_frag = true;
+	}
+
+	if (done)
+		return BPF_OK;
+
+	return parse_ip_proto(skb, iph->protocol);
+}
+
+PROG(IPV6)(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct ipv6hdr *ip6h, _ip6h;
+
+	ip6h = bpf_flow_dissect_get_header(skb, sizeof(*ip6h), &_ip6h);
+	if (!ip6h)
+		return BPF_DROP;
+
+	keys->addr_proto = ETH_P_IPV6;
+	memcpy(&keys->ipv6_src, &ip6h->saddr, 2*sizeof(ip6h->saddr));
+
+	cb->nhoff += sizeof(struct ipv6hdr);
+
+	return parse_ipv6_proto(skb, ip6h->nexthdr);
+}
+
+PROG(IPV6OP)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct ipv6_opt_hdr *ip6h, _ip6h;
+
+	ip6h = bpf_flow_dissect_get_header(skb, sizeof(*ip6h), &_ip6h);
+	if (!ip6h)
+		return BPF_DROP;
+
+	/* hlen is in 8-octects and does not include the first 8 bytes
+	 * of the header
+	 */
+	cb->nhoff += (1 + ip6h->hdrlen) << 3;
+
+	return parse_ipv6_proto(skb, ip6h->nexthdr);
+}
+
+PROG(IPV6FR)(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct frag_hdr *fragh, _fragh;
+
+	fragh = bpf_flow_dissect_get_header(skb, sizeof(*fragh), &_fragh);
+	if (!fragh)
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(*fragh);
+	keys->is_frag = true;
+	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET)))
+		keys->is_first_frag = true;
+
+	return parse_ipv6_proto(skb, fragh->nexthdr);
+}
+
+PROG(MPLS)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data = (void *)(long)skb->data;
+	struct mpls_label *mpls, _mpls;
+	__u8 *version;
+
+	mpls = bpf_flow_dissect_get_header(skb, sizeof(*mpls), &_mpls);
+	if (!mpls)
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+PROG(VLAN)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct vlan_hdr *vlan, _vlan;
+	__be16 proto;
+
+	/* Peek back to see if single or double-tagging */
+	if (bpf_skb_load_bytes(skb, cb->nhoff - sizeof(proto), &proto,
+			       sizeof(proto)))
+		return BPF_DROP;
+
+	/* Account for double-tagging */
+	if (proto == bpf_htons(ETH_P_8021AD)) {
+		vlan = bpf_flow_dissect_get_header(skb, sizeof(*vlan), &_vlan);
+		if (!vlan)
+			return BPF_DROP;
+
+		if (vlan->h_vlan_encapsulated_proto != bpf_htons(ETH_P_8021Q))
+			return BPF_DROP;
+
+		cb->nhoff += sizeof(*vlan);
+	}
+
+	vlan = bpf_flow_dissect_get_header(skb, sizeof(*vlan), &_vlan);
+	if (!vlan)
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(*vlan);
+	/* Only allow 8021AD + 8021Q double tagging and no triple tagging.*/
+	if (vlan->h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021AD) ||
+	    vlan->h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021Q))
+		return BPF_DROP;
+
+	return parse_eth_proto(skb, vlan->h_vlan_encapsulated_proto);
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

^ permalink raw reply related

* [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov @ 2018-09-08  0:11 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov, Willem de Bruijn
In-Reply-To: <20180908001110.200828-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/bpf.h            |   1 +
 include/linux/bpf_types.h      |   1 +
 include/linux/skbuff.h         |   7 ++
 include/net/net_namespace.h    |   3 +
 include/net/sch_generic.h      |  12 ++-
 include/uapi/linux/bpf.h       |  25 ++++++
 kernel/bpf/syscall.c           |   8 ++
 kernel/bpf/verifier.c          |  32 ++++++++
 net/core/filter.c              |  67 ++++++++++++++++
 net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       |   1 +
 tools/include/uapi/linux/bpf.h |  25 ++++++
 tools/lib/bpf/libbpf.c         |   2 +
 13 files changed, 317 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..988a00797bcd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -212,6 +212,7 @@ enum bpf_reg_type {
 	PTR_TO_PACKET_META,	 /* skb->data - meta_len */
 	PTR_TO_PACKET,		 /* reg points to skb->data */
 	PTR_TO_PACKET_END,	 /* skb->data + headlen */
+	PTR_TO_FLOW_KEYS,	 /* reg points to bpf_flow_keys */
 };
 
 /* The information passed from prog-specific *_is_valid_access
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 			     const struct flow_dissector_key *key,
 			     unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 9b5fdc50519a..99d4148e0f90 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -43,6 +43,7 @@ struct ctl_table_header;
 struct net_generic;
 struct uevent_sock;
 struct netns_ipvs;
+struct bpf_prog;
 
 
 #define NETDEV_HASHBITS    8
@@ -145,6 +146,8 @@ struct net {
 #endif
 	struct net_generic __rcu	*gen;
 
+	struct bpf_prog __rcu	*flow_dissector_prog;
+
 	/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
 	struct netns_xfrm	xfrm;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a6d00093f35e..1b81ba85fd2d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@ struct Qdisc_ops;
 struct qdisc_walker;
 struct tcf_walker;
 struct module;
+struct bpf_flow_keys;
 
 typedef int tc_setup_cb_t(enum tc_setup_type type,
 			  void *type_data, void *cb_priv);
@@ -307,9 +308,14 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-	unsigned int		pkt_len;
-	u16			slave_dev_queue_mapping;
-	u16			tc_classid;
+	union {
+		struct {
+			unsigned int		pkt_len;
+			u16			slave_dev_queue_mapping;
+			u16			tc_classid;
+		};
+		struct bpf_flow_keys *flow_keys;
+	};
 #define QDISC_CB_PRIV_LEN 20
 	unsigned char		data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..3064706fcaaa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
 	/* ... here. */
 
 	__u32 data_meta;
+	__u32 flow_keys;
 };
 
 struct bpf_tunnel_key {
@@ -2778,4 +2781,26 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+struct bpf_flow_keys {
+	__u16	thoff;
+	__u16	addr_proto;			/* ETH_P_* of valid addrs */
+	__u8	is_frag;
+	__u8	is_first_frag;
+	__u8	is_encap;
+	__be16	n_proto;
+	__u8	ip_proto;
+	union {
+		struct {
+			__be32	ipv4_src;
+			__be32	ipv4_dst;
+		};
+		struct {
+			__u32	ipv6_src[4];	/* in6_addr; network order */
+			__u32	ipv6_dst[4];	/* in6_addr; network order */
+		};
+	};
+	__be16	sport;
+	__be16	dport;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3c9636f03bb2..b3c2d09bcf7a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1615,6 +1615,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_LIRC_MODE2:
 		ptype = BPF_PROG_TYPE_LIRC_MODE2;
 		break;
+	case BPF_FLOW_DISSECTOR:
+		ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1636,6 +1639,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_LIRC_MODE2:
 		ret = lirc_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
+		break;
 	default:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
@@ -1688,6 +1694,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
 	case BPF_LIRC_MODE2:
 		return lirc_prog_detach(attr);
+	case BPF_FLOW_DISSECTOR:
+		return skb_flow_dissector_bpf_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6ff1bac1795d..8ccbff4fff93 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -261,6 +261,7 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_PACKET]		= "pkt",
 	[PTR_TO_PACKET_META]	= "pkt_meta",
 	[PTR_TO_PACKET_END]	= "pkt_end",
+	[PTR_TO_FLOW_KEYS]	= "flow_keys",
 };
 
 static char slot_type_char[] = {
@@ -965,6 +966,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_PACKET:
 	case PTR_TO_PACKET_META:
 	case PTR_TO_PACKET_END:
+	case PTR_TO_FLOW_KEYS:
 	case CONST_PTR_TO_MAP:
 		return true;
 	default:
@@ -1238,6 +1240,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		if (meta)
 			return meta->pkt_access;
 
@@ -1321,6 +1324,18 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 	return -EACCES;
 }
 
+static int check_flow_keys_access(struct bpf_verifier_env *env, int off,
+				  int size)
+{
+	if (size < 0 || off < 0 ||
+	    (u64)off + size > sizeof(struct bpf_flow_keys)) {
+		verbose(env, "invalid access to flow keys off=%d size=%d\n",
+			off, size);
+		return -EACCES;
+	}
+	return 0;
+}
+
 static bool __is_pointer_value(bool allow_ptr_leaks,
 			       const struct bpf_reg_state *reg)
 {
@@ -1422,6 +1437,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 		 * right in front, treat it the very same way.
 		 */
 		return check_pkt_ptr_alignment(env, reg, off, size, strict);
+	case PTR_TO_FLOW_KEYS:
+		pointer_desc = "flow keys ";
+		break;
 	case PTR_TO_MAP_VALUE:
 		pointer_desc = "value ";
 		break;
@@ -1692,6 +1710,17 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_packet_access(env, regno, off, size, false);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
+	} else if (reg->type == PTR_TO_FLOW_KEYS) {
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose(env, "R%d leaks addr into flow keys\n",
+				value_regno);
+			return -EACCES;
+		}
+
+		err = check_flow_keys_access(env, off, size);
+		if (!err && t == BPF_READ && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
 	} else {
 		verbose(env, "R%d invalid mem access '%s'\n", regno,
 			reg_type_str[reg->type]);
@@ -1839,6 +1868,8 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	case PTR_TO_PACKET_META:
 		return check_packet_access(env, regno, reg->off, access_size,
 					   zero_size_allowed);
+	case PTR_TO_FLOW_KEYS:
+		return check_flow_keys_access(env, reg->off, access_size);
 	case PTR_TO_MAP_VALUE:
 		return check_map_access(env, regno, reg->off, access_size,
 					zero_size_allowed);
@@ -4366,6 +4397,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_CTX:
 	case CONST_PTR_TO_MAP:
 	case PTR_TO_PACKET_END:
+	case PTR_TO_FLOW_KEYS:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
diff --git a/net/core/filter.c b/net/core/filter.c
index 8cb242b4400f..bc3725c26794 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5122,6 +5122,17 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
 static const struct bpf_func_proto *
 lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -5237,6 +5248,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		if (size != size_default)
 			return false;
 		break;
@@ -5265,6 +5277,7 @@ static bool sk_filter_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
 	}
@@ -5290,6 +5303,7 @@ static bool lwt_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		return false;
 	}
 
@@ -5500,6 +5514,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data_end):
 		info->reg_type = PTR_TO_PACKET_END;
 		break;
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
 	}
@@ -5701,6 +5716,7 @@ static bool sk_skb_is_valid_access(int off, int size,
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		return false;
 	}
 
@@ -5760,6 +5776,39 @@ static bool sk_msg_is_valid_access(int off, int size,
 	return true;
 }
 
+static bool flow_dissector_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
+		info->reg_type = PTR_TO_FLOW_KEYS;
+		break;
+	case bpf_ctx_range(struct __sk_buff, tc_classid):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+		return false;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -6054,6 +6103,15 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				      bpf_target_off(struct sock_common,
 						     skc_num, 2, target_size));
 		break;
+
+	case offsetof(struct __sk_buff, flow_keys):
+		off  = si->off;
+		off -= offsetof(struct __sk_buff, flow_keys);
+		off += offsetof(struct sk_buff, cb);
+		off += offsetof(struct qdisc_skb_cb, flow_keys);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
+				      si->src_reg, off);
+		break;
 	}
 
 	return insn - insn_buf;
@@ -7017,6 +7075,15 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
 const struct bpf_prog_ops sk_msg_prog_ops = {
 };
 
+const struct bpf_verifier_ops flow_dissector_verifier_ops = {
+	.get_func_proto		= flow_dissector_func_proto,
+	.is_valid_access	= flow_dissector_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops flow_dissector_prog_ops = {
+};
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index ce9eeeb7c024..7eed48c46a94 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -25,6 +25,9 @@
 #include <net/flow_dissector.h>
 #include <scsi/fc/fc_fcoe.h>
 #include <uapi/linux/batadv_packet.h>
+#include <linux/bpf.h>
+
+static DEFINE_MUTEX(flow_dissector_mutex);
 
 static void dissector_set_key(struct flow_dissector *flow_dissector,
 			      enum flow_dissector_key_id key_id)
@@ -62,6 +65,44 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 }
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog)
+{
+	struct bpf_prog *attached;
+	struct net *net;
+
+	net = current->nsproxy->net_ns;
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(net->flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (attached) {
+		/* Only one BPF program can be attached at a time */
+		mutex_unlock(&flow_dissector_mutex);
+		return -EEXIST;
+	}
+	rcu_assign_pointer(net->flow_dissector_prog, prog);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct bpf_prog *attached;
+	struct net *net;
+
+	net = current->nsproxy->net_ns;
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(net->flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (!attached) {
+		mutex_unlock(&flow_dissector_mutex);
+		return -ENOENT;
+	}
+	bpf_prog_put(attached);
+	RCU_INIT_POINTER(net->flow_dissector_prog, NULL);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
 /**
  * skb_flow_get_be16 - extract be16 entity
  * @skb: sk_buff to extract from
@@ -588,6 +629,60 @@ static bool skb_flow_dissect_allowed(int *num_hdrs)
 	return (*num_hdrs <= MAX_FLOW_DISSECT_HDRS);
 }
 
+static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
+				     struct flow_dissector *flow_dissector,
+				     void *target_container)
+{
+	struct flow_dissector_key_control *key_control;
+	struct flow_dissector_key_basic *key_basic;
+	struct flow_dissector_key_addrs *key_addrs;
+	struct flow_dissector_key_ports *key_ports;
+
+	key_control = skb_flow_dissector_target(flow_dissector,
+						FLOW_DISSECTOR_KEY_CONTROL,
+						target_container);
+	key_control->thoff = flow_keys->thoff;
+	if (flow_keys->is_frag)
+		key_control->flags |= FLOW_DIS_IS_FRAGMENT;
+	if (flow_keys->is_first_frag)
+		key_control->flags |= FLOW_DIS_FIRST_FRAG;
+	if (flow_keys->is_encap)
+		key_control->flags |= FLOW_DIS_ENCAPSULATION;
+
+	key_basic = skb_flow_dissector_target(flow_dissector,
+					      FLOW_DISSECTOR_KEY_BASIC,
+					      target_container);
+	key_basic->n_proto = flow_keys->n_proto;
+	key_basic->ip_proto = flow_keys->ip_proto;
+
+	if (flow_keys->addr_proto == ETH_P_IP &&
+	    dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) {
+		key_addrs = skb_flow_dissector_target(flow_dissector,
+						      FLOW_DISSECTOR_KEY_IPV4_ADDRS,
+						      target_container);
+		key_addrs->v4addrs.src = flow_keys->ipv4_src;
+		key_addrs->v4addrs.dst = flow_keys->ipv4_dst;
+		key_control->addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+	} else if (flow_keys->addr_proto == ETH_P_IPV6 &&
+		   dissector_uses_key(flow_dissector,
+				      FLOW_DISSECTOR_KEY_IPV6_ADDRS)) {
+		key_addrs = skb_flow_dissector_target(flow_dissector,
+						      FLOW_DISSECTOR_KEY_IPV6_ADDRS,
+						      target_container);
+		memcpy(&key_addrs->v6addrs, &flow_keys->ipv6_src,
+		       sizeof(key_addrs->v6addrs));
+		key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+	}
+
+	if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS)) {
+		key_ports = skb_flow_dissector_target(flow_dissector,
+						      FLOW_DISSECTOR_KEY_PORTS,
+						      target_container);
+		key_ports->src = flow_keys->sport;
+		key_ports->dst = flow_keys->dport;
+	}
+}
+
 /**
  * __skb_flow_dissect - extract the flow_keys struct and return it
  * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
@@ -619,6 +714,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_vlan *key_vlan;
 	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
+	struct bpf_prog *attached;
 	int num_hdrs = 0;
 	u8 ip_proto = 0;
 	bool ret;
@@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 					      FLOW_DISSECTOR_KEY_BASIC,
 					      target_container);
 
+	rcu_read_lock();
+	attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
+		       : NULL;
+	if (attached) {
+		/* Note that even though the const qualifier is discarded
+		 * throughout the execution of the BPF program, all changes(the
+		 * control block) are reverted after the BPF program returns.
+		 * Therefore, __skb_flow_dissect does not alter the skb.
+		 */
+		struct bpf_flow_keys flow_keys = {};
+		struct qdisc_skb_cb cb_saved;
+		struct qdisc_skb_cb *cb;
+		u16 *pseudo_cb;
+		u32 result;
+
+		cb = qdisc_skb_cb(skb);
+		pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
+
+		/* Save Control Block */
+		memcpy(&cb_saved, cb, sizeof(cb_saved));
+		memset(cb, 0, sizeof(cb_saved));
+
+		/* Pass parameters to the BPF program */
+		cb->flow_keys = &flow_keys;
+		*pseudo_cb = nhoff;
+
+		bpf_compute_data_pointers((struct sk_buff *)skb);
+		result = BPF_PROG_RUN(attached, skb);
+
+		/* Restore state */
+		memcpy(cb, &cb_saved, sizeof(cb_saved));
+
+		__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
+					 target_container);
+		key_control->thoff = min_t(u16, key_control->thoff, skb->len);
+		rcu_read_unlock();
+		return result == BPF_OK;
+	}
+	rcu_read_unlock();
+
 	if (dissector_uses_key(flow_dissector,
 			       FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
 		struct ethhdr *eth = eth_hdr(skb);
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d22106..b1cd3bc8db70 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_FLOW_DISSECTOR]	= "flow_dissector",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..3064706fcaaa 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
 	/* ... here. */
 
 	__u32 data_meta;
+	__u32 flow_keys;
 };
 
 struct bpf_tunnel_key {
@@ -2778,4 +2781,26 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+struct bpf_flow_keys {
+	__u16	thoff;
+	__u16	addr_proto;			/* ETH_P_* of valid addrs */
+	__u8	is_frag;
+	__u8	is_first_frag;
+	__u8	is_encap;
+	__be16	n_proto;
+	__u8	ip_proto;
+	union {
+		struct {
+			__be32	ipv4_src;
+			__be32	ipv4_dst;
+		};
+		struct {
+			__u32	ipv6_src[4];	/* in6_addr; network order */
+			__u32	ipv6_dst[4];	/* in6_addr; network order */
+		};
+	};
+	__be16	sport;
+	__be16	dport;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8476da7f2720..9ca8e0e624d8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2121,6 +2122,7 @@ static const struct {
 	BPF_PROG_SEC("sk_skb",		BPF_PROG_TYPE_SK_SKB),
 	BPF_PROG_SEC("sk_msg",		BPF_PROG_TYPE_SK_MSG),
 	BPF_PROG_SEC("lirc_mode2",	BPF_PROG_TYPE_LIRC_MODE2),
+	BPF_PROG_SEC("flow_dissector",	BPF_PROG_TYPE_FLOW_DISSECTOR),
 	BPF_SA_PROG_SEC("cgroup/bind4",	BPF_CGROUP_INET4_BIND),
 	BPF_SA_PROG_SEC("cgroup/bind6",	BPF_CGROUP_INET6_BIND),
 	BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

^ permalink raw reply related

* [bpf-next, v2 0/3] Introduce eBPF flow dissector
From: Petar Penkov @ 2018-09-08  0:11 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov

From: Petar Penkov <ppenkov@google.com>

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 3 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

Performance Evaluation:
The in-kernel implementation was compared against the demo program from
patch 2 using the test in patch 3 with IPv4/UDP traffic over 10 seconds.
	$perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
		-t 10

In-kernel Dissector:
	__skb_flow_dissect overhead: 2.12%
	Total Packets: 3,272,597 (from output of ./test_flow_dissector)

BPF Dissector:
	__skb_flow_dissect overhead: 1.63% 
	Total Packets: 3,232,356 (from output of ./test_flow_dissector)

No-op BPF Dissector:
	__skb_flow_dissect overhead: 1.52% 
	Total Packets: 3,330,635 (from output of ./test_flow_dissector)

Changes since v1:
1/ (Patch 1) LD_ABS instructions now disallowed for the new BPF prog type 
2/ (Patch 1) now checks if skb is NULL in __skb_flow_dissect()
3/ (Patch 1) fixed incorrect accesses in flow_dissector_is_valid_access()
	- writes to the flow_keys field now disallowed
	- reads/writes to tc_classid and data_meta now disallowed 
4/ (Patch 2) headers pulled with bpf_skb_load_data if direct access fails 

Changes since RFC:
1/ (Patch 1) Flow dissector hook changed from global to per-netns
2/ (Patch 1) Defined struct bpf_flow_keys to be used in BPF flow dissector
programs instead of exposing the internal flow keys layout. Added a
function to translate from bpf_flow_keys to the internal layout after BPF
dissection is complete. The pointer to this struct is stored in
qdisc_skb_cb rather than inside of the 20 byte control block which
simplifies verification and allows access to all 20 bytes of the cb.
3/ (Patch 2) Removed GUE parsing as it relied on a hardcoded port
4/ (Patch 2) MPLS parsing now stops at the first label which is consistent
with the in-kernel flow dissector
5/ (Patch 2) Refactored to use direct packet access and to write out to
struct bpf_flow_keys

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (3):
  flow_dissector: implements flow dissector BPF hook
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf.h                           |   1 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/skbuff.h                        |   7 +
 include/net/net_namespace.h                   |   3 +
 include/net/sch_generic.h                     |  12 +-
 include/uapi/linux/bpf.h                      |  25 +
 kernel/bpf/syscall.c                          |   8 +
 kernel/bpf/verifier.c                         |  32 +
 net/core/filter.c                             |  67 ++
 net/core/flow_dissector.c                     | 136 +++
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |  25 +
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c        | 386 +++++++++
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 22 files changed, 1838 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

-- 
2.19.0.rc2.392.g5ba43deb5a-goog

^ permalink raw reply

* Re: [**EXTERNAL**] Re: VRF with enslaved L3 enabled bridge
From: D'Souza, Nelson @ 2018-09-07 23:42 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org; +Cc: Ido Schimmel
In-Reply-To: <a32ada04-6dd6-3d09-19bf-8db4e1479e03@cumulusnetworks.com>

Thanks David and Ido, for finding the root-cause for bridge Rx packets getting dropped, also for coming up with a patch.

Regards,
Nelson

On 9/7/18, 9:09 AM, "David Ahern" <dsa@cumulusnetworks.com> wrote:

    On 9/7/18 9:56 AM, D'Souza, Nelson wrote:
    
    > ------------------------------------------------------------------------
    > *From:* David Ahern <dsa@cumulusnetworks.com>
    > *Sent:* Thursday, September 6, 2018 5:27 PM
    > *To:* D'Souza, Nelson; netdev@vger.kernel.org
    > *Subject:* Re: [**EXTERNAL**] Re: VRF with enslaved L3 enabled bridge
    >  
    > On 9/5/18 12:00 PM, D'Souza, Nelson wrote:
    >> Just following up.... would you be able to confirm that this is a
    > Linux VRF issue?
    > 
    > I can confirm that I can reproduce the problem. Need to find time to dig
    > into it.
    
    bridge's netfilter hook is dropping the packet.
    
    bridge's netfilter code registers hook operations that are invoked when
    nh_hook is called. It then sees all subsequent calls to nf_hook.
    
    Packet wise, the bridge netfilter hook runs first. br_nf_pre_routing
    allocates nf_bridge, sets in_prerouting to 1 and calls NF_HOOK for
    NF_INET_PRE_ROUTING. It's finish function, br_nf_pre_routing_finish,
    then resets in_prerouting flag to 0. Any subsequent calls to nf_hook
    invoke ip_sabotage_in. That function sees in_prerouting is not
    set and steals (drops) the packet.
    
    The simplest change is to have ip_sabotage_in recognize that the bridge
    can be enslaved to a VRF (L3 master device) and allow the packet to
    continue.
    
    Thanks to Ido for the hint on ip_sabotage_in.
    
    This patch works for me:
    
    diff --git a/net/bridge/br_netfilter_hooks.c
    b/net/bridge/br_netfilter_hooks.c
    index 6e0dc6bcd32a..37278dc280eb 100644
    --- a/net/bridge/br_netfilter_hooks.c
    +++ b/net/bridge/br_netfilter_hooks.c
    @@ -835,7 +835,8 @@ static unsigned int ip_sabotage_in(void *priv,
                                       struct sk_buff *skb,
                                       const struct nf_hook_state *state)
     {
    -       if (skb->nf_bridge && !skb->nf_bridge->in_prerouting) {
    +       if (skb->nf_bridge && !skb->nf_bridge->in_prerouting &&
    +           !netif_is_l3_master(skb->dev)) {
                    state->okfn(state->net, state->sk, skb);
                    return NF_STOLEN;
            }
    


^ permalink raw reply

* Re: WARNING in bpf_jit_free
From: syzbot @ 2018-09-07 23:23 UTC (permalink / raw)
  To: ast, daniel, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000e92d1805711f5552@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    28619527b8a7 Merge git://git.kernel.org/pub/scm/linux/kern..
git tree:       bpf
console output: https://syzkaller.appspot.com/x/log.txt?x=1339498e400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=62e9b447c16085cf
dashboard link: https://syzkaller.appspot.com/bug?extid=2ff1e7cb738fd3c41113
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=163cc149400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2ff1e7cb738fd3c41113@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
WARNING: CPU: 0 PID: 5391 at kernel/bpf/core.c:628 bpf_jit_free+0x2e5/0x3f0
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
  panic+0x238/0x4e7 kernel/panic.c:184
  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
  report_bug+0x254/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:993
RIP: 0010:bpf_jit_free+0x2e5/0x3f0
Code: 07 38 c8 7f 08 84 c0 0f 85 85 00 00 00 48 b8 00 02 00 00 00 00 ad de  
44 0f b6 63 02 48 39 c2 0f 84 d9 fd ff ff e8 8b 4b f3 ff <0f> 0b e9 cd fd  
ff ff e8 7f 4b f3 ff 4c 89 f0 48 ba 00 00 00 00 00
RSP: 0018:ffff8801b77bf648 EFLAGS: 00010293
RAX: ffff8801d8730500 RBX: ffffc9000192c000 RCX: 0000000000000002
RDX: 0000000000000000 RSI: ffffffff818b83a5 RDI: ffff8801cdcea6e8
RBP: ffff8801b77bf6e0 R08: ffff8801d8730dc8 R09: 0000000000000006
R10: 0000000000000000 R11: ffff8801d8730500 R12: 000000000000000f
R13: 1ffff10036ef7ecb R14: ffffc9000192c002 R15: ffffc9000192c020
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bef80 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bf0f8 R08: ffff8801d8730500 R09: ffffed003b5c4732
R10: ffffed003b5c4732 R11: ffff8801dae23993 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#2] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77be828 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77be9a0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#3] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77be0c8 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77be240 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#4] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bd968 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bdae0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#5] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bd208 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bd380 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#6] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bcaa8 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bcc20 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#7] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bc348 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bc4c0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#8] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bbbe8 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bbd60 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#9] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bb488 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bb600 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#10] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bad28 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77baea0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#11] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77ba5c8 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77ba740 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#12] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77b9e68 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77b9fe0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Oops: 0000 [#13] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77b9708 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77b9880 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Thread overran stack, or stack corrupted
Oops: 0000 [#14] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77b8fa8 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77b9120 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Thread overran stack, or stack corrupted
Oops: 0000 [#15] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77b8848 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77b89c0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
==================================================================
BUG: KASAN: slab-out-of-bounds in do_error_trap+0x3b6/0x4d0  
arch/x86/kernel/traps.c:296
Read of size 8 at addr ffff8801b77b7430 by task kworker/0:0/5391

CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
Call Trace:
BUG: unable to handle kernel paging request at fffffbfff4004000
PGD 21ffec067 P4D 21ffec067 PUD 21fe60067 PMD 1c1a04067 PTE 0
Thread overran stack, or stack corrupted
Oops: 0000 [#16] PREEMPT SMP KASAN
CPU: 0 PID: 5391 Comm: kworker/0:0 Not tainted 4.19.0-rc2+ #50
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events bpf_prog_free_deferred
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77b6e58 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77b6fd0 R08: ffff8801d8730500 R09: 0000000000000001
R10: ffffed003b5c4732 R11: 0000000000000000 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0
Call Trace:
Modules linked in:
Dumping ftrace buffer:
    (ftrace buffer empty)
CR2: fffffbfff4004000
---[ end trace 89eec6ca57f730dc ]---
RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:384 [inline]
RIP: 0010:bpf_tree_comp kernel/bpf/core.c:435 [inline]
RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
RIP: 0010:bpf_prog_kallsyms_find+0x2fe/0x4a0 kernel/bpf/core.c:509
Code: 8e f3 ff 4c 8b ad b0 fe ff ff 4c 89 e6 4c 89 ef e8 47 8f f3 ff 4d 39  
e5 0f 82 a7 00 00 00 e8 89 8e f3 ff 4c 89 e0 48 c1 e8 03 <42> 0f b6 04 30  
84 c0 74 08 3c 03 0f 8e 35 01 00 00 41 8b 04 24 4c
RSP: 0018:ffff8801b77bef80 EFLAGS: 00010806
RAX: 1ffffffff4004000 RBX: ffff8801cdcea6b0 RCX: ffffffff818b4099
RDX: 0000000000000000 RSI: ffffffff818b40a7 RDI: 0000000000000006
RBP: ffff8801b77bf0f8 R08: ffff8801d8730500 R09: ffffed003b5c4732
R10: ffffed003b5c4732 R11: ffff8801dae23993 R12: ffffffffa0020000
R13: ffffffffffffffff R14: dffffc0000000000 R15: ffff8801cdcea6b0
FS:  0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff4004000 CR3: 000000000946a000 CR4: 00000000001406f0

^ permalink raw reply

* Re: GPL compliance issue with liquidio/lio_23xx_vsw.bin firmware
From: Felix Manlunas @ 2018-09-07 22:58 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Florian Weimer, linux-firmware, netdev@vger.kernel.org,
	Derek Chickles, Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
	Manish Awasthi, Manojkumar.Panicker
In-Reply-To: <8dfbae2a66cfa1317e67e4ba21e017a1fa8054db.camel@decadent.org.uk>

On Wed, Aug 29, 2018 at 09:26:35PM +0100, Ben Hutchings wrote:
> Date: Wed, 29 Aug 2018 21:26:35 +0100
> From: Ben Hutchings <ben@decadent.org.uk>
> To: Felix Manlunas <felix.manlunas@cavium.com>, Florian Weimer
>  <fweimer@redhat.com>
> CC: linux-firmware@kernel.org, "netdev@vger.kernel.org"
>  <netdev@vger.kernel.org>, Derek Chickles
>  <derek.chickles@caviumnetworks.com>, Satanand Burla
>  <satananda.burla@caviumnetworks.com>, Felix Manlunas
>  <felix.manlunas@caviumnetworks.com>, Raghu Vatsavayi
>  <raghu.vatsavayi@caviumnetworks.com>, Manish Awasthi
>  <manish.awasthi@cavium.com>, Manojkumar.Panicker@cavium.com
> Subject: Re: GPL compliance issue with liquidio/lio_23xx_vsw.bin firmware
> User-Agent: Evolution 3.29.92-1 
> 
> On Mon, 2018-08-27 at 17:04 -0700, Felix Manlunas wrote:
> > On Mon, Aug 27, 2018 at 05:01:10PM +0200, Florian Weimer wrote:
> > > liquidio/lio_23xx_vsw.bin contains a compiled MIPS Linux kernel:
> > > 
> > > $ tail --bytes=+1313 liquidio/lio_23xx_vsw.bin > elf
> > > $ readelf -aW elf
> > > […]
> > >   [ 6] __ksymtab         PROGBITS        ffffffff80e495f8 64a5f8 00d130
> > > 00   A  0   0  8
> > >   [ 7] __ksymtab_gpl     PROGBITS        ffffffff80e56728 657728 008400
> > > 00   A  0   0  8
> > >   [ 8] __ksymtab_strings PROGBITS        ffffffff80e5eb28 65fb28 018868
> > > 00   A  0   0  1
> > > […]
> > > Symbol table '.symtab' contains 1349 entries:
> > >    Num:    Value          Size Type    Bind   Vis      Ndx Name
> > >      0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
> > >      1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS
> > > arch/mips/kernel/head.o
> > >      2: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS init/main.c
> > >      3: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS
> > > include/linux/types.h
> > > […]
> > > 
> > > Yet there is no corresponding source provided, and LICENCE.cavium lacks
> > > the required notices.
> > > 
> > > Thanks,
> > > Florian
> > 
> > Cavium apologizes for the oversight.  Cavium has been advertising the
> > appropriate license terms including the existence of GPL in the firmware
> > in our outbox releases. We will update the license terms in LICENCE.cavium
> > in our upstream contribution in collaboration with our legal team.
> 
> Everything added to linux-firmware.git needs to be safe for Linux
> distributions to redistribute.  (There are some ancient firmware images
> with unclear licensing, but we don't want to add any more.)
> 
> My understanding is that GPL v2 requires that commercial distributors
> either provide the complete and corresponding source alongside the
> binaries, or include a written offer to provide it later.  They
> *cannot* simply point to your offer to do this (only non-commercial
> distributors can do that).  So the source needs to be published too.
> 
> Adding the complete Linux kernel source code to linux-firmware.git
> doesn't seem like a sensible step, so maybe this particular firmware
> needs to live elsewhere.
> 
> Ben.

Apologies for the delay. We're committed to provide sources for the GPL
components of LiquidIO firmware.

After Cavium's recent merger with Marvell, we are revisiting our Licensing
and expect to share the updated license soon. Since keeping Linux kernel
source code in linux-firmware.git doesn't make sense to us too, we're also
working internally on identifying the appropriate mechanism to share those
sources when requested.  That said, it is our opinion that for all
practical purposes (for example, distribution of driver + firmware) the
firmware binary for LiquidIO adapters should continue to reside in the
linux-firmware.git as long as all the prerequisites for sharing the
firmware are met.  In the case of liquidio_23xx_vsw.bin, this would mean an
additional license type in the LICENCE.cavium file for this firmware
including a reference to the accompanying sources for the Linux kernel
bundled in the firmware.

As we understand, there is no precedent in the linux-firmware.git for
device firmware that include a Linux kernel so we also request comments
from community on the best ways to support the LiquidIO firmware and
similar firmwares that may appear in future.

Cavium LiquidIO Team

^ permalink raw reply

* [PATCH net] netfilter: bridge: Don't sabotage nf_hook calls from an l3mdev
From: dsahern @ 2018-09-07 22:08 UTC (permalink / raw)
  To: netdev; +Cc: ndsouza, idosch, pablo, fw, David Ahern

From: David Ahern <dsahern@gmail.com>

For starters, the bridge netfilter code registers operations that
are invoked any time nh_hook is called. Specifically, ip_sabotage_in
watches for nested calls for NF_INET_PRE_ROUTING when a bridge is in
the stack.

Packet wise, the bridge netfilter hook runs first. br_nf_pre_routing
allocates nf_bridge, sets in_prerouting to 1 and calls NF_HOOK for
NF_INET_PRE_ROUTING. It's finish function, br_nf_pre_routing_finish,
then resets in_prerouting flag to 0 and the packet continues up the
stack. The packet eventually makes it to the VRF driver and it invokes
nf_hook for NF_INET_PRE_ROUTING in case any rules have been added against
the vrf device.

Because of the registered operations the call to nf_hook causes
ip_sabotage_in to be invoked. That function sees the nf_bridge on the
skb and that in_prerouting is not set. Thinking it is an invalid nested
call it steals (drops) the packet.

Update ip_sabotage_in to recognize that the bridge or one of its upper
devices (e.g., vlan) can be enslaved to a VRF (L3 master device) and
allow the packet to go through the nf_hook a second time.

Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
Reported-by: D'Souza, Nelson <ndsouza@ciena.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/bridge/br_netfilter_hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 6e0dc6bcd32a..37278dc280eb 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -835,7 +835,8 @@ static unsigned int ip_sabotage_in(void *priv,
 				   struct sk_buff *skb,
 				   const struct nf_hook_state *state)
 {
-	if (skb->nf_bridge && !skb->nf_bridge->in_prerouting) {
+	if (skb->nf_bridge && !skb->nf_bridge->in_prerouting &&
+	    !netif_is_l3_master(skb->dev)) {
 		state->okfn(state->net, state->sk, skb);
 		return NF_STOLEN;
 	}
-- 
2.11.0

^ permalink raw reply related

* Re: [net-next] i40e(vf): remove i40e_ethtool_stats.h header file
From: David Miller @ 2018-09-07 22:02 UTC (permalink / raw)
  To: jacob.e.keller
  Cc: jeffrey.t.kirsher, netdev, dongsheng.wang, intel-wired-lan,
	aaron.f.brown
In-Reply-To: <20180907215544.5957-1-jacob.e.keller@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>
Date: Fri,  7 Sep 2018 14:55:44 -0700

> Essentially reverts commit 8fd75c58a09a ("i40e: move ethtool
> stats boiler plate code to i40e_ethtool_stats.h", 2018-08-30), and
> additionally moves the similar code in i40evf into i40evf_ethtool.c.
> 
> The code was intially moved from i40e_ethtool.c into i40e_ethtool_stats.h
> as a way of better logically organizing the code. This has two problems.
> First, we can't have an inline function with variadic arguments on all
> platforms. Second, it gave the appearance that we had plans to share
> code between the i40e and i40evf drivers, due to having a near copy of
> the contents in the i40evf/i40e_ethtool_stats.h file.
> 
> Patches which actually attempt to combine or share code between the i40e
> and i40evf drivers have not materialized, and are likely a ways off.
> 
> Rather than fixing the one function which causes build issues, just move
> this code back into the i40e_ethtool.c and i40evf_ethtool.c files. Note
> that we also change these functions back from static inlines to just
> statics, since they're no longer in a header file.
> 
> We can revisit this if/when work is done to actually attempt to share
> code between drivers. Alternatively, this stats code could be made more
> generic so that it can be shared across drivers as part of ethtool
> kernel work.
> 
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

Applied and after the build checks out will be pushed to net-next.

Thanks.

^ permalink raw reply

* [net-next] i40e(vf): remove i40e_ethtool_stats.h header file
From: Jacob Keller @ 2018-09-07 21:55 UTC (permalink / raw)
  To: David Miller, Jeffrey Kirsher, netdev
  Cc: Wang Dongsheng, Intel Wired LAN, aaron.f.brown, Jacob Keller

Essentially reverts commit 8fd75c58a09a ("i40e: move ethtool
stats boiler plate code to i40e_ethtool_stats.h", 2018-08-30), and
additionally moves the similar code in i40evf into i40evf_ethtool.c.

The code was intially moved from i40e_ethtool.c into i40e_ethtool_stats.h
as a way of better logically organizing the code. This has two problems.
First, we can't have an inline function with variadic arguments on all
platforms. Second, it gave the appearance that we had plans to share
code between the i40e and i40evf drivers, due to having a near copy of
the contents in the i40evf/i40e_ethtool_stats.h file.

Patches which actually attempt to combine or share code between the i40e
and i40evf drivers have not materialized, and are likely a ways off.

Rather than fixing the one function which causes build issues, just move
this code back into the i40e_ethtool.c and i40evf_ethtool.c files. Note
that we also change these functions back from static inlines to just
statics, since they're no longer in a header file.

We can revisit this if/when work is done to actually attempt to share
code between drivers. Alternatively, this stats code could be made more
generic so that it can be shared across drivers as part of ethtool
kernel work.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
Sorry aboutthe delay, our iwl queue maintainer Jeff is on vacation, and
discussion about how best to resolve this issue was/is ongoing in the
IWL list.

I opted to just move the contents of these files back into
i40e_ethtool.c and i40evf_ethtool.c respectively, as I think it's better
than only moving the single offending function back into the .c file.
This is different from the approach suggested by Wang, Dongsheng
<dongsheng.wang@hxt-semitech.com> on Intel Wired LAN, but I think it's a
better approach for now.

I've also kindly asked if Aaron Brown could test this out.

 .../net/ethernet/intel/i40e/i40e_ethtool.c    | 219 ++++++++++++++++-
 .../ethernet/intel/i40e/i40e_ethtool_stats.h  | 221 ------------------
 .../intel/i40evf/i40e_ethtool_stats.h         | 221 ------------------
 .../ethernet/intel/i40evf/i40evf_ethtool.c    | 219 ++++++++++++++++-
 4 files changed, 436 insertions(+), 444 deletions(-)
 delete mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
 delete mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index d7d3974beca2..87fe2e60602f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -6,7 +6,224 @@
 #include "i40e.h"
 #include "i40e_diag.h"
 
-#include "i40e_ethtool_stats.h"
+/* ethtool statistics helpers */
+
+/**
+ * struct i40e_stats - definition for an ethtool statistic
+ * @stat_string: statistic name to display in ethtool -S output
+ * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64)
+ * @stat_offset: offsetof() the stat from a base pointer
+ *
+ * This structure defines a statistic to be added to the ethtool stats buffer.
+ * It defines a statistic as offset from a common base pointer. Stats should
+ * be defined in constant arrays using the I40E_STAT macro, with every element
+ * of the array using the same _type for calculating the sizeof_stat and
+ * stat_offset.
+ *
+ * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or
+ * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from
+ * the i40e_add_ethtool_stat() helper function.
+ *
+ * The @stat_string is interpreted as a format string, allowing formatted
+ * values to be inserted while looping over multiple structures for a given
+ * statistics array. Thus, every statistic string in an array should have the
+ * same type and number of format specifiers, to be formatted by variadic
+ * arguments to the i40e_add_stat_string() helper function.
+ **/
+struct i40e_stats {
+	char stat_string[ETH_GSTRING_LEN];
+	int sizeof_stat;
+	int stat_offset;
+};
+
+/* Helper macro to define an i40e_stat structure with proper size and type.
+ * Use this when defining constant statistics arrays. Note that @_type expects
+ * only a type name and is used multiple times.
+ */
+#define I40E_STAT(_type, _name, _stat) { \
+	.stat_string = _name, \
+	.sizeof_stat = FIELD_SIZEOF(_type, _stat), \
+	.stat_offset = offsetof(_type, _stat) \
+}
+
+/* Helper macro for defining some statistics directly copied from the netdev
+ * stats structure.
+ */
+#define I40E_NETDEV_STAT(_net_stat) \
+	I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
+
+/* Helper macro for defining some statistics related to queues */
+#define I40E_QUEUE_STAT(_name, _stat) \
+	I40E_STAT(struct i40e_ring, _name, _stat)
+
+/* Stats associated with a Tx or Rx ring */
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+	I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+	I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
+/**
+ * i40e_add_one_ethtool_stat - copy the stat into the supplied buffer
+ * @data: location to store the stat value
+ * @pointer: basis for where to copy from
+ * @stat: the stat definition
+ *
+ * Copies the stat data defined by the pointer and stat structure pair into
+ * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
+ * i40e_add_queue_stats. If the pointer is null, data will be zero'd.
+ */
+static void
+i40e_add_one_ethtool_stat(u64 *data, void *pointer,
+			  const struct i40e_stats *stat)
+{
+	char *p;
+
+	if (!pointer) {
+		/* ensure that the ethtool data buffer is zero'd for any stats
+		 * which don't have a valid pointer.
+		 */
+		*data = 0;
+		return;
+	}
+
+	p = (char *)pointer + stat->stat_offset;
+	switch (stat->sizeof_stat) {
+	case sizeof(u64):
+		*data = *((u64 *)p);
+		break;
+	case sizeof(u32):
+		*data = *((u32 *)p);
+		break;
+	case sizeof(u16):
+		*data = *((u16 *)p);
+		break;
+	case sizeof(u8):
+		*data = *((u8 *)p);
+		break;
+	default:
+		WARN_ONCE(1, "unexpected stat size for %s",
+			  stat->stat_string);
+		*data = 0;
+	}
+}
+
+/**
+ * __i40e_add_ethtool_stats - copy stats into the ethtool supplied buffer
+ * @data: ethtool stats buffer
+ * @pointer: location to copy stats from
+ * @stats: array of stats to copy
+ * @size: the size of the stats definition
+ *
+ * Copy the stats defined by the stats array using the pointer as a base into
+ * the data buffer supplied by ethtool. Updates the data pointer to point to
+ * the next empty location for successive calls to __i40e_add_ethtool_stats.
+ * If pointer is null, set the data values to zero and update the pointer to
+ * skip these stats.
+ **/
+static void
+__i40e_add_ethtool_stats(u64 **data, void *pointer,
+			 const struct i40e_stats stats[],
+			 const unsigned int size)
+{
+	unsigned int i;
+
+	for (i = 0; i < size; i++)
+		i40e_add_one_ethtool_stat((*data)++, pointer, &stats[i]);
+}
+
+/**
+ * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer
+ * @data: ethtool stats buffer
+ * @pointer: location where stats are stored
+ * @stats: static const array of stat definitions
+ *
+ * Macro to ease the use of __i40e_add_ethtool_stats by taking a static
+ * constant stats array and passing the ARRAY_SIZE(). This avoids typos by
+ * ensuring that we pass the size associated with the given stats array.
+ *
+ * The parameter @stats is evaluated twice, so parameters with side effects
+ * should be avoided.
+ **/
+#define i40e_add_ethtool_stats(data, pointer, stats) \
+	__i40e_add_ethtool_stats(data, pointer, stats, ARRAY_SIZE(stats))
+
+/**
+ * i40e_add_queue_stats - copy queue statistics into supplied buffer
+ * @data: ethtool stats buffer
+ * @ring: the ring to copy
+ *
+ * Queue statistics must be copied while protected by
+ * u64_stats_fetch_begin_irq, so we can't directly use i40e_add_ethtool_stats.
+ * Assumes that queue stats are defined in i40e_gstrings_queue_stats. If the
+ * ring pointer is null, zero out the queue stat values and update the data
+ * pointer. Otherwise safely copy the stats from the ring into the supplied
+ * buffer and update the data pointer when finished.
+ *
+ * This function expects to be called while under rcu_read_lock().
+ **/
+static void
+i40e_add_queue_stats(u64 **data, struct i40e_ring *ring)
+{
+	const unsigned int size = ARRAY_SIZE(i40e_gstrings_queue_stats);
+	const struct i40e_stats *stats = i40e_gstrings_queue_stats;
+	unsigned int start;
+	unsigned int i;
+
+	/* To avoid invalid statistics values, ensure that we keep retrying
+	 * the copy until we get a consistent value according to
+	 * u64_stats_fetch_retry_irq. But first, make sure our ring is
+	 * non-null before attempting to access its syncp.
+	 */
+	do {
+		start = !ring ? 0 : u64_stats_fetch_begin_irq(&ring->syncp);
+		for (i = 0; i < size; i++) {
+			i40e_add_one_ethtool_stat(&(*data)[i], ring,
+						  &stats[i]);
+		}
+	} while (ring && u64_stats_fetch_retry_irq(&ring->syncp, start));
+
+	/* Once we successfully copy the stats in, update the data pointer */
+	*data += size;
+}
+
+/**
+ * __i40e_add_stat_strings - copy stat strings into ethtool buffer
+ * @p: ethtool supplied buffer
+ * @stats: stat definitions array
+ * @size: size of the stats array
+ *
+ * Format and copy the strings described by stats into the buffer pointed at
+ * by p.
+ **/
+static void __i40e_add_stat_strings(u8 **p, const struct i40e_stats stats[],
+				    const unsigned int size, ...)
+{
+	unsigned int i;
+
+	for (i = 0; i < size; i++) {
+		va_list args;
+
+		va_start(args, size);
+		vsnprintf(*p, ETH_GSTRING_LEN, stats[i].stat_string, args);
+		*p += ETH_GSTRING_LEN;
+		va_end(args);
+	}
+}
+
+/**
+ * 40e_add_stat_strings - copy stat strings into ethtool buffer
+ * @p: ethtool supplied buffer
+ * @stats: stat definitions array
+ *
+ * Format and copy the strings described by the const static stats value into
+ * the buffer pointed at by p.
+ *
+ * The parameter @stats is evaluated twice, so parameters with side effects
+ * should be avoided. Additionally, stats must be an array such that
+ * ARRAY_SIZE can be called on it.
+ **/
+#define i40e_add_stat_strings(p, stats, ...) \
+	__i40e_add_stat_strings(p, stats, ARRAY_SIZE(stats), ## __VA_ARGS__)
 
 #define I40E_PF_STAT(_name, _stat) \
 	I40E_STAT(struct i40e_pf, _name, _stat)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
deleted file mode 100644
index bba1cb0b658f..000000000000
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
+++ /dev/null
@@ -1,221 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* Copyright(c) 2013 - 2018 Intel Corporation. */
-
-/* ethtool statistics helpers */
-
-/**
- * struct i40e_stats - definition for an ethtool statistic
- * @stat_string: statistic name to display in ethtool -S output
- * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64)
- * @stat_offset: offsetof() the stat from a base pointer
- *
- * This structure defines a statistic to be added to the ethtool stats buffer.
- * It defines a statistic as offset from a common base pointer. Stats should
- * be defined in constant arrays using the I40E_STAT macro, with every element
- * of the array using the same _type for calculating the sizeof_stat and
- * stat_offset.
- *
- * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or
- * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from
- * the i40e_add_ethtool_stat() helper function.
- *
- * The @stat_string is interpreted as a format string, allowing formatted
- * values to be inserted while looping over multiple structures for a given
- * statistics array. Thus, every statistic string in an array should have the
- * same type and number of format specifiers, to be formatted by variadic
- * arguments to the i40e_add_stat_string() helper function.
- **/
-struct i40e_stats {
-	char stat_string[ETH_GSTRING_LEN];
-	int sizeof_stat;
-	int stat_offset;
-};
-
-/* Helper macro to define an i40e_stat structure with proper size and type.
- * Use this when defining constant statistics arrays. Note that @_type expects
- * only a type name and is used multiple times.
- */
-#define I40E_STAT(_type, _name, _stat) { \
-	.stat_string = _name, \
-	.sizeof_stat = FIELD_SIZEOF(_type, _stat), \
-	.stat_offset = offsetof(_type, _stat) \
-}
-
-/* Helper macro for defining some statistics directly copied from the netdev
- * stats structure.
- */
-#define I40E_NETDEV_STAT(_net_stat) \
-	I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
-
-/* Helper macro for defining some statistics related to queues */
-#define I40E_QUEUE_STAT(_name, _stat) \
-	I40E_STAT(struct i40e_ring, _name, _stat)
-
-/* Stats associated with a Tx or Rx ring */
-static const struct i40e_stats i40e_gstrings_queue_stats[] = {
-	I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
-	I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
-};
-
-/**
- * i40e_add_one_ethtool_stat - copy the stat into the supplied buffer
- * @data: location to store the stat value
- * @pointer: basis for where to copy from
- * @stat: the stat definition
- *
- * Copies the stat data defined by the pointer and stat structure pair into
- * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
- * i40e_add_queue_stats. If the pointer is null, data will be zero'd.
- */
-static inline void
-i40e_add_one_ethtool_stat(u64 *data, void *pointer,
-			  const struct i40e_stats *stat)
-{
-	char *p;
-
-	if (!pointer) {
-		/* ensure that the ethtool data buffer is zero'd for any stats
-		 * which don't have a valid pointer.
-		 */
-		*data = 0;
-		return;
-	}
-
-	p = (char *)pointer + stat->stat_offset;
-	switch (stat->sizeof_stat) {
-	case sizeof(u64):
-		*data = *((u64 *)p);
-		break;
-	case sizeof(u32):
-		*data = *((u32 *)p);
-		break;
-	case sizeof(u16):
-		*data = *((u16 *)p);
-		break;
-	case sizeof(u8):
-		*data = *((u8 *)p);
-		break;
-	default:
-		WARN_ONCE(1, "unexpected stat size for %s",
-			  stat->stat_string);
-		*data = 0;
-	}
-}
-
-/**
- * __i40e_add_ethtool_stats - copy stats into the ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location to copy stats from
- * @stats: array of stats to copy
- * @size: the size of the stats definition
- *
- * Copy the stats defined by the stats array using the pointer as a base into
- * the data buffer supplied by ethtool. Updates the data pointer to point to
- * the next empty location for successive calls to __i40e_add_ethtool_stats.
- * If pointer is null, set the data values to zero and update the pointer to
- * skip these stats.
- **/
-static inline void
-__i40e_add_ethtool_stats(u64 **data, void *pointer,
-			 const struct i40e_stats stats[],
-			 const unsigned int size)
-{
-	unsigned int i;
-
-	for (i = 0; i < size; i++)
-		i40e_add_one_ethtool_stat((*data)++, pointer, &stats[i]);
-}
-
-/**
- * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location where stats are stored
- * @stats: static const array of stat definitions
- *
- * Macro to ease the use of __i40e_add_ethtool_stats by taking a static
- * constant stats array and passing the ARRAY_SIZE(). This avoids typos by
- * ensuring that we pass the size associated with the given stats array.
- *
- * The parameter @stats is evaluated twice, so parameters with side effects
- * should be avoided.
- **/
-#define i40e_add_ethtool_stats(data, pointer, stats) \
-	__i40e_add_ethtool_stats(data, pointer, stats, ARRAY_SIZE(stats))
-
-/**
- * i40e_add_queue_stats - copy queue statistics into supplied buffer
- * @data: ethtool stats buffer
- * @ring: the ring to copy
- *
- * Queue statistics must be copied while protected by
- * u64_stats_fetch_begin_irq, so we can't directly use i40e_add_ethtool_stats.
- * Assumes that queue stats are defined in i40e_gstrings_queue_stats. If the
- * ring pointer is null, zero out the queue stat values and update the data
- * pointer. Otherwise safely copy the stats from the ring into the supplied
- * buffer and update the data pointer when finished.
- *
- * This function expects to be called while under rcu_read_lock().
- **/
-static inline void
-i40e_add_queue_stats(u64 **data, struct i40e_ring *ring)
-{
-	const unsigned int size = ARRAY_SIZE(i40e_gstrings_queue_stats);
-	const struct i40e_stats *stats = i40e_gstrings_queue_stats;
-	unsigned int start;
-	unsigned int i;
-
-	/* To avoid invalid statistics values, ensure that we keep retrying
-	 * the copy until we get a consistent value according to
-	 * u64_stats_fetch_retry_irq. But first, make sure our ring is
-	 * non-null before attempting to access its syncp.
-	 */
-	do {
-		start = !ring ? 0 : u64_stats_fetch_begin_irq(&ring->syncp);
-		for (i = 0; i < size; i++) {
-			i40e_add_one_ethtool_stat(&(*data)[i], ring,
-						  &stats[i]);
-		}
-	} while (ring && u64_stats_fetch_retry_irq(&ring->syncp, start));
-
-	/* Once we successfully copy the stats in, update the data pointer */
-	*data += size;
-}
-
-/**
- * __i40e_add_stat_strings - copy stat strings into ethtool buffer
- * @p: ethtool supplied buffer
- * @stats: stat definitions array
- * @size: size of the stats array
- *
- * Format and copy the strings described by stats into the buffer pointed at
- * by p.
- **/
-static inline void __i40e_add_stat_strings(u8 **p, const struct i40e_stats stats[],
-				    const unsigned int size, ...)
-{
-	unsigned int i;
-
-	for (i = 0; i < size; i++) {
-		va_list args;
-
-		va_start(args, size);
-		vsnprintf(*p, ETH_GSTRING_LEN, stats[i].stat_string, args);
-		*p += ETH_GSTRING_LEN;
-		va_end(args);
-	}
-}
-
-/**
- * 40e_add_stat_strings - copy stat strings into ethtool buffer
- * @p: ethtool supplied buffer
- * @stats: stat definitions array
- *
- * Format and copy the strings described by the const static stats value into
- * the buffer pointed at by p.
- *
- * The parameter @stats is evaluated twice, so parameters with side effects
- * should be avoided. Additionally, stats must be an array such that
- * ARRAY_SIZE can be called on it.
- **/
-#define i40e_add_stat_strings(p, stats, ...) \
-	__i40e_add_stat_strings(p, stats, ARRAY_SIZE(stats), ## __VA_ARGS__)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h
deleted file mode 100644
index 60b595dd8c39..000000000000
--- a/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h
+++ /dev/null
@@ -1,221 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* Copyright(c) 2013 - 2018 Intel Corporation. */
-
-/* ethtool statistics helpers */
-
-/**
- * struct i40e_stats - definition for an ethtool statistic
- * @stat_string: statistic name to display in ethtool -S output
- * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64)
- * @stat_offset: offsetof() the stat from a base pointer
- *
- * This structure defines a statistic to be added to the ethtool stats buffer.
- * It defines a statistic as offset from a common base pointer. Stats should
- * be defined in constant arrays using the I40E_STAT macro, with every element
- * of the array using the same _type for calculating the sizeof_stat and
- * stat_offset.
- *
- * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or
- * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from
- * the i40e_add_ethtool_stat() helper function.
- *
- * The @stat_string is interpreted as a format string, allowing formatted
- * values to be inserted while looping over multiple structures for a given
- * statistics array. Thus, every statistic string in an array should have the
- * same type and number of format specifiers, to be formatted by variadic
- * arguments to the i40e_add_stat_string() helper function.
- **/
-struct i40e_stats {
-	char stat_string[ETH_GSTRING_LEN];
-	int sizeof_stat;
-	int stat_offset;
-};
-
-/* Helper macro to define an i40e_stat structure with proper size and type.
- * Use this when defining constant statistics arrays. Note that @_type expects
- * only a type name and is used multiple times.
- */
-#define I40E_STAT(_type, _name, _stat) { \
-	.stat_string = _name, \
-	.sizeof_stat = FIELD_SIZEOF(_type, _stat), \
-	.stat_offset = offsetof(_type, _stat) \
-}
-
-/* Helper macro for defining some statistics directly copied from the netdev
- * stats structure.
- */
-#define I40E_NETDEV_STAT(_net_stat) \
-	I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
-
-/* Helper macro for defining some statistics related to queues */
-#define I40E_QUEUE_STAT(_name, _stat) \
-	I40E_STAT(struct i40e_ring, _name, _stat)
-
-/* Stats associated with a Tx or Rx ring */
-static const struct i40e_stats i40e_gstrings_queue_stats[] = {
-	I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
-	I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
-};
-
-/**
- * i40evf_add_one_ethtool_stat - copy the stat into the supplied buffer
- * @data: location to store the stat value
- * @pointer: basis for where to copy from
- * @stat: the stat definition
- *
- * Copies the stat data defined by the pointer and stat structure pair into
- * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
- * i40evf_add_queue_stats. If the pointer is null, data will be zero'd.
- */
-static inline void
-i40evf_add_one_ethtool_stat(u64 *data, void *pointer,
-			    const struct i40e_stats *stat)
-{
-	char *p;
-
-	if (!pointer) {
-		/* ensure that the ethtool data buffer is zero'd for any stats
-		 * which don't have a valid pointer.
-		 */
-		*data = 0;
-		return;
-	}
-
-	p = (char *)pointer + stat->stat_offset;
-	switch (stat->sizeof_stat) {
-	case sizeof(u64):
-		*data = *((u64 *)p);
-		break;
-	case sizeof(u32):
-		*data = *((u32 *)p);
-		break;
-	case sizeof(u16):
-		*data = *((u16 *)p);
-		break;
-	case sizeof(u8):
-		*data = *((u8 *)p);
-		break;
-	default:
-		WARN_ONCE(1, "unexpected stat size for %s",
-			  stat->stat_string);
-		*data = 0;
-	}
-}
-
-/**
- * __i40evf_add_ethtool_stats - copy stats into the ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location to copy stats from
- * @stats: array of stats to copy
- * @size: the size of the stats definition
- *
- * Copy the stats defined by the stats array using the pointer as a base into
- * the data buffer supplied by ethtool. Updates the data pointer to point to
- * the next empty location for successive calls to __i40evf_add_ethtool_stats.
- * If pointer is null, set the data values to zero and update the pointer to
- * skip these stats.
- **/
-static inline void
-__i40evf_add_ethtool_stats(u64 **data, void *pointer,
-			   const struct i40e_stats stats[],
-			   const unsigned int size)
-{
-	unsigned int i;
-
-	for (i = 0; i < size; i++)
-		i40evf_add_one_ethtool_stat((*data)++, pointer, &stats[i]);
-}
-
-/**
- * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location where stats are stored
- * @stats: static const array of stat definitions
- *
- * Macro to ease the use of __i40evf_add_ethtool_stats by taking a static
- * constant stats array and passing the ARRAY_SIZE(). This avoids typos by
- * ensuring that we pass the size associated with the given stats array.
- *
- * The parameter @stats is evaluated twice, so parameters with side effects
- * should be avoided.
- **/
-#define i40e_add_ethtool_stats(data, pointer, stats) \
-	__i40evf_add_ethtool_stats(data, pointer, stats, ARRAY_SIZE(stats))
-
-/**
- * i40evf_add_queue_stats - copy queue statistics into supplied buffer
- * @data: ethtool stats buffer
- * @ring: the ring to copy
- *
- * Queue statistics must be copied while protected by
- * u64_stats_fetch_begin_irq, so we can't directly use i40e_add_ethtool_stats.
- * Assumes that queue stats are defined in i40e_gstrings_queue_stats. If the
- * ring pointer is null, zero out the queue stat values and update the data
- * pointer. Otherwise safely copy the stats from the ring into the supplied
- * buffer and update the data pointer when finished.
- *
- * This function expects to be called while under rcu_read_lock().
- **/
-static inline void
-i40evf_add_queue_stats(u64 **data, struct i40e_ring *ring)
-{
-	const unsigned int size = ARRAY_SIZE(i40e_gstrings_queue_stats);
-	const struct i40e_stats *stats = i40e_gstrings_queue_stats;
-	unsigned int start;
-	unsigned int i;
-
-	/* To avoid invalid statistics values, ensure that we keep retrying
-	 * the copy until we get a consistent value according to
-	 * u64_stats_fetch_retry_irq. But first, make sure our ring is
-	 * non-null before attempting to access its syncp.
-	 */
-	do {
-		start = !ring ? 0 : u64_stats_fetch_begin_irq(&ring->syncp);
-		for (i = 0; i < size; i++) {
-			i40evf_add_one_ethtool_stat(&(*data)[i], ring,
-						    &stats[i]);
-		}
-	} while (ring && u64_stats_fetch_retry_irq(&ring->syncp, start));
-
-	/* Once we successfully copy the stats in, update the data pointer */
-	*data += size;
-}
-
-/**
- * __i40e_add_stat_strings - copy stat strings into ethtool buffer
- * @p: ethtool supplied buffer
- * @stats: stat definitions array
- * @size: size of the stats array
- *
- * Format and copy the strings described by stats into the buffer pointed at
- * by p.
- **/
-static void __i40e_add_stat_strings(u8 **p, const struct i40e_stats stats[],
-				    const unsigned int size, ...)
-{
-	unsigned int i;
-
-	for (i = 0; i < size; i++) {
-		va_list args;
-
-		va_start(args, size);
-		vsnprintf(*p, ETH_GSTRING_LEN, stats[i].stat_string, args);
-		*p += ETH_GSTRING_LEN;
-		va_end(args);
-	}
-}
-
-/**
- * 40e_add_stat_strings - copy stat strings into ethtool buffer
- * @p: ethtool supplied buffer
- * @stats: stat definitions array
- *
- * Format and copy the strings described by the const static stats value into
- * the buffer pointed at by p.
- *
- * The parameter @stats is evaluated twice, so parameters with side effects
- * should be avoided. Additionally, stats must be an array such that
- * ARRAY_SIZE can be called on it.
- **/
-#define i40e_add_stat_strings(p, stats, ...) \
-	__i40e_add_stat_strings(p, stats, ARRAY_SIZE(stats), ## __VA_ARGS__)
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c b/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
index 9319971c5c92..50f65ab737c9 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c
@@ -6,7 +6,224 @@
 
 #include <linux/uaccess.h>
 
-#include "i40e_ethtool_stats.h"
+/* ethtool statistics helpers */
+
+/**
+ * struct i40e_stats - definition for an ethtool statistic
+ * @stat_string: statistic name to display in ethtool -S output
+ * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64)
+ * @stat_offset: offsetof() the stat from a base pointer
+ *
+ * This structure defines a statistic to be added to the ethtool stats buffer.
+ * It defines a statistic as offset from a common base pointer. Stats should
+ * be defined in constant arrays using the I40E_STAT macro, with every element
+ * of the array using the same _type for calculating the sizeof_stat and
+ * stat_offset.
+ *
+ * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or
+ * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from
+ * the i40e_add_ethtool_stat() helper function.
+ *
+ * The @stat_string is interpreted as a format string, allowing formatted
+ * values to be inserted while looping over multiple structures for a given
+ * statistics array. Thus, every statistic string in an array should have the
+ * same type and number of format specifiers, to be formatted by variadic
+ * arguments to the i40e_add_stat_string() helper function.
+ **/
+struct i40e_stats {
+	char stat_string[ETH_GSTRING_LEN];
+	int sizeof_stat;
+	int stat_offset;
+};
+
+/* Helper macro to define an i40e_stat structure with proper size and type.
+ * Use this when defining constant statistics arrays. Note that @_type expects
+ * only a type name and is used multiple times.
+ */
+#define I40E_STAT(_type, _name, _stat) { \
+	.stat_string = _name, \
+	.sizeof_stat = FIELD_SIZEOF(_type, _stat), \
+	.stat_offset = offsetof(_type, _stat) \
+}
+
+/* Helper macro for defining some statistics directly copied from the netdev
+ * stats structure.
+ */
+#define I40E_NETDEV_STAT(_net_stat) \
+	I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
+
+/* Helper macro for defining some statistics related to queues */
+#define I40E_QUEUE_STAT(_name, _stat) \
+	I40E_STAT(struct i40e_ring, _name, _stat)
+
+/* Stats associated with a Tx or Rx ring */
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+	I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+	I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
+/**
+ * i40evf_add_one_ethtool_stat - copy the stat into the supplied buffer
+ * @data: location to store the stat value
+ * @pointer: basis for where to copy from
+ * @stat: the stat definition
+ *
+ * Copies the stat data defined by the pointer and stat structure pair into
+ * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
+ * i40evf_add_queue_stats. If the pointer is null, data will be zero'd.
+ */
+static void
+i40evf_add_one_ethtool_stat(u64 *data, void *pointer,
+			    const struct i40e_stats *stat)
+{
+	char *p;
+
+	if (!pointer) {
+		/* ensure that the ethtool data buffer is zero'd for any stats
+		 * which don't have a valid pointer.
+		 */
+		*data = 0;
+		return;
+	}
+
+	p = (char *)pointer + stat->stat_offset;
+	switch (stat->sizeof_stat) {
+	case sizeof(u64):
+		*data = *((u64 *)p);
+		break;
+	case sizeof(u32):
+		*data = *((u32 *)p);
+		break;
+	case sizeof(u16):
+		*data = *((u16 *)p);
+		break;
+	case sizeof(u8):
+		*data = *((u8 *)p);
+		break;
+	default:
+		WARN_ONCE(1, "unexpected stat size for %s",
+			  stat->stat_string);
+		*data = 0;
+	}
+}
+
+/**
+ * __i40evf_add_ethtool_stats - copy stats into the ethtool supplied buffer
+ * @data: ethtool stats buffer
+ * @pointer: location to copy stats from
+ * @stats: array of stats to copy
+ * @size: the size of the stats definition
+ *
+ * Copy the stats defined by the stats array using the pointer as a base into
+ * the data buffer supplied by ethtool. Updates the data pointer to point to
+ * the next empty location for successive calls to __i40evf_add_ethtool_stats.
+ * If pointer is null, set the data values to zero and update the pointer to
+ * skip these stats.
+ **/
+static void
+__i40evf_add_ethtool_stats(u64 **data, void *pointer,
+			   const struct i40e_stats stats[],
+			   const unsigned int size)
+{
+	unsigned int i;
+
+	for (i = 0; i < size; i++)
+		i40evf_add_one_ethtool_stat((*data)++, pointer, &stats[i]);
+}
+
+/**
+ * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer
+ * @data: ethtool stats buffer
+ * @pointer: location where stats are stored
+ * @stats: static const array of stat definitions
+ *
+ * Macro to ease the use of __i40evf_add_ethtool_stats by taking a static
+ * constant stats array and passing the ARRAY_SIZE(). This avoids typos by
+ * ensuring that we pass the size associated with the given stats array.
+ *
+ * The parameter @stats is evaluated twice, so parameters with side effects
+ * should be avoided.
+ **/
+#define i40e_add_ethtool_stats(data, pointer, stats) \
+	__i40evf_add_ethtool_stats(data, pointer, stats, ARRAY_SIZE(stats))
+
+/**
+ * i40evf_add_queue_stats - copy queue statistics into supplied buffer
+ * @data: ethtool stats buffer
+ * @ring: the ring to copy
+ *
+ * Queue statistics must be copied while protected by
+ * u64_stats_fetch_begin_irq, so we can't directly use i40e_add_ethtool_stats.
+ * Assumes that queue stats are defined in i40e_gstrings_queue_stats. If the
+ * ring pointer is null, zero out the queue stat values and update the data
+ * pointer. Otherwise safely copy the stats from the ring into the supplied
+ * buffer and update the data pointer when finished.
+ *
+ * This function expects to be called while under rcu_read_lock().
+ **/
+static void
+i40evf_add_queue_stats(u64 **data, struct i40e_ring *ring)
+{
+	const unsigned int size = ARRAY_SIZE(i40e_gstrings_queue_stats);
+	const struct i40e_stats *stats = i40e_gstrings_queue_stats;
+	unsigned int start;
+	unsigned int i;
+
+	/* To avoid invalid statistics values, ensure that we keep retrying
+	 * the copy until we get a consistent value according to
+	 * u64_stats_fetch_retry_irq. But first, make sure our ring is
+	 * non-null before attempting to access its syncp.
+	 */
+	do {
+		start = !ring ? 0 : u64_stats_fetch_begin_irq(&ring->syncp);
+		for (i = 0; i < size; i++) {
+			i40evf_add_one_ethtool_stat(&(*data)[i], ring,
+						    &stats[i]);
+		}
+	} while (ring && u64_stats_fetch_retry_irq(&ring->syncp, start));
+
+	/* Once we successfully copy the stats in, update the data pointer */
+	*data += size;
+}
+
+/**
+ * __i40e_add_stat_strings - copy stat strings into ethtool buffer
+ * @p: ethtool supplied buffer
+ * @stats: stat definitions array
+ * @size: size of the stats array
+ *
+ * Format and copy the strings described by stats into the buffer pointed at
+ * by p.
+ **/
+static void __i40e_add_stat_strings(u8 **p, const struct i40e_stats stats[],
+				    const unsigned int size, ...)
+{
+	unsigned int i;
+
+	for (i = 0; i < size; i++) {
+		va_list args;
+
+		va_start(args, size);
+		vsnprintf(*p, ETH_GSTRING_LEN, stats[i].stat_string, args);
+		*p += ETH_GSTRING_LEN;
+		va_end(args);
+	}
+}
+
+/**
+ * 40e_add_stat_strings - copy stat strings into ethtool buffer
+ * @p: ethtool supplied buffer
+ * @stats: stat definitions array
+ *
+ * Format and copy the strings described by the const static stats value into
+ * the buffer pointed at by p.
+ *
+ * The parameter @stats is evaluated twice, so parameters with side effects
+ * should be avoided. Additionally, stats must be an array such that
+ * ARRAY_SIZE can be called on it.
+ **/
+#define i40e_add_stat_strings(p, stats, ...) \
+	__i40e_add_stat_strings(p, stats, ARRAY_SIZE(stats), ## __VA_ARGS__)
 
 #define I40EVF_STAT(_name, _stat) \
 	I40E_STAT(struct i40evf_adapter, _name, _stat)
-- 
2.18.0.219.gaf81d287a9da

^ permalink raw reply related

* Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
From: Wolfgang Walter @ 2018-09-07 21:10 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Wei Wang, Tobias Hommel
In-Reply-To: <1591446.aUMgYVGcZN@stwm.de>

Hello Steffen,

in one of your emails to Thomas you wrote:
> xfrm_lookup+0x2a is at the very beginning of xfrm_lookup(), here we
> find:
> 
> u16 family = dst_orig->ops->family;
> 
> ops has an offset of 32 bytes (20 hex) in dst_orig, so looks like
> dst_orig is NULL.
> 
> In the forwarding case, we get dst_orig from the skb and dst_orig
> can't be NULL here unless the skb itself is already fishy.

Is this really true?

If xfrm_lookup is called from 

__xfrm_route_forward():

int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
{
        struct net *net = dev_net(skb->dev);
        struct flowi fl;
        struct dst_entry *dst;
        int res = 1;

        if (xfrm_decode_session(skb, &fl, family) < 0) {
                XFRM_INC_STATS(net, LINUX_MIB_XFRMFWDHDRERROR);
                return 0;
        }

        skb_dst_force(skb);

        dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, XFRM_LOOKUP_QUEUE);
        if (IS_ERR(dst)) {
                res = 0;
                dst = NULL;
        }
        skb_dst_set(skb, dst);
        return res;
}

couldn't it be possible that skb_dst_force(skb) actually sets dst to NULL if 
it cannot safely lock it? If it is absolutely sure that skb_dst_force() never 
can set dst to NULL I wonder why it is called at all?


Here is  skb_dst_force()

static inline void skb_dst_force(struct sk_buff *skb)
{
        if (skb_dst_is_noref(skb)) {
                struct dst_entry *dst = skb_dst(skb);

                WARN_ON(!rcu_read_lock_held());
                if (!dst_hold_safe(dst))
                        dst = NULL;

                skb->_skb_refdst = (unsigned long)dst;
        }
}

and dst_hold_safe() is

static inline bool dst_hold_safe(struct dst_entry *dst)
{
        return atomic_inc_not_zero(&dst->__refcnt);
}



Am Freitag, 7. September 2018, 22:22:39 schrieb Wolfgang Walter:
> Am Freitag, 31. August 2018, 08:50:24 schrieb Steffen Klassert:
> > On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> > > Hello,
> > > 
> > > kernels > 4.12 do not work on one of our main routers. They crash as
> > > soon
> > > as ipsec-tunnels are configured and ipsec-traffic actually flows.
> > 
> > Can you please send the backtrace of this crash?
> 
> I bootet the b838d5e1c5b6e57b10ec8af2268824041e3ea911 several times but I
> could not record the complete trace. I think I have to log to the serial
> console but I can't do that before next week.
> 
> 
> What I could record ist:
> 
> There is a always
> 	<IRQ> ... </IRQ>
> the callrace.
> 
> This is the part I could see:
> 
> 
> irq_exit+0x71/0x80
> do_IRQ+0x4d/0xd0
> common_interrup+07a/0x7a
> </IRQ>
> RIP: 010:cpuidle_enter_state+0x11d/0x200
> RSP: 0018:ffffc9000321bee0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffffc4
> RAX: ffff88085efde450 RBX: 0000000000000004 RCX: 00000003c9e63c13
> RDX: 00000003c9e63c13 RSI: ffb03103fe35ac43 RDI: 0000000000000000
> RBP: ffffe8ffff7cf600 R08: 000000000000000c R09: 0000000000000004
> R10: 0000000000000400 R11: 00000003c99e56fc R12: 00000003c9e63c13
> R13: 00000003c9da9567 R14: 0000000000000004 R15: ffffffff822763e0
> do_idle+0xd3/0x160
> cpu_startup_entry+0x14/0x20
> secondary_startup_64+0xa5/0xb0
> Code: 00 0f b7 83 c0 00 00 00 80 7c 02 08 01 0f 86 d3 02 00 00 41
> 8b 8c 24 3c 10 00 00 48 8b 6b 58 85 c9 0f 84 2f 01 00 00 48 83 e5 fe <f6> 45
> 60
> 02 0f 84 4e 01 00 00 f6 43 38 01 74 0d 80 00 bd ab 00 00
> RIP: ip_forward+0xd4/0x470 RSP: ffff88085efc3cb0
> CR2: 0000000000000060
> ----[ end trace 7205b53c25b7b35a ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: disabled
> Rebooting in 60 seconds..
> 
> 
> I got an email from Tobias Hommel and I think it is the same problem.
> 
> It is very clear that it is the difference from
> 
> 	ipv4: call dst_hold_safe() properly
> 
> to
> 
> 	ipv4: mark DST_NOGC and remove the operation of dst_free()
> 
> which triggers this bug.
> 
> Regards,

Regards
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* Re: [RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps
From: Mauricio Vasquez @ 2018-09-07 20:40 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev, joe
In-Reply-To: <20180907001317.fj7f6fg6ihljompp@ast-mbp.dhcp.thefacebook.com>



On 09/06/2018 07:13 PM, Alexei Starovoitov wrote:
> On Fri, Aug 31, 2018 at 11:25:48PM +0200, Mauricio Vasquez B wrote:
>> In some applications this is needed have a pool of free elements, like for
>> example the list of free L4 ports in a SNAT.  None of the current maps allow
>> to do it as it is not possibleto get an any element without having they key
>> it is associated to.
>>
>> This patchset implements two new kind of eBPF maps: queue and stack.
>> Those maps provide to eBPF programs the peek, push and pop operations, and for
>> userspace applications a new bpf_map_lookup_and_delete_elem() is added.
>>
>> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
>>
>> ---
>>
>> I am sending this as an RFC because there is still an issue I am not sure how
>> to solve.
>>
>> The queue/stack maps have a linked list for saving the nodes, and a
>> preallocation schema based on the pcpu_freelist already implemented and used
>> in the htabmap.  Each time an element is pushed into the map, a node from the
>> pcpu_freelist is taken and then added to the linked list.
>>
>> The pop operation takes and *removes* the first node from the linked list, then
>> it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
>> to the pcpu_freelist when the rcu callback is executed.  This is needed because
>> an element returned by the pop() operation should remain valid for the whole
>> duration of the eBPF program.
>>
>> The problem is that elements are not immediately returned to the free list, so
>> in some cases the push operation could fail because there are not free nodes
>> in the pcpu_freelist.
>>
>> The following code snippet exposes that problem.
>>
>> ...
>> 	/* Push MAP_SIZE elements */
>> 	for (i = 0; i < MAP_SIZE; i++)
>> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>>
>> 	/* Pop all elements */
>> 	for (i = 0; i < MAP_SIZE; i++)
>> 		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
>> 		       val == vals[i]);
>>
>>    // sleep(1) <-- If I put this sleep, everything works.
>> 	/* Push MAP_SIZE elements */
>> 	for (i = 0; i < MAP_SIZE; i++)
>> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>>             ^^^
>>             This fails because there are not available elements in pcpu_freelist
>> ...
>>
>> I think a possible solution is to oversize the pcpu_freelist (no idea by how
>> much, maybe double or, or make it 1.5 time the max elements in the map?)
>> I also have concerns about it, it would waste that memory in many cases and
>> this is also probably that it doesn't solve the issue because that code snippet
>> is puhsing and popping elements too fast, so even if the pcpu_freelist is much
>> large a certain time instant all the elements could be used.
>>
>> Is this really an important issue?
>> Any idea of how to solve it?
> It is important issue indeed and a difficult one to solve.
> We have the same issue with hash map.
> If the prog is doing:
> value = lookup(key);
> delete(key);
> // here the prog shouldn't be accessing the value anymore, since the memory
> // could have been reused, but value pointer is still valid and points to
> // allocated memory
Just to notice that for the queue map it is a little bit worse because 
there isn't a way to mark an element to be reused, hence in some cases 
the pool of free elements could be exhausted.
> bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
> validity of value without races.
> With pcpu_freelist I don't think there is a solution.
> We can have this queue/stack map without prealloc and use kmalloc/kfree
> back and forth. Performance will not be as great, but for your use case,
> I suspect, it will be good enough.
I agree, for our use case we are not that worried about the performance, 
it is still in the dataplane but let's say it is not in the "hot" path.
> The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
> If somebody starts doing push/pop for every packet, the rcu subsystem
> will struggle and nothing we can do about it.
>
> The only way I could think of to resolve this problem is to reuse
> the logic that Joe is working on for socket lookups inside the program.
> Joe,
> how is that going? Could you repost the latest patches?
>
> In such case the api for stack map will look like:
>
> elem = bpf_map_pop_elem(stack);
> // access elem
> bpf_map_free_elem(elem);
> // here prog is not allowed to access elem and verifier will catch that
>
> elem = bpf_map_alloc_elem(stack);
> // populate elem
> bpf_map_push_elem(elem);
> // here prog is not allowed to access elem and verifier will catch that
>
> Then both pre-allocated elems and kmalloc/kfree will work fine
> and no unbounded rcu issues in both cases.
>
>

I read the Joe's proposal and using that for this problem looks like a 
nice solution.

I think a good trade-off for now would be to go ahead with a queue/stack 
map without preallocating support (or maybe include it having always in 
mind that this issue has to be solved in the near future) and then, as a 
separated work, try to use Joe's proposal in the map helpers.

What do you think?

Thanks,
Mauricio.

^ permalink raw reply

* [Patch net-next] htb: use anonymous union for simplicity
From: Cong Wang @ 2018-09-07 20:29 UTC (permalink / raw)
  To: netdev; +Cc: jiri, Cong Wang, Jamal Hadi Salim
In-Reply-To: <20180907202914.21331-1-xiyou.wangcong@gmail.com>

cl->leaf.q is slightly more readable than cl->un.leaf.q.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_htb.c | 98 ++++++++++++++++++++++++++---------------------------
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 43c4bfe625a9..2c45e97b7b87 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -132,7 +132,7 @@ struct htb_class {
 		struct htb_class_inner {
 			struct htb_prio clprio[TC_HTB_NUMPRIO];
 		} inner;
-	} un;
+	};
 	s64			pq_key;
 
 	int			prio_activity;	/* for which prios are we active */
@@ -411,13 +411,13 @@ static void htb_activate_prios(struct htb_sched *q, struct htb_class *cl)
 			int prio = ffz(~m);
 			m &= ~(1 << prio);
 
-			if (p->un.inner.clprio[prio].feed.rb_node)
+			if (p->inner.clprio[prio].feed.rb_node)
 				/* parent already has its feed in use so that
 				 * reset bit in mask as parent is already ok
 				 */
 				mask &= ~(1 << prio);
 
-			htb_add_to_id_tree(&p->un.inner.clprio[prio].feed, cl, prio);
+			htb_add_to_id_tree(&p->inner.clprio[prio].feed, cl, prio);
 		}
 		p->prio_activity |= mask;
 		cl = p;
@@ -447,19 +447,19 @@ static void htb_deactivate_prios(struct htb_sched *q, struct htb_class *cl)
 			int prio = ffz(~m);
 			m &= ~(1 << prio);
 
-			if (p->un.inner.clprio[prio].ptr == cl->node + prio) {
+			if (p->inner.clprio[prio].ptr == cl->node + prio) {
 				/* we are removing child which is pointed to from
 				 * parent feed - forget the pointer but remember
 				 * classid
 				 */
-				p->un.inner.clprio[prio].last_ptr_id = cl->common.classid;
-				p->un.inner.clprio[prio].ptr = NULL;
+				p->inner.clprio[prio].last_ptr_id = cl->common.classid;
+				p->inner.clprio[prio].ptr = NULL;
 			}
 
 			htb_safe_rb_erase(cl->node + prio,
-					  &p->un.inner.clprio[prio].feed);
+					  &p->inner.clprio[prio].feed);
 
-			if (!p->un.inner.clprio[prio].feed.rb_node)
+			if (!p->inner.clprio[prio].feed.rb_node)
 				mask |= 1 << prio;
 		}
 
@@ -555,7 +555,7 @@ htb_change_class_mode(struct htb_sched *q, struct htb_class *cl, s64 *diff)
  */
 static inline void htb_activate(struct htb_sched *q, struct htb_class *cl)
 {
-	WARN_ON(cl->level || !cl->un.leaf.q || !cl->un.leaf.q->q.qlen);
+	WARN_ON(cl->level || !cl->leaf.q || !cl->leaf.q->q.qlen);
 
 	if (!cl->prio_activity) {
 		cl->prio_activity = 1 << cl->prio;
@@ -615,7 +615,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		__qdisc_drop(skb, to_free);
 		return ret;
 #endif
-	} else if ((ret = qdisc_enqueue(skb, cl->un.leaf.q,
+	} else if ((ret = qdisc_enqueue(skb, cl->leaf.q,
 					to_free)) != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(ret)) {
 			qdisc_qstats_drop(sch);
@@ -823,7 +823,7 @@ static struct htb_class *htb_lookup_leaf(struct htb_prio *hprio, const int prio)
 			cl = rb_entry(*sp->pptr, struct htb_class, node[prio]);
 			if (!cl->level)
 				return cl;
-			clp = &cl->un.inner.clprio[prio];
+			clp = &cl->inner.clprio[prio];
 			(++sp)->root = clp->feed.rb_node;
 			sp->pptr = &clp->ptr;
 			sp->pid = &clp->last_ptr_id;
@@ -857,7 +857,7 @@ static struct sk_buff *htb_dequeue_tree(struct htb_sched *q, const int prio,
 		 * graft operation on the leaf since last dequeue;
 		 * simply deactivate and skip such class
 		 */
-		if (unlikely(cl->un.leaf.q->q.qlen == 0)) {
+		if (unlikely(cl->leaf.q->q.qlen == 0)) {
 			struct htb_class *next;
 			htb_deactivate(q, cl);
 
@@ -873,12 +873,12 @@ static struct sk_buff *htb_dequeue_tree(struct htb_sched *q, const int prio,
 			goto next;
 		}
 
-		skb = cl->un.leaf.q->dequeue(cl->un.leaf.q);
+		skb = cl->leaf.q->dequeue(cl->leaf.q);
 		if (likely(skb != NULL))
 			break;
 
-		qdisc_warn_nonwc("htb", cl->un.leaf.q);
-		htb_next_rb_node(level ? &cl->parent->un.inner.clprio[prio].ptr:
+		qdisc_warn_nonwc("htb", cl->leaf.q);
+		htb_next_rb_node(level ? &cl->parent->inner.clprio[prio].ptr:
 					 &q->hlevel[0].hprio[prio].ptr);
 		cl = htb_lookup_leaf(hprio, prio);
 
@@ -886,16 +886,16 @@ static struct sk_buff *htb_dequeue_tree(struct htb_sched *q, const int prio,
 
 	if (likely(skb != NULL)) {
 		bstats_update(&cl->bstats, skb);
-		cl->un.leaf.deficit[level] -= qdisc_pkt_len(skb);
-		if (cl->un.leaf.deficit[level] < 0) {
-			cl->un.leaf.deficit[level] += cl->quantum;
-			htb_next_rb_node(level ? &cl->parent->un.inner.clprio[prio].ptr :
+		cl->leaf.deficit[level] -= qdisc_pkt_len(skb);
+		if (cl->leaf.deficit[level] < 0) {
+			cl->leaf.deficit[level] += cl->quantum;
+			htb_next_rb_node(level ? &cl->parent->inner.clprio[prio].ptr :
 						 &q->hlevel[0].hprio[prio].ptr);
 		}
 		/* this used to be after charge_class but this constelation
 		 * gives us slightly better performance
 		 */
-		if (!cl->un.leaf.q->q.qlen)
+		if (!cl->leaf.q->q.qlen)
 			htb_deactivate(q, cl);
 		htb_charge_class(q, cl, level, skb);
 	}
@@ -972,10 +972,10 @@ static void htb_reset(struct Qdisc *sch)
 	for (i = 0; i < q->clhash.hashsize; i++) {
 		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
 			if (cl->level)
-				memset(&cl->un.inner, 0, sizeof(cl->un.inner));
+				memset(&cl->inner, 0, sizeof(cl->inner));
 			else {
-				if (cl->un.leaf.q)
-					qdisc_reset(cl->un.leaf.q);
+				if (cl->leaf.q)
+					qdisc_reset(cl->leaf.q);
 			}
 			cl->prio_activity = 0;
 			cl->cmode = HTB_CAN_SEND;
@@ -1098,8 +1098,8 @@ static int htb_dump_class(struct Qdisc *sch, unsigned long arg,
 	 */
 	tcm->tcm_parent = cl->parent ? cl->parent->common.classid : TC_H_ROOT;
 	tcm->tcm_handle = cl->common.classid;
-	if (!cl->level && cl->un.leaf.q)
-		tcm->tcm_info = cl->un.leaf.q->handle;
+	if (!cl->level && cl->leaf.q)
+		tcm->tcm_info = cl->leaf.q->handle;
 
 	nest = nla_nest_start(skb, TCA_OPTIONS);
 	if (nest == NULL)
@@ -1142,9 +1142,9 @@ htb_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
 	};
 	__u32 qlen = 0;
 
-	if (!cl->level && cl->un.leaf.q) {
-		qlen = cl->un.leaf.q->q.qlen;
-		qs.backlog = cl->un.leaf.q->qstats.backlog;
+	if (!cl->level && cl->leaf.q) {
+		qlen = cl->leaf.q->q.qlen;
+		qs.backlog = cl->leaf.q->qstats.backlog;
 	}
 	cl->xstats.tokens = clamp_t(s64, PSCHED_NS2TICKS(cl->tokens),
 				    INT_MIN, INT_MAX);
@@ -1172,14 +1172,14 @@ static int htb_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
 				     cl->common.classid, extack)) == NULL)
 		return -ENOBUFS;
 
-	*old = qdisc_replace(sch, new, &cl->un.leaf.q);
+	*old = qdisc_replace(sch, new, &cl->leaf.q);
 	return 0;
 }
 
 static struct Qdisc *htb_leaf(struct Qdisc *sch, unsigned long arg)
 {
 	struct htb_class *cl = (struct htb_class *)arg;
-	return !cl->level ? cl->un.leaf.q : NULL;
+	return !cl->level ? cl->leaf.q : NULL;
 }
 
 static void htb_qlen_notify(struct Qdisc *sch, unsigned long arg)
@@ -1205,15 +1205,15 @@ static void htb_parent_to_leaf(struct htb_sched *q, struct htb_class *cl,
 {
 	struct htb_class *parent = cl->parent;
 
-	WARN_ON(cl->level || !cl->un.leaf.q || cl->prio_activity);
+	WARN_ON(cl->level || !cl->leaf.q || cl->prio_activity);
 
 	if (parent->cmode != HTB_CAN_SEND)
 		htb_safe_rb_erase(&parent->pq_node,
 				  &q->hlevel[parent->level].wait_pq);
 
 	parent->level = 0;
-	memset(&parent->un.inner, 0, sizeof(parent->un.inner));
-	parent->un.leaf.q = new_q ? new_q : &noop_qdisc;
+	memset(&parent->inner, 0, sizeof(parent->inner));
+	parent->leaf.q = new_q ? new_q : &noop_qdisc;
 	parent->tokens = parent->buffer;
 	parent->ctokens = parent->cbuffer;
 	parent->t_c = ktime_get_ns();
@@ -1223,8 +1223,8 @@ static void htb_parent_to_leaf(struct htb_sched *q, struct htb_class *cl,
 static void htb_destroy_class(struct Qdisc *sch, struct htb_class *cl)
 {
 	if (!cl->level) {
-		WARN_ON(!cl->un.leaf.q);
-		qdisc_destroy(cl->un.leaf.q);
+		WARN_ON(!cl->leaf.q);
+		qdisc_destroy(cl->leaf.q);
 	}
 	gen_kill_estimator(&cl->rate_est);
 	tcf_block_put(cl->block);
@@ -1286,11 +1286,11 @@ static int htb_delete(struct Qdisc *sch, unsigned long arg)
 	sch_tree_lock(sch);
 
 	if (!cl->level) {
-		unsigned int qlen = cl->un.leaf.q->q.qlen;
-		unsigned int backlog = cl->un.leaf.q->qstats.backlog;
+		unsigned int qlen = cl->leaf.q->q.qlen;
+		unsigned int backlog = cl->leaf.q->qstats.backlog;
 
-		qdisc_reset(cl->un.leaf.q);
-		qdisc_tree_reduce_backlog(cl->un.leaf.q, qlen, backlog);
+		qdisc_reset(cl->leaf.q);
+		qdisc_tree_reduce_backlog(cl->leaf.q, qlen, backlog);
 	}
 
 	/* delete from hash and active; remainder in destroy_class */
@@ -1419,13 +1419,13 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
 					  classid, NULL);
 		sch_tree_lock(sch);
 		if (parent && !parent->level) {
-			unsigned int qlen = parent->un.leaf.q->q.qlen;
-			unsigned int backlog = parent->un.leaf.q->qstats.backlog;
+			unsigned int qlen = parent->leaf.q->q.qlen;
+			unsigned int backlog = parent->leaf.q->qstats.backlog;
 
 			/* turn parent into inner node */
-			qdisc_reset(parent->un.leaf.q);
-			qdisc_tree_reduce_backlog(parent->un.leaf.q, qlen, backlog);
-			qdisc_destroy(parent->un.leaf.q);
+			qdisc_reset(parent->leaf.q);
+			qdisc_tree_reduce_backlog(parent->leaf.q, qlen, backlog);
+			qdisc_destroy(parent->leaf.q);
 			if (parent->prio_activity)
 				htb_deactivate(q, parent);
 
@@ -1436,10 +1436,10 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
 			}
 			parent->level = (parent->parent ? parent->parent->level
 					 : TC_HTB_MAXDEPTH) - 1;
-			memset(&parent->un.inner, 0, sizeof(parent->un.inner));
+			memset(&parent->inner, 0, sizeof(parent->inner));
 		}
 		/* leaf (we) needs elementary qdisc */
-		cl->un.leaf.q = new_q ? new_q : &noop_qdisc;
+		cl->leaf.q = new_q ? new_q : &noop_qdisc;
 
 		cl->common.classid = classid;
 		cl->parent = parent;
@@ -1455,8 +1455,8 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
 		qdisc_class_hash_insert(&q->clhash, &cl->common);
 		if (parent)
 			parent->children++;
-		if (cl->un.leaf.q != &noop_qdisc)
-			qdisc_hash_add(cl->un.leaf.q, true);
+		if (cl->leaf.q != &noop_qdisc)
+			qdisc_hash_add(cl->leaf.q, true);
 	} else {
 		if (tca[TCA_RATE]) {
 			err = gen_replace_estimator(&cl->bstats, NULL,
@@ -1478,7 +1478,7 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
 	psched_ratecfg_precompute(&cl->ceil, &hopt->ceil, ceil64);
 
 	/* it used to be a nasty bug here, we have to check that node
-	 * is really leaf before changing cl->un.leaf !
+	 * is really leaf before changing cl->leaf !
 	 */
 	if (!cl->level) {
 		u64 quantum = cl->rate.rate_bytes_ps;
-- 
2.14.4

^ permalink raw reply related

* [Patch net-next] net_sched: remove redundant qdisc lock classes
From: Cong Wang @ 2018-09-07 20:29 UTC (permalink / raw)
  To: netdev; +Cc: jiri, Cong Wang, Jamal Hadi Salim

We no longer take any spinlock on RX path for ingress qdisc,
so this lockdep annotation is no longer needed.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_api.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 98541c6399db..411c40344b77 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -27,7 +27,6 @@
 #include <linux/kmod.h>
 #include <linux/list.h>
 #include <linux/hrtimer.h>
-#include <linux/lockdep.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
 
@@ -1053,10 +1052,6 @@ static int qdisc_block_indexes_set(struct Qdisc *sch, struct nlattr **tca,
 	return 0;
 }
 
-/* lockdep annotation is needed for ingress; egress gets it only for name */
-static struct lock_class_key qdisc_tx_lock;
-static struct lock_class_key qdisc_rx_lock;
-
 /*
    Allocate and initialize new qdisc.
 
@@ -1121,7 +1116,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	if (handle == TC_H_INGRESS) {
 		sch->flags |= TCQ_F_INGRESS;
 		handle = TC_H_MAKE(TC_H_INGRESS, 0);
-		lockdep_set_class(qdisc_lock(sch), &qdisc_rx_lock);
 	} else {
 		if (handle == 0) {
 			handle = qdisc_alloc_handle(dev);
@@ -1129,7 +1123,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 			if (handle == 0)
 				goto err_out3;
 		}
-		lockdep_set_class(qdisc_lock(sch), &qdisc_tx_lock);
 		if (!netif_is_multiqueue(dev))
 			sch->flags |= TCQ_F_ONETXQUEUE;
 	}
-- 
2.14.4

^ permalink raw reply related

* Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
From: Wolfgang Walter @ 2018-09-07 20:22 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Wei Wang, Tobias Hommel
In-Reply-To: <20180831065024.GG23674@gauss3.secunet.de>

Am Freitag, 31. August 2018, 08:50:24 schrieb Steffen Klassert:
> On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> > Hello,
> > 
> > kernels > 4.12 do not work on one of our main routers. They crash as soon
> > as ipsec-tunnels are configured and ipsec-traffic actually flows.
> 
> Can you please send the backtrace of this crash?
> 

I bootet the b838d5e1c5b6e57b10ec8af2268824041e3ea911 several times but I 
could not record the complete trace. I think I have to log to the serial 
console but I can't do that before next week.


What I could record ist:

There is a always 
	<IRQ> ... </IRQ>
the callrace.

This is the part I could see:


irq_exit+0x71/0x80
do_IRQ+0x4d/0xd0
common_interrup+07a/0x7a
</IRQ>
RIP: 010:cpuidle_enter_state+0x11d/0x200
RSP: 0018:ffffc9000321bee0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffffc4
RAX: ffff88085efde450 RBX: 0000000000000004 RCX: 00000003c9e63c13
RDX: 00000003c9e63c13 RSI: ffb03103fe35ac43 RDI: 0000000000000000
RBP: ffffe8ffff7cf600 R08: 000000000000000c R09: 0000000000000004
R10: 0000000000000400 R11: 00000003c99e56fc R12: 00000003c9e63c13
R13: 00000003c9da9567 R14: 0000000000000004 R15: ffffffff822763e0
do_idle+0xd3/0x160
cpu_startup_entry+0x14/0x20
secondary_startup_64+0xa5/0xb0
Code: 00 0f b7 83 c0 00 00 00 80 7c 02 08 01 0f 86 d3 02 00 00 41
8b 8c 24 3c 10 00 00 48 8b 6b 58 85 c9 0f 84 2f 01 00 00 48 83 e5 fe <f6> 45 
60
02 0f 84 4e 01 00 00 f6 43 38 01 74 0d 80 00 bd ab 00 00
RIP: ip_forward+0xd4/0x470 RSP: ffff88085efc3cb0
CR2: 0000000000000060
----[ end trace 7205b53c25b7b35a ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 60 seconds..


I got an email from Tobias Hommel and I think it is the same problem.

It is very clear that it is the difference from

	ipv4: call dst_hold_safe() properly

to

	ipv4: mark DST_NOGC and remove the operation of dst_free()

which triggers this bug.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* Re: [PATCH net-next 08/13] net: sched: rename tcf_block_get{_ext}() and tcf_block_put{_ext}()
From: Cong Wang @ 2018-09-07 20:09 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Stephen Hemminger, Kirill Tkhai, Paul E. McKenney,
	Nicolas Dichtel, Leon Romanovsky, Greg KH, mark.rutland,
	Florian Westphal, David Ahern, lucien xin, Jakub Kicinski,
	Christian Brauner, Jiri Benc
In-Reply-To: <1536220742-25650-9-git-send-email-vladbu@mellanox.com>

On Thu, Sep 6, 2018 at 12:59 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Functions tcf_block_get{_ext}() and tcf_block_put{_ext}() actually
> attach/detach block to specific Qdisc besides just taking/putting
> reference. Rename them according to their purpose.

Where exactly does it attach to?

Each qdisc provides a pointer to a pointer of a block, like
&cl->block. It is where the result is saved to. It takes a parameter
of Qdisc* merely for read-only purpose.

So, renaming it to *attach() is even confusing, at least not
any better. Please find other names or leave them as they are.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox