Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Arnaldo Carvalho de Melo @ 2018-06-06 12:33 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180605212548.lwtaw4svvydo2lhy@kafai-mbp.dhcp.thefacebook.com>

Em Tue, Jun 05, 2018 at 02:25:48PM -0700, Martin KaFai Lau escreveu:
> On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu:
> > > This patch introduces BPF Type Format (BTF).
> > > 
> > > BTF (BPF Type Format) is the meta data format which describes
> > > the data types of BPF program/map.  Hence, it basically focus
> > > on the C programming language which the modern BPF is primary
> > > using.  The first use case is to provide a generic pretty print
> > > capability for a BPF map.
> > > 
> > > A modified pahole that can convert dwarf to BTF is here:
> > > https://github.com/iamkafai/pahole/tree/btf
> > > (Arnaldo, there is some BTF_KIND numbering changes on
> > >  Apr 18th, d61426c1571)
> > 
> > Thanks for letting me know, I'm starting to look at this,
> Hi Arnaldo,
> 
> Do you have a chance to take a look and pull it?  The kernel
> changes will be in 4.18, so it will be handy if it is available in
> the pahole repository.
> 
> [ btw, the latest commit (1 commit) should be 94a11b59e592 ].

Got sidetracked, will get back to it later today.

- Arnaldo

^ permalink raw reply

* Re: [PATCH] r8169: Reinstate ALDPS and ASPM support
From: Andrew Lunn @ 2018-06-06 12:34 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: David Miller, Hayes Wang, hkallweit1, romieu, Linux Netdev List,
	Linux Kernel Mailing List, Ryankao
In-Reply-To: <8BAED2AC-DA94-4F63-BA16-EB3D54649BC5@canonical.com>

> Hi Andrew,
> 
> Sure.
> 
> Do you think we should also strip out the enable/disable logic?
> 
> Or just remove the parameter?

Hi Kai-Heng

How would you use the logic if you remove the parameter?

    Andrew

^ permalink raw reply

* Re: [PATCH] net/phy: Micrel KSZ8061 PHY link failure after cable connect
From: Andrew Lunn @ 2018-06-06 12:39 UTC (permalink / raw)
  To: Alexander Onnasch; +Cc: Florian Fainelli, netdev, linux-kernel
In-Reply-To: <1528272198-10825-1-git-send-email-alexander.onnasch@landisgyr.com>

On Wed, Jun 06, 2018 at 10:03:18AM +0200, Alexander Onnasch wrote:
> With Micrel KSZ8061 PHY, the link may occasionally not come up after
> Ethernet cable connect. The vendor's (Microchip, former Micrel) errata
> sheet 80000688A.pdf describes the problem and possible workarounds in
> detail, see below.
> The patch implements workaround 1, which permanently fixes the issue.
> 
> DESCRIPTION
> Link-up may not occur properly when the Ethernet cable is initially
> connected. This issue occurs more commonly when the cable is connected
> slowly, but it may occur any time a cable is connected. This issue occurs
> in the auto-negotiation circuit, and will not occur if auto-negotiation
> is disabled (which requires that the two link partners be set to the
> same speed and duplex).
> 
> END USER IMPLICATIONS
> When this issue occurs, link is not established. Subsequent cable
> plug/unplug cycles will not correct the issue.
> 
> WORK AROUND
> There are four approaches to work around this issue:
> 1.  This issue can be prevented by setting bit 15 in MMD device address 1,
>     register 2, prior to connecting the  cable or prior to setting the
>     Restart Auto-Negotiation bit in register 0h.The MMD registers are
>     accessed via the indirect access registers Dh and Eh, or via the Micrel
>     EthUtil utility as shown here:
>     •  If using the EthUtil utility (usually with a Micrel KSZ8061
>        Evaluation Board), type the following commands:
>        > address 1
>        > mmd 1
>        > iw 2 b61a
>     •  Alternatively, write the following registers to write to the
>        indirect MMD register:
>        Write register Dh, data 0001h
>        Write register Eh, data 0002h
>        Write register Dh, data 4001h
>        Write register Eh, data B61Ah
> 2.  The issue can be avoided by disabling auto-negotiation in the KSZ8061,
>     either by the strapping option, or by clearing bit 12 in register 0h.
>     Care must be taken to ensure that the KSZ8061 and the link partner
>     will link with the same speed and duplex. Note that the KSZ8061
>     defaults to full-duplex when auto-negotiation is off, but other
>     devices may default to half-duplex in the event of failed
>     auto-negotiation.
> 3.  The issue can be avoided by connecting the cable prior to powering-up
>     or resetting the KSZ8061, and leaving it plugged in thereafter.
> 4.  If the above measures are not taken and the problem occurs, link can
>     be recovered by setting the Restart Auto-Negotiation bit in
>     register 0h, or by resetting or power cycling the device. Reset may
>     be either hardware reset or software reset (register 0h, bit 15).
> 
> PLAN
> This errata will not be corrected in a future revision.
> 
> Signed-off-by: Alexander Onnasch <alexander.onnasch@landisgyr.com>
> ---
>  drivers/net/phy/micrel.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
> index 6c45ff6..7d80a00 100644
> --- a/drivers/net/phy/micrel.c
> +++ b/drivers/net/phy/micrel.c
> @@ -339,6 +339,28 @@ static int ksz8041_config_aneg(struct phy_device *phydev)
>  	return genphy_config_aneg(phydev);
>  }
>  
> +#define MII_KSZ8061RN_MMD_CTRL_REG	    0x0d
> +#define MII_KSZ8061RN_MMD_REGDATA_REG	0x0e
> +
> +static int ksz8061_extended_write(struct phy_device *phydev,
> +				  u8 mode, u32 dev_addr, u32 regnum, u16 val)
> +{
> +	phy_write(phydev, MII_KSZ8061RN_MMD_CTRL_REG, dev_addr);
> +	phy_write(phydev, MII_KSZ8061RN_MMD_REGDATA_REG, regnum);
> +	phy_write(phydev, MII_KSZ8061RN_MMD_CTRL_REG, (mode << 14) | dev_addr);
> +	return phy_write(phydev, MII_KSZ8061RN_MMD_REGDATA_REG, val);
> +}

Hi Alexander

This looks a lot like phy_write_mmd().

     Andrew

^ permalink raw reply

* [PATCH bpf-next v5 2/2] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-06 12:43 UTC (permalink / raw)
  To: netdev; +Cc: Jesper Dangaard Brouer
In-Reply-To: <152828901936.9599.10143519677183418086.stgit@alrua-kau>

Add an example program showing how to sample packets from XDP using the
perf event buffer. The example userspace program just prints the ethernet
header for every packet sampled.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 samples/bpf/Makefile               |    4 +
 samples/bpf/xdp_sample_pkts_kern.c |   66 ++++++++++++++
 samples/bpf/xdp_sample_pkts_user.c |  176 ++++++++++++++++++++++++++++++++++++
 3 files changed, 246 insertions(+)
 create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
 create mode 100644 samples/bpf/xdp_sample_pkts_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af10e54d..9ea2f7b64869 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += xdp_sample_pkts
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += xdp_sample_pkts_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
 
 HOST_LOADLIBES		+= $(LIBBPF) -lelf
 HOSTLOADLIBES_tracex4		+= -lrt
diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
new file mode 100644
index 000000000000..f7ca8b850978
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_kern.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#define SAMPLE_SIZE 64ul
+#define MAX_CPUS 128
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_CPUS,
+};
+
+SEC("xdp_sample")
+int xdp_sample_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+
+	/* Metadata will be in the perf event before the packet data. */
+	struct S {
+		u16 cookie;
+		u16 pkt_len;
+	} __packed metadata;
+
+	if (data < data_end) {
+		/* The XDP perf_event_output handler will use the upper 32 bits
+		 * of the flags argument as a number of bytes to include of the
+		 * packet payload in the event data. If the size is too big, the
+		 * call to bpf_perf_event_output will fail and return -EFAULT.
+		 *
+		 * See bpf_xdp_event_output in net/core/filter.c.
+		 *
+		 * The BPF_F_CURRENT_CPU flag means that the event output fd
+		 * will be indexed by the CPU number in the event map.
+		 */
+		u64 flags = BPF_F_CURRENT_CPU;
+		u16 sample_size;
+		int ret;
+
+		metadata.cookie = 0xdead;
+		metadata.pkt_len = (u16)(data_end - data);
+		sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
+		flags |= (u64)sample_size << 32;
+
+		ret = bpf_perf_event_output(ctx, &my_map, flags,
+					    &metadata, sizeof(metadata));
+		if (ret)
+			bpf_printk("perf_event_output failed: %d\n", ret);
+	}
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
new file mode 100644
index 000000000000..944ff7512d2b
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_user.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <net/if.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/sysinfo.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include <libbpf.h>
+#include <bpf/bpf.h>
+
+#include "perf-sys.h"
+#include "trace_helpers.h"
+
+#define MAX_CPUS 128
+static int pmu_fds[MAX_CPUS], if_idx;
+static struct perf_event_mmap_page *headers[MAX_CPUS];
+static char *if_name;
+
+static int do_attach(int idx, int fd, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, fd, 0);
+	if (err < 0)
+		printf("ERROR: failed to attach program to %s\n", name);
+
+	return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, -1, 0);
+	if (err < 0)
+		printf("ERROR: failed to detach program from %s\n", name);
+
+	return err;
+}
+
+#define SAMPLE_SIZE 64
+
+static int print_bpf_output(void *data, int size)
+{
+	struct {
+		__u16 cookie;
+		__u16 pkt_len;
+		__u8  pkt_data[SAMPLE_SIZE];
+	} __packed *e = data;
+	int i;
+
+	if (e->cookie != 0xdead) {
+		printf("BUG cookie %x sized %d\n",
+		       e->cookie, size);
+		return LIBBPF_PERF_EVENT_ERROR;
+	}
+
+	printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
+	for (i = 0; i < 14 && i < e->pkt_len; i++)
+		printf("%02x ", e->pkt_data[i]);
+	printf("\n");
+
+	return LIBBPF_PERF_EVENT_CONT;
+}
+
+static void test_bpf_perf_event(int map_fd, int num)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.wakeup_events = 1, /* get an fd notification for every event */
+	};
+	int i;
+
+	for (i = 0; i < num; i++) {
+		int key = i;
+
+		pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/,
+						 -1/*group_fd*/, 0);
+
+		assert(pmu_fds[i] >= 0);
+		assert(bpf_map_update_elem(map_fd, &key,
+					   &pmu_fds[i], BPF_ANY) == 0);
+		ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
+	}
+}
+
+static void sig_handler(int signo)
+{
+	do_detach(if_idx, if_name);
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type	= BPF_PROG_TYPE_XDP,
+	};
+	struct bpf_object *obj;
+	struct bpf_map *map;
+	int prog_fd, map_fd;
+	char filename[256];
+	int ret, err, i;
+	int numcpus;
+
+	if (argc < 2) {
+		printf("Usage: %s <ifname>\n", argv[0]);
+		return 1;
+	}
+
+	numcpus = get_nprocs();
+	if (numcpus > MAX_CPUS)
+		numcpus = MAX_CPUS;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	if (!prog_fd) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	map = bpf_map__next(NULL, obj);
+	if (!map) {
+		printf("finding a map in obj file failed\n");
+		return 1;
+	}
+	map_fd = bpf_map__fd(map);
+
+	if_idx = if_nametoindex(argv[1]);
+	if (!if_idx)
+		if_idx = strtoul(argv[1], NULL, 0);
+
+	if (!if_idx) {
+		fprintf(stderr, "Invalid ifname\n");
+		return 1;
+	}
+	if_name = argv[1];
+	err = do_attach(if_idx, prog_fd, argv[1]);
+	if (err)
+		return err;
+
+	if (signal(SIGINT, sig_handler) ||
+	    signal(SIGHUP, sig_handler) ||
+	    signal(SIGTERM, sig_handler)) {
+		perror("signal");
+		return 1;
+	}
+
+	test_bpf_perf_event(map_fd, numcpus);
+
+	for (i = 0; i < numcpus; i++)
+		if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
+			return 1;
+
+	ret = perf_event_poller_multi(pmu_fds, headers, numcpus,
+				      print_bpf_output);
+	kill(0, SIGINT);
+	return ret;
+}

^ permalink raw reply related

* [PATCH bpf-next v5 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Toke Høiland-Jørgensen @ 2018-06-06 12:43 UTC (permalink / raw)
  To: netdev; +Cc: Jesper Dangaard Brouer

Add two new helper functions to trace_helpers that supports polling
multiple perf file descriptors for events. These are used to the XDP
perf_event_output example, which needs to work with one perf fd per CPU.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 tools/testing/selftests/bpf/trace_helpers.c |   49 ++++++++++++++++++++++++++-
 tools/testing/selftests/bpf/trace_helpers.h |    4 ++
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 3868dcb63420..0673e8840cc8 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -88,7 +88,7 @@ static int page_size;
 static int page_cnt = 8;
 static struct perf_event_mmap_page *header;
 
-int perf_event_mmap(int fd)
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
 {
 	void *base;
 	int mmap_size;
@@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
 		return -1;
 	}
 
-	header = base;
+	*header = base;
 	return 0;
 }
 
+int perf_event_mmap(int fd)
+{
+	return perf_event_mmap_header(fd, &header);
+}
+
 static int perf_event_poll(int fd)
 {
 	struct pollfd pfd = { .fd = fd, .events = POLLIN };
@@ -163,3 +168,43 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
 
 	return ret;
 }
+
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn)
+{
+	enum bpf_perf_event_ret ret;
+	struct pollfd *pfds;
+	void *buf = NULL;
+	size_t len = 0;
+	int i;
+
+	pfds = malloc(sizeof(*pfds) * num_fds);
+	if (!pfds)
+		return LIBBPF_PERF_EVENT_ERROR;
+
+	memset(pfds, 0, sizeof(*pfds) * num_fds);
+	for (i = 0; i < num_fds; i++) {
+		pfds[i].fd = fds[i];
+		pfds[i].events = POLLIN;
+	}
+
+	for (;;) {
+		poll(pfds, num_fds, 1000);
+		for (i = 0; i < num_fds; i++) {
+			if (!pfds[i].revents)
+				continue;
+
+			ret = bpf_perf_event_read_simple(headers[i],
+							 page_cnt * page_size,
+							 page_size, &buf, &len,
+							 bpf_perf_event_print,
+							 output_fn);
+			if (ret != LIBBPF_PERF_EVENT_CONT)
+				break;
+		}
+	}
+	free(buf);
+	free(pfds);
+
+	return ret;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index 3b4bcf7f5084..18924f23db1b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -3,6 +3,7 @@
 #define __TRACE_HELPER_H
 
 #include <libbpf.h>
+#include <linux/perf_event.h>
 
 struct ksym {
 	long addr;
@@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
 typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
 
 int perf_event_mmap(int fd);
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
 /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
 int perf_event_poller(int fd, perf_event_print_fn output_fn);
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+			    int num_fds, perf_event_print_fn output_fn);
 #endif

^ permalink raw reply related

* [PATCH] bnx2x: use the right constant
From: Julia Lawall @ 2018-06-06 13:03 UTC (permalink / raw)
  To: Ariel Elior
  Cc: kernel-janitors, everest-linux-l2, David S. Miller, netdev,
	linux-kernel

Nearby code that also tests port suggests that the P0 constant should be
used when port is zero.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e,e1;
@@

* e ? e1 : e1
// </smpl>


Fixes: 6c3218c6f7e5 ("bnx2x: Adjust ETS to 578xx")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
index 7dd83d0..22243c4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
@@ -588,7 +588,7 @@ static void bnx2x_ets_e3b0_nig_disabled(const struct link_params *params,
 	 * slots for the highest priority.
 	 */
 	REG_WR(bp, (port) ? NIG_REG_P1_TX_ARB_NUM_STRICT_ARB_SLOTS :
-		   NIG_REG_P1_TX_ARB_NUM_STRICT_ARB_SLOTS, 0x100);
+		   NIG_REG_P0_TX_ARB_NUM_STRICT_ARB_SLOTS, 0x100);
 	/* Mapping between the CREDIT_WEIGHT registers and actual client
 	 * numbers
 	 */

^ permalink raw reply related

* [PATCH V2 net-next 0/3] Bug fixes & optimization for HNS3 Driver
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm

This patch-set presents miscellaneous bug fixes and an optimization
for HNS3 driver

V1->V2:
	* Fixes the compilation break reported by David Miller & Kbuild

Xi Wang (3):
  net: hns3: Fix for VF mailbox cannot receiving PF response
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Optimize PF CMDQ interrupt switching process

 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 13 ++++++++++++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  3 +++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 23 +++++++++++++++++++---
 3 files changed, 36 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH V2 net-next 1/3] net: hns3: Fix for VF mailbox cannot receiving PF response
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

When the VF frequently switches the CMDQ interrupt, if the CMDQ_SRC is not
cleared, the VF will not receive the new PF response after the interrupt
is re-enabled, the corresponding log is as follows:

[  317.482222] hns3 0000:00:03.0: VF could not get mbx resp(=0) from PF
in 500 tries
[  317.483137] hns3 0000:00:03.0: VF request to get tqp info from PF
failed -5

This patch fixes this problem by clearing CMDQ_SRC before enabling
interrupt and syncing pending IRQ handlers after disabling interrupt.

Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index dd8e8e6..d55ee9c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1576,6 +1576,8 @@ static int hclgevf_misc_irq_init(struct hclgevf_dev *hdev)
 		return ret;
 	}
 
+	hclgevf_clear_event_cause(hdev, 0);
+
 	/* enable misc. vector(vector 0) */
 	hclgevf_enable_vector(&hdev->misc_vector, true);
 
@@ -1586,6 +1588,7 @@ static void hclgevf_misc_irq_uninit(struct hclgevf_dev *hdev)
 {
 	/* disable misc vector(vector 0) */
 	hclgevf_enable_vector(&hdev->misc_vector, false);
+	synchronize_irq(hdev->misc_vector.vector_irq);
 	free_irq(hdev->misc_vector.vector_irq, hdev);
 	hclgevf_free_vector(hdev, 0);
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH V2 net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

Before the firmware updates the crq's tail pointer, if the VF driver
reads the data in the crq, the data may be incomplete at this time,
which will lead to the driver read an unknown message.

This patch fixes it by checking if crq is empty before reading the
message.

Fixes: b11a0bb231f3 ("net: hns3: Add mailbox support to VF driver")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Patch V2: Fixes the compilation break reported David Miller & Kbuild
   Link: https://lkml.org/lkml/2018/6/5/866
         https://lkml.org/lkml/2018/6/6/147
Patch V1: Initial Submit
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 23 +++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index a286184..b598c06 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -126,6 +126,13 @@ int hclgevf_send_mbx_msg(struct hclgevf_dev *hdev, u16 code, u16 subcode,
 	return status;
 }
 
+static bool hclgevf_cmd_crq_empty(struct hclgevf_hw *hw)
+{
+	u32 tail = hclgevf_read_dev(hw, HCLGEVF_NIC_CRQ_TAIL_REG);
+
+	return tail == hw->cmq.crq.next_to_use;
+}
+
 void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 {
 	struct hclgevf_mbx_resp_status *resp;
@@ -140,11 +147,22 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 	resp = &hdev->mbx_resp;
 	crq = &hdev->hw.cmq.crq;
 
-	flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
-	while (hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B)) {
+	while (!hclgevf_cmd_crq_empty(&hdev->hw)) {
 		desc = &crq->desc[crq->next_to_use];
 		req = (struct hclge_mbx_pf_to_vf_cmd *)desc->data;
 
+		flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
+		if (unlikely(!hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B))) {
+			dev_warn(&hdev->pdev->dev,
+				 "dropped invalid mailbox message, code = %d\n",
+				 req->msg[0]);
+
+			/* dropping/not processing this invalid message */
+			crq->desc[crq->next_to_use].flag = 0;
+			hclge_mbx_ring_ptr_move_crq(crq);
+			continue;
+		}
+
 		/* synchronous messages are time critical and need preferential
 		 * treatment. Therefore, we need to acknowledge all the sync
 		 * responses as quickly as possible so that waiting tasks do not
@@ -205,7 +223,6 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 		}
 		crq->desc[crq->next_to_use].flag = 0;
 		hclge_mbx_ring_ptr_move_crq(crq);
-		flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
 	}
 
 	/* Write back CMDQ_RQ header pointer, M7 need this pointer */
-- 
2.7.4

^ permalink raw reply related

* [PATCH V2 net-next 3/3] net: hns3: Optimize PF CMDQ interrupt switching process
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
	linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>

From: Xi Wang <wangxi11@huawei.com>

When the PF frequently switches the CMDQ interrupt, if the CMDQ_SRC is
not cleared before the hardware interrupt is generated, the new interrupt
will not be reported.

This patch optimizes this problem by clearing CMDQ_SRC and RESET_STS
before enabling interrupt and syncing pending IRQ handlers after disabling
interrupt.

Fixes: 466b0c00391b ("net: hns3: Add support for misc interrupt")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 2a80134..d318d35 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2557,6 +2557,15 @@ static void hclge_clear_event_cause(struct hclge_dev *hdev, u32 event_type,
 	}
 }
 
+static void hclge_clear_all_event_cause(struct hclge_dev *hdev)
+{
+	hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_RST,
+				BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) |
+				BIT(HCLGE_VECTOR0_CORERESET_INT_B) |
+				BIT(HCLGE_VECTOR0_IMPRESET_INT_B));
+	hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_MBX, 0);
+}
+
 static void hclge_enable_vector(struct hclge_misc_vector *vector, bool enable)
 {
 	writel(enable ? 1 : 0, vector->addr);
@@ -5688,6 +5697,8 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 	INIT_WORK(&hdev->rst_service_task, hclge_reset_service_task);
 	INIT_WORK(&hdev->mbx_service_task, hclge_mailbox_service_task);
 
+	hclge_clear_all_event_cause(hdev);
+
 	/* Enable MISC vector(vector0) */
 	hclge_enable_vector(&hdev->misc_vector, true);
 
@@ -5817,6 +5828,8 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev *ae_dev)
 
 	/* Disable MISC vector(vector0) */
 	hclge_enable_vector(&hdev->misc_vector, false);
+	synchronize_irq(hdev->misc_vector.vector_irq);
+
 	hclge_destroy_cmd_queue(&hdev->hw);
 	hclge_misc_irq_uninit(hdev);
 	hclge_pci_uninit(hdev);
-- 
2.7.4

^ permalink raw reply related

* RE: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:10 UTC (permalink / raw)
  To: David Miller
  Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
	wangxi (M)
In-Reply-To: <20180605.144046.766561700879427040.davem@davemloft.net>

Hi Dave,
Sorry about this. I have fixed this in the V2 patch floated just now.

Best regards
Salil.
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 05, 2018 7:41 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox
> receiving unknown message
> 
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Tue, 5 Jun 2018 12:42:00 +0100
> 
> > +		if (unlikely(!hnae3_get_bit(flag,
> HCLGEVF_CMDQ_RX_OUTVLD_B))) {
> 
> This breaks the build, there is no such symbol named hnae3_get_bit().

^ permalink raw reply

* [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:16 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Tom Herbert, Kirill Tkhai

KCM removes the packets from sk_receive_queue in requeue_rx_msgs()

without acquiring any lock. Moreover, in R() when the MSG_PEEK
flag is not present, the skb is peeked and dequeued with two
separate, non-atomic, calls.

The above create room for races, which SYZBOT has been able to
exploit, causing list corruption and kernel oops:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
FS:  0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
  sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
  ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
  __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
  do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
  __do_sys_recvmmsg net/socket.c:2485 [inline]
  __se_sys_recvmmsg net/socket.c:2481 [inline]
  __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4417a9
RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0

To fix the above, we need to use the locked version of the socket dequeue
helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
the available skb when not peeking.

RFC -> v1:
 - use skb_dequeue(), as suggested by Tom
 - explicitly close the race between skb_peek and skb_unlink

Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This is an RFC, since I'm really new to this area, anyway the syzport
reported success in testing the proposed fix.
This is very likely a scenario where the upcoming skb->prev,next -> list_head
conversion would have helped a lot, thanks to list poisoning and list debug
---
 net/kcm/kcmsock.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d3601d421571..dd2d02bb35ae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
 	struct sk_buff *skb;
 	struct kcm_sock *kcm;
 
-	while ((skb = __skb_dequeue(head))) {
+	while ((skb = skb_dequeue(head))) {
 		/* Reset destructor to avoid calling kcm_rcv_ready */
 		skb->destructor = sock_rfree;
 		skb_orphan(skb);
@@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	return err;
 }
 
-static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
+static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
 				     long timeo, int *err)
 {
 	struct sk_buff *skb;
 
-	while (!(skb = skb_peek(&sk->sk_receive_queue))) {
+	for (;; ) {
+		skb = peek ? skb_peek(&sk->sk_receive_queue) :
+			     skb_dequeue(&sk->sk_receive_queue);
+		if (skb)
+			break;
+
 		if (sk->sk_err) {
 			*err = sock_error(sk);
 			return NULL;
@@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 	struct kcm_sock *kcm = kcm_sk(sk);
+	bool peek = flags & MSG_PEEK;
 	int err = 0;
 	long timeo;
 	struct strp_msg *stm;
@@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 
 	lock_sock(sk);
 
-	skb = kcm_wait_data(sk, flags, timeo, &err);
+	skb = kcm_wait_data(sk, flags, peek, timeo, &err);
 	if (!skb)
 		goto out;
 
@@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 		goto out;
 
 	copied = len;
-	if (likely(!(flags & MSG_PEEK))) {
+	if (likely(!peek)) {
 		KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
 		if (copied < stm->full_len) {
 			if (sock->type == SOCK_DGRAM) {
@@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 			/* Finished with message */
 			msg->msg_flags |= MSG_EOR;
 			KCM_STATS_INCR(kcm->stats.rx_msgs);
-			skb_unlink(skb, &sk->sk_receive_queue);
 			kfree_skb(skb);
 		}
 	}
@@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	skb = kcm_wait_data(sk, flags, timeo, &err);
+	skb = kcm_wait_data(sk, flags, true, timeo, &err);
 	if (!skb)
 		goto err_out;
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Kirill Tkhai @ 2018-06-06 13:28 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <628e0398546aefabd68669450621909d269e1ba8.1528289745.git.pabeni@redhat.com>

On 06.06.2018 16:16, Paolo Abeni wrote:
> KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> 
> without acquiring any lock. Moreover, in R() when the MSG_PEEK
> flag is not present, the skb is peeked and dequeued with two
> separate, non-atomic, calls.
> 
> The above create room for races, which SYZBOT has been able to
> exploit, causing list corruption and kernel oops:
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
>     (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> FS:  0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>   kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
>   sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
>   ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
>   __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
>   do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
>   __do_sys_recvmmsg net/socket.c:2485 [inline]
>   __se_sys_recvmmsg net/socket.c:2481 [inline]
>   __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4417a9
> RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
> 
> To fix the above, we need to use the locked version of the socket dequeue
> helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> the available skb when not peeking.
> 
> RFC -> v1:
>  - use skb_dequeue(), as suggested by Tom
>  - explicitly close the race between skb_peek and skb_unlink
> 
> Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This is an RFC, since I'm really new to this area, anyway the syzport
> reported success in testing the proposed fix.
> This is very likely a scenario where the upcoming skb->prev,next -> list_head
> conversion would have helped a lot, thanks to list poisoning and list debug
> ---
>  net/kcm/kcmsock.c | 19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index d3601d421571..dd2d02bb35ae 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
>  	struct sk_buff *skb;
>  	struct kcm_sock *kcm;
>  
> -	while ((skb = __skb_dequeue(head))) {
> +	while ((skb = skb_dequeue(head))) {

I try to find how the patch protects against the following race:

requeue_rx_msgs()         kcm_recvmsg()
  skb = skb_dequeue()       skb = kcm_wait_data(peek = true)
  ...                       ...
free skb                    ...
...                         skb_copy_datagram_msg(skb) <--- Use after free?

Isn't there possible a use-after-free?

Thanks,
Kirill

>  		/* Reset destructor to avoid calling kcm_rcv_ready */
>  		skb->destructor = sock_rfree;
>  		skb_orphan(skb);
> @@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
>  	return err;
>  }
>  
> -static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
> +static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
>  				     long timeo, int *err)
>  {
>  	struct sk_buff *skb;
>  
> -	while (!(skb = skb_peek(&sk->sk_receive_queue))) {
> +	for (;; ) {
> +		skb = peek ? skb_peek(&sk->sk_receive_queue) :
> +			     skb_dequeue(&sk->sk_receive_queue);
> +		if (skb)
> +			break;
> +
>  		if (sk->sk_err) {
>  			*err = sock_error(sk);
>  			return NULL;
> @@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  {
>  	struct sock *sk = sock->sk;
>  	struct kcm_sock *kcm = kcm_sk(sk);
> +	bool peek = flags & MSG_PEEK;
>  	int err = 0;
>  	long timeo;
>  	struct strp_msg *stm;
> @@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  
>  	lock_sock(sk);
>  
> -	skb = kcm_wait_data(sk, flags, timeo, &err);
> +	skb = kcm_wait_data(sk, flags, peek, timeo, &err);
>  	if (!skb)
>  		goto out;
>  
> @@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  		goto out;
>  
>  	copied = len;
> -	if (likely(!(flags & MSG_PEEK))) {
> +	if (likely(!peek)) {
>  		KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
>  		if (copied < stm->full_len) {
>  			if (sock->type == SOCK_DGRAM) {
> @@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>  			/* Finished with message */
>  			msg->msg_flags |= MSG_EOR;
>  			KCM_STATS_INCR(kcm->stats.rx_msgs);
> -			skb_unlink(skb, &sk->sk_receive_queue);
>  			kfree_skb(skb);
>  		}
>  	}
> @@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
>  
>  	lock_sock(sk);
>  
> -	skb = kcm_wait_data(sk, flags, timeo, &err);
> +	skb = kcm_wait_data(sk, flags, true, timeo, &err);
>  	if (!skb)
>  		goto err_out;
>  
> 

^ permalink raw reply

* WARNING: refcount bug in smc_tcp_listen_work
From: syzbot @ 2018-06-06 13:32 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun

Hello,

syzbot found the following crash on:

HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1632a6f7800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=7b76ec81cd4aad4b
dashboard link: https://syzkaller.appspot.com/bug?extid=9e60d2428a42049a592a
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9e60d2428a42049a592a@syzkaller.appspotmail.com

A link change request failed with some changes committed already. Interface  
bridge0 may have been left with an inconsistent configuration, please check.
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 23870 at lib/refcount.c:187  
refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 23870 Comm: kworker/0:4 Not tainted 4.17.0-rc7+ #79
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: events smc_tcp_listen_work
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  panic+0x22f/0x4de kernel/panic.c:184
  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
  report_bug+0x252/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
RSP: 0018:ffff880194aaf4f8 EFLAGS: 00010286
RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffffffff8160b8ad
RDX: 0000000000000000 RSI: ffffffff81610561 RDI: ffff880194aaf058
RBP: ffff880194aaf5e0 R08: ffff88014148e500 R09: 0000000000000006
R10: ffff88014148e500 R11: 0000000000000000 R12: 00000000ffffffff
R13: ffff880194aaf5b8 R14: 0000000000000001 R15: ffff880194aaf728
  refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
  sock_put include/net/sock.h:1668 [inline]
  smc_tcp_listen_work+0xb94/0xec0 net/smc/af_smc.c:1073
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* Re: [PATCH] netfilter: provide udp*_lib_lookup for nf_tproxy
From: Pablo Neira Ayuso @ 2018-06-06 13:33 UTC (permalink / raw)
  To: David Miller
  Cc: arnd, kuznet, yoshfuji, ecklm94, pabeni, willemb, edumazet,
	dsahern, kafai, netdev, linux-kernel
In-Reply-To: <20180605.105453.339908802413146875.davem@davemloft.net>

On Tue, Jun 05, 2018 at 10:54:53AM -0400, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue,  5 Jun 2018 13:40:34 +0200
> 
> > It is now possible to enable the libified nf_tproxy modules without
> > also enabling NETFILTER_XT_TARGET_TPROXY, which throws off the
> > ifdef logic in the udp core code:
> > 
> > net/ipv6/netfilter/nf_tproxy_ipv6.o: In function `nf_tproxy_get_sock_v6':
> > nf_tproxy_ipv6.c:(.text+0x1a8): undefined reference to `udp6_lib_lookup'
> > net/ipv4/netfilter/nf_tproxy_ipv4.o: In function `nf_tproxy_get_sock_v4':
> > nf_tproxy_ipv4.c:(.text+0x3d0): undefined reference to `udp4_lib_lookup'
> > 
> > We can actually simplify the conditions now to provide the two functions
> > exactly when they are needed.
> > 
> > Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> Pablo, I'm going to apply this directly to fix the link failure.

BTW, could you also pick this fix for net-next?

http://patchwork.ozlabs.org/patch/924706/

Thanks.

^ permalink raw reply

* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:46 UTC (permalink / raw)
  To: Kirill Tkhai, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <f79f8b6f-c137-1eb7-1161-2264c270c025@virtuozzo.com>

On Wed, 2018-06-06 at 16:28 +0300, Kirill Tkhai wrote:
> On 06.06.2018 16:16, Paolo Abeni wrote:
> > KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> > 
> > without acquiring any lock. Moreover, in R() when the MSG_PEEK
> > flag is not present, the skb is peeked and dequeued with two
> > separate, non-atomic, calls.
> > 
> > The above create room for races, which SYZBOT has been able to
> > exploit, causing list corruption and kernel oops:
> > 
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault: 0000 [#1] SMP KASAN
> > Dumping ftrace buffer:
> >     (ftrace buffer empty)
> > Modules linked in:
> > CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> > RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> > RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> > RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> > RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> > RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> > R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> > FS:  0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
> >   sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
> >   ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
> >   __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
> >   do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
> >   __do_sys_recvmmsg net/socket.c:2485 [inline]
> >   __se_sys_recvmmsg net/socket.c:2481 [inline]
> >   __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
> >   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x4417a9
> > RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> > RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> > RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> > R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> > R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> > Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> > 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> > 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> > RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> > RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
> > 
> > To fix the above, we need to use the locked version of the socket dequeue
> > helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> > the available skb when not peeking.
> > 
> > RFC -> v1:
> >  - use skb_dequeue(), as suggested by Tom
> >  - explicitly close the race between skb_peek and skb_unlink
> > 
> > Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> > Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > This is an RFC, since I'm really new to this area, anyway the syzport
> > reported success in testing the proposed fix.
> > This is very likely a scenario where the upcoming skb->prev,next -> list_head
> > conversion would have helped a lot, thanks to list poisoning and list debug
> > ---
> >  net/kcm/kcmsock.c | 19 ++++++++++++-------
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> > index d3601d421571..dd2d02bb35ae 100644
> > --- a/net/kcm/kcmsock.c
> > +++ b/net/kcm/kcmsock.c
> > @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
> >  	struct sk_buff *skb;
> >  	struct kcm_sock *kcm;
> >  
> > -	while ((skb = __skb_dequeue(head))) {
> > +	while ((skb = skb_dequeue(head))) {
> 
> I try to find how the patch protects against the following race:
> 
> requeue_rx_msgs()         kcm_recvmsg()
>   skb = skb_dequeue()       skb = kcm_wait_data(peek = true)
>   ...                       ...
> free skb                    ...
> ...                         skb_copy_datagram_msg(skb) <--- Use after free?
> 
> Isn't there possible a use-after-free?

You are right, this patch does not fix the above race: is addressing a
different one, when recvmsg() is not peeking.

The race itself is not introduced by this code, and I think a separate
patch for the the above would be better (we probably need to increment
the skb reference count while peeking and consume the skb after the
copy)

Cheers,

Paolo

^ permalink raw reply

* Make CONFIG_NET and CONFIG_SECCOMP_FILTER independent of CONFIG_NET
From: Norbert Manthey @ 2018-06-06 13:52 UTC (permalink / raw)
  To: linux-kernel, kvm, netdev, x86

Dear all,

currently, KVM and SECCOMP rely on functionality of CONFIG_NET, and hence the
latter has to be enabled when building the kernel for the first two
configurations. However, there exists scenarios where the system does not need
networking, but KVM and SECCOMP filters. To reduce the kernel image size for
these scenarios, and to be able to drop active code, this commit series allows
to enable CONFIG_KVM and CONFIG_SECCOMP_FILTER without using CONFIG_NET.

The functionality that is required for seccomp filters is kept in the same
files and - after reordering the source code - is guarded with a single ifdef
per file.

I hope these changes are useful for other scenarios than the one I currently
face.

Best,
Norbert

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply

* [less-CONFIG_NET 1/7] net: reorder filter code
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, netdev, linux-kernel
In-Reply-To: <1528293127-23825-1-git-send-email-nmanthey@amazon.de>

This commit reorders the definition of functions and struct in the
file filter.c, such that in the next step we can easily cut the file
into a commonly used part, as well as a part that is only required in
case CONFIG_NET is actually set.

This is part of the effort to split CONFIG_SECCOMP_FILTER and
CONFIG_NET.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 net/core/filter.c | 330 +++++++++++++++++++++++++++---------------------------
 1 file changed, 165 insertions(+), 165 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 201ff36b..0d980e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -59,58 +59,6 @@
 #include <net/tcp.h>
 #include <linux/bpf_trace.h>
 
-/**
- *	sk_filter_trim_cap - run a packet through a socket filter
- *	@sk: sock associated with &sk_buff
- *	@skb: buffer to filter
- *	@cap: limit on how short the eBPF program may trim the packet
- *
- * Run the eBPF program and then cut skb->data to correct size returned by
- * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
- * than pkt_len we keep whole skb->data. This is the socket level
- * wrapper to BPF_PROG_RUN. It returns 0 if the packet should
- * be accepted or -EPERM if the packet should be tossed.
- *
- */
-int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
-{
-	int err;
-	struct sk_filter *filter;
-
-	/*
-	 * If the skb was allocated from pfmemalloc reserves, only
-	 * allow SOCK_MEMALLOC sockets to use it as this socket is
-	 * helping free memory
-	 */
-	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
-		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
-		return -ENOMEM;
-	}
-	err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
-	if (err)
-		return err;
-
-	err = security_sock_rcv_skb(sk, skb);
-	if (err)
-		return err;
-
-	rcu_read_lock();
-	filter = rcu_dereference(sk->sk_filter);
-	if (filter) {
-		struct sock *save_sk = skb->sk;
-		unsigned int pkt_len;
-
-		skb->sk = sk;
-		pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
-		skb->sk = save_sk;
-		err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
-	}
-	rcu_read_unlock();
-
-	return err;
-}
-EXPORT_SYMBOL(sk_filter_trim_cap);
-
 BPF_CALL_1(__skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
@@ -165,12 +113,6 @@ BPF_CALL_0(__get_raw_cpu_id)
 	return raw_smp_processor_id();
 }
 
-static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
-	.func		= __get_raw_cpu_id,
-	.gpl_only	= false,
-	.ret_type	= RET_INTEGER,
-};
-
 static u32 convert_skb_access(int skb_field, int dst_reg, int src_reg,
 			      struct bpf_insn *insn_buf)
 {
@@ -954,71 +896,6 @@ static void __bpf_prog_release(struct bpf_prog *prog)
 	}
 }
 
-static void __sk_filter_release(struct sk_filter *fp)
-{
-	__bpf_prog_release(fp->prog);
-	kfree(fp);
-}
-
-/**
- * 	sk_filter_release_rcu - Release a socket filter by rcu_head
- *	@rcu: rcu_head that contains the sk_filter to free
- */
-static void sk_filter_release_rcu(struct rcu_head *rcu)
-{
-	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
-
-	__sk_filter_release(fp);
-}
-
-/**
- *	sk_filter_release - release a socket filter
- *	@fp: filter to remove
- *
- *	Remove a filter from a socket and release its resources.
- */
-static void sk_filter_release(struct sk_filter *fp)
-{
-	if (refcount_dec_and_test(&fp->refcnt))
-		call_rcu(&fp->rcu, sk_filter_release_rcu);
-}
-
-void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
-{
-	u32 filter_size = bpf_prog_size(fp->prog->len);
-
-	atomic_sub(filter_size, &sk->sk_omem_alloc);
-	sk_filter_release(fp);
-}
-
-/* try to charge the socket memory if there is space available
- * return true on success
- */
-static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
-{
-	u32 filter_size = bpf_prog_size(fp->prog->len);
-
-	/* same check as in sock_kmalloc() */
-	if (filter_size <= sysctl_optmem_max &&
-	    atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
-		atomic_add(filter_size, &sk->sk_omem_alloc);
-		return true;
-	}
-	return false;
-}
-
-bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
-{
-	if (!refcount_inc_not_zero(&fp->refcnt))
-		return false;
-
-	if (!__sk_filter_charge(sk, fp)) {
-		sk_filter_release(fp);
-		return false;
-	}
-	return true;
-}
-
 static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp)
 {
 	struct sock_filter *old_prog;
@@ -1127,19 +1004,22 @@ static struct bpf_prog *bpf_prepare_filter(struct bpf_prog *fp,
 }
 
 /**
- *	bpf_prog_create - create an unattached filter
+ *	bpf_prog_create_from_user - create an unattached filter from user buffer
  *	@pfp: the unattached filter that is created
  *	@fprog: the filter program
+ *	@trans: post-classic verifier transformation handler
+ *	@save_orig: save classic BPF program
  *
- * Create a filter independent of any socket. We first run some
- * sanity checks on it to make sure it does not explode on us later.
- * If an error occurs or there is insufficient memory for the filter
- * a negative errno code is returned. On success the return is zero.
+ * This function effectively does the same as bpf_prog_create(), only
+ * that it builds up its insns buffer from user space provided buffer.
+ * It also allows for passing a bpf_aux_classic_check_t handler.
  */
-int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
+int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
+			      bpf_aux_classic_check_t trans, bool save_orig)
 {
 	unsigned int fsize = bpf_classic_proglen(fprog);
 	struct bpf_prog *fp;
+	int err;
 
 	/* Make sure new filter is there and in the right amounts. */
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
@@ -1149,44 +1029,177 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
 	if (!fp)
 		return -ENOMEM;
 
-	memcpy(fp->insns, fprog->filter, fsize);
+	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+		__bpf_prog_free(fp);
+		return -EFAULT;
+	}
 
 	fp->len = fprog->len;
-	/* Since unattached filters are not copied back to user
-	 * space through sk_get_filter(), we do not need to hold
-	 * a copy here, and can spare us the work.
-	 */
 	fp->orig_prog = NULL;
 
+	if (save_orig) {
+		err = bpf_prog_store_orig_filter(fp, fprog);
+		if (err) {
+			__bpf_prog_free(fp);
+			return -ENOMEM;
+		}
+	}
+
 	/* bpf_prepare_filter() already takes care of freeing
 	 * memory in case something goes wrong.
 	 */
-	fp = bpf_prepare_filter(fp, NULL);
+	fp = bpf_prepare_filter(fp, trans);
 	if (IS_ERR(fp))
 		return PTR_ERR(fp);
 
 	*pfp = fp;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(bpf_prog_create);
+EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);
+
+void bpf_prog_destroy(struct bpf_prog *fp)
+{
+	__bpf_prog_release(fp);
+}
+EXPORT_SYMBOL_GPL(bpf_prog_destroy);
 
 /**
- *	bpf_prog_create_from_user - create an unattached filter from user buffer
+ *	sk_filter_trim_cap - run a packet through a socket filter
+ *	@sk: sock associated with &sk_buff
+ *	@skb: buffer to filter
+ *	@cap: limit on how short the eBPF program may trim the packet
+ *
+ * Run the eBPF program and then cut skb->data to correct size returned by
+ * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
+ * than pkt_len we keep whole skb->data. This is the socket level
+ * wrapper to BPF_PROG_RUN. It returns 0 if the packet should
+ * be accepted or -EPERM if the packet should be tossed.
+ *
+ */
+int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
+{
+	int err;
+	struct sk_filter *filter;
+
+	/*
+	 * If the skb was allocated from pfmemalloc reserves, only
+	 * allow SOCK_MEMALLOC sockets to use it as this socket is
+	 * helping free memory
+	 */
+	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
+		return -ENOMEM;
+	}
+	err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+	if (err)
+		return err;
+
+	err = security_sock_rcv_skb(sk, skb);
+	if (err)
+		return err;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sock *save_sk = skb->sk;
+		unsigned int pkt_len;
+
+		skb->sk = sk;
+		pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
+		skb->sk = save_sk;
+		err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
+	}
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL(sk_filter_trim_cap);
+
+static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
+	.func		= __get_raw_cpu_id,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+};
+
+static void __sk_filter_release(struct sk_filter *fp)
+{
+	__bpf_prog_release(fp->prog);
+	kfree(fp);
+}
+
+/**
+ * 	sk_filter_release_rcu - Release a socket filter by rcu_head
+ *	@rcu: rcu_head that contains the sk_filter to free
+ */
+static void sk_filter_release_rcu(struct rcu_head *rcu)
+{
+	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
+
+	__sk_filter_release(fp);
+}
+
+/**
+ *	sk_filter_release - release a socket filter
+ *	@fp: filter to remove
+ *
+ *	Remove a filter from a socket and release its resources.
+ */
+static void sk_filter_release(struct sk_filter *fp)
+{
+	if (refcount_dec_and_test(&fp->refcnt))
+		call_rcu(&fp->rcu, sk_filter_release_rcu);
+}
+
+void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
+{
+	u32 filter_size = bpf_prog_size(fp->prog->len);
+
+	atomic_sub(filter_size, &sk->sk_omem_alloc);
+	sk_filter_release(fp);
+}
+
+/* try to charge the socket memory if there is space available
+ * return true on success
+ */
+static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
+{
+	u32 filter_size = bpf_prog_size(fp->prog->len);
+
+	/* same check as in sock_kmalloc() */
+	if (filter_size <= sysctl_optmem_max &&
+	    atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
+		atomic_add(filter_size, &sk->sk_omem_alloc);
+		return true;
+	}
+	return false;
+}
+
+bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
+{
+	if (!refcount_inc_not_zero(&fp->refcnt))
+		return false;
+
+	if (!__sk_filter_charge(sk, fp)) {
+		sk_filter_release(fp);
+		return false;
+	}
+	return true;
+}
+
+/**
+ *	bpf_prog_create - create an unattached filter
  *	@pfp: the unattached filter that is created
  *	@fprog: the filter program
- *	@trans: post-classic verifier transformation handler
- *	@save_orig: save classic BPF program
  *
- * This function effectively does the same as bpf_prog_create(), only
- * that it builds up its insns buffer from user space provided buffer.
- * It also allows for passing a bpf_aux_classic_check_t handler.
+ * Create a filter independent of any socket. We first run some
+ * sanity checks on it to make sure it does not explode on us later.
+ * If an error occurs or there is insufficient memory for the filter
+ * a negative errno code is returned. On success the return is zero.
  */
-int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
-			      bpf_aux_classic_check_t trans, bool save_orig)
+int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
 {
 	unsigned int fsize = bpf_classic_proglen(fprog);
 	struct bpf_prog *fp;
-	int err;
 
 	/* Make sure new filter is there and in the right amounts. */
 	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
@@ -1196,39 +1209,26 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
 	if (!fp)
 		return -ENOMEM;
 
-	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
-		__bpf_prog_free(fp);
-		return -EFAULT;
-	}
+	memcpy(fp->insns, fprog->filter, fsize);
 
 	fp->len = fprog->len;
+	/* Since unattached filters are not copied back to user
+	 * space through sk_get_filter(), we do not need to hold
+	 * a copy here, and can spare us the work.
+	 */
 	fp->orig_prog = NULL;
 
-	if (save_orig) {
-		err = bpf_prog_store_orig_filter(fp, fprog);
-		if (err) {
-			__bpf_prog_free(fp);
-			return -ENOMEM;
-		}
-	}
-
 	/* bpf_prepare_filter() already takes care of freeing
 	 * memory in case something goes wrong.
 	 */
-	fp = bpf_prepare_filter(fp, trans);
+	fp = bpf_prepare_filter(fp, NULL);
 	if (IS_ERR(fp))
 		return PTR_ERR(fp);
 
 	*pfp = fp;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);
-
-void bpf_prog_destroy(struct bpf_prog *fp)
-{
-	__bpf_prog_release(fp);
-}
-EXPORT_SYMBOL_GPL(bpf_prog_destroy);
+EXPORT_SYMBOL_GPL(bpf_prog_create);
 
 static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
 {
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [less-CONFIG_NET 2/7] net: reorder flow_dissector
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, David S. Miller, Simon Horman, Andrew Lunn,
	Jakub Kicinski, Tom Herbert, John Crispin, Eric Dumazet,
	Sven Eckelmann, WANG Cong, David Ahern, Jon Maloy, netdev,
	linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>

This commit reorders the definitions, such that in the next step we
can easily cut the file into a commonly used part, as well as a part
that is only required in case CONFIG_NET is used.

This is part of the effort to split CONFIG_SECCOMP_FILTER and
CONFIG_NET.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 net/core/flow_dissector.c | 206 +++++++++++++++++++++++-----------------------
 1 file changed, 103 insertions(+), 103 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index d29f09b..70e0679 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1085,36 +1085,6 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
 	return (sizeof(*flow) - diff) / sizeof(u32);
 }
 
-__be32 flow_get_u32_src(const struct flow_keys *flow)
-{
-	switch (flow->control.addr_type) {
-	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
-		return flow->addrs.v4addrs.src;
-	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-		return (__force __be32)ipv6_addr_hash(
-			&flow->addrs.v6addrs.src);
-	case FLOW_DISSECTOR_KEY_TIPC:
-		return flow->addrs.tipckey.key;
-	default:
-		return 0;
-	}
-}
-EXPORT_SYMBOL(flow_get_u32_src);
-
-__be32 flow_get_u32_dst(const struct flow_keys *flow)
-{
-	switch (flow->control.addr_type) {
-	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
-		return flow->addrs.v4addrs.dst;
-	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
-		return (__force __be32)ipv6_addr_hash(
-			&flow->addrs.v6addrs.dst);
-	default:
-		return 0;
-	}
-}
-EXPORT_SYMBOL(flow_get_u32_dst);
-
 static inline void __flow_hash_consistentify(struct flow_keys *keys)
 {
 	int addr_diff, i;
@@ -1162,49 +1132,6 @@ static inline u32 __flow_hash_from_keys(struct flow_keys *keys, u32 keyval)
 	return hash;
 }
 
-u32 flow_hash_from_keys(struct flow_keys *keys)
-{
-	__flow_hash_secret_init();
-	return __flow_hash_from_keys(keys, hashrnd);
-}
-EXPORT_SYMBOL(flow_hash_from_keys);
-
-static inline u32 ___skb_get_hash(const struct sk_buff *skb,
-				  struct flow_keys *keys, u32 keyval)
-{
-	skb_flow_dissect_flow_keys(skb, keys,
-				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
-
-	return __flow_hash_from_keys(keys, keyval);
-}
-
-struct _flow_keys_digest_data {
-	__be16	n_proto;
-	u8	ip_proto;
-	u8	padding;
-	__be32	ports;
-	__be32	src;
-	__be32	dst;
-};
-
-void make_flow_keys_digest(struct flow_keys_digest *digest,
-			   const struct flow_keys *flow)
-{
-	struct _flow_keys_digest_data *data =
-	    (struct _flow_keys_digest_data *)digest;
-
-	BUILD_BUG_ON(sizeof(*data) > sizeof(*digest));
-
-	memset(digest, 0, sizeof(*digest));
-
-	data->n_proto = flow->basic.n_proto;
-	data->ip_proto = flow->basic.ip_proto;
-	data->ports = flow->ports.ports;
-	data->src = flow->addrs.v4addrs.src;
-	data->dst = flow->addrs.v4addrs.dst;
-}
-EXPORT_SYMBOL(make_flow_keys_digest);
-
 static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
 
 u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
@@ -1222,36 +1149,6 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(__skb_get_hash_symmetric);
 
-/**
- * __skb_get_hash: calculate a flow hash
- * @skb: sk_buff to calculate flow hash from
- *
- * This function calculates a flow hash based on src/dst addresses
- * and src/dst port numbers.  Sets hash in skb to non-zero hash value
- * on success, zero indicates no valid hash.  Also, sets l4_hash in skb
- * if hash is a canonical 4-tuple hash over transport ports.
- */
-void __skb_get_hash(struct sk_buff *skb)
-{
-	struct flow_keys keys;
-	u32 hash;
-
-	__flow_hash_secret_init();
-
-	hash = ___skb_get_hash(skb, &keys, hashrnd);
-
-	__skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
-}
-EXPORT_SYMBOL(__skb_get_hash);
-
-__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
-{
-	struct flow_keys keys;
-
-	return ___skb_get_hash(skb, &keys, perturb);
-}
-EXPORT_SYMBOL(skb_get_hash_perturb);
-
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
 		   const struct flow_keys *keys, int hlen)
 {
@@ -1322,6 +1219,109 @@ u32 skb_get_poff(const struct sk_buff *skb)
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
 }
 
+__be32 flow_get_u32_src(const struct flow_keys *flow)
+{
+	switch (flow->control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		return flow->addrs.v4addrs.src;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		return (__force __be32)ipv6_addr_hash(
+			&flow->addrs.v6addrs.src);
+	case FLOW_DISSECTOR_KEY_TIPC:
+		return flow->addrs.tipckey.key;
+	default:
+		return 0;
+	}
+}
+EXPORT_SYMBOL(flow_get_u32_src);
+
+__be32 flow_get_u32_dst(const struct flow_keys *flow)
+{
+	switch (flow->control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		return flow->addrs.v4addrs.dst;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		return (__force __be32)ipv6_addr_hash(
+			&flow->addrs.v6addrs.dst);
+	default:
+		return 0;
+	}
+}
+EXPORT_SYMBOL(flow_get_u32_dst);
+
+u32 flow_hash_from_keys(struct flow_keys *keys)
+{
+	__flow_hash_secret_init();
+	return __flow_hash_from_keys(keys, hashrnd);
+}
+EXPORT_SYMBOL(flow_hash_from_keys);
+
+static inline u32 ___skb_get_hash(const struct sk_buff *skb,
+				  struct flow_keys *keys, u32 keyval)
+{
+	skb_flow_dissect_flow_keys(skb, keys,
+				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+
+	return __flow_hash_from_keys(keys, keyval);
+}
+
+struct _flow_keys_digest_data {
+	__be16	n_proto;
+	u8	ip_proto;
+	u8	padding;
+	__be32	ports;
+	__be32	src;
+	__be32	dst;
+};
+
+void make_flow_keys_digest(struct flow_keys_digest *digest,
+			   const struct flow_keys *flow)
+{
+	struct _flow_keys_digest_data *data =
+	    (struct _flow_keys_digest_data *)digest;
+
+	BUILD_BUG_ON(sizeof(*data) > sizeof(*digest));
+
+	memset(digest, 0, sizeof(*digest));
+
+	data->n_proto = flow->basic.n_proto;
+	data->ip_proto = flow->basic.ip_proto;
+	data->ports = flow->ports.ports;
+	data->src = flow->addrs.v4addrs.src;
+	data->dst = flow->addrs.v4addrs.dst;
+}
+EXPORT_SYMBOL(make_flow_keys_digest);
+
+/**
+ * __skb_get_hash: calculate a flow hash
+ * @skb: sk_buff to calculate flow hash from
+ *
+ * This function calculates a flow hash based on src/dst addresses
+ * and src/dst port numbers.  Sets hash in skb to non-zero hash value
+ * on success, zero indicates no valid hash.  Also, sets l4_hash in skb
+ * if hash is a canonical 4-tuple hash over transport ports.
+ */
+void __skb_get_hash(struct sk_buff *skb)
+{
+	struct flow_keys keys;
+	u32 hash;
+
+	__flow_hash_secret_init();
+
+	hash = ___skb_get_hash(skb, &keys, hashrnd);
+
+	__skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
+}
+EXPORT_SYMBOL(__skb_get_hash);
+
+__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
+{
+	struct flow_keys keys;
+
+	return ___skb_get_hash(skb, &keys, perturb);
+}
+EXPORT_SYMBOL(skb_get_hash_perturb);
+
 __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
 {
 	memset(keys, 0, sizeof(*keys));
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [less-CONFIG_NET 3/7] seccomp: include net and bpf files
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, netdev, linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>

When we want to use CONFIG_SECCOMP_FILTER without CONFIG_NET, we have
to ensure that the required files that would be pulled in via
CONFIG_NET are compiled when dropping CONFIG_NET.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 kernel/bpf/Makefile | 3 ++-
 net/Makefile        | 5 +++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index a713fd2..5d13269 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -4,7 +4,8 @@ obj-y := core.o
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
-ifeq ($(CONFIG_NET),y)
+
+ifneq ($(filter y,$(CONFIG_NET) $(CONFIG_SECCOMP_FILTER)),)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
diff --git a/net/Makefile b/net/Makefile
index a6147c6..08f1875 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_NET)		:= socket.o core/
 tmp-$(CONFIG_COMPAT) 		:= compat.o
 obj-$(CONFIG_NET)		+= $(tmp-y)
 
+ifneq ($(CONFIG_NET),y)
+obj-$(CONFIG_SECCOMP_FILTER)    += core/filter.o
+obj-$(CONFIG_SECCOMP_FILTER)    += core/flow_dissector.o
+endif
+
 # LLC has to be linked before the files in net/802/
 obj-$(CONFIG_LLC)		+= llc/
 obj-$(CONFIG_NET)		+= ethernet/ 802/ sched/ netlink/ bpf/
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [less-CONFIG_NET 5/7] seccomp: cut off functions not required
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
  Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, John Crispin, Simon Horman, Jakub Kicinski,
	Tom Herbert, Eric Dumazet, Sven Eckelmann, WANG Cong, David Ahern,
	Jon Maloy, netdev, linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>

When using CONFIG_SECCOMP_FILTER, not all functions of filter.c and
flow_dissector.c are required. To not pull in more dependencies,
guard the functions that are not required with CONFIG_NET defines.
This way, these functions are enabled in case the file is compiled
because of CONFIG_NET, but they are not present when the file is
compiled because of other configurations.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
 net/core/filter.c         | 2 ++
 net/core/flow_dissector.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 0d980e9..4ddacb7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1063,6 +1063,7 @@ void bpf_prog_destroy(struct bpf_prog *fp)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_destroy);
 
+#if defined(CONFIG_NET)
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
  *	@sk: sock associated with &sk_buff
@@ -5657,3 +5658,4 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
 	release_sock(sk);
 	return ret;
 }
+#endif  // CONFIG_NET
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 70e0679..0903444 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1219,6 +1219,7 @@ u32 skb_get_poff(const struct sk_buff *skb)
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
 }
 
+#if defined(CONFIG_NET)
 __be32 flow_get_u32_src(const struct flow_keys *flow)
 {
 	switch (flow->control.addr_type) {
@@ -1340,6 +1341,7 @@ __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
 	return flow_hash_from_keys(keys);
 }
 EXPORT_SYMBOL(__get_hash_from_flowi6);
+#endif  // CONFIG_NET
 
 static const struct flow_dissector_key flow_keys_dissector_keys[] = {
 	{
-- 
2.7.4

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply related

* [PATCH net] rxrpc: Fix terminal retransmission connection ID to include the channel
From: David Howells @ 2018-06-06 13:59 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel

When retransmitting the final ACK or ABORT packet for a call, the cid field
in the packet header is set to the connection's cid, but this is incorrect
as it also needs to include the channel number on that connection that the
call was made on.

Fix this by OR'ing in the channel number.

Note that this fixes the bug that:

	commit 1a025028d400b23477341aa7ec2ce55f8b39b554
	rxrpc: Fix handling of call quietly cancelled out on server

works around.  I'm not intending to revert that as it will help protect
against problems that might occur on the server.

Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client call")
Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/conn_event.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index 1350f1be8037..8229a52c2acd 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -70,7 +70,7 @@ static void rxrpc_conn_retransmit_call(struct rxrpc_connection *conn,
 	iov[2].iov_len	= sizeof(ack_info);
 
 	pkt.whdr.epoch		= htonl(conn->proto.epoch);
-	pkt.whdr.cid		= htonl(conn->proto.cid);
+	pkt.whdr.cid		= htonl(conn->proto.cid | channel);
 	pkt.whdr.callNumber	= htonl(call_id);
 	pkt.whdr.seq		= 0;
 	pkt.whdr.type		= chan->last_type;

^ permalink raw reply related

* Re: [PATCH net] ipv4: igmp: hold wakelock to prevent delayed reports
From: Tejaswi Tanikella @ 2018-06-06 14:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, f.fainelli, andrew, edumazet
In-Reply-To: <20180604.110640.184022152967598302.davem@davemloft.net>

On Mon, Jun 04, 2018 at 11:06:40AM -0400, David Miller wrote:
> From: Tejaswi Tanikella <tejaswit@codeaurora.org>
> Date: Fri, 1 Jun 2018 19:35:41 +0530
> 
> > On receiving a IGMPv2/v3 query, based on max_delay set in the header a
> > timer is started to send out a response after a random time within
> > max_delay. If the system then moves into suspend state, Report is
> > delayed until system wakes up.
> > 
> > In one reported scenario, on arm64 devices, max_delay was set to 10s,
> > Reports were consistantly delayed if the timer is scheduled after 5 plus
> > seconds.
> > 
> > Hold wakelock while starting the timer to prevent moving into suspend
> > state.
> > 
> > Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
> 
> As Florian stated, this won't be the only networking facility to hit
> a problem like this.  So, if we go down this route, we probably want
> to generically solve this.
> 
> But I have a deeper concern.
> 
> Do we really want every timer based querying mechanism to prevent a
> system from being able to suspend?
> 
> We get to the point where external entities can generate traffic which
> can prevent a remote system from entering suspend state.

Like you suggested maybe aquiring a wakelock will unnecessarily stop the
system from suspending.
Would using alarmtimer be better? It seems similar to a hrtimer but can
wake the system up from suspend.

This should allow the system to suspend till the timer expires. But might
cause repeated wake-ups and suspends.

^ permalink raw reply

* Re: [PATCH net-next 1/6] net: hippi: use pci_zalloc_consistent
From: kbuild test robot @ 2018-06-06 15:04 UTC (permalink / raw)
  To: YueHaibing
  Cc: kbuild-all, davem, netdev, linux-kernel, jcliburn, chris.snook,
	benve, jdmason, chessman, jes, rahul.verma, YueHaibing
In-Reply-To: <20180605122851.23912-2-yuehaibing@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 5410 bytes --]

Hi YueHaibing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/YueHaibing/use-pci_zalloc_consistent/20180606-205531
config: x86_64-allyesdebian (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/hippi/rrunner.c: In function 'rr_init_one':
>> drivers/net/hippi/rrunner.c:176:7: error: 'trrpriv' undeclared (first use in this function); did you mean 'rrpriv'?
     if (!trrpriv->evt_ring) {
          ^~~~~~~
          rrpriv
   drivers/net/hippi/rrunner.c:176:7: note: each undeclared identifier is reported only once for each function it appears in

vim +176 drivers/net/hippi/rrunner.c

    74	
    75	/*
    76	 * Implementation notes:
    77	 *
    78	 * The DMA engine only allows for DMA within physical 64KB chunks of
    79	 * memory. The current approach of the driver (and stack) is to use
    80	 * linear blocks of memory for the skbuffs. However, as the data block
    81	 * is always the first part of the skb and skbs are 2^n aligned so we
    82	 * are guarantted to get the whole block within one 64KB align 64KB
    83	 * chunk.
    84	 *
    85	 * On the long term, relying on being able to allocate 64KB linear
    86	 * chunks of memory is not feasible and the skb handling code and the
    87	 * stack will need to know about I/O vectors or something similar.
    88	 */
    89	
    90	static int rr_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
    91	{
    92		struct net_device *dev;
    93		static int version_disp;
    94		u8 pci_latency;
    95		struct rr_private *rrpriv;
    96		dma_addr_t ring_dma;
    97		int ret = -ENOMEM;
    98	
    99		dev = alloc_hippi_dev(sizeof(struct rr_private));
   100		if (!dev)
   101			goto out3;
   102	
   103		ret = pci_enable_device(pdev);
   104		if (ret) {
   105			ret = -ENODEV;
   106			goto out2;
   107		}
   108	
   109		rrpriv = netdev_priv(dev);
   110	
   111		SET_NETDEV_DEV(dev, &pdev->dev);
   112	
   113		ret = pci_request_regions(pdev, "rrunner");
   114		if (ret < 0)
   115			goto out;
   116	
   117		pci_set_drvdata(pdev, dev);
   118	
   119		rrpriv->pci_dev = pdev;
   120	
   121		spin_lock_init(&rrpriv->lock);
   122	
   123		dev->netdev_ops = &rr_netdev_ops;
   124	
   125		/* display version info if adapter is found */
   126		if (!version_disp) {
   127			/* set display flag to TRUE so that */
   128			/* we only display this string ONCE */
   129			version_disp = 1;
   130			printk(version);
   131		}
   132	
   133		pci_read_config_byte(pdev, PCI_LATENCY_TIMER, &pci_latency);
   134		if (pci_latency <= 0x58){
   135			pci_latency = 0x58;
   136			pci_write_config_byte(pdev, PCI_LATENCY_TIMER, pci_latency);
   137		}
   138	
   139		pci_set_master(pdev);
   140	
   141		printk(KERN_INFO "%s: Essential RoadRunner serial HIPPI "
   142		       "at 0x%llx, irq %i, PCI latency %i\n", dev->name,
   143		       (unsigned long long)pci_resource_start(pdev, 0),
   144		       pdev->irq, pci_latency);
   145	
   146		/*
   147		 * Remap the MMIO regs into kernel space.
   148		 */
   149		rrpriv->regs = pci_iomap(pdev, 0, 0x1000);
   150		if (!rrpriv->regs) {
   151			printk(KERN_ERR "%s:  Unable to map I/O register, "
   152				"RoadRunner will be disabled.\n", dev->name);
   153			ret = -EIO;
   154			goto out;
   155		}
   156	
   157		rrpriv->tx_ring = pci_alloc_consistent(pdev, TX_TOTAL_SIZE, &ring_dma);
   158		rrpriv->tx_ring_dma = ring_dma;
   159	
   160		if (!rrpriv->tx_ring) {
   161			ret = -ENOMEM;
   162			goto out;
   163		}
   164	
   165		rrpriv->rx_ring = pci_alloc_consistent(pdev, RX_TOTAL_SIZE, &ring_dma);
   166		rrpriv->rx_ring_dma = ring_dma;
   167	
   168		if (!rrpriv->rx_ring) {
   169			ret = -ENOMEM;
   170			goto out;
   171		}
   172	
   173		rrpriv->evt_ring = pci_alloc_consistent(pdev, EVT_RING_SIZE, &ring_dma);
   174		rrpriv->evt_ring_dma = ring_dma;
   175	
 > 176		if (!trrpriv->evt_ring) {
   177			ret = -ENOMEM;
   178			goto out;
   179		}
   180	
   181		/*
   182		 * Don't access any register before this point!
   183		 */
   184	#ifdef __BIG_ENDIAN
   185		writel(readl(&rrpriv->regs->HostCtrl) | NO_SWAP,
   186			&rrpriv->regs->HostCtrl);
   187	#endif
   188		/*
   189		 * Need to add a case for little-endian 64-bit hosts here.
   190		 */
   191	
   192		rr_init(dev);
   193	
   194		ret = register_netdev(dev);
   195		if (ret)
   196			goto out;
   197		return 0;
   198	
   199	 out:
   200		if (rrpriv->evt_ring)
   201			pci_free_consistent(pdev, EVT_RING_SIZE, rrpriv->evt_ring,
   202					    rrpriv->evt_ring_dma);
   203		if (rrpriv->rx_ring)
   204			pci_free_consistent(pdev, RX_TOTAL_SIZE, rrpriv->rx_ring,
   205					    rrpriv->rx_ring_dma);
   206		if (rrpriv->tx_ring)
   207			pci_free_consistent(pdev, TX_TOTAL_SIZE, rrpriv->tx_ring,
   208					    rrpriv->tx_ring_dma);
   209		if (rrpriv->regs)
   210			pci_iounmap(pdev, rrpriv->regs);
   211		if (pdev)
   212			pci_release_regions(pdev);
   213	 out2:
   214		free_netdev(dev);
   215	 out3:
   216		return ret;
   217	}
   218	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39429 bytes --]

^ permalink raw reply

* [PATCH net] net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan
From: Willem de Bruijn @ 2018-06-06 15:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, rppt, herbert, anton.ivanov, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.

For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer.  Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.

Commit f09e2249c4f5 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.

Virtio, packet and uml do not insert the vlan header in the user
buffer.

When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.

Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.

Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 arch/um/drivers/vector_transports.c |  3 ++-
 drivers/net/tap.c                   |  5 ++++-
 drivers/net/tun.c                   |  3 ++-
 drivers/net/virtio_net.c            |  3 ++-
 include/linux/virtio_net.h          | 11 ++++-------
 net/packet/af_packet.c              |  4 ++--
 6 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c
index 9065047f844b..77e4ebc206ae 100644
--- a/arch/um/drivers/vector_transports.c
+++ b/arch/um/drivers/vector_transports.c
@@ -120,7 +120,8 @@ static int raw_form_header(uint8_t *header,
 		skb,
 		vheader,
 		virtio_legacy_is_little_endian(),
-		false
+		false,
+		0
 	);
 
 	return 0;
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9b6cb780affe..f0f7cd977667 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -774,13 +774,16 @@ static ssize_t tap_put_user(struct tap_queue *q,
 	int total;
 
 	if (q->flags & IFF_VNET_HDR) {
+		int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
 		struct virtio_net_hdr vnet_hdr;
+
 		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 		if (iov_iter_count(iter) < vnet_hdr_len)
 			return -EINVAL;
 
 		if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
-					    tap_is_little_endian(q), true))
+					    tap_is_little_endian(q), true,
+					    vlan_hlen))
 			BUG();
 
 		if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 23e9eb66197f..409eb8b74740 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2078,7 +2078,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 			return -EINVAL;
 
 		if (virtio_net_hdr_from_skb(skb, &gso,
-					    tun_is_little_endian(tun), true)) {
+					    tun_is_little_endian(tun), true,
+					    vlan_hlen)) {
 			struct skb_shared_info *sinfo = skb_shinfo(skb);
 			pr_err("unexpected GSO type: "
 			       "0x%x, gso_size %d, hdr_len %d\n",
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 032e1ac10a30..8c7207535179 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1358,7 +1358,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		hdr = skb_vnet_hdr(skb);
 
 	if (virtio_net_hdr_from_skb(skb, &hdr->hdr,
-				    virtio_is_little_endian(vi->vdev), false))
+				    virtio_is_little_endian(vi->vdev), false,
+				    0))
 		BUG();
 
 	if (vi->mergeable_rx_bufs)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216febc6..9397628a1967 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -58,7 +58,8 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 					  struct virtio_net_hdr *hdr,
 					  bool little_endian,
-					  bool has_data_valid)
+					  bool has_data_valid,
+					  int vlan_hlen)
 {
 	memset(hdr, 0, sizeof(*hdr));   /* no info leak */
 
@@ -83,12 +84,8 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-		if (skb_vlan_tag_present(skb))
-			hdr->csum_start = __cpu_to_virtio16(little_endian,
-				skb_checksum_start_offset(skb) + VLAN_HLEN);
-		else
-			hdr->csum_start = __cpu_to_virtio16(little_endian,
-				skb_checksum_start_offset(skb));
+		hdr->csum_start = __cpu_to_virtio16(little_endian,
+			skb_checksum_start_offset(skb) + vlan_hlen);
 		hdr->csum_offset = __cpu_to_virtio16(little_endian,
 				skb->csum_offset);
 	} else if (has_data_valid &&
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index acb7b86574cd..9198997b39d1 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2037,7 +2037,7 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
 		return -EINVAL;
 	*len -= sizeof(vnet_hdr);
 
-	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true))
+	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
 		return -EINVAL;
 
 	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
@@ -2304,7 +2304,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (do_vnet) {
 		if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
 					    sizeof(struct virtio_net_hdr),
-					    vio_le(), true)) {
+					    vio_le(), true, 0)) {
 			spin_lock(&sk->sk_receive_queue.lock);
 			goto drop_n_account;
 		}
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox