Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/4] irda: move it to drivers/staging so we can delete it
From: Greg Kroah-Hartman @ 2017-08-29 14:48 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: driverdevel, Samuel Ortiz, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Stefano Brivio, David S. Miller
In-Reply-To: <CAMuHMdVb-oygJkkHwBOfSmNECAfF72QruJSmVTpqdUW+RNJbrg@mail.gmail.com>

On Tue, Aug 29, 2017 at 02:11:39PM +0200, Geert Uytterhoeven wrote:
> Hi Greg,
> 
> On Tue, Aug 29, 2017 at 1:28 PM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Aug 29, 2017 at 01:11:31PM +0200, Stefano Brivio wrote:
> >> On Tue, 29 Aug 2017 12:59:00 +0200
> >> Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> >> > On Sun, Aug 27, 2017 at 5:03 PM, Greg Kroah-Hartman
> >> > <gregkh@linuxfoundation.org> wrote:
> >> > > The IRDA code has long been obsolete and broken.  So, to keep people
> >> > > from trying to use it, and to prevent people from having to maintain it,
> >> > > let's move it to drivers/staging/ so that we can delete it entirely from
> >> > > the kernel in a few releases.
> >> >
> >> > (diving into an early boot crash)
> >> >
> >> > Have you tried running this? ;-)
> >> >
> >> > irda_init() and net_dev_init() are both subsys_initcall()s.
> >> > But the former now runs before the latter, leading to:
> >> >
> >> > Unable to handle kernel NULL pointer dereference at virtual address 00000004
> >>
> >> Should be fixed by https://patchwork.ozlabs.org/patch/807006/
> >> ("[net-next] staging: irda: force to be a kernel module") I guess...
> >
> > Yup, that's the fix for this issue.
> >
> > Geert, does that fix the problem for you?
> 
> Thanks, that patch fixes the crash, obviously.
> 
> It does mean you can no longer have IrDA in a non-modular kernel.

Given that irda doesn't really work, I doubt anyone is going to care
about it :)

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] staging: r8822be: Fix typo for CONFIG_RTLWIFI_DEBUG
From: Greg KH @ 2017-08-29 14:42 UTC (permalink / raw)
  To: Larry Finger
  Cc: Andreas Ziegler, devel, Yan-Hsuan Chuang, Birming Chiu, netdev,
	Steven Ting
In-Reply-To: <fcab4e04-be42-5a05-c5ad-4947d2d41164@lwfinger.net>

On Tue, Aug 29, 2017 at 09:10:10AM -0500, Larry Finger wrote:
> On 08/29/2017 06:30 AM, Andreas Ziegler wrote:
> > The debugging output in deinit_priv is guarded by an  #ifdef using
> > CONFIG_RTL_DEBUG. This symbol does not exist and should be
> > CONFIG_RTLWIFI_DEBUG instead.
> > 
> > Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
> 
> NACK.
> 
> Yes, there is a problem; however, CONFIG_RTLWIFI_DEBUG is not the value that
> should be used. That one is reserved for the non-staging drivers in
> drivers/net/wireless/realtek/rtlwifi/. The correct symbol for r8822be is
> CONFIG_RTLWIFI_DEBUG_ST.

Yeah, kbuild just blew up on this as well, I wonder why my local build
testing didn't see that :(

Now dropped.

greg k-h

^ permalink raw reply

* [PATCH net-next 7/7] samples/bpf: xdp_monitor tool based on tracepoints
From: Jesper Dangaard Brouer @ 2017-08-29 14:38 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

This tool xdp_monitor demonstrate how to use the different xdp_redirect
tracepoints xdp_redirect{,_map}{,_err} from a BPF program.

The default mode is to only monitor the error counters, to avoid
affecting the per packet performance. Tracepoints comes with a base
overhead of 25 nanosec for an attached bpf_prog, and 48 nanosec for
using a full perf record (with non-matching filter).  Thus, default
loading the --stats mode could affect the maximum performance.

This version of the tool is very simple and count all types of errors
as one.  It will be natural to extend this later with the different
types of errors that can occur, which should help users quickly
identify common mistakes.

Because the TP_STRUCT was kept in sync all the tracepoints loads the
same BPF code.  It would also be natural to extend the map version to
demonstrate how the map information could be used.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile           |    4 +
 samples/bpf/xdp_monitor_kern.c |   88 ++++++++++++
 samples/bpf/xdp_monitor_user.c |  295 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 387 insertions(+)
 create mode 100644 samples/bpf/xdp_monitor_kern.c
 create mode 100644 samples/bpf/xdp_monitor_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index f1010fe759fe..cf17c7932a6e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -39,6 +39,7 @@ hostprogs-y += per_socket_stats_example
 hostprogs-y += load_sock_ops
 hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
+hostprogs-y += xdp_monitor
 hostprogs-y += syscall_tp
 
 # Libbpf dependencies
@@ -83,6 +84,7 @@ test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o
 per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o
 xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
+xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 
 # Tell kbuild to always build the programs
@@ -127,6 +129,7 @@ always += tcp_iw_kern.o
 always += tcp_clamp_kern.o
 always += xdp_redirect_kern.o
 always += xdp_redirect_map_kern.o
+always += xdp_monitor_kern.o
 always += syscall_tp_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
@@ -166,6 +169,7 @@ HOSTLOADLIBES_xdp_tx_iptunnel += -lelf
 HOSTLOADLIBES_test_map_in_map += -lelf
 HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
+HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
diff --git a/samples/bpf/xdp_monitor_kern.c b/samples/bpf/xdp_monitor_kern.c
new file mode 100644
index 000000000000..74f3fd8ed729
--- /dev/null
+++ b/samples/bpf/xdp_monitor_kern.c
@@ -0,0 +1,88 @@
+/* XDP monitor tool, based on tracepoints
+ *
+ *  Copyright(c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") redirect_err_cnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u64),
+	.max_entries = 2,
+	/* TODO: have entries for all possible errno's */
+};
+
+/* Tracepoint format: /sys/kernel/debug/tracing/events/xdp/xdp_redirect/format
+ * Code in:                kernel/include/trace/events/xdp.h
+ */
+struct xdp_redirect_ctx {
+	unsigned short common_type;	//	offset:0;  size:2; signed:0;
+	unsigned char common_flags;	//	offset:2;  size:1; signed:0;
+	unsigned char common_preempt_count;//	offset:3;  size:1; signed:0;
+	int common_pid;			//	offset:4;  size:4; signed:1;
+
+	int prog_id;			//	offset:8;  size:4; signed:1;
+	u32 act;			//	offset:12  size:4; signed:0;
+	int ifindex;			//	offset:16  size:4; signed:1;
+	int err;			//	offset:20  size:4; signed:1;
+	int to_ifindex;			//	offset:24  size:4; signed:1;
+	u32 map_id;			//	offset:28  size:4; signed:0;
+	int map_index;			//	offset:32  size:4; signed:1;
+};					//	offset:36
+
+enum {
+	XDP_REDIRECT_SUCCESS = 0,
+	XDP_REDIRECT_ERROR = 1
+};
+
+static __always_inline
+int xdp_redirect_collect_stat(struct xdp_redirect_ctx *ctx)
+{
+	u32 key = XDP_REDIRECT_ERROR;
+	int err = ctx->err;
+	u64 *cnt;
+
+	if (!err)
+		key = XDP_REDIRECT_SUCCESS;
+
+	cnt  = bpf_map_lookup_elem(&redirect_err_cnt, &key);
+	if (!cnt)
+		return 0;
+	*cnt += 1;
+
+	return 0; /* Indicate event was filtered (no further processing)*/
+	/*
+	 * Returning 1 here would allow e.g. a perf-record tracepoint
+	 * to see and record these events, but it doesn't work well
+	 * in-practice as stopping perf-record also unload this
+	 * bpf_prog.  Plus, there is additional overhead of doing so.
+	 */
+}
+
+SEC("tracepoint/xdp/xdp_redirect_err")
+int trace_xdp_redirect_err(struct xdp_redirect_ctx *ctx)
+{
+	return xdp_redirect_collect_stat(ctx);
+}
+
+
+SEC("tracepoint/xdp/xdp_redirect_map_err")
+int trace_xdp_redirect_map_err(struct xdp_redirect_ctx *ctx)
+{
+	return xdp_redirect_collect_stat(ctx);
+}
+
+/* Likely unloaded when prog starts */
+SEC("tracepoint/xdp/xdp_redirect")
+int trace_xdp_redirect(struct xdp_redirect_ctx *ctx)
+{
+	return xdp_redirect_collect_stat(ctx);
+}
+
+/* Likely unloaded when prog starts */
+SEC("tracepoint/xdp/xdp_redirect_map")
+int trace_xdp_redirect_map(struct xdp_redirect_ctx *ctx)
+{
+	return xdp_redirect_collect_stat(ctx);
+}
diff --git a/samples/bpf/xdp_monitor_user.c b/samples/bpf/xdp_monitor_user.c
new file mode 100644
index 000000000000..b51b4f5e3257
--- /dev/null
+++ b/samples/bpf/xdp_monitor_user.c
@@ -0,0 +1,295 @@
+/* Copyright(c) 2017 Jesper Dangaard Brouer, Red Hat, Inc.
+ */
+static const char *__doc__=
+ "XDP monitor tool, based on tracepoints\n"
+;
+
+static const char *__doc_err_only__=
+ " NOTICE: Only tracking XDP redirect errors\n"
+ "         Enable TX success stats via '--stats'\n"
+ "         (which comes with a per packet processing overhead)\n"
+;
+
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <string.h>
+#include <ctype.h>
+#include <unistd.h>
+#include <locale.h>
+
+#include <getopt.h>
+#include <net/if.h>
+#include <time.h>
+
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "bpf_util.h"
+
+static int verbose = 1;
+static bool debug = false;
+
+static const struct option long_options[] = {
+	{"help",	no_argument,		NULL, 'h' },
+	{"debug",	no_argument,		NULL, 'D' },
+	{"stats",	no_argument,		NULL, 'S' },
+	{"sec", 	required_argument,	NULL, 's' },
+	{0, 0, NULL,  0 }
+};
+
+static void usage(char *argv[])
+{
+	int i;
+	printf("\nDOCUMENTATION:\n%s\n", __doc__);
+	printf("\n");
+	printf(" Usage: %s (options-see-below)\n",
+	       argv[0]);
+	printf(" Listing options:\n");
+	for (i = 0; long_options[i].name != 0; i++) {
+		printf(" --%-15s", long_options[i].name);
+		if (long_options[i].flag != NULL)
+			printf(" flag (internal value:%d)",
+			       *long_options[i].flag);
+		else
+			printf("(internal short-option: -%c)",
+			       long_options[i].val);
+		printf("\n");
+	}
+	printf("\n");
+}
+
+#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
+__u64 gettime(void)
+{
+	struct timespec t;
+	int res;
+
+	res = clock_gettime(CLOCK_MONOTONIC, &t);
+	if (res < 0) {
+		fprintf(stderr, "Error with gettimeofday! (%i)\n", res);
+		exit(EXIT_FAILURE);
+	}
+	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
+}
+
+enum {
+	REDIR_SUCCESS = 0,
+	REDIR_ERROR = 1,
+};
+#define REDIR_RES_MAX 2
+static const char *redir_names[REDIR_RES_MAX] = {
+	[REDIR_SUCCESS]	= "Success",
+	[REDIR_ERROR]	= "Error",
+};
+static const char *err2str(int err)
+{
+	if (err < REDIR_RES_MAX)
+		return redir_names[err];
+	return NULL;
+}
+
+struct record {
+	__u64 counter;
+	__u64 timestamp;
+};
+
+struct stats_record {
+	struct record xdp_redir[REDIR_RES_MAX];
+};
+
+static void stats_print_headers(bool err_only)
+{
+	if (err_only)
+		printf("\n%s\n", __doc_err_only__);
+
+	printf("%-14s %-10s %-18s %-9s\n",
+	       "XDP_REDIRECT", "pps ", "pps-human-readable", "measure-period");
+}
+
+static void stats_print(struct stats_record *rec,
+			struct stats_record *prev,
+			bool err_only)
+{
+	int i = 0;
+
+	if (err_only)
+		i = REDIR_ERROR;
+
+	for (; i < REDIR_RES_MAX; i++) {
+		struct record *r = &rec->xdp_redir[i];
+		struct record *p = &prev->xdp_redir[i];
+		__u64 period  = 0;
+		__u64 packets = 0;
+		double pps = 0;
+		double period_ = 0;
+
+		if (p->timestamp) {
+			packets = r->counter - p->counter;
+			period  = r->timestamp - p->timestamp;
+			if (period > 0) {
+				period_ = ((double) period / NANOSEC_PER_SEC);
+				pps = packets / period_;
+			}
+		}
+
+		printf("%-14s %-10.0f %'-18.0f %f\n",
+		       err2str(i), pps, pps, period_);
+	}
+}
+
+static __u64 get_key32_value64_percpu(int fd, __u32 key)
+{
+	/* For percpu maps, userspace gets a value per possible CPU */
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus];
+	__u64 sum = 0;
+	int i;
+
+	if ((bpf_map_lookup_elem(fd, &key, values)) != 0) {
+		fprintf(stderr,
+			"ERR: bpf_map_lookup_elem failed key:0x%X\n", key);
+		return 0;
+	}
+
+	/* Sum values from each CPU */
+	for (i = 0; i < nr_cpus; i++) {
+		sum += values[i];
+	}
+	return sum;
+}
+
+static bool stats_collect(int fd, struct stats_record *rec)
+{
+	int i;
+
+	/* TODO: Detect if someone unloaded the perf event_fd's, as
+	 * this can happen by someone running perf-record -e
+	 */
+
+	for (i = 0; i < REDIR_RES_MAX; i++) {
+		rec->xdp_redir[i].timestamp = gettime();
+		rec->xdp_redir[i].counter = get_key32_value64_percpu(fd, i);
+	}
+	return true;
+}
+
+static void stats_poll(int interval, bool err_only)
+{
+	struct stats_record rec, prev;
+	int map_fd;
+
+	memset(&rec, 0, sizeof(rec));
+
+	/* Trick to pretty printf with thousands separators use %' */
+	setlocale(LC_NUMERIC, "en_US");
+
+	/* Header */
+	if (verbose)
+		printf("\n%s", __doc__);
+
+	/* TODO Need more advanced stats on error types */
+	if (verbose)
+		printf(" - Stats map: %s\n", map_data[0].name);
+	map_fd = map_data[0].fd;
+
+	stats_print_headers(err_only);
+	fflush(stdout);
+
+	while (1) {
+		memcpy(&prev, &rec, sizeof(rec));
+		stats_collect(map_fd, &rec);
+		stats_print(&rec, &prev, err_only);
+		fflush(stdout);
+		sleep(interval);
+	}
+}
+
+void print_bpf_prog_info(void)
+{
+	int i;
+
+	/* Prog info */
+	printf("Loaded BPF prog have %d bpf program(s)\n", prog_cnt);
+	for (i = 0; i < prog_cnt; i++) {
+		printf(" - prog_fd[%d] = fd(%d)\n", i, prog_fd[i]);
+	}
+
+	/* Maps info */
+	printf("Loaded BPF prog have %d map(s)\n", map_data_count);
+	for (i = 0; i < map_data_count; i++) {
+		char *name = map_data[i].name;
+		int fd     = map_data[i].fd;
+
+		printf(" - map_data[%d] = fd(%d) name:%s\n", i, fd, name);
+	}
+
+	/* Event info */
+	printf("Searching for (max:%d) event file descriptor(s)\n", prog_cnt);
+	for (i = 0; i < prog_cnt; i++) {
+		if (event_fd[i] != -1)
+			printf(" - event_fd[%d] = fd(%d)\n", i, event_fd[i]);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	int longindex = 0, opt;
+	int ret = EXIT_SUCCESS;
+	char bpf_obj_file[256];
+
+	/* Default settings: */
+	bool errors_only = true;
+	int interval = 2;
+
+	snprintf(bpf_obj_file, sizeof(bpf_obj_file), "%s_kern.o", argv[0]);
+
+	/* Parse commands line args */
+	while ((opt = getopt_long(argc, argv, "h",
+				  long_options, &longindex)) != -1) {
+		switch (opt) {
+		case 'D':
+			debug = true;
+			break;
+		case 'S':
+			errors_only = false;
+			break;
+		case 's':
+			interval = atoi(optarg);
+			break;
+		case 'h':
+		default:
+			usage(argv);
+			return EXIT_FAILURE;
+		}
+	}
+
+	if (load_bpf_file(bpf_obj_file)) {
+		printf("ERROR - bpf_log_buf: %s", bpf_log_buf);
+		return 1;
+	}
+	if (!prog_fd[0]) {
+		printf("ERROR - load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	if (debug) {
+		print_bpf_prog_info();
+	}
+
+	/* Unload/stop tracepoint event by closing fd's */
+	if (errors_only) {
+		/* The prog_fd[i] and event_fd[i] depend on the
+		 * order the functions was defined in _kern.c
+		 */
+		close(event_fd[2]); /* tracepoint/xdp/xdp_redirect */
+		close(prog_fd[2]);  /* func: trace_xdp_redirect */
+		close(event_fd[3]); /* tracepoint/xdp/xdp_redirect_map */
+		close(prog_fd[3]);  /* func: trace_xdp_redirect_map */
+	}
+
+	stats_poll(interval, errors_only);
+
+	return ret;
+}

^ permalink raw reply related

* [PATCH net-next 6/7] samples/bpf: xdp_redirect load XDP dummy prog on TX device
From: Jesper Dangaard Brouer @ 2017-08-29 14:38 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

For supporting XDP_REDIRECT, a device driver must (obviously)
implement the "TX" function ndo_xdp_xmit().  An additional requirement
is you cannot TX out a device, unless it also have a xdp bpf program
attached. This dependency is caused by the driver code need to setup
XDP resources before it can ndo_xdp_xmit.

Update bpf samples xdp_redirect and xdp_redirect_map to automatically
attach a dummy XDP program to the configured ifindex_out device.  Use
the XDP flag XDP_FLAGS_UPDATE_IF_NOEXIST on the dummy load, to avoid
overriding an existing XDP prog on the device.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/xdp_redirect_kern.c     |   11 ++++++++++-
 samples/bpf/xdp_redirect_map_kern.c |   11 ++++++++++-
 samples/bpf/xdp_redirect_map_user.c |   22 +++++++++++++++-------
 samples/bpf/xdp_redirect_user.c     |   21 +++++++++++++++------
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/samples/bpf/xdp_redirect_kern.c b/samples/bpf/xdp_redirect_kern.c
index a34ad457a684..1c90288d0203 100644
--- a/samples/bpf/xdp_redirect_kern.c
+++ b/samples/bpf/xdp_redirect_kern.c
@@ -26,6 +26,9 @@ struct bpf_map_def SEC("maps") tx_port = {
 	.max_entries = 1,
 };
 
+/* Count RX packets, as XDP bpf_prog doesn't get direct TX-success
+ * feedback.  Redirect TX errors can be caught via a tracepoint.
+ */
 struct bpf_map_def SEC("maps") rxcnt = {
 	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
 	.key_size = sizeof(u32),
@@ -33,7 +36,6 @@ struct bpf_map_def SEC("maps") rxcnt = {
 	.max_entries = 1,
 };
 
-
 static void swap_src_dst_mac(void *data)
 {
 	unsigned short *p = data;
@@ -78,4 +80,11 @@ int xdp_redirect_prog(struct xdp_md *ctx)
 	return bpf_redirect(*ifindex, 0);
 }
 
+/* Redirect require an XDP bpf_prog loaded on the TX device */
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_kern.c b/samples/bpf/xdp_redirect_map_kern.c
index 2faf196e17ea..79795d41ad0d 100644
--- a/samples/bpf/xdp_redirect_map_kern.c
+++ b/samples/bpf/xdp_redirect_map_kern.c
@@ -26,6 +26,9 @@ struct bpf_map_def SEC("maps") tx_port = {
 	.max_entries = 100,
 };
 
+/* Count RX packets, as XDP bpf_prog doesn't get direct TX-success
+ * feedback.  Redirect TX errors can be caught via a tracepoint.
+ */
 struct bpf_map_def SEC("maps") rxcnt = {
 	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
 	.key_size = sizeof(u32),
@@ -33,7 +36,6 @@ struct bpf_map_def SEC("maps") rxcnt = {
 	.max_entries = 1,
 };
 
-
 static void swap_src_dst_mac(void *data)
 {
 	unsigned short *p = data;
@@ -80,4 +82,11 @@ int xdp_redirect_map_prog(struct xdp_md *ctx)
 	return bpf_redirect_map(&tx_port, vport, 0);
 }
 
+/* Redirect require an XDP bpf_prog loaded on the TX device */
+SEC("xdp_redirect_dummy")
+int xdp_redirect_dummy(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c
index a1ad00fdaa8a..d4d86a273fba 100644
--- a/samples/bpf/xdp_redirect_map_user.c
+++ b/samples/bpf/xdp_redirect_map_user.c
@@ -16,6 +16,7 @@
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <string.h>
 #include <unistd.h>
 #include <libgen.h>
@@ -26,17 +27,18 @@
 
 static int ifindex_in;
 static int ifindex_out;
+static bool ifindex_out_xdp_dummy_attached = true;
 
 static __u32 xdp_flags;
 
 static void int_exit(int sig)
 {
 	set_link_xdp_fd(ifindex_in, -1, xdp_flags);
+	if (ifindex_out_xdp_dummy_attached)
+		set_link_xdp_fd(ifindex_out, -1, xdp_flags);
 	exit(0);
 }
 
-/* simple per-protocol drop counter
- */
 static void poll_stats(int interval, int ifindex)
 {
 	unsigned int nr_cpus = bpf_num_possible_cpus();
@@ -70,7 +72,6 @@ static void usage(const char *prog)
 		prog);
 }
 
-
 int main(int argc, char **argv)
 {
 	const char *optstr = "SN";
@@ -112,14 +113,21 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
-	signal(SIGINT, int_exit);
-	signal(SIGTERM, int_exit);
-
 	if (set_link_xdp_fd(ifindex_in, prog_fd[0], xdp_flags) < 0) {
-		printf("link set xdp fd failed\n");
+		printf("ERROR: link set xdp fd failed on %d\n", ifindex_in);
 		return 1;
 	}
 
+	/* Loading dummy XDP prog on out-device */
+	if (set_link_xdp_fd(ifindex_out, prog_fd[1],
+			    (xdp_flags | XDP_FLAGS_UPDATE_IF_NOEXIST)) < 0) {
+		printf("WARN: link set xdp fd failed on %d\n", ifindex_out);
+		ifindex_out_xdp_dummy_attached = false;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
 	printf("map[0] (vports) = %i, map[1] (map) = %i, map[2] (count) = %i\n",
 		map_fd[0], map_fd[1], map_fd[2]);
 
diff --git a/samples/bpf/xdp_redirect_user.c b/samples/bpf/xdp_redirect_user.c
index f705a1905d2d..4475d837bf2c 100644
--- a/samples/bpf/xdp_redirect_user.c
+++ b/samples/bpf/xdp_redirect_user.c
@@ -16,6 +16,7 @@
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <string.h>
 #include <unistd.h>
 #include <libgen.h>
@@ -26,17 +27,18 @@
 
 static int ifindex_in;
 static int ifindex_out;
+static bool ifindex_out_xdp_dummy_attached = true;
 
 static __u32 xdp_flags;
 
 static void int_exit(int sig)
 {
 	set_link_xdp_fd(ifindex_in, -1, xdp_flags);
+	if (ifindex_out_xdp_dummy_attached)
+		set_link_xdp_fd(ifindex_out, -1, xdp_flags);
 	exit(0);
 }
 
-/* simple per-protocol drop counter
- */
 static void poll_stats(int interval, int ifindex)
 {
 	unsigned int nr_cpus = bpf_num_possible_cpus();
@@ -112,14 +114,21 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
-	signal(SIGINT, int_exit);
-	signal(SIGTERM, int_exit);
-
 	if (set_link_xdp_fd(ifindex_in, prog_fd[0], xdp_flags) < 0) {
-		printf("link set xdp fd failed\n");
+		printf("ERROR: link set xdp fd failed on %d\n", ifindex_in);
 		return 1;
 	}
 
+	/* Loading dummy XDP prog on out-device */
+	if (set_link_xdp_fd(ifindex_out, prog_fd[1],
+			    (xdp_flags | XDP_FLAGS_UPDATE_IF_NOEXIST)) < 0) {
+		printf("WARN: link set xdp fd failed on %d\n", ifindex_out);
+		ifindex_out_xdp_dummy_attached = false;
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
 	/* bpf redirect port */
 	ret = bpf_map_update_elem(map_fd[0], &key, &ifindex_out, 0);
 	if (ret) {

^ permalink raw reply related

* [PATCH net-next 5/7] xdp: separate xdp_redirect tracepoint in map case
From: Jesper Dangaard Brouer @ 2017-08-29 14:38 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

Creating as specific xdp_redirect_map variant of the xdp tracepoints
allow users to write simpler/faster BPF progs that get attached to
these tracepoints.

Goal is to still keep the tracepoints in xdp_redirect and xdp_redirect_map
similar enough, that a tool can read the top part of the TP_STRUCT and
produce similar monitor statistics.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/xdp.h |   46 ++++++++++++++++++++++++++++++++++++--------
 net/core/filter.c          |    4 ++--
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 1eebad55ebd4..862575ac8da9 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -77,13 +77,11 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 		__entry->map_index	= map_index;
 	),
 
-	TP_printk("prog_id=%d action=%s ifindex=%d to_ifindex=%d err=%d"
-		  " map_id=%d map_index=%d",
+	TP_printk("prog_id=%d action=%s ifindex=%d to_ifindex=%d err=%d",
 		  __entry->prog_id,
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
 		  __entry->ifindex, __entry->to_ifindex,
-		  __entry->err,
-		  __entry->map_id, __entry->map_index)
+		  __entry->err)
 );
 
 DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
@@ -108,11 +106,43 @@ DEFINE_EVENT(xdp_redirect_template, xdp_redirect_err,
 #define _trace_xdp_redirect_err(dev, xdp, to, err)	\
 	 trace_xdp_redirect_err(dev, xdp, to, err, NULL, 0);
 
-#define trace_xdp_redirect_map(dev, xdp, fwd, map, idx)	\
-	trace_xdp_redirect(dev, xdp, fwd ? fwd->ifindex : 0, 0, map, idx);
+DEFINE_EVENT_PRINT(xdp_redirect_template, xdp_redirect_map,
+	TP_PROTO(const struct net_device *dev,
+		 const struct bpf_prog *xdp,
+		 int to_ifindex, int err,
+		 const struct bpf_map *map, u32 map_index),
+	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index),
+	TP_printk("prog_id=%d action=%s ifindex=%d to_ifindex=%d err=%d"
+		  " map_id=%d map_index=%d",
+		  __entry->prog_id,
+		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
+		  __entry->ifindex, __entry->to_ifindex,
+		  __entry->err,
+		  __entry->map_id, __entry->map_index)
+);
+
+DEFINE_EVENT_PRINT(xdp_redirect_template, xdp_redirect_map_err,
+	TP_PROTO(const struct net_device *dev,
+		 const struct bpf_prog *xdp,
+		 int to_ifindex, int err,
+		 const struct bpf_map *map, u32 map_index),
+	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index),
+	TP_printk("prog_id=%d action=%s ifindex=%d to_ifindex=%d err=%d"
+		  " map_id=%d map_index=%d",
+		  __entry->prog_id,
+		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
+		  __entry->ifindex, __entry->to_ifindex,
+		  __entry->err,
+		  __entry->map_id, __entry->map_index)
+);
+
+#define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx)		\
+	 trace_xdp_redirect_map(dev, xdp, fwd ? fwd->ifindex : 0,	\
+				0, map, idx);
 
-#define trace_xdp_redirect_map_err(dev, xdp, fwd, map, idx, err)	\
-	trace_xdp_redirect_err(dev, xdp, fwd ? fwd->ifindex : 0, err, map, idx);
+#define _trace_xdp_redirect_map_err(dev, xdp, fwd, map, idx, err)	\
+	 trace_xdp_redirect_map_err(dev, xdp, fwd ? fwd->ifindex : 0,	\
+				    err, map, idx);
 
 #endif /* _TRACE_XDP_H */
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 096e78de0b97..c6a37fe0285b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2525,10 +2525,10 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 		goto err;
 
 	ri->map_to_flush = map;
-	trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+	_trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
 	return 0;
 err:
-	trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
+	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
 	return err;
 }
 

^ permalink raw reply related

* [PATCH net-next 4/7] xdp: separate xdp_redirect tracepoint in error case
From: Jesper Dangaard Brouer @ 2017-08-29 14:37 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

There is a need to separate the xdp_redirect tracepoint into two
tracepoints, for separating the error case from the normal forward
case.

Due to the extreme speeds XDP is operating at, loading a tracepoint
have a measurable impact.  Single core XDP REDIRECT (ethtool tuned
rx-usecs 25) can do 13.7 Mpps forwarding, but loading a simple
bpf_prog at the tracepoint (with a return 0) reduce perf to 10.2 Mpps
(CPU E5-1650 v4 @ 3.60GHz, driver: ixgbe)

The overhead of loading a bpf-based tracepoint can be calculated to
cost 25 nanosec ((1/13782002-1/10267937)*10^9 = -24.83 ns).

Using perf record on the tracepoint event, with a non-matching --filter
expression, the overhead is much larger. Performance drops to 8.3 Mpps,
cost 48 nanosec ((1/13782002-1/8312497)*10^9 = -47.74))

Having a separate tracepoint for err cases, which should be less
frequent, allow running a continuous monitor for errors while not
affecting the redirect forward performance (this have also been
verified by measurements).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/xdp.h |   22 ++++++++++++++++++----
 net/core/filter.c          |   37 ++++++++++++++++++++++++-------------
 2 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 89eba564e199..1eebad55ebd4 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -94,11 +94,25 @@ DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
 	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index)
 );
 
-#define _trace_xdp_redirect(dev, xdp, to, err)	\
-	 trace_xdp_redirect(dev, xdp, to, err, NULL, 0);
+DEFINE_EVENT(xdp_redirect_template, xdp_redirect_err,
+	TP_PROTO(const struct net_device *dev,
+		 const struct bpf_prog *xdp,
+		 int to_ifindex, int err,
+		 const struct bpf_map *map, u32 map_index),
+	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index)
+);
+
+#define _trace_xdp_redirect(dev, xdp, to)		\
+	 trace_xdp_redirect(dev, xdp, to, 0, NULL, 0);
+
+#define _trace_xdp_redirect_err(dev, xdp, to, err)	\
+	 trace_xdp_redirect_err(dev, xdp, to, err, NULL, 0);
+
+#define trace_xdp_redirect_map(dev, xdp, fwd, map, idx)	\
+	trace_xdp_redirect(dev, xdp, fwd ? fwd->ifindex : 0, 0, map, idx);
 
-#define trace_xdp_redirect_map(dev, xdp, fwd, err, map, idx)	\
-	trace_xdp_redirect(dev, xdp, fwd ? fwd->ifindex : 0, err, map, idx);
+#define trace_xdp_redirect_map_err(dev, xdp, fwd, map, idx, err)	\
+	trace_xdp_redirect_err(dev, xdp, fwd ? fwd->ifindex : 0, err, map, idx);
 
 #endif /* _TRACE_XDP_H */
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 31eab77cc842..096e78de0b97 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2515,16 +2515,20 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 	fwd = __dev_map_lookup_elem(map, index);
 	if (!fwd) {
 		err = -EINVAL;
-		goto out;
+		goto err;
 	}
 	if (ri->map_to_flush && ri->map_to_flush != map)
 		xdp_do_flush_map();
 
 	err = __bpf_tx_xdp(fwd, map, xdp, index);
-	if (likely(!err))
-		ri->map_to_flush = map;
-out:
-	trace_xdp_redirect_map(dev, xdp_prog, fwd, err, map, index);
+	if (unlikely(err))
+		goto err;
+
+	ri->map_to_flush = map;
+	trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+	return 0;
+err:
+	trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
 	return err;
 }
 
@@ -2543,12 +2547,17 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	ri->ifindex = 0;
 	if (unlikely(!fwd)) {
 		err = -EINVAL;
-		goto out;
+		goto err;
 	}
 
 	err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
-out:
-	_trace_xdp_redirect(dev, xdp_prog, index, err);
+	if (unlikely(err))
+		goto err;
+
+	_trace_xdp_redirect(dev, xdp_prog, index);
+	return 0;
+err:
+	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
@@ -2566,23 +2575,25 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	ri->ifindex = 0;
 	if (unlikely(!fwd)) {
 		err = -EINVAL;
-		goto out;
+		goto err;
 	}
 
 	if (unlikely(!(fwd->flags & IFF_UP))) {
 		err = -ENETDOWN;
-		goto out;
+		goto err;
 	}
 
 	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
 	if (skb->len > len) {
 		err = -EMSGSIZE;
-		goto out;
+		goto err;
 	}
 
 	skb->dev = fwd;
-out:
-	_trace_xdp_redirect(dev, xdp_prog, index, err);
+	_trace_xdp_redirect(dev, xdp_prog, index);
+	return 0;
+err:
+	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);

^ permalink raw reply related

* [PATCH net-next 3/7] xdp: make xdp tracepoints report bpf prog id instead of prog_tag
From: Jesper Dangaard Brouer @ 2017-08-29 14:37 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

Given previous patch expose the map_id, it seems natural to also
report the bpf prog id.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/xdp.h |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 573dcfa1aeaa..89eba564e199 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -31,20 +31,19 @@ TRACE_EVENT(xdp_exception,
 	TP_ARGS(dev, xdp, act),
 
 	TP_STRUCT__entry(
-		__array(u8, prog_tag, 8)
+		__field(int, prog_id)
 		__field(u32, act)
 		__field(int, ifindex)
 	),
 
 	TP_fast_assign(
-		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(xdp->tag));
-		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
+		__entry->prog_id	= xdp->aux->id;
 		__entry->act		= act;
 		__entry->ifindex	= dev->ifindex;
 	),
 
-	TP_printk("prog=%s action=%s ifindex=%d",
-		  __print_hex_str(__entry->prog_tag, 8),
+	TP_printk("prog_id=%d action=%s ifindex=%d",
+		  __entry->prog_id,
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
 		  __entry->ifindex)
 );
@@ -59,7 +58,7 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index),
 
 	TP_STRUCT__entry(
-		__array(u8, prog_tag, 8)
+		__field(int, prog_id)
 		__field(u32, act)
 		__field(int, ifindex)
 		__field(int, err)
@@ -69,8 +68,7 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 	),
 
 	TP_fast_assign(
-		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(xdp->tag));
-		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
+		__entry->prog_id	= xdp->aux->id;
 		__entry->act		= XDP_REDIRECT;
 		__entry->ifindex	= dev->ifindex;
 		__entry->err		= err;
@@ -79,9 +77,9 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 		__entry->map_index	= map_index;
 	),
 
-	TP_printk("prog=%s action=%s ifindex=%d to_ifindex=%d err=%d"
+	TP_printk("prog_id=%d action=%s ifindex=%d to_ifindex=%d err=%d"
 		  " map_id=%d map_index=%d",
-		  __print_hex_str(__entry->prog_tag, 8),
+		  __entry->prog_id,
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
 		  __entry->ifindex, __entry->to_ifindex,
 		  __entry->err,

^ permalink raw reply related

* [PATCH net-next 2/7] xdp: tracepoint xdp_redirect also need a map argument
From: Jesper Dangaard Brouer @ 2017-08-29 14:37 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

To make sense of the map index, the tracepoint user also need to know
that map we are talking about.  Supply the map pointer but only expose
the map->id.

The 'to_index' is renamed 'to_ifindex'.  In the xdp_redirect_map case,
this is the result of the devmap lookup. The map lookup key is exposed
as map_index, which is needed to troubleshoot in case the lookup failed.
The 'to_ifindex' is placed after 'err' to keep TP_STRUCT as common as
possible.

This also keeps the TP_STRUCT similar enough, that userspace can write
a monitor program, that doesn't need to care about whether
bpf_redirect or bpf_redirect_map were used.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/xdp.h |   38 ++++++++++++++++++++++++++++++--------
 net/core/filter.c          |    6 +++---
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index f684f3b36bca..573dcfa1aeaa 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -49,20 +49,23 @@ TRACE_EVENT(xdp_exception,
 		  __entry->ifindex)
 );
 
-TRACE_EVENT(xdp_redirect,
+DECLARE_EVENT_CLASS(xdp_redirect_template,
 
 	TP_PROTO(const struct net_device *dev,
 		 const struct bpf_prog *xdp,
-		 int to_index, int err),
+		 int to_ifindex, int err,
+		 const struct bpf_map *map, u32 map_index),
 
-	TP_ARGS(dev, xdp, to_index, err),
+	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index),
 
 	TP_STRUCT__entry(
 		__array(u8, prog_tag, 8)
 		__field(u32, act)
 		__field(int, ifindex)
-		__field(int, to_index)
 		__field(int, err)
+		__field(int, to_ifindex)
+		__field(u32, map_id)
+		__field(int, map_index)
 	),
 
 	TP_fast_assign(
@@ -70,16 +73,35 @@ TRACE_EVENT(xdp_redirect,
 		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
 		__entry->act		= XDP_REDIRECT;
 		__entry->ifindex	= dev->ifindex;
-		__entry->to_index	= to_index;
 		__entry->err		= err;
+		__entry->to_ifindex	= to_ifindex;
+		__entry->map_id		= map ? map->id : 0;
+		__entry->map_index	= map_index;
 	),
 
-	TP_printk("prog=%s action=%s ifindex=%d to_index=%d err=%d",
+	TP_printk("prog=%s action=%s ifindex=%d to_ifindex=%d err=%d"
+		  " map_id=%d map_index=%d",
 		  __print_hex_str(__entry->prog_tag, 8),
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
-		  __entry->ifindex, __entry->to_index,
-		  __entry->err)
+		  __entry->ifindex, __entry->to_ifindex,
+		  __entry->err,
+		  __entry->map_id, __entry->map_index)
 );
+
+DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
+	TP_PROTO(const struct net_device *dev,
+		 const struct bpf_prog *xdp,
+		 int to_ifindex, int err,
+		 const struct bpf_map *map, u32 map_index),
+	TP_ARGS(dev, xdp, to_ifindex, err, map, map_index)
+);
+
+#define _trace_xdp_redirect(dev, xdp, to, err)	\
+	 trace_xdp_redirect(dev, xdp, to, err, NULL, 0);
+
+#define trace_xdp_redirect_map(dev, xdp, fwd, err, map, idx)	\
+	trace_xdp_redirect(dev, xdp, fwd ? fwd->ifindex : 0, err, map, idx);
+
 #endif /* _TRACE_XDP_H */
 
 #include <trace/define_trace.h>
diff --git a/net/core/filter.c b/net/core/filter.c
index de31fb684ad4..31eab77cc842 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2524,7 +2524,7 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 	if (likely(!err))
 		ri->map_to_flush = map;
 out:
-	trace_xdp_redirect(dev, xdp_prog, index, err);
+	trace_xdp_redirect_map(dev, xdp_prog, fwd, err, map, index);
 	return err;
 }
 
@@ -2548,7 +2548,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 	err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
 out:
-	trace_xdp_redirect(dev, xdp_prog, index, err);
+	_trace_xdp_redirect(dev, xdp_prog, index, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
@@ -2582,7 +2582,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 out:
-	trace_xdp_redirect(dev, xdp_prog, index, err);
+	_trace_xdp_redirect(dev, xdp_prog, index, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);

^ permalink raw reply related

* [PATCH net-next 1/7] xdp: remove redundant argument to trace_xdp_redirect
From: Jesper Dangaard Brouer @ 2017-08-29 14:37 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer
In-Reply-To: <150401743083.16384.15778781741742858567.stgit@firesoul>

Supplying the action argument XDP_REDIRECT to the tracepoint xdp_redirect
is redundant as it is only called in-case this action was specified.

Remove the argument, but keep "act" member of the tracepoint struct and
populate it with XDP_REDIRECT.  This makes it easier to write a common bpf_prog
processing events.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/xdp.h |    6 +++---
 net/core/filter.c          |    6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 27cf2ef35f5f..f684f3b36bca 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -52,10 +52,10 @@ TRACE_EVENT(xdp_exception,
 TRACE_EVENT(xdp_redirect,
 
 	TP_PROTO(const struct net_device *dev,
-		 const struct bpf_prog *xdp, u32 act,
+		 const struct bpf_prog *xdp,
 		 int to_index, int err),
 
-	TP_ARGS(dev, xdp, act, to_index, err),
+	TP_ARGS(dev, xdp, to_index, err),
 
 	TP_STRUCT__entry(
 		__array(u8, prog_tag, 8)
@@ -68,7 +68,7 @@ TRACE_EVENT(xdp_redirect,
 	TP_fast_assign(
 		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(xdp->tag));
 		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
-		__entry->act		= act;
+		__entry->act		= XDP_REDIRECT;
 		__entry->ifindex	= dev->ifindex;
 		__entry->to_index	= to_index;
 		__entry->err		= err;
diff --git a/net/core/filter.c b/net/core/filter.c
index 4bcd6baa80c9..de31fb684ad4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2524,7 +2524,7 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 	if (likely(!err))
 		ri->map_to_flush = map;
 out:
-	trace_xdp_redirect(dev, xdp_prog, XDP_REDIRECT, index, err);
+	trace_xdp_redirect(dev, xdp_prog, index, err);
 	return err;
 }
 
@@ -2548,7 +2548,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 	err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
 out:
-	trace_xdp_redirect(dev, xdp_prog, XDP_REDIRECT, index, err);
+	trace_xdp_redirect(dev, xdp_prog, index, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
@@ -2582,7 +2582,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 out:
-	trace_xdp_redirect(dev, xdp_prog, XDP_REDIRECT, index, err);
+	trace_xdp_redirect(dev, xdp_prog, index, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);

^ permalink raw reply related

* [PATCH net-next 0/7] XDP redirect tracepoints
From: Jesper Dangaard Brouer @ 2017-08-29 14:37 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Jesper Dangaard Brouer

I feel this is as far as I can take the tracepoint infrastructure to
assist XDP monitoring.

Tracepoints comes with a base overhead of 25 nanosec for an attached
bpf_prog, and 48 nanosec for using a full perf record. This is
problematic for the XDP use-case, but it is very convenient to use the
existing perf infrastructure.

>From a performance perspective, the real solution would be to attach
another bpf_prog (that understand xdp_buff), but I'm not sure we want
to introduce yet another bpf attach API for this.

One thing left is to standardize the possible err return codes, to a
limited set, to allow easier (and faster) mapping into a bpf map.

---

Jesper Dangaard Brouer (7):
      xdp: remove redundant argument to trace_xdp_redirect
      xdp: tracepoint xdp_redirect also need a map argument
      xdp: make xdp tracepoints report bpf prog id instead of prog_tag
      xdp: separate xdp_redirect tracepoint in error case
      xdp: separate xdp_redirect tracepoint in map case
      samples/bpf: xdp_redirect load XDP dummy prog on TX device
      samples/bpf: xdp_monitor tool based on tracepoints


 include/trace/events/xdp.h          |  100 ++++++++++--
 net/core/filter.c                   |   37 +++-
 samples/bpf/Makefile                |    4 
 samples/bpf/xdp_monitor_kern.c      |   88 ++++++++++
 samples/bpf/xdp_monitor_user.c      |  295 +++++++++++++++++++++++++++++++++++
 samples/bpf/xdp_redirect_kern.c     |   11 +
 samples/bpf/xdp_redirect_map_kern.c |   11 +
 samples/bpf/xdp_redirect_map_user.c |   22 ++-
 samples/bpf/xdp_redirect_user.c     |   21 ++
 9 files changed, 543 insertions(+), 46 deletions(-)
 create mode 100644 samples/bpf/xdp_monitor_kern.c
 create mode 100644 samples/bpf/xdp_monitor_user.c

^ permalink raw reply

* Re: [PATCH] staging: r8822be: Fix typo for CONFIG_RTLWIFI_DEBUG
From: Larry Finger @ 2017-08-29 14:10 UTC (permalink / raw)
  To: Andreas Ziegler, Greg KH
  Cc: devel, Yan-Hsuan Chuang, Birming Chiu, netdev, Steven Ting
In-Reply-To: <1504006208-18965-1-git-send-email-andreas.ziegler@fau.de>

On 08/29/2017 06:30 AM, Andreas Ziegler wrote:
> The debugging output in deinit_priv is guarded by an  #ifdef using
> CONFIG_RTL_DEBUG. This symbol does not exist and should be
> CONFIG_RTLWIFI_DEBUG instead.
> 
> Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>

NACK.

Yes, there is a problem; however, CONFIG_RTLWIFI_DEBUG is not the value that 
should be used. That one is reserved for the non-staging drivers in 
drivers/net/wireless/realtek/rtlwifi/. The correct symbol for r8822be is 
CONFIG_RTLWIFI_DEBUG_ST.

Larry

> ---
>   drivers/staging/rtlwifi/halmac/rtl_halmac.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/staging/rtlwifi/halmac/rtl_halmac.c b/drivers/staging/rtlwifi/halmac/rtl_halmac.c
> index 031bf2c..888ca43 100644
> --- a/drivers/staging/rtlwifi/halmac/rtl_halmac.c
> +++ b/drivers/staging/rtlwifi/halmac/rtl_halmac.c
> @@ -386,7 +386,7 @@ static void deinit_priv(struct rtl_halmac *halmac)
>   		u32 count, size;
>   
>   		count = HALMAC_FEATURE_ALL + 1;
> -#ifdef CONFIG_RTL_DEBUG
> +#ifdef CONFIG_RTLWIFI_DEBUG
>   		{
>   			struct submit_ctx *sctx;
>   			u32 i;
> @@ -405,7 +405,7 @@ static void deinit_priv(struct rtl_halmac *halmac)
>   				rtl_mfree((u8 *)sctx, sizeof(*sctx));
>   			}
>   		}
> -#endif /* !CONFIG_RTL_DEBUG */
> +#endif /* !CONFIG_RTLWIFI_DEBUG */
>   		size = sizeof(*indicator) * count;
>   		kfree((u8 *)indicator);
>   	}
> 

^ permalink raw reply

* Re: [PATCH v4 4/5] net: stmmac: dwmac-sun8i: choose internal PHY via phy-is-integrated
From: Andrew Lunn @ 2017-08-29 14:06 UTC (permalink / raw)
  To: Corentin Labbe
  Cc: mark.rutland, devicetree, f.fainelli, alexandre.torgue, netdev,
	linux, linux-kernel, wens, robh+dt, peppe.cavallaro,
	maxime.ripard, linux-arm-kernel, icenowy
In-Reply-To: <20170829083432.GA28818@Red>

> Do you agree that another mdio_mux_init() must be written ? (taking
> a of_node (in our case: mdio-mux) instead of a device)

Yes, you are correct, it is using dev->of_node. So you need to
refactor the code a little, to allow an of_node to be passed.

	 Andrew

^ permalink raw reply

* Re: [PATCH net-next v2] bridge: fdb add and delete tracepoints
From: Roopa Prabhu @ 2017-08-29 13:49 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org, Florian Fainelli,
	davem@davemloft.net, bridge
In-Reply-To: <20170829133648.GF22093@lunn.ch>

On Tue, Aug 29, 2017 at 6:36 AM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Mon, Aug 28, 2017 at 09:22:48PM -0700, Roopa Prabhu wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> A few useful tracepoints to trace bridge forwarding
>> database updates.
>
> Hi Roopa
>
> Once accepted, it would be nice to add mdb tracepoints as well. I
> expect the implementation would be very similar.
>

yes, sounds great. will do.

^ permalink raw reply

* Re: Question about ip_defrag
From: Florian Westphal @ 2017-08-29 13:46 UTC (permalink / raw)
  To: liujian (CE)
  Cc: Florian Westphal, Jesper Dangaard Brouer, netdev@vger.kernel.org,
	Wangkefeng (Kevin), weiyongjun (A)
In-Reply-To: <4F88C5DDA1E80143B232E89585ACE27D018F4850@DGGEMA502-MBX.china.huawei.com>

liujian (CE) <liujian56@huawei.com> wrote:

[ trimming cc list ]

> Now, I have not the real environment. 
> I use iperf generate fragment packets; 
> and I always change NIC rx irq's affinity cpu, to make sure frag_mem_limit reach to thresh.
> my test machine, CPU num is 384.

Oh well, that explains it.

> > > +	if (frag_mem_limit(nf) > nf->low_thresh) {
> > >  		inet_frag_schedule_worker(f);
> > > +		update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16);
> > > +	}

You need to reduce this to a lower value.
Your cpu count * batch_value needs to be less than
low_thresh to avoid problems.

Wtih 384 cpus its close to 12 mbyte...

Perhaps do this:

update_frag_mem_limit(nf, 2 * 1024*1024 / NR_CPUS);


However, I think its better to revert the percpu counter change and
move back to a single atomic_t count.

^ permalink raw reply

* Re: [PATCH net-next v2] bridge: fdb add and delete tracepoints
From: Andrew Lunn @ 2017-08-29 13:36 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: davem, netdev, nikolay, f.fainelli, bridge
In-Reply-To: <1503980568-35240-1-git-send-email-roopa@cumulusnetworks.com>

On Mon, Aug 28, 2017 at 09:22:48PM -0700, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> A few useful tracepoints to trace bridge forwarding
> database updates.

Hi Roopa

Once accepted, it would be nice to add mdb tracepoints as well. I
expect the implementation would be very similar.

       Andrew

^ permalink raw reply

* RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: David Laight @ 2017-08-29 13:35 UTC (permalink / raw)
  To: 'Jakub Kicinski', Jacob Keller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20170825153418.53864810@cakuba>

From: Jakub Kicinski
> Sent: 25 August 2017 20:34
>
> On Fri, 25 Aug 2017 08:24:49 -0700, Jacob Keller wrote:
> > Under some circumstances, such as with many stacked devices, it is
> > possible that dev_hard_start_xmit will bundle many packets together, and
> > mark them all with xmit_more.
> 
> Excuse my ignorance but what are those stacked devices?  Could they
> perhaps be fixed somehow?  My intuition was that long xmit_more
> sequences can only happen if NIC and/or BQL are back pressuring, and
> therefore we shouldn't be seeing a long xmit_more "train" arriving at
> an empty device ring...

I also suspect that the packets could be coming from multiple sources.
So getting the sources to limit the number of packets with XMIT_MORE
set won't really solve any problem.

At some point the driver for the physical device will have to give it
a kick to start the transmits.

On the systems I've got (desktop x86) PCIe writes aren't really very
expensive.
Reads are a different matter entirely (2us into our fpga target).

	David.

^ permalink raw reply

* Re: XDP redirect measurements, gotchas and tracepoints
From: Jesper Dangaard Brouer @ 2017-08-29 13:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andy Gospodarek, Michael Chan, John Fastabend, Duyck, Alexander H,
	pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, borkmann@iogearbox.net, brouer
In-Reply-To: <CAKgT0UeZ5Wn+8qY8kbBef5weORwGAXjq6Qp7jtB1bscmjY_oYw@mail.gmail.com>

On Mon, 28 Aug 2017 09:11:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:

> My advice would be to not over complicate this. My big concern with
> all this buffer recycling is what happens the first time somebody
> introduces something like mirroring? Are you going to copy the data to
> a new page which would be quite expensive or just have to introduce
> reference counts? You are going to have to deal with stuff like
> reference counts eventually so you might as well bite that bullet now.
> My advice would be to not bother with optimizing for performance right
> now and instead focus on just getting functionality. The approach we
> took in ixgbe for the transmit path should work for almost any other
> driver since all you are looking at is having to free the page
> reference which takes care of reference counting already.

This return API is not about optimizing performance right now.  It is
actually about allowing us to change the underlying memory model per RX
queue for XDP.

If a RX-ring is use for both SKBs and XDP, then the refcnt model is
still enforced.  Although a driver using the 1-packet-per-page model,
should be able to reuse refcnt==1 pages when returned from XDP.

If a RX-ring is _ONLY_ used for XDP, then the driver have freedom to
implement another memory model, with the return-API.  We need to
experiment with the most optimal memory model.  The 1-packet-per-page
model is actually not the fastest, because of PCI-e bottlenecks.  With
HW support for packing descriptors and packets over the PCI-e bus, much
higher rates can be achieved.  Mellanox mlx5-Lx already have the needed HW
support.  And companies like NetCope also have 100G HW that does
similar tricks, and they even have a whitepaper[1][2] how they are
faster than DPDK with their NDP (Netcope Data Plane) API.

We do need the ability/flexibility to change the RX memory model, to
take advantage of this new NIC hardware.

[1] https://www.netcope.com/en/resources/improving-dpdk-performance
[2] https://www.netcope.com/en/company/press-center/press-releases/read-new-netcope-whitepaper-on-dpdk-acceleration

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net-next v3 02/13] phy: add the mvebu cp110 comphy driver
From: Antoine Tenart @ 2017-08-29 13:12 UTC (permalink / raw)
  To: Kishon Vijay Abraham I
  Cc: Antoine Tenart, davem, andrew, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <d1d5212d-aa39-0b49-afb8-cc45d57717c0@ti.com>

[-- Attachment #1: Type: text/plain, Size: 4410 bytes --]

Hi Kishon,

On Tue, Aug 29, 2017 at 05:55:06PM +0530, Kishon Vijay Abraham I wrote:
> On Tuesday 29 August 2017 04:53 PM, Antoine Tenart wrote:
> > On Tue, Aug 29, 2017 at 04:34:17PM +0530, Kishon Vijay Abraham I wrote:
> >> On Monday 28 August 2017 08:27 PM, Antoine Tenart wrote:
> >>> +static const struct mvebu_comhy_conf mvebu_comphy_cp110_modes[] = {
> >>> +	/* lane 0 */
> >>> +	MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII, 0x1),
> >>> +	/* lane 1 */
> >>> +	MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII, 0x1),
> >>> +	/* lane 2 */
> >>> +	MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII, 0x1),
> >>> +	MVEBU_COMPHY_CONF(2, 0, PHY_MODE_10GKR, 0x1),
> >>> +	/* lane 3 */
> >>> +	MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII, 0x2),
> >>> +	/* lane 4 */
> >>> +	MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII, 0x2),
> >>> +	MVEBU_COMPHY_CONF(4, 0, PHY_MODE_10GKR, 0x2),
> >>> +	MVEBU_COMPHY_CONF(4, 1, PHY_MODE_SGMII, 0x1),
> >>> +	/* lane 5 */
> >>> +	MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII, 0x1),
> >>> +};
> >>
> >> IMHO all the lane and mode configuration should come from dt. That would make
> >> it more reusable when comphy is configured differently.
> > 
> > These connexions between engines and the comphy lanes are inside the
> > SoC. They won't change for a given SoC, and the actual configuration is
> > at the board level to know what is connected to the output of a given
> > lane, which is already described into the dt (the lane phandle).
> > 
> > So I think we can keep this inside the driver, and we'll had other
> > tables if the same comphy is ever used in another SoC.
> > 
> > What do you think?
> 
> I'd like to avoid adding tables for every SoC. These are configuration details
> and can come from dt.

Actually this is per CP design, not SoC (this one is used in both 7k and
8k SoCs from Marvell, and probably others). I'm still not convinced this
is a good idea to put this into the dt. First of all we would end up with
something like (and this is only for a single lane, out of *6*):

cpm_comphy: phy@phy@120000 {
	compatible = "marvell,comphy-cp110";
	reg = <0x120000 0x6000>;
	marvell,system-controller = <&cpm_syscon0>;
	#address-cells = <1>;
	#size-cells = <0>;

	cpm_comphy0: phy@0 {
		reg = <0>;
		#phy-cells = <1>;

		mode@0 {
			phy-mode = PHY_MODE_SGMII;
			selector = <0x1>;
			pipe-selector = <0x0>;
			port = <0>;
		};

		mode@1 {
			phy-mode = PHY_MODE_HS_SGMII;
			selector = <0x1>;
			pipe-selector = <0x0>;
			port = <0>;
		};

		mode@2 {
			phy-mode = PHY_MODE_RXAUI;
			selector = <0x2>;
			pipe-selector = <0x0>;
			port = <0>;
		};

		mode@3 {
			phy-mode = PHY_MODE_10GKR;
			selector = <0x2>;
			pipe-selector = <0x0>;
			port = <0>;
		};

		mode@4 {
			phy-mode = PHY_MODE_SATA;
			selector = <0x4>;
			pipe-selector = <0x0>;
			port = <1>;
		};

		mode@5 {
			phy-mode = PHY_MODE_USB;
			selector = <0x0>;
			pipe-selector = <0x1>;
			port = <2>;
		};

		mode@6 {
			phy-mode = PHY_MODE_USB;
			selector = <0x0>;
			pipe-selector = <0x2>;
			port = <0>;
		};

		... + PCIe, other ports ...
	};

	cpm_comphy1: phy@1 {
		...
	};

	...
};

And this "configuration" makes us think the comphy driver would be
somehow generic with only these parameters to update when using a
different CP. In reality we do have one other comphy in another CP, and
it requires more than just updating the above parameters: the init
functions also are SoC specific. So the table used in the patch proposed
only is a small part of this "configuration". In fact it's not a
configuration at all, but only a mode-to-bit indirection, used in the
comphy init functions.

What is proposed instead, is very close to what actually changes a lot,
and what a designer can change: the CP internals are described in the
driver as these won't change (and if they do in a future CP design, a
lot more effort would be needed than just updating the table), and the
actual lane connexions on the board are configured through the dt, which
is board specific.

Finally we do not have (yet) any example of this IP being reused as-is.
So we have no idea what other changes than the ones described above will
be needed. But for sure not only the mode and lane configurations.

Thanks,
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH ] net: frag: print frag_mem_limit value in sockstat proc file
From: liujian56 @ 2017-08-29 13:05 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, brouer, fw, netdev, liujian56

From: liujian <liujian56@huawei.com>

>From 6d7b857d5( net: use lib/percpu_counter API for fragmentation mem
accounting),
frag_mem_limit and sum_frag_mem_limit have different value if there are
multiple NIC RX CPU.
Print frag_mem_limit value, then we can get more debug info.

Signed-off-by: liujian <liujian56@huawei.com>
---
 net/ipv4/proc.c | 3 ++-
 net/ipv6/proc.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 43eb6567..370cda1 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -73,7 +73,8 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 	seq_printf(seq, "RAW: inuse %d\n",
 		   sock_prot_inuse_get(net, &raw_prot));
 	frag_mem = ip_frag_mem(net);
-	seq_printf(seq,  "FRAG: inuse %u memory %u\n", !!frag_mem, frag_mem);
+	seq_printf(seq,  "FRAG: inuse %u memory %u current mem_limit %d\n",
+		!!frag_mem, frag_mem, frag_mem_limit(&net->ipv4.frags));
 	return 0;
 }
 
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index e88bcb8..00680bc 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -48,7 +48,8 @@ static int sockstat6_seq_show(struct seq_file *seq, void *v)
 			sock_prot_inuse_get(net, &udplitev6_prot));
 	seq_printf(seq, "RAW6: inuse %d\n",
 		       sock_prot_inuse_get(net, &rawv6_prot));
-	seq_printf(seq, "FRAG6: inuse %u memory %u\n", !!frag_mem, frag_mem);
+	seq_printf(seq, "FRAG6: inuse %u memory %u current mem_limit %d\n",
+		 !!frag_mem, frag_mem, frag_mem_limit(&net->ipv6.frags));
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* RE: Question about ip_defrag
From: liujian (CE) @ 2017-08-29 13:01 UTC (permalink / raw)
  To: liujian (CE), Florian Westphal
  Cc: Jesper Dangaard Brouer, davem@davemloft.net, kuznet@ms2.inr.ac.ru,
	yoshfuji@linux-ipv6.org, elena.reshetova@intel.com,
	edumazet@google.com, netdev@vger.kernel.org, Wangkefeng (Kevin),
	weiyongjun (A)
In-Reply-To: <20170828140032.GB12926@breakpoint.cc>




Best Regards,
liujian


> -----Original Message-----
> From: liujian (CE)
> Sent: Tuesday, August 29, 2017 3:39 PM
> To: 'Florian Westphal'
> Cc: Jesper Dangaard Brouer; davem@davemloft.net; kuznet@ms2.inr.ac.ru;
> yoshfuji@linux-ipv6.org; elena.reshetova@intel.com; edumazet@google.com;
> netdev@vger.kernel.org; Wangkefeng (Kevin); weiyongjun (A)
> Subject: RE: Question about ip_defrag
> 
> 
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org
> > [mailto:netdev-owner@vger.kernel.org]
> > On Behalf Of Florian Westphal
> > Sent: Monday, August 28, 2017 10:01 PM
> > To: liujian (CE)
> > Cc: Jesper Dangaard Brouer; davem@davemloft.net; kuznet@ms2.inr.ac.ru;
> > yoshfuji@linux-ipv6.org; elena.reshetova@intel.com;
> > edumazet@google.com; netdev@vger.kernel.org; Wangkefeng (Kevin);
> > weiyongjun (A)
> > Subject: Re: Question about ip_defrag
> >
> > liujian (CE) <liujian56@huawei.com> wrote:
> > > Hi
> > >
> > > I checked our 3.10 kernel, we had backported all percpu_counter bug
> > > fix in
> > lib/percpu_counter.c and include/linux/percpu_counter.h.
> > > And I check 4.13-rc6, also has the issue if NIC's rx cpu num big enough.
> > >
> > > > > > > the issue:
> > > > > > > Ip_defrag fail caused by frag_mem_limit reached
> > 4M(frags.high_thresh).
> > > > > > > At this moment,sum_frag_mem_limit is about 10K.
> > >
> > > So should we change ipfrag high/low thresh to a reasonable value ?
> > > And if it is, is there a standard to change the value?
> >
> > Each cpu can have frag_percpu_counter_batch bytes rest doesn't know
> > about so with 64 cpus that is ~8 mbyte.
> >
> > possible solutions:
> > 1. reduce frag_percpu_counter_batch to 16k or so 2. make both low and
> > high thresh depend on NR_CPUS
> >
> Thank you for your reply.
> 
> > liujian, does this change help in any way?
> 
> I will have a try.

Now, I have not the real environment. 
I use iperf generate fragment packets; 
and I always change NIC rx irq's affinity cpu, to make sure frag_mem_limit reach to thresh.
my test machine, CPU num is 384.
As above , test the patch , seemingly , there is no improving...
Check /proc/net/snmp, there is no significant difference.
maybe we should find a good test method!

root@RH8100-V3:/proc/net# cat sockstat
sockets: used 1386
TCP: inuse 3 orphan 0 tw 0 alloc 4 mem 1
UDP: inuse 44 mem 42
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 1 memory 34336, 3144424.
 

> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -123,6 +123,17 @@ static bool inet_fragq_should_evict(const struct
> > inet_frag_queue *q)
> >  	       frag_mem_limit(q->net) >= q->net->low_thresh;  }
> >
> > +/* ->mem batch size is huge, this can cause severe discrepancies
> > + * between actual value (sum of pcpu values) and the global estimate.
> > + *
> > + * Use a smaller batch to give an opportunity for the global estimate
> > + * to more accurately reflect current state.
> > + */
> > +static void update_frag_mem_limit(struct netns_frags *nf, unsigned
> > +int
> > +batch) {
> > +	 percpu_counter_add_batch(&nf->mem, 0, batch); }
> > +
> >  static unsigned int
> >  inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)
> > { @@
> > -146,8 +157,12 @@ inet_evict_bucket(struct inet_frags *f, struct
> > inet_frag_bucket *hb)
> >
> >  	spin_unlock(&hb->chain_lock);
> >
> > -	hlist_for_each_entry_safe(fq, n, &expired, list_evictor)
> > +	hlist_for_each_entry_safe(fq, n, &expired, list_evictor) {
> > +		struct netns_frags *nf = fq->net;
> > +
> >  		f->frag_expire((unsigned long) fq);
> > +		update_frag_mem_limit(nf, 1);
> 
> > +	}
> >
> >  	return evicted;
> >  }
> > @@ -396,8 +411,10 @@ struct inet_frag_queue *inet_frag_find(struct
> > netns_frags *nf,
> >  	struct inet_frag_queue *q;
> >  	int depth = 0;
> >
> > -	if (frag_mem_limit(nf) > nf->low_thresh)
> > +	if (frag_mem_limit(nf) > nf->low_thresh) {
> >  		inet_frag_schedule_worker(f);
> > +		update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16);
> > +	}
> >
> >  	hash &= (INETFRAGS_HASHSZ - 1);
> >  	hb = &f->hash[hash];
> > @@ -416,6 +433,8 @@ struct inet_frag_queue *inet_frag_find(struct
> > netns_frags *nf,
> >  	if (depth <= INETFRAGS_MAXDEPTH)
> >  		return inet_frag_create(nf, f, key);
> >
> > +	update_frag_mem_limit(nf, 1);
> > +
> >  	if (inet_frag_may_rebuild(f)) {
> >  		if (!f->rebuild)
> >  			f->rebuild = true;

^ permalink raw reply

* Re: [PATCH net-next v2 00/10] net: dsa: add generic debugfs interface
From: Andrew Lunn @ 2017-08-29 12:50 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Vivien Didelot, netdev, linux-kernel, kernel, David S. Miller,
	Florian Fainelli, Egil Hjelmeland, John Crispin, Woojung Huh,
	Sean Wang, Nikita Yushchenko, Chris Healy, mlxsw
In-Reply-To: <20170829062523.GA1977@nanopsycho.orion>

On Tue, Aug 29, 2017 at 08:25:23AM +0200, Jiri Pirko wrote:
> Mon, Aug 28, 2017 at 10:08:34PM CEST, andrew@lunn.ch wrote:
> >> I see this overlaps a lot with DPIPE. Why won't you use that to expose
> >> your hw state?
> >
> >We took a look at dpipe and i talked to you about using it for this
> >sort of thing at netconf/netdev. But dpipe has issues displaying the
> >sort of information we have. I never figured out how to do two
> >dimensional tables. The output of the dpipe command is pretty
> >unreadable. A lot of the information being dumped here is not about
> >the data pipe, etc.
> 
> So improve it. No problem. Also, we extend it to support what you neede.

Will i did try to do this back in March. And i failed.

Lets start with stats. Vivien gives an example on the cover letter:

    # pr -mt switch0/port{5,6}/stats
    in_good_octets      : 0             in_good_octets      : 13824
    in_bad_octets       : 0             in_bad_octets       : 0
    in_unicast          : 0             in_unicast          : 0
    in_broadcasts       : 0             in_broadcasts       : 216
    in_multicasts       : 0             in_multicasts       : 0
    in_pause            : 0             in_pause            : 0
    in_undersize        : 0             in_undersize        : 0

This is what i tried to implement using dpipe. It is a simple two
dimensional table. First column is a string, second a u64. In debugfs
we have such a table per port. That fits with the hierarchy that each
port is a directory in debugfs. But it could also be implemented as
one table with N+1 columns, for N switch ports.

How about you, or one of your team, implement that. It should be able
to use the dsa_loop driver, which is a simple dummy switch. But it
does have statistics counters for all ports. Florian or I can help you
get it running if needed.

This branch contains some of the basic plumbing code from my previous
attempt:

https://github.com/lunn/linux.git v4.11-rc4-net-next-dpipe

	 Andrew

^ permalink raw reply

* Attention Dear Friend,
From: Mr.Robert Koffi @ 2017-08-29 12:39 UTC (permalink / raw)


Attention Dear Friend,

Compliment of the day to you, I sent you a mail a couple of day ago
about a transaction and haven't heard from you since then.The message
was to solicit your partnership to transfer  the sum of ($14 million
US dollars) deposited by my late client how have the same last name
with you from the same country.I shall send you more information and
procedures when I receive a positive response from you.

I look forward to hearing from you.

Yours sincerely,
Mr Robert Koffi

^ permalink raw reply

* Re: [PATCH net] ipv6: do not set sk_destruct in IPV6_ADDRFORM sockopt
From: Paolo Abeni @ 2017-08-29 12:33 UTC (permalink / raw)
  To: Xin Long, network dev; +Cc: davem, chunwang, syzkaller
In-Reply-To: <4cfe77b4a829c0d6134b842fe2ea7c41b6b210ff.1503888301.git.lucien.xin@gmail.com>

On Mon, 2017-08-28 at 10:45 +0800, Xin Long wrote:
> ChunYu found a kernel warn_on during syzkaller fuzzing:
> 
> [40226.038539] WARNING: CPU: 5 PID: 23720 at net/ipv4/af_inet.c:152 inet_sock_destruct+0x78d/0x9a0
> [40226.144849] Call Trace:
> [40226.147590]  <IRQ>
> [40226.149859]  dump_stack+0xe2/0x186
> [40226.176546]  __warn+0x1a4/0x1e0
> [40226.180066]  warn_slowpath_null+0x31/0x40
> [40226.184555]  inet_sock_destruct+0x78d/0x9a0
> [40226.246355]  __sk_destruct+0xfa/0x8c0
> [40226.290612]  rcu_process_callbacks+0xaa0/0x18a0
> [40226.336816]  __do_softirq+0x241/0x75e
> [40226.367758]  irq_exit+0x1f6/0x220
> [40226.371458]  smp_apic_timer_interrupt+0x7b/0xa0
> [40226.376507]  apic_timer_interrupt+0x93/0xa0
> 
> The warn_on happned when sk->sk_rmem_alloc wasn't 0 in inet_sock_destruct.
> As after commit f970bd9e3a06 ("udp: implement memory accounting helpers"),
> udp has changed to use udp_destruct_sock as sk_destruct where it would
> udp_rmem_release all rmem.
> 
> But IPV6_ADDRFORM sockopt sets sk_destruct with inet_sock_destruct after
> changing family to PF_INET. If rmem is not 0 at that time, and there is
> no place to release rmem before calling inet_sock_destruct, the warn_on
> will be triggered.
> 
> This patch is to fix it by not setting sk_destruct in IPV6_ADDRFORM sockopt
> any more. As IPV6_ADDRFORM sockopt only works for tcp and udp. TCP sock has
> already set it's sk_destruct with inet_sock_destruct and UDP has set with
> udp_destruct_sock since they're created.
> 
> Fixes: f970bd9e3a06 ("udp: implement memory accounting helpers")
> Reported-by: ChunYu Wang <chunwang@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/ipv6/ipv6_sockglue.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
> index 02d795f..a5e466d 100644
> --- a/net/ipv6/ipv6_sockglue.c
> +++ b/net/ipv6/ipv6_sockglue.c
> @@ -242,7 +242,6 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
>  			pktopt = xchg(&np->pktoptions, NULL);
>  			kfree_skb(pktopt);
>  
> -			sk->sk_destruct = inet_sock_destruct;
>  			/*
>  			 * ... and add it to the refcnt debug socks count
>  			 * in the new family. -acme

That assignment looks like a relic from ancient past... the patch LGTM,

Thanks Xin to fix this.

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply

* [PATCH net-next 7/7] rxrpc: Allow failed client calls to be retried
From: David Howells @ 2017-08-29 12:27 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <150400960745.21371.15455867950162722742.stgit@warthog.procyon.org.uk>

Allow a client call that failed on network error to be retried, provided
that the Tx queue still holds DATA packet 1.  This allows an operation to
be submitted to another server or another address for the same server
without having to repackage and re-encrypt the data so far processed.

Two new functions are provided:

 (1) rxrpc_kernel_check_call() - This is used to find out the completion
     state of a call to guess whether it can be retried and whether it
     should be retried.

 (2) rxrpc_kernel_retry_call() - Disconnect the call from its current
     connection, reset the state and submit it as a new client call to a
     new address.  The new address need not match the previous address.

A call may be retried even if all the data hasn't been loaded into it yet;
a partially constructed will be retained at the same point it was at when
an error condition was detected.  msg_data_left() can be used to find out
how much data was packaged before the error occurred.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/networking/rxrpc.txt |   45 ++++++++++++++++
 include/net/af_rxrpc.h             |   16 ++++++
 net/rxrpc/af_rxrpc.c               |   69 ++++++++++++++++++++++++
 net/rxrpc/ar-internal.h            |   19 ++-----
 net/rxrpc/call_object.c            |  102 ++++++++++++++++++++++++++++++++++--
 net/rxrpc/conn_client.c            |   17 +++++-
 net/rxrpc/sendmsg.c                |   24 +++++---
 7 files changed, 261 insertions(+), 31 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index 92a3c3bd5ac3..810620153a44 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -975,6 +975,51 @@ The kernel interface functions are as follows:
      size should be set when the call is begun.  tx_total_len may not be less
      than zero.
 
+ (*) Check to see the completion state of a call so that the caller can assess
+     whether it needs to be retried.
+
+	enum rxrpc_call_completion {
+		RXRPC_CALL_SUCCEEDED,
+		RXRPC_CALL_REMOTELY_ABORTED,
+		RXRPC_CALL_LOCALLY_ABORTED,
+		RXRPC_CALL_LOCAL_ERROR,
+		RXRPC_CALL_NETWORK_ERROR,
+	};
+
+	int rxrpc_kernel_check_call(struct socket *sock, struct rxrpc_call *call,
+				    enum rxrpc_call_completion *_compl,
+				    u32 *_abort_code);
+
+     On return, -EINPROGRESS will be returned if the call is still ongoing; if
+     it is finished, *_compl will be set to indicate the manner of completion,
+     *_abort_code will be set to any abort code that occurred.  0 will be
+     returned on a successful completion, -ECONNABORTED will be returned if the
+     client failed due to a remote abort and anything else will return an
+     appropriate error code.
+
+     The caller should look at this information to decide if it's worth
+     retrying the call.
+
+ (*) Retry a client call.
+
+	int rxrpc_kernel_retry_call(struct socket *sock,
+				    struct rxrpc_call *call,
+				    struct sockaddr_rxrpc *srx,
+				    struct key *key);
+
+     This attempts to partially reinitialise a call and submit it again whilst
+     reusing the original call's Tx queue to avoid the need to repackage and
+     re-encrypt the data to be sent.  call indicates the call to retry, srx the
+     new address to send it to and key the encryption key to use for signing or
+     encrypting the packets.
+
+     For this to work, the first Tx data packet must still be in the transmit
+     queue, and currently this is only permitted for local and network errors
+     and the call must not have been aborted.  Any partially constructed Tx
+     packet is left as is and can continue being filled afterwards.
+
+     It returns 0 if the call was requeued and an error otherwise.
+
 
 =======================
 CONFIGURABLE PARAMETERS
diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h
index 07a47ee6f783..3ac79150291f 100644
--- a/include/net/af_rxrpc.h
+++ b/include/net/af_rxrpc.h
@@ -19,6 +19,18 @@ struct sock;
 struct socket;
 struct rxrpc_call;
 
+/*
+ * Call completion condition (state == RXRPC_CALL_COMPLETE).
+ */
+enum rxrpc_call_completion {
+	RXRPC_CALL_SUCCEEDED,		/* - Normal termination */
+	RXRPC_CALL_REMOTELY_ABORTED,	/* - call aborted by peer */
+	RXRPC_CALL_LOCALLY_ABORTED,	/* - call aborted locally on error or close */
+	RXRPC_CALL_LOCAL_ERROR,		/* - call failed due to local error */
+	RXRPC_CALL_NETWORK_ERROR,	/* - call terminated by network error */
+	NR__RXRPC_CALL_COMPLETIONS
+};
+
 typedef void (*rxrpc_notify_rx_t)(struct sock *, struct rxrpc_call *,
 				  unsigned long);
 typedef void (*rxrpc_notify_end_tx_t)(struct sock *, struct rxrpc_call *,
@@ -51,5 +63,9 @@ void rxrpc_kernel_get_peer(struct socket *, struct rxrpc_call *,
 int rxrpc_kernel_charge_accept(struct socket *, rxrpc_notify_rx_t,
 			       rxrpc_user_attach_call_t, unsigned long, gfp_t);
 void rxrpc_kernel_set_tx_length(struct socket *, struct rxrpc_call *, s64);
+int rxrpc_kernel_retry_call(struct socket *, struct rxrpc_call *,
+			    struct sockaddr_rxrpc *, struct key *);
+int rxrpc_kernel_check_call(struct socket *, struct rxrpc_call *,
+			    enum rxrpc_call_completion *, u32 *);
 
 #endif /* _NET_RXRPC_H */
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 31e97f714ca9..fb17552fd292 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -337,6 +337,75 @@ void rxrpc_kernel_end_call(struct socket *sock, struct rxrpc_call *call)
 EXPORT_SYMBOL(rxrpc_kernel_end_call);
 
 /**
+ * rxrpc_kernel_check_call - Check a call's state
+ * @sock: The socket the call is on
+ * @call: The call to check
+ * @_compl: Where to store the completion state
+ * @_abort_code: Where to store any abort code
+ *
+ * Allow a kernel service to query the state of a call and find out the manner
+ * of its termination if it has completed.  Returns -EINPROGRESS if the call is
+ * still going, 0 if the call finished successfully, -ECONNABORTED if the call
+ * was aborted and an appropriate error if the call failed in some other way.
+ */
+int rxrpc_kernel_check_call(struct socket *sock, struct rxrpc_call *call,
+			    enum rxrpc_call_completion *_compl, u32 *_abort_code)
+{
+	if (call->state != RXRPC_CALL_COMPLETE)
+		return -EINPROGRESS;
+	smp_rmb();
+	*_compl = call->completion;
+	*_abort_code = call->abort_code;
+	return call->error;
+}
+EXPORT_SYMBOL(rxrpc_kernel_check_call);
+
+/**
+ * rxrpc_kernel_retry_call - Allow a kernel service to retry a call
+ * @sock: The socket the call is on
+ * @call: The call to retry
+ * @srx: The address of the peer to contact
+ * @key: The security context to use (defaults to socket setting)
+ *
+ * Allow a kernel service to try resending a client call that failed due to a
+ * network error to a new address.  The Tx queue is maintained intact, thereby
+ * relieving the need to re-encrypt any request data that has already been
+ * buffered.
+ */
+int rxrpc_kernel_retry_call(struct socket *sock, struct rxrpc_call *call,
+			    struct sockaddr_rxrpc *srx, struct key *key)
+{
+	struct rxrpc_conn_parameters cp;
+	struct rxrpc_sock *rx = rxrpc_sk(sock->sk);
+	int ret;
+
+	_enter("%d{%d}", call->debug_id, atomic_read(&call->usage));
+
+	if (!key)
+		key = rx->key;
+	if (key && !key->payload.data[0])
+		key = NULL; /* a no-security key */
+
+	memset(&cp, 0, sizeof(cp));
+	cp.local		= rx->local;
+	cp.key			= key;
+	cp.security_level	= 0;
+	cp.exclusive		= false;
+	cp.service_id		= srx->srx_service;
+
+	mutex_lock(&call->user_mutex);
+
+	ret = rxrpc_prepare_call_for_retry(rx, call);
+	if (ret == 0)
+		ret = rxrpc_retry_client_call(rx, call, &cp, srx, GFP_KERNEL);
+
+	mutex_unlock(&call->user_mutex);
+	_leave(" = %d", ret);
+	return ret;
+}
+EXPORT_SYMBOL(rxrpc_kernel_retry_call);
+
+/**
  * rxrpc_kernel_new_call_notification - Get notifications of new calls
  * @sock: The socket to intercept received messages on
  * @notify_new_call: Function to be called when new calls appear
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 227e24794055..ea5600b747cc 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -445,6 +445,7 @@ enum rxrpc_call_flag {
 	RXRPC_CALL_EXPOSED,		/* The call was exposed to the world */
 	RXRPC_CALL_RX_LAST,		/* Received the last packet (at rxtx_top) */
 	RXRPC_CALL_TX_LAST,		/* Last packet in Tx buffer (at rxtx_top) */
+	RXRPC_CALL_TX_LASTQ,		/* Last packet has been queued */
 	RXRPC_CALL_SEND_PING,		/* A ping will need to be sent */
 	RXRPC_CALL_PINGING,		/* Ping in process */
 	RXRPC_CALL_RETRANS_TIMEOUT,	/* Retransmission due to timeout occurred */
@@ -482,18 +483,6 @@ enum rxrpc_call_state {
 };
 
 /*
- * Call completion condition (state == RXRPC_CALL_COMPLETE).
- */
-enum rxrpc_call_completion {
-	RXRPC_CALL_SUCCEEDED,		/* - Normal termination */
-	RXRPC_CALL_REMOTELY_ABORTED,	/* - call aborted by peer */
-	RXRPC_CALL_LOCALLY_ABORTED,	/* - call aborted locally on error or close */
-	RXRPC_CALL_LOCAL_ERROR,		/* - call failed due to local error */
-	RXRPC_CALL_NETWORK_ERROR,	/* - call terminated by network error */
-	NR__RXRPC_CALL_COMPLETIONS
-};
-
-/*
  * Call Tx congestion management modes.
  */
 enum rxrpc_congest_mode {
@@ -687,9 +676,15 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *,
 					 struct rxrpc_conn_parameters *,
 					 struct sockaddr_rxrpc *,
 					 unsigned long, s64, gfp_t);
+int rxrpc_retry_client_call(struct rxrpc_sock *,
+			    struct rxrpc_call *,
+			    struct rxrpc_conn_parameters *,
+			    struct sockaddr_rxrpc *,
+			    gfp_t);
 void rxrpc_incoming_call(struct rxrpc_sock *, struct rxrpc_call *,
 			 struct sk_buff *);
 void rxrpc_release_call(struct rxrpc_sock *, struct rxrpc_call *);
+int rxrpc_prepare_call_for_retry(struct rxrpc_sock *, struct rxrpc_call *);
 void rxrpc_release_calls_on_socket(struct rxrpc_sock *);
 bool __rxrpc_queue_call(struct rxrpc_call *);
 bool rxrpc_queue_call(struct rxrpc_call *);
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index d7809a0620b4..fcdd6555a820 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -269,11 +269,6 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx,
 	trace_rxrpc_call(call, rxrpc_call_connected, atomic_read(&call->usage),
 			 here, NULL);
 
-	spin_lock_bh(&call->conn->params.peer->lock);
-	hlist_add_head(&call->error_link,
-		       &call->conn->params.peer->error_targets);
-	spin_unlock_bh(&call->conn->params.peer->lock);
-
 	rxrpc_start_call_timer(call);
 
 	_net("CALL new %d on CONN %d", call->debug_id, call->conn->debug_id);
@@ -304,6 +299,48 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx,
 }
 
 /*
+ * Retry a call to a new address.  It is expected that the Tx queue of the call
+ * will contain data previously packaged for an old call.
+ */
+int rxrpc_retry_client_call(struct rxrpc_sock *rx,
+			    struct rxrpc_call *call,
+			    struct rxrpc_conn_parameters *cp,
+			    struct sockaddr_rxrpc *srx,
+			    gfp_t gfp)
+{
+	const void *here = __builtin_return_address(0);
+	int ret;
+
+	/* Set up or get a connection record and set the protocol parameters,
+	 * including channel number and call ID.
+	 */
+	ret = rxrpc_connect_call(call, cp, srx, gfp);
+	if (ret < 0)
+		goto error;
+
+	trace_rxrpc_call(call, rxrpc_call_connected, atomic_read(&call->usage),
+			 here, NULL);
+
+	rxrpc_start_call_timer(call);
+
+	_net("CALL new %d on CONN %d", call->debug_id, call->conn->debug_id);
+
+	if (!test_and_set_bit(RXRPC_CALL_EV_RESEND, &call->events))
+		rxrpc_queue_call(call);
+
+	_leave(" = 0");
+	return 0;
+
+error:
+	rxrpc_set_call_completion(call, RXRPC_CALL_LOCAL_ERROR,
+				  RX_CALL_DEAD, ret);
+	trace_rxrpc_call(call, rxrpc_call_error, atomic_read(&call->usage),
+			 here, ERR_PTR(ret));
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
  * Set up an incoming call.  call->conn points to the connection.
  * This is called in BH context and isn't allowed to fail.
  */
@@ -471,6 +508,61 @@ void rxrpc_release_call(struct rxrpc_sock *rx, struct rxrpc_call *call)
 }
 
 /*
+ * Prepare a kernel service call for retry.
+ */
+int rxrpc_prepare_call_for_retry(struct rxrpc_sock *rx, struct rxrpc_call *call)
+{
+	const void *here = __builtin_return_address(0);
+	int i;
+	u8 last = 0;
+
+	_enter("{%d,%d}", call->debug_id, atomic_read(&call->usage));
+
+	trace_rxrpc_call(call, rxrpc_call_release, atomic_read(&call->usage),
+			 here, (const void *)call->flags);
+
+	ASSERTCMP(call->state, ==, RXRPC_CALL_COMPLETE);
+	ASSERTCMP(call->completion, !=, RXRPC_CALL_REMOTELY_ABORTED);
+	ASSERTCMP(call->completion, !=, RXRPC_CALL_LOCALLY_ABORTED);
+	ASSERT(list_empty(&call->recvmsg_link));
+
+	del_timer_sync(&call->timer);
+
+	_debug("RELEASE CALL %p (%d CONN %p)", call, call->debug_id, call->conn);
+
+	if (call->conn)
+		rxrpc_disconnect_call(call);
+
+	if (rxrpc_is_service_call(call) ||
+	    !call->tx_phase ||
+	    call->tx_hard_ack != 0 ||
+	    call->rx_hard_ack != 0 ||
+	    call->rx_top != 0)
+		return -EINVAL;
+
+	call->state = RXRPC_CALL_UNINITIALISED;
+	call->completion = RXRPC_CALL_SUCCEEDED;
+	call->call_id = 0;
+	call->cid = 0;
+	call->cong_cwnd = 0;
+	call->cong_extra = 0;
+	call->cong_ssthresh = 0;
+	call->cong_mode = 0;
+	call->cong_dup_acks = 0;
+	call->cong_cumul_acks = 0;
+	call->acks_lowest_nak = 0;
+
+	for (i = 0; i < RXRPC_RXTX_BUFF_SIZE; i++) {
+		last |= call->rxtx_annotations[i];
+		call->rxtx_annotations[i] &= RXRPC_TX_ANNO_LAST;
+		call->rxtx_annotations[i] |= RXRPC_TX_ANNO_RETRANS;
+	}
+
+	_leave(" = 0");
+	return 0;
+}
+
+/*
  * release all the calls associated with a socket
  */
 void rxrpc_release_calls_on_socket(struct rxrpc_sock *rx)
diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c
index eb2157680399..5f9624bd311c 100644
--- a/net/rxrpc/conn_client.c
+++ b/net/rxrpc/conn_client.c
@@ -555,7 +555,10 @@ static void rxrpc_activate_one_channel(struct rxrpc_connection *conn,
 	trace_rxrpc_client(conn, channel, rxrpc_client_chan_activate);
 
 	write_lock_bh(&call->state_lock);
-	call->state = RXRPC_CALL_CLIENT_SEND_REQUEST;
+	if (!test_bit(RXRPC_CALL_TX_LASTQ, &call->flags))
+		call->state = RXRPC_CALL_CLIENT_SEND_REQUEST;
+	else
+		call->state = RXRPC_CALL_CLIENT_AWAIT_REPLY;
 	write_unlock_bh(&call->state_lock);
 
 	rxrpc_see_call(call);
@@ -688,15 +691,23 @@ int rxrpc_connect_call(struct rxrpc_call *call,
 
 	ret = rxrpc_get_client_conn(call, cp, srx, gfp);
 	if (ret < 0)
-		return ret;
+		goto out;
 
 	rxrpc_animate_client_conn(rxnet, call->conn);
 	rxrpc_activate_channels(call->conn);
 
 	ret = rxrpc_wait_for_channel(call, gfp);
-	if (ret < 0)
+	if (ret < 0) {
 		rxrpc_disconnect_client_call(call);
+		goto out;
+	}
+
+	spin_lock_bh(&call->conn->params.peer->lock);
+	hlist_add_head(&call->error_link,
+		       &call->conn->params.peer->error_targets);
+	spin_unlock_bh(&call->conn->params.peer->lock);
 
+out:
 	_leave(" = %d", ret);
 	return ret;
 }
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 344fdce89823..9ea6f972767e 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -128,8 +128,10 @@ static void rxrpc_queue_packet(struct rxrpc_sock *rx, struct rxrpc_call *call,
 
 	ASSERTCMP(seq, ==, call->tx_top + 1);
 
-	if (last)
+	if (last) {
 		annotation |= RXRPC_TX_ANNO_LAST;
+		set_bit(RXRPC_CALL_TX_LASTQ, &call->flags);
+	}
 
 	/* We have to set the timestamp before queueing as the retransmit
 	 * algorithm can see the packet as soon as we queue it.
@@ -326,11 +328,6 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 				call->tx_total_len -= copy;
 		}
 
-		/* check for the far side aborting the call or a network error
-		 * occurring */
-		if (call->state == RXRPC_CALL_COMPLETE)
-			goto call_terminated;
-
 		/* add the packet to the send queue if it's now full */
 		if (sp->remain <= 0 ||
 		    (msg_data_left(msg) == 0 && !more)) {
@@ -370,6 +367,16 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 					   notify_end_tx);
 			skb = NULL;
 		}
+
+		/* Check for the far side aborting the call or a network error
+		 * occurring.  If this happens, save any packet that was under
+		 * construction so that in the case of a network error, the
+		 * call can be retried or redirected.
+		 */
+		if (call->state == RXRPC_CALL_COMPLETE) {
+			ret = call->error;
+			goto out;
+		}
 	} while (msg_data_left(msg) > 0);
 
 success:
@@ -379,11 +386,6 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 	_leave(" = %d", ret);
 	return ret;
 
-call_terminated:
-	rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
-	_leave(" = %d", call->error);
-	return call->error;
-
 maybe_error:
 	if (copied)
 		goto success;

^ permalink raw reply related

* [PATCH net-next 6/7] rxrpc: Add notification of end-of-Tx phase
From: David Howells @ 2017-08-29 12:27 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <150400960745.21371.15455867950162722742.stgit@warthog.procyon.org.uk>

Add a callback to rxrpc_kernel_send_data() so that a kernel service can get
a notification that the AF_RXRPC call has transitioned out the Tx phase and
is now waiting for a reply or a final ACK.

This is called from AF_RXRPC with the call state lock held so the
notification is guaranteed to come before any reply is passed back.

Further, modify the AFS filesystem to make use of this so that we don't have
to change the afs_call state before sending the last bit of data.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/networking/rxrpc.txt |   12 +++++++++
 fs/afs/rxrpc.c                     |   46 +++++++++++++++++++++++++++---------
 include/net/af_rxrpc.h             |    5 +++-
 net/rxrpc/sendmsg.c                |   34 +++++++++++++++++++++------
 4 files changed, 77 insertions(+), 20 deletions(-)

diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index 8c70ba5dee4d..92a3c3bd5ac3 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -818,10 +818,15 @@ The kernel interface functions are as follows:
 
  (*) Send data through a call.
 
+	typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
+					      unsigned long user_call_ID,
+					      struct sk_buff *skb);
+
 	int rxrpc_kernel_send_data(struct socket *sock,
 				   struct rxrpc_call *call,
 				   struct msghdr *msg,
-				   size_t len);
+				   size_t len,
+				   rxrpc_notify_end_tx_t notify_end_rx);
 
      This is used to supply either the request part of a client call or the
      reply part of a server call.  msg.msg_iovlen and msg.msg_iov specify the
@@ -832,6 +837,11 @@ The kernel interface functions are as follows:
      The msg must not specify a destination address, control data or any flags
      other than MSG_MORE.  len is the total amount of data to transmit.
 
+     notify_end_rx can be NULL or it can be used to specify a function to be
+     called when the call changes state to end the Tx phase.  This function is
+     called with the call-state spinlock held to prevent any reply or final ACK
+     from being delivered first.
+
  (*) Receive data from a call.
 
 	int rxrpc_kernel_recv_data(struct socket *sock,
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 10743043d431..0bf191f0dbaf 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -292,6 +292,19 @@ static void afs_load_bvec(struct afs_call *call, struct msghdr *msg,
 }
 
 /*
+ * Advance the AFS call state when the RxRPC call ends the transmit phase.
+ */
+static void afs_notify_end_request_tx(struct sock *sock,
+				      struct rxrpc_call *rxcall,
+				      unsigned long call_user_ID)
+{
+	struct afs_call *call = (struct afs_call *)call_user_ID;
+
+	if (call->state == AFS_CALL_REQUESTING)
+		call->state = AFS_CALL_AWAIT_REPLY;
+}
+
+/*
  * attach the data from a bunch of pages on an inode to a call
  */
 static int afs_send_pages(struct afs_call *call, struct msghdr *msg)
@@ -310,14 +323,8 @@ static int afs_send_pages(struct afs_call *call, struct msghdr *msg)
 		bytes = msg->msg_iter.count;
 		nr = msg->msg_iter.nr_segs;
 
-		/* Have to change the state *before* sending the last
-		 * packet as RxRPC might give us the reply before it
-		 * returns from sending the request.
-		 */
-		if (first + nr - 1 >= last)
-			call->state = AFS_CALL_AWAIT_REPLY;
-		ret = rxrpc_kernel_send_data(afs_socket, call->rxcall,
-					     msg, bytes);
+		ret = rxrpc_kernel_send_data(afs_socket, call->rxcall, msg,
+					     bytes, afs_notify_end_request_tx);
 		for (loop = 0; loop < nr; loop++)
 			put_page(bv[loop].bv_page);
 		if (ret < 0)
@@ -409,7 +416,8 @@ int afs_make_call(struct in_addr *addr, struct afs_call *call, gfp_t gfp,
 	if (!call->send_pages)
 		call->state = AFS_CALL_AWAIT_REPLY;
 	ret = rxrpc_kernel_send_data(afs_socket, rxcall,
-				     &msg, call->request_size);
+				     &msg, call->request_size,
+				     afs_notify_end_request_tx);
 	if (ret < 0)
 		goto error_do_abort;
 
@@ -741,6 +749,20 @@ static int afs_deliver_cm_op_id(struct afs_call *call)
 }
 
 /*
+ * Advance the AFS call state when an RxRPC service call ends the transmit
+ * phase.
+ */
+static void afs_notify_end_reply_tx(struct sock *sock,
+				    struct rxrpc_call *rxcall,
+				    unsigned long call_user_ID)
+{
+	struct afs_call *call = (struct afs_call *)call_user_ID;
+
+	if (call->state == AFS_CALL_REPLYING)
+		call->state = AFS_CALL_AWAIT_ACK;
+}
+
+/*
  * send an empty reply
  */
 void afs_send_empty_reply(struct afs_call *call)
@@ -759,7 +781,8 @@ void afs_send_empty_reply(struct afs_call *call)
 	msg.msg_flags		= 0;
 
 	call->state = AFS_CALL_AWAIT_ACK;
-	switch (rxrpc_kernel_send_data(afs_socket, call->rxcall, &msg, 0)) {
+	switch (rxrpc_kernel_send_data(afs_socket, call->rxcall, &msg, 0,
+				       afs_notify_end_reply_tx)) {
 	case 0:
 		_leave(" [replied]");
 		return;
@@ -797,7 +820,8 @@ void afs_send_simple_reply(struct afs_call *call, const void *buf, size_t len)
 	msg.msg_flags		= 0;
 
 	call->state = AFS_CALL_AWAIT_ACK;
-	n = rxrpc_kernel_send_data(afs_socket, call->rxcall, &msg, len);
+	n = rxrpc_kernel_send_data(afs_socket, call->rxcall, &msg, len,
+				   afs_notify_end_reply_tx);
 	if (n >= 0) {
 		/* Success */
 		_leave(" [replied]");
diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h
index c172709787af..07a47ee6f783 100644
--- a/include/net/af_rxrpc.h
+++ b/include/net/af_rxrpc.h
@@ -21,6 +21,8 @@ struct rxrpc_call;
 
 typedef void (*rxrpc_notify_rx_t)(struct sock *, struct rxrpc_call *,
 				  unsigned long);
+typedef void (*rxrpc_notify_end_tx_t)(struct sock *, struct rxrpc_call *,
+				      unsigned long);
 typedef void (*rxrpc_notify_new_call_t)(struct sock *, struct rxrpc_call *,
 					unsigned long);
 typedef void (*rxrpc_discard_new_call_t)(struct rxrpc_call *, unsigned long);
@@ -37,7 +39,8 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket *,
 					   gfp_t,
 					   rxrpc_notify_rx_t);
 int rxrpc_kernel_send_data(struct socket *, struct rxrpc_call *,
-			   struct msghdr *, size_t);
+			   struct msghdr *, size_t,
+			   rxrpc_notify_end_tx_t);
 int rxrpc_kernel_recv_data(struct socket *, struct rxrpc_call *,
 			   void *, size_t, size_t *, bool, u32 *);
 bool rxrpc_kernel_abort_call(struct socket *, struct rxrpc_call *,
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index bc7f0241d92b..344fdce89823 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -101,11 +101,23 @@ static inline void rxrpc_instant_resend(struct rxrpc_call *call, int ix)
 }
 
 /*
+ * Notify the owner of the call that the transmit phase is ended and the last
+ * packet has been queued.
+ */
+static void rxrpc_notify_end_tx(struct rxrpc_sock *rx, struct rxrpc_call *call,
+				rxrpc_notify_end_tx_t notify_end_tx)
+{
+	if (notify_end_tx)
+		notify_end_tx(&rx->sk, call, call->user_call_ID);
+}
+
+/*
  * Queue a DATA packet for transmission, set the resend timeout and send the
  * packet immediately
  */
-static void rxrpc_queue_packet(struct rxrpc_call *call, struct sk_buff *skb,
-			       bool last)
+static void rxrpc_queue_packet(struct rxrpc_sock *rx, struct rxrpc_call *call,
+			       struct sk_buff *skb, bool last,
+			       rxrpc_notify_end_tx_t notify_end_tx)
 {
 	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 	rxrpc_seq_t seq = sp->hdr.seq;
@@ -141,6 +153,7 @@ static void rxrpc_queue_packet(struct rxrpc_call *call, struct sk_buff *skb,
 		switch (call->state) {
 		case RXRPC_CALL_CLIENT_SEND_REQUEST:
 			call->state = RXRPC_CALL_CLIENT_AWAIT_REPLY;
+			rxrpc_notify_end_tx(rx, call, notify_end_tx);
 			break;
 		case RXRPC_CALL_SERVER_ACK_REQUEST:
 			call->state = RXRPC_CALL_SERVER_SEND_REPLY;
@@ -153,6 +166,7 @@ static void rxrpc_queue_packet(struct rxrpc_call *call, struct sk_buff *skb,
 				break;
 		case RXRPC_CALL_SERVER_SEND_REPLY:
 			call->state = RXRPC_CALL_SERVER_AWAIT_ACK;
+			rxrpc_notify_end_tx(rx, call, notify_end_tx);
 			break;
 		default:
 			break;
@@ -189,7 +203,8 @@ static void rxrpc_queue_packet(struct rxrpc_call *call, struct sk_buff *skb,
  */
 static int rxrpc_send_data(struct rxrpc_sock *rx,
 			   struct rxrpc_call *call,
-			   struct msghdr *msg, size_t len)
+			   struct msghdr *msg, size_t len,
+			   rxrpc_notify_end_tx_t notify_end_tx)
 {
 	struct rxrpc_skb_priv *sp;
 	struct sk_buff *skb;
@@ -350,7 +365,9 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 			if (ret < 0)
 				goto out;
 
-			rxrpc_queue_packet(call, skb, !msg_data_left(msg) && !more);
+			rxrpc_queue_packet(rx, call, skb,
+					   !msg_data_left(msg) && !more,
+					   notify_end_tx);
 			skb = NULL;
 		}
 	} while (msg_data_left(msg) > 0);
@@ -611,7 +628,7 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
 		/* Reply phase not begun or not complete for service call. */
 		ret = -EPROTO;
 	} else {
-		ret = rxrpc_send_data(rx, call, msg, len);
+		ret = rxrpc_send_data(rx, call, msg, len, NULL);
 	}
 
 	mutex_unlock(&call->user_mutex);
@@ -631,6 +648,7 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
  * @call: The call to send data through
  * @msg: The data to send
  * @len: The amount of data to send
+ * @notify_end_tx: Notification that the last packet is queued.
  *
  * Allow a kernel service to send data on a call.  The call must be in an state
  * appropriate to sending data.  No control data should be supplied in @msg,
@@ -638,7 +656,8 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len)
  * more data to come, otherwise this data will end the transmission phase.
  */
 int rxrpc_kernel_send_data(struct socket *sock, struct rxrpc_call *call,
-			   struct msghdr *msg, size_t len)
+			   struct msghdr *msg, size_t len,
+			   rxrpc_notify_end_tx_t notify_end_tx)
 {
 	int ret;
 
@@ -656,7 +675,8 @@ int rxrpc_kernel_send_data(struct socket *sock, struct rxrpc_call *call,
 	case RXRPC_CALL_CLIENT_SEND_REQUEST:
 	case RXRPC_CALL_SERVER_ACK_REQUEST:
 	case RXRPC_CALL_SERVER_SEND_REPLY:
-		ret = rxrpc_send_data(rxrpc_sk(sock->sk), call, msg, len);
+		ret = rxrpc_send_data(rxrpc_sk(sock->sk), call, msg, len,
+				      notify_end_tx);
 		break;
 	case RXRPC_CALL_COMPLETE:
 		read_lock_bh(&call->state_lock);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox