Netdev List
 help / color / mirror / Atom feed
* [patch net-next v3 00/10] net: sched: allow qdiscs to share filter block instances
From: Jiri Pirko @ 2017-12-13 15:10 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, mlxsw, andrew, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, matanb, leonro,
	idosch, jakub.kicinski, simon.horman, pieter.jansenvanvuuren,
	john.hurley, alexander.h.duyck, ogerlitz, john.fastabend, daniel

From: Jiri Pirko <jiri@mellanox.com>

Currently the filters added to qdiscs are independent. So for example if you
have 2 netdevices and you create ingress qdisc on both and you want to add
identical filter rules both, you need to add them twice. This patchset
makes this easier and mainly saves resources allowing to share all filters
within a qdisc - I call it a "filter block". Also this helps to save
resources when we do offload to hw for example to expensive TCAM.

So back to the example. First, we create 2 qdiscs. Both will share
block number 22. "22" is just an identification. If we don't pass any
block number, a new one will be generated by kernel:

$ tc qdisc add dev ens7 ingress block 22
                                ^^^^^^^^
$ tc qdisc add dev ens8 ingress block 22
                                ^^^^^^^^

Now if we list the qdiscs, we will see the block index in the output:

$ tc qdisc
qdisc ingress ffff: dev ens7 parent ffff:fff1 block 22
qdisc ingress ffff: dev ens8 parent ffff:fff1 block 22

To make is more visual, the situation looks like this:

   ens7 ingress qdisc                 ens7 ingress qdisc
          |                                  |
          |                                  |
          +---------->  block 22  <----------+

Unlimited number of qdiscs may share the same block.

Now we can add filter to any of qdiscs sharing the same block:

$ tc filter add dev ens7 ingress protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop


We will see the same output if we list filters for ens7 and ens8, including stats:

$ tc -s filter show dev ens7 ingress
filter protocol ip pref 25 flower chain 0
filter protocol ip pref 25 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.0.0/16
  not_in_hw
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 39 sec used 2 sec
        Action statistics:
        Sent 3108 bytes 37 pkt (dropped 37, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0

$ tc -s filter show dev ens8 ingress
filter protocol ip pref 25 flower chain 0
filter protocol ip pref 25 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.0.0/16
  not_in_hw
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 40 sec used 3 sec
        Action statistics:
        Sent 3108 bytes 37 pkt (dropped 37, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0

---
v2->v3:
- removed original patch 1, removing tp->q cls_bpf dependency. Fixed by
  Jakub in the meantime.
- patch 1:
 - rebased on top of the current net-next
- patch 5:
 - new patch
- patch 6:
 - removed "p_" prefix from block index function args
- patch 9:
 - add tc offload feature handling

Jiri Pirko (10):
  net: sched: introduce support for multiple filter chain pointers
    registration
  net: sched: avoid usage of tp->q in tcf_classify
  net: sched: introduce block mechanism to handle netif_keep_dst calls
  net: sched: remove classid and q fields from tcf_proto
  net: sched: keep track of offloaded filters and check tc offload
    feature
  net: sched: allow ingress and clsact qdiscs to share filter blocks
  mlxsw: spectrum_acl: Reshuffle code around
    mlxsw_sp_acl_ruleset_create/destroy
  mlxsw: spectrum_acl: Don't store netdev and ingress for ruleset unbind
  mlxsw: spectrum_acl: Implement TC block sharing
  mlxsw: spectrum_acl: Pass mlxsw_sp_port down to ruleset bind/unbind
    ops

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c     | 182 +++++++++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h     |  44 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 302 +++++++++++++----
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c    |  44 ++-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  41 +--
 include/net/pkt_cls.h                              |   4 +
 include/net/sch_generic.h                          |  26 +-
 include/uapi/linux/pkt_sched.h                     |  11 +
 net/sched/cls_api.c                                | 359 ++++++++++++++++++---
 net/sched/cls_bpf.c                                |   8 +-
 net/sched/cls_flow.c                               |   2 +-
 net/sched/cls_flower.c                             |   3 +-
 net/sched/cls_matchall.c                           |   3 +-
 net/sched/cls_route.c                              |   2 +-
 net/sched/cls_u32.c                                |  13 +-
 net/sched/sch_ingress.c                            |  89 ++++-
 16 files changed, 913 insertions(+), 220 deletions(-)

-- 
2.9.5

^ permalink raw reply

* Re: [PATCH net] tcp: fix potential underestimation on rcv_rtt
From: Neal Cardwell @ 2017-12-13 15:04 UTC (permalink / raw)
  To: Wei Wang; +Cc: David Miller, Netdev, Eric Dumazet
In-Reply-To: <20171213002858.74188-1-tracywwnj@gmail.com>

On Tue, Dec 12, 2017 at 7:28 PM, Wei Wang <weiwan@google.com> wrote:
> From: Wei Wang <weiwan@google.com>
>
> When ms timestamp is used, current logic uses 1us in
> tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
> This could cause rcv_rtt underestimation.
> Fix it by always using a min value of 1ms if ms timestamp is used.
>
> Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high
> resolution timestamps")
>
> Signed-off-by: Wei Wang <weiwan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks!

neal

^ permalink raw reply

* Re: AF_VSOCK connection refused errno
From: Stefan Hajnoczi @ 2017-12-13 14:58 UTC (permalink / raw)
  To: Jorgen S. Hansen; +Cc: Dexuan Cui, netdev@vger.kernel.org
In-Reply-To: <7CCDA4D3-7683-4E5C-95ED-97B2FC9DFF33@vmware.com>

[-- Attachment #1: Type: text/plain, Size: 1226 bytes --]

On Wed, Dec 13, 2017 at 10:28:30AM +0000, Jorgen S. Hansen wrote:
> 
> > On Dec 12, 2017, at 4:53 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > When connect(2) fails because the peer is not listening the virtio vsock
> > transport returns ECONNRESET.  I believe the VMCI transport does the
> > same (based on code inspection).
> > 
> > Jorgen: Can you confirm this VMCI transport behavior?
> 
> Yes, that is correct.
> 
> > I'd like to change to ECONNREFUSED for all transports because developers
> > will be surprised when they get ECONNRESET.  It makes porting AF_INET
> > code harder.
> > 
> > On the other hand, it may be too late to fix this if there userspace
> > applications that rely on ECONNRESET?  I'm not aware of any such
> > applications myself.
> 
> In the past, I’ve explained to customers that an ECONNRESET error on connect
> can indicate that the peer isn’t listening on the dest address. Whether they went
> and used that information isn’t clear, but changing this behavior now would
> risk breaking applications. While it is unfortunate that we deviate from INET in
> this case, I would prefer it to stay as is.

That's fine.  Thanks for confirming.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply

* [PATCH 5/5] VSOCK: add AF_VSOCK test cases
From: Stefan Hajnoczi @ 2017-12-13 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171213144911.6428-1-stefanha@redhat.com>

The vsock_test.c program runs a test suite of AF_VSOCK test cases.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/testing/vsock/Makefile     |   5 +-
 tools/testing/vsock/vsock_test.c | 321 +++++++++++++++++++++++++++++++++++++++
 tools/testing/vsock/.gitignore   |   1 +
 tools/testing/vsock/README       |   1 +
 4 files changed, 326 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/vsock/vsock_test.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index cf0650007fa8..9efb9010df2e 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,9 +1,10 @@
 all: test
-test: vsock_diag_test
+test: vsock_test vsock_diag_test
+vsock_test: vsock_test.o timeout.o control.o util.o
 vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
 
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
 clean:
-	${RM} *.o *.d vsock_diag_test
+	${RM} *.o *.d vsock_test vsock_diag_test
 -include *.d
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
new file mode 100644
index 000000000000..c1fc376eea00
--- /dev/null
+++ b/tools/testing/vsock/vsock_test.c
@@ -0,0 +1,321 @@
+/*
+ * vsock_test - vsock.ko test suite
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <linux/list.h>
+#include <linux/net.h>
+#include <linux/netlink.h>
+#include <linux/sock_diag.h>
+#include <netinet/tcp.h>
+
+#include "timeout.h"
+#include "control.h"
+#include "util.h"
+
+static void test_stream_connection_reset(const struct test_opts *opts)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = opts->peer_cid,
+		},
+	};
+	int ret;
+	int fd;
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	timeout_begin(TIMEOUT);
+	do {
+		ret = connect(fd, &addr.sa, sizeof(addr.svm));
+		timeout_check("connect");
+	} while (ret < 0 && errno == EINTR);
+	timeout_end();
+
+	if (ret != -1) {
+		fprintf(stderr, "expected connect(2) failure, got %d\n", ret);
+		exit(EXIT_FAILURE);
+	}
+	if (errno != ECONNRESET) {
+		fprintf(stderr, "unexpected connect(2) errno %d\n", errno);
+		exit(EXIT_FAILURE);
+	}
+
+	close(fd);
+}
+
+static void test_stream_client_close_client(const struct test_opts *opts)
+{
+	int fd;
+
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	send_byte(fd, 1);
+	close(fd);
+	control_writeln("CLOSED");
+}
+
+static void test_stream_client_close_server(const struct test_opts *opts)
+{
+	int fd;
+
+	fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+	if (fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	control_expectln("CLOSED");
+
+	send_byte(fd, -EPIPE);
+	recv_byte(fd, 1);
+	recv_byte(fd, 0);
+	close(fd);
+}
+
+static void test_stream_server_close_client(const struct test_opts *opts)
+{
+	int fd;
+
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	control_expectln("CLOSED");
+
+	send_byte(fd, -EPIPE);
+	recv_byte(fd, 1);
+	recv_byte(fd, 0);
+	close(fd);
+}
+
+static void test_stream_server_close_server(const struct test_opts *opts)
+{
+	int fd;
+
+	fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+	if (fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+
+	send_byte(fd, 1);
+	close(fd);
+	control_writeln("CLOSED");
+}
+
+#define MULTICONN_NFDS 1000
+
+static void test_stream_multiconn_client(const struct test_opts *opts)
+{
+	int fds[MULTICONN_NFDS];
+	int i;
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		fds[i] = vsock_stream_connect(opts->peer_cid, 1234);
+		if (fds[i] < 0) {
+			perror("connect");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		if (i % 1)
+			recv_byte(fds[i], 1);
+		else
+			send_byte(fds[i], 1);
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		close(fds[i]);
+}
+
+static void test_stream_multiconn_server(const struct test_opts *opts)
+{
+	int fds[MULTICONN_NFDS];
+	int i;
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		fds[i] = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+		if (fds[i] < 0) {
+			perror("accept");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++) {
+		if (i % 1)
+			send_byte(fds[i], 1);
+		else
+			recv_byte(fds[i], 1);
+	}
+
+	for (i = 0; i < MULTICONN_NFDS; i++)
+		close(fds[i]);
+}
+
+static struct test_case test_cases[] = {
+	{
+		.name = "SOCK_STREAM connection reset",
+		.run_client = test_stream_connection_reset,
+	},
+	{
+		.name = "SOCK_STREAM client close",
+		.run_client = test_stream_client_close_client,
+		.run_server = test_stream_client_close_server,
+	},
+	{
+		.name = "SOCK_STREAM server close",
+		.run_client = test_stream_server_close_client,
+		.run_server = test_stream_server_close_server,
+	},
+	{
+		.name = "SOCK_STREAM multiple connections",
+		.run_client = test_stream_multiconn_client,
+		.run_server = test_stream_multiconn_server,
+	},
+	{},
+};
+
+static const char optstring[] = "";
+static const struct option longopts[] = {
+	{
+		.name = "control-host",
+		.has_arg = required_argument,
+		.val = 'H',
+	},
+	{
+		.name = "control-port",
+		.has_arg = required_argument,
+		.val = 'P',
+	},
+	{
+		.name = "mode",
+		.has_arg = required_argument,
+		.val = 'm',
+	},
+	{
+		.name = "peer-cid",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
+		.name = "help",
+		.has_arg = no_argument,
+		.val = '?',
+	},
+	{},
+};
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: vsock_test [--help] [--control-host=<host>] --control-port=<port> --mode=client|server --peer-cid=<cid>\n"
+		"\n"
+		"  Server: vsock_test --control-port=1234 --mode=server --peer-cid=3\n"
+		"  Client: vsock_test --control-host=192.168.0.1 --control-port=1234 --mode=client --peer-cid=2\n"
+		"\n"
+		"Run vsock.ko tests.  Must be launched in both guest\n"
+		"and host.  One side must use --mode=client and\n"
+		"the other side must use --mode=server.\n"
+		"\n"
+		"A TCP control socket connection is used to coordinate tests\n"
+		"between the client and the server.  The server requires a\n"
+		"listen address and the client requires an address to\n"
+		"connect to.\n"
+		"\n"
+		"The CID of the other side must be given with --peer-cid=<cid>.\n");
+	exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+	const char *control_host = NULL;
+	const char *control_port = NULL;
+	struct test_opts opts = {
+		.mode = TEST_MODE_UNSET,
+		.peer_cid = VMADDR_CID_ANY,
+	};
+
+	init_signals();
+
+	for (;;) {
+		int opt = getopt_long(argc, argv, optstring, longopts, NULL);
+
+		if (opt == -1)
+			break;
+
+		switch (opt) {
+		case 'H':
+			control_host = optarg;
+			break;
+		case 'm':
+			if (strcmp(optarg, "client") == 0)
+				opts.mode = TEST_MODE_CLIENT;
+			else if (strcmp(optarg, "server") == 0)
+				opts.mode = TEST_MODE_SERVER;
+			else {
+				fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
+				return EXIT_FAILURE;
+			}
+			break;
+		case 'p':
+			opts.peer_cid = parse_cid(optarg);
+			break;
+		case 'P':
+			control_port = optarg;
+			break;
+		case '?':
+		default:
+			usage();
+		}
+	}
+
+	if (!control_port)
+		usage();
+	if (opts.mode == TEST_MODE_UNSET)
+		usage();
+	if (opts.peer_cid == VMADDR_CID_ANY)
+		usage();
+
+	if (!control_host) {
+		if (opts.mode != TEST_MODE_SERVER)
+			usage();
+		control_host = "0.0.0.0";
+	}
+
+	control_init(control_host, control_port,
+		     opts.mode == TEST_MODE_SERVER);
+
+	run_tests(test_cases, &opts);
+
+	control_cleanup();
+	return EXIT_SUCCESS;
+}
diff --git a/tools/testing/vsock/.gitignore b/tools/testing/vsock/.gitignore
index dc5f11faf530..7f7a2ccc30c4 100644
--- a/tools/testing/vsock/.gitignore
+++ b/tools/testing/vsock/.gitignore
@@ -1,2 +1,3 @@
 *.d
+vsock_test
 vsock_diag_test
diff --git a/tools/testing/vsock/README b/tools/testing/vsock/README
index 2cc6d7302db6..3014d9e4959a 100644
--- a/tools/testing/vsock/README
+++ b/tools/testing/vsock/README
@@ -5,6 +5,7 @@ Hyper-V.
 
 The following tests are available:
 
+  * vsock_test - core AF_VSOCK socket functionality
   * vsock_diag_test - vsock_diag.ko module for listing open sockets
 
 The following prerequisite steps are not automated and must be performed prior
-- 
2.14.3

^ permalink raw reply related

* [PATCH 4/5] VSOCK: add send_byte()/recv_byte() test utilities
From: Stefan Hajnoczi @ 2017-12-13 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171213144911.6428-1-stefanha@redhat.com>

Test cases will want to transfer data.  This patch adds utility
functions to do this.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/testing/vsock/util.h |  2 +
 tools/testing/vsock/util.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index 3e6971f15caa..cf43d42591bc 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -35,6 +35,8 @@ unsigned int parse_cid(const char *str);
 int vsock_stream_connect(unsigned int cid, unsigned int port);
 int vsock_stream_accept(unsigned int cid, unsigned int port,
 			struct sockaddr_vm *clientaddrp);
+void send_byte(int fd, int expected_ret);
+void recv_byte(int fd, int expected_ret);
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts);
 
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 2be923fe9922..e28085a2a429 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -13,6 +13,7 @@
 
 #include <errno.h>
 #include <stdio.h>
+#include <stdint.h>
 #include <stdlib.h>
 #include <signal.h>
 #include <unistd.h>
@@ -153,6 +154,104 @@ int vsock_stream_accept(unsigned int cid, unsigned int port,
 	return client_fd;
 }
 
+/* Transmit one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void send_byte(int fd, int expected_ret)
+{
+	const uint8_t byte = 'A';
+	ssize_t nwritten;
+
+	timeout_begin(TIMEOUT);
+	do {
+		nwritten = write(fd, &byte, sizeof(byte));
+		timeout_check("write");
+	} while (nwritten < 0 && errno == EINTR);
+	timeout_end();
+
+	if (expected_ret < 0) {
+		if (nwritten != -1) {
+			fprintf(stderr, "bogus write(2) return value %zd\n",
+				nwritten);
+			exit(EXIT_FAILURE);
+		}
+		if (errno != -expected_ret) {
+			perror("write");
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+
+	if (nwritten < 0) {
+		perror("write");
+		exit(EXIT_FAILURE);
+	}
+	if (nwritten == 0) {
+		if (expected_ret == 0)
+			return;
+
+		fprintf(stderr, "unexpected EOF while sending byte\n");
+		exit(EXIT_FAILURE);
+	}
+	if (nwritten != sizeof(byte)) {
+		fprintf(stderr, "bogus write(2) return value %zd\n", nwritten);
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Receive one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void recv_byte(int fd, int expected_ret)
+{
+	uint8_t byte;
+	ssize_t nread;
+
+	timeout_begin(TIMEOUT);
+	do {
+		nread = read(fd, &byte, sizeof(byte));
+		timeout_check("read");
+	} while (nread < 0 && errno == EINTR);
+	timeout_end();
+
+	if (expected_ret < 0) {
+		if (nread != -1) {
+			fprintf(stderr, "bogus read(2) return value %zd\n",
+				nread);
+			exit(EXIT_FAILURE);
+		}
+		if (errno != -expected_ret) {
+			perror("read");
+			exit(EXIT_FAILURE);
+		}
+		return;
+	}
+
+	if (nread < 0) {
+		perror("read");
+		exit(EXIT_FAILURE);
+	}
+	if (nread == 0) {
+		if (expected_ret == 0)
+			return;
+
+		fprintf(stderr, "unexpected EOF while receiving byte\n");
+		exit(EXIT_FAILURE);
+	}
+	if (nread != sizeof(byte)) {
+		fprintf(stderr, "bogus read(2) return value %zd\n", nread);
+		exit(EXIT_FAILURE);
+	}
+}
+
 /* Run test cases.  The program terminates if a failure occurs. */
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts)
-- 
2.14.3

^ permalink raw reply related

* [PATCH 3/5] VSOCK: add full barrier between test cases
From: Stefan Hajnoczi @ 2017-12-13 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171213144911.6428-1-stefanha@redhat.com>

See code comment for details.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/testing/vsock/util.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 75a78f295b37..2be923fe9922 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -165,10 +165,24 @@ void run_tests(const struct test_case *test_cases,
 		printf("%s...", test_cases[i].name);
 		fflush(stdout);
 
-		if (opts->mode == TEST_MODE_CLIENT)
+		if (opts->mode == TEST_MODE_CLIENT) {
+			/* Full barrier before executing the next test.  This
+			 * ensures that client and server are executing the
+			 * same test case.  In particular, it means whoever is
+			 * faster will not see the peer still executing the
+			 * last test.  This is important because port numbers
+			 * can be used by multiple test cases.
+			 */
+			control_expectln("NEXT");
+			control_writeln("NEXT");
+
 			run = test_cases[i].run_client;
-		else
+		} else {
+			control_writeln("NEXT");
+			control_expectln("NEXT");
+
 			run = test_cases[i].run_server;
+		}
 
 		if (run)
 			run(opts);
-- 
2.14.3

^ permalink raw reply related

* [PATCH 2/5] VSOCK: extract connect/accept functions from vsock_diag_test.c
From: Stefan Hajnoczi @ 2017-12-13 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171213144911.6428-1-stefanha@redhat.com>

Many test cases will need to connect to the server or accept incoming
connections.  This patch extracts these operations into utility
functions that can be reused.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/testing/vsock/util.h            |   6 ++
 tools/testing/vsock/util.c            | 108 ++++++++++++++++++++++++++++++++++
 tools/testing/vsock/vsock_diag_test.c |  81 ++-----------------------
 3 files changed, 119 insertions(+), 76 deletions(-)

diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index a51e17578ccf..3e6971f15caa 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -1,6 +1,9 @@
 #ifndef UTIL_H
 #define UTIL_H
 
+#include <sys/socket.h>
+#include "../../../include/uapi/linux/vm_sockets.h"
+
 /* Tests can either run as the client or the server */
 enum test_mode {
 	TEST_MODE_UNSET,
@@ -29,6 +32,9 @@ struct test_case {
 
 void init_signals(void);
 unsigned int parse_cid(const char *str);
+int vsock_stream_connect(unsigned int cid, unsigned int port);
+int vsock_stream_accept(unsigned int cid, unsigned int port,
+			struct sockaddr_vm *clientaddrp);
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts);
 
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 2567abd90153..75a78f295b37 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -15,8 +15,10 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <signal.h>
+#include <unistd.h>
 
 #include "timeout.h"
+#include "control.h"
 #include "util.h"
 
 /* Install signal handlers */
@@ -45,6 +47,112 @@ unsigned int parse_cid(const char *str)
 	return n;
 }
 
+/* Connect to <cid, port> and return the file descriptor. */
+int vsock_stream_connect(unsigned int cid, unsigned int port)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = cid,
+		},
+	};
+	int ret;
+	int fd;
+
+	control_expectln("LISTENING");
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	timeout_begin(TIMEOUT);
+	do {
+		ret = connect(fd, &addr.sa, sizeof(addr.svm));
+		timeout_check("connect");
+	} while (ret < 0 && errno == EINTR);
+	timeout_end();
+
+	if (ret < 0) {
+		int old_errno = errno;
+
+		close(fd);
+		fd = -1;
+		errno = old_errno;
+	}
+	return fd;
+}
+
+/* Listen on <cid, port> and return the first incoming connection.  The remote
+ * address is stored to clientaddrp.  clientaddrp may be NULL.
+ */
+int vsock_stream_accept(unsigned int cid, unsigned int port,
+			struct sockaddr_vm *clientaddrp)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = port,
+			.svm_cid = cid,
+		},
+	};
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} clientaddr;
+	socklen_t clientaddr_len = sizeof(clientaddr.svm);
+	int fd;
+	int client_fd;
+	int old_errno;
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln("LISTENING");
+
+	timeout_begin(TIMEOUT);
+	do {
+		client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
+		timeout_check("accept");
+	} while (client_fd < 0 && errno == EINTR);
+	timeout_end();
+
+	old_errno = errno;
+	close(fd);
+	errno = old_errno;
+
+	if (client_fd < 0)
+		return client_fd;
+
+	if (clientaddr_len != sizeof(clientaddr.svm)) {
+		fprintf(stderr, "unexpected addrlen from accept(2), %zu\n",
+			(size_t)clientaddr_len);
+		exit(EXIT_FAILURE);
+	}
+	if (clientaddr.sa.sa_family != AF_VSOCK) {
+		fprintf(stderr, "expected AF_VSOCK from accept(2), got %d\n",
+			clientaddr.sa.sa_family);
+		exit(EXIT_FAILURE);
+	}
+
+	if (clientaddrp)
+		*clientaddrp = clientaddr.svm;
+	return client_fd;
+}
+
 /* Run test cases.  The program terminates if a failure occurs. */
 void run_tests(const struct test_case *test_cases,
 	       const struct test_opts *opts)
diff --git a/tools/testing/vsock/vsock_diag_test.c b/tools/testing/vsock/vsock_diag_test.c
index f1ace5d96e00..654bbe2fc211 100644
--- a/tools/testing/vsock/vsock_diag_test.c
+++ b/tools/testing/vsock/vsock_diag_test.c
@@ -17,7 +17,6 @@
 #include <string.h>
 #include <errno.h>
 #include <unistd.h>
-#include <sys/socket.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <linux/list.h>
@@ -26,7 +25,6 @@
 #include <linux/sock_diag.h>
 #include <netinet/tcp.h>
 
-#include "../../../include/uapi/linux/vm_sockets.h"
 #include "../../../include/uapi/linux/vm_sockets_diag.h"
 
 #include "timeout.h"
@@ -383,33 +381,12 @@ static void test_listen_socket_server(const struct test_opts *opts)
 
 static void test_connect_client(const struct test_opts *opts)
 {
-	union {
-		struct sockaddr sa;
-		struct sockaddr_vm svm;
-	} addr = {
-		.svm = {
-			.svm_family = AF_VSOCK,
-			.svm_port = 1234,
-			.svm_cid = opts->peer_cid,
-		},
-	};
 	int fd;
-	int ret;
 	LIST_HEAD(sockets);
 	struct vsock_stat *st;
 
-	control_expectln("LISTENING");
-
-	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
-
-	timeout_begin(TIMEOUT);
-	do {
-		ret = connect(fd, &addr.sa, sizeof(addr.svm));
-		timeout_check("connect");
-	} while (ret < 0 && errno == EINTR);
-	timeout_end();
-
-	if (ret < 0) {
+	fd = vsock_stream_connect(opts->peer_cid, 1234);
+	if (fd < 0) {
 		perror("connect");
 		exit(EXIT_FAILURE);
 	}
@@ -429,66 +406,19 @@ static void test_connect_client(const struct test_opts *opts)
 
 static void test_connect_server(const struct test_opts *opts)
 {
-	union {
-		struct sockaddr sa;
-		struct sockaddr_vm svm;
-	} addr = {
-		.svm = {
-			.svm_family = AF_VSOCK,
-			.svm_port = 1234,
-			.svm_cid = VMADDR_CID_ANY,
-		},
-	};
-	union {
-		struct sockaddr sa;
-		struct sockaddr_vm svm;
-	} clientaddr;
-	socklen_t clientaddr_len = sizeof(clientaddr.svm);
-	LIST_HEAD(sockets);
 	struct vsock_stat *st;
-	int fd;
+	LIST_HEAD(sockets);
 	int client_fd;
 
-	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
-
-	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
-		perror("bind");
-		exit(EXIT_FAILURE);
-	}
-
-	if (listen(fd, 1) < 0) {
-		perror("listen");
-		exit(EXIT_FAILURE);
-	}
-
-	control_writeln("LISTENING");
-
-	timeout_begin(TIMEOUT);
-	do {
-		client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
-		timeout_check("accept");
-	} while (client_fd < 0 && errno == EINTR);
-	timeout_end();
-
+	client_fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
 	if (client_fd < 0) {
 		perror("accept");
 		exit(EXIT_FAILURE);
 	}
-	if (clientaddr.sa.sa_family != AF_VSOCK) {
-		fprintf(stderr, "expected AF_VSOCK from accept(2), got %d\n",
-			clientaddr.sa.sa_family);
-		exit(EXIT_FAILURE);
-	}
-	if (clientaddr.svm.svm_cid != opts->peer_cid) {
-		fprintf(stderr, "expected peer CID %u from accept(2), got %u\n",
-			opts->peer_cid, clientaddr.svm.svm_cid);
-		exit(EXIT_FAILURE);
-	}
 
 	read_vsock_stat(&sockets);
 
-	check_num_sockets(&sockets, 2);
-	find_vsock_stat(&sockets, fd);
+	check_num_sockets(&sockets, 1);
 	st = find_vsock_stat(&sockets, client_fd);
 	check_socket_state(st, TCP_ESTABLISHED);
 
@@ -496,7 +426,6 @@ static void test_connect_server(const struct test_opts *opts)
 	control_expectln("DONE");
 
 	close(client_fd);
-	close(fd);
 	free_sock_stat(&sockets);
 }
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH 1/5] VSOCK: extract utility functions from vsock_diag_test.c
From: Stefan Hajnoczi @ 2017-12-13 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171213144911.6428-1-stefanha@redhat.com>

Move useful functions into a separate file in preparation for more vsock
test programs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/testing/vsock/Makefile          |  2 +-
 tools/testing/vsock/util.h            | 35 +++++++++++++
 tools/testing/vsock/util.c            | 70 ++++++++++++++++++++++++++
 tools/testing/vsock/vsock_diag_test.c | 92 +++++++++--------------------------
 4 files changed, 128 insertions(+), 71 deletions(-)
 create mode 100644 tools/testing/vsock/util.h
 create mode 100644 tools/testing/vsock/util.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 66ba0924194d..cf0650007fa8 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,6 +1,6 @@
 all: test
 test: vsock_diag_test
-vsock_diag_test: vsock_diag_test.o timeout.o control.o
+vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
 
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
new file mode 100644
index 000000000000..a51e17578ccf
--- /dev/null
+++ b/tools/testing/vsock/util.h
@@ -0,0 +1,35 @@
+#ifndef UTIL_H
+#define UTIL_H
+
+/* Tests can either run as the client or the server */
+enum test_mode {
+	TEST_MODE_UNSET,
+	TEST_MODE_CLIENT,
+	TEST_MODE_SERVER
+};
+
+/* Test runner options */
+struct test_opts {
+	enum test_mode mode;
+	unsigned int peer_cid;
+};
+
+/* A test case definition.  Test functions must print failures to stderr and
+ * terminate with exit(EXIT_FAILURE).
+ */
+struct test_case {
+	const char *name; /* human-readable name */
+
+	/* Called when test mode is TEST_MODE_CLIENT */
+	void (*run_client)(const struct test_opts *opts);
+
+	/* Called when test mode is TEST_MODE_SERVER */
+	void (*run_server)(const struct test_opts *opts);
+};
+
+void init_signals(void);
+unsigned int parse_cid(const char *str);
+void run_tests(const struct test_case *test_cases,
+	       const struct test_opts *opts);
+
+#endif /* UTIL_H */
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
new file mode 100644
index 000000000000..2567abd90153
--- /dev/null
+++ b/tools/testing/vsock/util.c
@@ -0,0 +1,70 @@
+/*
+ * vsock test utilities
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+
+#include "timeout.h"
+#include "util.h"
+
+/* Install signal handlers */
+void init_signals(void)
+{
+	struct sigaction act = {
+		.sa_handler = sigalrm,
+	};
+
+	sigaction(SIGALRM, &act, NULL);
+	signal(SIGPIPE, SIG_IGN);
+}
+
+/* Parse a CID in string representation */
+unsigned int parse_cid(const char *str)
+{
+	char *endptr = NULL;
+	unsigned long int n;
+
+	errno = 0;
+	n = strtoul(str, &endptr, 10);
+	if (errno || *endptr != '\0') {
+		fprintf(stderr, "malformed CID \"%s\"\n", str);
+		exit(EXIT_FAILURE);
+	}
+	return n;
+}
+
+/* Run test cases.  The program terminates if a failure occurs. */
+void run_tests(const struct test_case *test_cases,
+	       const struct test_opts *opts)
+{
+	int i;
+
+	for (i = 0; test_cases[i].name; i++) {
+		void (*run)(const struct test_opts *opts);
+
+		printf("%s...", test_cases[i].name);
+		fflush(stdout);
+
+		if (opts->mode == TEST_MODE_CLIENT)
+			run = test_cases[i].run_client;
+		else
+			run = test_cases[i].run_server;
+
+		if (run)
+			run(opts);
+
+		printf("ok\n");
+	}
+}
diff --git a/tools/testing/vsock/vsock_diag_test.c b/tools/testing/vsock/vsock_diag_test.c
index e896a4af52f4..f1ace5d96e00 100644
--- a/tools/testing/vsock/vsock_diag_test.c
+++ b/tools/testing/vsock/vsock_diag_test.c
@@ -13,12 +13,10 @@
 
 #include <getopt.h>
 #include <stdio.h>
-#include <stdbool.h>
 #include <stdlib.h>
 #include <string.h>
 #include <errno.h>
 #include <unistd.h>
-#include <signal.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
 #include <sys/types.h>
@@ -33,12 +31,7 @@
 
 #include "timeout.h"
 #include "control.h"
-
-enum test_mode {
-	TEST_MODE_UNSET,
-	TEST_MODE_CLIENT,
-	TEST_MODE_SERVER
-};
+#include "util.h"
 
 /* Per-socket status */
 struct vsock_stat {
@@ -339,7 +332,7 @@ static void free_sock_stat(struct list_head *sockets)
 		free(st);
 }
 
-static void test_no_sockets(unsigned int peer_cid)
+static void test_no_sockets(const struct test_opts *opts)
 {
 	LIST_HEAD(sockets);
 
@@ -350,7 +343,7 @@ static void test_no_sockets(unsigned int peer_cid)
 	free_sock_stat(&sockets);
 }
 
-static void test_listen_socket_server(unsigned int peer_cid)
+static void test_listen_socket_server(const struct test_opts *opts)
 {
 	union {
 		struct sockaddr sa;
@@ -388,7 +381,7 @@ static void test_listen_socket_server(unsigned int peer_cid)
 	free_sock_stat(&sockets);
 }
 
-static void test_connect_client(unsigned int peer_cid)
+static void test_connect_client(const struct test_opts *opts)
 {
 	union {
 		struct sockaddr sa;
@@ -397,7 +390,7 @@ static void test_connect_client(unsigned int peer_cid)
 		.svm = {
 			.svm_family = AF_VSOCK,
 			.svm_port = 1234,
-			.svm_cid = peer_cid,
+			.svm_cid = opts->peer_cid,
 		},
 	};
 	int fd;
@@ -434,7 +427,7 @@ static void test_connect_client(unsigned int peer_cid)
 	free_sock_stat(&sockets);
 }
 
-static void test_connect_server(unsigned int peer_cid)
+static void test_connect_server(const struct test_opts *opts)
 {
 	union {
 		struct sockaddr sa;
@@ -486,9 +479,9 @@ static void test_connect_server(unsigned int peer_cid)
 			clientaddr.sa.sa_family);
 		exit(EXIT_FAILURE);
 	}
-	if (clientaddr.svm.svm_cid != peer_cid) {
+	if (clientaddr.svm.svm_cid != opts->peer_cid) {
 		fprintf(stderr, "expected peer CID %u from accept(2), got %u\n",
-			peer_cid, clientaddr.svm.svm_cid);
+			opts->peer_cid, clientaddr.svm.svm_cid);
 		exit(EXIT_FAILURE);
 	}
 
@@ -507,11 +500,7 @@ static void test_connect_server(unsigned int peer_cid)
 	free_sock_stat(&sockets);
 }
 
-static struct {
-	const char *name;
-	void (*run_client)(unsigned int peer_cid);
-	void (*run_server)(unsigned int peer_cid);
-} test_cases[] = {
+static struct test_case test_cases[] = {
 	{
 		.name = "No sockets",
 		.run_server = test_no_sockets,
@@ -528,30 +517,6 @@ static struct {
 	{},
 };
 
-static void init_signals(void)
-{
-	struct sigaction act = {
-		.sa_handler = sigalrm,
-	};
-
-	sigaction(SIGALRM, &act, NULL);
-	signal(SIGPIPE, SIG_IGN);
-}
-
-static unsigned int parse_cid(const char *str)
-{
-	char *endptr = NULL;
-	unsigned long int n;
-
-	errno = 0;
-	n = strtoul(str, &endptr, 10);
-	if (errno || *endptr != '\0') {
-		fprintf(stderr, "malformed CID \"%s\"\n", str);
-		exit(EXIT_FAILURE);
-	}
-	return n;
-}
-
 static const char optstring[] = "";
 static const struct option longopts[] = {
 	{
@@ -606,9 +571,10 @@ int main(int argc, char **argv)
 {
 	const char *control_host = NULL;
 	const char *control_port = NULL;
-	int mode = TEST_MODE_UNSET;
-	unsigned int peer_cid = VMADDR_CID_ANY;
-	int i;
+	struct test_opts opts = {
+		.mode = TEST_MODE_UNSET,
+		.peer_cid = VMADDR_CID_ANY,
+	};
 
 	init_signals();
 
@@ -624,16 +590,16 @@ int main(int argc, char **argv)
 			break;
 		case 'm':
 			if (strcmp(optarg, "client") == 0)
-				mode = TEST_MODE_CLIENT;
+				opts.mode = TEST_MODE_CLIENT;
 			else if (strcmp(optarg, "server") == 0)
-				mode = TEST_MODE_SERVER;
+				opts.mode = TEST_MODE_SERVER;
 			else {
 				fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
 				return EXIT_FAILURE;
 			}
 			break;
 		case 'p':
-			peer_cid = parse_cid(optarg);
+			opts.peer_cid = parse_cid(optarg);
 			break;
 		case 'P':
 			control_port = optarg;
@@ -646,35 +612,21 @@ int main(int argc, char **argv)
 
 	if (!control_port)
 		usage();
-	if (mode == TEST_MODE_UNSET)
+	if (opts.mode == TEST_MODE_UNSET)
 		usage();
-	if (peer_cid == VMADDR_CID_ANY)
+	if (opts.peer_cid == VMADDR_CID_ANY)
 		usage();
 
 	if (!control_host) {
-		if (mode != TEST_MODE_SERVER)
+		if (opts.mode != TEST_MODE_SERVER)
 			usage();
 		control_host = "0.0.0.0";
 	}
 
-	control_init(control_host, control_port, mode == TEST_MODE_SERVER);
+	control_init(control_host, control_port,
+		     opts.mode == TEST_MODE_SERVER);
 
-	for (i = 0; test_cases[i].name; i++) {
-		void (*run)(unsigned int peer_cid);
-
-		printf("%s...", test_cases[i].name);
-		fflush(stdout);
-
-		if (mode == TEST_MODE_CLIENT)
-			run = test_cases[i].run_client;
-		else
-			run = test_cases[i].run_server;
-
-		if (run)
-			run(peer_cid);
-
-		printf("ok\n");
-	}
+	run_tests(test_cases, &opts);
 
 	control_cleanup();
 	return EXIT_SUCCESS;
-- 
2.14.3

^ permalink raw reply related

* [PATCH 0/5] VSOCK: add vsock_test test suite
From: Stefan Hajnoczi @ 2017-12-13 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi

The vsock_diag.ko module already has a test suite but the core AF_VSOCK
functionality has no tests.  This patch series adds several test cases that
exercise AF_VSOCK SOCK_STREAM socket semantics (send/recv, connect/accept,
half-closed connections, simultaneous connections).

The test suite is modest but I hope to cover additional cases in the future.
My goal is to have a shared test suite so VMCI, Hyper-V, and KVM can ensure
that our transports behave the same.

I have tested virtio-vsock.

Jorgen: Please test the VMCI transport and let me know if anything needs to be
adjusted.  See tools/testing/vsock/README for information on how to run the
test suite.

Dexuan: I'm not sure if this test suite is useful for the Hyper-V transport
since the host is Windows and uses a different API for AF_HYPERV?

Stefan Hajnoczi (5):
  VSOCK: extract utility functions from vsock_diag_test.c
  VSOCK: extract connect/accept functions from vsock_diag_test.c
  VSOCK: add full barrier between test cases
  VSOCK: add send_byte()/recv_byte() test utilities
  VSOCK: add AF_VSOCK test cases

 tools/testing/vsock/Makefile          |   7 +-
 tools/testing/vsock/util.h            |  43 +++++
 tools/testing/vsock/util.c            | 291 ++++++++++++++++++++++++++++++
 tools/testing/vsock/vsock_diag_test.c | 167 +++---------------
 tools/testing/vsock/vsock_test.c      | 321 ++++++++++++++++++++++++++++++++++
 tools/testing/vsock/.gitignore        |   1 +
 tools/testing/vsock/README            |   1 +
 7 files changed, 685 insertions(+), 146 deletions(-)
 create mode 100644 tools/testing/vsock/util.h
 create mode 100644 tools/testing/vsock/util.c
 create mode 100644 tools/testing/vsock/vsock_test.c

-- 
2.14.3

^ permalink raw reply

* [PATCH v2 net-next] net: dsa: lan9303: Introduce lan9303_read_wait
From: Egil Hjelmeland @ 2017-12-13 14:42 UTC (permalink / raw)
  To: andrew, vivien.didelot, f.fainelli, netdev, linux-kernel; +Cc: Egil Hjelmeland

Simplify lan9303_indirect_phy_wait_for_completion()
and lan9303_switch_wait_for_completion() by using a new function
lan9303_read_wait()

Changes v1 -> v2:
 - param 'mask' type u32
 - removed param 'value' (will probably never be used)
 - add newline before return

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
---
 drivers/net/dsa/lan9303-core.c | 59 +++++++++++++++++++-----------------------
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index c1b004fa64d9..f412aad58253 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -249,6 +249,29 @@ static int lan9303_read(struct regmap *regmap, unsigned int offset, u32 *reg)
 	return -EIO;
 }
 
+/* Wait a while until mask & reg == value. Otherwise return timeout. */
+static int lan9303_read_wait(struct lan9303 *chip, int offset, u32 mask)
+{
+	int i;
+
+	for (i = 0; i < 25; i++) {
+		u32 reg;
+		int ret;
+
+		ret = lan9303_read(chip->regmap, offset, &reg);
+		if (ret) {
+			dev_err(chip->dev, "%s failed to read offset %d: %d\n",
+				__func__, offset, ret);
+			return ret;
+		}
+		if (!(reg & mask))
+			return 0;
+		usleep_range(1000, 2000);
+	}
+
+	return -ETIMEDOUT;
+}
+
 static int lan9303_virt_phy_reg_read(struct lan9303 *chip, int regnum)
 {
 	int ret;
@@ -274,22 +297,8 @@ static int lan9303_virt_phy_reg_write(struct lan9303 *chip, int regnum, u16 val)
 
 static int lan9303_indirect_phy_wait_for_completion(struct lan9303 *chip)
 {
-	int ret, i;
-	u32 reg;
-
-	for (i = 0; i < 25; i++) {
-		ret = lan9303_read(chip->regmap, LAN9303_PMI_ACCESS, &reg);
-		if (ret) {
-			dev_err(chip->dev,
-				"Failed to read pmi access status: %d\n", ret);
-			return ret;
-		}
-		if (!(reg & LAN9303_PMI_ACCESS_MII_BUSY))
-			return 0;
-		usleep_range(1000, 2000);
-	}
-
-	return -EIO;
+	return lan9303_read_wait(chip, LAN9303_PMI_ACCESS,
+				 LAN9303_PMI_ACCESS_MII_BUSY);
 }
 
 static int lan9303_indirect_phy_read(struct lan9303 *chip, int addr, int regnum)
@@ -366,22 +375,8 @@ EXPORT_SYMBOL_GPL(lan9303_indirect_phy_ops);
 
 static int lan9303_switch_wait_for_completion(struct lan9303 *chip)
 {
-	int ret, i;
-	u32 reg;
-
-	for (i = 0; i < 25; i++) {
-		ret = lan9303_read(chip->regmap, LAN9303_SWITCH_CSR_CMD, &reg);
-		if (ret) {
-			dev_err(chip->dev,
-				"Failed to read csr command status: %d\n", ret);
-			return ret;
-		}
-		if (!(reg & LAN9303_SWITCH_CSR_CMD_BUSY))
-			return 0;
-		usleep_range(1000, 2000);
-	}
-
-	return -EIO;
+	return lan9303_read_wait(chip, LAN9303_SWITCH_CSR_CMD,
+				 LAN9303_SWITCH_CSR_CMD_BUSY);
 }
 
 static int lan9303_write_switch_reg(struct lan9303 *chip, u16 regnum, u32 val)
-- 
2.14.1

^ permalink raw reply related

* Re: Setting large MTU size on slave interfaces may stall the whole system
From: Or Gerlitz @ 2017-12-13 14:28 UTC (permalink / raw)
  To: Qing Huang
  Cc: Linux Netdev List, Linux Kernel, Jay Vosburgh, Veaceslav Falico,
	Andy Gospodarek, Aviv Heller, Moni Shoua
In-Reply-To: <ea15b7f9-4c02-3d4e-1c01-4e2a90d2a23c@oracle.com>

On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang <qing.huang@oracle.com> wrote:
> Hi,
>
> We found an issue with the bonding driver when testing Mellanox devices.
> The following test commands will stall the whole system sometimes, with
> serial console
> flooded with log messages from the bond_miimon_inspect() function. Setting
> mtu size
> to be 1500 seems okay but very rarely it may hit the same problem too.
>
> ip address flush dev ens3f0
> ip link set dev ens3f0 down
> ip address flush dev ens3f1
> ip link set dev ens3f1 down
> [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
> updelay=500 downdelay=500
> [root@ca-hcl629 etc]# ifconfig bond0 up
> [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
> [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up

> Seiral console output:
>
> ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
> interface ens3f0, disabling it in 500 ms
[..]


> It seems that when setting a large mtu size on an RoCE interface, the RTNL
> mutex may be held too long by the slave
> interface, causing bond_mii_monitor() to be called repeatedly at an interval
> of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive.

Did you try/managed to reproduce that also with other NIC drivers?

Or.

^ permalink raw reply

* Re: [RFC][PATCH] new byteorder primitives - ..._{replace,get}_bits()
From: Al Viro @ 2017-12-13 14:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Linus Torvalds, netdev, linux-kernel
In-Reply-To: <20171212184400.13b27cf8@cakuba.netronome.com>

On Tue, Dec 12, 2017 at 06:44:00PM -0800, Jakub Kicinski wrote:
> On Wed, 13 Dec 2017 01:51:25 +0000, Al Viro wrote:
> > On Tue, Dec 12, 2017 at 05:35:28PM -0800, Jakub Kicinski wrote:
> > 
> > > It used to be __always_inline, but apparently LLVM/clang doesn't
> > > propagate constants :(  
> > > 
> > > 4e59532541c8 ("nfp: don't depend on compiler constant propagation")  
> > 
> > Doesn't propagate constants or doesn't have exact same set of
> > rules for __builtin_constant_p()?  IOW, if you dropped that
> > BUILD_BUG_ON(), what would be left after optimizations?
> 
> Hm.  You're right.  It just doesn't recognize the parameter as constant
> in __builtin_constant_p().

FWIW, clang does propagate them well enough to detect and optimize
multiplication by constant power of two.  With the variant I've posted.

The check on field overflow (which works on gcc builds) does nothing on
clang - __builtin_constant_p() gives a flat-out false for arguments of
inline function there.  I can understand their reasoning, even though
it's inconvenient in cases like this one - semantics of __builtin_constant_p()
is a fucking mess and the less you rely upon its details, the safer you
are...

Hell knows - a part of that can be recovered by taking the check into a
wrapper; as in

static __always_inline __le32_replace_bits(__le32 old, u32 v, u32 mask)
{
	__le32 m = cpu_to_le32(mask);
	return (old & ~m) | cpu_to_le32(v * mask/mask_to_multiplier(mask)));
}

#define le32_replace_bits(old, v, mask)	({			\
	typeof(v) ____v = (v);					\
	typeof(mask) ____m = (mask);				\
	if (__builtin_constant_p(____v))			\
		if (v & ~(____m / mask_to_multiplier(____m)))	\
			__field_overflow();			\
	__l32_replace_bits(old, ____v, ____m);			\
})

That would give that sanity check a better coverage on clang builds, but...
does it really buy us enough to bother, especially since all those macros
would have to be spelled out; you can have a macro expanding to definition
of static inline, but you can't have a macro expanding to anything that
would contain a preprocessor directive, including #define.  And with that
kind of sanity checks, the first build with gcc will catch everything
missed by clang builds anyway.

IMO it's not worth the trouble; let's go with the check inside of
static inline and accept that on clang builds it'll do nothing.

Next question: where do we put that bunch?  I've put it into
linux/byteorder/generic.h, so that anything picking fixed-endian primitives
would pick those as well; I hadn't thought of linux/bitfield.h at the time.
We certainly could put it there instead - it's never pulled by other headers,
so adding #include <asm/byteorder.h> into linux/bitfield.h is not going to
cause header order problems.  Not sure...

Linus, do you have any preferences in that area?

^ permalink raw reply

* Re: [bpf-next V1-RFC PATCH 06/14] mlx4: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-13 14:00 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, dsahern, gospo,
	bjorn.topel, michael.chan, brouer
In-Reply-To: <5b7d2132-40c7-71f5-b939-64539c6c53bb@mellanox.com>

On Wed, 13 Dec 2017 14:42:25 +0200
Tariq Toukan <tariqt@mellanox.com> wrote:

> > @@ -650,6 +658,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >   	int cq_ring = cq->ring;
> >   	bool doorbell_pending;
> >   	struct mlx4_cqe *cqe;
> > +	struct xdp_buff xdp;
> >   	int polled = 0;
> >   	int index;
> >   
> > @@ -664,6 +673,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >   	/* Protect accesses to: ring->xdp_prog, priv->mac_hash list */
> >   	rcu_read_lock();
> >   	xdp_prog = rcu_dereference(ring->xdp_prog);
> > +	xdp.rxq = &ring->xdp_rxq;  
> 
> You moved this to update it only once, and not per packet. Right?
> This is because all fields of struct xdp_buff used to be specific 
> per-packet and filled in every iteration. Now you introduce a new field 
> which holds a context.
> 
> Well, I still need to go over the infrastructure in your other patches, 
> but from first glance it seems that we can use two separated structs: 
> one for context, and one for per-packet info.

This are two separate structs.  I guess, what you are suggesting is
passing them as separate structs to bpf_prog_run_xdp() ?

The reason I like/want to have xdp_buff point to xdp_rxq_info, is that
I want this information transferred through to the XDP_REDIRECT calls.

The plan (after this patchset) is to implement a kfree_xdp_buff() that
can free/return xdp frames directly to the driver (needed in err cases
in redirect code) by checking if the rxq->dev have defined an
ndo_xdp_return() function.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [bpf-next V1-RFC PATCH 02/14] xdp/mlx5: setup xdp_rxq_info and extend with qtype
From: Jesper Dangaard Brouer @ 2017-12-13 13:44 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, dsahern, Matan Barak,
	Saeed Mahameed, gospo, bjorn.topel, michael.chan, brouer
In-Reply-To: <e54da3e8-9944-6a7a-37a9-be438c88884b@mellanox.com>

On Wed, 13 Dec 2017 14:27:08 +0200
Tariq Toukan <tariqt@mellanox.com> wrote:

> Hi Jesper,
> Thanks for taking care of the drop RQ.
> 
> In general, mlx5 part looks ok to me.
> Find a few comments below. Mostly pointing out some typos.
> 
> On 13/12/2017 1:19 PM, Jesper Dangaard Brouer wrote:
> > The mlx5 driver have a special drop-RQ queue (one per interface) that
> > simply drops all incoming traffic. It helps driver keep other HW
> > objects (flow steering) alive upon down/up operations.  It is
> > temporarily pointed by flow steering objects during the interface
> > setup, and when interface is down. It lacks many fields that are set
> > in a regular RQ (for example its state is never switched to
> > MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explaination).  
> typo: explanation

Fixed 

> > 
> > The XDP RX-queue info API is extended with a queue-type, and mlx5 uses
> > this kind of drop/sink-type (RXQ_TYPE_SINK) for this kind of sink queue.
> > 
> > Driver hook points for xdp_rxq_info:
> >   * init+reg: mlx5e_alloc_rq()
> >   * init+reg: mlx5e_alloc_drop_rq()
> >   * unreg   : mlx5e_free_rq()
> > 
> > Tested on actual hardware with samples/bpf program
> > 
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > Cc: Matan Barak <matanb@mellanox.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en.h      |    4 ++++
> >   drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   14 +++++++++++++
> >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |    1 +
> >   include/net/xdp.h                                 |   23 +++++++++++++++++++++
> >   net/core/xdp.c                                    |    6 +++++
> >   5 files changed, 48 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > index c0872b3284cb..fe10a042783b 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > @@ -46,6 +46,7 @@
> >   #include <linux/mlx5/transobj.h>
> >   #include <linux/rhashtable.h>
> >   #include <net/switchdev.h>
> > +#include <net/xdp.h>
> >   #include "wq.h"
> >   #include "mlx5_core.h"
> >   #include "en_stats.h"
> > @@ -568,6 +569,9 @@ struct mlx5e_rq {
> >   	u32                    rqn;
> >   	struct mlx5_core_dev  *mdev;
> >   	struct mlx5_core_mkey  umr_mkey;
> > +
> > +	/* XDP read-mostly */
> > +	struct xdp_rxq_info xdp_rxq;
> >   } ____cacheline_aligned_in_smp;
> >   
> >   struct mlx5e_channel {
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 0f5c012de52e..ea44b5f25e11 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -582,6 +582,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
> >   	rq->ix      = c->ix;
> >   	rq->mdev    = mdev;
> >   
> > +	/* XDP RX-queue info */
> > +	xdp_rxq_info_init(&rq->xdp_rxq);
> > +	rq->xdp_rxq.dev		= rq->netdev;
> > +	rq->xdp_rxq.queue_index = rq->ix;
> > +	xdp_rxq_info_reg(&rq->xdp_rxq);
> > +  
> You don't set type here. This is ok as long as the following hold:
> 1) RXQ_TYPE_DEFAULT is zero

True

> 2) xdp_rxq is zalloc'ed.

xdp_rxq memory area is part of rq allocation, but in
xdp_rxq_info_init() I memset/zero the area explicit.

 
> >   	rq->xdp_prog = params->xdp_prog ?
> > bpf_prog_inc(params->xdp_prog) : NULL; if (IS_ERR(rq->xdp_prog)) {
> >   		err = PTR_ERR(rq->xdp_prog);
> > @@ -695,6 +701,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel
> > *c, err_rq_wq_destroy:
> >   	if (rq->xdp_prog)
> >   		bpf_prog_put(rq->xdp_prog);
> > +	xdp_rxq_info_unreg(&rq->xdp_rxq);
> >   	mlx5_wq_destroy(&rq->wq_ctrl);
> >   
> >   	return err;
> > @@ -707,6 +714,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
> >   	if (rq->xdp_prog)
> >   		bpf_prog_put(rq->xdp_prog);
> >   
> > +	xdp_rxq_info_unreg(&rq->xdp_rxq);
> > +
> >   	switch (rq->wq_type) {
> >   	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> >   		mlx5e_rq_free_mpwqe_info(rq);
> > @@ -2768,6 +2777,11 @@ static int mlx5e_alloc_drop_rq(struct
> > mlx5_core_dev *mdev, if (err)
> >   		return err;
> >   
> > +	/* XDP RX-queue info for "Drop-RQ", packets never reach
> > XDP */
> > +	xdp_rxq_info_init(&rq->xdp_rxq);
> > +	xdp_rxq_info_type(&rq->xdp_rxq, RXQ_TYPE_SINK);
> > +	xdp_rxq_info_reg(&rq->xdp_rxq);
> > +
> >   	rq->mdev = mdev;
> >   
> >   	return 0;
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index
> > 5b499c7a698f..7b38480811d4 100644 ---
> > a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -812,6 +812,7
> > @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
> > xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + *len;
> >   	xdp.data_hard_start = va;
> > +	xdp.rxq = &rq->xdp_rxq;
> >   
> >   	act = bpf_prog_run_xdp(prog, &xdp);
> >   	switch (act) {
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index e4acd198fd60..5be560d943e1 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -36,10 +36,33 @@ struct xdp_rxq_info {
> >   	struct net_device *dev;
> >   	u32 queue_index;
> >   	u32 reg_state;
> > +	u32 qtype;
> >   } ____cacheline_aligned; /* perf critical, avoid false-sharing */
> >   
> >   void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq);
> >   void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq);
> >   void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
> >   
> > +/**
> > + * DOC: XDP RX-queue type
> > + *
> > + * The XDP RX-queue info can have associated a type.
> > + *
> > + * @RXQ_TYPE_DEFAULT: default no specifik queue type need to be
> > specified  
> 
> typo: specific

Thanks, this is a Danish typo (it's spelled that way in Danish).
 
> > + *
> > + * @RXQ_TYPE_SINK: indicate a fake queue that never reach XDP RX
> > + *	code.  Some drivers have a need to maintain a lower layer
> > + *	RX-queue as a sink queue, while reconfiguring other
> > RX-queues.
> > + */
> > +#define RXQ_TYPE_DEFAULT	0
> > +#define RXQ_TYPE_SINK		1
> > +#define RXQ_TYPE_MAX		RXQ_TYPE_SINK  
> 
> Definitions of incremental numbers, enum might be best here, you can 
> give them some enum type and use it in xdp_rxq_info->qtype.

I use defines to make the below BUILD_BUG_ON work, as enums does not
get expanded to their values in the C-preprocessor stage.

> > +
> > +static inline
> > +void xdp_rxq_info_type(struct xdp_rxq_info *xdp_rxq, u32 qtype)
> > +{
> > +	BUILD_BUG_ON(qtype > RXQ_TYPE_MAX);
> > +	xdp_rxq->qtype = qtype;
> > +}
> > +
> >   #endif /* __LINUX_NET_XDP_H__ */
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index a9d2dd7b1ede..2a111f5987f6 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -32,8 +32,14 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
> >   
> >   void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
> >   {
> > +	if (xdp_rxq->qtype == RXQ_TYPE_SINK)
> > +		goto skip_content_check;
> > +
> > +	/* Check information setup by driver code */
> >   	WARN(!xdp_rxq->dev, "Missing net_device from driver");
> >   	WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver"); +
> > +skip_content_check:
> >   	WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
> >     xdp_rxq->reg_state = REG_STATE_REGISTRED;  
> typo: REGISTERED (introduced in a previous patch)

Thanks for catching that! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* RE: [PATCH V3 net-next 3/8] net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
From: Salil Mehta @ 2017-12-13 13:02 UTC (permalink / raw)
  To: Philippe Ombredanne
  Cc: David S. Miller, Zhuangyuzeng (Yisen), lipeng (Y), Salil Mehta,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, LKML,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linuxarm
In-Reply-To: <CAOFm3uFH_X5Bfagpx=ai_EdEKAmBHrcPb-xhdO=E-gDhtq0rLw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Philippe,

> -----Original Message-----
> From: Philippe Ombredanne [mailto:pombredanne@nexb.com]
> Sent: Wednesday, December 13, 2017 12:54 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: David S. Miller <davem@davemloft.net>; Zhuangyuzeng (Yisen)
> <yisen.zhuang@huawei.com>; lipeng (Y) <lipeng321@huawei.com>; Salil
> Mehta <mehta.salil.lnk@gmail.com>; netdev@vger.kernel.org; LKML <linux-
> kernel@vger.kernel.org>; linux-rdma@vger.kernel.org; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [PATCH V3 net-next 3/8] net: hns3: Add HNS3 VF
> HCL(Hardware Compatibility Layer) Support
> 
> Dear Salil,
> 
> On Wed, Dec 13, 2017 at 11:35 AM, Salil Mehta <salil.mehta@huawei.com>
> wrote:
> 
> > I could also make out from different articles, including from the
> below,
> > Linus suggesting moving to "//" type instead of starred ones for
> headers.
> >
> > It looks SPDX change is still a suggestion?
> 
> Well, I think this has been discussed and agreed among maintainers at
> the plumbers summit and here on this list. And when I see Thomas doc
> patches, Greg patches, other key maintainers as well as Linus comments
> and reviews and guidelines, I can can hardly see this as just being a
> "suggestion", but that's your call to consider it this way.
Ok. I understand. Thanks for confirming it. I will do the changes as
suggested by you and thanks for bringing this into notice.

Best regards
Salil

^ permalink raw reply

* Re: [PATCH V3 net-next 3/8] net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support
From: Philippe Ombredanne @ 2017-12-13 12:53 UTC (permalink / raw)
  To: Salil Mehta
  Cc: David S. Miller, Zhuangyuzeng (Yisen), lipeng (Y), Salil Mehta,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, LKML,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linuxarm
In-Reply-To: <F4CC6FACFEB3C54C9141D49AD221F7F93BB76189-WFPaWmAhWqtUuCJht5byYAK1hpo4iccwjNknBlVQO8k@public.gmane.org>

Dear Salil,

On Wed, Dec 13, 2017 at 11:35 AM, Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:

> I could also make out from different articles, including from the below,
> Linus suggesting moving to "//" type instead of starred ones for headers.
>
> It looks SPDX change is still a suggestion?

Well, I think this has been discussed and agreed among maintainers at
the plumbers summit and here on this list. And when I see Thomas doc
patches, Greg patches, other key maintainers as well as Linus comments
and reviews and guidelines, I can can hardly see this as just being a
"suggestion", but that's your call to consider it this way.

-- 
Cordially
Philippe Ombredanne
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [bpf-next V1-RFC PATCH 06/14] mlx4: setup xdp_rxq_info
From: Tariq Toukan @ 2017-12-13 12:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, dsahern, gospo, bjorn.topel, michael.chan, Tariq Toukan
In-Reply-To: <151316399143.14967.3940027276821596115.stgit@firesoul>



On 13/12/2017 1:19 PM, Jesper Dangaard Brouer wrote:
> Driver hook points for xdp_rxq_info:
>   * init+reg: mlx4_en_create_rx_ring
>   * unreg   : mlx4_en_destroy_rx_ring
> 
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c |    3 ++-
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c     |   13 +++++++++++--
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    4 +++-
>   3 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 99051a294fa6..0cfcf3089ae4 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2172,8 +2172,9 @@ static int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
>   
>   		if (mlx4_en_create_rx_ring(priv, &priv->rx_ring[i],
>   					   prof->rx_ring_size, priv->stride,
> -					   node))
> +					   node, i))
>   			goto err;
> +
>   	}
>   
>   #ifdef CONFIG_RFS_ACCEL
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 85e28efcda33..2091c9734e6a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -262,7 +262,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
>   
>   int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
>   			   struct mlx4_en_rx_ring **pring,
> -			   u32 size, u16 stride, int node)
> +			   u32 size, u16 stride, int node, int queue_index)
>   {
>   	struct mlx4_en_dev *mdev = priv->mdev;
>   	struct mlx4_en_rx_ring *ring;
> @@ -286,6 +286,12 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
>   	ring->log_stride = ffs(ring->stride) - 1;
>   	ring->buf_size = ring->size * ring->stride + TXBB_SIZE;
>   
> +	/* XDP RX-queue info */
> +	xdp_rxq_info_init(&ring->xdp_rxq);
> +	ring->xdp_rxq.dev = priv->dev;
> +	ring->xdp_rxq.queue_index = queue_index;
> +	xdp_rxq_info_reg(&ring->xdp_rxq);
> +
>   	tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
>   					sizeof(struct mlx4_en_rx_alloc));
>   	ring->rx_info = vzalloc_node(tmp, node);
> @@ -318,6 +324,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
>   	vfree(ring->rx_info);
>   	ring->rx_info = NULL;
>   err_ring:
> +	xdp_rxq_info_unreg(&ring->xdp_rxq);
>   	kfree(ring);
>   	*pring = NULL;
>   
> @@ -440,6 +447,7 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
>   					lockdep_is_held(&mdev->state_lock));
>   	if (old_prog)
>   		bpf_prog_put(old_prog);
> +	xdp_rxq_info_unreg(&ring->xdp_rxq);
>   	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
>   	vfree(ring->rx_info);
>   	ring->rx_info = NULL;
> @@ -650,6 +658,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	int cq_ring = cq->ring;
>   	bool doorbell_pending;
>   	struct mlx4_cqe *cqe;
> +	struct xdp_buff xdp;
>   	int polled = 0;
>   	int index;
>   
> @@ -664,6 +673,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	/* Protect accesses to: ring->xdp_prog, priv->mac_hash list */
>   	rcu_read_lock();
>   	xdp_prog = rcu_dereference(ring->xdp_prog);
> +	xdp.rxq = &ring->xdp_rxq;

You moved this to update it only once, and not per packet. Right?
This is because all fields of struct xdp_buff used to be specific 
per-packet and filled in every iteration. Now you introduce a new field 
which holds a context.

Well, I still need to go over the infrastructure in your other patches, 
but from first glance it seems that we can use two separated structs: 
one for context, and one for per-packet info.

>   	doorbell_pending = 0;
>   
>   	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
> @@ -748,7 +758,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   		 * read bytes but not past the end of the frag.
>   		 */
>   		if (xdp_prog) {
> -			struct xdp_buff xdp;
>   			dma_addr_t dma;
>   			void *orig_data;
>   			u32 act;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 1856e279a7e0..bdfb4362b35a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -46,6 +46,7 @@
>   #endif
>   #include <linux/cpu_rmap.h>
>   #include <linux/ptp_clock_kernel.h>
> +#include <net/xdp.h>
>   
>   #include <linux/mlx4/device.h>
>   #include <linux/mlx4/qp.h>
> @@ -353,6 +354,7 @@ struct mlx4_en_rx_ring {
>   	unsigned long dropped;
>   	int hwtstamp_rx_filter;
>   	cpumask_var_t affinity_mask;
> +	struct xdp_rxq_info xdp_rxq;
>   };
>   
>   struct mlx4_en_cq {
> @@ -716,7 +718,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev);
>   void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv);
>   int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
>   			   struct mlx4_en_rx_ring **pring,
> -			   u32 size, u16 stride, int node);
> +			   u32 size, u16 stride, int node, int queue_index);
>   void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
>   			     struct mlx4_en_rx_ring **pring,
>   			     u32 size, u16 stride);
> 

^ permalink raw reply

* Re: [bpf-next V1-RFC PATCH 02/14] xdp/mlx5: setup xdp_rxq_info and extend with qtype
From: Tariq Toukan @ 2017-12-13 12:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, dsahern, Matan Barak, Saeed Mahameed, gospo, bjorn.topel,
	michael.chan, Tariq Toukan
In-Reply-To: <151316397109.14967.10661226040285486466.stgit@firesoul>

Hi Jesper,
Thanks for taking care of the drop RQ.

In general, mlx5 part looks ok to me.
Find a few comments below. Mostly pointing out some typos.

On 13/12/2017 1:19 PM, Jesper Dangaard Brouer wrote:
> The mlx5 driver have a special drop-RQ queue (one per interface) that
> simply drops all incoming traffic. It helps driver keep other HW
> objects (flow steering) alive upon down/up operations.  It is
> temporarily pointed by flow steering objects during the interface
> setup, and when interface is down. It lacks many fields that are set
> in a regular RQ (for example its state is never switched to
> MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explaination).
typo: explanation
> 
> The XDP RX-queue info API is extended with a queue-type, and mlx5 uses
> this kind of drop/sink-type (RXQ_TYPE_SINK) for this kind of sink queue.
> 
> Driver hook points for xdp_rxq_info:
>   * init+reg: mlx5e_alloc_rq()
>   * init+reg: mlx5e_alloc_drop_rq()
>   * unreg   : mlx5e_free_rq()
> 
> Tested on actual hardware with samples/bpf program
> 
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Matan Barak <matanb@mellanox.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en.h      |    4 ++++
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   14 +++++++++++++
>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |    1 +
>   include/net/xdp.h                                 |   23 +++++++++++++++++++++
>   net/core/xdp.c                                    |    6 +++++
>   5 files changed, 48 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index c0872b3284cb..fe10a042783b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -46,6 +46,7 @@
>   #include <linux/mlx5/transobj.h>
>   #include <linux/rhashtable.h>
>   #include <net/switchdev.h>
> +#include <net/xdp.h>
>   #include "wq.h"
>   #include "mlx5_core.h"
>   #include "en_stats.h"
> @@ -568,6 +569,9 @@ struct mlx5e_rq {
>   	u32                    rqn;
>   	struct mlx5_core_dev  *mdev;
>   	struct mlx5_core_mkey  umr_mkey;
> +
> +	/* XDP read-mostly */
> +	struct xdp_rxq_info xdp_rxq;
>   } ____cacheline_aligned_in_smp;
>   
>   struct mlx5e_channel {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 0f5c012de52e..ea44b5f25e11 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -582,6 +582,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>   	rq->ix      = c->ix;
>   	rq->mdev    = mdev;
>   
> +	/* XDP RX-queue info */
> +	xdp_rxq_info_init(&rq->xdp_rxq);
> +	rq->xdp_rxq.dev		= rq->netdev;
> +	rq->xdp_rxq.queue_index = rq->ix;
> +	xdp_rxq_info_reg(&rq->xdp_rxq);
> +
You don't set type here. This is ok as long as the following hold:
1) RXQ_TYPE_DEFAULT is zero
2) xdp_rxq is zalloc'ed.

>   	rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL;
>   	if (IS_ERR(rq->xdp_prog)) {
>   		err = PTR_ERR(rq->xdp_prog);
> @@ -695,6 +701,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>   err_rq_wq_destroy:
>   	if (rq->xdp_prog)
>   		bpf_prog_put(rq->xdp_prog);
> +	xdp_rxq_info_unreg(&rq->xdp_rxq);
>   	mlx5_wq_destroy(&rq->wq_ctrl);
>   
>   	return err;
> @@ -707,6 +714,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
>   	if (rq->xdp_prog)
>   		bpf_prog_put(rq->xdp_prog);
>   
> +	xdp_rxq_info_unreg(&rq->xdp_rxq);
> +
>   	switch (rq->wq_type) {
>   	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
>   		mlx5e_rq_free_mpwqe_info(rq);
> @@ -2768,6 +2777,11 @@ static int mlx5e_alloc_drop_rq(struct mlx5_core_dev *mdev,
>   	if (err)
>   		return err;
>   
> +	/* XDP RX-queue info for "Drop-RQ", packets never reach XDP */
> +	xdp_rxq_info_init(&rq->xdp_rxq);
> +	xdp_rxq_info_type(&rq->xdp_rxq, RXQ_TYPE_SINK);
> +	xdp_rxq_info_reg(&rq->xdp_rxq);
> +
>   	rq->mdev = mdev;
>   
>   	return 0;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 5b499c7a698f..7b38480811d4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -812,6 +812,7 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
>   	xdp_set_data_meta_invalid(&xdp);
>   	xdp.data_end = xdp.data + *len;
>   	xdp.data_hard_start = va;
> +	xdp.rxq = &rq->xdp_rxq;
>   
>   	act = bpf_prog_run_xdp(prog, &xdp);
>   	switch (act) {
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index e4acd198fd60..5be560d943e1 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -36,10 +36,33 @@ struct xdp_rxq_info {
>   	struct net_device *dev;
>   	u32 queue_index;
>   	u32 reg_state;
> +	u32 qtype;
>   } ____cacheline_aligned; /* perf critical, avoid false-sharing */
>   
>   void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq);
>   void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq);
>   void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
>   
> +/**
> + * DOC: XDP RX-queue type
> + *
> + * The XDP RX-queue info can have associated a type.
> + *
> + * @RXQ_TYPE_DEFAULT: default no specifik queue type need to be specified

typo: specific

> + *
> + * @RXQ_TYPE_SINK: indicate a fake queue that never reach XDP RX
> + *	code.  Some drivers have a need to maintain a lower layer
> + *	RX-queue as a sink queue, while reconfiguring other RX-queues.
> + */
> +#define RXQ_TYPE_DEFAULT	0
> +#define RXQ_TYPE_SINK		1
> +#define RXQ_TYPE_MAX		RXQ_TYPE_SINK

Definitions of incremental numbers, enum might be best here, you can 
give them some enum type and use it in xdp_rxq_info->qtype.

> +
> +static inline
> +void xdp_rxq_info_type(struct xdp_rxq_info *xdp_rxq, u32 qtype)
> +{
> +	BUILD_BUG_ON(qtype > RXQ_TYPE_MAX);
> +	xdp_rxq->qtype = qtype;
> +}
> +
>   #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index a9d2dd7b1ede..2a111f5987f6 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -32,8 +32,14 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
>   
>   void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
>   {
> +	if (xdp_rxq->qtype == RXQ_TYPE_SINK)
> +		goto skip_content_check;
> +
> +	/* Check information setup by driver code */
>   	WARN(!xdp_rxq->dev, "Missing net_device from driver");
>   	WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver");
> +
> +skip_content_check:
>   	WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
>   	xdp_rxq->reg_state = REG_STATE_REGISTRED;
typo: REGISTERED (introduced in a previous patch)
>   }
> 

^ permalink raw reply

* Business Proposal Of $18,100,000.00
From: Mr Youichi Kanno @ 2017-12-13  7:48 UTC (permalink / raw)




Dear Sir/Madam,

My name is Youichi Kanno and I work in Audit & credit Supervisory role at 
The Norinchukin Bank,I am contacting you regarding the asset of a deceased 
client Mr. Grigor Kassan and I need your assistance to process the fund 
claims oF $18,100,000.00. if intreasted get back to me so we can discuss 
the logistic of moving the funds to a safe offshore bank.

Yours sincerely,
Youichi Kanno

^ permalink raw reply

* [PATCH] ipv6: ip6mr: Recalc UDP checksum before forwarding
From: Brendan McGrath @ 2017-12-13 11:20 UTC (permalink / raw)
  To: David S . Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev,
	linux-kernel
  Cc: Brendan McGrath

Currently, when forwarding from a Virtual Interface to a Physical
Interface, ip_summed is set to a value of CHECKSUM_UNNECESSARY and
the UDP checksum has not been calculated.

When the packet is then forwarded by a Multicast Router, the checksum
value is left as is and therefore rejected by the receiving
machine(s).

This patch ensures the checksum is recalculated before forwarding.

Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org>
---

It's a bit ugly putting UDP specific code in this spot - but I'm not
aware of any other protocols that are:
a) multicast;
b) forwarded; and
c) checksummed

 net/ipv6/ip6mr.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 890f9bda..ee4370a 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -2077,6 +2077,13 @@ static int ip6mr_forward2(struct net *net, struct mr6_table *mrt,
 	ipv6h = ipv6_hdr(skb);
 	ipv6h->hop_limit--;
 
+	if (ipv6h->nexthdr == NEXTHDR_UDP &&
+					skb->ip_summed != CHECKSUM_PARTIAL) {
+		struct udphdr *uh = udp_hdr(skb);
+		udp6_set_csum(false, skb, &ipv6_hdr(skb)->saddr,
+					&ipv6_hdr(skb)->daddr, ntohs(uh->len));
+	}
+
 	IP6CB(skb)->flags |= IP6SKB_FORWARDED;
 
 	return NF_HOOK(NFPROTO_IPV6, NF_INET_FORWARD,
-- 
2.7.4

^ permalink raw reply related

* [bpf-next V1-RFC PATCH 14/14] samples/bpf: program demonstrating access to xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-13 11:20 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151316391502.14967.13292358380181773729.stgit@firesoul>

This sample program can be used for monitoring and reporting how many
packets per sec (pps) are received per NIC RX queue index and which
CPU processed the packet. In itself it is a useful tool for quickly
identifying RSS imbalance issues, see below.

The default XDP action is XDP_PASS in-order to provide a monitor
mode. For benchmarking purposes it is possible to specify other XDP
actions on the cmdline --action.

Output below shows an imbalance RSS case where most RXQ's deliver to
CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
things from a CPU level the two CPUs are processing approx the same
amount, BUT looking at the rx_queue_index levels it is clear that
RXQ-2 receive much better service, than other RXQs which all share CPU-0.

Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       900,473     0
XDP-RX CPU      2       906,921     0
XDP-RX CPU      total   1,807,395

RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   180,098     0
rx_queue_index    0:sum 180,098
rx_queue_index    1:0   180,098     0
rx_queue_index    1:sum 180,098
rx_queue_index    2:2   906,921     0
rx_queue_index    2:sum 906,921
rx_queue_index    3:0   180,098     0
rx_queue_index    3:sum 180,098
rx_queue_index    4:0   180,082     0
rx_queue_index    4:sum 180,082
rx_queue_index    5:0   180,093     0
rx_queue_index    5:sum 180,093

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile            |    4 
 samples/bpf/xdp_rxq_info_kern.c |   96 +++++++
 samples/bpf/xdp_rxq_info_user.c |  533 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 633 insertions(+)
 create mode 100644 samples/bpf/xdp_rxq_info_kern.c
 create mode 100644 samples/bpf/xdp_rxq_info_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4fb944a7ecf8..3ff7a05bea9a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_redirect_cpu
 hostprogs-y += xdp_monitor
+hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 
 # Libbpf dependencies
@@ -90,6 +91,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
+xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 
 # Tell kbuild to always build the programs
@@ -139,6 +141,7 @@ always += xdp_redirect_kern.o
 always += xdp_redirect_map_kern.o
 always += xdp_redirect_cpu_kern.o
 always += xdp_monitor_kern.o
+always += xdp_rxq_info_kern.o
 always += syscall_tp_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
@@ -182,6 +185,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_redirect_cpu += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
+HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
diff --git a/samples/bpf/xdp_rxq_info_kern.c b/samples/bpf/xdp_rxq_info_kern.c
new file mode 100644
index 000000000000..19cbd662202c
--- /dev/null
+++ b/samples/bpf/xdp_rxq_info_kern.c
@@ -0,0 +1,96 @@
+/* Example howto extract XDP RX-queue info
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Config setup from with userspace
+ *
+ * User-side setup ifindex in config_map, to verify that
+ * ctx->ingress_ifindex is correct (against configured ifindex)
+ */
+struct config {
+	__u32 action;
+	int ifindex;
+};
+struct bpf_map_def SEC("maps") config_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(struct config),
+	.max_entries	= 1,
+};
+
+/* Common stats data record (shared with userspace) */
+struct datarec {
+	__u64 processed;
+	__u64 issue;
+};
+
+struct bpf_map_def SEC("maps") stats_global_map = {
+	.type		= BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size	= sizeof(u32),
+	.value_size	= sizeof(struct datarec),
+	.max_entries	= 1,
+};
+
+#define MAX_RXQs 64
+
+/* Stats per rx_queue_index (per CPU) */
+struct bpf_map_def SEC("maps") rx_queue_index_map = {
+	.type		= BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size	= sizeof(u32),
+	.value_size	= sizeof(struct datarec),
+	.max_entries	= MAX_RXQs + 1,
+};
+
+SEC("xdp_prog0")
+int  xdp_prognum0(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data     = (void *)(long)ctx->data;
+	struct datarec *rec, *rxq_rec;
+	int ingress_ifindex;
+	struct config *config;
+	u32 key = 0;
+
+	/* Global stats record */
+	rec = bpf_map_lookup_elem(&stats_global_map, &key);
+	if (!rec)
+		return XDP_ABORTED;
+	rec->processed++;
+
+	/* Accessing ctx->ingress_ifindex, cause BPF to rewrite BPF
+	 * instructions inside kernel to access xdp_rxq->dev->ifindex
+	 */
+	ingress_ifindex = ctx->ingress_ifindex;
+
+	config = bpf_map_lookup_elem(&config_map, &key);
+	if (!config)
+		return XDP_ABORTED;
+
+	/* Simple test: check ctx provided ifindex is as expected */
+	if (ingress_ifindex != config->ifindex) {
+		/* count this error case */
+		rec->issue++;
+		return XDP_ABORTED;
+	}
+
+	/* Update stats per rx_queue_index. Handle if rx_queue_index
+	 * is larger than stats map can contain info for.
+	 */
+	key = ctx->rx_queue_index;
+	if (key >= MAX_RXQs)
+		key = MAX_RXQs;
+	rxq_rec = bpf_map_lookup_elem(&rx_queue_index_map, &key);
+	if (!rxq_rec)
+		return XDP_ABORTED;
+	rxq_rec->processed++;
+	if (key == MAX_RXQs)
+		rxq_rec->issue++;
+
+	return config->action;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
new file mode 100644
index 000000000000..8e540c33631c
--- /dev/null
+++ b/samples/bpf/xdp_rxq_info_user.c
@@ -0,0 +1,533 @@
+/*
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+static const char *__doc__ = " XDP RX-queue info extract example\n\n"
+	"Monitor how many packets per sec (pps) are received\n"
+	"per NIC RX queue index and which CPU processed the packet\n"
+	;
+
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+#include <locale.h>
+#include <sys/resource.h>
+#include <getopt.h>
+#include <net/if.h>
+#include <time.h>
+
+#include <arpa/inet.h>
+#include <linux/if_link.h>
+
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "bpf_util.h"
+
+static int ifindex = -1;
+static char ifname_buf[IF_NAMESIZE];
+static char *ifname;
+
+static __u32 xdp_flags;
+
+/* Exit return codes */
+#define EXIT_OK		0
+#define EXIT_FAIL		1
+#define EXIT_FAIL_OPTION	2
+#define EXIT_FAIL_XDP		3
+#define EXIT_FAIL_BPF		4
+#define EXIT_FAIL_MEM		5
+
+static const struct option long_options[] = {
+	{"help",	no_argument,		NULL, 'h' },
+	{"dev",		required_argument,	NULL, 'd' },
+	{"skb-mode",	no_argument,		NULL, 'S' },
+	{"sec",		required_argument,	NULL, 's' },
+	{"no-separators", no_argument,		NULL, 'z' },
+	{"action", 	required_argument,	NULL, 'a' },
+	{0, 0, NULL,  0 }
+};
+
+static void int_exit(int sig)
+{
+	fprintf(stderr,
+		"Interrupted: Removing XDP program on ifindex:%d device:%s\n",
+		ifindex, ifname);
+	if (ifindex > -1)
+		set_link_xdp_fd(ifindex, -1, xdp_flags);
+	exit(EXIT_OK);
+}
+
+struct config {
+	__u32 action;
+	int ifindex;
+};
+#define XDP_ACTION_MAX (XDP_TX + 1)
+#define XDP_ACTION_MAX_STRLEN 11
+static const char *xdp_action_names[XDP_ACTION_MAX] = {
+	[XDP_ABORTED]	= "XDP_ABORTED",
+	[XDP_DROP]	= "XDP_DROP",
+	[XDP_PASS]	= "XDP_PASS",
+	[XDP_TX]	= "XDP_TX",
+};
+
+static const char *action2str(int action)
+{
+	if (action < XDP_ACTION_MAX)
+		return xdp_action_names[action];
+	return NULL;
+}
+
+static int parse_xdp_action(char *action_str)
+{
+	size_t maxlen;
+	__u64 action = -1;
+	int i;
+
+	for (i = 0; i < XDP_ACTION_MAX; i++) {
+		maxlen = XDP_ACTION_MAX_STRLEN;
+		if (strncmp(xdp_action_names[i], action_str, maxlen) == 0) {
+			action = i;
+			break;
+		}
+	}
+	return action;
+}
+
+static void list_xdp_actions(void)
+{
+	int i;
+
+	printf("Available XDP --action <options>\n");
+	for (i = 0; i < XDP_ACTION_MAX; i++) {
+		printf("\t%s\n", xdp_action_names[i]);
+	}
+	printf("\n");
+}
+
+static void usage(char *argv[])
+{
+	int i;
+
+	printf("\nDOCUMENTATION:\n%s\n", __doc__);
+	printf(" Usage: %s (options-see-below)\n", argv[0]);
+	printf(" Listing options:\n");
+	for (i = 0; long_options[i].name != 0; i++) {
+		printf(" --%-12s", long_options[i].name);
+		if (long_options[i].flag != NULL)
+			printf(" flag (internal value:%d)",
+				*long_options[i].flag);
+		else
+			printf(" short-option: -%c",
+				long_options[i].val);
+		printf("\n");
+	}
+	printf("\n");
+	list_xdp_actions();
+}
+
+#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
+static __u64 gettime(void)
+{
+	struct timespec t;
+	int res;
+
+	res = clock_gettime(CLOCK_MONOTONIC, &t);
+	if (res < 0) {
+		fprintf(stderr, "Error with gettimeofday! (%i)\n", res);
+		exit(EXIT_FAIL);
+	}
+	return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
+}
+
+/* Common stats data record shared with _kern.c */
+struct datarec {
+	__u64 processed;
+	__u64 issue;
+};
+struct record {
+	__u64 timestamp;
+	struct datarec total;
+	struct datarec *cpu;
+};
+struct stats_record {
+	struct record stats;
+	struct record *rxq;
+};
+
+static struct datarec *alloc_record_per_cpu(void)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	struct datarec *array;
+	size_t size;
+
+	size = sizeof(struct datarec) * nr_cpus;
+	array = malloc(size);
+	memset(array, 0, size);
+	if (!array) {
+		fprintf(stderr, "Mem alloc error (nr_cpus:%u)\n", nr_cpus);
+		exit(EXIT_FAIL_MEM);
+	}
+	return array;
+}
+
+static struct record *alloc_record_per_rxq(void)
+{
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	struct record *array;
+	size_t size;
+
+	size = sizeof(struct record) * nr_rxqs;
+	array = malloc(size);
+	memset(array, 0, size);
+	if (!array) {
+		fprintf(stderr, "Mem alloc error (nr_rxqs:%u)\n", nr_rxqs);
+		exit(EXIT_FAIL_MEM);
+	}
+	return array;
+}
+
+static struct stats_record *alloc_stats_record(void)
+{
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	struct stats_record *rec;
+	int i;
+
+	rec = malloc(sizeof(*rec));
+	memset(rec, 0, sizeof(*rec));
+	if (!rec) {
+		fprintf(stderr, "Mem alloc error\n");
+		exit(EXIT_FAIL_MEM);
+	}
+	rec->rxq = alloc_record_per_rxq();
+	for (i = 0; i < nr_rxqs; i++) {
+		rec->rxq[i].cpu = alloc_record_per_cpu();
+	}
+	rec->stats.cpu = alloc_record_per_cpu();
+	return rec;
+}
+
+static void free_stats_record(struct stats_record *r)
+{
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	int i;
+
+	for (i = 0; i < nr_rxqs; i++) {
+		free(r->rxq[i].cpu);
+	}
+	free(r->rxq);
+	free(r->stats.cpu);
+	free(r);
+}
+
+static bool map_collect_percpu(int fd, __u32 key, struct record *rec)
+{
+	/* For percpu maps, userspace gets a value per possible CPU */
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	struct datarec values[nr_cpus];
+	__u64 sum_processed = 0;
+	__u64 sum_issue = 0;
+	int i;
+
+	if ((bpf_map_lookup_elem(fd, &key, values)) != 0) {
+		fprintf(stderr,
+			"ERR: bpf_map_lookup_elem failed key:0x%X\n", key);
+		return false;
+	}
+	/* Get time as close as possible to reading map contents */
+	rec->timestamp = gettime();
+
+	/* Record and sum values from each CPU */
+	for (i = 0; i < nr_cpus; i++) {
+		rec->cpu[i].processed = values[i].processed;
+		sum_processed        += values[i].processed;
+		rec->cpu[i].issue = values[i].issue;
+		sum_issue        += values[i].issue;
+	}
+	rec->total.processed = sum_processed;
+	rec->total.issue     = sum_issue;
+	return true;
+}
+
+static void stats_collect(struct stats_record *rec)
+{
+	int fd, i, max_rxqs;
+
+	fd = map_data[1].fd; /* map: stats_global_map */
+	map_collect_percpu(fd, 0, &rec->stats);
+
+	fd = map_data[2].fd; /* map: rx_queue_index_map */
+	max_rxqs = map_data[2].def.max_entries;
+	for (i = 0; i < max_rxqs; i++)
+		map_collect_percpu(fd, i, &rec->rxq[i]);
+}
+
+static double calc_period(struct record *r, struct record *p)
+{
+	double period_ = 0;
+	__u64 period = 0;
+
+	period = r->timestamp - p->timestamp;
+	if (period > 0)
+		period_ = ((double) period / NANOSEC_PER_SEC);
+
+	return period_;
+}
+
+static __u64 calc_pps(struct datarec *r, struct datarec *p, double period_)
+{
+	__u64 packets = 0;
+	__u64 pps = 0;
+
+	if (period_ > 0) {
+		packets = r->processed - p->processed;
+		pps = packets / period_;
+	}
+	return pps;
+}
+
+static __u64 calc_errs_pps(struct datarec *r,
+			    struct datarec *p, double period_)
+{
+	__u64 packets = 0;
+	__u64 pps = 0;
+
+	if (period_ > 0) {
+		packets = r->issue - p->issue;
+		pps = packets / period_;
+	}
+	return pps;
+}
+
+static void stats_print(struct stats_record *stats_rec,
+			struct stats_record *stats_prev,
+			int action)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	unsigned int nr_rxqs = map_data[2].def.max_entries;
+	double pps = 0, err = 0;
+	struct record *rec, *prev;
+	double t;
+	int rxq;
+	int i;
+
+	/* Header */
+	printf("\nRunning XDP on dev:%s (ifindex:%d) action:%s\n",
+	       ifname, ifindex, action2str(action));
+
+	/* stats_global_map */
+	{
+		char *fmt_rx = "%-15s %-7d %'-11.0f %'-10.0f %s\n";
+		char *fm2_rx = "%-15s %-7s %'-11.0f\n";
+		char *errstr = "";
+
+		printf("%-15s %-7s %-11s %-11s\n",
+		       "XDP stats", "CPU", "pps", "issue-pps");
+
+		rec  =  &stats_rec->stats;
+		prev = &stats_prev->stats;
+		t = calc_period(rec, prev);
+		for (i = 0; i < nr_cpus; i++) {
+			struct datarec *r = &rec->cpu[i];
+			struct datarec *p = &prev->cpu[i];
+
+			pps = calc_pps     (r, p, t);
+			err = calc_errs_pps(r, p, t);
+			if (err > 0)
+				errstr = "invalid-ifindex";
+			if (pps > 0)
+				printf(fmt_rx, "XDP-RX CPU",
+					i, pps, err, errstr);
+		}
+		pps  = calc_pps     (&rec->total, &prev->total, t);
+		err  = calc_errs_pps(&rec->total, &prev->total, t);
+		printf(fm2_rx, "XDP-RX CPU", "total", pps, err);
+	}
+
+	/* rx_queue_index_map */
+	printf("\n%-15s %-7s %-11s %-11s\n",
+	       "RXQ stats", "RXQ:CPU", "pps", "issue-pps");
+
+	for (rxq = 0; rxq < nr_rxqs; rxq++) {
+		char *fmt_rx = "%-15s %3d:%-3d %'-11.0f %'-10.0f %s\n";
+		char *fm2_rx = "%-15s %3d:%-3s %'-11.0f\n";
+		char *errstr = "";
+		int rxq_ = rxq;
+
+		/* Last RXQ in map catch overflows */
+		if (rxq_ == nr_rxqs - 1)
+			rxq_ = -1;
+
+		rec  =  &stats_rec->rxq[rxq];
+		prev = &stats_prev->rxq[rxq];
+		t = calc_period(rec, prev);
+		for (i = 0; i < nr_cpus; i++) {
+			struct datarec *r = &rec->cpu[i];
+			struct datarec *p = &prev->cpu[i];
+
+			pps = calc_pps     (r, p, t);
+			err = calc_errs_pps(r, p, t);
+			if (err > 0) {
+				if (rxq_ == -1)
+					errstr = "map-overflow-RXQ";
+				else
+					errstr = "err";
+			}
+			if (pps > 0)
+				printf(fmt_rx, "rx_queue_index",
+				       rxq_, i, pps, err, errstr);
+		}
+		pps  = calc_pps     (&rec->total, &prev->total, t);
+		err  = calc_errs_pps(&rec->total, &prev->total, t);
+		if (pps || err)
+			printf(fm2_rx, "rx_queue_index", rxq_, "sum", pps, err);
+	}
+}
+
+
+/* Pointer swap trick */
+static inline void swap(struct stats_record **a, struct stats_record **b)
+{
+	struct stats_record *tmp;
+
+	tmp = *a;
+	*a = *b;
+	*b = tmp;
+}
+
+static void stats_poll(int interval, int action)
+{
+	struct stats_record *record, *prev;
+
+	record = alloc_stats_record();
+	prev   = alloc_stats_record();
+	stats_collect(record);
+
+	while (1) {
+		swap(&prev, &record);
+		stats_collect(record);
+		stats_print(record, prev, action);
+		sleep(interval);
+	}
+
+	free_stats_record(record);
+	free_stats_record(prev);
+}
+
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
+	bool use_separators = true;
+	struct config cfg = { 0 };
+	char filename[256];
+	int longindex = 0;
+	int interval = 2;
+	__u32 key = 0;
+	int opt, err;
+
+	char action_str_buf[XDP_ACTION_MAX_STRLEN + 1 /* for \0 */] = { 0 };
+	int action = XDP_PASS; /* Default action */
+	char *action_str = NULL;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK)");
+		return 1;
+	}
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "ERR in load_bpf_file(): %s", bpf_log_buf);
+		return EXIT_FAIL;
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERR: load_bpf_file: %s\n", strerror(errno));
+		return EXIT_FAIL;
+	}
+
+	/* Parse commands line args */
+	while ((opt = getopt_long(argc, argv, "hSd:",
+				  long_options, &longindex)) != -1) {
+		switch (opt) {
+		case 'd':
+			if (strlen(optarg) >= IF_NAMESIZE) {
+				fprintf(stderr, "ERR: --dev name too long\n");
+				goto error;
+			}
+			ifname = (char *)&ifname_buf;
+			strncpy(ifname, optarg, IF_NAMESIZE);
+			ifindex = if_nametoindex(ifname);
+			if (ifindex == 0) {
+				fprintf(stderr,
+					"ERR: --dev name unknown err(%d):%s\n",
+					errno, strerror(errno));
+				goto error;
+			}
+			break;
+		case 's':
+			interval = atoi(optarg);
+			break;
+		case 'S':
+			xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'z':
+			use_separators = false;
+			break;
+		case 'a':
+			action_str = (char *)&action_str_buf;
+			strncpy(action_str, optarg, XDP_ACTION_MAX_STRLEN);
+			break;
+		case 'h':
+		error:
+		default:
+			usage(argv);
+			return EXIT_FAIL_OPTION;
+		}
+	}
+	/* Required option */
+	if (ifindex == -1) {
+		fprintf(stderr, "ERR: required option --dev missing\n");
+		usage(argv);
+		return EXIT_FAIL_OPTION;
+	}
+	cfg.ifindex = ifindex;
+
+	/* Parse action string */
+	if (action_str) {
+		action = parse_xdp_action(action_str);
+		if (action < 0) {
+			fprintf(stderr, "ERR: Invalid XDP --action: %s\n",
+				action_str);
+			list_xdp_actions();
+			return EXIT_FAIL_OPTION;
+		}
+	}
+	cfg.action = action;
+
+	/* Trick to pretty printf with thousands separators use %' */
+	if (use_separators)
+		setlocale(LC_NUMERIC, "en_US");
+
+	/* User-side setup ifindex in config_map */
+	err = bpf_map_update_elem(map_fd[0], &key, &cfg, 0);
+	if (err) {
+		fprintf(stderr, "Store config failed (err:%d)\n", err);
+		exit(EXIT_FAIL_BPF);
+	}
+
+	/* Remove XDP program when program is interrupted */
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) {
+		fprintf(stderr, "link set xdp fd failed\n");
+		return EXIT_FAIL_XDP;
+	}
+
+	stats_poll(interval, action);
+	return EXIT_OK;
+}

^ permalink raw reply related

* [bpf-next V1-RFC PATCH 13/14] bpf: finally expose xdp_rxq_info to XDP bpf-programs
From: Jesper Dangaard Brouer @ 2017-12-13 11:20 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151316391502.14967.13292358380181773729.stgit@firesoul>

Now all XDP driver have been updated to setup xdp_rxq_info and assign
this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
of the xdp_rxq_info struct members.

This patch extend xdp_md and expose UAPI to userspace for
ingress_ifindex and rx_queue_index.  Access happens via bpf
instruction rewrite, that load data directly from struct xdp_rxq_info.

* ingress_ifindex map to xdp_rxq_info->dev->ifindex
* rx_queue_index  map to xdp_rxq_info->queue_index

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h |    3 +++
 net/core/filter.c        |   19 +++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 595bda120cfb..512d456dc469 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -893,6 +893,9 @@ struct xdp_md {
 	__u32 data;
 	__u32 data_end;
 	__u32 data_meta;
+	/* Below access go though struct xdp_rxq_info */
+	__u32 ingress_ifindex; /* rxq->dev->ifindex */
+	__u32 rx_queue_index;  /* rxq->queue_index  */
 };
 
 enum sk_action {
diff --git a/net/core/filter.c b/net/core/filter.c
index 754abe1041b7..676ee9860019 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4303,6 +4303,25 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct xdp_buff, data_end));
 		break;
+	case offsetof(struct xdp_md, ingress_ifindex):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, rxq));
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, dev),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct xdp_rxq_info, dev));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      bpf_target_off(struct net_device,
+						     ifindex, 4, target_size));
+		break;
+	case offsetof(struct xdp_md, rx_queue_index):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, rxq));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      bpf_target_off(struct xdp_rxq_info,
+						queue_index, 4, target_size));
+		break;
 	}
 
 	return insn - insn_buf;

^ permalink raw reply related

* [bpf-next V1-RFC PATCH 12/14] xdp: generic XDP handling of xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-13 11:20 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, Jesper Dangaard Brouer, dsahern, gospo, bjorn.topel,
	michael.chan
In-Reply-To: <151316391502.14967.13292358380181773729.stgit@firesoul>

Hook points for xdp_rxq_info:
 * init+reg: netif_alloc_rx_queues
 * unreg   : netif_free_rx_queues

The net_device have some members (num_rx_queues + real_num_rx_queues)
and data-area (dev->_rx with struct netdev_rx_queue's) that were
primarily used for exporting information about RPS (CONFIG_RPS) queues
to sysfs (CONFIG_SYSFS).

For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
and remove some of the CONFIG_SYSFS ifdefs.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |   60 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cc4ce7456e38..43595b037872 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -44,6 +44,7 @@
 #include <net/dcbnl.h>
 #endif
 #include <net/netprio_cgroup.h>
+#include <net/xdp.h>
 
 #include <linux/netdev_features.h>
 #include <linux/neighbour.h>
@@ -686,6 +687,7 @@ struct netdev_rx_queue {
 #endif
 	struct kobject			kobj;
 	struct net_device		*dev;
+	struct xdp_rxq_info		xdp_rxq;
 } ____cacheline_aligned_in_smp;
 
 /*
diff --git a/net/core/dev.c b/net/core/dev.c
index 6bea8931bb62..44932d6194a2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3873,9 +3873,33 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	return NET_RX_DROP;
 }
 
+static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	struct netdev_rx_queue *rxqueue;
+
+	rxqueue = dev->_rx;
+
+	if (skb_rx_queue_recorded(skb)) {
+		u16 index = skb_get_rx_queue(skb);
+
+		if (unlikely(index >= dev->real_num_rx_queues)) {
+			WARN_ONCE(dev->real_num_rx_queues > 1,
+				  "%s received packet on queue %u, but number "
+				  "of RX queues is %u\n",
+				  dev->name, index, dev->real_num_rx_queues);
+
+			return rxqueue; /* Return first rxqueue */
+		}
+		rxqueue += index;
+	}
+	return rxqueue;
+}
+
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 				     struct bpf_prog *xdp_prog)
 {
+	struct netdev_rx_queue *rxqueue;
 	u32 metalen, act = XDP_DROP;
 	struct xdp_buff xdp;
 	void *orig_data;
@@ -3919,6 +3943,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	xdp.data_hard_start = skb->data - skb_headroom(skb);
 	orig_data = xdp.data;
 
+	rxqueue = netif_get_rxqueue(skb);
+	xdp.rxq = &rxqueue->xdp_rxq;
+
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	off = xdp.data - orig_data;
@@ -7538,7 +7565,6 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
 	unsigned int i, count = dev->num_rx_queues;
@@ -7553,11 +7579,31 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	dev->_rx = rx;
 
-	for (i = 0; i < count; i++)
+	for (i = 0; i < count; i++) {
 		rx[i].dev = dev;
+
+		/* XDP RX-queue setup */
+		xdp_rxq_info_init(&rx[i].xdp_rxq);
+		rx[i].xdp_rxq.dev = dev;
+		rx[i].xdp_rxq.queue_index = i;
+		xdp_rxq_info_reg(&rx[i].xdp_rxq);
+	}
 	return 0;
 }
-#endif
+
+static void netif_free_rx_queues(struct net_device *dev)
+{
+	unsigned int i, count = dev->num_rx_queues;
+	struct netdev_rx_queue *rx;
+
+	if (!dev->_rx) /* netif_alloc_rx_queues alloc failed */
+		return;
+
+	rx = dev->_rx;
+
+	for (i = 0; i < count; i++)
+		xdp_rxq_info_unreg(&rx[i].xdp_rxq);
+}
 
 static void netdev_init_one_queue(struct net_device *dev,
 				  struct netdev_queue *queue, void *_unused)
@@ -8118,12 +8164,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
-#ifdef CONFIG_SYSFS
 	if (rxqs < 1) {
 		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
 		return NULL;
 	}
-#endif
 
 	alloc_size = sizeof(struct net_device);
 	if (sizeof_priv) {
@@ -8180,12 +8224,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (netif_alloc_netdev_queues(dev))
 		goto free_all;
 
-#ifdef CONFIG_SYSFS
 	dev->num_rx_queues = rxqs;
 	dev->real_num_rx_queues = rxqs;
 	if (netif_alloc_rx_queues(dev))
 		goto free_all;
-#endif
 
 	strcpy(dev->name, name);
 	dev->name_assign_type = name_assign_type;
@@ -8224,9 +8266,7 @@ void free_netdev(struct net_device *dev)
 
 	might_sleep();
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_SYSFS
-	kvfree(dev->_rx);
-#endif
+	netif_free_rx_queues(dev);
 
 	kfree(rcu_dereference_protected(dev->ingress_queue, 1));
 

^ permalink raw reply related

* [bpf-next V1-RFC PATCH 10/14] tun: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-13 11:20 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Willem de Bruijn, Michael S. Tsirkin, netdev, Jason Wang, dsahern,
	Jesper Dangaard Brouer, gospo, bjorn.topel, michael.chan
In-Reply-To: <151316391502.14967.13292358380181773729.stgit@firesoul>

Driver hook points for xdp_rxq_info:
 * init+reg: tun_attach
 * unreg   : __tun_detach

I've done some manual testing of this tun driver, but I would
appriciate good review and someone else running their use-case tests,
as I'm not 100% sure I understand the tfile->detached semantics.

Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/tun.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e367d6310353..f1df08c2c541 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -180,6 +180,7 @@ struct tun_file {
 	struct list_head next;
 	struct tun_struct *detached;
 	struct skb_array tx_array;
+	struct xdp_rxq_info xdp_rxq;
 };
 
 struct tun_flow_entry {
@@ -687,8 +688,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 			    tun->dev->reg_state == NETREG_REGISTERED)
 				unregister_netdevice(tun->dev);
 		}
-		if (tun)
+		if (tun) {
 			skb_array_cleanup(&tfile->tx_array);
+			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+		}
 		sock_put(&tfile->sk);
 	}
 }
@@ -728,11 +731,15 @@ static void tun_detach_all(struct net_device *dev)
 		tun_napi_del(tun, tfile);
 		/* Drop read queue */
 		tun_queue_purge(tfile);
+		skb_array_cleanup(&tfile->tx_array);
+		xdp_rxq_info_unreg(&tfile->xdp_rxq);
 		sock_put(&tfile->sk);
 	}
 	list_for_each_entry_safe(tfile, tmp, &tun->disabled, next) {
 		tun_enable_queue(tfile);
 		tun_queue_purge(tfile);
+		skb_array_cleanup(&tfile->tx_array);
+		xdp_rxq_info_unreg(&tfile->xdp_rxq);
 		sock_put(&tfile->sk);
 	}
 	BUG_ON(tun->numdisabled != 0);
@@ -784,6 +791,21 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 
 	tfile->queue_index = tun->numqueues;
 	tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
+
+	if (tfile->detached) {
+		/* Re-attach detached tfile, updating XDP queue_index */
+		WARN_ON(!xdp_rxq_info_is_reg(&tfile->xdp_rxq));
+
+		if (tfile->xdp_rxq.queue_index    != tfile->queue_index)
+			tfile->xdp_rxq.queue_index = tfile->queue_index;
+	} else {
+		/* Setup XDP RX-queue info, for new tfile getting attached */
+		xdp_rxq_info_init(&tfile->xdp_rxq);
+		tfile->xdp_rxq.dev = tun->dev;
+		tfile->xdp_rxq.queue_index = tfile->queue_index;
+		xdp_rxq_info_reg(&tfile->xdp_rxq);
+	}
+
 	rcu_assign_pointer(tfile->tun, tun);
 	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
 	tun->numqueues++;
@@ -1508,6 +1530,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp.data = buf + pad;
 		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + len;
+		xdp.rxq = &tfile->xdp_rxq;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 

^ permalink raw reply related

* [bpf-next V1-RFC PATCH 11/14] virtio_net: setup xdp_rxq_info
From: Jesper Dangaard Brouer @ 2017-12-13 11:20 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Michael S. Tsirkin, netdev, Jesper Dangaard Brouer,
	virtualization, dsahern, gospo, bjorn.topel, michael.chan
In-Reply-To: <151316391502.14967.13292358380181773729.stgit@firesoul>

The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet.  Thus, the xdp_rxq_info also remains
fairly static.  This allow us to simply add the init+reg/unreg to
net_device open/close functions.

Driver hook points for xdp_rxq_info:
 * init+reg: virtnet_open
 * unreg   : virtnet_close

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/virtio_net.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 19a985ef9104..9792183befbf 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,6 +31,7 @@
 #include <linux/average.h>
 #include <linux/filter.h>
 #include <net/route.h>
+#include <net/xdp.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -115,6 +116,8 @@ struct receive_queue {
 
 	/* Name of this receive queue: input.$index */
 	char name[40];
+
+	struct xdp_rxq_info xdp_rxq;
 };
 
 struct virtnet_info {
@@ -556,6 +559,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 		xdp.data = xdp.data_hard_start + xdp_headroom;
 		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + len;
+		xdp.rxq = &rq->xdp_rxq;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
@@ -1229,6 +1233,13 @@ static int virtnet_open(struct net_device *dev)
 			/* Make sure we have some buffers: if oom use wq. */
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
+
+		/* XDP RX queue info */
+		xdp_rxq_info_init(&vi->rq[i].xdp_rxq);
+		vi->rq[i].xdp_rxq.dev = dev;
+		vi->rq[i].xdp_rxq.queue_index = i;
+		xdp_rxq_info_reg(&vi->rq[i].xdp_rxq);
+
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
@@ -1557,6 +1568,7 @@ static int virtnet_close(struct net_device *dev)
 	cancel_delayed_work_sync(&vi->refill);
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
+		xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
 		napi_disable(&vi->rq[i].napi);
 		virtnet_napi_tx_disable(&vi->sq[i].napi);
 	}

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox