Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/8] batman-adv: Disable CONFIG_BATMAN_ADV_DEBUGFS by default
From: Sergei Shtylyov @ 2018-05-25 15:56 UTC (permalink / raw)
  To: Sven Eckelmann
  Cc: Joe Perches, netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <2273548.5hmqMeOsCk@bentobox>

On 05/25/2018 02:15 PM, Sven Eckelmann wrote:

>>> [...]
>>>>> --- a/net/batman-adv/Kconfig
>>>>> +++ b/net/batman-adv/Kconfig
>>>>> @@ -94,13 +94,13 @@ config BATMAN_ADV_DEBUGFS
>>>>> bool "batman-adv debugfs entries"
>>>>> depends on BATMAN_ADV
>>>>> depends on DEBUG_FS
>>>>> -       default y
>>>>> +       default n
>>>>
>>>> N is the default default. :-) You don't need this line.
>>>
>>> Hm, looks like this would have to be changed in a lot of places (~782
>>> according to `git grep 'default n$'|wc -l` in my slightly outdated linux-
>>> next). Do you want to fix it everywhere?
>>
>>     No, but we can at least not add the new ones...
> 
> But the patch was added to net-next yesterday.

   DaveM is still too fast for me. :-)

> Kind regards,
> 	Sven

MBR, Sergei

^ permalink raw reply

* [PATCH v4 bpf-next 6/6] selftests/bpf: Selftest for sys_sendmsg hooks
From: Andrey Ignatov @ 2018-05-25 15:55 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, davem, kafai, ast, daniel, kernel-team
In-Reply-To: <cover.1527263217.git.rdna@fb.com>

Add selftest for BPF_CGROUP_UDP4_SENDMSG and BPF_CGROUP_UDP6_SENDMSG
attach types.

Try to sendmsg(2) to specific IP:port and test that:
* source IP is overridden as expected.
* remote IP:port pair is overridden as expected;

Both UDPv4 and UDPv6 are tested.

Output:
  # test_sock_addr.sh 2>/dev/null
  Wait for testing IPv4/IPv6 to become available ... OK
  ... pre-existing test-cases skipped ...
  Test case: sendmsg4: load prog with wrong expected attach type .. [PASS]
  Test case: sendmsg4: attach prog with wrong attach type .. [PASS]
  Test case: sendmsg4: rewrite IP & port (asm) .. [PASS]
  Test case: sendmsg4: rewrite IP & port (C) .. [PASS]
  Test case: sendmsg4: deny call .. [PASS]
  Test case: sendmsg6: load prog with wrong expected attach type .. [PASS]
  Test case: sendmsg6: attach prog with wrong attach type .. [PASS]
  Test case: sendmsg6: rewrite IP & port (asm) .. [PASS]
  Test case: sendmsg6: rewrite IP & port (C) .. [PASS]
  Test case: sendmsg6: IPv4-mapped IPv6 .. [PASS]
  Test case: sendmsg6: deny call .. [PASS]
  Summary: 27 PASSED, 0 FAILED

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/Makefile         |   2 +-
 tools/testing/selftests/bpf/sendmsg4_prog.c  |  49 +++
 tools/testing/selftests/bpf/sendmsg6_prog.c  |  60 ++++
 tools/testing/selftests/bpf/test_sock_addr.c | 518 +++++++++++++++++++++++++++
 4 files changed, 628 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/sendmsg4_prog.c
 create mode 100644 tools/testing/selftests/bpf/sendmsg6_prog.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8504444..a1b66da 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -34,7 +34,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
 	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
 	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
-	test_lwt_seg6local.o
+	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/sendmsg4_prog.c b/tools/testing/selftests/bpf/sendmsg4_prog.c
new file mode 100644
index 0000000..a91536b
--- /dev/null
+++ b/tools/testing/selftests/bpf/sendmsg4_prog.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+#include <sys/socket.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define SRC1_IP4		0xAC100001U /* 172.16.0.1 */
+#define SRC2_IP4		0x00000000U
+#define SRC_REWRITE_IP4		0x7f000004U
+#define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
+#define DST_REWRITE_IP4		0x7f000001U
+#define DST_PORT		4040
+#define DST_REWRITE_PORT4	4444
+
+int _version SEC("version") = 1;
+
+SEC("cgroup/sendmsg4")
+int sendmsg_v4_prog(struct bpf_sock_addr *ctx)
+{
+	if (ctx->type != SOCK_DGRAM)
+		return 0;
+
+	/* Rewrite source. */
+	if (ctx->msg_src_ip4 == bpf_htonl(SRC1_IP4) ||
+	    ctx->msg_src_ip4 == bpf_htonl(SRC2_IP4)) {
+		ctx->msg_src_ip4 = bpf_htonl(SRC_REWRITE_IP4);
+	} else {
+		/* Unexpected source. Reject sendmsg. */
+		return 0;
+	}
+
+	/* Rewrite destination. */
+	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
+	     ctx->user_port == bpf_htons(DST_PORT)) {
+		ctx->user_ip4 = bpf_htonl(DST_REWRITE_IP4);
+		ctx->user_port = bpf_htons(DST_REWRITE_PORT4);
+	} else {
+		/* Unexpected source. Reject sendmsg. */
+		return 0;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/sendmsg6_prog.c b/tools/testing/selftests/bpf/sendmsg6_prog.c
new file mode 100644
index 0000000..5aeaa28
--- /dev/null
+++ b/tools/testing/selftests/bpf/sendmsg6_prog.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+#include <sys/socket.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define SRC_REWRITE_IP6_0	0
+#define SRC_REWRITE_IP6_1	0
+#define SRC_REWRITE_IP6_2	0
+#define SRC_REWRITE_IP6_3	6
+
+#define DST_REWRITE_IP6_0	0
+#define DST_REWRITE_IP6_1	0
+#define DST_REWRITE_IP6_2	0
+#define DST_REWRITE_IP6_3	1
+
+#define DST_REWRITE_PORT6	6666
+
+int _version SEC("version") = 1;
+
+SEC("cgroup/sendmsg6")
+int sendmsg_v6_prog(struct bpf_sock_addr *ctx)
+{
+	if (ctx->type != SOCK_DGRAM)
+		return 0;
+
+	/* Rewrite source. */
+	if (ctx->msg_src_ip6[3] == bpf_htonl(1) ||
+	    ctx->msg_src_ip6[3] == bpf_htonl(0)) {
+		ctx->msg_src_ip6[0] = bpf_htonl(SRC_REWRITE_IP6_0);
+		ctx->msg_src_ip6[1] = bpf_htonl(SRC_REWRITE_IP6_1);
+		ctx->msg_src_ip6[2] = bpf_htonl(SRC_REWRITE_IP6_2);
+		ctx->msg_src_ip6[3] = bpf_htonl(SRC_REWRITE_IP6_3);
+	} else {
+		/* Unexpected source. Reject sendmsg. */
+		return 0;
+	}
+
+	/* Rewrite destination. */
+	if ((ctx->user_ip6[0] & 0xFFFF) == bpf_htons(0xFACE) &&
+	     ctx->user_ip6[0] >> 16 == bpf_htons(0xB00C)) {
+		ctx->user_ip6[0] = bpf_htonl(DST_REWRITE_IP6_0);
+		ctx->user_ip6[1] = bpf_htonl(DST_REWRITE_IP6_1);
+		ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
+		ctx->user_ip6[3] = bpf_htonl(DST_REWRITE_IP6_3);
+
+		ctx->user_port = bpf_htons(DST_REWRITE_PORT6);
+	} else {
+		/* Unexpected destination. Reject sendmsg. */
+		return 0;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
index ed3e397..a5e76b9 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -1,12 +1,16 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2018 Facebook
 
+#define _GNU_SOURCE
+
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
 
 #include <arpa/inet.h>
+#include <netinet/in.h>
 #include <sys/types.h>
+#include <sys/select.h>
 #include <sys/socket.h>
 
 #include <linux/filter.h>
@@ -17,6 +21,10 @@
 #include "cgroup_helpers.h"
 #include "bpf_rlimit.h"
 
+#ifndef ENOTSUPP
+# define ENOTSUPP 524
+#endif
+
 #ifndef ARRAY_SIZE
 # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
 #endif
@@ -24,15 +32,20 @@
 #define CG_PATH	"/foo"
 #define CONNECT4_PROG_PATH	"./connect4_prog.o"
 #define CONNECT6_PROG_PATH	"./connect6_prog.o"
+#define SENDMSG4_PROG_PATH	"./sendmsg4_prog.o"
+#define SENDMSG6_PROG_PATH	"./sendmsg6_prog.o"
 
 #define SERV4_IP		"192.168.1.254"
 #define SERV4_REWRITE_IP	"127.0.0.1"
+#define SRC4_IP			"172.16.0.1"
 #define SRC4_REWRITE_IP		"127.0.0.4"
 #define SERV4_PORT		4040
 #define SERV4_REWRITE_PORT	4444
 
 #define SERV6_IP		"face:b00c:1234:5678::abcd"
 #define SERV6_REWRITE_IP	"::1"
+#define SERV6_V4MAPPED_IP	"::ffff:192.168.0.4"
+#define SRC6_IP			"::1"
 #define SRC6_REWRITE_IP		"::6"
 #define SERV6_PORT		6060
 #define SERV6_REWRITE_PORT	6666
@@ -65,6 +78,8 @@ struct sock_addr_test {
 	enum {
 		LOAD_REJECT,
 		ATTACH_REJECT,
+		SYSCALL_EPERM,
+		SYSCALL_ENOTSUPP,
 		SUCCESS,
 	} expected_result;
 };
@@ -73,6 +88,12 @@ static int bind4_prog_load(const struct sock_addr_test *test);
 static int bind6_prog_load(const struct sock_addr_test *test);
 static int connect4_prog_load(const struct sock_addr_test *test);
 static int connect6_prog_load(const struct sock_addr_test *test);
+static int sendmsg_deny_prog_load(const struct sock_addr_test *test);
+static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
+static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test);
+static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
+static int sendmsg6_rw_c_prog_load(const struct sock_addr_test *test);
+static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test);
 
 static struct sock_addr_test tests[] = {
 	/* bind */
@@ -302,6 +323,162 @@ static struct sock_addr_test tests[] = {
 		SRC6_REWRITE_IP,
 		SUCCESS,
 	},
+
+	/* sendmsg */
+	{
+		"sendmsg4: load prog with wrong expected attach type",
+		sendmsg4_rw_asm_prog_load,
+		BPF_CGROUP_UDP6_SENDMSG,
+		BPF_CGROUP_UDP4_SENDMSG,
+		AF_INET,
+		SOCK_DGRAM,
+		NULL,
+		0,
+		NULL,
+		0,
+		NULL,
+		LOAD_REJECT,
+	},
+	{
+		"sendmsg4: attach prog with wrong attach type",
+		sendmsg4_rw_asm_prog_load,
+		BPF_CGROUP_UDP4_SENDMSG,
+		BPF_CGROUP_UDP6_SENDMSG,
+		AF_INET,
+		SOCK_DGRAM,
+		NULL,
+		0,
+		NULL,
+		0,
+		NULL,
+		ATTACH_REJECT,
+	},
+	{
+		"sendmsg4: rewrite IP & port (asm)",
+		sendmsg4_rw_asm_prog_load,
+		BPF_CGROUP_UDP4_SENDMSG,
+		BPF_CGROUP_UDP4_SENDMSG,
+		AF_INET,
+		SOCK_DGRAM,
+		SERV4_IP,
+		SERV4_PORT,
+		SERV4_REWRITE_IP,
+		SERV4_REWRITE_PORT,
+		SRC4_REWRITE_IP,
+		SUCCESS,
+	},
+	{
+		"sendmsg4: rewrite IP & port (C)",
+		sendmsg4_rw_c_prog_load,
+		BPF_CGROUP_UDP4_SENDMSG,
+		BPF_CGROUP_UDP4_SENDMSG,
+		AF_INET,
+		SOCK_DGRAM,
+		SERV4_IP,
+		SERV4_PORT,
+		SERV4_REWRITE_IP,
+		SERV4_REWRITE_PORT,
+		SRC4_REWRITE_IP,
+		SUCCESS,
+	},
+	{
+		"sendmsg4: deny call",
+		sendmsg_deny_prog_load,
+		BPF_CGROUP_UDP4_SENDMSG,
+		BPF_CGROUP_UDP4_SENDMSG,
+		AF_INET,
+		SOCK_DGRAM,
+		SERV4_IP,
+		SERV4_PORT,
+		SERV4_REWRITE_IP,
+		SERV4_REWRITE_PORT,
+		SRC4_REWRITE_IP,
+		SYSCALL_EPERM,
+	},
+	{
+		"sendmsg6: load prog with wrong expected attach type",
+		sendmsg6_rw_asm_prog_load,
+		BPF_CGROUP_UDP4_SENDMSG,
+		BPF_CGROUP_UDP6_SENDMSG,
+		AF_INET6,
+		SOCK_DGRAM,
+		NULL,
+		0,
+		NULL,
+		0,
+		NULL,
+		LOAD_REJECT,
+	},
+	{
+		"sendmsg6: attach prog with wrong attach type",
+		sendmsg6_rw_asm_prog_load,
+		BPF_CGROUP_UDP6_SENDMSG,
+		BPF_CGROUP_UDP4_SENDMSG,
+		AF_INET6,
+		SOCK_DGRAM,
+		NULL,
+		0,
+		NULL,
+		0,
+		NULL,
+		ATTACH_REJECT,
+	},
+	{
+		"sendmsg6: rewrite IP & port (asm)",
+		sendmsg6_rw_asm_prog_load,
+		BPF_CGROUP_UDP6_SENDMSG,
+		BPF_CGROUP_UDP6_SENDMSG,
+		AF_INET6,
+		SOCK_DGRAM,
+		SERV6_IP,
+		SERV6_PORT,
+		SERV6_REWRITE_IP,
+		SERV6_REWRITE_PORT,
+		SRC6_REWRITE_IP,
+		SUCCESS,
+	},
+	{
+		"sendmsg6: rewrite IP & port (C)",
+		sendmsg6_rw_c_prog_load,
+		BPF_CGROUP_UDP6_SENDMSG,
+		BPF_CGROUP_UDP6_SENDMSG,
+		AF_INET6,
+		SOCK_DGRAM,
+		SERV6_IP,
+		SERV6_PORT,
+		SERV6_REWRITE_IP,
+		SERV6_REWRITE_PORT,
+		SRC6_REWRITE_IP,
+		SUCCESS,
+	},
+	{
+		"sendmsg6: IPv4-mapped IPv6",
+		sendmsg6_rw_v4mapped_prog_load,
+		BPF_CGROUP_UDP6_SENDMSG,
+		BPF_CGROUP_UDP6_SENDMSG,
+		AF_INET6,
+		SOCK_DGRAM,
+		SERV6_IP,
+		SERV6_PORT,
+		SERV6_REWRITE_IP,
+		SERV6_REWRITE_PORT,
+		SRC6_REWRITE_IP,
+		SYSCALL_ENOTSUPP,
+	},
+	{
+		"sendmsg6: deny call",
+		sendmsg_deny_prog_load,
+		BPF_CGROUP_UDP6_SENDMSG,
+		BPF_CGROUP_UDP6_SENDMSG,
+		AF_INET6,
+		SOCK_DGRAM,
+		SERV6_IP,
+		SERV6_PORT,
+		SERV6_REWRITE_IP,
+		SERV6_REWRITE_PORT,
+		SRC6_REWRITE_IP,
+		SYSCALL_EPERM,
+	},
 };
 
 static int mk_sockaddr(int domain, const char *ip, unsigned short port,
@@ -540,6 +717,141 @@ static int connect6_prog_load(const struct sock_addr_test *test)
 	return load_path(test, CONNECT6_PROG_PATH);
 }
 
+static int sendmsg_deny_prog_load(const struct sock_addr_test *test)
+{
+	struct bpf_insn insns[] = {
+		/* return 0 */
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	return load_insns(test, insns, sizeof(insns) / sizeof(struct bpf_insn));
+}
+
+static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test)
+{
+	struct sockaddr_in dst4_rw_addr;
+	struct in_addr src4_rw_ip;
+
+	if (inet_pton(AF_INET, SRC4_REWRITE_IP, (void *)&src4_rw_ip) != 1) {
+		log_err("Invalid IPv4: %s", SRC4_REWRITE_IP);
+		return -1;
+	}
+
+	if (mk_sockaddr(AF_INET, SERV4_REWRITE_IP, SERV4_REWRITE_PORT,
+			(struct sockaddr *)&dst4_rw_addr,
+			sizeof(dst4_rw_addr)) == -1)
+		return -1;
+
+	struct bpf_insn insns[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* if (sk.family == AF_INET && */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, family)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET, 8),
+
+		/*     sk.type == SOCK_DGRAM)  { */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, type)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, SOCK_DGRAM, 6),
+
+		/*      msg_src_ip4 = src4_rw_ip */
+		BPF_MOV32_IMM(BPF_REG_7, src4_rw_ip.s_addr),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, msg_src_ip4)),
+
+		/*      user_ip4 = dst4_rw_addr.sin_addr */
+		BPF_MOV32_IMM(BPF_REG_7, dst4_rw_addr.sin_addr.s_addr),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_ip4)),
+
+		/*      user_port = dst4_rw_addr.sin_port */
+		BPF_MOV32_IMM(BPF_REG_7, dst4_rw_addr.sin_port),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_port)),
+		/* } */
+
+		/* return 1 */
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+
+	return load_insns(test, insns, sizeof(insns) / sizeof(struct bpf_insn));
+}
+
+static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test)
+{
+	return load_path(test, SENDMSG4_PROG_PATH);
+}
+
+static int sendmsg6_rw_dst_asm_prog_load(const struct sock_addr_test *test,
+					 const char *rw_dst_ip)
+{
+	struct sockaddr_in6 dst6_rw_addr;
+	struct in6_addr src6_rw_ip;
+
+	if (inet_pton(AF_INET6, SRC6_REWRITE_IP, (void *)&src6_rw_ip) != 1) {
+		log_err("Invalid IPv6: %s", SRC6_REWRITE_IP);
+		return -1;
+	}
+
+	if (mk_sockaddr(AF_INET6, rw_dst_ip, SERV6_REWRITE_PORT,
+			(struct sockaddr *)&dst6_rw_addr,
+			sizeof(dst6_rw_addr)) == -1)
+		return -1;
+
+	struct bpf_insn insns[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* if (sk.family == AF_INET6) { */
+		BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_6,
+			    offsetof(struct bpf_sock_addr, family)),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_7, AF_INET6, 18),
+
+#define STORE_IPV6_WORD_N(DST, SRC, N)					       \
+		BPF_MOV32_IMM(BPF_REG_7, SRC[N]),			       \
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,		       \
+			    offsetof(struct bpf_sock_addr, DST[N]))
+
+#define STORE_IPV6(DST, SRC)						       \
+		STORE_IPV6_WORD_N(DST, SRC, 0),				       \
+		STORE_IPV6_WORD_N(DST, SRC, 1),				       \
+		STORE_IPV6_WORD_N(DST, SRC, 2),				       \
+		STORE_IPV6_WORD_N(DST, SRC, 3)
+
+		STORE_IPV6(msg_src_ip6, src6_rw_ip.s6_addr32),
+		STORE_IPV6(user_ip6, dst6_rw_addr.sin6_addr.s6_addr32),
+
+		/*      user_port = dst6_rw_addr.sin6_port */
+		BPF_MOV32_IMM(BPF_REG_7, dst6_rw_addr.sin6_port),
+		BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_7,
+			    offsetof(struct bpf_sock_addr, user_port)),
+
+		/* } */
+
+		/* return 1 */
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+
+	return load_insns(test, insns, sizeof(insns) / sizeof(struct bpf_insn));
+}
+
+static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test)
+{
+	return sendmsg6_rw_dst_asm_prog_load(test, SERV6_REWRITE_IP);
+}
+
+static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test)
+{
+	return sendmsg6_rw_dst_asm_prog_load(test, SERV6_V4MAPPED_IP);
+}
+
+static int sendmsg6_rw_c_prog_load(const struct sock_addr_test *test)
+{
+	return load_path(test, SENDMSG6_PROG_PATH);
+}
+
 static int cmp_addr(const struct sockaddr_storage *addr1,
 		    const struct sockaddr_storage *addr2, int cmp_port)
 {
@@ -656,6 +968,135 @@ static int connect_to_server(int type, const struct sockaddr_storage *addr,
 	return fd;
 }
 
+int init_pktinfo(int domain, struct cmsghdr *cmsg)
+{
+	struct in6_pktinfo *pktinfo6;
+	struct in_pktinfo *pktinfo4;
+
+	if (domain == AF_INET) {
+		cmsg->cmsg_level = SOL_IP;
+		cmsg->cmsg_type = IP_PKTINFO;
+		cmsg->cmsg_len = CMSG_LEN(sizeof(struct in_pktinfo));
+		pktinfo4 = (struct in_pktinfo *)CMSG_DATA(cmsg);
+		memset(pktinfo4, 0, sizeof(struct in_pktinfo));
+		if (inet_pton(domain, SRC4_IP,
+			      (void *)&pktinfo4->ipi_spec_dst) != 1)
+			return -1;
+	} else if (domain == AF_INET6) {
+		cmsg->cmsg_level = SOL_IPV6;
+		cmsg->cmsg_type = IPV6_PKTINFO;
+		cmsg->cmsg_len = CMSG_LEN(sizeof(struct in6_pktinfo));
+		pktinfo6 = (struct in6_pktinfo *)CMSG_DATA(cmsg);
+		memset(pktinfo6, 0, sizeof(struct in6_pktinfo));
+		if (inet_pton(domain, SRC6_IP,
+			      (void *)&pktinfo6->ipi6_addr) != 1)
+			return -1;
+	} else {
+		return -1;
+	}
+
+	return 0;
+}
+
+static int sendmsg_to_server(const struct sockaddr_storage *addr,
+			     socklen_t addr_len, int set_cmsg, int *syscall_err)
+{
+	union {
+		char buf[CMSG_SPACE(sizeof(struct in6_pktinfo))];
+		struct cmsghdr align;
+	} control6;
+	union {
+		char buf[CMSG_SPACE(sizeof(struct in_pktinfo))];
+		struct cmsghdr align;
+	} control4;
+	struct msghdr hdr;
+	struct iovec iov;
+	char data = 'a';
+	int domain;
+	int fd = -1;
+
+	domain = addr->ss_family;
+
+	if (domain != AF_INET && domain != AF_INET6) {
+		log_err("Unsupported address family");
+		goto err;
+	}
+
+	fd = socket(domain, SOCK_DGRAM, 0);
+	if (fd == -1) {
+		log_err("Failed to create client socket");
+		goto err;
+	}
+
+	memset(&iov, 0, sizeof(iov));
+	iov.iov_base = &data;
+	iov.iov_len = sizeof(data);
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_name = (void *)addr;
+	hdr.msg_namelen = addr_len;
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+
+	if (set_cmsg) {
+		if (domain == AF_INET) {
+			hdr.msg_control = &control4;
+			hdr.msg_controllen = sizeof(control4.buf);
+		} else if (domain == AF_INET6) {
+			hdr.msg_control = &control6;
+			hdr.msg_controllen = sizeof(control6.buf);
+		}
+		if (init_pktinfo(domain, CMSG_FIRSTHDR(&hdr))) {
+			log_err("Fail to init pktinfo");
+			goto err;
+		}
+	}
+
+	if (sendmsg(fd, &hdr, 0) != sizeof(data)) {
+		log_err("Fail to send message to server");
+		*syscall_err = errno;
+		goto err;
+	}
+
+	goto out;
+err:
+	close(fd);
+	fd = -1;
+out:
+	return fd;
+}
+
+static int recvmsg_from_client(int sockfd, struct sockaddr_storage *src_addr)
+{
+	struct timeval tv;
+	struct msghdr hdr;
+	struct iovec iov;
+	char data[64];
+	fd_set rfds;
+
+	FD_ZERO(&rfds);
+	FD_SET(sockfd, &rfds);
+
+	tv.tv_sec = 2;
+	tv.tv_usec = 0;
+
+	if (select(sockfd + 1, &rfds, NULL, NULL, &tv) <= 0 ||
+	    !FD_ISSET(sockfd, &rfds))
+		return -1;
+
+	memset(&iov, 0, sizeof(iov));
+	iov.iov_base = data;
+	iov.iov_len = sizeof(data);
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_name = src_addr;
+	hdr.msg_namelen = sizeof(struct sockaddr_storage);
+	hdr.msg_iov = &iov;
+	hdr.msg_iovlen = 1;
+
+	return recvmsg(sockfd, &hdr, 0);
+}
+
 static int init_addrs(const struct sock_addr_test *test,
 		      struct sockaddr_storage *requested_addr,
 		      struct sockaddr_storage *expected_addr,
@@ -753,6 +1194,69 @@ static int run_connect_test_case(const struct sock_addr_test *test)
 	return err;
 }
 
+static int run_sendmsg_test_case(const struct sock_addr_test *test)
+{
+	socklen_t addr_len = sizeof(struct sockaddr_storage);
+	struct sockaddr_storage expected_src_addr;
+	struct sockaddr_storage requested_addr;
+	struct sockaddr_storage expected_addr;
+	struct sockaddr_storage real_src_addr;
+	int clientfd = -1;
+	int servfd = -1;
+	int set_cmsg;
+	int err = 0;
+
+	if (test->type != SOCK_DGRAM)
+		goto err;
+
+	if (init_addrs(test, &requested_addr, &expected_addr,
+		       &expected_src_addr))
+		goto err;
+
+	/* Prepare server to sendmsg to */
+	servfd = start_server(test->type, &expected_addr, addr_len);
+	if (servfd == -1)
+		goto err;
+
+	for (set_cmsg = 0; set_cmsg <= 1; ++set_cmsg) {
+		if (clientfd >= 0)
+			close(clientfd);
+
+		clientfd = sendmsg_to_server(&requested_addr, addr_len,
+					     set_cmsg, &err);
+		if (err)
+			goto out;
+		else if (clientfd == -1)
+			goto err;
+
+		/* Try to receive message on server instead of using
+		 * getpeername(2) on client socket, to check that client's
+		 * destination address was rewritten properly, since
+		 * getpeername(2) doesn't work with unconnected datagram
+		 * sockets.
+		 *
+		 * Get source address from recvmsg(2) as well to make sure
+		 * source was rewritten properly: getsockname(2) can't be used
+		 * since socket is unconnected and source defined for one
+		 * specific packet may differ from the one used by default and
+		 * returned by getsockname(2).
+		 */
+		if (recvmsg_from_client(servfd, &real_src_addr) == -1)
+			goto err;
+
+		if (cmp_addr(&real_src_addr, &expected_src_addr, /*cmp_port*/0))
+			goto err;
+	}
+
+	goto out;
+err:
+	err = -1;
+out:
+	close(clientfd);
+	close(servfd);
+	return err;
+}
+
 static int run_test_case(int cgfd, const struct sock_addr_test *test)
 {
 	int progfd = -1;
@@ -784,10 +1288,24 @@ static int run_test_case(int cgfd, const struct sock_addr_test *test)
 	case BPF_CGROUP_INET6_CONNECT:
 		err = run_connect_test_case(test);
 		break;
+	case BPF_CGROUP_UDP4_SENDMSG:
+	case BPF_CGROUP_UDP6_SENDMSG:
+		err = run_sendmsg_test_case(test);
+		break;
 	default:
 		goto err;
 	}
 
+	if (test->expected_result == SYSCALL_EPERM && err == EPERM) {
+		err = 0; /* error was expected, reset it */
+		goto out;
+	}
+
+	if (test->expected_result == SYSCALL_ENOTSUPP && err == ENOTSUPP) {
+		err = 0; /* error was expected, reset it */
+		goto out;
+	}
+
 	if (err || test->expected_result != SUCCESS)
 		goto err;
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH v4 bpf-next 4/6] libbpf: Support guessing sendmsg{4,6} progs
From: Andrey Ignatov @ 2018-05-25 15:55 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, davem, kafai, ast, daniel, kernel-team
In-Reply-To: <cover.1527263217.git.rdna@fb.com>

libbpf can guess prog type and expected attach type based on section
name. Add hints for "cgroup/sendmsg4" and "cgroup/sendmsg6" section
names.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/lib/bpf/libbpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index d20411e..b1a60ac 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2043,6 +2043,8 @@ static const struct {
 	BPF_SA_PROG_SEC("cgroup/bind6",	BPF_CGROUP_INET6_BIND),
 	BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
 	BPF_SA_PROG_SEC("cgroup/connect6", BPF_CGROUP_INET6_CONNECT),
+	BPF_SA_PROG_SEC("cgroup/sendmsg4", BPF_CGROUP_UDP4_SENDMSG),
+	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
 };
-- 
2.9.5

^ permalink raw reply related

* Fw: [Bug 199823] New: HTB not honoring Queue priority when being overdriven with large limits
From: Stephen Hemminger @ 2018-05-25 15:45 UTC (permalink / raw)
  To: netdev

Looks like a TSO issue?

Begin forwarded message:

Date: Thu, 24 May 2018 16:40:26 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 199823] New: HTB not honoring Queue priority when being overdriven with large limits

https://bugzilla.kernel.org/show_bug.cgi?id=199823

            Bug ID: 199823
           Summary: HTB not honoring Queue priority when being overdriven
                    with large limits
           Product: Networking
           Version: 2.5
    Kernel Version: 4.9.16
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: stephen@networkplumber.org
          Reporter: tyohe@comtechefdata.com
        Regression: No

Created attachment 276165
  --> https://bugzilla.kernel.org/attachment.cgi?id=276165&action=edit  
Files needed to recreate problem

Our application reads packets from eth2 using pf_ring and outputs them to eth3
by doing raw socket sends to eth3.  We have two queues, one at priority 4 for
port 3232 traffic, and one at priority 5 at port 3233 traffic. When overdriving
the specified max rate of the HTB of 1 Mbps with two 1 Mbps iperf UDP streams,
the HTB splits the traffic evenly over the two queues.  However it should allow
the priority 4 queue to meet its max rate of 1 Mbps and give no traffic to the
lower priority queue.

If we specify a small byte limit (15000 bytes) to each queue, the shaping works
as expected.  IE the lower priority queue is completely starved and the higher
priority queue can meet its max rate.

At higher byte limits the shaping breaks down and traffic is split evenly.

When we set up a bridge br0 of eth2 and eth3 and run the same data streams
through it, the shaping works correctly regardless of the byte limits.

Therefore, we have concluded that the IP stack or bridge module of Linux is
marking the packets in such a way that HTB always shapes them correctly. 
However when doing raw socket sends to the eth3 interface the HTB shaping
breaks down.

We can recreate this problem with the following kernel versions:  4.16.9,
4.9.90, 3.10.102.

This problem can be recreated easily using the "tcpreplay" tool.  The steps to
recreate are:

1. Setup a Linux box with eth2/eth3 bridge
2. Download and build "tcpreplay-4.2.4.tar.gz"
3. Set up the QoS
   a) copy attached file "traffic" and "dotraffic" to /tmp
   b) 'chmod 755 dotraffic'
   c) './dotraffic'
4. Execute "tcpreplay" using the attached pcap file with the following command:
        'tcpreplay -l 1000 -i eth3 -M 155 -K eth2.pcap0'

Verify that the problem is recreated by executing the following command:

     watch 'tc -s -d qdisc show dev eth3'

The above command will show that traffic is being split evenly between the two
queues instead of honoring the priority.

Also in the attachment is a screenshot of the web GUI that shows the status of
the queues.

Thanks.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* [PATCH v2 net-next] net: stmmac: Add PPS and Flexible PPS support
From: Jose Abreu @ 2018-05-25 15:32 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, David S. Miller, Joao Pinto, Vitor Soares,
	Giuseppe Cavallaro, Alexandre Torgue, Richard Cochran

This adds support for PPS output and Flexible PPS (which is equivalent
to per_out output of PTP subsystem).

Tested using an oscilloscope and the following commands:

1) Start PTP4L:
	# ptp4l -A -4 -H -m -i eth0 &
2) Set Flexible PPS frequency:
	# echo <idx> <ts> <tns> <ps> <pns> > /sys/class/ptp/ptpX/period

Where, ts/tns is start time and ps/pns is period time, and ptpX is ptp
of eth0.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Vitor Soares <soares@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Richard Cochran <richardcochran@gmail.com>
---
Changes from v1:
	- Correct kbuild errors in some archs
---
 drivers/net/ethernet/stmicro/stmmac/common.h      |    2 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4.h      |    1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c |    2 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c  |    2 +
 drivers/net/ethernet/stmicro/stmmac/dwmac5.c      |   68 +++++++++++++++++++++
 drivers/net/ethernet/stmicro/stmmac/dwmac5.h      |   23 +++++++
 drivers/net/ethernet/stmicro/stmmac/hwif.h        |   10 +++
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |   12 ++++
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |    4 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c  |   50 ++++++++++++++-
 10 files changed, 170 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index a679cb7..78fd0f8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -346,6 +346,8 @@ struct dma_features {
 	/* TX and RX number of queues */
 	unsigned int number_rx_queues;
 	unsigned int number_tx_queues;
+	/* PPS output */
+	unsigned int pps_out_num;
 	/* Alternate (enhanced) DESC mode */
 	unsigned int enh_desc;
 	/* TX and RX FIFO sizes */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
index 6330a55..eb013d5 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -187,6 +187,7 @@ enum power_event {
 #define GMAC_HW_RXFIFOSIZE		GENMASK(4, 0)
 
 /* MAC HW features2 bitmap */
+#define GMAC_HW_FEAT_PPSOUTNUM		GENMASK(26, 24)
 #define GMAC_HW_FEAT_TXCHCNT		GENMASK(21, 18)
 #define GMAC_HW_FEAT_RXCHCNT		GENMASK(15, 12)
 #define GMAC_HW_FEAT_TXQCNT		GENMASK(9, 6)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
index a7121a7..d46e784 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
@@ -796,6 +796,8 @@ static void dwmac4_debug(void __iomem *ioaddr, struct stmmac_extra_stats *x,
 	.safety_feat_irq_status = dwmac5_safety_feat_irq_status,
 	.safety_feat_dump = dwmac5_safety_feat_dump,
 	.rxp_config = dwmac5_rxp_config,
+	.pps_config = dwmac5_pps_config,
+	.flex_pps_config = dwmac5_flex_pps_config,
 };
 
 int dwmac4_setup(struct stmmac_priv *priv)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index bf8e5a1..d37f17c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -373,6 +373,8 @@ static void dwmac4_get_hw_feature(void __iomem *ioaddr,
 		((hw_cap & GMAC_HW_FEAT_RXQCNT) >> 0) + 1;
 	dma_cap->number_tx_queues =
 		((hw_cap & GMAC_HW_FEAT_TXQCNT) >> 6) + 1;
+	/* PPS output */
+	dma_cap->pps_out_num = (hw_cap & GMAC_HW_FEAT_PPSOUTNUM) >> 24;
 
 	/* IEEE 1588-2002 */
 	dma_cap->time_stamp = 0;
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac5.c b/drivers/net/ethernet/stmicro/stmmac/dwmac5.c
index b2becb8..d12fa94 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac5.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac5.c
@@ -8,6 +8,7 @@
 #include "dwmac4.h"
 #include "dwmac5.h"
 #include "stmmac.h"
+#include "stmmac_ptp.h"
 
 struct dwmac5_error_desc {
 	bool valid;
@@ -494,3 +495,70 @@ int dwmac5_rxp_config(void __iomem *ioaddr, struct stmmac_tc_entry *entries,
 	writel(old_val, ioaddr + GMAC_CONFIG);
 	return ret;
 }
+
+int dwmac5_pps_config(void __iomem *ioaddr, bool enable)
+{
+	u32 val = readl(ioaddr + MAC_PPS_CONTROL);
+
+	/* There is no way to disable fixed PPS output so we just reset
+	 * the values to make sure its in fixed PPS mode */
+	val &= ~PPSx_MASK(0);
+	val |= TRGTMODSELx(0, 0x2);
+
+	writel(val, ioaddr + MAC_PPS_CONTROL);
+	return 0;
+}
+
+int dwmac5_flex_pps_config(void __iomem *ioaddr, int index,
+			   struct stmmac_pps_cfg *cfg, bool enable,
+			   u32 sub_second_inc, u32 systime_flags)
+{
+	u32 tnsec = readl(ioaddr + MAC_PPSx_TARGET_TIME_NSEC(index));
+	u32 val = readl(ioaddr + MAC_PPS_CONTROL);
+	u64 period;
+
+	if (!cfg->available)
+		return -EINVAL;
+	if (tnsec & TRGTBUSY0)
+		return -EBUSY;
+	if (!sub_second_inc || !systime_flags)
+		return -EINVAL;
+
+	val &= ~PPSx_MASK(index);
+
+	if (!enable) {
+		val |= PPSCMDx(index, 0x5);
+		writel(val, ioaddr + MAC_PPS_CONTROL);
+		return 0;
+	}
+
+	val |= PPSCMDx(index, 0x2);
+	val |= TRGTMODSELx(index, 0x2);
+	val |= PPSEN0;
+
+	writel(cfg->start.tv_sec, ioaddr + MAC_PPSx_TARGET_TIME_SEC(index));
+
+	if (!(systime_flags & PTP_TCR_TSCTRLSSR))
+		cfg->start.tv_nsec = (cfg->start.tv_nsec * 1000) / 465;
+	writel(cfg->start.tv_nsec, ioaddr + MAC_PPSx_TARGET_TIME_NSEC(index));
+
+	period = cfg->period.tv_sec * 1000000000;
+	period += cfg->period.tv_nsec;
+
+	do_div(period, sub_second_inc);
+
+	if (period <= 1)
+		return -EINVAL;
+
+	writel(period - 1, ioaddr + MAC_PPSx_INTERVAL(index));
+
+	period >>= 1;
+	if (period <= 1)
+		return -EINVAL;
+
+	writel(period - 1, ioaddr + MAC_PPSx_WIDTH(index));
+
+	/* Finally, activate it */
+	writel(val, ioaddr + MAC_PPS_CONTROL);
+	return 0;
+}
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac5.h b/drivers/net/ethernet/stmicro/stmmac/dwmac5.h
index cc810af..d0a12cf 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac5.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac5.h
@@ -11,6 +11,25 @@
 #define PRTYEN				BIT(1)
 #define TMOUTEN				BIT(0)
 
+#define MAC_PPS_CONTROL			0x00000b70
+#define PPS_MAXIDX(x)			((((x) + 1) * 8) - 1)
+#define PPS_MINIDX(x)			((x) * 8)
+#define PPSx_MASK(x)			GENMASK(PPS_MAXIDX(x), PPS_MINIDX(x))
+#define MCGRENx(x)			BIT(PPS_MAXIDX(x))
+#define TRGTMODSELx(x, val)		\
+	GENMASK(PPS_MAXIDX(x) - 1, PPS_MAXIDX(x) - 2) & \
+	((val) << (PPS_MAXIDX(x) - 2))
+#define PPSCMDx(x, val)			\
+	GENMASK(PPS_MINIDX(x) + 3, PPS_MINIDX(x)) & \
+	((val) << PPS_MINIDX(x))
+#define PPSEN0				BIT(4)
+#define MAC_PPSx_TARGET_TIME_SEC(x)	(0x00000b80 + ((x) * 0x10))
+#define MAC_PPSx_TARGET_TIME_NSEC(x)	(0x00000b84 + ((x) * 0x10))
+#define TRGTBUSY0			BIT(31)
+#define TTSL0				GENMASK(30, 0)
+#define MAC_PPSx_INTERVAL(x)		(0x00000b88 + ((x) * 0x10))
+#define MAC_PPSx_WIDTH(x)		(0x00000b8c + ((x) * 0x10))
+
 #define MTL_RXP_CONTROL_STATUS		0x00000ca0
 #define RXPI				BIT(31)
 #define NPE				GENMASK(23, 16)
@@ -61,5 +80,9 @@ int dwmac5_safety_feat_dump(struct stmmac_safety_stats *stats,
 			int index, unsigned long *count, const char **desc);
 int dwmac5_rxp_config(void __iomem *ioaddr, struct stmmac_tc_entry *entries,
 		      unsigned int count);
+int dwmac5_pps_config(void __iomem *ioaddr, bool enable);
+int dwmac5_flex_pps_config(void __iomem *ioaddr, int index,
+			   struct stmmac_pps_cfg *cfg, bool enable,
+			   u32 sub_second_inc, u32 systime_flags);
 
 #endif /* __DWMAC5_H__ */
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index f499a7f..44ea531 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -241,6 +241,7 @@ struct stmmac_dma_ops {
 struct rgmii_adv;
 struct stmmac_safety_stats;
 struct stmmac_tc_entry;
+struct stmmac_pps_cfg;
 
 /* Helpers to program the MAC core */
 struct stmmac_ops {
@@ -313,6 +314,11 @@ struct stmmac_ops {
 	/* Flexible RX Parser */
 	int (*rxp_config)(void __iomem *ioaddr, struct stmmac_tc_entry *entries,
 			  unsigned int count);
+	/* PPS and Flexible PPS */
+	int (*pps_config)(void __iomem *ioaddr, bool enable);
+	int (*flex_pps_config)(void __iomem *ioaddr, int index,
+			       struct stmmac_pps_cfg *cfg, bool enable,
+			       u32 sub_second_inc, u32 systime_flags);
 };
 
 #define stmmac_core_init(__priv, __args...) \
@@ -379,6 +385,10 @@ struct stmmac_ops {
 	stmmac_do_callback(__priv, mac, safety_feat_dump, __args)
 #define stmmac_rxp_config(__priv, __args...) \
 	stmmac_do_callback(__priv, mac, rxp_config, __args)
+#define stmmac_pps_config(__priv, __args...) \
+	stmmac_do_callback(__priv, mac, pps_config, __args)
+#define stmmac_flex_pps_config(__priv, __args...) \
+	stmmac_do_callback(__priv, mac, flex_pps_config, __args)
 
 /* PTP and HW Timer helpers */
 struct stmmac_hwtimestamp {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4d425b1..d1a4cb7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -100,6 +100,13 @@ struct stmmac_tc_entry {
 	} __packed val;
 };
 
+#define STMMAC_PPS_MAX		4
+struct stmmac_pps_cfg {
+	bool available;
+	struct timespec64 start;
+	struct timespec64 period;
+};
+
 struct stmmac_priv {
 	/* Frequently used values are kept adjacent for cache effect */
 	u32 tx_count_frames;
@@ -160,6 +167,8 @@ struct stmmac_priv {
 	struct ptp_clock *ptp_clock;
 	struct ptp_clock_info ptp_clock_ops;
 	unsigned int default_addend;
+	u32 sub_second_inc;
+	u32 systime_flags;
 	u32 adv_ts;
 	int use_riwt;
 	int irq_wake;
@@ -181,6 +190,9 @@ struct stmmac_priv {
 	unsigned int tc_entries_max;
 	unsigned int tc_off_max;
 	struct stmmac_tc_entry *tc_entries;
+
+	/* Pulse Per Second output */
+	struct stmmac_pps_cfg pps[STMMAC_PPS_MAX];
 };
 
 enum stmmac_state {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c32de53..14361c8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -722,6 +722,10 @@ static int stmmac_hwtstamp_ioctl(struct net_device *dev, struct ifreq *ifr)
 				priv->plat->has_gmac4, &sec_inc);
 		temp = div_u64(1000000000ULL, sec_inc);
 
+		/* Store sub second increment and flags for later use */
+		priv->sub_second_inc = sec_inc;
+		priv->systime_flags = value;
+
 		/* calculate default added value:
 		 * formula is :
 		 * addend = (2^32)/freq_div_ratio;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
index 7d3a5c7..35c6d0c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
@@ -140,19 +140,50 @@ static int stmmac_set_time(struct ptp_clock_info *ptp,
 static int stmmac_enable(struct ptp_clock_info *ptp,
 			 struct ptp_clock_request *rq, int on)
 {
-	return -EOPNOTSUPP;
+	struct stmmac_priv *priv =
+	    container_of(ptp, struct stmmac_priv, ptp_clock_ops);
+	struct stmmac_pps_cfg *cfg;
+	int ret = -EOPNOTSUPP;
+	unsigned long flags;
+
+	switch (rq->type) {
+	case PTP_CLK_REQ_PEROUT:
+		cfg = &priv->pps[rq->perout.index];
+
+		cfg->start.tv_sec = rq->perout.start.sec;
+		cfg->start.tv_nsec = rq->perout.start.nsec;
+		cfg->period.tv_sec = rq->perout.period.sec;
+		cfg->period.tv_nsec = rq->perout.period.nsec;
+
+		spin_lock_irqsave(&priv->ptp_lock, flags);
+		ret = stmmac_flex_pps_config(priv, priv->ioaddr,
+					     rq->perout.index, cfg, on,
+					     priv->sub_second_inc,
+					     priv->systime_flags);
+		spin_unlock_irqrestore(&priv->ptp_lock, flags);
+		break;
+	case PTP_CLK_REQ_PPS:
+		spin_lock_irqsave(&priv->ptp_lock, flags);
+		ret = stmmac_pps_config(priv, priv->ioaddr, on);
+		spin_unlock_irqrestore(&priv->ptp_lock, flags);
+		break;
+	default:
+		break;
+	}
+
+	return ret;
 }
 
 /* structure describing a PTP hardware clock */
-static const struct ptp_clock_info stmmac_ptp_clock_ops = {
+static struct ptp_clock_info stmmac_ptp_clock_ops = {
 	.owner = THIS_MODULE,
 	.name = "stmmac_ptp_clock",
 	.max_adj = 62500000,
 	.n_alarm = 0,
 	.n_ext_ts = 0,
-	.n_per_out = 0,
+	.n_per_out = 0, /* will be overwritten in stmmac_ptp_register */
 	.n_pins = 0,
-	.pps = 0,
+	.pps = 0, /* will be overwritten in stmmac_ptp_register */
 	.adjfreq = stmmac_adjust_freq,
 	.adjtime = stmmac_adjust_time,
 	.gettime64 = stmmac_get_time,
@@ -168,6 +199,17 @@ static int stmmac_enable(struct ptp_clock_info *ptp,
  */
 void stmmac_ptp_register(struct stmmac_priv *priv)
 {
+	int i;
+
+	for (i = 0; i < priv->dma_cap.pps_out_num; i++) {
+		if (i >= STMMAC_PPS_MAX)
+			break;
+		priv->pps[i].available = true;
+	}
+
+	stmmac_ptp_clock_ops.pps = priv->dma_cap.pps_out_num > 0;
+	stmmac_ptp_clock_ops.n_per_out = priv->dma_cap.pps_out_num;
+
 	spin_lock_init(&priv->ptp_lock);
 	priv->ptp_clock_ops = stmmac_ptp_clock_ops;
 
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino
From: Alban Crequy @ 2018-05-25 15:21 UTC (permalink / raw)
  To: Y Song
  Cc: Iago López Galeiras, netdev, Linux Containers, LKML,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Alexei Starovoitov
In-Reply-To: <CAH3MdRVdfw52atavT3KL8MpPw7zDM_hR6aUcqDP1PogLn_sH+w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, May 23, 2018 at 4:34 AM Y Song <ys114321-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> I did a quick prototyping and the above interface seems working fine.

Thanks! I gave your kernel patch & userspace program a try and it works for
me on cgroup-v2.

Also, I found out how to get my containers to use both cgroup-v1 and
cgroup-v2 (by enabling systemd's hybrid cgroup mode and docker's
'--exec-opt native.cgroupdriver=systemd' option). So I should be able to
use the BPF helper function without having to add support for all the
cgroup-v1 hierarchies.

> The kernel change:
> ===============

> [yhs@localhost bpf-next]$ git diff
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 97446bbe2ca5..669b7383fddb 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1976,7 +1976,8 @@ union bpf_attr {
>          FN(fib_lookup),                 \
>          FN(sock_hash_update),           \
>          FN(msg_redirect_hash),          \
> -       FN(sk_redirect_hash),
> +       FN(sk_redirect_hash),           \
> +       FN(get_current_cgroup_id),

>   /* integer value in 'imm' field of BPF_CALL instruction selects which
helper
>    * function eBPF program intends to call
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ce2cbbff27e4..e11e3298f911 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -493,6 +493,21 @@ static const struct bpf_func_proto
> bpf_current_task_under_cgroup_proto = {
>          .arg2_type      = ARG_ANYTHING,
>   };

> +BPF_CALL_0(bpf_get_current_cgroup_id)
> +{
> +       struct cgroup *cgrp = task_dfl_cgroup(current);
> +       if (!cgrp)
> +               return -EINVAL;
> +
> +       return cgrp->kn->id.id;
> +}
> +
> +static const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
> +       .func           = bpf_get_current_cgroup_id,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +};
> +
>   BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
>             const void *, unsafe_ptr)
>   {
> @@ -563,6 +578,8 @@ tracing_func_proto(enum bpf_func_id func_id, const
> struct bpf_prog *prog)
>                  return &bpf_get_prandom_u32_proto;
>          case BPF_FUNC_probe_read_str:
>                  return &bpf_probe_read_str_proto;
> +       case BPF_FUNC_get_current_cgroup_id:
> +               return &bpf_get_current_cgroup_id_proto;
>          default:
>                  return NULL;
>          }

> The following program can be used to print out a cgroup id given a cgroup
path.
> [yhs@localhost cg]$ cat get_cgroup_id.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>

> int main(int argc, char **argv)
> {
>      int dirfd, err, flags, mount_id, fhsize;
>      struct file_handle *fhp;
>      char *pathname;

>      if (argc != 2) {
>          printf("usage: %s <cgroup_path>\n", argv[0]);
>          return 1;
>      }

>      pathname = argv[1];
>      dirfd = AT_FDCWD;
>      flags = 0;

>      fhsize = sizeof(*fhp);
>      fhp = malloc(fhsize);
>      if (!fhp)
>          return 1;

>      err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags);
>      if (err >= 0) {
>          printf("error\n");
>          return 1;
>      }

>      fhsize = sizeof(struct file_handle) + fhp->handle_bytes;
>      fhp = realloc(fhp, fhsize);
>      if (!fhp)
>          return 1;

>      err = name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags);
>      if (err < 0)
>          perror("name_to_handle_at");
>      else {
>          int i;

>          printf("dir = %s, mount_id = %d\n", pathname, mount_id);
>          printf("handle_bytes = %d, handle_type = %d\n", fhp->handle_bytes,
>              fhp->handle_type);
>          if (fhp->handle_bytes != 8)
>              return 1;

>          printf("cgroup_id = 0x%llx\n", *(unsigned long long
*)fhp->f_handle);
>      }

>      return 0;
> }
> [yhs@localhost cg]$

> Given a cgroup path, the user can get cgroup_id and use it in their bpf
> program for filtering purpose.

> I run a simple program t.c
>     int main() { while(1) sleep(1); return 0; }
> in the cgroup v2 directory /home/yhs/tmp/yhs
>     none on /home/yhs/tmp type cgroup2 (rw,relatime,seclabel)

> $ ./get_cgroup_id /home/yhs/tmp/yhs
> dir = /home/yhs/tmp/yhs, mount_id = 124
> handle_bytes = 8, handle_type = 1
> cgroup_id = 0x1000006b2

> // the below command to get cgroup_id from the kernel for the
> // process compiled with t.c and ran under /home/yhs/tmp/yhs:
> $ sudo ./trace.py -p 4067 '__x64_sys_nanosleep "cgid = %llx", $cgid'
> PID     TID     COMM            FUNC             -
> 4067    4067    a.out           __x64_sys_nanosleep cgid = 1000006b2
> 4067    4067    a.out           __x64_sys_nanosleep cgid = 1000006b2
> 4067    4067    a.out           __x64_sys_nanosleep cgid = 1000006b2
> ^C[yhs@localhost tools]$

> The kernel and user space cgid matches. Will provide a
> formal patch later.




> On Mon, May 21, 2018 at 5:24 PM, Y Song <ys114321-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov
> > <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote:
> >>>
> >>> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags)
> >>> +{
> >>> +     // TODO: pick the correct hierarchy instead of the mem
controller
> >>> +     struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id);
> >>> +
> >>> +     if (unlikely(!cgrp))
> >>> +             return -EINVAL;
> >>> +     if (unlikely(hierarchy))
> >>> +             return -EINVAL;
> >>> +     if (unlikely(flags))
> >>> +             return -EINVAL;
> >>> +
> >>> +     return cgrp->kn->id.ino;
> >>
> >> ino only is not enough to identify cgroup. It needs generation number
too.
> >> I don't quite see how hierarchy and flags can be used in the future.
> >> Also why limit it to memcg?
> >>
> >> How about something like this instead:
> >>
> >> BPF_CALL_2(bpf_get_current_cgroup_id)
> >> {
> >>         struct cgroup *cgrp = task_dfl_cgroup(current);
> >>
> >>         return cgrp->kn->id.id;
> >> }
> >> The user space can use fhandle api to get the same 64-bit id.
> >
> > I think this should work. This will also be useful to bcc as user
> > space can encode desired id
> > in the bpf program and compared that id to the current cgroup id, so we
can have
> > cgroup level tracing (esp. stat collection) support. To cope with
> > cgroup hierarchy, user can use
> > cgroup-array based approach or explicitly compare against multiple
cgroup id's.

^ permalink raw reply

* Re: [PATCH iproute2] ip link: Do not call ll_name_to_index when creating a new link
From: Stephen Hemminger @ 2018-05-25 15:21 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev
In-Reply-To: <054faf87-f311-61a1-3c9a-e7d20cac8279@gmail.com>

On Fri, 18 May 2018 17:40:05 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 5/18/18 4:08 PM, Stephen Hemminger wrote:
> > 
> > What about just pushing the lookup down to the leaf functions that need it?
> >   
> 
> That should work as well. You want to re-send a formal patch?
> 

I just pushed it up as a formal patch (with your text).

^ permalink raw reply

* Re: [PATCH] drivers/net/phy/micrel: Fix for PHY KSZ8061 errrata: Potential link-up failure when Ethernet cable is connected slowly
From: Florian Fainelli @ 2018-05-25 15:17 UTC (permalink / raw)
  To: Alexander Onnasch; +Cc: Andrew Lunn, netdev, linux-kernel
In-Reply-To: <1527251853-22218-1-git-send-email-alexander.onnasch@landisgyr.com>



On 05/25/2018 05:37 AM, Alexander Onnasch wrote:
> Signed-off-by: Alexander Onnasch <alexander.onnasch@landisgyr.com>

You would want to make the commit subject shorter (ideally capped
somewhere around 72 characters) and provide a commit message which
explains the issue and why the workaround is effective.

Thank you!

[snip]

> 
> P PLEASE CONSIDER OUR ENVIRONMENT BEFORE PRINTING THIS EMAIL.
> 
> This e-mail (including any attachments) is confidential and may be legally privileged. If you are not an intended recipient or an authorized representative of an intended recipient, you are prohibited from using, copying or distributing the information in this e-mail or its attachments. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete all copies of this message and any attachments. Thank you.

You need to remove that footer otherwise we cannot be accepting your patch.
-- 
Florian

^ permalink raw reply

* Re: [PATCH 1/7] core, dma-direct: add a flag 32-bit dma limits
From: Greg Kroah-Hartman @ 2018-05-25 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Fenghua Yu, Tony Luck, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Ingo Molnar,
	Thomas Gleixner
In-Reply-To: <20180525143512.1466-2-hch-jcswGhMUV9g@public.gmane.org>

On Fri, May 25, 2018 at 04:35:06PM +0200, Christoph Hellwig wrote:
> Various PCI bridges (VIA PCI, Xilinx PCIe) limit DMA to only 32-bits
> even if the device itself supports more.  Add a single bit flag to
> struct device (to be moved into the dma extension once we around it)

"once we around it"?  I don't understand, sorry.

> to flag such devices and reject larger DMA to them.
> 
> Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> ---
>  include/linux/device.h | 3 +++
>  lib/dma-direct.c       | 6 ++++++
>  2 files changed, 9 insertions(+)

For the patch, no objection from me:

Reviewed-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

^ permalink raw reply

* Re: [PATCH net-next] net: sched: shrink struct Qdisc
From: Jiri Pirko @ 2018-05-25 14:41 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, David S. Miller, Jamal Hadi Salim, Cong Wang
In-Reply-To: <607936fe39bf1e78ca8b520e2ef25b7b326a767f.1527258390.git.pabeni@redhat.com>

Fri, May 25, 2018 at 04:28:44PM CEST, pabeni@redhat.com wrote:
>The struct Qdisc has a lot of holes, especially after commit
>a53851e2c321 ("net: sched: explicit locking in gso_cpu fallback"),
>which as a side effect, moved the fields just after 'busylock'
>on a new cacheline.
>
>Since both 'padded' and 'refcnt' are not updated frequently, and
>there is a hole before 'gso_skb', we can move such fields there,
>saving a cacheline without any performance side effect.
>
>Before this commit:
>
>pahole -C Qdisc net/sche/sch_generic.o
>	# ...
>        /* size: 384, cachelines: 6, members: 25 */
>        /* sum members: 236, holes: 3, sum holes: 92 */
>        /* padding: 56 */
>
>After this commit:
>pahole -C Qdisc net/sche/sch_generic.o
>	# ...
>	/* size: 320, cachelines: 5, members: 25 */
>	/* sum members: 236, holes: 2, sum holes: 28 */
>	/* padding: 56 */
>
>Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH] net: stmmac: Use mutex instead of spinlock
From: Bhadram Varka @ 2018-05-25 14:36 UTC (permalink / raw)
  To: Thierry Reding, David S. Miller
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Jon Hunter, netdev,
	linux-kernel
In-Reply-To: <20180524140907.24197-1-thierry.reding@gmail.com>

Hi,

On 5/24/2018 7:39 PM, Thierry Reding wrote:
> From: Thierry Reding <treding@nvidia.com>
> 
> Some drivers, such as DWC EQOS on Tegra, need to perform operations that
> can sleep under this lock (clk_set_rate() in tegra_eqos_fix_speed()) for
> proper operation. Since there is no need for this lock to be a spinlock,
> convert it to a mutex instead.
> 
> Fixes: e6ea2d16fc61 ("net: stmmac: dwc-qos: Add Tegra186 support")
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Signed-off-by: Thierry Reding <treding@nvidia.com>
> ---

Tested on P3310 Tegra186 platform.

Tested-by: Bhadram Varka <vbhadram@nvidia.com>

-- 
Bhadram

^ permalink raw reply

* [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: x86, iommu, linux-kernel, linux-ia64, netdev
In-Reply-To: <20180525143512.1466-1-hch@lst.de>

Instead of globally disabling > 32bit DMA using the arch_dma_supported
hook walk the PCI bus under the actually affected bridge and mark every
device with the dma_32bit_limit flag.  This also gets rid of the
arch_dma_supported hook entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/include/asm/dma-mapping.h |  3 ---
 arch/x86/kernel/pci-dma.c          | 27 ++++++++++-----------------
 include/linux/dma-mapping.h        | 11 -----------
 3 files changed, 10 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
index 89ce4bfd241f..eb4e1352e403 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -30,9 +30,6 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
 	return dma_ops;
 }
 
-int arch_dma_supported(struct device *dev, u64 mask);
-#define arch_dma_supported arch_dma_supported
-
 bool arch_dma_alloc_attrs(struct device **dev, gfp_t *gfp);
 #define arch_dma_alloc_attrs arch_dma_alloc_attrs
 
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index b5cbef974bd1..0d6fd0d1c14f 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -15,7 +15,7 @@
 #include <asm/x86_init.h>
 #include <asm/iommu_table.h>
 
-static int forbid_dac __read_mostly;
+static bool disable_dac_quirk __read_mostly;
 
 const struct dma_map_ops *dma_ops = &dma_direct_ops;
 EXPORT_SYMBOL(dma_ops);
@@ -129,7 +129,7 @@ static __init int iommu_setup(char *p)
 		if (!strncmp(p, "nodac", 5))
 			pr_warn("nodac option ignored.\n");
 		if (!strncmp(p, "usedac", 6)) {
-			forbid_dac = -1;
+			disable_dac_quirk = true;
 			return 1;
 		}
 #ifdef CONFIG_SWIOTLB
@@ -154,19 +154,6 @@ static __init int iommu_setup(char *p)
 }
 early_param("iommu", iommu_setup);
 
-int arch_dma_supported(struct device *dev, u64 mask)
-{
-#ifdef CONFIG_PCI
-	if (mask > 0xffffffff && forbid_dac > 0) {
-		dev_info(dev, "PCI: Disallowing DAC for device\n");
-		return 0;
-	}
-#endif
-
-	return 1;
-}
-EXPORT_SYMBOL(arch_dma_supported);
-
 static int __init pci_iommu_init(void)
 {
 	struct iommu_table_entry *p;
@@ -190,11 +177,17 @@ rootfs_initcall(pci_iommu_init);
 #ifdef CONFIG_PCI
 /* Many VIA bridges seem to corrupt data for DAC. Disable it here */
 
+static int via_no_dac_cb(struct pci_dev *pdev, void *data)
+{
+	pdev->dev.dma_32bit_limit = true;
+	return 0;
+}
+
 static void via_no_dac(struct pci_dev *dev)
 {
-	if (forbid_dac == 0) {
+	if (!disable_dac_quirk) {
 		dev_info(&dev->dev, "disabling DAC on VIA PCI bridge\n");
-		forbid_dac = 1;
+		pci_walk_bus(dev->subordinate, via_no_dac_cb, NULL);
 	}
 }
 DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f8ab1c0f589e..0249bce7c5e7 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -572,14 +572,6 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 	return 0;
 }
 
-/*
- * This is a hack for the legacy x86 forbid_dac and iommu_sac_force. Please
- * don't use this in new code.
- */
-#ifndef arch_dma_supported
-#define arch_dma_supported(dev, mask)	(1)
-#endif
-
 static inline void dma_check_mask(struct device *dev, u64 mask)
 {
 	if (sme_active() && (mask < (((u64)sme_get_me_mask() << 1) - 1)))
@@ -592,9 +584,6 @@ static inline int dma_supported(struct device *dev, u64 mask)
 
 	if (!ops)
 		return 0;
-	if (!arch_dma_supported(dev, mask))
-		return 0;
-
 	if (!ops->dma_supported)
 		return 1;
 	return ops->dma_supported(dev, mask);
-- 
2.17.0

^ permalink raw reply related

* [PATCH 6/7] x86: remove the explicit nodac and allowdac option
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: x86, iommu, linux-kernel, linux-ia64, netdev
In-Reply-To: <20180525143512.1466-1-hch@lst.de>

This is something drivers should decide (modulo chipset quirks like
for VIA), which as far as I can tell is how things have been handled
for the last 15 years.

Note that we keep the usedac option for now, as it is used in the wild
to override the too generic VIA quirk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/x86/x86_64/boot-options.txt | 5 -----
 arch/x86/kernel/pci-dma.c                 | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 341588ec4e29..8d109ef67ab6 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -236,11 +236,6 @@ IOMMU (input/output memory management unit)
     nomerge            Don't do scatter-gather (SG) merging.
     noaperture         Ask the IOMMU not to touch the aperture for AGP.
     noagp              Don't initialize the AGP driver and use full aperture.
-    allowdac           Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
-                       DAC is used with 32-bit PCI to push a 64-bit address in
-                       two cycles. When off all DMA over >4GB is forced through
-                       an IOMMU or software bounce buffering.
-    nodac              Forbid DAC mode, i.e. DMA >4GB.
     panic              Always panic when IOMMU overflows.
     calgary            Use the Calgary IOMMU if it is available
 
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 91dff954b745..b5cbef974bd1 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -125,9 +125,9 @@ static __init int iommu_setup(char *p)
 		if (!strncmp(p, "forcesac", 8))
 			pr_warn("forcesac option ignored.\n");
 		if (!strncmp(p, "allowdac", 8))
-			forbid_dac = 0;
+			pr_warn("allowdac option ignored.\n");
 		if (!strncmp(p, "nodac", 5))
-			forbid_dac = 1;
+			pr_warn("nodac option ignored.\n");
 		if (!strncmp(p, "usedac", 6)) {
 			forbid_dac = -1;
 			return 1;
-- 
2.17.0

^ permalink raw reply related

* [PATCH 5/7] x86: remove the experimental forcesac boot option
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: x86, iommu, linux-kernel, linux-ia64, netdev
In-Reply-To: <20180525143512.1466-1-hch@lst.de>

Limiting the dma mask to avoid PCI (pre-PCIe) DAC cycles while paying
the huge overhead of an IOMMU is rather pointless, and this seriously
gets in the way of dma mapping work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 .../admin-guide/kernel-parameters.txt         |  1 -
 Documentation/x86/x86_64/boot-options.txt     |  4 +---
 arch/x86/kernel/pci-dma.c                     | 21 +------------------
 drivers/net/ethernet/sfc/efx.c                |  5 ++---
 drivers/net/ethernet/sfc/falcon/efx.c         |  5 ++---
 5 files changed, 6 insertions(+), 30 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f2040d46f095..cc0ac035b8fe 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1705,7 +1705,6 @@
 		nopanic
 		merge
 		nomerge
-		forcesac
 		soft
 		pt		[x86, IA-64]
 		nobypass	[PPC/POWERNV]
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 153b3a57fba2..341588ec4e29 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -208,7 +208,7 @@ IOMMU (input/output memory management unit)
       Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
 
  iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
-	[,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge]
+	[,memaper[=<order>]][,merge][,fullflush][,nomerge]
 	[,noaperture][,calgary]
 
   General iommu options:
@@ -235,8 +235,6 @@ IOMMU (input/output memory management unit)
                        (experimental).
     nomerge            Don't do scatter-gather (SG) merging.
     noaperture         Ask the IOMMU not to touch the aperture for AGP.
-    forcesac           Force single-address cycle (SAC) mode for masks <40bits
-                       (experimental).
     noagp              Don't initialize the AGP driver and use full aperture.
     allowdac           Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
                        DAC is used with 32-bit PCI to push a 64-bit address in
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 77625b60a510..91dff954b745 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -20,8 +20,6 @@ static int forbid_dac __read_mostly;
 const struct dma_map_ops *dma_ops = &dma_direct_ops;
 EXPORT_SYMBOL(dma_ops);
 
-static int iommu_sac_force __read_mostly;
-
 #ifdef CONFIG_IOMMU_DEBUG
 int panic_on_overflow __read_mostly = 1;
 int force_iommu __read_mostly = 1;
@@ -125,7 +123,7 @@ static __init int iommu_setup(char *p)
 		if (!strncmp(p, "nomerge", 7))
 			iommu_merge = 0;
 		if (!strncmp(p, "forcesac", 8))
-			iommu_sac_force = 1;
+			pr_warn("forcesac option ignored.\n");
 		if (!strncmp(p, "allowdac", 8))
 			forbid_dac = 0;
 		if (!strncmp(p, "nodac", 5))
@@ -165,23 +163,6 @@ int arch_dma_supported(struct device *dev, u64 mask)
 	}
 #endif
 
-	/* Tell the device to use SAC when IOMMU force is on.  This
-	   allows the driver to use cheaper accesses in some cases.
-
-	   Problem with this is that if we overflow the IOMMU area and
-	   return DAC as fallback address the device may not handle it
-	   correctly.
-
-	   As a special case some controllers have a 39bit address
-	   mode that is as efficient as 32bit (aic79xx). Don't force
-	   SAC for these.  Assume all masks <= 40 bits are of this
-	   type. Normally this doesn't make any difference, but gives
-	   more gentle handling of IOMMU overflow. */
-	if (iommu_sac_force && (mask >= DMA_BIT_MASK(40))) {
-		dev_info(dev, "Force SAC with mask %Lx\n", mask);
-		return 0;
-	}
-
 	return 1;
 }
 EXPORT_SYMBOL(arch_dma_supported);
diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index a4ebd8715494..661828e8fdcf 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1289,9 +1289,8 @@ static int efx_init_io(struct efx_nic *efx)
 
 	pci_set_master(pci_dev);
 
-	/* Set the PCI DMA mask.  Try all possibilities from our
-	 * genuine mask down to 32 bits, because some architectures
-	 * (e.g. x86_64 with iommu_sac_force set) will allow 40 bit
+	/* Set the PCI DMA mask.  Try all possibilities from our genuine mask
+	 * down to 32 bits, because some architectures will allow 40 bit
 	 * masks event though they reject 46 bit masks.
 	 */
 	while (dma_mask > 0x7fffffffUL) {
diff --git a/drivers/net/ethernet/sfc/falcon/efx.c b/drivers/net/ethernet/sfc/falcon/efx.c
index 3d6c91e96589..dd5530a4f8c8 100644
--- a/drivers/net/ethernet/sfc/falcon/efx.c
+++ b/drivers/net/ethernet/sfc/falcon/efx.c
@@ -1242,9 +1242,8 @@ static int ef4_init_io(struct ef4_nic *efx)
 
 	pci_set_master(pci_dev);
 
-	/* Set the PCI DMA mask.  Try all possibilities from our
-	 * genuine mask down to 32 bits, because some architectures
-	 * (e.g. x86_64 with iommu_sac_force set) will allow 40 bit
+	/* Set the PCI DMA mask.  Try all possibilities from our genuine mask
+	 * down to 32 bits, because some architectures will allow 40 bit
 	 * masks event though they reject 46 bit masks.
 	 */
 	while (dma_mask > 0x7fffffffUL) {
-- 
2.17.0

^ permalink raw reply related

* [PATCH 4/7] x86: remove a stray reference to pci-nommu.c
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180525143512.1466-1-hch-jcswGhMUV9g@public.gmane.org>

This is just the minimal workaround.  The file file is mostly either stale
and/or duplicative of Documentation/admin-guide/kernel-parameters.txt,
but that is much more work than I'm willing to do right now.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 Documentation/x86/x86_64/boot-options.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index b297c48389b9..153b3a57fba2 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -187,9 +187,9 @@ PCI
 
 IOMMU (input/output memory management unit)
 
- Currently four x86-64 PCI-DMA mapping implementations exist:
+ Multiple x86-64 PCI-DMA mapping implementations exist, for example:
 
-   1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
+   1. <lib/dma-direct.c>: use no hardware/software IOMMU at all
       (e.g. because you have < 3 GB memory).
       Kernel boot message: "PCI-DMA: Disabling IOMMU"
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH 3/7] ia64: remove iommu_dma_supported
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180525143512.1466-1-hch-jcswGhMUV9g@public.gmane.org>

The generic dma_direct_supported helper already used by intel-iommu on
x86 does a better job than the ia64 reimplementation.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 arch/ia64/kernel/pci-dma.c  | 13 -------------
 drivers/iommu/intel-iommu.c |  2 --
 2 files changed, 15 deletions(-)

diff --git a/arch/ia64/kernel/pci-dma.c b/arch/ia64/kernel/pci-dma.c
index 50b6ad282a90..3c2884bef3d4 100644
--- a/arch/ia64/kernel/pci-dma.c
+++ b/arch/ia64/kernel/pci-dma.c
@@ -51,18 +51,6 @@ iommu_dma_init(void)
 	return;
 }
 
-int iommu_dma_supported(struct device *dev, u64 mask)
-{
-	/* Copied from i386. Doesn't make much sense, because it will
-	   only work for pci_alloc_coherent.
-	   The caller just has to use GFP_DMA in this case. */
-	if (mask < DMA_BIT_MASK(24))
-		return 0;
-
-	return 1;
-}
-EXPORT_SYMBOL(iommu_dma_supported);
-
 void __init pci_iommu_alloc(void)
 {
 	dma_ops = &intel_dma_ops;
@@ -71,7 +59,6 @@ void __init pci_iommu_alloc(void)
 	intel_dma_ops.sync_sg_for_cpu = machvec_dma_sync_sg;
 	intel_dma_ops.sync_single_for_device = machvec_dma_sync_single;
 	intel_dma_ops.sync_sg_for_device = machvec_dma_sync_sg;
-	intel_dma_ops.dma_supported = iommu_dma_supported;
 
 	/*
 	 * The order of these functions is important for
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 749d8f235346..5e0bef3754d1 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3841,9 +3841,7 @@ const struct dma_map_ops intel_dma_ops = {
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
 	.mapping_error = intel_mapping_error,
-#ifdef CONFIG_X86
 	.dma_supported = dma_direct_supported,
-#endif
 };
 
 static inline int iommu_domain_cache_init(void)
-- 
2.17.0

^ permalink raw reply related

* [PATCH 2/7] ia64: remove the dead iommu_sac_force variable
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: x86, iommu, linux-kernel, linux-ia64, netdev
In-Reply-To: <20180525143512.1466-1-hch@lst.de>

Looks like copy and paste from x86 that never actually got used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/ia64/kernel/pci-dma.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/arch/ia64/kernel/pci-dma.c b/arch/ia64/kernel/pci-dma.c
index b5df084c0af4..50b6ad282a90 100644
--- a/arch/ia64/kernel/pci-dma.c
+++ b/arch/ia64/kernel/pci-dma.c
@@ -18,8 +18,6 @@
 dma_addr_t bad_dma_address __read_mostly;
 EXPORT_SYMBOL(bad_dma_address);
 
-static int iommu_sac_force __read_mostly;
-
 int no_iommu __read_mostly;
 #ifdef CONFIG_IOMMU_DEBUG
 int force_iommu __read_mostly = 1;
@@ -61,23 +59,6 @@ int iommu_dma_supported(struct device *dev, u64 mask)
 	if (mask < DMA_BIT_MASK(24))
 		return 0;
 
-	/* Tell the device to use SAC when IOMMU force is on.  This
-	   allows the driver to use cheaper accesses in some cases.
-
-	   Problem with this is that if we overflow the IOMMU area and
-	   return DAC as fallback address the device may not handle it
-	   correctly.
-
-	   As a special case some controllers have a 39bit address
-	   mode that is as efficient as 32bit (aic79xx). Don't force
-	   SAC for these.  Assume all masks <= 40 bits are of this
-	   type. Normally this doesn't make any difference, but gives
-	   more gentle handling of IOMMU overflow. */
-	if (iommu_sac_force && (mask >= DMA_BIT_MASK(40))) {
-		dev_info(dev, "Force SAC with mask %llx\n", mask);
-		return 0;
-	}
-
 	return 1;
 }
 EXPORT_SYMBOL(iommu_dma_supported);
-- 
2.17.0

^ permalink raw reply related

* [PATCH 1/7] core, dma-direct: add a flag 32-bit dma limits
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180525143512.1466-1-hch-jcswGhMUV9g@public.gmane.org>

Various PCI bridges (VIA PCI, Xilinx PCIe) limit DMA to only 32-bits
even if the device itself supports more.  Add a single bit flag to
struct device (to be moved into the dma extension once we around it)
to flag such devices and reject larger DMA to them.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 include/linux/device.h | 3 +++
 lib/dma-direct.c       | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 477956990f5e..fa317e45f5e6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -904,6 +904,8 @@ struct dev_links_info {
  * @offline:	Set after successful invocation of bus type's .offline().
  * @of_node_reused: Set if the device-tree node is shared with an ancestor
  *              device.
+ * @dma_32bit_limit: bridge limited to 32bit DMA even if the device itself
+ *		indicates support for a higher limit in the dma_mask field.
  *
  * At the lowest level, every device in a Linux system is represented by an
  * instance of struct device. The device structure contains the information
@@ -992,6 +994,7 @@ struct device {
 	bool			offline_disabled:1;
 	bool			offline:1;
 	bool			of_node_reused:1;
+	bool			dma_32bit_limit:1;
 };
 
 static inline struct device *kobj_to_dev(struct kobject *kobj)
diff --git a/lib/dma-direct.c b/lib/dma-direct.c
index bbfb229aa067..0151a7b2bc87 100644
--- a/lib/dma-direct.c
+++ b/lib/dma-direct.c
@@ -165,6 +165,12 @@ int dma_direct_supported(struct device *dev, u64 mask)
 	if (mask < DMA_BIT_MASK(32))
 		return 0;
 #endif
+	/*
+	 * Various PCI/PCIe bridges have broken support for > 32bit DMA even
+	 * if the device itself might support it.
+	 */
+	if (dev->dma_32bit_limit && mask > DMA_BIT_MASK(32))
+		return 0;
 	return 1;
 }
 
-- 
2.17.0

^ permalink raw reply related

* refactor 32-bit dma limit quirks
From: Christoph Hellwig @ 2018-05-25 14:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Tony Luck, Fenghua Yu,
	Greg Kroah-Hartman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-ia64-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi all,

x86 currently has some quirks to force lower dma masks.  They are mostly
useful for certain VIA systems that otherwise corrupt data, but otherwise
don't make much sense given that the modern DMA APIs do the right thing
automatically.

This series dumps a few of these old kernel command lines (including their
not really working version on ia64), and moves the VIA quirk to a flag
in struct device so that it can be apply generically.  This will be needed
to support Xylinx root ports with a similar issue that show up in common
RISC-V boards.

^ permalink raw reply

* Re: [PATCH net-next] net:sched: add action inheritdsfield to skbmod
From: Marcelo Ricardo Leitner @ 2018-05-25 14:34 UTC (permalink / raw)
  To: Fu, Qiaobin
  Cc: davem@davemloft.net, netdev@vger.kernel.org, jhs@mojatatu.com,
	Michel Machado
In-Reply-To: <C7516012-947F-4485-B5DA-DD9AD45427F8@bu.edu>

On Fri, May 25, 2018 at 05:45:03AM +0000, Fu, Qiaobin wrote:
> Hi Marcelo,
> 
> Thanks for pointing out these style issues. Below is the updated version:

Hi Qiaobin,

Looks good to me. Now you have to submit it like you submitted the
original patch, but add the version tag to the summary. Like '[PATCH
v2 net-next] ....'
And without the text before the changelog.

Thanks.

> 
> ---
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
> 
> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
> Reviewed-by: Michel Machado <michel@digirati.com.br>
> ---
> 
> Note that the motivation for this patch is found in the following discussion:
> https://www.spinics.net/lists/netdev/msg501061.html
> ---
> 
> diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
> index 38c072f..0718b48 100644
> --- a/include/uapi/linux/tc_act/tc_skbmod.h
> +++ b/include/uapi/linux/tc_act/tc_skbmod.h
> @@ -19,6 +19,7 @@
>  #define SKBMOD_F_SMAC	0x2
>  #define SKBMOD_F_ETYPE	0x4
>  #define SKBMOD_F_SWAPMAC 0x8
> +#define SKBMOD_F_INHERITDSFIELD 0x10
>  
>  struct tc_skbmod {
>  	tc_gen;
> diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
> index ad050d7..e2082f6 100644
> --- a/net/sched/act_skbmod.c
> +++ b/net/sched/act_skbmod.c
> @@ -16,6 +16,9 @@
>  #include <linux/rtnetlink.h>
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
> +#include <net/ip.h>
> +#include <net/ipv6.h>
> +#include <net/dsfield.h>
>  
>  #include <linux/tc_act/tc_skbmod.h>
>  #include <net/tc_act/tc_skbmod.h>
> @@ -72,6 +75,26 @@ static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
>  		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
>  	}
>  
> +	if (flags & SKBMOD_F_INHERITDSFIELD) {
> +		int wlen = skb_network_offset(skb);
> +
> +		switch (tc_skb_protocol(skb)) {
> +		case htons(ETH_P_IP):
> +			wlen += sizeof(struct iphdr);
> +			if (!pskb_may_pull(skb, wlen))
> +				return TC_ACT_SHOT;
> +			skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> +			break;
> +
> +		case htons(ETH_P_IPV6):
> +			wlen += sizeof(struct ipv6hdr);
> +			if (!pskb_may_pull(skb, wlen))
> +				return TC_ACT_SHOT;
> +			skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> +			break;
> +		}
> +	}
> +
>  	return action;
>  }
>  
> @@ -127,6 +150,9 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
>  	if (parm->flags & SKBMOD_F_SWAPMAC)
>  		lflags = SKBMOD_F_SWAPMAC;
>  
> +	if (parm->flags & SKBMOD_F_INHERITDSFIELD)
> +		lflags |= SKBMOD_F_INHERITDSFIELD;
> +
>  	exists = tcf_idr_check(tn, parm->index, a, bind);
>  	if (exists && bind)
>  		return 0;
> 
> > On May 23, 2018, at 2:06 PM, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > 
> > Hi,
> > 
> > Some style fixes:
> > 
> > On Thu, May 17, 2018 at 07:33:08PM +0000, Fu, Qiaobin wrote:
> >> net/sched: add action inheritdsfield to skbmod
> > 
> > This extra line above should not be here.
> > 
> >> 
> >> The new action inheritdsfield copies the field DS of
> >> IPv4 and IPv6 packets into skb->prioriry. This enables
> >                              typo -----^
> > 
> >> later classification of packets based on the DS field.
> >> 
> >> Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
> >> 
> >> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
> >> Reviewed-by: Michel Machado <michel@digirati.com.br>
> >> ---
> >> 
> >> Note that the motivation for this patch is found in the following discussion:
> >> https://www.spinics.net/lists/netdev/msg501061.html
> >> ---
> >> 
> >> diff --git a/include/uapi/linux/tc_act/tc_skbmod.h b/include/uapi/linux/tc_act/tc_skbmod.h
> >> index 38c072f..0718b48 100644
> >> --- a/include/uapi/linux/tc_act/tc_skbmod.h
> >> +++ b/include/uapi/linux/tc_act/tc_skbmod.h
> >> @@ -19,6 +19,7 @@
> >> #define SKBMOD_F_SMAC	0x2
> >> #define SKBMOD_F_ETYPE	0x4
> >> #define SKBMOD_F_SWAPMAC 0x8
> >> +#define SKBMOD_F_INHERITDSFIELD 0x10
> >> 
> >> struct tc_skbmod {
> >> 	tc_gen;
> >> diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
> >> index ad050d7..21d5bec 100644
> >> --- a/net/sched/act_skbmod.c
> >> +++ b/net/sched/act_skbmod.c
> >> @@ -16,6 +16,9 @@
> >> #include <linux/rtnetlink.h>
> >> #include <net/netlink.h>
> >> #include <net/pkt_sched.h>
> >> +#include <net/ip.h>
> >> +#include <net/ipv6.h>
> >> +#include <net/dsfield.h>
> >> 
> >> #include <linux/tc_act/tc_skbmod.h>
> >> #include <net/tc_act/tc_skbmod.h>
> >> @@ -72,6 +75,25 @@ static int tcf_skbmod_run(struct sk_buff *skb, const struct tc_action *a,
> >> 		ether_addr_copy(eth_hdr(skb)->h_source, (u8 *)tmpaddr);
> >> 	}
> >> 
> >> +	if (flags & SKBMOD_F_INHERITDSFIELD) {
> >> +		int wlen = skb_network_offset(skb);
> > 
> > You need a blank line here, between var declaration and the rest.
> > 
> >> +		switch (tc_skb_protocol(skb)) {
> >> +		case htons(ETH_P_IP):
> >> +			wlen += sizeof(struct iphdr);
> >> +			if (!pskb_may_pull(skb, wlen))
> >> +				return TC_ACT_SHOT;
> >> +			skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> >> +			break;
> >> +
> >> +		case htons(ETH_P_IPV6):
> >> +			wlen += sizeof(struct ipv6hdr);
> >> +			if (!pskb_may_pull(skb, wlen))
> >> +				return TC_ACT_SHOT;
> >> +			skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> >> +			break;
> >> +		}
> >> +	}
> >> +
> >> 	return action;
> >> }
> >> 
> >> @@ -127,6 +149,9 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
> >> 	if (parm->flags & SKBMOD_F_SWAPMAC)
> >> 		lflags = SKBMOD_F_SWAPMAC;
> >> 
> >> +	if (parm->flags & SKBMOD_F_INHERITDSFIELD)
> >> +		lflags |= SKBMOD_F_INHERITDSFIELD;
> >> +
> >> 	exists = tcf_idr_check(tn, parm->index, a, bind);
> >> 	if (exists && bind)
> >> 		return 0;
> 

^ permalink raw reply

* Re: STMMAC driver with TSO enabled issue
From: Jose Abreu @ 2018-05-25 14:32 UTC (permalink / raw)
  To: Bhadram Varka, Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <94cda7c4-127c-cae1-e51e-8853224065e2@nvidia.com>

On 25-05-2018 15:25, Bhadram Varka wrote:
> Hi Jose,
>
> On 5/25/2018 7:35 PM, Jose Abreu wrote:
>> Hi Bhadram,
>>
>> On 25-05-2018 05:41, Bhadram Varka wrote:
>>> Hi Jose,
>>>
>>> On 5/24/2018 3:01 PM, Jose Abreu wrote:
>>>> Hi Bhadram,
>>>>
>>>> On 24-05-2018 06:58, Bhadram Varka wrote:
>>>>>
>>>>> After some time if check Tx descriptor status - then I see
>>>>> only
>>>>> below
>>>>>
>>>>> [..]
>>>>> [85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8
>>>>> 0x90000000
>>>>>
>>>>> index 025 and 026 descriptors processed but not index 027.
>>>>>
>>>>> At this stage Tx DMA is always in below state -
>>>>>
>>>>> ■ 3'b011: Running (Reading Data from system memory
>>>>> buffer and queuing it to the Tx buffer (Tx FIFO))
>>>>
>>>> Thats strange, I think the descriptors look okay though. I will
>>>> need the registers values (before the lock) and, if
>>>> possible, the
>>>> git bisect output.
>>>
>>> Attaching the register dump file after the issue observed.
>>> Please check once.
>>>
>>
>> ----->8-----
>> 0x112c = 0x0000003F
>> 0x11ac = 0x0000003F
>> 0x122c = 0x0000003F
>> 0x12ac = 0x0000003F
>>
>> 0x1130 = 0x0000003F
>> 0x11b0 = 0x0000003F
>> 0x1230 = 0x0000003F
>> 0x12b0 = 0x0000003F
>> ----->8-----
>>
>> This can't be right, it should be DMA_{RX/TX}_SIZE - 1 = 511. Did
>> you change these values in the code?
>>
>
> Yes. I have changed the descriptor length to 64 - so that
> searching for the current descriptor status would be easy.

Ok, it shouldn't impact anything. The only thing I'm remembering
now is that you can have TSO not enabled in all DMA channels (HW
configuration allows this). Please check if TSO in single-queue
works.

Thanks and Best Regards,
Jose Miguel Abreu

^ permalink raw reply

* hello
From: las8ms @ 2018-05-25 14:28 UTC (permalink / raw)


we came across your e-mail contact prior a private search while in need 
of help

Get your bank atm card, I prepare card that worth huge amount of money
loaded in the card is $3,000,000m us dollars, with this card you can 
make
maximum withdrawals of $500 us dollars daily from any atm machine in
your country. get back to us by this email, mathewthomson746@gmail,com

Mr. mr. mathewthomson,
Manager  director atm card department

^ permalink raw reply

* [PATCH net-next] net: sched: shrink struct Qdisc
From: Paolo Abeni @ 2018-05-25 14:28 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Jamal Hadi Salim, Cong Wang, Jiri Pirko

The struct Qdisc has a lot of holes, especially after commit
a53851e2c321 ("net: sched: explicit locking in gso_cpu fallback"),
which as a side effect, moved the fields just after 'busylock'
on a new cacheline.

Since both 'padded' and 'refcnt' are not updated frequently, and
there is a hole before 'gso_skb', we can move such fields there,
saving a cacheline without any performance side effect.

Before this commit:

pahole -C Qdisc net/sche/sch_generic.o
	# ...
        /* size: 384, cachelines: 6, members: 25 */
        /* sum members: 236, holes: 3, sum holes: 92 */
        /* padding: 56 */

After this commit:
pahole -C Qdisc net/sche/sch_generic.o
	# ...
	/* size: 320, cachelines: 5, members: 25 */
	/* sum members: 236, holes: 2, sum holes: 28 */
	/* padding: 56 */

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sch_generic.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 98c10a28cd01..827a3711dc68 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -85,6 +85,8 @@ struct Qdisc {
 	struct net_rate_estimator __rcu *rate_est;
 	struct gnet_stats_basic_cpu __percpu *cpu_bstats;
 	struct gnet_stats_queue	__percpu *cpu_qstats;
+	int			padded;
+	refcount_t		refcnt;

 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
@@ -97,8 +99,6 @@ struct Qdisc {
 	unsigned long		state;
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
-	int			padded;
-	refcount_t		refcnt;

 	spinlock_t		busylock ____cacheline_aligned_in_smp;
 	spinlock_t		seqlock;
-- 
2.17.0

^ permalink raw reply related

* Re: STMMAC driver with TSO enabled issue
From: Bhadram Varka @ 2018-05-25 14:25 UTC (permalink / raw)
  To: Jose Abreu, netdev@vger.kernel.org, Joao Pinto
In-Reply-To: <78914761-0375-5929-ed88-5225e0e260b9@synopsys.com>

Hi Jose,

On 5/25/2018 7:35 PM, Jose Abreu wrote:
> Hi Bhadram,
> 
> On 25-05-2018 05:41, Bhadram Varka wrote:
>> Hi Jose,
>>
>> On 5/24/2018 3:01 PM, Jose Abreu wrote:
>>> Hi Bhadram,
>>>
>>> On 24-05-2018 06:58, Bhadram Varka wrote:
>>>>
>>>> After some time if check Tx descriptor status - then I see only
>>>> below
>>>>
>>>> [..]
>>>> [85788.286730] 027 [0x827951b0]: 0xf854f000 0x0 0x16d8
>>>> 0x90000000
>>>>
>>>> index 025 and 026 descriptors processed but not index 027.
>>>>
>>>> At this stage Tx DMA is always in below state -
>>>>
>>>> ■ 3'b011: Running (Reading Data from system memory
>>>> buffer and queuing it to the Tx buffer (Tx FIFO))
>>>
>>> Thats strange, I think the descriptors look okay though. I will
>>> need the registers values (before the lock) and, if possible, the
>>> git bisect output.
>>
>> Attaching the register dump file after the issue observed.
>> Please check once.
>>
> 
> ----->8-----
> 0x112c = 0x0000003F
> 0x11ac = 0x0000003F
> 0x122c = 0x0000003F
> 0x12ac = 0x0000003F
> 
> 0x1130 = 0x0000003F
> 0x11b0 = 0x0000003F
> 0x1230 = 0x0000003F
> 0x12b0 = 0x0000003F
> ----->8-----
> 
> This can't be right, it should be DMA_{RX/TX}_SIZE - 1 = 511. Did
> you change these values in the code?
> 

Yes. I have changed the descriptor length to 64 - so that searching for 
the current descriptor status would be easy.

-- 
Thanks,
Bhadram.

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: David Miller @ 2018-05-25 14:23 UTC (permalink / raw)
  To: qing.huang
  Cc: tariqt, haakon.bugge, yanjun.zhu, netdev, linux-rdma,
	linux-kernel, gi-oh.kim
In-Reply-To: <20180523232246.20445-1-qing.huang@oracle.com>

From: Qing Huang <qing.huang@oracle.com>
Date: Wed, 23 May 2018 16:22:46 -0700

> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
> 
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
> 
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
> 
> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
> 
> However in order to support smaller ICM chunk size, we need to fix
> another issue in large size kcalloc allocations.
> 
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
> 
> The solution is to use kvzalloc to replace kcalloc which will fall back
> to vmalloc automatically if kmalloc fails.
> 
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> Acked-by: Daniel Jurgens <danielj@mellanox.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox