Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH V3 net-next 1/2] tcp: send in-queue bytes in cmsg upon read
From: Soheil Hassas Yeganeh @ 2018-05-01 19:28 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet,
	Willem de Bruijn
In-Reply-To: <20180501.143447.737462619555001271.davem@davemloft.net>

On Tue, May 1, 2018 at 2:34 PM, David Miller <davem@davemloft.net> wrote:
> From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> Date: Tue,  1 May 2018 10:11:27 -0400
>
>> +static inline int tcp_inq_hint(struct sock *sk)
>
> Please do not use 'inline' in foo.c files, let the compiler decide.
>
> Otherwise looks great, thanks.

Oops, sorry about this. Will send a v4.

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-01 19:31 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet, Dave Taht, Cong Wang
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <4ec8da81-8671-f434-bada-27088b09ce7b@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 05/01/2018 11:53 AM, Toke Høiland-Jørgensen wrote:
>
>> *sigh* - can do, I guess. Though I'm not sure what that is going to
>> accomplish, exactly?
>
>
> I guess that nobody really wants to really review Cake if
> it is a file with 2700 lines of code and hundreds of variables/tunables.
>
> Sure, we have big files in networking land, as a result of thousands
> of changes.
>
> If you split it, then maybe the work can be split among reviewers as a
> result.
>
> Or maybe David Miller can simply merge your patch as is, and hope for
> the best, I really do not know.
>
> It seems you guys spent years/months on work on this stuff, so what is
> the big deal about presenting your work in the best possible way ?

I was objecting to what felt like an arbitrary moving of goal posts
without an explanation. Now that you give one, that is fine of course.
I'll split it an resubmit.

Could you comment on specifically what you believe is broken in this,
please, so I can fix it in the same iteration?

+static inline struct tcphdr *cake_get_tcphdr(struct sk_buff *skb)
+{
+	struct ipv6hdr *ipv6h;
+	struct iphdr *iph;
+
+	/* check IPv6 header size immediately, since for IPv4 we need the space
+	 * for the TCP header anyway
+	 */
+	if (!pskb_may_pull(skb, skb_network_offset(skb) +
+				sizeof(struct ipv6hdr)))
+		return NULL;
+
+	iph = ip_hdr(skb);
+
+	if (iph->version == 4) {
+		/* special-case 6in4 tunnelling, as that is a common way to get
+		 * v6 connectivity in the home
+		 */
+		if (iph->protocol == IPPROTO_IPV6) {
+			if (!pskb_may_pull(skb, (skb_network_offset(skb) +
+						 ip_hdrlen(skb) +
+						 sizeof(struct ipv6hdr))))
+				return NULL;
+
+			ipv6h = (struct ipv6hdr *)(skb_network_header(skb) +
+						   ip_hdrlen(skb));
+
+			if (ipv6h->nexthdr != IPPROTO_TCP)
+				return NULL;
+
+			skb_set_inner_network_header(skb,
+						     skb_network_offset(skb) +
+						     ip_hdrlen(skb));
+			skb_set_inner_transport_header(skb,
+						skb_inner_network_offset(skb) +
+						sizeof(struct ipv6hdr));
+
+		} else if (iph->protocol == IPPROTO_TCP) {
+			/* we always set the inner headers so we can use those
+			 * unconditionally in the filtering logic
+			 */
+			skb_set_inner_network_header(skb,
+						     skb_network_offset(skb));
+			skb_set_inner_transport_header(skb,
+						       skb_network_offset(skb) +
+						       ip_hdrlen(skb));
+		} else {
+			return NULL;
+		}
+
+	} else if (iph->version == 6) {
+		ipv6h = (struct ipv6hdr *)iph;
+
+		if (ipv6h->nexthdr != IPPROTO_TCP)
+			return NULL;
+
+		/* we always set the inner headers so we can use those
+		 * unconditionally in the filtering logic
+		 */
+		skb_set_inner_network_header(skb,
+					     skb_network_offset(skb));
+		skb_set_inner_transport_header(skb,
+					       skb_network_offset(skb) +
+					       sizeof(struct ipv6hdr));
+
+	} else {
+		return NULL;
+	}
+
+	if (!pskb_may_pull(skb, skb_inner_transport_offset(skb) +
+				sizeof(struct tcphdr)) ||
+	    !pskb_may_pull(skb, skb_inner_transport_offset(skb) +
+				inner_tcp_hdrlen(skb)))
+		return NULL;
+
+	return inner_tcp_hdr(skb);
+}


Thanks,

-Toke

^ permalink raw reply

* [PATCH V4 net-next 1/2] tcp: send in-queue bytes in cmsg upon read
From: Soheil Hassas Yeganeh @ 2018-05-01 19:39 UTC (permalink / raw)
  To: davem, netdev; +Cc: ycheng, ncardwell, edumazet, willemb, Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

V3 change-log:
	As suggested by David Miller, added loads with barrier
	to check whether we have multiple threads calling recvmsg
	in parallel. When that happens we lock the socket to
	calculate inq.
V4 change-log:
	Removed inline from a static function.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
---
 include/linux/tcp.h      |  2 +-
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c           | 43 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 20585d5c4e1c3..807776928cb86 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -228,7 +228,7 @@ struct tcp_sock {
 		unused:2;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
-		unused1	    : 1,
+		recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index e9e8373b34b9d..29eb659aa77a1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -123,6 +123,9 @@ enum {
 #define TCP_FASTOPEN_KEY	33	/* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE	34	/* Enable TFO without a TFO cookie */
 #define TCP_ZEROCOPY_RECEIVE	35
+#define TCP_INQ			36	/* Notify bytes available to read as a cmsg on read */
+
+#define TCP_CM_INQ		TCP_INQ
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4028ddd14dd5a..868ed74a76a80 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1889,6 +1889,22 @@ static void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk,
 	}
 }
 
+static int tcp_inq_hint(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u32 copied_seq = READ_ONCE(tp->copied_seq);
+	u32 rcv_nxt = READ_ONCE(tp->rcv_nxt);
+	int inq;
+
+	inq = rcv_nxt - copied_seq;
+	if (unlikely(inq < 0 || copied_seq != READ_ONCE(tp->copied_seq))) {
+		lock_sock(sk);
+		inq = tp->rcv_nxt - tp->copied_seq;
+		release_sock(sk);
+	}
+	return inq;
+}
+
 /*
  *	This routine copies from a sock struct into the user buffer.
  *
@@ -1905,13 +1921,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	u32 peek_seq;
 	u32 *seq;
 	unsigned long used;
-	int err;
+	int err, inq;
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct sk_buff *skb, *last;
 	u32 urg_hole = 0;
 	struct scm_timestamping tss;
 	bool has_tss = false;
+	bool has_cmsg;
 
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
@@ -1926,6 +1943,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	if (sk->sk_state == TCP_LISTEN)
 		goto out;
 
+	has_cmsg = tp->recvmsg_inq;
 	timeo = sock_rcvtimeo(sk, nonblock);
 
 	/* Urgent data needs to be handled specially. */
@@ -2112,6 +2130,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		if (TCP_SKB_CB(skb)->has_rxtstamp) {
 			tcp_update_recv_tstamps(skb, &tss);
 			has_tss = true;
+			has_cmsg = true;
 		}
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
 			goto found_fin_ok;
@@ -2131,13 +2150,20 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
 
-	if (has_tss)
-		tcp_recv_timestamp(msg, sk, &tss);
-
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
 
 	release_sock(sk);
+
+	if (has_cmsg) {
+		if (has_tss)
+			tcp_recv_timestamp(msg, sk, &tss);
+		if (tp->recvmsg_inq) {
+			inq = tcp_inq_hint(sk);
+			put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq);
+		}
+	}
+
 	return copied;
 
 out:
@@ -3006,6 +3032,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		tp->notsent_lowat = val;
 		sk->sk_write_space(sk);
 		break;
+	case TCP_INQ:
+		if (val > 1 || val < 0)
+			err = -EINVAL;
+		else
+			tp->recvmsg_inq = val;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -3431,6 +3463,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_NOTSENT_LOWAT:
 		val = tp->notsent_lowat;
 		break;
+	case TCP_INQ:
+		val = tp->recvmsg_inq;
+		break;
 	case TCP_SAVE_SYN:
 		val = tp->save_syn;
 		break;
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH V4 net-next 2/2] selftest: add test for TCP_INQ
From: Soheil Hassas Yeganeh @ 2018-05-01 19:39 UTC (permalink / raw)
  To: davem, netdev; +Cc: ycheng, ncardwell, edumazet, willemb, Soheil Hassas Yeganeh
In-Reply-To: <20180501193916.113642-1-soheil.kdev@gmail.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
---
 tools/testing/selftests/net/Makefile  |   3 +-
 tools/testing/selftests/net/tcp_inq.c | 189 ++++++++++++++++++++++++++
 2 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_inq.c

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index df9102ec7b7af..0a1821f8dfb18 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh udpgso.sh
 TEST_PROGS += udpgso_bench.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
-TEST_GEN_FILES += tcp_mmap
+TEST_GEN_FILES += tcp_mmap tcp_inq
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 TEST_GEN_PROGS += udpgso udpgso_bench_tx udpgso_bench_rx
@@ -18,3 +18,4 @@ include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
+$(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_inq.c b/tools/testing/selftests/net/tcp_inq.c
new file mode 100644
index 0000000000000..d044b29ddabcc
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_inq.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Soheil Hassas Yeganeh (soheil@google.com)
+ *
+ * Simple example on how to use TCP_INQ and TCP_CM_INQ.
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ */
+#define _GNU_SOURCE
+
+#include <error.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+#ifndef TCP_INQ
+#define TCP_INQ 36
+#endif
+
+#ifndef TCP_CM_INQ
+#define TCP_CM_INQ TCP_INQ
+#endif
+
+#define BUF_SIZE 8192
+#define CMSG_SIZE 32
+
+static int family = AF_INET6;
+static socklen_t addr_len = sizeof(struct sockaddr_in6);
+static int port = 4974;
+
+static void setup_loopback_addr(int family, struct sockaddr_storage *sockaddr)
+{
+	struct sockaddr_in6 *addr6 = (void *) sockaddr;
+	struct sockaddr_in *addr4 = (void *) sockaddr;
+
+	switch (family) {
+	case PF_INET:
+		memset(addr4, 0, sizeof(*addr4));
+		addr4->sin_family = AF_INET;
+		addr4->sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+		addr4->sin_port = htons(port);
+		break;
+	case PF_INET6:
+		memset(addr6, 0, sizeof(*addr6));
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_addr = in6addr_loopback;
+		addr6->sin6_port = htons(port);
+		break;
+	default:
+		error(1, 0, "illegal family");
+	}
+}
+
+void *start_server(void *arg)
+{
+	int server_fd = (int)(unsigned long)arg;
+	struct sockaddr_in addr;
+	socklen_t addrlen = sizeof(addr);
+	char *buf;
+	int fd;
+	int r;
+
+	buf = malloc(BUF_SIZE);
+
+	for (;;) {
+		fd = accept(server_fd, (struct sockaddr *)&addr, &addrlen);
+		if (fd == -1) {
+			perror("accept");
+			break;
+		}
+		do {
+			r = send(fd, buf, BUF_SIZE, 0);
+		} while (r < 0 && errno == EINTR);
+		if (r < 0)
+			perror("send");
+		if (r != BUF_SIZE)
+			fprintf(stderr, "can only send %d bytes\n", r);
+		/* TCP_INQ can overestimate in-queue by one byte if we send
+		 * the FIN packet. Sleep for 1 second, so that the client
+		 * likely invoked recvmsg().
+		 */
+		sleep(1);
+		close(fd);
+	}
+
+	free(buf);
+	close(server_fd);
+	pthread_exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_storage listen_addr, addr;
+	int c, one = 1, inq = -1;
+	pthread_t server_thread;
+	char cmsgbuf[CMSG_SIZE];
+	struct iovec iov[1];
+	struct cmsghdr *cm;
+	struct msghdr msg;
+	int server_fd, fd;
+	char *buf;
+
+	while ((c = getopt(argc, argv, "46p:")) != -1) {
+		switch (c) {
+		case '4':
+			family = PF_INET;
+			addr_len = sizeof(struct sockaddr_in);
+			break;
+		case '6':
+			family = PF_INET6;
+			addr_len = sizeof(struct sockaddr_in6);
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		}
+	}
+
+	server_fd = socket(family, SOCK_STREAM, 0);
+	if (server_fd < 0)
+		error(1, errno, "server socket");
+	setup_loopback_addr(family, &listen_addr);
+	if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR,
+		       &one, sizeof(one)) != 0)
+		error(1, errno, "setsockopt(SO_REUSEADDR)");
+	if (bind(server_fd, (const struct sockaddr *)&listen_addr,
+		 addr_len) == -1)
+		error(1, errno, "bind");
+	if (listen(server_fd, 128) == -1)
+		error(1, errno, "listen");
+	if (pthread_create(&server_thread, NULL, start_server,
+			   (void *)(unsigned long)server_fd) != 0)
+		error(1, errno, "pthread_create");
+
+	fd = socket(family, SOCK_STREAM, 0);
+	if (fd < 0)
+		error(1, errno, "client socket");
+	setup_loopback_addr(family, &addr);
+	if (connect(fd, (const struct sockaddr *)&addr, addr_len) == -1)
+		error(1, errno, "connect");
+	if (setsockopt(fd, SOL_TCP, TCP_INQ, &one, sizeof(one)) != 0)
+		error(1, errno, "setsockopt(TCP_INQ)");
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgbuf;
+	msg.msg_controllen = sizeof(cmsgbuf);
+	msg.msg_flags = 0;
+
+	buf = malloc(BUF_SIZE);
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUF_SIZE / 2;
+
+	if (recvmsg(fd, &msg, 0) != iov[0].iov_len)
+		error(1, errno, "recvmsg");
+	if (msg.msg_flags & MSG_CTRUNC)
+		error(1, 0, "control message is truncated");
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm))
+		if (cm->cmsg_level == SOL_TCP && cm->cmsg_type == TCP_CM_INQ)
+			inq = *((int *) CMSG_DATA(cm));
+
+	if (inq != BUF_SIZE - iov[0].iov_len) {
+		fprintf(stderr, "unexpected inq: %d\n", inq);
+		exit(1);
+	}
+
+	printf("PASSED\n");
+	free(buf);
+	close(fd);
+	return 0;
+}
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Eric Dumazet @ 2018-05-01 19:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Eric Dumazet, Dave Taht,
	Cong Wang
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <871sev2mvx.fsf@toke.dk>



On 05/01/2018 12:31 PM, Toke Høiland-Jørgensen wrote:

> Could you comment on specifically what you believe is broken in this,
> please, so I can fix it in the same iteration?
> 

Apart from the various pskb_may_pull() this helper should not change skb layout.

Ideally, the skb should be const and you would use skb_header_pointer() to make clear
you do not ever write this skb.

This would make the reviewer job pretty easy, as no side effect can possibly happen.


> +static inline struct tcphdr *cake_get_tcphdr(struct sk_buff *skb)
> +{
> +	struct ipv6hdr *ipv6h;
> +	struct iphdr *iph;
> +
> +	/* check IPv6 header size immediately, since for IPv4 we need the space
> +	 * for the TCP header anyway
> +	 */
> +	if (!pskb_may_pull(skb, skb_network_offset(skb) +
> +				sizeof(struct ipv6hdr)))
> +		return NULL;
> +
> +	iph = ip_hdr(skb);
> +
> +	if (iph->version == 4) {
> +		/* special-case 6in4 tunnelling, as that is a common way to get
> +		 * v6 connectivity in the home
> +		 */
> +		if (iph->protocol == IPPROTO_IPV6) {
> +			if (!pskb_may_pull(skb, (skb_network_offset(skb) +
> +						 ip_hdrlen(skb) +
> +						 sizeof(struct ipv6hdr))))
> +				return NULL;
> +
> +			ipv6h = (struct ipv6hdr *)(skb_network_header(skb) +
> +						   ip_hdrlen(skb));
> +
> +			if (ipv6h->nexthdr != IPPROTO_TCP)
> +				return NULL;
> +
> +			skb_set_inner_network_header(skb,
> +						     skb_network_offset(skb) +
> +						     ip_hdrlen(skb));
> +			skb_set_inner_transport_header(skb,
> +						skb_inner_network_offset(skb) +
> +						sizeof(struct ipv6hdr));

This is not allowed for a dissector.

^ permalink raw reply

* Re: [RFC net-next 0/5] Support for PHY test modes
From: Andrew Lunn @ 2018-05-01 19:59 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David Miller, netdev, rmk, linux-kernel, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <cd910177-bff9-261a-78e7-aa0a2c6532b5@gmail.com>

> # echo 4 > /sys/class/net/gphy/operstate
> # ip link show gphy
> 4: gphy@eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
> qdisc noqueue switchid 00000000 state TESTING mode DEFAULT group default
> qlen 1000
>     link/ether 00:10:18:de:38:1f brd ff:ff:ff:ff:ff:ff

This looks good.

I stopped using ifconfig years ago, so i personally don't care about
it. If you are using old tools, you have to expect some surprises and
missing information.

	Andrew

^ permalink raw reply

* [PATCH net] net/tls: Don't recursively call push_record during tls_write_space callbacks
From: Dave Watson @ 2018-05-01 20:05 UTC (permalink / raw)
  To: David S. Miller, Andre Tomt, netdev, borisp, Aviad Yehezkel

It is reported that in some cases, write_space may be called in
do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again:

[  660.468802]  ? do_tcp_sendpages+0x8d/0x580
[  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
[  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
[  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
...

tls_push_sg already does a loop over all sending sg's, so ignore
any tls_write_space notifications until we are done sending.
We then have to call the previous write_space to wake up
poll() waiters after we are done with the send loop.

Reported-by: Andre Tomt <andre@tomt.net>
Signed-off-by: Dave Watson <davejwatson@fb.com>
---
 include/net/tls.h  | 1 +
 net/tls/tls_main.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/net/tls.h b/include/net/tls.h
index 3da8e13..b400d0bb 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -148,6 +148,7 @@ struct tls_context {
 	struct scatterlist *partially_sent_record;
 	u16 partially_sent_offset;
 	unsigned long flags;
+	bool in_tcp_sendpages;
 
 	u16 pending_open_record_frags;
 	int (*push_pending_record)(struct sock *sk, int flags);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 0d37997..cc03e00 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -114,6 +114,7 @@ int tls_push_sg(struct sock *sk,
 	size = sg->length - offset;
 	offset += sg->offset;
 
+	ctx->in_tcp_sendpages = true;
 	while (1) {
 		if (sg_is_last(sg))
 			sendpage_flags = flags;
@@ -148,6 +149,8 @@ int tls_push_sg(struct sock *sk,
 	}
 
 	clear_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags);
+	ctx->in_tcp_sendpages = false;
+	ctx->sk_write_space(sk);
 
 	return 0;
 }
@@ -217,6 +220,10 @@ static void tls_write_space(struct sock *sk)
 {
 	struct tls_context *ctx = tls_get_ctx(sk);
 
+	/* We are already sending pages, ignore notification */
+	if (ctx->in_tcp_sendpages)
+		return;
+
 	if (!sk->sk_write_pending && tls_is_pending_closed_record(ctx)) {
 		gfp_t sk_allocation = sk->sk_allocation;
 		int rc;
-- 
2.9.5

^ permalink raw reply related

* RE: [RFC net-next 4/5] net: phy: Add support for IEEE standard test modes
From: Woojung.Huh @ 2018-05-01 20:07 UTC (permalink / raw)
  To: f.fainelli, netdev
  Cc: andrew, rmk, linux-kernel, davem, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <66f30fca-f485-7f52-7441-3c8cf1718640@gmail.com>

Hi Florian,

> Not sure I completely understand your suggestion, do you mean that I
> should break down the body of that function above such that there are
> per-speed lower level functions? Something like the pseudo-code below:
> 
> genphy_set_test() {
> 	switch (mode) {
> 	case PHY_STD_TEST_MODE_100BASET2_1:
> 	..
> 	case PHY_STD_TEST_MODE_100BASET2_3:
> 		return genphy_set_100baset2();
> 
> 	case PHY_STD_TEST_MODE_1000BASET_1:
> 	..
> 	case PHY_STD_TEST_MODE_1000BASET_4:
> 		return genphy_set_1000baset();
> 
> 	case PHY_STD_TEST_MODE_8021BWQCQ_1:
> 		return genphy_set_100baset1();
> 
> }
Yes, I should write pseudo code. Sorry about confusion.
User can override this function or expand to other modes.

Thanks.
Woojung


^ permalink raw reply

* [PATCH bpf-nex] tools: bpftool: change time format for program 'loaded at:' information
From: Quentin Monnet @ 2018-05-01 20:18 UTC (permalink / raw)
  To: ast, borkmann; +Cc: netdev, oss-drivers, quentin.monnet

To make eBPF program load time easier to parse from "bpftool prog"
output for machines, change the time format used by the program. The
format now differs for plain and JSON version:

- Plain version uses a string formatted according to ISO 8601.
- JSON uses the number of seconds since the Epoch, wich is less friendly
  for humans but even easier to process.

Example output:

    # ./bpftool prog
    41298: xdp  tag a04f5eef06a7f555 dev foo
            loaded_at 2018-04-18T17:19:47+0100  uid 0
            xlated 16B  not jited  memlock 4096B

    # ./bpftool prog -p
    [{
            "id": 41298,
            "type": "xdp",
            "tag": "a04f5eef06a7f555",
            "gpl_compatible": false,
            "dev": {
                "ifindex": 14,
                "ns_dev": 3,
                "ns_inode": 4026531993,
                "ifname": "foo"
            },
            "loaded_at": 1524068387,
            "uid": 0,
            "bytes_xlated": 16,
            "jited": false,
            "bytes_memlock": 4096
        }
    ]

Previously, "Apr 18/17:19" would be used at both places.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/prog.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index e71a0a11afde..9bdfdf2d3fbe 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -96,7 +96,10 @@ static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
 		return;
 	}
 
-	strftime(buf, size, "%b %d/%H:%M", &load_tm);
+	if (json_output)
+		strftime(buf, size, "%s", &load_tm);
+	else
+		strftime(buf, size, "%FT%T%z", &load_tm);
 }
 
 static int prog_fd_by_tag(unsigned char *tag)
@@ -245,7 +248,8 @@ static void print_prog_json(struct bpf_prog_info *info, int fd)
 		print_boot_time(info->load_time, buf, sizeof(buf));
 
 		/* Piggy back on load_time, since 0 uid is a valid one */
-		jsonw_string_field(json_wtr, "loaded_at", buf);
+		jsonw_name(json_wtr, "loaded_at");
+		jsonw_printf(json_wtr, "%s", buf);
 		jsonw_uint_field(json_wtr, "uid", info->created_by_uid);
 	}
 
-- 
2.7.4

^ permalink raw reply related

* Re: [RFC PATCH v3 bpf-next 2/5] bpf/verifier: rewrite subprog boundary detection
From: Edward Cree @ 2018-05-01 20:40 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Daniel Borkmann, netdev
In-Reply-To: <20180417234826.egydr2sg2rewzvyu@ast-mbp>

On 18/04/18 00:48, Alexei Starovoitov wrote:
> as I was saying before this is no go.
> subprogno is meaningless in the hierarchy of: prog -> func -> bb -> insn
> Soon bpf will have libraries and this field would need to become
> a pointer back to bb or func structure creating unnecessary circular dependency.
I'm afraid I don't follow.  Why can't func numbers (and later, bb numbers)
 be per-prog?  When verifier is linking multiple progs together it will
 necessarily have the subprog-info for each prog, and when making cross-
 prog calls it'll have to already know which prog it's calling into; I
 don't see any reason why the index into a prog's subprog_info array
 should become "meaningless" in such a setup.
Besides, subprogno is how the rest of the verifier currently identifies a
 func, and in the absence of any indication of how anything different
 will be implemented, that's what an incremental patch has to work with.

If you're worried about the SPOT violation from having both
 aux->subprogno and subprog_info->start... well, we could actually get
 rid of the latter!  Uses of it are:
* carving up insn arrays in jit_subprogs().  Could be done based on
  aux->subprogno instead (v1 of this series did that)
* checking CALL destination is at start of function.  That could be done
  by putting a flag in the aux_data to mark "this insn is at the start of
  its subprog".  Doesn't even need to increase memory usage: it could be
  done by ORing a flag (0x8000, say) into aux->subprogno; or we could
  replace 'bool seen;' with 'u8 flags;' again with no extra memory used.
* a few verbose() messages.
That would have another nice consequence, in that adjust_subprog_starts()
 could go away - another code simplification resulting from use of the
 right data structures.

Btw, sorry for delay in responding; got bogged down in some sfc driver
 work.

-Ed

^ permalink raw reply

* Re: [PATCH net-next v7 1/3] vmcore: add API to collect hardware dump in second kernel
From: kbuild test robot @ 2018-05-01 20:44 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: kbuild-all, netdev, kexec, linux-fsdevel, linux-kernel, davem,
	viro, ebiederm, stephen, akpm, torvalds, ganeshgr, nirranjan,
	indranil, Rahul Lakkireddy
In-Reply-To: <b6065b53c5446d98ee55e09119f6821f979418d4.1525197408.git.rahul.lakkireddy@chelsio.com>

[-- Attachment #1: Type: text/plain, Size: 730 bytes --]

Hi Rahul,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Rahul-Lakkireddy/vmcore-add-API-to-collect-hardware-dump-in-second-kernel/20180502-023638
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

>> ./usr/include/linux/vmcore.h:9: found __[us]{8,16,32,64} type without #include <linux/types.h>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6293 bytes --]

^ permalink raw reply

* Re: [RFC net-next 4/5] net: phy: Add support for IEEE standard test modes
From: Florian Fainelli @ 2018-05-01 20:51 UTC (permalink / raw)
  To: Woojung.Huh, netdev
  Cc: andrew, rmk, linux-kernel, davem, cphealy, nikita.yoush,
	vivien.didelot, Nisar.Sayed, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40D55B0A@CHN-SV-EXMX02.mchp-main.com>

On 05/01/2018 01:07 PM, Woojung.Huh@microchip.com wrote:
> Hi Florian,
> 
>> Not sure I completely understand your suggestion, do you mean that I
>> should break down the body of that function above such that there are
>> per-speed lower level functions? Something like the pseudo-code below:
>>
>> genphy_set_test() {
>> 	switch (mode) {
>> 	case PHY_STD_TEST_MODE_100BASET2_1:
>> 	..
>> 	case PHY_STD_TEST_MODE_100BASET2_3:
>> 		return genphy_set_100baset2();
>>
>> 	case PHY_STD_TEST_MODE_1000BASET_1:
>> 	..
>> 	case PHY_STD_TEST_MODE_1000BASET_4:
>> 		return genphy_set_1000baset();
>>
>> 	case PHY_STD_TEST_MODE_8021BWQCQ_1:
>> 		return genphy_set_100baset1();
>>
>> }
> Yes, I should write pseudo code. Sorry about confusion.
> User can override this function or expand to other modes.

Well, the way the code is structure is that if you call that function
with a test mode value that is not part of the standard set, it returns
-EOPNOTSUPP, so if your particular PHY driver wants to "overlay"
standard and non-standard modes, it can by using that hint.

This should work even if we have more standard test modes in the future
because the test modes are dynamically fetched by user-space using the
ETH_GSTRINGS ioctl().

Does that cover what you had in mind?
-- 
Florian

^ permalink raw reply

* [PATCH] net: stmmac: Avoid VLA usage
From: Kees Cook @ 2018-05-01 21:01 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue; +Cc: linux-kernel, netdev, Jose Abreu

In the quest to remove all stack VLAs from the kernel[1], this switches
the "status" stack buffer to use the existing small (8) upper bound on
how many queues can be checked for DMA, and adds a sanity-check just to
make sure it doesn't operate under pathological conditions.

[1] http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b65e2d144698..19bdc23fa314 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2045,7 +2045,11 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 				tx_channel_count : rx_channel_count;
 	u32 chan;
 	bool poll_scheduled = false;
-	int status[channels_to_check];
+	int status[max_t(u32, MTL_MAX_TX_QUEUES, MTL_MAX_RX_QUEUES)];
+
+	/* Make sure we never check beyond our status buffer. */
+	if (WARN_ON_ONCE(channels_to_check > ARRAY_SIZE(status)))
+		channels_to_check = ARRAY_SIZE(status);
 
 	/* Each DMA channel can be used for rx and tx simultaneously, yet
 	 * napi_struct is embedded in struct stmmac_rx_queue rather than in a
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [PATCH bpf-nex] tools: bpftool: change time format for program 'loaded at:' information
From: Alexei Starovoitov @ 2018-05-01 21:25 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: ast, borkmann, netdev, oss-drivers
In-Reply-To: <1525205918-16779-1-git-send-email-quentin.monnet@netronome.com>

On Tue, May 01, 2018 at 09:18:38PM +0100, Quentin Monnet wrote:
> To make eBPF program load time easier to parse from "bpftool prog"
> output for machines, change the time format used by the program. The
> format now differs for plain and JSON version:
> 
> - Plain version uses a string formatted according to ISO 8601.
> - JSON uses the number of seconds since the Epoch, wich is less friendly
>   for humans but even easier to process.
> 
> Example output:
> 
>     # ./bpftool prog
>     41298: xdp  tag a04f5eef06a7f555 dev foo
>             loaded_at 2018-04-18T17:19:47+0100  uid 0
>             xlated 16B  not jited  memlock 4096B
> 
>     # ./bpftool prog -p
>     [{
>             "id": 41298,
>             "type": "xdp",
>             "tag": "a04f5eef06a7f555",
>             "gpl_compatible": false,
>             "dev": {
>                 "ifindex": 14,
>                 "ns_dev": 3,
>                 "ns_inode": 4026531993,
>                 "ifname": "foo"
>             },
>             "loaded_at": 1524068387,
>             "uid": 0,
>             "bytes_xlated": 16,
>             "jited": false,
>             "bytes_memlock": 4096
>         }
>     ]
> 
> Previously, "Apr 18/17:19" would be used at both places.
> 
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied, Thanks!

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-01 21:54 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet, Dave Taht, Cong Wang
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <c43e3156-dce2-ced7-3a8f-47fe756202f9@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 05/01/2018 12:31 PM, Toke Høiland-Jørgensen wrote:
>
>> Could you comment on specifically what you believe is broken in this,
>> please, so I can fix it in the same iteration?
>> 
>
> Apart from the various pskb_may_pull() this helper should not change skb layout.
>
> Ideally, the skb should be const and you would use
> skb_header_pointer() to make clear you do not ever write this skb.
>
> This would make the reviewer job pretty easy, as no side effect can
> possibly happen.

Gotcha. Will fix; thanks :)

-Toke

^ permalink raw reply

* Re: [PATCH bpf] bpf: minor fix to selftest test_stacktrace_build_id()
From: Daniel Borkmann @ 2018-05-01 22:19 UTC (permalink / raw)
  To: Song Liu, netdev; +Cc: kernel-team, Alexei Starovoitov
In-Reply-To: <20180501172024.781777-1-songliubraving@fb.com>

On 05/01/2018 07:20 PM, Song Liu wrote:
> 1. remove useless parameter list to ./urandom_read
> 2. add missing "\n" to the end of an error message
> 
> Fixes: 81f77fd0deeb ("bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID")
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Song Liu <songliubraving@fb.com>

Applied to bpf tree, thanks Song!

^ permalink raw reply

* Re: [PATCH bpf-next 0/3] bpf: cleanups on managing subprog information
From: Alexei Starovoitov @ 2018-05-01 22:22 UTC (permalink / raw)
  To: Jiong Wang; +Cc: borkmann, ecree, netdev, oss-drivers
In-Reply-To: <1525127296-3573-1-git-send-email-jiong.wang@netronome.com>

On Mon, Apr 30, 2018 at 06:28:13PM -0400, Jiong Wang wrote:
> 
> There is no functional change by this patch set.
> No bpf selftest regression found after this patch set.

I was about to apply them, but there is a regression:
[   27.773899] ==================================================================
[   27.774802] BUG: KASAN: slab-out-of-bounds in do_jit+0x5499/0x6020
[   27.775559] Read of size 4 at addr ffff8801197fe7f4 by task test_verifier/344
[   27.776412]
[   27.776607] CPU: 3 PID: 344 Comm: test_verifier Not tainted 4.17.0-rc2-00451-geb43cb64a84a #943
[   27.777644] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[   27.778894] Call Trace:
[   27.779217]  dump_stack+0x5b/0x8b
[   27.779675]  ? do_jit+0x5499/0x6020
[   27.780148]  print_address_description+0x73/0x290
[   27.780716]  ? do_jit+0x5499/0x6020
[   27.781152]  kasan_report+0x22b/0x350
[   27.781602]  do_jit+0x5499/0x6020
[   27.782020]  ? __mod_node_page_state+0xa2/0xd0
[   27.782557]  ? jit_fill_hole+0x20/0x20
[   27.783019]  ? ___slab_alloc+0x3e7/0x4d0
[   27.783498]  ? kasan_unpoison_shadow+0x30/0x40
[   27.784042]  ? kasan_kmalloc+0xa0/0xd0
[   27.784497]  ? __kmalloc+0x109/0x200
[   27.784931]  ? bpf_int_jit_compile+0x7ac/0xab0
[   27.785475]  bpf_int_jit_compile+0x2b6/0xab0
[   27.786001]  ? do_jit+0x6020/0x6020
[   27.786428]  ? kasan_kmalloc+0xa0/0xd0
[   27.786885]  bpf_check+0x2c05/0x4c40
[   27.787346]  ? fixup_bpf_calls+0x1140/0x1140
[   27.787865]  ? kasan_unpoison_shadow+0x30/0x40
[   27.788406]  ? kasan_kmalloc+0xa0/0xd0
[   27.788865]  ? memset+0x1f/0x40
[   27.789255]  ? bpf_obj_name_cpy+0x2d/0x200
[   27.789750]  bpf_prog_load+0xb07/0xeb0

simply running test_verifier with JIT and kasan on.

^ permalink raw reply

* Re: Silently dropped UDP packets on kernel 4.14
From: Florian Westphal @ 2018-05-01 22:24 UTC (permalink / raw)
  To: Kristian Evensen; +Cc: Netfilter Development Mailing list, Network Development
In-Reply-To: <CAKfDRXj5oSD9-LKRbS_LYko+WEKwwunUq0L35G3t7Gi9Om74zw@mail.gmail.com>

Kristian Evensen <kristian.evensen@gmail.com> wrote:
> On Tue, May 1, 2018 at 8:50 PM, Kristian Evensen
> <kristian.evensen@gmail.com> wrote:
> > Does anyone have any idea of what could be wrong, where I should look
> > or other things I can try? I tried to space the requests out a bit in
> > time (I inserted a sleep 1 between them), and then the problem went
> > away.
> 
> I should learn to always go through everything one last time before
> sending an email. First of all, I see that both requests are treated
> as new.

Normal when nfqueue is in use.

> Second, on my router, new requests are sent to user space for
> marking, which explains the large delay in processing. When removing
> the NFQUEUE-rule + handling and marking statically, my problem goes
> away and I get an answer for both packets.

Yes, because 2nd packet finds the conntrack entry created by the first
one.

> However, I do have one question. In my application, both packets are
> assigned the same mark. Shouldn't they then match the same conntrack
> entry, or am I missing something since that seems not to be the case?

The 2nd packet creates a new conntrack entry, because the conntrack
entry created by the first one is not inserted into global table yet.

This happens as last step, after packet has traversed all chains.
When nfqueue is used, this gets delayed, and 2nd packet will be dropped
as the insertion step finds that another packet created same flow
already.

I'm not sure what the best way to solve this is, we either need
to insert earlier in nfqueue case, or do conflict resolution in nfqueue
case (and perhaps also nat undo? not sure).

^ permalink raw reply

* Re: [PATCH net-next 0/2] bridge: FDB: Notify about removal of non-user-added entries
From: Petr Machata @ 2018-05-01 22:27 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, ivecera, davem, stephen, vivien.didelot, f.fainelli, jiri
In-Reply-To: <20180501175439.GB1265@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> On Tue, May 01, 2018 at 07:04:19PM +0200, Petr Machata wrote:
>> Device drivers may generally need to keep in sync with bridge's FDB. In
>> particular, for its offload of tc mirror action where the mirrored-to
>> device is a gretap device, mlxsw needs to listen to a number of events.
>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE would be a natural notification to
>> listen to in order to keep up with FDB updates.
>> 
>> However, for removal of FDB entries added due to device activity (as
>> opposed to explicit addition through "bridge fdb add" or similar), there
>> are no notifications.
>
> Could you explain this some more. Why is mlxsw special in that it
> needs this, but other drivers don't. The b53 driver supports tc
> mirror, for example.

mlxsw supports offload of tc mirror to a gretap netdevice to achieve
mirroring of traffic encapsulated in GRE. However, the device doesn't do
routing and switching on the encapsulated mirrored packets. You need to
configure exact parameters--L2 and L3 addresses, VLAN tagging if any,
etc., and a single front panel port to forward to.

To offload this, the driver sort of plays the game of networking stack.
Finds the underlay route, finds the neighbor to get the Ethernet address
(and possibly kick starts ARP resolution), and if the target netdevice
is a bridge, queries the FDB to find the one port to forward to (and
bails out if it needs to flood, we can't really offload that).

Now any time anything in that pipeline changes, the stuff after that
point needs to be redone, and the offloading decision (if and how) may
change.

So if an administrator decides to remove an FDB entry by hand, the
driver should notice this and reflect the intent by removing offloads
that depend on existence of that FDB entry. But since there's no event,
mlxsw doesn't notice this outright. It does notice later, because of an
unrelated event--e.g. from a neighbor, a FIB notification, whatever it
happens to be. That discrepancy is what I'm trying to fix.

Thanks,
Petr

^ permalink raw reply

* Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc
From: Cong Wang @ 2018-05-01 22:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Toke Høiland-Jørgensen, Dave Taht,
	Linux Kernel Network Developers, Cake List
In-Reply-To: <4ec8da81-8671-f434-bada-27088b09ce7b@gmail.com>

On Tue, May 1, 2018 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I guess that nobody really wants to really review Cake if
> it is a file with 2700 lines of code and hundreds of variables/tunables.
>
> Sure, we have big files in networking land, as a result of thousands of changes.
>
> If you split it, then maybe the work can be split among reviewers as a result.
>

+1

At *least* split it into CAKE without ack-filter and ack-filter implementation.

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: bridge: Notify about !added_by_user FDB entries
From: Petr Machata @ 2018-05-01 22:33 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, ivecera, davem, stephen, andrew, vivien.didelot,
	f.fainelli, jiri, bridge@lists.linux-foundation.org
In-Reply-To: <df818113-33d4-d1ec-eb55-75e1aa59338d@cumulusnetworks.com>

Nikolay Aleksandrov <nikolay@cumulusnetworks.com> writes:

> On 01/05/18 20:04, Petr Machata wrote:
>> Do not automatically bail out on sending notifications about activity on
>> non-user-added FDB entries. Instead, notify about this activity except
>> for cases where the activity itself originates in a notification, to
>> avoid sending duplicate notifications.
>>
>> Signed-off-by: Petr Machata <petrm@mellanox.com>
>> ---
>>   net/bridge/br.c           |  4 ++--
>>   net/bridge/br_fdb.c       | 40 +++++++++++++++++++++++++++-------------
>>   net/bridge/br_private.h   |  4 ++--
>>   net/bridge/br_switchdev.c |  2 +-
>>   4 files changed, 32 insertions(+), 18 deletions(-)
>>
>
> Hi Petr,
> We already have 7 different fdb delete functions, I'm really not a fan of
> adding yet another one for such trivial change.
> Why don't you just add the new notify parameter to the already existing
> fdb_delete() ? (actually about the name see below)
> IMO it's confusing - if one wants a notification then use fdb_delete() or __fdb_delete(true)
> vs __fdb_delete(false) if a notification is not required. I think simply having the last
> parameter everywhere for fdb_delete() shows the intention clearer and avoids another
> fdb delete function.

All right--this is how I had it written actually, but then decided to do
this wrapping, because so many of the calls end up being true. I'll send
a v2 with just the extra argument.

> Another point, the notify parameter has a confusing name in this context because
> you're controlling the switchdev notifications not the rtnetlink ones. I'd suggest
> changing the name to something more descriptive like swdev_notify, otherwise you
> could get the funny end result of calling __fdb_notify() with notify == false which
> to me means don't notify. :-)

OK, swdev_notify it will be.

> Also please add the bridge maintainers to the CC list.

bridge@lists.linux-foundation.org? I saw it's a moderated list and for
some reason that made me think it's not meant for patch postings. I'll
add them the next time.

Thanks,
Petr

^ permalink raw reply

* Re: [PATCH net-next v7 1/3] vmcore: add API to collect hardware dump in second kernel
From: Eric W. Biederman @ 2018-05-01 22:41 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, davem, viro, stephen,
	akpm, torvalds, ganeshgr, nirranjan, indranil
In-Reply-To: <b6065b53c5446d98ee55e09119f6821f979418d4.1525197408.git.rahul.lakkireddy@chelsio.com>

Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:

> The sequence of actions done by device drivers to append their device
> specific hardware/firmware logs to /proc/vmcore are as follows:

Except for the missing include that the kbuild test robot caught
I am not seeing a problems.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> 1. During probe (before hardware is initialized), device drivers
> register to the vmcore module (via vmcore_add_device_dump()), with
> callback function, along with buffer size and log name needed for
> firmware/hardware log collection.
>
> 2. vmcore module allocates the buffer with requested size. It adds
> an Elf note and invokes the device driver's registered callback
> function.
>
> 3. Device driver collects all hardware/firmware logs into the buffer
> and returns control back to vmcore module.
>
> Ensure that the device dump buffer size is always aligned to page size
> so that it can be mmaped.
>
> Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
> generic and reserve NT_VMCOREDD note type to indicate vmcore device
> dump.
>
> Suggested-by: Eric Biederman <ebiederm@xmission.com>.
> Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
> ---
> v7:
> - Removed "CHELSIO" vendor identifier in Elf Note name. Instead,
>   writing "LINUX".
> - Moved vmcoredd_header to new file include/uapi/linux/vmcore.h
> - Reworked vmcoredd_header to include Elf Note as part of the header
>   itself.
> - Removed vmcoredd_get_note_size().
> - Renamed vmcoredd_write_note() to vmcoredd_write_header().
> - Replaced all "unsigned long" with "unsigned int" for device dump
>   size since max size of Elf Word is u32.
>
> v6:
> - Reworked device dump elf note name to contain vendor identifier.
> - Added vmcoredd_header that precedes actual dump in the Elf Note.
> - Device dump's name is moved inside vmcoredd_header.
>
> v5:
> - Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
>   updated help message to indicate that the driver must be present
>   in second kernel's initramfs to collect the underlying device
>   snapshot.
>
> v4:
> - Made __vmcore_add_device_dump() static.
> - Moved compile check to define vmcore_add_device_dump() to
>   crash_dump.h to fix compilation when vmcore.c is not compiled in.
> - Convert ---help--- to help in Kconfig as indicated by checkpatch.
> - Rebased to tip.
>
> v3:
> - Dropped sysfs crashdd module.
> - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
>   dump support.
> - Moved logic related to adding dumps from crashdd to vmcore module.
> - Rename all crashdd* to vmcoredd*.
>
> v2:
> - Added ABI Documentation for crashdd.
> - Directly use octal permission instead of macro.
>
> Changes since rfc v2:
> - Moved exporting crashdd from procfs to sysfs.  Suggested by
>   Stephen Hemminger <stephen@networkplumber.org>
> - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
> - Replaced all proc API with sysfs API and updated comments.
> - Calling driver callback before creating the binary file under
>   crashdd sysfs.
> - Changed binary dump file permission from S_IRUSR to S_IRUGO.
> - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.
>
> rfc v2:
> - Collecting logs in 2nd kernel instead of during kernel panic.
>   Suggested by Eric Biederman <ebiederm@xmission.com>.
> - Patch added in this series.
>
>  fs/proc/Kconfig             |  15 +++++
>  fs/proc/vmcore.c            | 140 +++++++++++++++++++++++++++++++++++++++++---
>  include/linux/crash_dump.h  |  18 ++++++
>  include/linux/kcore.h       |   6 ++
>  include/uapi/linux/elf.h    |   1 +
>  include/uapi/linux/vmcore.h |  16 +++++
>  6 files changed, 187 insertions(+), 9 deletions(-)
>  create mode 100644 include/uapi/linux/vmcore.h
>
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 1ade1206bb89..0eaeb41453f5 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -43,6 +43,21 @@ config PROC_VMCORE
>          help
>          Exports the dump image of crashed kernel in ELF format.
>  
> +config PROC_VMCORE_DEVICE_DUMP
> +	bool "Device Hardware/Firmware Log Collection"
> +	depends on PROC_VMCORE
> +	default n
> +	help
> +	  After kernel panic, device drivers can collect the device
> +	  specific snapshot of their hardware or firmware before the
> +	  underlying devices are initialized in crash recovery kernel.
> +	  Note that the device driver must be present in the crash
> +	  recovery kernel's initramfs to collect its underlying device
> +	  snapshot.
> +
> +	  If you say Y here, the collected device dumps will be added
> +	  as ELF notes to /proc/vmcore.
> +
>  config PROC_SYSCTL
>  	bool "Sysctl support (/proc/sys)" if EXPERT
>  	depends on PROC_FS
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index a45f0af22a60..60fce8dfe4e3 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -20,6 +20,7 @@
>  #include <linux/init.h>
>  #include <linux/crash_dump.h>
>  #include <linux/list.h>
> +#include <linux/mutex.h>
>  #include <linux/vmalloc.h>
>  #include <linux/pagemap.h>
>  #include <linux/uaccess.h>
> @@ -44,6 +45,12 @@ static u64 vmcore_size;
>  
>  static struct proc_dir_entry *proc_vmcore;
>  
> +#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
> +/* Device Dump list and mutex to synchronize access to list */
> +static LIST_HEAD(vmcoredd_list);
> +static DEFINE_MUTEX(vmcoredd_mutex);
> +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
> +
>  /*
>   * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
>   * The called function has to take care of module refcounting.
> @@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>  };
>  
>  /**
> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
> - *                      vmalloc memory
> - *
> - * @notes_sz: size of buffer
> + * vmcore_alloc_buf - allocate buffer in vmalloc memory
> + * @sizez: size of buffer
>   *
>   * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>   * the buffer to user-space by means of remap_vmalloc_range().
> @@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>   * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>   * disabled and there's no need to allow users to mmap the buffer.
>   */
> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
> +static inline char *vmcore_alloc_buf(size_t size)
>  {
>  #ifdef CONFIG_MMU
> -	return vmalloc_user(notes_sz);
> +	return vmalloc_user(size);
>  #else
> -	return vzalloc(notes_sz);
> +	return vzalloc(size);
>  #endif
>  }
>  
> @@ -665,7 +670,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = vmcore_alloc_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -851,7 +856,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = vmcore_alloc_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -1145,6 +1150,120 @@ static int __init parse_crash_elf_headers(void)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
> +/**
> + * vmcoredd_write_header - Write vmcore device dump header at the
> + * beginning of the dump's buffer.
> + * @buf: Output buffer where the note is written
> + * @data: Dump info
> + * @size: Size of the dump
> + *
> + * Fills beginning of the dump's buffer with vmcore device dump header.
> + */
> +static void vmcoredd_write_header(void *buf, struct vmcoredd_data *data,
> +				  u32 size)
> +{
> +	struct vmcoredd_header *vdd_hdr = (struct vmcoredd_header *)buf;
> +
> +	vdd_hdr->n_namesz = sizeof(vdd_hdr->name);
> +	vdd_hdr->n_descsz = size + sizeof(vdd_hdr->dump_name);
> +	vdd_hdr->n_type = NT_VMCOREDD;
> +
> +	strncpy((char *)vdd_hdr->name, VMCOREDD_NOTE_NAME,
> +		sizeof(vdd_hdr->name));
> +	memcpy(vdd_hdr->dump_name, data->dump_name, sizeof(vdd_hdr->dump_name));
> +}
> +
> +/**
> + * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
> + * @data: dump info.
> + *
> + * Allocate a buffer and invoke the calling driver's dump collect routine.
> + * Write Elf note at the beginning of the buffer to indicate vmcore device
> + * dump and add the dump to global list.
> + */
> +static int __vmcore_add_device_dump(struct vmcoredd_data *data)
> +{
> +	struct vmcoredd_node *dump;
> +	void *buf = NULL;
> +	size_t data_size;
> +	int ret;
> +
> +	if (!data || !strlen(data->dump_name) ||
> +	    !data->vmcoredd_callback || !data->size)
> +		return -EINVAL;
> +
> +	dump = vzalloc(sizeof(*dump));
> +	if (!dump) {
> +		ret = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	/* Keep size of the buffer page aligned so that it can be mmaped */
> +	data_size = roundup(sizeof(struct vmcoredd_header) + data->size,
> +			    PAGE_SIZE);
> +
> +	/* Allocate buffer for driver's to write their dumps */
> +	buf = vmcore_alloc_buf(data_size);
> +	if (!buf) {
> +		ret = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	vmcoredd_write_header(buf, data, data_size -
> +			      sizeof(struct vmcoredd_header));
> +
> +	/* Invoke the driver's dump collection routing */
> +	ret = data->vmcoredd_callback(data, buf +
> +				      sizeof(struct vmcoredd_header));
> +	if (ret)
> +		goto out_err;
> +
> +	dump->buf = buf;
> +	dump->size = data_size;
> +
> +	/* Add the dump to driver sysfs list */
> +	mutex_lock(&vmcoredd_mutex);
> +	list_add_tail(&dump->list, &vmcoredd_list);
> +	mutex_unlock(&vmcoredd_mutex);
> +
> +	return 0;
> +
> +out_err:
> +	if (buf)
> +		vfree(buf);
> +
> +	if (dump)
> +		vfree(dump);
> +
> +	return ret;
> +}
> +
> +int vmcore_add_device_dump(struct vmcoredd_data *data)
> +{
> +	return __vmcore_add_device_dump(data);
> +}
> +EXPORT_SYMBOL(vmcore_add_device_dump);
> +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
> +
> +/* Free all dumps in vmcore device dump list */
> +static void vmcore_free_device_dumps(void)
> +{
> +#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
> +	mutex_lock(&vmcoredd_mutex);
> +	while (!list_empty(&vmcoredd_list)) {
> +		struct vmcoredd_node *dump;
> +
> +		dump = list_first_entry(&vmcoredd_list, struct vmcoredd_node,
> +					list);
> +		list_del(&dump->list);
> +		vfree(dump->buf);
> +		vfree(dump);
> +	}
> +	mutex_unlock(&vmcoredd_mutex);
> +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
> +}
> +
>  /* Init function for vmcore module. */
>  static int __init vmcore_init(void)
>  {
> @@ -1192,4 +1311,7 @@ void vmcore_cleanup(void)
>  		kfree(m);
>  	}
>  	free_elfcorebuf();
> +
> +	/* clear vmcore device dump list */
> +	vmcore_free_device_dumps();
>  }
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index f7ac2aa93269..3e4ba9d753c8 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -5,6 +5,7 @@
>  #include <linux/kexec.h>
>  #include <linux/proc_fs.h>
>  #include <linux/elf.h>
> +#include <uapi/linux/vmcore.h>
>  
>  #include <asm/pgtable.h> /* for pgprot_t */
>  
> @@ -93,4 +94,21 @@ static inline bool is_kdump_kernel(void) { return 0; }
>  #endif /* CONFIG_CRASH_DUMP */
>  
>  extern unsigned long saved_max_pfn;
> +
> +/* Device Dump information to be filled by drivers */
> +struct vmcoredd_data {
> +	char dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
> +	unsigned int size;                       /* Size of the dump */
> +	/* Driver's registered callback to be invoked to collect dump */
> +	int (*vmcoredd_callback)(struct vmcoredd_data *data, void *buf);
> +};
> +
> +#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
> +int vmcore_add_device_dump(struct vmcoredd_data *data);
> +#else
> +static inline int vmcore_add_device_dump(struct vmcoredd_data *data)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
>  #endif /* LINUX_CRASHDUMP_H */
> diff --git a/include/linux/kcore.h b/include/linux/kcore.h
> index 80db19d3a505..8de55e4b5ee9 100644
> --- a/include/linux/kcore.h
> +++ b/include/linux/kcore.h
> @@ -28,6 +28,12 @@ struct vmcore {
>  	loff_t offset;
>  };
>  
> +struct vmcoredd_node {
> +	struct list_head list;	/* List of dumps */
> +	void *buf;		/* Buffer containing device's dump */
> +	unsigned int size;	/* Size of the buffer */
> +};
> +
>  #ifdef CONFIG_PROC_KCORE
>  extern void kclist_add(struct kcore_list *, void *, size_t, int type);
>  #else
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index e2535d6dcec7..4e12c423b9fe 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -421,6 +421,7 @@ typedef struct elf64_shdr {
>  #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
>  #define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
>  #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
> +#define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
>  
>  /* Note header in a PT_NOTE section */
>  typedef struct elf32_note {
> diff --git a/include/uapi/linux/vmcore.h b/include/uapi/linux/vmcore.h
> new file mode 100644
> index 000000000000..f9eab9a37370
> --- /dev/null
> +++ b/include/uapi/linux/vmcore.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _UAPI_VMCORE_H
> +#define _UAPI_VMCORE_H
> +
> +#define VMCOREDD_NOTE_NAME "LINUX"
> +#define VMCOREDD_MAX_NAME_BYTES 44
> +
> +struct vmcoredd_header {
> +	__u32 n_namesz; /* Name size */
> +	__u32 n_descsz; /* Content size */
> +	__u32 n_type;   /* NT_VMCOREDD */
> +	__u8 name[8];   /* LINUX\0\0\0 */
> +	__u8 dump_name[VMCOREDD_MAX_NAME_BYTES]; /* Device dump's name */
> +};
> +
> +#endif /* _UAPI_VMCORE_H */

^ permalink raw reply

* Re: RTL8723BE performance regression
From: João Paulo Rechi Vita @ 2018-05-01 22:41 UTC (permalink / raw)
  To: Larry Finger
  Cc: Steve deRosier, Yan-Hsuan Chuang, Ping-Ke Shih, Birming Chiu,
	Shaofu, Steven Ting, Chaoming Li, Kalle Valo, linux-wireless,
	Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, linux
In-Reply-To: <059b40f0-b8e2-b55f-92d5-a859ba4204a4@lwfinger.net>

On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
>>
>> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
>> wrote:
>>
>> (...)
>>
>>> As the antenna selection code changes affected your first bisection, do
>>> you
>>> have one of those HP laptops with only one antenna and the incorrect
>>> coding
>>> in the FUSE?
>>
>>
>> Yes, that is why I've been passing ant_sel=1 during my tests -- this
>> was needed to achieve a good performance in the past, before this
>> regression. I've also opened the laptop chassis and confirmed the
>> antenna cable is plugged to the connector labeled with "1" on the
>> card.
>>
>>> If so, please make sure that you still have the same signal
>>> strength for good and bad cases. I have tried to keep the driver and the
>>> btcoex code in sync, but there may be some combinations of antenna
>>> configuration and FUSE contents that cause the code to fail.
>>>
>>
>> What is the recommended way to monitor the signal strength?
>
>
> The btcoex code is developed for multiple platforms by a different group
> than the Linux driver. I think they made a change that caused ant_sel to
> switch from 1 to 2. At least numerous comments at
> github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
>
> Mhy recommended method is to verify the wifi device name with "iw dev". Then
> using that device
>
> sudo iw dev <dev_name> scan | egrep "SSID|signal"
>

I have confirmed that the performance regression is indeed tied to
signal strength: on the good cases signal was between -16 and -8 dBm,
whereas in bad cases signal was always between -50 to - 40 dBm. I've
also switched to testing bandwidth in controlled LAN environment using
iperf3, as suggested by Steve deRosier, with the DUT being the only
machine connected to the 2.4 GHz radio and the machine running the
iperf3 server connected via ethernet.

Using those two tests (iperf3 and signal strength) I've dug deeper
into the culprit I had found previously, commit 7937f02d1953,
reverting it partially and testing the resulting driver, to isolate
which change was causing the problem. Besides "hooking up external
functions for newer ICs", as described by the commit message, that
commit also added code to decided whether ex_btc8723b1ant_*() or
ex_btc8723b2ant_*() functions should be used in halbtcoutsrc.c,
depending on the value of board_info.btdm_ant_num, whereas before that
commit ex_btc8723b2ant_*() were always used. Reverting to always using
ex_btc8723b2ant_*() functions fixes the regression on v4.15.

I've also tried to bisect between v4.15..v4.16 to find what else was
causing problems there, as the changes mentioned above on top of v4.16
did not solve the problem. The bisect pointed to "874e837d67d0
rtlwifi: fill FW version and subversion", only but reverting it plus
the changes mentioned above also didn't yield good results. That's
when I decided to get a bit creative: starting on v4.16 I first
applied the changes to have ex_btc8723b2ant_*() always being used, as
mentioned above, then reverted every commit between v4.15..v4.16
affecting drivers/net/wireless/realtek/rtlwifi/, and verified the
resulting kernel had a good performance. Then I started trimming down
the history and testing along the way, to reduce to the minimum set of
changes that had to be reverted in order to restore the good
performance. In addition to the ex_btc8723b2ant_*() changes and
reverting "874e837d67d0 rtlwifi: fill FW version and subversion", I've
also had to remove the following lines from
drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtcoutsrc.c, which
were introduced by "40d9dd4f1c5d rtlwifi: btcoex: Remove global
variables from btcoex", in order to restore the upload performance and
signal strength.

        /* set default antenna position to main  port */
        btcoexist->board_info.btdm_ant_pos = BTC_ANTENNA_AT_MAIN_PORT;

These are the results I've got on v4.16 (similarly on
wireless-drivers-next-for-davem-2018-03-29 or v4.15):

     $ sudo iw dev wlp2s0 scan | grep -B3 JJ | grep signal
            signal: -42.00 dBm
     $ iperf3 -c 192.168.1.254
     Connecting to host 192.168.1.254, port 5201
     [  4] local 192.168.1.253 port 39678 connected to 192.168.1.254 port 5201
     [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
     [  4]   0.00-1.00   sec   735 KBytes  6.02 Mbits/sec    1   1.41 KBytes
     [  4]   1.00-2.00   sec   274 KBytes  2.25 Mbits/sec    1   1.41 KBytes
     [  4]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
     [  4]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
     [  4]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    1   28.3 KBytes
     [  4]   5.00-6.00   sec   423 KBytes  3.47 Mbits/sec    3   41.0 KBytes
     [  4]   6.00-7.00   sec   840 KBytes  6.88 Mbits/sec    0   58.0 KBytes
     [  4]   7.00-8.00   sec   830 KBytes  6.79 Mbits/sec    1   1.41 KBytes
     [  4]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
     [  4]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
     - - - - - - - - - - - - - - - - - - - - - - - - -
     [ ID] Interval           Transfer     Bandwidth       Retr
     [  4]   0.00-10.00  sec  3.03 MBytes  2.54 Mbits/sec    7
    sender
     [  4]   0.00-10.00  sec  2.88 MBytes  2.41 Mbits/sec
    receiver

     iperf Done.

And these are the results on v4.16 with all the changes mentioned on
the previous paragraph, and similar results were obtained when
applying the same changes on the current wireless-drivers-next master
branch (f56324baf329):

     $ sudo iw dev wlp2s0 scan | grep -B3 JJ | grep signal
            signal: -14.00 dBm
     $ iperf3 -c 192.168.1.254
     Connecting to host 192.168.1.254, port 5201
     [  4] local 192.168.1.253 port 59380 connected to 192.168.1.254 port 5201
     [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
     [  4]   0.00-1.00   sec  4.63 MBytes  38.8 Mbits/sec    0    194 KBytes
     [  4]   1.00-2.00   sec  4.58 MBytes  38.4 Mbits/sec    0    273 KBytes
     [  4]   2.00-3.00   sec  5.05 MBytes  42.3 Mbits/sec    0    332 KBytes
     [  4]   3.00-4.00   sec  4.98 MBytes  41.8 Mbits/sec    0    393 KBytes
     [  4]   4.00-5.00   sec  4.76 MBytes  39.9 Mbits/sec    0    434 KBytes
     [  4]   5.00-6.00   sec  4.85 MBytes  40.7 Mbits/sec    0    434 KBytes
     [  4]   6.00-7.00   sec  3.96 MBytes  33.2 Mbits/sec    0    464 KBytes
     [  4]   7.00-8.00   sec  4.74 MBytes  39.8 Mbits/sec    0    481 KBytes
     [  4]   8.00-9.00   sec  4.22 MBytes  35.4 Mbits/sec    0    508 KBytes
     [  4]   9.00-10.00  sec  4.09 MBytes  34.3 Mbits/sec    0    564 KBytes
     - - - - - - - - - - - - - - - - - - - - - - - - -
     [ ID] Interval           Transfer     Bandwidth       Retr
     [  4]   0.00-10.00  sec  45.9 MBytes  38.5 Mbits/sec    0
    sender
     [  4]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec
    receiver

     iperf Done.

I'll send an RFC patchset with the changes that I mentioned above.
Please advise whether those changes themselves should be merged, or if
you suggest a different approach.

^ permalink raw reply

* Re: Silently dropped UDP packets on kernel 4.14
From: Kristian Evensen @ 2018-05-01 22:42 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Netfilter Development Mailing list, Network Development
In-Reply-To: <20180501222406.fy5rph2t4dzndhu2@breakpoint.cc>

Hi,

Thanks for your quick and detailed reply!

On Wed, May 2, 2018 at 12:24 AM, Florian Westphal <fw@strlen.de> wrote:
> I'm not sure what the best way to solve this is, we either need
> to insert earlier in nfqueue case, or do conflict resolution in nfqueue
> case (and perhaps also nat undo? not sure).

My knowledge of the conntrack/nat subsystem is not that great, and I
don't know the implications of what I am about to suggest. However,
considering that the two packets represent the same flow, wouldn't it
be possible to apply the existing nat-mapping to the second packet,
and then let the second packet pass?

BR,
Kristian

^ permalink raw reply

* [RFC PATCH 1/3] rtlwifi: btcoex: Don't init antenna position to main port
From: João Paulo Rechi Vita @ 2018-05-01 22:46 UTC (permalink / raw)
  To: Larry Finger
  Cc: Steve deRosier, Yan-Hsuan Chuang, Ping-Ke Shih, Birming Chiu,
	Shaofu, Steven Ting, Chaoming Li, Kalle Valo, linux-wireless,
	Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, linux
In-Reply-To: <CA+A7VXUUsZHD2gr9TBcV6jZBOPGZv7_vK-wWZ0MvGgiCnAiUgQ@mail.gmail.com>

This partially reverts commit 40d9dd4f1c5dcd0d4a2a1f0efcd225c9c4b36d6f,
which does not only remove global variables from btcoex, as described on
its commit message, but also does a few other things -- in particular,
setting the default antenna position to BTC_ANTENNA_AT_MAIN_PORT.

This drastically affects the upload performance and signal strenght on
the HP 240 G4 laptop, as shown by the results bellow:

Without this change:

 $ sudo iw dev wlp2s0 scan | grep -B3 JJ | grep signal
 	signal: -42.00 dBm
 $ iperf3 -c 192.168.1.254
 Connecting to host 192.168.1.254, port 5201
 [  4] local 192.168.1.253 port 39678 connected to 192.168.1.254 port 5201
 [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
 [  4]   0.00-1.00   sec   735 KBytes  6.02 Mbits/sec    1   1.41 KBytes
 [  4]   1.00-2.00   sec   274 KBytes  2.25 Mbits/sec    1   1.41 KBytes
 [  4]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
 [  4]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
 [  4]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    1   28.3 KBytes
 [  4]   5.00-6.00   sec   423 KBytes  3.47 Mbits/sec    3   41.0 KBytes
 [  4]   6.00-7.00   sec   840 KBytes  6.88 Mbits/sec    0   58.0 KBytes
 [  4]   7.00-8.00   sec   830 KBytes  6.79 Mbits/sec    1   1.41 KBytes
 [  4]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
 [  4]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth       Retr
 [  4]   0.00-10.00  sec  3.03 MBytes  2.54 Mbits/sec    7             sender
 [  4]   0.00-10.00  sec  2.88 MBytes  2.41 Mbits/sec                  receiver

 iperf Done.

With this change:

 $ sudo iw dev wlp2s0 scan | grep -B3 JJ | grep signal
 	signal: -14.00 dBm
 $ iperf3 -c 192.168.1.254
 Connecting to host 192.168.1.254, port 5201
 [  4] local 192.168.1.253 port 59380 connected to 192.168.1.254 port 5201
 [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
 [  4]   0.00-1.00   sec  4.63 MBytes  38.8 Mbits/sec    0    194 KBytes
 [  4]   1.00-2.00   sec  4.58 MBytes  38.4 Mbits/sec    0    273 KBytes
 [  4]   2.00-3.00   sec  5.05 MBytes  42.3 Mbits/sec    0    332 KBytes
 [  4]   3.00-4.00   sec  4.98 MBytes  41.8 Mbits/sec    0    393 KBytes
 [  4]   4.00-5.00   sec  4.76 MBytes  39.9 Mbits/sec    0    434 KBytes
 [  4]   5.00-6.00   sec  4.85 MBytes  40.7 Mbits/sec    0    434 KBytes
 [  4]   6.00-7.00   sec  3.96 MBytes  33.2 Mbits/sec    0    464 KBytes
 [  4]   7.00-8.00   sec  4.74 MBytes  39.8 Mbits/sec    0    481 KBytes
 [  4]   8.00-9.00   sec  4.22 MBytes  35.4 Mbits/sec    0    508 KBytes
 [  4]   9.00-10.00  sec  4.09 MBytes  34.3 Mbits/sec    0    564 KBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth       Retr
 [  4]   0.00-10.00  sec  45.9 MBytes  38.5 Mbits/sec    0             sender
 [  4]   0.00-10.00  sec  45.0 MBytes  37.7 Mbits/sec                  receiver

 iperf Done.

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
---
 drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtcoutsrc.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtcoutsrc.c b/drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtcoutsrc.c
index b026e80940a4..86182b917c92 100644
--- a/drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtcoutsrc.c
+++ b/drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtcoutsrc.c
@@ -1388,9 +1388,6 @@ bool exhalbtc_bind_bt_coex_withadapter(void *adapter)
 	ant_num = rtl_get_hwpg_ant_num(rtlpriv);
 	exhalbtc_set_ant_num(rtlpriv, BT_COEX_ANT_TYPE_PG, ant_num);
 
-	/* set default antenna position to main  port */
-	btcoexist->board_info.btdm_ant_pos = BTC_ANTENNA_AT_MAIN_PORT;
-
 	single_ant_path = rtl_get_hwpg_single_ant_path(rtlpriv);
 	exhalbtc_set_single_ant_path(btcoexist, single_ant_path);
 
-- 
2.17.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox