[PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
@ 2019-09-26 22:42 Eric Dumazet
  2019-09-27  8:25 ` Jonathan Maxwell
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Eric Dumazet @ 2019-09-26 22:42 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Yuchung Cheng,
	Marek Majkowski, Jon Maxwell

Yuchung Cheng and Marek Majkowski independently reported a weird
behavior of TCP_USER_TIMEOUT option when used at connect() time.

When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
believes the flow should live, and the following condition
in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :

    remaining = icsk->icsk_user_timeout - elapsed;
    if (remaining <= 0)
        return 1; /* user timeout has passed; fire ASAP */

This silly situation ends when the max syn rtx count is reached.

This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
avoiding these spurious SYN packets.

Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Reported-by: Marek Majkowski <marek@cloudflare.com>
Cc: Jon Maxwell <jmaxwell37@gmail.com>
Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
---
 net/ipv4/tcp_timer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index dbd9d2d0ee63aa46ad2dda417da6ec9409442b77..40de2d2364a1eca14c259d77ebed361d17829eb9 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -210,7 +210,7 @@ static int tcp_write_timeout(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
-	bool expired, do_reset;
+	bool expired = false, do_reset;
 	int retry_until;
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
@@ -242,9 +242,10 @@ static int tcp_write_timeout(struct sock *sk)
 			if (tcp_out_of_resources(sk, do_reset))
 				return 1;
 		}
+	}
+	if (!expired)
 		expired = retransmits_timed_out(sk, retry_until,
 						icsk->icsk_user_timeout);
-	}
 	tcp_fastopen_active_detect_blackhole(sk, expired);
 
 	if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  2019-09-26 22:42 [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state Eric Dumazet
@ 2019-09-27  8:25 ` Jonathan Maxwell
  2019-09-27 12:40   ` Marek Majkowski
  2019-09-27 16:35 ` Yuchung Cheng
  2019-09-27 18:42 ` David Miller
  2 siblings, 1 reply; 6+ messages in thread
From: Jonathan Maxwell @ 2019-09-27  8:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Yuchung Cheng,
	Marek Majkowski

Acked-by: Jon Maxwell <jmaxwell37@gmail.com>

Thanks for fixing that Eric.


On Fri, Sep 27, 2019 at 8:42 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Yuchung Cheng and Marek Majkowski independently reported a weird
> behavior of TCP_USER_TIMEOUT option when used at connect() time.
>
> When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
> believes the flow should live, and the following condition
> in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :
>
>     remaining = icsk->icsk_user_timeout - elapsed;
>     if (remaining <= 0)
>         return 1; /* user timeout has passed; fire ASAP */
>
> This silly situation ends when the max syn rtx count is reached.
>
> This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
> avoiding these spurious SYN packets.
>
> Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Yuchung Cheng <ycheng@google.com>
> Reported-by: Marek Majkowski <marek@cloudflare.com>
> Cc: Jon Maxwell <jmaxwell37@gmail.com>
> Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
> ---
>  net/ipv4/tcp_timer.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index dbd9d2d0ee63aa46ad2dda417da6ec9409442b77..40de2d2364a1eca14c259d77ebed361d17829eb9 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -210,7 +210,7 @@ static int tcp_write_timeout(struct sock *sk)
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct net *net = sock_net(sk);
> -       bool expired, do_reset;
> +       bool expired = false, do_reset;
>         int retry_until;
>
>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> @@ -242,9 +242,10 @@ static int tcp_write_timeout(struct sock *sk)
>                         if (tcp_out_of_resources(sk, do_reset))
>                                 return 1;
>                 }
> +       }
> +       if (!expired)
>                 expired = retransmits_timed_out(sk, retry_until,
>                                                 icsk->icsk_user_timeout);
> -       }
>         tcp_fastopen_active_detect_blackhole(sk, expired);
>
>         if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
> --
> 2.23.0.444.g18eeb5a265-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  2019-09-27  8:25 ` Jonathan Maxwell
@ 2019-09-27 12:40   ` Marek Majkowski
  2019-09-27 13:18     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Marek Majkowski @ 2019-09-27 12:40 UTC (permalink / raw)
  To: edumazet
  Cc: davem, jmaxwell37, eric.dumazet, marek, netdev, ycheng,
	kernel-team

On 9/27/19 10:25 AM, Jonathan Maxwell wrote:
> Acked-by: Jon Maxwell <jmaxwell37@gmail.com>
>
> Thanks for fixing that Eric.
>

The patch seems to do the job.

Tested-by: Marek Majkowski <marek@cloudflare.com>

Here's a selftest:

---8<---
From: Marek Majkowski <marek@cloudflare.com>
Date: Fri, 27 Sep 2019 13:37:52 +0200
Subject: [PATCH] selftests/net: TCP_USER_TIMEOUT in SYN-SENT state

Test the TCP_USER_TIMEOUT behavior, overriding TCP_SYNCNT
when socket is in SYN-SENT state.

Signed-off-by: Marek Majkowski <marek@cloudflare.com>
---
 tools/testing/selftests/net/.gitignore        |   1 +
 tools/testing/selftests/net/Makefile          |   3 +-
 .../selftests/net/tcp_user_timeout_syn_sent.c | 322 ++++++++++++++++++
 .../net/tcp_user_timeout_syn_sent.sh          |   4 +
 4 files changed, 329 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_user_timeout_syn_sent.c
 create mode 100755 tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index c7cced739c34..bc6a2b7199b6 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -21,3 +21,4 @@ ipv6_flowlabel
 ipv6_flowlabel_mgr
 so_txtime
 tcp_fastopen_backup_key
+tcp_user_timeout_syn_sent
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 0bd6b23c97ef..065a171b8834 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -11,13 +11,14 @@ TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_any.sh
 TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
 TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh l2tp.sh
+TEST_PROGS += tcp_user_timeout_syn_sent.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket nettest
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
 TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd txring_overwrite
 TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
 TEST_GEN_FILES += so_txtime ipv6_flowlabel ipv6_flowlabel_mgr
-TEST_GEN_FILES += tcp_fastopen_backup_key
+TEST_GEN_FILES += tcp_fastopen_backup_key tcp_user_timeout_syn_sent
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
diff --git a/tools/testing/selftests/net/tcp_user_timeout_syn_sent.c b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.c
new file mode 100644
index 000000000000..1c9ec582359a
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Testing if TCP_USER_TIMEOUT on SYN-SENT state overides TCP_SYNCNT.
+ *
+ * Historically, TCP_USER_TIMEOUT made only sense on synchronized TCP
+ * states, like ESTABLISHED. There was a bug on SYN-SENT state: with
+ * TCP_USER_TIMEOUT set, the connect() would ETIMEDOUT after given
+ * time, but near the end of the timer would flood SYN packets to
+ * fulfill the TCP_SYNCNT counter. For example for 2000ms user
+ * timeout and default TCP_SYNCNT=6, the tcpdump would look like:
+ *
+ * 00:00.000000 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ * 00:01.029452 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ * 00:02.021354 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ * 00:02.033419 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ * 00:02.041633 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ * 00:02.049263 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ * 00:02.057264 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
+ *
+ * Notice, 5 out of 6 retransmissions are aligned to 2s. We fixed
+ * that, and this code tests for the regression. We do this by
+ * actively dropping SYN packets on listen socket with ebpf
+ * SOCKET_FILTER, and counting how many packets did we drop.
+ *
+ * See: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
+ */
+#include <arpa/inet.h>
+#include <errno.h>
+#include <error.h>
+#include <linux/bpf.h>
+#include <linux/tcp.h>
+#include <linux/unistd.h>
+#include <netinet/in.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+
+int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
+		   int max_entries, uint32_t map_flags)
+{
+	union bpf_attr attr = {};
+
+	attr.map_type = map_type;
+	attr.key_size = key_size;
+	attr.value_size = value_size;
+	attr.max_entries = max_entries;
+	attr.map_flags = map_flags;
+	return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
+}
+
+int bpf_load_program(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
+		     size_t insns_cnt, const char *license,
+		     uint32_t kern_version)
+{
+	union bpf_attr attr = {};
+
+	attr.prog_type = prog_type;
+	attr.insns = (long)insns;
+	attr.insn_cnt = insns_cnt;
+	attr.license = (long)license;
+	attr.log_buf = (long)NULL;
+	attr.log_size = 0;
+	attr.log_level = 0;
+	attr.kern_version = kern_version;
+
+	int fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
+	return fd;
+}
+
+int bpf_map_update_elem(int fd, const void *key, const void *value,
+			uint64_t flags)
+{
+	union bpf_attr attr = {};
+
+	attr.map_fd = fd;
+	attr.key = (long)key;
+	attr.value = (long)value;
+	attr.flags = flags;
+
+	return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
+}
+
+int bpf_map_lookup_elem(int fd, const void *key, void *value)
+{
+	union bpf_attr attr = {};
+
+	attr.map_fd = fd;
+	attr.key = (long)key;
+	attr.value = (long)value;
+
+	return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
+}
+
+/*
+struct bpf_map_def SEC("maps") stats_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(uint32_t),
+	.value_size = sizeof(uint64_t),
+	.max_entries = 2,
+};
+
+SEC("socket_filter")
+int _socket_filter(struct __sk_buff *skb)
+{
+	(void)skb;
+
+	uint32_t no = 0;
+	uint64_t *value = bpf_map_lookup_elem(&stats_map, &no);
+	if (value) {
+		__sync_fetch_and_add(value, 1);
+	}
+	return 0; // DROP inbound SYN packets
+}
+ */
+
+size_t bpf_insn_socket_filter_cnt = 12;
+struct bpf_insn bpf_insn_socket_filter[] = {
+	{
+		.code = 0xb7,
+		.dst_reg = BPF_REG_1,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0x63,
+		.dst_reg = BPF_REG_10,
+		.src_reg = BPF_REG_1,
+		.off = -4,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0xbf,
+		.dst_reg = BPF_REG_2,
+		.src_reg = BPF_REG_10,
+		.off = 0,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0x7,
+		.dst_reg = BPF_REG_2,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = -4 /**/
+	},
+	{
+		.code = 0x18,
+		.dst_reg = BPF_REG_1,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 0 /* relocation for stats_map */
+	},
+	{
+		.code = 0x0,
+		.dst_reg = BPF_REG_0,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0x85,
+		.dst_reg = BPF_REG_0,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 1 /**/
+	},
+	{
+		.code = 0x15,
+		.dst_reg = BPF_REG_0,
+		.src_reg = BPF_REG_0,
+		.off = 2,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0xb7,
+		.dst_reg = BPF_REG_1,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 1 /**/
+	},
+	{
+		.code = 0xdb,
+		.dst_reg = BPF_REG_0,
+		.src_reg = BPF_REG_1,
+		.off = 0,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0xb7,
+		.dst_reg = BPF_REG_0,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 0 /**/
+	},
+	{
+		.code = 0x95,
+		.dst_reg = BPF_REG_0,
+		.src_reg = BPF_REG_0,
+		.off = 0,
+		.imm = 0 /**/
+	}
+};
+
+void socket_filter_fill_stats_map(int fd)
+{
+	bpf_insn_socket_filter[4].src_reg = BPF_REG_1;
+	bpf_insn_socket_filter[4].imm = fd;
+}
+
+static int net_setup_ebpf(int sd)
+{
+	int stats_map, bpf, r;
+
+	stats_map = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(uint32_t),
+				   sizeof(uint64_t), 2, 0);
+	if (stats_map < 0)
+		error(1, errno, "bpf");
+
+	socket_filter_fill_stats_map(stats_map);
+
+	bpf = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER,
+			       bpf_insn_socket_filter,
+			       bpf_insn_socket_filter_cnt, "Dual BSD/GPL", 0);
+	if (bpf < 0)
+		error(1, errno, "bpf");
+
+	r = setsockopt(sd, SOL_SOCKET, SO_ATTACH_BPF, &bpf, sizeof(bpf));
+	if (r < 0)
+		error(1, errno, "setsockopt(SO_ATTACH_FILTER)");
+
+	return stats_map;
+}
+
+static int setup_server(struct sockaddr_in *addr)
+{
+	int sd, r;
+	socklen_t addr_sz;
+
+	sd = socket(AF_INET, SOCK_STREAM, 0);
+	if (sd < 0)
+		error(1, errno, "socket()");
+
+	r = bind(sd, (struct sockaddr *)addr, sizeof(*addr));
+	if (r != 0)
+		error(1, errno, "bind()");
+
+	r = listen(sd, 16);
+	if (r != 0)
+		error(1, errno, "listen()");
+
+	addr_sz = sizeof(*addr);
+	r = getsockname(sd, (struct sockaddr *)addr, &addr_sz);
+	if (r != 0)
+		error(1, errno, "getsockname()");
+
+	return sd;
+}
+
+int main(void)
+{
+	struct sockaddr_in addr = {
+		.sin_family = AF_INET,
+		.sin_addr = { inet_addr("127.0.0.1") },
+	};
+
+	int sd = setup_server(&addr);
+	int stats_map = net_setup_ebpf(sd);
+	struct {
+		int user_timeout;
+		int expected_counter;
+	} tests[] = {
+		{ 200, 2 }, // TCP_USER_TIMEOUT kicks in on first retranmission
+		{ 1500, 2 },
+		{ 3500, 3 },
+		{ -1, -1 },
+	};
+
+	int failed = 0, i;
+
+	for (i = 0; tests[i].user_timeout >= 0; i++) {
+		int r, cd, v;
+		uint32_t k = 0;
+		uint64_t counter = 0;
+
+		r = bpf_map_update_elem(stats_map, &k, &counter, 0);
+
+		cd = socket(AF_INET, SOCK_STREAM, 0);
+		if (cd < 0)
+			error(1, errno, "socket()");
+
+		v = tests[i].user_timeout;
+		r = setsockopt(cd, IPPROTO_TCP, TCP_USER_TIMEOUT, &v,
+			       sizeof(v));
+		if (r != 0)
+			error(1, errno, "setsockopt()");
+
+		r = connect(cd, (struct sockaddr *)&addr, sizeof(addr));
+		if (r != -1 && errno != ETIMEDOUT)
+			error(1, errno, "connect()");
+
+		r = bpf_map_lookup_elem(stats_map, &k, &counter);
+		if (r != 0)
+			error(1, errno, "bpf_map_lookup_elem()");
+
+		if ((int)counter != tests[i].expected_counter) {
+			failed += 1;
+			printf("[!] Expecting %d SYN packets on "
+			       "TCP_USER_TIMEOUT=%d, got %d\n",
+			       tests[i].expected_counter, tests[i].user_timeout,
+			       (int)counter);
+		}
+		close(cd);
+	}
+	close(sd);
+	close(stats_map);
+	if (failed == 0)
+		fprintf(stderr, "PASSED\n");
+	return failed;
+}
diff --git a/tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh
new file mode 100755
index 000000000000..26765f3a92c6
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh
@@ -0,0 +1,4 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+./in_netns.sh ./tcp_user_timeout_syn_sent
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  2019-09-27 12:40   ` Marek Majkowski
@ 2019-09-27 13:18     ` Eric Dumazet
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2019-09-27 13:18 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: David Miller, Jonathan Maxwell, Eric Dumazet, netdev,
	Yuchung Cheng, kernel-team

On Fri, Sep 27, 2019 at 5:40 AM Marek Majkowski <marek@cloudflare.com> wrote:
>
> On 9/27/19 10:25 AM, Jonathan Maxwell wrote:
> > Acked-by: Jon Maxwell <jmaxwell37@gmail.com>
> >
> > Thanks for fixing that Eric.
> >
>
> The patch seems to do the job.
>
> Tested-by: Marek Majkowski <marek@cloudflare.com>
>
> Here's a selftest:
>

Here is the packetdrill test :

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP) = 4
   +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [8000], 4) = 0

   +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = -1
   +0 > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8>
   +1 > S 0:0(0) <mss 1460,sackOK,TS val 1100 ecr 0,nop,wscale 8>
   +2 > S 0:0(0) <mss 1460,sackOK,TS val 3100 ecr 0,nop,wscale 8>
   +4 > S 0:0(0) <mss 1460,sackOK,TS val 7100 ecr 0,nop,wscale 8>

  +10 write(4, ..., 1000) = -1 ETIMEDOUT (Connection Timed Out)
   +0 %{ assert tcpi_state == TCP_CLOSE }%
   +0 < S. 1234:1234(0) ack 1 win 5840 <mss 1460,sackOK,TS val 987134
ecr 7100,nop,wscale 8>
   +0 > R 1:1(0)



> ---8<---
> From: Marek Majkowski <marek@cloudflare.com>
> Date: Fri, 27 Sep 2019 13:37:52 +0200
> Subject: [PATCH] selftests/net: TCP_USER_TIMEOUT in SYN-SENT state
>
> Test the TCP_USER_TIMEOUT behavior, overriding TCP_SYNCNT
> when socket is in SYN-SENT state.
>
> Signed-off-by: Marek Majkowski <marek@cloudflare.com>
> ---
>  tools/testing/selftests/net/.gitignore        |   1 +
>  tools/testing/selftests/net/Makefile          |   3 +-
>  .../selftests/net/tcp_user_timeout_syn_sent.c | 322 ++++++++++++++++++
>  .../net/tcp_user_timeout_syn_sent.sh          |   4 +
>  4 files changed, 329 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/net/tcp_user_timeout_syn_sent.c
>  create mode 100755 tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh
>
> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
> index c7cced739c34..bc6a2b7199b6 100644
> --- a/tools/testing/selftests/net/.gitignore
> +++ b/tools/testing/selftests/net/.gitignore
> @@ -21,3 +21,4 @@ ipv6_flowlabel
>  ipv6_flowlabel_mgr
>  so_txtime
>  tcp_fastopen_backup_key
> +tcp_user_timeout_syn_sent
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index 0bd6b23c97ef..065a171b8834 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -11,13 +11,14 @@ TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
>  TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_any.sh
>  TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
>  TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh l2tp.sh
> +TEST_PROGS += tcp_user_timeout_syn_sent.sh
>  TEST_PROGS_EXTENDED := in_netns.sh
>  TEST_GEN_FILES =  socket nettest
>  TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
>  TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd txring_overwrite
>  TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
>  TEST_GEN_FILES += so_txtime ipv6_flowlabel ipv6_flowlabel_mgr
> -TEST_GEN_FILES += tcp_fastopen_backup_key
> +TEST_GEN_FILES += tcp_fastopen_backup_key tcp_user_timeout_syn_sent
>  TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
>  TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
>
> diff --git a/tools/testing/selftests/net/tcp_user_timeout_syn_sent.c b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.c
> new file mode 100644
> index 000000000000..1c9ec582359a
> --- /dev/null
> +++ b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.c
> @@ -0,0 +1,322 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Testing if TCP_USER_TIMEOUT on SYN-SENT state overides TCP_SYNCNT.
> + *
> + * Historically, TCP_USER_TIMEOUT made only sense on synchronized TCP
> + * states, like ESTABLISHED. There was a bug on SYN-SENT state: with
> + * TCP_USER_TIMEOUT set, the connect() would ETIMEDOUT after given
> + * time, but near the end of the timer would flood SYN packets to
> + * fulfill the TCP_SYNCNT counter. For example for 2000ms user
> + * timeout and default TCP_SYNCNT=6, the tcpdump would look like:
> + *
> + * 00:00.000000 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + * 00:01.029452 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + * 00:02.021354 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + * 00:02.033419 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + * 00:02.041633 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + * 00:02.049263 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + * 00:02.057264 IP 127.0.0.1.1 > 127.0.0.1.2: Flags [S]
> + *
> + * Notice, 5 out of 6 retransmissions are aligned to 2s. We fixed
> + * that, and this code tests for the regression. We do this by
> + * actively dropping SYN packets on listen socket with ebpf
> + * SOCKET_FILTER, and counting how many packets did we drop.
> + *
> + * See: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
> + */
> +#include <arpa/inet.h>
> +#include <errno.h>
> +#include <error.h>
> +#include <linux/bpf.h>
> +#include <linux/tcp.h>
> +#include <linux/unistd.h>
> +#include <netinet/in.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/socket.h>
> +#include <unistd.h>
> +
> +int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
> +                  int max_entries, uint32_t map_flags)
> +{
> +       union bpf_attr attr = {};
> +
> +       attr.map_type = map_type;
> +       attr.key_size = key_size;
> +       attr.value_size = value_size;
> +       attr.max_entries = max_entries;
> +       attr.map_flags = map_flags;
> +       return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
> +}
> +
> +int bpf_load_program(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
> +                    size_t insns_cnt, const char *license,
> +                    uint32_t kern_version)
> +{
> +       union bpf_attr attr = {};
> +
> +       attr.prog_type = prog_type;
> +       attr.insns = (long)insns;
> +       attr.insn_cnt = insns_cnt;
> +       attr.license = (long)license;
> +       attr.log_buf = (long)NULL;
> +       attr.log_size = 0;
> +       attr.log_level = 0;
> +       attr.kern_version = kern_version;
> +
> +       int fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
> +       return fd;
> +}
> +
> +int bpf_map_update_elem(int fd, const void *key, const void *value,
> +                       uint64_t flags)
> +{
> +       union bpf_attr attr = {};
> +
> +       attr.map_fd = fd;
> +       attr.key = (long)key;
> +       attr.value = (long)value;
> +       attr.flags = flags;
> +
> +       return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
> +}
> +
> +int bpf_map_lookup_elem(int fd, const void *key, void *value)
> +{
> +       union bpf_attr attr = {};
> +
> +       attr.map_fd = fd;
> +       attr.key = (long)key;
> +       attr.value = (long)value;
> +
> +       return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
> +}
> +
> +/*
> +struct bpf_map_def SEC("maps") stats_map = {
> +       .type = BPF_MAP_TYPE_ARRAY,
> +       .key_size = sizeof(uint32_t),
> +       .value_size = sizeof(uint64_t),
> +       .max_entries = 2,
> +};
> +
> +SEC("socket_filter")
> +int _socket_filter(struct __sk_buff *skb)
> +{
> +       (void)skb;
> +
> +       uint32_t no = 0;
> +       uint64_t *value = bpf_map_lookup_elem(&stats_map, &no);
> +       if (value) {
> +               __sync_fetch_and_add(value, 1);
> +       }
> +       return 0; // DROP inbound SYN packets
> +}
> + */
> +
> +size_t bpf_insn_socket_filter_cnt = 12;
> +struct bpf_insn bpf_insn_socket_filter[] = {
> +       {
> +               .code = 0xb7,
> +               .dst_reg = BPF_REG_1,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0x63,
> +               .dst_reg = BPF_REG_10,
> +               .src_reg = BPF_REG_1,
> +               .off = -4,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0xbf,
> +               .dst_reg = BPF_REG_2,
> +               .src_reg = BPF_REG_10,
> +               .off = 0,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0x7,
> +               .dst_reg = BPF_REG_2,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = -4 /**/
> +       },
> +       {
> +               .code = 0x18,
> +               .dst_reg = BPF_REG_1,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 0 /* relocation for stats_map */
> +       },
> +       {
> +               .code = 0x0,
> +               .dst_reg = BPF_REG_0,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0x85,
> +               .dst_reg = BPF_REG_0,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 1 /**/
> +       },
> +       {
> +               .code = 0x15,
> +               .dst_reg = BPF_REG_0,
> +               .src_reg = BPF_REG_0,
> +               .off = 2,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0xb7,
> +               .dst_reg = BPF_REG_1,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 1 /**/
> +       },
> +       {
> +               .code = 0xdb,
> +               .dst_reg = BPF_REG_0,
> +               .src_reg = BPF_REG_1,
> +               .off = 0,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0xb7,
> +               .dst_reg = BPF_REG_0,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 0 /**/
> +       },
> +       {
> +               .code = 0x95,
> +               .dst_reg = BPF_REG_0,
> +               .src_reg = BPF_REG_0,
> +               .off = 0,
> +               .imm = 0 /**/
> +       }
> +};
> +
> +void socket_filter_fill_stats_map(int fd)
> +{
> +       bpf_insn_socket_filter[4].src_reg = BPF_REG_1;
> +       bpf_insn_socket_filter[4].imm = fd;
> +}
> +
> +static int net_setup_ebpf(int sd)
> +{
> +       int stats_map, bpf, r;
> +
> +       stats_map = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(uint32_t),
> +                                  sizeof(uint64_t), 2, 0);
> +       if (stats_map < 0)
> +               error(1, errno, "bpf");
> +
> +       socket_filter_fill_stats_map(stats_map);
> +
> +       bpf = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER,
> +                              bpf_insn_socket_filter,
> +                              bpf_insn_socket_filter_cnt, "Dual BSD/GPL", 0);
> +       if (bpf < 0)
> +               error(1, errno, "bpf");
> +
> +       r = setsockopt(sd, SOL_SOCKET, SO_ATTACH_BPF, &bpf, sizeof(bpf));
> +       if (r < 0)
> +               error(1, errno, "setsockopt(SO_ATTACH_FILTER)");
> +
> +       return stats_map;
> +}
> +
> +static int setup_server(struct sockaddr_in *addr)
> +{
> +       int sd, r;
> +       socklen_t addr_sz;
> +
> +       sd = socket(AF_INET, SOCK_STREAM, 0);
> +       if (sd < 0)
> +               error(1, errno, "socket()");
> +
> +       r = bind(sd, (struct sockaddr *)addr, sizeof(*addr));
> +       if (r != 0)
> +               error(1, errno, "bind()");
> +
> +       r = listen(sd, 16);
> +       if (r != 0)
> +               error(1, errno, "listen()");
> +
> +       addr_sz = sizeof(*addr);
> +       r = getsockname(sd, (struct sockaddr *)addr, &addr_sz);
> +       if (r != 0)
> +               error(1, errno, "getsockname()");
> +
> +       return sd;
> +}
> +
> +int main(void)
> +{
> +       struct sockaddr_in addr = {
> +               .sin_family = AF_INET,
> +               .sin_addr = { inet_addr("127.0.0.1") },
> +       };
> +
> +       int sd = setup_server(&addr);
> +       int stats_map = net_setup_ebpf(sd);
> +       struct {
> +               int user_timeout;
> +               int expected_counter;
> +       } tests[] = {
> +               { 200, 2 }, // TCP_USER_TIMEOUT kicks in on first retranmission
> +               { 1500, 2 },
> +               { 3500, 3 },
> +               { -1, -1 },
> +       };
> +
> +       int failed = 0, i;
> +
> +       for (i = 0; tests[i].user_timeout >= 0; i++) {
> +               int r, cd, v;
> +               uint32_t k = 0;
> +               uint64_t counter = 0;
> +
> +               r = bpf_map_update_elem(stats_map, &k, &counter, 0);
> +
> +               cd = socket(AF_INET, SOCK_STREAM, 0);
> +               if (cd < 0)
> +                       error(1, errno, "socket()");
> +
> +               v = tests[i].user_timeout;
> +               r = setsockopt(cd, IPPROTO_TCP, TCP_USER_TIMEOUT, &v,
> +                              sizeof(v));
> +               if (r != 0)
> +                       error(1, errno, "setsockopt()");
> +
> +               r = connect(cd, (struct sockaddr *)&addr, sizeof(addr));
> +               if (r != -1 && errno != ETIMEDOUT)
> +                       error(1, errno, "connect()");
> +
> +               r = bpf_map_lookup_elem(stats_map, &k, &counter);
> +               if (r != 0)
> +                       error(1, errno, "bpf_map_lookup_elem()");
> +
> +               if ((int)counter != tests[i].expected_counter) {
> +                       failed += 1;
> +                       printf("[!] Expecting %d SYN packets on "
> +                              "TCP_USER_TIMEOUT=%d, got %d\n",
> +                              tests[i].expected_counter, tests[i].user_timeout,
> +                              (int)counter);
> +               }
> +               close(cd);
> +       }
> +       close(sd);
> +       close(stats_map);
> +       if (failed == 0)
> +               fprintf(stderr, "PASSED\n");
> +       return failed;
> +}
> diff --git a/tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh
> new file mode 100755
> index 000000000000..26765f3a92c6
> --- /dev/null
> +++ b/tools/testing/selftests/net/tcp_user_timeout_syn_sent.sh
> @@ -0,0 +1,4 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +
> +./in_netns.sh ./tcp_user_timeout_syn_sent
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  2019-09-26 22:42 [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state Eric Dumazet
  2019-09-27  8:25 ` Jonathan Maxwell
@ 2019-09-27 16:35 ` Yuchung Cheng
  2019-09-27 18:42 ` David Miller
  2 siblings, 0 replies; 6+ messages in thread
From: Yuchung Cheng @ 2019-09-27 16:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Marek Majkowski,
	Jon Maxwell

On Thu, Sep 26, 2019 at 3:42 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Yuchung Cheng and Marek Majkowski independently reported a weird
> behavior of TCP_USER_TIMEOUT option when used at connect() time.
>
> When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
> believes the flow should live, and the following condition
> in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :
>
>     remaining = icsk->icsk_user_timeout - elapsed;
>     if (remaining <= 0)
>         return 1; /* user timeout has passed; fire ASAP */
>
> This silly situation ends when the max syn rtx count is reached.
>
> This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
> avoiding these spurious SYN packets.
>
> Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Yuchung Cheng <ycheng@google.com>
> Reported-by: Marek Majkowski <marek@cloudflare.com>
> Cc: Jon Maxwell <jmaxwell37@gmail.com>
> Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2
> ---
Acked-by: Yuchung Cheng <ycheng@google.com>
thanks for fixing it!
>  net/ipv4/tcp_timer.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index dbd9d2d0ee63aa46ad2dda417da6ec9409442b77..40de2d2364a1eca14c259d77ebed361d17829eb9 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -210,7 +210,7 @@ static int tcp_write_timeout(struct sock *sk)
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct net *net = sock_net(sk);
> -       bool expired, do_reset;
> +       bool expired = false, do_reset;
>         int retry_until;
>
>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> @@ -242,9 +242,10 @@ static int tcp_write_timeout(struct sock *sk)
>                         if (tcp_out_of_resources(sk, do_reset))
>                                 return 1;
>                 }
> +       }
> +       if (!expired)
>                 expired = retransmits_timed_out(sk, retry_until,
>                                                 icsk->icsk_user_timeout);
> -       }
>         tcp_fastopen_active_detect_blackhole(sk, expired);
>
>         if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
> --
> 2.23.0.444.g18eeb5a265-goog
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  2019-09-26 22:42 [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state Eric Dumazet
  2019-09-27  8:25 ` Jonathan Maxwell
  2019-09-27 16:35 ` Yuchung Cheng
@ 2019-09-27 18:42 ` David Miller
  2 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2019-09-27 18:42 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, ycheng, marek, jmaxwell37

From: Eric Dumazet <edumazet@google.com>
Date: Thu, 26 Sep 2019 15:42:51 -0700

> Yuchung Cheng and Marek Majkowski independently reported a weird
> behavior of TCP_USER_TIMEOUT option when used at connect() time.
> 
> When the TCP_USER_TIMEOUT is reached, tcp_write_timeout()
> believes the flow should live, and the following condition
> in tcp_clamp_rto_to_user_timeout() programs one jiffie timers :
> 
>     remaining = icsk->icsk_user_timeout - elapsed;
>     if (remaining <= 0)
>         return 1; /* user timeout has passed; fire ASAP */
> 
> This silly situation ends when the max syn rtx count is reached.
> 
> This patch makes sure we honor both TCP_SYNCNT and TCP_USER_TIMEOUT,
> avoiding these spurious SYN packets.
> 
> Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Yuchung Cheng <ycheng@google.com>
> Reported-by: Marek Majkowski <marek@cloudflare.com>
> Cc: Jon Maxwell <jmaxwell37@gmail.com>
> Link: https://marc.info/?l=linux-netdev&m=156940118307949&w=2

Applied and queued up for -stable.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-27 18:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-26 22:42 [PATCH net] tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state Eric Dumazet
2019-09-27  8:25 ` Jonathan Maxwell
2019-09-27 12:40   ` Marek Majkowski
2019-09-27 13:18     ` Eric Dumazet
2019-09-27 16:35 ` Yuchung Cheng
2019-09-27 18:42 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).