* [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls
@ 2017-08-03 13:19 Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix Neal Cardwell
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Neal Cardwell @ 2017-08-03 13:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Neal Cardwell
This patch series is a bug fix for a TCP loss recovery performance bug
reported independently in recent netdev threads:
(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"
Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports,
traces, and packetdrill test cases, which enabled us to root-cause
this issue and verify the fix.
- v1 -> v2:
- In patch 2/3, changed an unclear comment in the pre-existing code
in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet
for suggesting we improve this).
Neal Cardwell (3):
tcp: introduce tcp_rto_delta_us() helper for xmit timer fix
tcp: enable xmit timer fix by having TLP use time when RTO should fire
tcp: fix xmit timer to only be reset if data ACKed/SACKed
include/net/tcp.h | 10 ++++++++++
net/ipv4/tcp_input.c | 30 +++++++++++++++++-------------
net/ipv4/tcp_output.c | 23 +++++------------------
3 files changed, 32 insertions(+), 31 deletions(-)
--
2.14.0.rc1.383.gd1ce394fe2-goog
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix
2017-08-03 13:19 [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls Neal Cardwell
@ 2017-08-03 13:19 ` Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 2/3] tcp: enable xmit timer fix by having TLP use time when RTO should fire Neal Cardwell
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Neal Cardwell @ 2017-08-03 13:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati
Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)
Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
include/net/tcp.h | 10 ++++++++++
net/ipv4/tcp_input.c | 5 +----
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70483296157f..ada65e767b28 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1916,6 +1916,16 @@ extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
u64 xmit_time);
extern void tcp_rack_reo_timeout(struct sock *sk);
+/* At how many usecs into the future should the RTO fire? */
+static inline s64 tcp_rto_delta_us(const struct sock *sk)
+{
+ const struct sk_buff *skb = tcp_write_queue_head(sk);
+ u32 rto = inet_csk(sk)->icsk_rto;
+ u64 rto_time_stamp_us = skb->skb_mstamp + jiffies_to_usecs(rto);
+
+ return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp;
+}
+
/*
* Save and compile IPv4 options, return a pointer to it
*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dad026fcfd09..b9ba8d55d8b8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3004,10 +3004,7 @@ void tcp_rearm_rto(struct sock *sk)
/* Offset the time elapsed after installing regular RTO */
if (icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT ||
icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
- struct sk_buff *skb = tcp_write_queue_head(sk);
- u64 rto_time_stamp = skb->skb_mstamp +
- jiffies_to_usecs(rto);
- s64 delta_us = rto_time_stamp - tp->tcp_mstamp;
+ s64 delta_us = tcp_rto_delta_us(sk);
/* delta_us may not be positive if the socket is locked
* when the retrans timer fires and is rescheduled.
*/
--
2.14.0.rc1.383.gd1ce394fe2-goog
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 net 2/3] tcp: enable xmit timer fix by having TLP use time when RTO should fire
2017-08-03 13:19 [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix Neal Cardwell
@ 2017-08-03 13:19 ` Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed Neal Cardwell
2017-08-03 22:39 ` [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls David Miller
3 siblings, 0 replies; 5+ messages in thread
From: Neal Cardwell @ 2017-08-03 13:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati
Have tcp_schedule_loss_probe() base the TLP scheduling decision based
on when the RTO *should* fire. This is to enable the upcoming xmit
timer fix in this series, where tcp_schedule_loss_probe() cannot
assume that the last timer installed was an RTO timer (because we are
no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
ACK). So tcp_schedule_loss_probe() must independently figure out when
an RTO would want to fire.
In the new TLP implementation following in this series, we cannot
assume that icsk_timeout was set based on an RTO; after processing a
cumulative ACK the icsk_timeout we see can be from a previous TLP or
RTO. So we need to independently recalculate the RTO time (instead of
reading it out of icsk_timeout). Removing this dependency on the
nature of icsk_timeout makes things a little easier to reason about
anyway.
Note that the old and new code should be equivalent, since they are
both saying: "if the RTO is in the future, but at an earlier time than
the normal TLP time, then set the TLP timer to fire when the RTO would
have fired".
Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_output.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2f1588bf73da..cd8e257492c4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2377,8 +2377,8 @@ bool tcp_schedule_loss_probe(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
- u32 timeout, tlp_time_stamp, rto_time_stamp;
u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
+ u32 timeout, rto_delta_us;
/* No consecutive loss probes. */
if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
@@ -2417,14 +2417,10 @@ bool tcp_schedule_loss_probe(struct sock *sk)
(rtt + (rtt >> 1) + TCP_DELACK_MAX));
timeout = max_t(u32, timeout, msecs_to_jiffies(10));
- /* If RTO is shorter, just schedule TLP in its place. */
- tlp_time_stamp = tcp_jiffies32 + timeout;
- rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
- if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
- s32 delta = rto_time_stamp - tcp_jiffies32;
- if (delta > 0)
- timeout = delta;
- }
+ /* If the RTO formula yields an earlier time, then use that time. */
+ rto_delta_us = tcp_rto_delta_us(sk); /* How far in future is RTO? */
+ if (rto_delta_us > 0)
+ timeout = min_t(u32, timeout, usecs_to_jiffies(rto_delta_us));
inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
TCP_RTO_MAX);
--
2.14.0.rc1.383.gd1ce394fe2-goog
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed
2017-08-03 13:19 [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 2/3] tcp: enable xmit timer fix by having TLP use time when RTO should fire Neal Cardwell
@ 2017-08-03 13:19 ` Neal Cardwell
2017-08-03 22:39 ` [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls David Miller
3 siblings, 0 replies; 5+ messages in thread
From: Neal Cardwell @ 2017-08-03 13:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Neal Cardwell, Yuchung Cheng, Nandita Dukkipati
Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:
(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"
The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:
- Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
could cause TCP to repeatedly schedule TLPs forever. We kept
sending TLPs after every ~200ms, which elicited bogus SACKs, which
caused more TLPs, ad infinitum; we never fired an RTO to fill in
the holes.
- Incoming data segments could, in some cases, cause us to reschedule
our RTO or TLP timer further out in time, for no good reason. This
could cause repeated inbound data to result in stalls in outbound
data, in the presence of packet loss.
This commit fixes these bugs by changing the TLP and RTO ACK
processing to:
(a) Only reschedule the xmit timer once per ACK.
(b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
ACK indicates sufficient forward progress (a packet was
cumulatively ACKed, or we got a SACK for a packet that was sent
before the most recent retransmit of the write queue head).
This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.
As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:
1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);
2) Once in tcp_clean_rtx_queue(), to update the RTO:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);
3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
to TLP:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.
This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.
This patch is a bug fix patch intended to be queued for -stable
relases.
Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen <kl@vsen.dk>
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 25 ++++++++++++++++---------
net/ipv4/tcp_output.c | 9 ---------
2 files changed, 16 insertions(+), 18 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b9ba8d55d8b8..53de1424c13c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -107,6 +107,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
+#define FLAG_SET_XMIT_TIMER 0x1000 /* Set TLP or RTO timer */
#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
#define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_send_challenge_ack() */
@@ -3016,6 +3017,13 @@ void tcp_rearm_rto(struct sock *sk)
}
}
+/* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
+static void tcp_set_xmit_timer(struct sock *sk)
+{
+ if (!tcp_schedule_loss_probe(sk))
+ tcp_rearm_rto(sk);
+}
+
/* If we get here, the whole TSO packet has not been acked. */
static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
{
@@ -3177,7 +3185,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
ca_rtt_us, sack->rate);
if (flag & FLAG_ACKED) {
- tcp_rearm_rto(sk);
+ flag |= FLAG_SET_XMIT_TIMER; /* set TLP or RTO timer */
if (unlikely(icsk->icsk_mtup.probe_size &&
!after(tp->mtu_probe.probe_seq_end, tp->snd_una))) {
tcp_mtup_probe_success(sk);
@@ -3205,7 +3213,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
* after when the head was last (re)transmitted. Otherwise the
* timeout may continue to extend in loss recovery.
*/
- tcp_rearm_rto(sk);
+ flag |= FLAG_SET_XMIT_TIMER; /* set TLP or RTO timer */
}
if (icsk->icsk_ca_ops->pkts_acked) {
@@ -3577,9 +3585,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (after(ack, tp->snd_nxt))
goto invalid_ack;
- if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
- tcp_rearm_rto(sk);
-
if (after(ack, prior_snd_una)) {
flag |= FLAG_SND_UNA_ADVANCED;
icsk->icsk_retransmits = 0;
@@ -3644,18 +3649,20 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
&sack_state);
+ if (tp->tlp_high_seq)
+ tcp_process_tlp_ack(sk, ack, flag);
+ /* If needed, reset TLP/RTO timer; RACK may later override this. */
+ if (flag & FLAG_SET_XMIT_TIMER)
+ tcp_set_xmit_timer(sk);
+
if (tcp_ack_is_dubious(sk, flag)) {
is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
}
- if (tp->tlp_high_seq)
- tcp_process_tlp_ack(sk, ack, flag);
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
sk_dst_confirm(sk);
- if (icsk->icsk_pending == ICSK_TIME_RETRANS)
- tcp_schedule_loss_probe(sk);
delivered = tp->delivered - delivered; /* freshly ACKed or SACKed */
lost = tp->lost - lost; /* freshly marked lost */
tcp_rate_gen(sk, delivered, lost, sack_state.rate);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cd8e257492c4..276406a83a37 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2380,21 +2380,12 @@ bool tcp_schedule_loss_probe(struct sock *sk)
u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
u32 timeout, rto_delta_us;
- /* No consecutive loss probes. */
- if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
- tcp_rearm_rto(sk);
- return false;
- }
/* Don't do any loss probe on a Fast Open connection before 3WHS
* finishes.
*/
if (tp->fastopen_rsk)
return false;
- /* TLP is only scheduled when next timer event is RTO. */
- if (icsk->icsk_pending != ICSK_TIME_RETRANS)
- return false;
-
/* Schedule a loss probe in 2*RTT for SACK capable connections
* in Open state, that are either limited by cwnd or application.
*/
--
2.14.0.rc1.383.gd1ce394fe2-goog
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls
2017-08-03 13:19 [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls Neal Cardwell
` (2 preceding siblings ...)
2017-08-03 13:19 ` [PATCH v2 net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed Neal Cardwell
@ 2017-08-03 22:39 ` David Miller
3 siblings, 0 replies; 5+ messages in thread
From: David Miller @ 2017-08-03 22:39 UTC (permalink / raw)
To: ncardwell; +Cc: netdev
From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 3 Aug 2017 09:19:51 -0400
> This patch series is a bug fix for a TCP loss recovery performance bug
> reported independently in recent netdev threads:
>
> (i) July 26, 2017: netdev thread "TCP fast retransmit issues"
> (ii) July 26, 2017: netdev thread:
> "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
> outstanding TLP retransmission"
>
> Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports,
> traces, and packetdrill test cases, which enabled us to root-cause
> this issue and verify the fix.
>
> - v1 -> v2:
> - In patch 2/3, changed an unclear comment in the pre-existing code
> in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet
> for suggesting we improve this).
Series applied, thanks Neal.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-08-03 22:39 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-03 13:19 [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 1/3] tcp: introduce tcp_rto_delta_us() helper for xmit timer fix Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 2/3] tcp: enable xmit timer fix by having TLP use time when RTO should fire Neal Cardwell
2017-08-03 13:19 ` [PATCH v2 net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed Neal Cardwell
2017-08-03 22:39 ` [PATCH v2 net 0/3] tcp: fix xmit timer rearming to avoid stalls David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).