[PATCH v2] tcp: Add memory barrier to tcp_push()

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Salvatore Dipietro <dipiets@amazon.com>
To: <edumazet@google.com>
Cc: <alisaidi@amazon.com>, <benh@amazon.com>, <blakgeof@amazon.com>,
	<davem@davemloft.net>, <dipietro.salvatore@gmail.com>,
	<dipiets@amazon.com>, <dsahern@kernel.org>, <kuba@kernel.org>,
	<netdev@vger.kernel.org>, <pabeni@redhat.com>
Subject: [PATCH v2] tcp: Add memory barrier to tcp_push()
Date: Wed, 17 Jan 2024 13:26:48 -0800	[thread overview]
Message-ID: <20240117212648.12572-1-dipiets@amazon.com> (raw)
In-Reply-To: <CANn89iJwokqZC9P3Ycy4ZWpmT1QhC0qD79y1K1eg2UUAcAj-Lw@mail.gmail.com>

On CPUs with weak memory models, reads and updates performed by tcp_push to the
sk variables can get reordered leaving the socket throttled when it should not.
The tasklet running tcp_wfree() may also not observe the memory updates in time
and will skip flushing any packets throttled by tcp_push(), delaying the sending.
This can pathologically cause 40ms extra latency due to bad interactions with
delayed acks.

Modeling the memory access behavior of tcp_push() (P0) and tcp_wfree() (P1)
using the herd7 simulator, proves this behavior can occur. Below is the litmus
model which describes the functions:
```
C MP+tcp
{
  [flag] = 0;
  [sk] = 5;
  [corked] = 0;
}

P0(int *flag, int *sk, int *corked){
    int r0;
    int r1;
    int r2;

    r1 = READ_ONCE(*sk);
    if (r1 == 5) {

        r0 = READ_ONCE(*flag);
        if (r0 == 0) {
            WRITE_ONCE(*flag, 1);
        }

        // memory barrier added in this patch,
        // original code does not order the reads/writes
        smp_mb();

        r2 = READ_ONCE(*sk);
        if (r2 == 5 ) {
            WRITE_ONCE(*corked,1);
        }
    }
}

P1(int *flag, int *sk, int *corked){
    int r0;
    int r1;

    r1 = READ_ONCE(*sk);
    smp_store_release(sk, 0);

    r0 = smp_load_acquire(flag);
    if (r0 == 1) {
        smp_store_release(flag, 0);
    }
}
locations [0:r0; 0:r1; 0:r2; 1:r0; 1:r1; flag; sk; corked; ]
exists ( flag=1 /\ corked=1 )
```

Adding the memory barrier removes the positive witness from the memory model.
smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86
since not affected.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
Apache Tomcat 9.0.83 running the basic servlet below:
```
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class HelloWorldServlet extends HttpServlet {
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
        response.setContentType("text/html;charset=utf-8");
        OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
        String s = "a".repeat(3096);
        osw.write(s,0,s.length());
        osw.flush();
    }
}
```
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
values is observed while, with the patch, the extra latency disappears.

# No patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
 50.000%    0.91ms
 75.000%    1.13ms
 90.000%    1.46ms
 99.000%    1.74ms
 99.900%    1.89ms
 99.990%   41.95ms  <<< 40+ ms extra latency
 99.999%   48.32ms
100.000%   48.96ms

# With patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
 50.000%    0.90ms
 75.000%    1.13ms
 90.000%    1.45ms
 99.000%    1.72ms
 99.900%    1.83ms
 99.990%    2.11ms  <<< no 40+ ms extra latency
 99.999%    2.53ms
100.000%    2.62ms

Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
affected by this issue and the patch doesn't introduce any additional
delay.

Fixes: f54b311142a9 ("tcp: auto corking")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
---
 net/ipv4/tcp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ff6838ca2e58..ab9e3922393c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
 		/* It is possible TX completion already happened
 		 * before we set TSQ_THROTTLED.
 		 */
+		smp_mb__after_atomic();
 		if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize)
 			return;
 	}
-- 
2.42.0

next prev parent reply	other threads:[~2024-01-17 21:29 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-08 18:20 [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set Salvatore Dipietro
2023-12-09 11:03 ` Eric Dumazet
2023-12-11 15:58   ` Hazem Mohamed Abuelfotoh
2023-12-12 10:41     ` Paolo Abeni
2023-12-12 11:29       ` Eric Dumazet
2023-12-12 10:57 ` Paolo Abeni
2023-12-13 14:10   ` Eric Dumazet
2023-12-13 19:00     ` Jakub Kicinski
2023-12-13 19:29       ` Eric Dumazet
2023-12-13 21:30     ` Salvatore Dipietro
2023-12-14  8:40       ` Eric Dumazet
2023-12-14  9:05         ` Eric Dumazet
2023-12-14 14:08       ` Eric Dumazet
2023-12-14 15:52         ` Geoff Blake
2023-12-14 16:07           ` Eric Dumazet
2024-01-17 21:26             ` Salvatore Dipietro [this message]
2024-01-17 21:58               ` [PATCH v2] tcp: Add memory barrier to tcp_push() Eric Dumazet
2024-01-17 23:16                 ` [PATCH v3] " Salvatore Dipietro
2024-01-18 10:38                   ` Paolo Abeni
2024-01-18 10:42                     ` Eric Dumazet
2024-01-18 16:40                       ` Jakub Kicinski
2024-01-18 17:46                     ` Geoff Blake
2024-01-18 18:30                       ` Paolo Abeni
2024-01-18 19:30                         ` Geoff Blake
2024-01-19 13:33                           ` Eric Dumazet
2024-04-17  7:33                           ` Eric Dumazet
2024-04-22 16:16                             ` Salvatore Dipietro
2023-12-13 10:40 ` [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set patchwork-bot+netdevbpf

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:ff6838ca2e5 dfblob:ab9e3922393 )
 OR (
bs:"[PATCH v2] tcp: Add memory barrier to tcp_push()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240117212648.12572-1-dipiets@amazon.com \
    --to=dipiets@amazon.com \
    --cc=alisaidi@amazon.com \
    --cc=benh@amazon.com \
    --cc=blakgeof@amazon.com \
    --cc=davem@davemloft.net \
    --cc=dipietro.salvatore@gmail.com \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).