* [PATCH 0/6] tcp: support preloading data on a listening socket
@ 2025-05-16 15:54 Jeremy Harris
2025-05-16 15:54 ` [PATCH 1/6] tcp: support writing to a socket in listening state Jeremy Harris
` (7 more replies)
0 siblings, 8 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:54 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
Support write to a listen TCP socket, for immediate
transmission on passive connection establishments.
On a normal connection transmission is triggered by the receipt of
the 3rd-ack. On a fastopen (with accepted cookie) connection the data
is sent in the synack packet.
The data preload is done using a sendmsg with a newly-defined flag
(MSG_PRELOAD); the amount of data limited to a single linear sk_buff.
Note that this definition is the last-but-two bit available if "int"
is 32 bits.
Testing:
A) packetdrill scripts for
- normal non-TFO
- normal TFO
- synack lost
- 3rd-ack acks only the SYN
- 3rd-ack acks partial data
(NB: packetdrill can only check the data size, not actual content)
B) Application use, running the application testsuite
and manual check of specific cases via packet capture
C) Daily-driver laptop use (not expected to trigger the feature;
only regression-test)
Jeremy Harris (6):
tcp: support writing to a socket in listening state
tcp: copy write-data from listen socket to accept child socket
tcp: fastopen: add write-data to fastopen synack packet
tcp: transmit any pending data on receipt of 3rd-ack
tcp: fastopen: retransmit data when only the SYN of a synack-with-data
is acked
tcp: fastopen: extend retransmit-queue trimming to handle linear
sk_buff
include/linux/socket.h | 1 +
net/ipv4/tcp.c | 17 ++++--
net/ipv4/tcp_fastopen.c | 3 +-
net/ipv4/tcp_input.c | 15 ++++-
net/ipv4/tcp_ipv4.c | 4 +-
net/ipv4/tcp_minisocks.c | 58 +++++++++++++++++--
net/ipv4/tcp_output.c | 50 ++++++++++++++--
.../perf/trace/beauty/include/linux/socket.h | 1 +
tools/perf/trace/beauty/msg_flags.c | 3 +
9 files changed, 135 insertions(+), 17 deletions(-)
base-commit: 2da35e4b4df99d3dd29bacf0c054e6988013d4ec
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/6] tcp: support writing to a socket in listening state
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
@ 2025-05-16 15:54 ` Jeremy Harris
2025-05-16 15:55 ` [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket Jeremy Harris
` (6 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:54 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
In the tcp sendmsg handler, permit a write in LISTENING state if
a MSG_PRELOAD flag is used. Copy from iovec to a linear sk_buff
for placement on the socket write queue.
Signed-off-by: Jeremy Harris <jgh@exim.org>
---
include/linux/socket.h | 1 +
net/ipv4/tcp.c | 17 +++++++++++++----
tools/perf/trace/beauty/include/linux/socket.h | 1 +
tools/perf/trace/beauty/msg_flags.c | 3 +++
4 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 3b262487ec06..b41f4cd4dc97 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -330,6 +330,7 @@ struct ucred {
#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
#define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */
+#define MSG_PRELOAD 0x10000000 /* Preload tx data while listening */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file
descriptor received through
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b7b6ab41b496..72b5d7cad351 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1136,12 +1136,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
- /* Wait for a connection to finish. One exception is TCP Fast Open
+ /* Wait for a connection to finish. Exceptions are TCP Fast Open
* (passive side) where data is allowed to be sent before a connection
- * is fully established.
+ * is fully established, and a message marked as preload which is
+ * allowed to be placed in the send queue of a listening socket.
*/
if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
- !tcp_passive_fastopen(sk)) {
+ !tcp_passive_fastopen(sk) && !(flags & MSG_PRELOAD)) {
err = sk_stream_wait_connect(sk, &timeo);
if (err != 0)
goto do_error;
@@ -1226,7 +1227,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (copy > msg_data_left(msg))
copy = msg_data_left(msg);
- if (zc == 0) {
+ if (unlikely(flags & MSG_PRELOAD)) {
+ copy = min_t(int, copy, skb_tailroom(skb));
+ err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
+ copy);
+ if (err)
+ goto do_error;
+ } else if (zc == 0) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
@@ -1330,6 +1337,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (!msg_data_left(msg)) {
if (unlikely(flags & MSG_EOR))
TCP_SKB_CB(skb)->eor = 1;
+ if (unlikely(flags & MSG_PRELOAD))
+ goto out_nopush;
goto out;
}
diff --git a/tools/perf/trace/beauty/include/linux/socket.h b/tools/perf/trace/beauty/include/linux/socket.h
index c3322eb3d686..e9ea498169f3 100644
--- a/tools/perf/trace/beauty/include/linux/socket.h
+++ b/tools/perf/trace/beauty/include/linux/socket.h
@@ -330,6 +330,7 @@ struct ucred {
#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
#define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */
+#define MSG_PRELOAD 0x10000000 /* Preload tx data while listening */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file
descriptor received through
diff --git a/tools/perf/trace/beauty/msg_flags.c b/tools/perf/trace/beauty/msg_flags.c
index 2da581ff0c80..27e40da9b02d 100644
--- a/tools/perf/trace/beauty/msg_flags.c
+++ b/tools/perf/trace/beauty/msg_flags.c
@@ -20,6 +20,9 @@
#ifndef MSG_SPLICE_PAGES
#define MSG_SPLICE_PAGES 0x8000000
#endif
+#ifndef MSG_PRELOAD
+#define MSG_PRELOAD 0x10000000
+#endif
#ifndef MSG_FASTOPEN
#define MSG_FASTOPEN 0x20000000
#endif
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
2025-05-16 15:54 ` [PATCH 1/6] tcp: support writing to a socket in listening state Jeremy Harris
@ 2025-05-16 15:55 ` Jeremy Harris
2025-05-16 17:51 ` Eric Dumazet
2025-05-16 15:55 ` [PATCH 3/6] tcp: fastopen: add write-data to fastopen synack packet Jeremy Harris
` (5 subsequent siblings)
7 siblings, 1 reply; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:55 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
Set the request_sock flag for fastopen earlier, making it available
to the af_ops SYN-handler function.
In that function copy data from the listen socket write queue into an
sk_buff, allocating if needed and adding to the write queue of the
newly-created child socket.
Set sequence number values depending on the fastopen status.
Signed-off-by: Jeremy Harris <jgh@exim.org>
---
net/ipv4/tcp_fastopen.c | 3 ++-
net/ipv4/tcp_ipv4.c | 4 +--
net/ipv4/tcp_minisocks.c | 58 ++++++++++++++++++++++++++++++++++++----
3 files changed, 57 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 9b83d639b5ac..03a86d0b87ba 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -245,6 +245,8 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
struct sock *child;
bool own_req;
+ tcp_rsk(req)->tfo_listener = true;
+
child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL,
NULL, &own_req);
if (!child)
@@ -261,7 +263,6 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
tp = tcp_sk(child);
rcu_assign_pointer(tp->fastopen_rsk, req);
- tcp_rsk(req)->tfo_listener = true;
/* RFC1323: The window in SYN & SYN/ACK segments is never
* scaled. So correct it appropriately.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6a14f9e6fef6..e488effdbdb2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1747,8 +1747,8 @@ EXPORT_IPV6_MOD(tcp_v4_conn_request);
/*
- * The three way handshake has completed - we got a valid synack -
- * now create the new socket.
+ * The three way handshake has completed - we got a valid synack
+ * (or a FASTOPEN syn) - now create the new socket.
*/
struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 43d7852ce07e..d471531b4a78 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -529,7 +529,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
struct inet_connection_sock *newicsk;
const struct tcp_sock *oldtp;
struct tcp_sock *newtp;
- u32 seq;
+ u32 seq, a_seq, n_seq;
if (!newsk)
return NULL;
@@ -550,9 +550,55 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->segs_in = 1;
seq = treq->snt_isn + 1;
- newtp->snd_sml = newtp->snd_una = seq;
- WRITE_ONCE(newtp->snd_nxt, seq);
- newtp->snd_up = seq;
+ n_seq = seq;
+ a_seq = seq;
+ newtp->write_seq = seq;
+ newtp->snd_una = seq;
+
+ /* If there is write-data sitting on the listen socket, copy it to
+ * the accept socket. If FASTOPEN we will send it on the synack,
+ * otherwise it sits there until 3rd-ack arrives.
+ */
+
+ if (unlikely(!skb_queue_empty(&sk->sk_write_queue))) {
+ struct sk_buff *l_skb = tcp_send_head(sk),
+ *a_skb = tcp_write_queue_tail(newsk);
+ ssize_t copy = 0;
+
+ if (a_skb)
+ copy = l_skb->len - a_skb->len;
+
+ if (copy <= 0 || !tcp_skb_can_collapse_to(a_skb)) {
+ bool first_skb = tcp_rtx_and_write_queues_empty(newsk);
+
+ a_skb = tcp_stream_alloc_skb(newsk,
+ newsk->sk_allocation,
+ first_skb);
+ if (!a_skb) {
+ /* is this the correct free? */
+ bh_unlock_sock(newsk);
+ sk_free(newsk);
+ return NULL;
+ }
+
+ tcp_skb_entail(newsk, a_skb);
+ }
+ copy = min_t(int, l_skb->len, skb_tailroom(a_skb));
+ skb_put_data(a_skb, l_skb->data, copy);
+
+ TCP_SKB_CB(a_skb)->end_seq += copy;
+
+ a_seq += l_skb->len;
+
+ if (treq->tfo_listener)
+ seq = a_seq;
+
+ /* assumes only one skb on the listen write queue */
+ }
+
+ newtp->snd_sml = seq;
+ WRITE_ONCE(newtp->snd_nxt, a_seq);
+ newtp->snd_up = n_seq;
INIT_LIST_HEAD(&newtp->tsq_node);
INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
@@ -567,7 +613,9 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
newtp->total_retrans = req->num_retrans;
tcp_init_xmit_timers(newsk);
- WRITE_ONCE(newtp->write_seq, newtp->pushed_seq = treq->snt_isn + 1);
+
+ newtp->pushed_seq = n_seq;
+ WRITE_ONCE(newtp->write_seq, a_seq);
if (sock_flag(newsk, SOCK_KEEPOPEN))
tcp_reset_keepalive_timer(newsk, keepalive_time_when(newtp));
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/6] tcp: fastopen: add write-data to fastopen synack packet
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
2025-05-16 15:54 ` [PATCH 1/6] tcp: support writing to a socket in listening state Jeremy Harris
2025-05-16 15:55 ` [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket Jeremy Harris
@ 2025-05-16 15:55 ` Jeremy Harris
2025-05-16 15:55 ` [PATCH 4/6] tcp: transmit any pending data on receipt of 3rd-ack Jeremy Harris
` (4 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:55 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
While building the synack packet, for a fastopen socket
copy data from write queue to the packet.
Move the data from write queue to retransmit queue.
Signed-off-by: Jeremy Harris <jgh@exim.org>
---
net/ipv4/tcp_output.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3ac8d2d17e1f..c50553c1c795 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3702,7 +3702,7 @@ int tcp_send_synack(struct sock *sk)
/**
* tcp_make_synack - Allocate one skb and build a SYNACK packet.
- * @sk: listener socket
+ * @sk: listener socket (or child socket for fastopen)
* @dst: dst entry attached to the SYNACK. It is consumed and caller
* should not use it again.
* @req: request_sock pointer
@@ -3719,6 +3719,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
struct inet_request_sock *ireq = inet_rsk(req);
const struct tcp_sock *tp = tcp_sk(sk);
struct tcp_out_options opts;
+ struct sock *fastopen_sk = (struct sock *)sk;
struct tcp_key key = {};
struct sk_buff *skb;
int tcp_header_size;
@@ -3748,7 +3749,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
* cpu might call us concurrently.
* sk->sk_wmem_alloc in an atomic, we can promote to rw.
*/
- skb_set_owner_w(skb, (struct sock *)sk);
+ skb_set_owner_w(skb, fastopen_sk);
break;
}
skb_dst_set(skb, dst);
@@ -3831,6 +3832,33 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
th->window = htons(min(req->rsk_rcv_wnd, 65535U));
tcp_options_write(th, NULL, tcp_rsk(req), &opts, &key);
th->doff = (tcp_header_size >> 2);
+
+ /* If this is a FASTOPEN, and there is write-data on the accept socket,
+ * re-copy it to the synack segment. If not FASTOPEN. any data waits
+ * until 3rd-ack arrival.
+ */
+
+ if (synack_type == TCP_SYNACK_FASTOPEN &&
+ !skb_queue_empty(&sk->sk_write_queue)) {
+ struct sk_buff *a_skb = tcp_write_queue_tail(sk);
+ int copy = min_t(int, a_skb->len, skb_tailroom(skb));
+
+ skb_put_data(skb, a_skb->data, copy);
+ TCP_SKB_CB(skb)->end_seq += copy;
+
+ tcp_skb_pcount_set(a_skb, 1);
+ WRITE_ONCE(tcp_sk(fastopen_sk)->write_seq,
+ TCP_SKB_CB(a_skb)->end_seq);
+
+ skb_set_delivery_time(a_skb, now, SKB_CLOCK_MONOTONIC);
+
+ /* Move the data to the retransmit queue.
+ * Code elsewhere implies this is a full child socket and
+ * can be treated as writeable - permitting the cast.
+ */
+ tcp_event_new_data_sent(fastopen_sk, a_skb);
+ }
+
TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
/* Okay, we have all we need - do the md5 hash if needed */
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/6] tcp: transmit any pending data on receipt of 3rd-ack
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
` (2 preceding siblings ...)
2025-05-16 15:55 ` [PATCH 3/6] tcp: fastopen: add write-data to fastopen synack packet Jeremy Harris
@ 2025-05-16 15:55 ` Jeremy Harris
2025-05-16 15:55 ` [PATCH 5/6] tcp: fastopen: retransmit data when only the SYN of a synack-with-data is acked Jeremy Harris
` (3 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:55 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
For the non-fastopen case of prelaod, when the 3rd-ack arrives there
will be data on the write queue. Transmit it immediately
by allowing the SYN_SENT state to run the xmit-recovery code.
Signed-off-by: Jeremy Harris <jgh@exim.org>
---
net/ipv4/tcp_input.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8ec92dec321a..345a08baaf02 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3900,7 +3900,8 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit)
{
struct tcp_sock *tp = tcp_sk(sk);
- if (rexmit == REXMIT_NONE || sk->sk_state == TCP_SYN_SENT)
+ if ((rexmit == REXMIT_NONE && sk->sk_state != TCP_SYN_RECV) ||
+ sk->sk_state == TCP_SYN_SENT)
return;
if (unlikely(rexmit == REXMIT_NEW)) {
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 5/6] tcp: fastopen: retransmit data when only the SYN of a synack-with-data is acked
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
` (3 preceding siblings ...)
2025-05-16 15:55 ` [PATCH 4/6] tcp: transmit any pending data on receipt of 3rd-ack Jeremy Harris
@ 2025-05-16 15:55 ` Jeremy Harris
2025-05-16 15:55 ` [PATCH 6/6] tcp: fastopen: extend retransmit-queue trimming to handle linear sk_buff Jeremy Harris
` (2 subsequent siblings)
7 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:55 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
A corner-case for the 3rd-ack after a data-on-synack is for only
the SYN to be acked. Handle this by, in ack processing, when in
SYN_RECV state (the state is not yet updated to ESTABLISHED)
marking the retransmit-queue sk_buff as having been lost.
Signed-off-by: Jeremy Harris <jgh@exim.org>
---
net/ipv4/tcp_input.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 345a08baaf02..a53021edddd5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4069,6 +4069,18 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
&rexmit);
}
+ /* On receiving a 3rd-ack, if we never sent a packet via
+ * the normal means (which counts them), yet there is data
+ * remaining for retransmit, it was data-on-synack not acked;
+ * mark the skb for retransmission.
+ */
+ if (sk->sk_state == TCP_SYN_RECV && tp->segs_out == 0) {
+ struct sk_buff *skb = tcp_rtx_queue_head(sk);
+
+ if (skb)
+ tcp_mark_skb_lost(sk, skb);
+ }
+
/* If needed, reset TLP/RTO timer when RACK doesn't set. */
if (flag & FLAG_SET_XMIT_TIMER)
tcp_set_xmit_timer(sk);
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 6/6] tcp: fastopen: extend retransmit-queue trimming to handle linear sk_buff
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
` (4 preceding siblings ...)
2025-05-16 15:55 ` [PATCH 5/6] tcp: fastopen: retransmit data when only the SYN of a synack-with-data is acked Jeremy Harris
@ 2025-05-16 15:55 ` Jeremy Harris
2025-05-16 16:58 ` [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
2025-05-16 18:19 ` Neal Cardwell
7 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 15:55 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell, Jeremy Harris
A corner-case for the 3rd-ack after a data-on-synack is for
some but not all of the data to be acked. Support this by
adding to the retransmit-queue trim routine to handle a
linear sk_buff.
Signed-off-by: Jeremy Harris <jgh@exim.org>
---
net/ipv4/tcp_output.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c50553c1c795..bff5934ff04b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1708,8 +1708,22 @@ static int __pskb_trim_head(struct sk_buff *skb, int len)
struct skb_shared_info *shinfo;
int i, k, eat;
- DEBUG_NET_WARN_ON_ONCE(skb_headlen(skb));
- eat = len;
+ eat = skb_headlen(skb);
+ if (unlikely(eat)) {
+ if (len < eat)
+ eat = len;
+ skb->head += eat;
+ skb->len -= eat;
+ if (skb->data_len)
+ skb->data_len -= eat;
+
+ eat = len - eat;
+ if (eat == 0)
+ return len;
+ } else {
+ eat = len;
+ }
+
k = 0;
shinfo = skb_shinfo(skb);
for (i = 0; i < shinfo->nr_frags; i++) {
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] tcp: support preloading data on a listening socket
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
` (5 preceding siblings ...)
2025-05-16 15:55 ` [PATCH 6/6] tcp: fastopen: extend retransmit-queue trimming to handle linear sk_buff Jeremy Harris
@ 2025-05-16 16:58 ` Jeremy Harris
2025-05-16 18:19 ` Neal Cardwell
7 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 16:58 UTC (permalink / raw)
To: netdev; +Cc: linux-api, edumazet, ncardwell
On 2025/05/16 4:54 PM, Jeremy Harris wrote:
> Support write to a listen TCP socket, for immediate
> transmission on passive connection establishments.
This series should have included "net-next" in their Subjects.
--
Apologies,
Jeremy
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket
2025-05-16 15:55 ` [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket Jeremy Harris
@ 2025-05-16 17:51 ` Eric Dumazet
2025-05-16 20:11 ` Jeremy Harris
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2025-05-16 17:51 UTC (permalink / raw)
To: Jeremy Harris; +Cc: netdev, linux-api, ncardwell
On Fri, May 16, 2025 at 8:56 AM Jeremy Harris <jgh@exim.org> wrote:
>
> Set the request_sock flag for fastopen earlier, making it available
> to the af_ops SYN-handler function.
>
> In that function copy data from the listen socket write queue into an
> sk_buff, allocating if needed and adding to the write queue of the
> newly-created child socket.
> Set sequence number values depending on the fastopen status.
I do not see any locking. I think you should run a local KASAN/syzbot
instance and you will be shocked.
Honestly we need to be convinced of why adding code in sendmsg() fast
path is worth this.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] tcp: support preloading data on a listening socket
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
` (6 preceding siblings ...)
2025-05-16 16:58 ` [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
@ 2025-05-16 18:19 ` Neal Cardwell
2025-05-16 20:10 ` Jeremy Harris
7 siblings, 1 reply; 12+ messages in thread
From: Neal Cardwell @ 2025-05-16 18:19 UTC (permalink / raw)
To: Jeremy Harris; +Cc: netdev, linux-api, edumazet
On Fri, May 16, 2025 at 11:55 AM Jeremy Harris <jgh@exim.org> wrote:
>
> Support write to a listen TCP socket, for immediate
> transmission on passive connection establishments.
>
> On a normal connection transmission is triggered by the receipt of
> the 3rd-ack. On a fastopen (with accepted cookie) connection the data
> is sent in the synack packet.
>
> The data preload is done using a sendmsg with a newly-defined flag
> (MSG_PRELOAD); the amount of data limited to a single linear sk_buff.
> Note that this definition is the last-but-two bit available if "int"
> is 32 bits.
Can you please add a bit more context, like:
+ What is the motivating use case? (Accelerating Exim?) Is this
targeted for connections using encryption (like TLS/SSL), or just
plain-text connections?
+ What are the exact performance improvements you are seeing in your
benchmarks that (a) motivate this, and (b) justify any performance
impact on the TCP stack?
+ Regarding "Support write to a listen TCP socket, for immediate
transmission on passive connection establishments.": can you please
make it explicitly clear whether the data written to the listening
socket is saved and transmitted on all future successful passive
sockets that are created for the listener, or is just transmitted on
the next connection that is created?
thanks,
neal
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] tcp: support preloading data on a listening socket
2025-05-16 18:19 ` Neal Cardwell
@ 2025-05-16 20:10 ` Jeremy Harris
0 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 20:10 UTC (permalink / raw)
To: Neal Cardwell; +Cc: netdev, linux-api, edumazet
Hi Neal,
Thanks for the initial review.
On 2025/05/16 7:19 PM, Neal Cardwell wrote:
> On Fri, May 16, 2025 at 11:55 AM Jeremy Harris <jgh@exim.org> wrote:
>>
>> Support write to a listen TCP socket, for immediate
>> transmission on passive connection establishments.
>>
>> On a normal connection transmission is triggered by the receipt of
>> the 3rd-ack. On a fastopen (with accepted cookie) connection the data
>> is sent in the synack packet.
>>
>> The data preload is done using a sendmsg with a newly-defined flag
>> (MSG_PRELOAD); the amount of data limited to a single linear sk_buff.
>> Note that this definition is the last-but-two bit available if "int"
>> is 32 bits.
>
> Can you please add a bit more context, like:
>
> + What is the motivating use case? (Accelerating Exim?)
Accelerating any server-first ULP, SMTP being the major use I
know of (and yes, Exim is my primary testcase and is operational
against a test kernel with this patch series).
One caveat: the initial server data cannot change from one passive
connection to another.
> Is this
> targeted for connections using encryption (like TLS/SSL), or just
> plain-text connections?
TLS-on-connect cannot benefit, being client-first. SMTP that uses
STARTTLS can take advantage of it, as can plaintext SMTP.
I would not expect https to be able to use it.
> + What are the exact performance improvements you are seeing in your
> benchmarks that (a) motivate this, and (b) justify any performance
> impact on the TCP stack?
Because of the lack of userland roundtrip needed for the initial server
data, there is a latency benefit. This is better for the TFO-C case,
but also significant for the non-TFO case.
Packet capture (laptop, loopback, TFO-C case) for initial SYN to first
client data packet (5 samples):
- baseline TFO_C 1064 1470 1455 1547 1595 usec
- patched non-TFO 140 150 159 144 153 usec
- patched TFO_C 142 149 149 125 125 usec
One fewer packet is sent by the server in most packet captures, sometimes
one fewer in each direction. There is one less application kernel entry/exit
on the server.
I'm hoping those differences will add up to both less cpu time (on both
endpoints) and less wire-time. However, I have not run benchmarks looking
for a change in peak rate of connection-handling.
In summary, this is the mirror of TCP Fast Open client data: the latency
benefit is probably the most useful aspect.
> + Regarding "Support write to a listen TCP socket, for immediate
> transmission on passive connection establishments.": can you please
> make it explicitly clear whether the data written to the listening
> socket is saved and transmitted on all future successful passive
> sockets that are created for the listener,
This. The data is copied for each future passive socket from this
listener,
> or is just transmitted on
> the next connection that is created?
(and not this option).
I'll copy these comments in any future v2.
As Eric says, I should run KASAN/syzbot first.
--
Cheers,
Jeremy
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket
2025-05-16 17:51 ` Eric Dumazet
@ 2025-05-16 20:11 ` Jeremy Harris
0 siblings, 0 replies; 12+ messages in thread
From: Jeremy Harris @ 2025-05-16 20:11 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, linux-api, ncardwell
On 2025/05/16 6:51 PM, Eric Dumazet wrote:
> I do not see any locking. I think you should run a local KASAN/syzbot
> instance and you will be shocked.
Thanks for the suggestion; I'll look into doing that.
--
Cheers,
Jeremy
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-05-16 20:11 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 15:54 [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
2025-05-16 15:54 ` [PATCH 1/6] tcp: support writing to a socket in listening state Jeremy Harris
2025-05-16 15:55 ` [PATCH 2/6] tcp: copy write-data from listen socket to accept child socket Jeremy Harris
2025-05-16 17:51 ` Eric Dumazet
2025-05-16 20:11 ` Jeremy Harris
2025-05-16 15:55 ` [PATCH 3/6] tcp: fastopen: add write-data to fastopen synack packet Jeremy Harris
2025-05-16 15:55 ` [PATCH 4/6] tcp: transmit any pending data on receipt of 3rd-ack Jeremy Harris
2025-05-16 15:55 ` [PATCH 5/6] tcp: fastopen: retransmit data when only the SYN of a synack-with-data is acked Jeremy Harris
2025-05-16 15:55 ` [PATCH 6/6] tcp: fastopen: extend retransmit-queue trimming to handle linear sk_buff Jeremy Harris
2025-05-16 16:58 ` [PATCH 0/6] tcp: support preloading data on a listening socket Jeremy Harris
2025-05-16 18:19 ` Neal Cardwell
2025-05-16 20:10 ` Jeremy Harris
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).