* [PATCH v4 net-next 1/3] tcp: Make use of MSG_EOR in tcp_sendmsg
2016-04-25 21:44 [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
@ 2016-04-25 21:44 ` Martin KaFai Lau
2016-04-25 23:02 ` Eric Dumazet
2016-04-26 0:48 ` Soheil Hassas Yeganeh
2016-04-25 21:44 ` [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb Martin KaFai Lau
` (3 subsequent siblings)
4 siblings, 2 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2016-04-25 21:44 UTC (permalink / raw)
To: netdev
Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
Willem de Bruijn, Yuchung Cheng, Kernel Team
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
---
include/net/tcp.h | 8 +++++++-
net/ipv4/tcp.c | 7 +++++--
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7f2553d..ce08038 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -762,7 +762,8 @@ struct tcp_skb_cb {
__u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
__u8 txstamp_ack:1, /* Record TX timestamp for ack? */
- unused:7;
+ eor:1, /* Is skb MSG_EOR marked? */
+ unused:6;
__u32 ack_seq; /* Sequence number ACK'd */
union {
struct inet_skb_parm h4;
@@ -809,6 +810,11 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
return TCP_SKB_CB(skb)->tcp_gso_size;
}
+static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
+{
+ return likely(!TCP_SKB_CB(skb)->eor);
+}
+
/* Events passed to congestion control interface */
enum tcp_ca_event {
CA_EVENT_TX_START, /* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858..ea5364b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
int copy, i;
bool can_coalesce;
- if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+ if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0 ||
+ !tcp_skb_can_collapse_to(skb)) {
new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
@@ -1156,7 +1157,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
copy = max - skb->len;
}
- if (copy <= 0) {
+ if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
new_segment:
/* Allocate new segment. If the interface is SG,
* allocate skb fitting to single page.
@@ -1250,6 +1251,8 @@ new_segment:
copied += copy;
if (!msg_data_left(msg)) {
tcp_tx_timestamp(sk, sockc.tsflags, skb);
+ if (unlikely(flags & MSG_EOR))
+ TCP_SKB_CB(skb)->eor = 1;
goto out;
}
--
2.5.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v4 net-next 1/3] tcp: Make use of MSG_EOR in tcp_sendmsg
2016-04-25 21:44 ` [PATCH v4 net-next 1/3] " Martin KaFai Lau
@ 2016-04-25 23:02 ` Eric Dumazet
2016-04-26 0:48 ` Soheil Hassas Yeganeh
1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2016-04-25 23:02 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
Willem de Bruijn, Yuchung Cheng, Kernel Team
On Mon, 2016-04-25 at 14:44 -0700, Martin KaFai Lau wrote:
> This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
> is passed to tcp_sendmsg, the eor bit will be set at the skb
> containing the last byte of the userland's msg. The eor bit
> will prevent data from appending to that skb in the future.
Acked-by: Eric Dumazet <edumazet@google.com>
Thanks !
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 net-next 1/3] tcp: Make use of MSG_EOR in tcp_sendmsg
2016-04-25 21:44 ` [PATCH v4 net-next 1/3] " Martin KaFai Lau
2016-04-25 23:02 ` Eric Dumazet
@ 2016-04-26 0:48 ` Soheil Hassas Yeganeh
1 sibling, 0 replies; 12+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-26 0:48 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Willem de Bruijn,
Yuchung Cheng, Kernel Team
On Mon, Apr 25, 2016 at 5:44 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
> is passed to tcp_sendmsg, the eor bit will be set at the skb
> containing the last byte of the userland's msg. The eor bit
> will prevent data from appending to that skb in the future.
>
> The change in do_tcp_sendpages is to honor the eor set
> during the previous tcp_sendmsg(MSG_EOR) call.
>
> This patch handles the tcp_sendmsg case. The followup patches
> will handle other skb coalescing and fragment cases.
>
> One potential use case is to use MSG_EOR with
> SOF_TIMESTAMPING_TX_ACK to get a more accurate
> TCP ack timestamping on application protocol with
> multiple outgoing response messages (e.g. HTTP2).
>
> Packetdrill script for testing:
> ~~~~~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 write(4, ..., 14600) = 14600
> 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
> 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
>
> 0.200 > . 1:7301(7300) ack 1
> 0.200 > P. 7301:14601(7300) ack 1
>
> 0.300 < . 1:1(0) ack 14601 win 257
> 0.300 > P. 14601:15331(730) ack 1
> 0.300 > P. 15331:16061(730) ack 1
>
> 0.400 < . 1:1(0) ack 16061 win 257
> 0.400 close(4) = 0
> 0.400 > F. 16061:16061(0) ack 1
> 0.400 < F. 1:1(0) ack 16062 win 257
> 0.400 > . 16062:16062(0) ack 2
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---
> include/net/tcp.h | 8 +++++++-
> net/ipv4/tcp.c | 7 +++++--
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7f2553d..ce08038 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -762,7 +762,8 @@ struct tcp_skb_cb {
>
> __u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
> __u8 txstamp_ack:1, /* Record TX timestamp for ack? */
> - unused:7;
> + eor:1, /* Is skb MSG_EOR marked? */
> + unused:6;
> __u32 ack_seq; /* Sequence number ACK'd */
> union {
> struct inet_skb_parm h4;
> @@ -809,6 +810,11 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
> return TCP_SKB_CB(skb)->tcp_gso_size;
> }
>
> +static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
> +{
> + return likely(!TCP_SKB_CB(skb)->eor);
> +}
> +
> /* Events passed to congestion control interface */
> enum tcp_ca_event {
> CA_EVENT_TX_START, /* first transmit when no packets in flight */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4d73858..ea5364b 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -908,7 +908,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> int copy, i;
> bool can_coalesce;
>
> - if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
> + if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0 ||
> + !tcp_skb_can_collapse_to(skb)) {
> new_segment:
> if (!sk_stream_memory_free(sk))
> goto wait_for_sndbuf;
> @@ -1156,7 +1157,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> copy = max - skb->len;
> }
>
> - if (copy <= 0) {
> + if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
> new_segment:
> /* Allocate new segment. If the interface is SG,
> * allocate skb fitting to single page.
> @@ -1250,6 +1251,8 @@ new_segment:
> copied += copy;
> if (!msg_data_left(msg)) {
> tcp_tx_timestamp(sk, sockc.tsflags, skb);
> + if (unlikely(flags & MSG_EOR))
> + TCP_SKB_CB(skb)->eor = 1;
> goto out;
> }
>
> --
> 2.5.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb
2016-04-25 21:44 [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
2016-04-25 21:44 ` [PATCH v4 net-next 1/3] " Martin KaFai Lau
@ 2016-04-25 21:44 ` Martin KaFai Lau
2016-04-25 23:03 ` Eric Dumazet
2016-04-26 0:49 ` Soheil Hassas Yeganeh
2016-04-25 21:44 ` [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb Martin KaFai Lau
` (2 subsequent siblings)
4 siblings, 2 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2016-04-25 21:44 UTC (permalink / raw)
To: netdev
Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
Willem de Bruijn, Yuchung Cheng, Kernel Team
This patch:
1. Prevent next_skb from coalescing to the prev_skb if
TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
allowed
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 11680
0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 257
0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
net/ipv4/tcp_input.c | 4 ++++
net/ipv4/tcp_output.c | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dcad8f9..65fb708 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1303,6 +1303,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
}
TCP_SKB_CB(prev)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
+ TCP_SKB_CB(prev)->eor = TCP_SKB_CB(skb)->eor;
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
TCP_SKB_CB(prev)->end_seq++;
@@ -1368,6 +1369,9 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
goto fallback;
+ if (!tcp_skb_can_collapse_to(prev))
+ goto fallback;
+
in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
!before(end_seq, TCP_SKB_CB(skb)->end_seq);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9d3b4b3..fa4d17f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2494,6 +2494,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
* packet counting does not break.
*/
TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS;
+ TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
/* changed transmit queue under us so clear hints */
tcp_clear_retrans_hints_partial(tp);
@@ -2545,6 +2546,9 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
if (!tcp_can_collapse(sk, skb))
break;
+ if (!tcp_skb_can_collapse_to(to))
+ break;
+
space -= skb->len;
if (first) {
--
2.5.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb
2016-04-25 21:44 ` [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb Martin KaFai Lau
@ 2016-04-25 23:03 ` Eric Dumazet
2016-04-26 0:49 ` Soheil Hassas Yeganeh
1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2016-04-25 23:03 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
Willem de Bruijn, Yuchung Cheng, Kernel Team
On Mon, 2016-04-25 at 14:44 -0700, Martin KaFai Lau wrote:
> This patch:
> 1. Prevent next_skb from coalescing to the prev_skb if
> TCP_SKB_CB(prev_skb)->eor is set
> 2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
> allowed
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb
2016-04-25 21:44 ` [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb Martin KaFai Lau
2016-04-25 23:03 ` Eric Dumazet
@ 2016-04-26 0:49 ` Soheil Hassas Yeganeh
1 sibling, 0 replies; 12+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-26 0:49 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Willem de Bruijn,
Yuchung Cheng, Kernel Team
On Mon, Apr 25, 2016 at 5:44 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> This patch:
> 1. Prevent next_skb from coalescing to the prev_skb if
> TCP_SKB_CB(prev_skb)->eor is set
> 2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
> allowed
>
> Packetdrill script for testing:
> ~~~~~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
> 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
> 0.200 write(4, ..., 11680) = 11680
>
> 0.200 > P. 1:731(730) ack 1
> 0.200 > P. 731:1461(730) ack 1
> 0.200 > . 1461:8761(7300) ack 1
> 0.200 > P. 8761:13141(4380) ack 1
>
> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
> 0.300 > P. 1:731(730) ack 1
> 0.300 > P. 731:1461(730) ack 1
> 0.400 < . 1:1(0) ack 13141 win 257
>
> 0.400 close(4) = 0
> 0.400 > F. 13141:13141(0) ack 1
> 0.500 < F. 1:1(0) ack 13142 win 257
> 0.500 > . 13142:13142(0) ack 2
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---
> net/ipv4/tcp_input.c | 4 ++++
> net/ipv4/tcp_output.c | 4 ++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index dcad8f9..65fb708 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1303,6 +1303,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
> }
>
> TCP_SKB_CB(prev)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;
> + TCP_SKB_CB(prev)->eor = TCP_SKB_CB(skb)->eor;
> if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
> TCP_SKB_CB(prev)->end_seq++;
>
> @@ -1368,6 +1369,9 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
> if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
> goto fallback;
>
> + if (!tcp_skb_can_collapse_to(prev))
> + goto fallback;
> +
> in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
> !before(end_seq, TCP_SKB_CB(skb)->end_seq);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9d3b4b3..fa4d17f 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2494,6 +2494,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
> * packet counting does not break.
> */
> TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS;
> + TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
>
> /* changed transmit queue under us so clear hints */
> tcp_clear_retrans_hints_partial(tp);
> @@ -2545,6 +2546,9 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
> if (!tcp_can_collapse(sk, skb))
> break;
>
> + if (!tcp_skb_can_collapse_to(to))
> + break;
> +
> space -= skb->len;
>
> if (first) {
> --
> 2.5.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb
2016-04-25 21:44 [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
2016-04-25 21:44 ` [PATCH v4 net-next 1/3] " Martin KaFai Lau
2016-04-25 21:44 ` [PATCH v4 net-next 2/3] tcp: Handle eor bit when coalescing skb Martin KaFai Lau
@ 2016-04-25 21:44 ` Martin KaFai Lau
2016-04-25 23:04 ` Eric Dumazet
2016-04-26 0:49 ` Soheil Hassas Yeganeh
2016-04-26 0:50 ` [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Soheil Hassas Yeganeh
2016-04-28 20:14 ` David Miller
4 siblings, 2 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2016-04-25 21:44 UTC (permalink / raw)
To: netdev
Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
Willem de Bruijn, Yuchung Cheng, Kernel Team
When fragmenting a skb, the next_skb should carry
the eor from prev_skb. The eor of prev_skb should
also be reset.
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
net/ipv4/tcp_output.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fa4d17f..55a926b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1128,6 +1128,12 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
}
}
+static void tcp_skb_fragment_eor(struct sk_buff *skb, struct sk_buff *skb2)
+{
+ TCP_SKB_CB(skb2)->eor = TCP_SKB_CB(skb)->eor;
+ TCP_SKB_CB(skb)->eor = 0;
+}
+
/* Function to create two new TCP segments. Shrinks the given segment
* to the specified size and appends a new segment with the rest of the
* packet to the list. This won't be called frequently, I hope.
@@ -1173,6 +1179,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
TCP_SKB_CB(skb)->tcp_flags = flags & ~(TCPHDR_FIN | TCPHDR_PSH);
TCP_SKB_CB(buff)->tcp_flags = flags;
TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
+ tcp_skb_fragment_eor(skb, buff);
if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) {
/* Copy and checksum data tail into the new buffer. */
@@ -1733,6 +1740,8 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
/* This packet was never sent out yet, so no SACK bits. */
TCP_SKB_CB(buff)->sacked = 0;
+ tcp_skb_fragment_eor(skb, buff);
+
buff->ip_summed = skb->ip_summed = CHECKSUM_PARTIAL;
skb_split(skb, buff, len);
tcp_fragment_tstamp(skb, buff);
--
2.5.1
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb
2016-04-25 21:44 ` [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb Martin KaFai Lau
@ 2016-04-25 23:04 ` Eric Dumazet
2016-04-26 0:49 ` Soheil Hassas Yeganeh
1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2016-04-25 23:04 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
Willem de Bruijn, Yuchung Cheng, Kernel Team
On Mon, 2016-04-25 at 14:44 -0700, Martin KaFai Lau wrote:
> When fragmenting a skb, the next_skb should carry
> the eor from prev_skb. The eor of prev_skb should
> also be reset.
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb
2016-04-25 21:44 ` [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb Martin KaFai Lau
2016-04-25 23:04 ` Eric Dumazet
@ 2016-04-26 0:49 ` Soheil Hassas Yeganeh
1 sibling, 0 replies; 12+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-26 0:49 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Willem de Bruijn,
Yuchung Cheng, Kernel Team
On Mon, Apr 25, 2016 at 5:44 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> When fragmenting a skb, the next_skb should carry
> the eor from prev_skb. The eor of prev_skb should
> also be reset.
>
> Packetdrill script for testing:
> ~~~~~~
> +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
> +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
>
> 0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
> 0.200 sendto(4, ..., 730, 0, ..., ...) = 730
>
> 0.200 > . 1:7301(7300) ack 1
> 0.200 > . 7301:14601(7300) ack 1
>
> 0.300 < . 1:1(0) ack 14601 win 257
> 0.300 > P. 14601:15331(730) ack 1
> 0.300 > P. 15331:16061(730) ack 1
>
> 0.400 < . 1:1(0) ack 16061 win 257
> 0.400 close(4) = 0
> 0.400 > F. 16061:16061(0) ack 1
> 0.400 < F. 1:1(0) ack 16062 win 257
> 0.400 > . 16062:16062(0) ack 2
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> ---
> net/ipv4/tcp_output.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index fa4d17f..55a926b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1128,6 +1128,12 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
> }
> }
>
> +static void tcp_skb_fragment_eor(struct sk_buff *skb, struct sk_buff *skb2)
> +{
> + TCP_SKB_CB(skb2)->eor = TCP_SKB_CB(skb)->eor;
> + TCP_SKB_CB(skb)->eor = 0;
> +}
> +
> /* Function to create two new TCP segments. Shrinks the given segment
> * to the specified size and appends a new segment with the rest of the
> * packet to the list. This won't be called frequently, I hope.
> @@ -1173,6 +1179,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
> TCP_SKB_CB(skb)->tcp_flags = flags & ~(TCPHDR_FIN | TCPHDR_PSH);
> TCP_SKB_CB(buff)->tcp_flags = flags;
> TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
> + tcp_skb_fragment_eor(skb, buff);
>
> if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) {
> /* Copy and checksum data tail into the new buffer. */
> @@ -1733,6 +1740,8 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
> /* This packet was never sent out yet, so no SACK bits. */
> TCP_SKB_CB(buff)->sacked = 0;
>
> + tcp_skb_fragment_eor(skb, buff);
> +
> buff->ip_summed = skb->ip_summed = CHECKSUM_PARTIAL;
> skb_split(skb, buff, len);
> tcp_fragment_tstamp(skb, buff);
> --
> 2.5.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg
2016-04-25 21:44 [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
` (2 preceding siblings ...)
2016-04-25 21:44 ` [PATCH v4 net-next 3/3] tcp: Handle eor bit when fragmenting a skb Martin KaFai Lau
@ 2016-04-26 0:50 ` Soheil Hassas Yeganeh
2016-04-28 20:14 ` David Miller
4 siblings, 0 replies; 12+ messages in thread
From: Soheil Hassas Yeganeh @ 2016-04-26 0:50 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, Eric Dumazet, Neal Cardwell, Willem de Bruijn,
Yuchung Cheng, Kernel Team
On Mon, Apr 25, 2016 at 5:44 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> v4:
> ~ Do not set eor bit in do_tcp_sendpages() since there is
> no way to pass MSG_EOR from the userland now.
> ~ Avoid rmw by testing MSG_EOR first in tcp_sendmsg().
> ~ Move TCP_SKB_CB(skb)->eor test to a new helper
> tcp_skb_can_collapse_to() (suggested by Soheil).
> ~ Add some packetdrill tests.
Thanks for the nice patches and the tests!
> v3:
> ~ Separate EOR marking from the SKBTX_ANY_TSTAMP logic.
> ~ Move the eor bit test back to the loop in tcp_sendmsg and
> tcp_sendpage because there could be >1 threads doing
> sendmsg.
> ~ Thanks to Eric Dumazet's suggestions on v2.
> ~ The TCP timestamp bug fixes are separated into other threads.
>
> v2:
> ~ Rework based on the recent work
> "add TX timestamping via cmsg" by
> Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> ~ This version takes the MSG_EOR bit as a signal of
> end-of-response-message and leave the selective
> timestamping job to the cmsg
> ~ Changes based on the v1 feedback (like avoid
> unlikely check in a loop and adding tcp_sendpage
> support)
> ~ The first 3 patches are bug fixes. The fixes in this
> series depend on the newly introduced txstamp_ack in
> net-next. I will make relevant patches against net after
> getting some feedback.
> ~ The test results are based on the recently posted net fix:
> "tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks"
>
> One potential use case is to use MSG_EOR with
> SOF_TIMESTAMPING_TX_ACK to get a more accurate
> TCP ack timestamping on application protocol with
> multiple outgoing response messages (e.g. HTTP2).
>
> One of our use case is at the webserver. The webserver tracks
> the HTTP2 response latency by measuring when the webserver sends
> the first byte to the socket till the TCP ACK of the last byte
> is received. In the cases where we don't have client side
> measurement, measuring from the server side is the only option.
> In the cases we have the client side measurement, the server side
> data can also be used to justify/cross-check-with the client
> side data.
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg
2016-04-25 21:44 [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Martin KaFai Lau
` (3 preceding siblings ...)
2016-04-26 0:50 ` [PATCH v4 net-next 0/3] tcp: Make use of MSG_EOR in tcp_sendmsg Soheil Hassas Yeganeh
@ 2016-04-28 20:14 ` David Miller
4 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2016-04-28 20:14 UTC (permalink / raw)
To: kafai; +Cc: netdev, edumazet, ncardwell, soheil, willemb, ycheng, kernel-team
From: Martin KaFai Lau <kafai@fb.com>
Date: Mon, 25 Apr 2016 14:44:47 -0700
...
> One potential use case is to use MSG_EOR with
> SOF_TIMESTAMPING_TX_ACK to get a more accurate
> TCP ack timestamping on application protocol with
> multiple outgoing response messages (e.g. HTTP2).
>
> One of our use case is at the webserver. The webserver tracks
> the HTTP2 response latency by measuring when the webserver sends
> the first byte to the socket till the TCP ACK of the last byte
> is received. In the cases where we don't have client side
> measurement, measuring from the server side is the only option.
> In the cases we have the client side measurement, the server side
> data can also be used to justify/cross-check-with the client
> side data.
Looks good, series applied, thanks!
^ permalink raw reply [flat|nested] 12+ messages in thread