netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
@ 2024-05-18  2:50 Jason Xing
  2024-05-20 16:51 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Xing @ 2024-05-18  2:50 UTC (permalink / raw)
  To: edumazet, dsahern, kuba, pabeni, davem, ncardwell
  Cc: netdev, kerneljasonxing, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
CDN team would not benefit from this change because they cannot have a
large window to receive a big packet, which will be slowed down especially
in long RTT.

According to RFC 7323, it says:
  "The maximum receive window, and therefore the scale factor, is
   determined by the maximum receive buffer space."

So we can get rid of this 64k limitation and let the window be tunable if
the user wants to do it within the control of buffer space. Then many
companies, I believe, can have the same behaviour as old days. Besides,
there are many papers conducting various interesting experiments which
have something to do with this window and show good outputs in some cases,
say, paper [1] in Yahoo! CDN.

To avoid future confusion, current change doesn't affect the initial
receive window on the wire in a SYN or SYN+ACK packet which are set within
65535 bytes according to RFC 7323 also due to the limit in
__tcp_transmit_skb():

    th->window      = htons(min(tp->rcv_wnd, 65535U));

In one word, __tcp_transmit_skb() already ensures that constraint is
respected, no matter how large tp->rcv_wnd is.

Let me provide one example if with or without the patch:
Before:
client   --- SYN: rwindow=65535 ---> server
client   <--- SYN+ACK: rwindow=65535 ----  server
client   --- ACK: rwindow=65536 ---> server
Note: for the last ACK, the calculation is 512 << 7.

After:
client   --- SYN: rwindow=65535 ---> server
client   <--- SYN+ACK: rwindow=65535 ----  server
client   --- ACK: rwindow=175232 ---> server
Note: I use the following command to make it work:
ip route change default via [ip] dev eth0 metric 100 initrwnd 120
For the last ACK, the calculation is 1369 << 7.

We can pay attention to the last ACK in 3-way shakehand and notice that
with the patch applied the window can reach more than 64 KByte.

[1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
v2
Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
1. revise the title and body messages (Neal)
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 95caf8aaa8be..95618d0e78e4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
 	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
 		(*rcv_wnd) = min(space, MAX_TCP_WINDOW);
 	else
-		(*rcv_wnd) = min_t(u32, space, U16_MAX);
+		(*rcv_wnd) = space;
 
 	if (init_rcv_wnd)
 		*rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
  2024-05-18  2:50 [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value Jason Xing
@ 2024-05-20 16:51 ` Eric Dumazet
  2024-05-21  0:36   ` Jason Xing
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2024-05-20 16:51 UTC (permalink / raw)
  To: Jason Xing; +Cc: dsahern, kuba, pabeni, davem, ncardwell, netdev, Jason Xing

On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> CDN team would not benefit from this change because they cannot have a
> large window to receive a big packet, which will be slowed down especially
> in long RTT.
>
> According to RFC 7323, it says:
>   "The maximum receive window, and therefore the scale factor, is
>    determined by the maximum receive buffer space."

This seems not relevant ?  wscale factor is not changed in this patch ?
tp->rcv_wnd is also not the maximum receive window.

>
> So we can get rid of this 64k limitation and let the window be tunable if
> the user wants to do it within the control of buffer space. Then many
> companies, I believe, can have the same behaviour as old days.

Not sure this has ever worked, see below.

Also, the "many companies ..." mention has nothing to do in a changelog.


> Besides,
> there are many papers conducting various interesting experiments which
> have something to do with this window and show good outputs in some cases,
> say, paper [1] in Yahoo! CDN.

I think this changelog is trying hard to sell something, but in
reality TCP 3WHS nature
makes your claims wrong.

Instead, you should clearly document that this problem can _not_ be
solved for both
active _and_ passive connections.

In the first RTT, a client (active connection) can not send more than
64KB, if TCP specs
are properly applied.

>
> To avoid future confusion, current change doesn't affect the initial
> receive window on the wire in a SYN or SYN+ACK packet which are set within
> 65535 bytes according to RFC 7323 also due to the limit in
> __tcp_transmit_skb():
>
>     th->window      = htons(min(tp->rcv_wnd, 65535U));
>
> In one word, __tcp_transmit_skb() already ensures that constraint is
> respected, no matter how large tp->rcv_wnd is.
>
> Let me provide one example if with or without the patch:
> Before:
> client   --- SYN: rwindow=65535 ---> server
> client   <--- SYN+ACK: rwindow=65535 ----  server
> client   --- ACK: rwindow=65536 ---> server
> Note: for the last ACK, the calculation is 512 << 7.
>
> After:
> client   --- SYN: rwindow=65535 ---> server
> client   <--- SYN+ACK: rwindow=65535 ----  server
> client   --- ACK: rwindow=175232 ---> server
> Note: I use the following command to make it work:
> ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> For the last ACK, the calculation is 1369 << 7.
>
> We can pay attention to the last ACK in 3-way shakehand and notice that
> with the patch applied the window can reach more than 64 KByte.

You carefully avoided mentioning the asymmetry.
I do not think this is needed in the changelog, because this is adding
confusion.

>
> [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
> v2
> Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> 1. revise the title and body messages (Neal)
> ---
>  net/ipv4/tcp_output.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 95caf8aaa8be..95618d0e78e4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
>         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
>                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
>         else
> -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> +               (*rcv_wnd) = space;

This is probably breaking some  packetdrill tests, but your change
might [1] be good,
especially because it allows DRS behavior to be consistent for large
MTU (eg MTU 9000) and bigger tcp_rmem[1],
even without playing with initrwnd attribute.

"ss -temoi " would display after connection setup  rcv_space:89600
instead of a capped value.

[1] This is hard to say, DRS is full of surprises.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
  2024-05-20 16:51 ` Eric Dumazet
@ 2024-05-21  0:36   ` Jason Xing
  2024-05-21  0:58     ` Jason Xing
  2024-05-21  6:56     ` Jason Xing
  0 siblings, 2 replies; 7+ messages in thread
From: Jason Xing @ 2024-05-21  0:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: dsahern, kuba, pabeni, davem, ncardwell, netdev, Jason Xing

Hello Eric,

On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > CDN team would not benefit from this change because they cannot have a
> > large window to receive a big packet, which will be slowed down especially
> > in long RTT.
> >
> > According to RFC 7323, it says:
> >   "The maximum receive window, and therefore the scale factor, is
> >    determined by the maximum receive buffer space."
>
> This seems not relevant ?  wscale factor is not changed in this patch ?
> tp->rcv_wnd is also not the maximum receive window.

Thanks for your review.

I can remove this part. I was trying to claim I do not break RFC.

>
> >
> > So we can get rid of this 64k limitation and let the window be tunable if
> > the user wants to do it within the control of buffer space. Then many
> > companies, I believe, can have the same behaviour as old days.
>
> Not sure this has ever worked, see below.
>
> Also, the "many companies ..." mention has nothing to do in a changelog.

Oh, I just copied/translated from my initial studies of this rcv_wnd
by reading many papers something like this.

I can also remove this sentence.

>
>
> > Besides,
> > there are many papers conducting various interesting experiments which
> > have something to do with this window and show good outputs in some cases,
> > say, paper [1] in Yahoo! CDN.
>
> I think this changelog is trying hard to sell something, but in
> reality TCP 3WHS nature
> makes your claims wrong.
>
> Instead, you should clearly document that this problem can _not_ be
> solved for both
> active _and_ passive connections.
>
> In the first RTT, a client (active connection) can not send more than
> 64KB, if TCP specs
> are properly applied.

Having a large rcv_wnd if the user can tweak this knob can help
transfer data more rapidly. I'm not referring to the first RTT.

>
> >
> > To avoid future confusion, current change doesn't affect the initial
> > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > 65535 bytes according to RFC 7323 also due to the limit in
> > __tcp_transmit_skb():
> >
> >     th->window      = htons(min(tp->rcv_wnd, 65535U));
> >
> > In one word, __tcp_transmit_skb() already ensures that constraint is
> > respected, no matter how large tp->rcv_wnd is.
> >
> > Let me provide one example if with or without the patch:
> > Before:
> > client   --- SYN: rwindow=65535 ---> server
> > client   <--- SYN+ACK: rwindow=65535 ----  server
> > client   --- ACK: rwindow=65536 ---> server
> > Note: for the last ACK, the calculation is 512 << 7.
> >
> > After:
> > client   --- SYN: rwindow=65535 ---> server
> > client   <--- SYN+ACK: rwindow=65535 ----  server
> > client   --- ACK: rwindow=175232 ---> server
> > Note: I use the following command to make it work:
> > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > For the last ACK, the calculation is 1369 << 7.
> >
> > We can pay attention to the last ACK in 3-way shakehand and notice that
> > with the patch applied the window can reach more than 64 KByte.
>
> You carefully avoided mentioning the asymmetry.
> I do not think this is needed in the changelog, because this is adding
> confusion.

What kind of case I've met in production is only about whether we're
capable of sending more data at the same time at the very beginning,
so I care much more about the sending process right now.

>
> >
> > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > v2
> > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> > 1. revise the title and body messages (Neal)
> > ---
> >  net/ipv4/tcp_output.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 95caf8aaa8be..95618d0e78e4 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> >         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> >                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> >         else
> > -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> > +               (*rcv_wnd) = space;
>
> This is probably breaking some  packetdrill tests, but your change
> might [1] be good,

I'll do some packetdrill tests and get back some information soon.

> especially because it allows DRS behavior to be consistent for large
> MTU (eg MTU 9000) and bigger tcp_rmem[1],
> even without playing with initrwnd attribute.
>
> "ss -temoi " would display after connection setup  rcv_space:89600
> instead of a capped value.
>
> [1] This is hard to say, DRS is full of surprises.

To avoid confusion, I will remove this link and relevant statements.

Here are my opinions in conclusion:
1) this change doesn't break the law, I mean, various RFCs.
2) this change allows us to have the same behaviour as 2018 in this case.
3) this change does some good things to certain cases, especially for
the CDN team.

I'll refine the changelog as far as I can, hoping it will not confuse
the readers.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
  2024-05-21  0:36   ` Jason Xing
@ 2024-05-21  0:58     ` Jason Xing
  2024-05-21  6:56     ` Jason Xing
  1 sibling, 0 replies; 7+ messages in thread
From: Jason Xing @ 2024-05-21  0:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: dsahern, kuba, pabeni, davem, ncardwell, netdev, Jason Xing

On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Eric,
>
> On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > > CDN team would not benefit from this change because they cannot have a
> > > large window to receive a big packet, which will be slowed down especially
> > > in long RTT.
> > >
> > > According to RFC 7323, it says:
> > >   "The maximum receive window, and therefore the scale factor, is
> > >    determined by the maximum receive buffer space."
> >
> > This seems not relevant ?  wscale factor is not changed in this patch ?
> > tp->rcv_wnd is also not the maximum receive window.
>
> Thanks for your review.
>
> I can remove this part. I was trying to claim I do not break RFC.
>
> >
> > >
> > > So we can get rid of this 64k limitation and let the window be tunable if
> > > the user wants to do it within the control of buffer space. Then many
> > > companies, I believe, can have the same behaviour as old days.
> >
> > Not sure this has ever worked, see below.
> >
> > Also, the "many companies ..." mention has nothing to do in a changelog.
>
> Oh, I just copied/translated from my initial studies of this rcv_wnd
> by reading many papers something like this.
>
> I can also remove this sentence.
>
> >
> >
> > > Besides,
> > > there are many papers conducting various interesting experiments which
> > > have something to do with this window and show good outputs in some cases,
> > > say, paper [1] in Yahoo! CDN.
> >
> > I think this changelog is trying hard to sell something, but in
> > reality TCP 3WHS nature
> > makes your claims wrong.
> >
> > Instead, you should clearly document that this problem can _not_ be
> > solved for both
> > active _and_ passive connections.
> >
> > In the first RTT, a client (active connection) can not send more than
> > 64KB, if TCP specs
> > are properly applied.
>
> Having a large rcv_wnd if the user can tweak this knob can help
> transfer data more rapidly. I'm not referring to the first RTT.

For the first RTT, it is surely limited to 64 KB at most as you said.

For the whole process, the change can accelerate the sending process
and save some RTTs.

How can I find this change? We had some servers upgraded to the latest
kernel and noticed the indicator from the user side showed worse
results than before. Because of this, I spent some time digging into
this part.

After applying this patch, the indicator shows normal/good results
like before. It is proven it works.

For the CDN team, they are very sensitive to the latency/time about
sending big data in the long RTT.

Thanks,
Jason

>
> >
> > >
> > > To avoid future confusion, current change doesn't affect the initial
> > > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > > 65535 bytes according to RFC 7323 also due to the limit in
> > > __tcp_transmit_skb():
> > >
> > >     th->window      = htons(min(tp->rcv_wnd, 65535U));
> > >
> > > In one word, __tcp_transmit_skb() already ensures that constraint is
> > > respected, no matter how large tp->rcv_wnd is.
> > >
> > > Let me provide one example if with or without the patch:
> > > Before:
> > > client   --- SYN: rwindow=65535 ---> server
> > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > client   --- ACK: rwindow=65536 ---> server
> > > Note: for the last ACK, the calculation is 512 << 7.
> > >
> > > After:
> > > client   --- SYN: rwindow=65535 ---> server
> > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > client   --- ACK: rwindow=175232 ---> server
> > > Note: I use the following command to make it work:
> > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > > For the last ACK, the calculation is 1369 << 7.
> > >
> > > We can pay attention to the last ACK in 3-way shakehand and notice that
> > > with the patch applied the window can reach more than 64 KByte.
> >
> > You carefully avoided mentioning the asymmetry.
> > I do not think this is needed in the changelog, because this is adding
> > confusion.
>
> What kind of case I've met in production is only about whether we're
> capable of sending more data at the same time at the very beginning,
> so I care much more about the sending process right now.
>
> >
> > >
> > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> > > 1. revise the title and body messages (Neal)
> > > ---
> > >  net/ipv4/tcp_output.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 95caf8aaa8be..95618d0e78e4 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> > >         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> > >                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> > >         else
> > > -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> > > +               (*rcv_wnd) = space;
> >
> > This is probably breaking some  packetdrill tests, but your change
> > might [1] be good,
>
> I'll do some packetdrill tests and get back some information soon.
>
> > especially because it allows DRS behavior to be consistent for large
> > MTU (eg MTU 9000) and bigger tcp_rmem[1],
> > even without playing with initrwnd attribute.
> >
> > "ss -temoi " would display after connection setup  rcv_space:89600
> > instead of a capped value.
> >
> > [1] This is hard to say, DRS is full of surprises.
>
> To avoid confusion, I will remove this link and relevant statements.
>
> Here are my opinions in conclusion:
> 1) this change doesn't break the law, I mean, various RFCs.
> 2) this change allows us to have the same behaviour as 2018 in this case.
> 3) this change does some good things to certain cases, especially for
> the CDN team.
>
> I'll refine the changelog as far as I can, hoping it will not confuse
> the readers.
>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
  2024-05-21  0:36   ` Jason Xing
  2024-05-21  0:58     ` Jason Xing
@ 2024-05-21  6:56     ` Jason Xing
  2024-05-21  9:43       ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Jason Xing @ 2024-05-21  6:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: dsahern, kuba, pabeni, davem, ncardwell, netdev, Jason Xing

Hello Eric,

On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Eric,
>
> On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > > CDN team would not benefit from this change because they cannot have a
> > > large window to receive a big packet, which will be slowed down especially
> > > in long RTT.
> > >
> > > According to RFC 7323, it says:
> > >   "The maximum receive window, and therefore the scale factor, is
> > >    determined by the maximum receive buffer space."
> >
> > This seems not relevant ?  wscale factor is not changed in this patch ?
> > tp->rcv_wnd is also not the maximum receive window.
>
> Thanks for your review.
>
> I can remove this part. I was trying to claim I do not break RFC.
>
> >
> > >
> > > So we can get rid of this 64k limitation and let the window be tunable if
> > > the user wants to do it within the control of buffer space. Then many
> > > companies, I believe, can have the same behaviour as old days.
> >
> > Not sure this has ever worked, see below.
> >
> > Also, the "many companies ..." mention has nothing to do in a changelog.
>
> Oh, I just copied/translated from my initial studies of this rcv_wnd
> by reading many papers something like this.
>
> I can also remove this sentence.
>
> >
> >
> > > Besides,
> > > there are many papers conducting various interesting experiments which
> > > have something to do with this window and show good outputs in some cases,
> > > say, paper [1] in Yahoo! CDN.
> >
> > I think this changelog is trying hard to sell something, but in
> > reality TCP 3WHS nature
> > makes your claims wrong.
> >
> > Instead, you should clearly document that this problem can _not_ be
> > solved for both
> > active _and_ passive connections.
> >
> > In the first RTT, a client (active connection) can not send more than
> > 64KB, if TCP specs
> > are properly applied.
>
> Having a large rcv_wnd if the user can tweak this knob can help
> transfer data more rapidly. I'm not referring to the first RTT.
>
> >
> > >
> > > To avoid future confusion, current change doesn't affect the initial
> > > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > > 65535 bytes according to RFC 7323 also due to the limit in
> > > __tcp_transmit_skb():
> > >
> > >     th->window      = htons(min(tp->rcv_wnd, 65535U));
> > >
> > > In one word, __tcp_transmit_skb() already ensures that constraint is
> > > respected, no matter how large tp->rcv_wnd is.
> > >
> > > Let me provide one example if with or without the patch:
> > > Before:
> > > client   --- SYN: rwindow=65535 ---> server
> > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > client   --- ACK: rwindow=65536 ---> server
> > > Note: for the last ACK, the calculation is 512 << 7.
> > >
> > > After:
> > > client   --- SYN: rwindow=65535 ---> server
> > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > client   --- ACK: rwindow=175232 ---> server
> > > Note: I use the following command to make it work:
> > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > > For the last ACK, the calculation is 1369 << 7.
> > >
> > > We can pay attention to the last ACK in 3-way shakehand and notice that
> > > with the patch applied the window can reach more than 64 KByte.
> >
> > You carefully avoided mentioning the asymmetry.
> > I do not think this is needed in the changelog, because this is adding
> > confusion.
>
> What kind of case I've met in production is only about whether we're
> capable of sending more data at the same time at the very beginning,
> so I care much more about the sending process right now.
>
> >
> > >
> > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> > > 1. revise the title and body messages (Neal)
> > > ---
> > >  net/ipv4/tcp_output.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 95caf8aaa8be..95618d0e78e4 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> > >         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> > >                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> > >         else
> > > -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> > > +               (*rcv_wnd) = space;
> >
> > This is probably breaking some  packetdrill tests, but your change
> > might [1] be good,
>
> I'll do some packetdrill tests and get back some information soon.

I'm done with the packetdrill tests[1]. Here are two tests failed
after comparing with/without this patch:
1) ./packetdrill/run_all.py -S -v -L -l tcp/ioctl/ioctl-siocinq-fin.pkt
2) ./packetdrill/run_all.py -S -v -L -l
tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt

For the first one, it shows:
"FAIL [/data/home/kernelxing/source_code/packetdrill/gtests/net/tcp/ioctl/ioctl-siocinq-fin.pkt
(ipv6)]
stdout:
stderr:
ioctl-siocinq-fin.pkt:28: error handling packet: timing error:
expected outbound packet at 0.302321 sec but happened at 0.342759 sec;
tolerance 0.004000 sec
script packet:  0.302321 . 1:1(0) ack 2002
actual packet:  0.342759 . 1:1(0) ack 2002 win 65535"

For the second one, it shows:
"client-ack-dropped-then-recovery-ms-timestamps.pkt:33: error handling
packet: live packet field tcp_window: expected: 256 (0x100) vs actual:
532 (0x214)
script packet:  0.012251 P. 1:5001(5000) ack 1001 win 256 <nop,nop,TS
val 2010 ecr 1000>
actual packet:  0.012242 P. 1:5001(5000) ack 1001 win 532 <nop,nop,TS
val 2010 ecr 1000>
Ran    3 tests:    0 passing,    3 failing,    0 timed out (0.91 sec):
tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt"

The reason is unexpected window size. Since I removed the limit of
64KB, It is expected from my view.

[1]: https://github.com/google/packetdrill
Running: ./packetdrill/run_all.py -S -v -L -l tcp/

I wonder if you mind this change which might be unpredictable, how
about this one:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 95caf8aaa8be..3bf7d9fd2d6b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -231,11 +231,13 @@ void tcp_select_initial_window(const struct sock
*sk, int __space, __u32 mss,
         */
        if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
                (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
-       else
-               (*rcv_wnd) = min_t(u32, space, U16_MAX);

-       if (init_rcv_wnd)
+       if (init_rcv_wnd) {
+               *rcv_wnd = space;
                *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
+       } else {
+               *rcv_wnd = min_t(u32, space, U16_MAX);
+       }

        *rcv_wscale = 0;
        if (wscale_ok) {
?

It affects/changes the TCP stack only when the user tries to use 'ip
route' to set initrwnd.

Thanks,
Jason

>
> > especially because it allows DRS behavior to be consistent for large
> > MTU (eg MTU 9000) and bigger tcp_rmem[1],
> > even without playing with initrwnd attribute.
> >
> > "ss -temoi " would display after connection setup  rcv_space:89600
> > instead of a capped value.
> >
> > [1] This is hard to say, DRS is full of surprises.
>
> To avoid confusion, I will remove this link and relevant statements.
>
> Here are my opinions in conclusion:
> 1) this change doesn't break the law, I mean, various RFCs.
> 2) this change allows us to have the same behaviour as 2018 in this case.
> 3) this change does some good things to certain cases, especially for
> the CDN team.
>
> I'll refine the changelog as far as I can, hoping it will not confuse
> the readers.
>
> Thanks,
> Jason

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
  2024-05-21  6:56     ` Jason Xing
@ 2024-05-21  9:43       ` Eric Dumazet
  2024-05-21 13:11         ` Jason Xing
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2024-05-21  9:43 UTC (permalink / raw)
  To: Jason Xing; +Cc: dsahern, kuba, pabeni, davem, ncardwell, netdev, Jason Xing

On Tue, May 21, 2024 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> Hello Eric,
>
> On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Eric,
> >
> > On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > > > CDN team would not benefit from this change because they cannot have a
> > > > large window to receive a big packet, which will be slowed down especially
> > > > in long RTT.
> > > >
> > > > According to RFC 7323, it says:
> > > >   "The maximum receive window, and therefore the scale factor, is
> > > >    determined by the maximum receive buffer space."
> > >
> > > This seems not relevant ?  wscale factor is not changed in this patch ?
> > > tp->rcv_wnd is also not the maximum receive window.
> >
> > Thanks for your review.
> >
> > I can remove this part. I was trying to claim I do not break RFC.
> >
> > >
> > > >
> > > > So we can get rid of this 64k limitation and let the window be tunable if
> > > > the user wants to do it within the control of buffer space. Then many
> > > > companies, I believe, can have the same behaviour as old days.
> > >
> > > Not sure this has ever worked, see below.
> > >
> > > Also, the "many companies ..." mention has nothing to do in a changelog.
> >
> > Oh, I just copied/translated from my initial studies of this rcv_wnd
> > by reading many papers something like this.
> >
> > I can also remove this sentence.
> >
> > >
> > >
> > > > Besides,
> > > > there are many papers conducting various interesting experiments which
> > > > have something to do with this window and show good outputs in some cases,
> > > > say, paper [1] in Yahoo! CDN.
> > >
> > > I think this changelog is trying hard to sell something, but in
> > > reality TCP 3WHS nature
> > > makes your claims wrong.
> > >
> > > Instead, you should clearly document that this problem can _not_ be
> > > solved for both
> > > active _and_ passive connections.
> > >
> > > In the first RTT, a client (active connection) can not send more than
> > > 64KB, if TCP specs
> > > are properly applied.
> >
> > Having a large rcv_wnd if the user can tweak this knob can help
> > transfer data more rapidly. I'm not referring to the first RTT.
> >
> > >
> > > >
> > > > To avoid future confusion, current change doesn't affect the initial
> > > > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > > > 65535 bytes according to RFC 7323 also due to the limit in
> > > > __tcp_transmit_skb():
> > > >
> > > >     th->window      = htons(min(tp->rcv_wnd, 65535U));
> > > >
> > > > In one word, __tcp_transmit_skb() already ensures that constraint is
> > > > respected, no matter how large tp->rcv_wnd is.
> > > >
> > > > Let me provide one example if with or without the patch:
> > > > Before:
> > > > client   --- SYN: rwindow=65535 ---> server
> > > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > > client   --- ACK: rwindow=65536 ---> server
> > > > Note: for the last ACK, the calculation is 512 << 7.
> > > >
> > > > After:
> > > > client   --- SYN: rwindow=65535 ---> server
> > > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > > client   --- ACK: rwindow=175232 ---> server
> > > > Note: I use the following command to make it work:
> > > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > > > For the last ACK, the calculation is 1369 << 7.
> > > >
> > > > We can pay attention to the last ACK in 3-way shakehand and notice that
> > > > with the patch applied the window can reach more than 64 KByte.
> > >
> > > You carefully avoided mentioning the asymmetry.
> > > I do not think this is needed in the changelog, because this is adding
> > > confusion.
> >
> > What kind of case I've met in production is only about whether we're
> > capable of sending more data at the same time at the very beginning,
> > so I care much more about the sending process right now.
> >
> > >
> > > >
> > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > > v2
> > > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> > > > 1. revise the title and body messages (Neal)
> > > > ---
> > > >  net/ipv4/tcp_output.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > index 95caf8aaa8be..95618d0e78e4 100644
> > > > --- a/net/ipv4/tcp_output.c
> > > > +++ b/net/ipv4/tcp_output.c
> > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> > > >         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> > > >                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> > > >         else
> > > > -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> > > > +               (*rcv_wnd) = space;
> > >
> > > This is probably breaking some  packetdrill tests, but your change
> > > might [1] be good,
> >
> > I'll do some packetdrill tests and get back some information soon.
>
> I'm done with the packetdrill tests[1]. Here are two tests failed
> after comparing with/without this patch:
> 1) ./packetdrill/run_all.py -S -v -L -l tcp/ioctl/ioctl-siocinq-fin.pkt
> 2) ./packetdrill/run_all.py -S -v -L -l
> tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt
>
> For the first one, it shows:
> "FAIL [/data/home/kernelxing/source_code/packetdrill/gtests/net/tcp/ioctl/ioctl-siocinq-fin.pkt
> (ipv6)]
> stdout:
> stderr:
> ioctl-siocinq-fin.pkt:28: error handling packet: timing error:
> expected outbound packet at 0.302321 sec but happened at 0.342759 sec;
> tolerance 0.004000 sec
> script packet:  0.302321 . 1:1(0) ack 2002
> actual packet:  0.342759 . 1:1(0) ack 2002 win 65535"
>
> For the second one, it shows:
> "client-ack-dropped-then-recovery-ms-timestamps.pkt:33: error handling
> packet: live packet field tcp_window: expected: 256 (0x100) vs actual:
> 532 (0x214)
> script packet:  0.012251 P. 1:5001(5000) ack 1001 win 256 <nop,nop,TS
> val 2010 ecr 1000>
> actual packet:  0.012242 P. 1:5001(5000) ack 1001 win 532 <nop,nop,TS
> val 2010 ecr 1000>
> Ran    3 tests:    0 passing,    3 failing,    0 timed out (0.91 sec):
> tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt"
>
> The reason is unexpected window size. Since I removed the limit of
> 64KB, It is expected from my view.

I think you misunderstood what I was saying.

Basically, this change will break some packetdrill tests, and this is fine,
because those packetdrill tests were relying on a prior kernel behavior that
was not set in stone (certainly not documented)

>
> [1]: https://github.com/google/packetdrill
> Running: ./packetdrill/run_all.py -S -v -L -l tcp/
>
> I wonder if you mind this change which might be unpredictable, how
> about this one:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 95caf8aaa8be..3bf7d9fd2d6b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -231,11 +231,13 @@ void tcp_select_initial_window(const struct sock
> *sk, int __space, __u32 mss,
>          */
>         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
>                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> -       else
> -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
>
> -       if (init_rcv_wnd)
> +       if (init_rcv_wnd) {
> +               *rcv_wnd = space;
>                 *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
> +       } else {
> +               *rcv_wnd = min_t(u32, space, U16_MAX);
> +       }
>
>         *rcv_wscale = 0;
>         if (wscale_ok) {
> ?
>
> It affects/changes the TCP stack only when the user tries to use 'ip
> route' to set initrwnd.

I much prefer the prior and simpler version.

Only the changelog was not very good IMO.

Also, I think this is fixing a bug and should target the net tree.

If it took 6 years to discover the unwanted side effects, we should
make sure the fix
is backported by stable teams, thanks to an appropriate Fixes: tag.


>
> Thanks,
> Jason
>
> >
> > > especially because it allows DRS behavior to be consistent for large
> > > MTU (eg MTU 9000) and bigger tcp_rmem[1],
> > > even without playing with initrwnd attribute.
> > >
> > > "ss -temoi " would display after connection setup  rcv_space:89600
> > > instead of a capped value.
> > >
> > > [1] This is hard to say, DRS is full of surprises.
> >
> > To avoid confusion, I will remove this link and relevant statements.
> >
> > Here are my opinions in conclusion:
> > 1) this change doesn't break the law, I mean, various RFCs.
> > 2) this change allows us to have the same behaviour as 2018 in this case.
> > 3) this change does some good things to certain cases, especially for
> > the CDN team.
> >
> > I'll refine the changelog as far as I can, hoping it will not confuse
> > the readers.
> >
> > Thanks,
> > Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value
  2024-05-21  9:43       ` Eric Dumazet
@ 2024-05-21 13:11         ` Jason Xing
  0 siblings, 0 replies; 7+ messages in thread
From: Jason Xing @ 2024-05-21 13:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: dsahern, kuba, pabeni, davem, ncardwell, netdev, Jason Xing

On Tue, May 21, 2024 at 5:43 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, May 21, 2024 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > Hello Eric,
> >
> > On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > Hello Eric,
> > >
> > > On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > > > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > > > > CDN team would not benefit from this change because they cannot have a
> > > > > large window to receive a big packet, which will be slowed down especially
> > > > > in long RTT.
> > > > >
> > > > > According to RFC 7323, it says:
> > > > >   "The maximum receive window, and therefore the scale factor, is
> > > > >    determined by the maximum receive buffer space."
> > > >
> > > > This seems not relevant ?  wscale factor is not changed in this patch ?
> > > > tp->rcv_wnd is also not the maximum receive window.
> > >
> > > Thanks for your review.
> > >
> > > I can remove this part. I was trying to claim I do not break RFC.
> > >
> > > >
> > > > >
> > > > > So we can get rid of this 64k limitation and let the window be tunable if
> > > > > the user wants to do it within the control of buffer space. Then many
> > > > > companies, I believe, can have the same behaviour as old days.
> > > >
> > > > Not sure this has ever worked, see below.
> > > >
> > > > Also, the "many companies ..." mention has nothing to do in a changelog.
> > >
> > > Oh, I just copied/translated from my initial studies of this rcv_wnd
> > > by reading many papers something like this.
> > >
> > > I can also remove this sentence.
> > >
> > > >
> > > >
> > > > > Besides,
> > > > > there are many papers conducting various interesting experiments which
> > > > > have something to do with this window and show good outputs in some cases,
> > > > > say, paper [1] in Yahoo! CDN.
> > > >
> > > > I think this changelog is trying hard to sell something, but in
> > > > reality TCP 3WHS nature
> > > > makes your claims wrong.
> > > >
> > > > Instead, you should clearly document that this problem can _not_ be
> > > > solved for both
> > > > active _and_ passive connections.
> > > >
> > > > In the first RTT, a client (active connection) can not send more than
> > > > 64KB, if TCP specs
> > > > are properly applied.
> > >
> > > Having a large rcv_wnd if the user can tweak this knob can help
> > > transfer data more rapidly. I'm not referring to the first RTT.
> > >
> > > >
> > > > >
> > > > > To avoid future confusion, current change doesn't affect the initial
> > > > > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > > > > 65535 bytes according to RFC 7323 also due to the limit in
> > > > > __tcp_transmit_skb():
> > > > >
> > > > >     th->window      = htons(min(tp->rcv_wnd, 65535U));
> > > > >
> > > > > In one word, __tcp_transmit_skb() already ensures that constraint is
> > > > > respected, no matter how large tp->rcv_wnd is.
> > > > >
> > > > > Let me provide one example if with or without the patch:
> > > > > Before:
> > > > > client   --- SYN: rwindow=65535 ---> server
> > > > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > > > client   --- ACK: rwindow=65536 ---> server
> > > > > Note: for the last ACK, the calculation is 512 << 7.
> > > > >
> > > > > After:
> > > > > client   --- SYN: rwindow=65535 ---> server
> > > > > client   <--- SYN+ACK: rwindow=65535 ----  server
> > > > > client   --- ACK: rwindow=175232 ---> server
> > > > > Note: I use the following command to make it work:
> > > > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > > > > For the last ACK, the calculation is 1369 << 7.
> > > > >
> > > > > We can pay attention to the last ACK in 3-way shakehand and notice that
> > > > > with the patch applied the window can reach more than 64 KByte.
> > > >
> > > > You carefully avoided mentioning the asymmetry.
> > > > I do not think this is needed in the changelog, because this is adding
> > > > confusion.
> > >
> > > What kind of case I've met in production is only about whether we're
> > > capable of sending more data at the same time at the very beginning,
> > > so I care much more about the sending process right now.
> > >
> > > >
> > > > >
> > > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
> > > > >
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > > v2
> > > > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> > > > > 1. revise the title and body messages (Neal)
> > > > > ---
> > > > >  net/ipv4/tcp_output.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > index 95caf8aaa8be..95618d0e78e4 100644
> > > > > --- a/net/ipv4/tcp_output.c
> > > > > +++ b/net/ipv4/tcp_output.c
> > > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> > > > >         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> > > > >                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> > > > >         else
> > > > > -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> > > > > +               (*rcv_wnd) = space;
> > > >
> > > > This is probably breaking some  packetdrill tests, but your change
> > > > might [1] be good,
> > >
> > > I'll do some packetdrill tests and get back some information soon.
> >
> > I'm done with the packetdrill tests[1]. Here are two tests failed
> > after comparing with/without this patch:
> > 1) ./packetdrill/run_all.py -S -v -L -l tcp/ioctl/ioctl-siocinq-fin.pkt
> > 2) ./packetdrill/run_all.py -S -v -L -l
> > tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt
> >
> > For the first one, it shows:
> > "FAIL [/data/home/kernelxing/source_code/packetdrill/gtests/net/tcp/ioctl/ioctl-siocinq-fin.pkt
> > (ipv6)]
> > stdout:
> > stderr:
> > ioctl-siocinq-fin.pkt:28: error handling packet: timing error:
> > expected outbound packet at 0.302321 sec but happened at 0.342759 sec;
> > tolerance 0.004000 sec
> > script packet:  0.302321 . 1:1(0) ack 2002
> > actual packet:  0.342759 . 1:1(0) ack 2002 win 65535"
> >
> > For the second one, it shows:
> > "client-ack-dropped-then-recovery-ms-timestamps.pkt:33: error handling
> > packet: live packet field tcp_window: expected: 256 (0x100) vs actual:
> > 532 (0x214)
> > script packet:  0.012251 P. 1:5001(5000) ack 1001 win 256 <nop,nop,TS
> > val 2010 ecr 1000>
> > actual packet:  0.012242 P. 1:5001(5000) ack 1001 win 532 <nop,nop,TS
> > val 2010 ecr 1000>
> > Ran    3 tests:    0 passing,    3 failing,    0 timed out (0.91 sec):
> > tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt"
> >
> > The reason is unexpected window size. Since I removed the limit of
> > 64KB, It is expected from my view.
>
> I think you misunderstood what I was saying.
>
> Basically, this change will break some packetdrill tests, and this is fine,
> because those packetdrill tests were relying on a prior kernel behavior that
> was not set in stone (certainly not documented)
>
> >
> > [1]: https://github.com/google/packetdrill
> > Running: ./packetdrill/run_all.py -S -v -L -l tcp/
> >
> > I wonder if you mind this change which might be unpredictable, how
> > about this one:
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 95caf8aaa8be..3bf7d9fd2d6b 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -231,11 +231,13 @@ void tcp_select_initial_window(const struct sock
> > *sk, int __space, __u32 mss,
> >          */
> >         if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> >                 (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> > -       else
> > -               (*rcv_wnd) = min_t(u32, space, U16_MAX);
> >
> > -       if (init_rcv_wnd)
> > +       if (init_rcv_wnd) {
> > +               *rcv_wnd = space;
> >                 *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
> > +       } else {
> > +               *rcv_wnd = min_t(u32, space, U16_MAX);
> > +       }
> >
> >         *rcv_wscale = 0;
> >         if (wscale_ok) {
> > ?
> >
> > It affects/changes the TCP stack only when the user tries to use 'ip
> > route' to set initrwnd.
>
> I much prefer the prior and simpler version.
>
> Only the changelog was not very good IMO.
>
> Also, I think this is fixing a bug and should target the net tree.
>
> If it took 6 years to discover the unwanted side effects, we should
> make sure the fix
> is backported by stable teams, thanks to an appropriate Fixes: tag.

Oh, I see. I'll submit a new one soon.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-21 13:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-18  2:50 [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value Jason Xing
2024-05-20 16:51 ` Eric Dumazet
2024-05-21  0:36   ` Jason Xing
2024-05-21  0:58     ` Jason Xing
2024-05-21  6:56     ` Jason Xing
2024-05-21  9:43       ` Eric Dumazet
2024-05-21 13:11         ` Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).