* [RFC net-next] tcp: allow larger TSO to be built under overload
@ 2022-03-08 3:03 Jakub Kicinski
2022-03-08 3:50 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-08 3:03 UTC (permalink / raw)
To: edumazet; +Cc: netdev, willemb, ncardwell, ycheng, Jakub Kicinski
We observed Tx-heavy workloads causing softirq overload because
with increased load and therefore latency the pacing rates fall,
pushing TCP to generate smaller and smaller TSO packets.
It seems reasonable to allow larger packets to be built when
system is under stress. TCP already uses the
this_cpu_ksoftirqd() == current
condition as an indication of overload for TSQ scheduling.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
Sending as an RFC because it seems reasonable, but really
I haven't run any large scale testing, yet. Bumping
tcp_min_tso_segs to prevent overloads is okay but it
seems like we can do better since we only need coarser
pacing once disaster strikes?
The downsides are that users may have already increased
the value to what's needed during overload, or applied
the same logic in out-of-tree CA algo implementations
(only BBR implements ca_ops->min_tso_segs() upstream).
---
net/ipv4/tcp_output.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2319531267c6..815ef4ffc39d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1967,7 +1967,13 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
* This preserves ACK clocking and is consistent
* with tcp_tso_should_defer() heuristic.
*/
- segs = max_t(u32, bytes / mss_now, min_tso_segs);
+ segs = bytes / mss_now;
+ if (segs < min_tso_segs) {
+ segs = min_tso_segs;
+ /* Allow larger packets under stress */
+ if (this_cpu_ksoftirqd() == current)
+ segs *= 2;
+ }
return segs;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 3:03 [RFC net-next] tcp: allow larger TSO to be built under overload Jakub Kicinski
@ 2022-03-08 3:50 ` Eric Dumazet
2022-03-08 4:29 ` Jakub Kicinski
2022-03-08 9:07 ` David Laight
0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 3:50 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, Willem de Bruijn, Neal Cardwell, Yuchung Cheng
On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> We observed Tx-heavy workloads causing softirq overload because
> with increased load and therefore latency the pacing rates fall,
> pushing TCP to generate smaller and smaller TSO packets.
Yes, we saw this behavior but came up with something more generic,
also helping the common case. Cooking larger TSO is really a function
of the radius (distance between peers)
> It seems reasonable to allow larger packets to be built when
> system is under stress. TCP already uses the
>
> this_cpu_ksoftirqd() == current
>
> condition as an indication of overload for TSQ scheduling.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> Sending as an RFC because it seems reasonable, but really
> I haven't run any large scale testing, yet. Bumping
> tcp_min_tso_segs to prevent overloads is okay but it
> seems like we can do better since we only need coarser
> pacing once disaster strikes?
>
> The downsides are that users may have already increased
> the value to what's needed during overload, or applied
> the same logic in out-of-tree CA algo implementations
> (only BBR implements ca_ops->min_tso_segs() upstream).
>
Unfortunately this would make packetdrill flaky, thus break our tests.
Also, I would guess the pacing decreases because CWND is small anyway,
or RTT increases ?
What CC are you using ?
The issue I see here is that bi modal behavior will cause all kinds of
artifacts.
BBR2 has something to give an extra allowance based on min_rtt.
I think we should adopt this for all CC, because it is not bi-modal,
and even allow full size TSO packets
for hosts in the same rack.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
const struct tcp_sock *tp,
static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
int min_tso_segs)
{
- u32 bytes, segs;
+/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
+ * in bigger TSO bursts. By default we cut the RTT-based allowance in half
+ * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
+ * is below 1500 bytes after 6 * ~500 usec = 3ms.
+ * Default: halve allowance per 2^9 usecs, 512us.
+ */
+ const u32 rtt_shift = 9;
+ unsigned long bytes;
+ u32 r;
+
+ bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
+ /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
+ * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
+ * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
+ */
+ r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
+ if (r < BITS_PER_TYPE(u32))
+ bytes += GSO_MAX_SIZE >> r;
+
+ bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
- bytes = min_t(unsigned long,
- sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
- sk->sk_gso_max_size);
/* Goal is to send at least one packet per ms,
* not one big TSO packet every 100 ms.
* This preserves ACK clocking and is consistent
* with tcp_tso_should_defer() heuristic.
*/
- segs = max_t(u32, bytes / mss_now, min_tso_segs);
-
- return segs;
+ return max_t(u32, bytes / mss_now, min_tso_segs);
}
/* Return the number of segments we want in the skb we are transmitting.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 3:50 ` Eric Dumazet
@ 2022-03-08 4:29 ` Jakub Kicinski
2022-03-08 9:07 ` David Laight
1 sibling, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-08 4:29 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Willem de Bruijn, Neal Cardwell, Yuchung Cheng
On Mon, 7 Mar 2022 19:50:10 -0800 Eric Dumazet wrote:
> On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > We observed Tx-heavy workloads causing softirq overload because
> > with increased load and therefore latency the pacing rates fall,
> > pushing TCP to generate smaller and smaller TSO packets.
>
> Yes, we saw this behavior but came up with something more generic,
> also helping the common case. Cooking larger TSO is really a function
> of the radius (distance between peers)
Excellent, I was hoping you have a better fix :)
> > It seems reasonable to allow larger packets to be built when
> > system is under stress. TCP already uses the
> >
> > this_cpu_ksoftirqd() == current
> >
> > condition as an indication of overload for TSQ scheduling.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > ---
> > Sending as an RFC because it seems reasonable, but really
> > I haven't run any large scale testing, yet. Bumping
> > tcp_min_tso_segs to prevent overloads is okay but it
> > seems like we can do better since we only need coarser
> > pacing once disaster strikes?
> >
> > The downsides are that users may have already increased
> > the value to what's needed during overload, or applied
> > the same logic in out-of-tree CA algo implementations
> > (only BBR implements ca_ops->min_tso_segs() upstream).
> >
>
> Unfortunately this would make packetdrill flaky, thus break our tests.
>
> Also, I would guess the pacing decreases because CWND is small anyway,
> or RTT increases ?
Both increase - CWND can go up to the 256-512 bucket (in a histogram)
but latency gets insane as the machine tries to pump out 2kB segments,
doing a lot of splitting and barely services the ACK from the Rx ring.
With a Rx ring of a few thousand packets latency crosses 250ms,
in-building. I've seen srtt_us > 1M.
> What CC are you using ?
A mix of CUBIC and DCTCP for this application, primarily DCTCP.
> The issue I see here is that bi modal behavior will cause all kinds of
> artifacts.
>
> BBR2 has something to give an extra allowance based on min_rtt.
>
> I think we should adopt this for all CC, because it is not bi-modal,
> and even allow full size TSO packets
> for hosts in the same rack.
Using min_rtt makes perfect sense in the case I saw.
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
> const struct tcp_sock *tp,
> static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
> int min_tso_segs)
> {
> - u32 bytes, segs;
> +/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
> + * in bigger TSO bursts. By default we cut the RTT-based allowance in half
> + * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
> + * is below 1500 bytes after 6 * ~500 usec = 3ms.
> + * Default: halve allowance per 2^9 usecs, 512us.
> + */
> + const u32 rtt_shift = 9;
> + unsigned long bytes;
> + u32 r;
> +
> + bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
> + /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
> + * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
> + * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
> + */
> + r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
> + if (r < BITS_PER_TYPE(u32))
> + bytes += GSO_MAX_SIZE >> r;
> +
> + bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
>
> - bytes = min_t(unsigned long,
> - sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
> - sk->sk_gso_max_size);
>
> /* Goal is to send at least one packet per ms,
> * not one big TSO packet every 100 ms.
> * This preserves ACK clocking and is consistent
> * with tcp_tso_should_defer() heuristic.
> */
> - segs = max_t(u32, bytes / mss_now, min_tso_segs);
> -
> - return segs;
> + return max_t(u32, bytes / mss_now, min_tso_segs);
> }
>
> /* Return the number of segments we want in the skb we are transmitting.
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 3:50 ` Eric Dumazet
2022-03-08 4:29 ` Jakub Kicinski
@ 2022-03-08 9:07 ` David Laight
2022-03-08 19:53 ` Eric Dumazet
1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2022-03-08 9:07 UTC (permalink / raw)
To: 'Eric Dumazet', Jakub Kicinski
Cc: netdev, Willem de Bruijn, Neal Cardwell, Yuchung Cheng
From: Eric Dumazet
> Sent: 08 March 2022 03:50
...
> /* Goal is to send at least one packet per ms,
> * not one big TSO packet every 100 ms.
> * This preserves ACK clocking and is consistent
> * with tcp_tso_should_defer() heuristic.
> */
> - segs = max_t(u32, bytes / mss_now, min_tso_segs);
> -
> - return segs;
> + return max_t(u32, bytes / mss_now, min_tso_segs);
> }
Which is the common side of that max_t() ?
If it is mon_tso_segs it might be worth avoiding the
divide by coding as:
return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 9:07 ` David Laight
@ 2022-03-08 19:53 ` Eric Dumazet
2022-03-08 22:12 ` David Laight
2022-03-09 0:18 ` Jakub Kicinski
0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 19:53 UTC (permalink / raw)
To: David Laight
Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
Yuchung Cheng
On Tue, Mar 8, 2022 at 1:08 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 08 March 2022 03:50
> ...
> > /* Goal is to send at least one packet per ms,
> > * not one big TSO packet every 100 ms.
> > * This preserves ACK clocking and is consistent
> > * with tcp_tso_should_defer() heuristic.
> > */
> > - segs = max_t(u32, bytes / mss_now, min_tso_segs);
> > -
> > - return segs;
> > + return max_t(u32, bytes / mss_now, min_tso_segs);
> > }
>
> Which is the common side of that max_t() ?
> If it is mon_tso_segs it might be worth avoiding the
> divide by coding as:
>
> return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
>
I think the common case is when the divide must happen.
Not sure if this really matters with current cpus.
Jakub, Neal, I am going to send a patch for net-next.
In conjunction with BIG TCP, this gives a considerable boost of performance.
Before:
otrv5:/home/google/edumazet# nstat -n;./super_netperf 600 -H otrv6 -l
20 -- -K dctcp -q 20000000;nstat|egrep
"TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96005
TcpInSegs 15649381 0.0
TcpOutSegs 58659574 0.0 # Average of
3.74 4K segments per TSO packet
TcpExtTCPDelivered 58655240 0.0
TcpExtTCPDeliveredCE 21 0.0
After:
otrv5:/home/google/edumazet# nstat -n;./super_netperf 600 -H otrv6 -l
20 -- -K dctcp -q 20000000;nstat|egrep
"TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96046
TcpInSegs 1445864 0.0
TcpOutSegs 58885065 0.0 # Average of
40.72 4K segments per TSO packet
TcpExtTCPDelivered 58880873 0.0
TcpExtTCPDeliveredCE 28 0.0
-> 1,445,864 ACK packets instead of 15,649,381
And about 25 % of cpu cycles saved, according to perf stat
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 --
-K dctcp -q 20000000':
66,895.00 msec task-clock # 2.886 CPUs
utilized
1,312,687 context-switches # 19623.389 M/sec
5,645 cpu-migrations # 84.387 M/sec
942,412 page-faults # 14088.139 M/sec
203,672,224,410 cycles # 3044700.936 GHz
(83.40%)
18,933,350,691 stalled-cycles-frontend # 9.30% frontend
cycles idle (83.46%)
138,500,001,318 stalled-cycles-backend # 68.00% backend
cycles idle (83.38%)
53,694,300,814 instructions # 0.26 insn per
cycle
# 2.58 stalled
cycles per insn (83.30%)
9,100,155,390 branches # 136038439.770
M/sec (83.26%)
152,331,123 branch-misses # 1.67% of all
branches (83.47%)
23.180309488 seconds time elapsed
-->
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 --
-K dctcp -q 20000000':
48,964.30 msec task-clock # 2.103 CPUs
utilized
184,903 context-switches # 3776.305 M/sec
3,057 cpu-migrations # 62.434 M/sec
940,615 page-faults # 19210.338 M/sec
152,390,738,065 cycles # 3112301.652 GHz
(83.61%)
11,603,675,527 stalled-cycles-frontend # 7.61% frontend
cycles idle (83.49%)
120,240,493,440 stalled-cycles-backend # 78.90% backend
cycles idle (83.30%)
37,106,498,492 instructions # 0.24 insn per
cycle
# 3.24 stalled
cycles per insn (83.47%)
5,968,256,846 branches # 121890712.483
M/sec (83.25%)
88,743,145 branch-misses # 1.49% of all
branches (83.24%)
23.284583305 seconds time elapsed
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 19:53 ` Eric Dumazet
@ 2022-03-08 22:12 ` David Laight
2022-03-08 22:26 ` Eric Dumazet
2022-03-09 0:18 ` Jakub Kicinski
1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2022-03-08 22:12 UTC (permalink / raw)
To: 'Eric Dumazet'
Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
Yuchung Cheng
From: Eric Dumazet
> Sent: 08 March 2022 19:54
..
> > Which is the common side of that max_t() ?
> > If it is mon_tso_segs it might be worth avoiding the
> > divide by coding as:
> >
> > return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
> >
>
> I think the common case is when the divide must happen.
> Not sure if this really matters with current cpus.
Last document I looked at still quoted considerable latency
for integer divide on x86-64.
If you get a cmov then all the instructions will just get
queued waiting for the divide to complete.
But a branch could easily get mispredicted.
That is likely to hit ppc - which I don't think has a cmov?
OTOH if the divide is in the ?: bit nothing probably depends
on it for a while - so the latency won't matter.
Latest figures I have are for skylakeX
u-ops latency 1/throughput
DIV r8 10 10 p0 p1 p5 p6 23 6
DIV r16 10 10 p0 p1 p5 p6 23 6
DIV r32 10 10 p0 p1 p5 p6 26 6
DIV r64 36 36 p0 p1 p5 p6 35-88 21-83
IDIV r8 11 11 p0 p1 p5 p6 24 6
IDIV r16 10 10 p0 p1 p5 p6 23 6
IDIV r32 10 10 p0 p1 p5 p6 26 6
IDIV r64 57 57 p0 p1 p5 p6 42-95 24-90
Broadwell is a bit slower.
Note that 64bit divide is really horrid.
I think that one will be 32bit - so 'only' 26 clocks
latency.
AMD Ryzen is a lot better for 64bit divides:
ltncy 1/thpt
DIV r8/m8 1 13-16 13-16
DIV r16/m16 2 14-21 14-21
DIV r32/m32 2 14-30 14-30
DIV r64/m64 2 14-46 14-45
IDIV r8/m8 1 13-16 13-16
IDIV r16/m16 2 13-21 14-22
IDIV r32/m32 2 14-30 14-30
IDIV r64/m64 2 14-47 14-45
But less pipelining for 32bit ones.
Quite how those tables actually affect real code
is another matter - but they are guidelines about
what is possible (if you can get the u-ops executed
on the right ports).
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 22:12 ` David Laight
@ 2022-03-08 22:26 ` Eric Dumazet
2022-03-08 22:42 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 22:26 UTC (permalink / raw)
To: David Laight
Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
Yuchung Cheng
On Tue, Mar 8, 2022 at 2:12 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 08 March 2022 19:54
> ..
> > > Which is the common side of that max_t() ?
> > > If it is mon_tso_segs it might be worth avoiding the
> > > divide by coding as:
> > >
> > > return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
> > >
> >
> > I think the common case is when the divide must happen.
> > Not sure if this really matters with current cpus.
>
> Last document I looked at still quoted considerable latency
> for integer divide on x86-64.
> If you get a cmov then all the instructions will just get
> queued waiting for the divide to complete.
> But a branch could easily get mispredicted.
> That is likely to hit ppc - which I don't think has a cmov?
>
> OTOH if the divide is in the ?: bit nothing probably depends
> on it for a while - so the latency won't matter.
>
> Latest figures I have are for skylakeX
> u-ops latency 1/throughput
> DIV r8 10 10 p0 p1 p5 p6 23 6
> DIV r16 10 10 p0 p1 p5 p6 23 6
> DIV r32 10 10 p0 p1 p5 p6 26 6
> DIV r64 36 36 p0 p1 p5 p6 35-88 21-83
> IDIV r8 11 11 p0 p1 p5 p6 24 6
> IDIV r16 10 10 p0 p1 p5 p6 23 6
> IDIV r32 10 10 p0 p1 p5 p6 26 6
> IDIV r64 57 57 p0 p1 p5 p6 42-95 24-90
>
> Broadwell is a bit slower.
> Note that 64bit divide is really horrid.
>
> I think that one will be 32bit - so 'only' 26 clocks
> latency.
>
> AMD Ryzen is a lot better for 64bit divides:
> ltncy 1/thpt
> DIV r8/m8 1 13-16 13-16
> DIV r16/m16 2 14-21 14-21
> DIV r32/m32 2 14-30 14-30
> DIV r64/m64 2 14-46 14-45
> IDIV r8/m8 1 13-16 13-16
> IDIV r16/m16 2 13-21 14-22
> IDIV r32/m32 2 14-30 14-30
> IDIV r64/m64 2 14-47 14-45
> But less pipelining for 32bit ones.
>
> Quite how those tables actually affect real code
> is another matter - but they are guidelines about
> what is possible (if you can get the u-ops executed
> on the right ports).
>
Thanks, I think I will make sure that we use the 32bit divide then,
because compiler might not be smart enough to detect both operands are < ~0U
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 22:26 ` Eric Dumazet
@ 2022-03-08 22:42 ` Eric Dumazet
0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 22:42 UTC (permalink / raw)
To: David Laight
Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
Yuchung Cheng
On Tue, Mar 8, 2022 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote:
>
>
> Thanks, I think I will make sure that we use the 32bit divide then,
> because compiler might not be smart enough to detect both operands are < ~0U
BTW, it seems the compiler (clang for me) is smart enough.
bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
return max_t(u32, bytes / mss_now, min_tso_segs);
Compiler is using the divide by 32bit operation (div %ecx)
If you remove the min_t() clamping, and only keep:
return max_t(u32, bytes / mss_now, min_tso_segs);
Then clang makes a special case if bytes >= (1UL<<32)
790d: 48 89 c2 mov %rax,%rdx
7910: 48 c1 ea 20 shr $0x20,%rdx
7914: 74 07 je 791d <tcp_tso_autosize+0x4d>
7916: 31 d2 xor %edx,%edx
7918: 48 f7 f1 div %rcx
# More expensive divide
791b: eb 04 jmp 7921 <tcp_tso_autosize+0x51>
791d: 31 d2 xor %edx,%edx
791f: f7 f1 div %ecx
7921: 44 39 c0 cmp %r8d,%eax
7924: 44 0f 47 c0 cmova %eax,%r8d
7928: 44 89 c0 mov %r8d,%eax
792b: 5d pop %rbp
792c: c3 ret
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-08 19:53 ` Eric Dumazet
2022-03-08 22:12 ` David Laight
@ 2022-03-09 0:18 ` Jakub Kicinski
2022-03-09 1:09 ` Eric Dumazet
2022-03-09 16:42 ` Neal Cardwell
1 sibling, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-09 0:18 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Laight, netdev, Willem de Bruijn, Neal Cardwell,
Yuchung Cheng
On Tue, 8 Mar 2022 11:53:38 -0800 Eric Dumazet wrote:
> Jakub, Neal, I am going to send a patch for net-next.
>
> In conjunction with BIG TCP, this gives a considerable boost of performance.
SGTM! The change may cause increased burstiness, or is that unlikely?
I'm asking to gauge risk / figure out appropriate roll out plan.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-09 0:18 ` Jakub Kicinski
@ 2022-03-09 1:09 ` Eric Dumazet
2022-03-09 16:42 ` Neal Cardwell
1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-09 1:09 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David Laight, netdev, Willem de Bruijn, Neal Cardwell,
Yuchung Cheng
On Tue, Mar 8, 2022 at 4:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 8 Mar 2022 11:53:38 -0800 Eric Dumazet wrote:
> > Jakub, Neal, I am going to send a patch for net-next.
> >
> > In conjunction with BIG TCP, this gives a considerable boost of performance.
>
> SGTM! The change may cause increased burstiness, or is that unlikely?
> I'm asking to gauge risk / figure out appropriate roll out plan.
I guess we can make the shift factor a sysctl, default to 9.
If you want to disable the feature, set the sysctl to a small value like 0 or 1
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-09 0:18 ` Jakub Kicinski
2022-03-09 1:09 ` Eric Dumazet
@ 2022-03-09 16:42 ` Neal Cardwell
2022-03-09 16:54 ` Jakub Kicinski
1 sibling, 1 reply; 12+ messages in thread
From: Neal Cardwell @ 2022-03-09 16:42 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Eric Dumazet, David Laight, netdev, Willem de Bruijn,
Yuchung Cheng
On Tue, Mar 8, 2022 at 7:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 8 Mar 2022 11:53:38 -0800 Eric Dumazet wrote:
> > Jakub, Neal, I am going to send a patch for net-next.
> >
> > In conjunction with BIG TCP, this gives a considerable boost of performance.
>
> SGTM! The change may cause increased burstiness, or is that unlikely?
> I'm asking to gauge risk / figure out appropriate roll out plan.
In theory it could cause increased burstiness in some scenarios, but
in practice we have used this min_rtt-based TSO autosizing component
in production for about two years, where we see improvements in load
tests and no problems seen in production.
neal
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
2022-03-09 16:42 ` Neal Cardwell
@ 2022-03-09 16:54 ` Jakub Kicinski
0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-09 16:54 UTC (permalink / raw)
To: Neal Cardwell
Cc: Eric Dumazet, David Laight, netdev, Willem de Bruijn,
Yuchung Cheng
On Wed, 9 Mar 2022 11:42:24 -0500 Neal Cardwell wrote:
> > SGTM! The change may cause increased burstiness, or is that unlikely?
> > I'm asking to gauge risk / figure out appropriate roll out plan.
>
> In theory it could cause increased burstiness in some scenarios, but
> in practice we have used this min_rtt-based TSO autosizing component
> in production for about two years, where we see improvements in load
> tests and no problems seen in production.
Perfect, sounds safe to put it in the next point release here, then.
Thank you!
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-03-09 17:04 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-08 3:03 [RFC net-next] tcp: allow larger TSO to be built under overload Jakub Kicinski
2022-03-08 3:50 ` Eric Dumazet
2022-03-08 4:29 ` Jakub Kicinski
2022-03-08 9:07 ` David Laight
2022-03-08 19:53 ` Eric Dumazet
2022-03-08 22:12 ` David Laight
2022-03-08 22:26 ` Eric Dumazet
2022-03-08 22:42 ` Eric Dumazet
2022-03-09 0:18 ` Jakub Kicinski
2022-03-09 1:09 ` Eric Dumazet
2022-03-09 16:42 ` Neal Cardwell
2022-03-09 16:54 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).