netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net-next] tcp: allow larger TSO to be built under overload
@ 2022-03-08  3:03 Jakub Kicinski
  2022-03-08  3:50 ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-08  3:03 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, willemb, ncardwell, ycheng, Jakub Kicinski

We observed Tx-heavy workloads causing softirq overload because
with increased load and therefore latency the pacing rates fall,
pushing TCP to generate smaller and smaller TSO packets.

It seems reasonable to allow larger packets to be built when
system is under stress. TCP already uses the

  this_cpu_ksoftirqd() == current

condition as an indication of overload for TSQ scheduling.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
Sending as an RFC because it seems reasonable, but really
I haven't run any large scale testing, yet. Bumping
tcp_min_tso_segs to prevent overloads is okay but it
seems like we can do better since we only need coarser
pacing once disaster strikes?

The downsides are that users may have already increased
the value to what's needed during overload, or applied
the same logic in out-of-tree CA algo implementations
(only BBR implements ca_ops->min_tso_segs() upstream).
---
 net/ipv4/tcp_output.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2319531267c6..815ef4ffc39d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1967,7 +1967,13 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
 	 * This preserves ACK clocking and is consistent
 	 * with tcp_tso_should_defer() heuristic.
 	 */
-	segs = max_t(u32, bytes / mss_now, min_tso_segs);
+	segs = bytes / mss_now;
+	if (segs < min_tso_segs) {
+		segs = min_tso_segs;
+		/* Allow larger packets under stress */
+		if (this_cpu_ksoftirqd() == current)
+			segs *= 2;
+	}
 
 	return segs;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08  3:03 [RFC net-next] tcp: allow larger TSO to be built under overload Jakub Kicinski
@ 2022-03-08  3:50 ` Eric Dumazet
  2022-03-08  4:29   ` Jakub Kicinski
  2022-03-08  9:07   ` David Laight
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08  3:50 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, Willem de Bruijn, Neal Cardwell, Yuchung Cheng

On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> We observed Tx-heavy workloads causing softirq overload because
> with increased load and therefore latency the pacing rates fall,
> pushing TCP to generate smaller and smaller TSO packets.

Yes, we saw this behavior but came up with something more generic,
also helping the common case. Cooking larger TSO is really a function
of the radius (distance between peers)

> It seems reasonable to allow larger packets to be built when
> system is under stress. TCP already uses the
>
>   this_cpu_ksoftirqd() == current
>
> condition as an indication of overload for TSQ scheduling.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> Sending as an RFC because it seems reasonable, but really
> I haven't run any large scale testing, yet. Bumping
> tcp_min_tso_segs to prevent overloads is okay but it
> seems like we can do better since we only need coarser
> pacing once disaster strikes?
>
> The downsides are that users may have already increased
> the value to what's needed during overload, or applied
> the same logic in out-of-tree CA algo implementations
> (only BBR implements ca_ops->min_tso_segs() upstream).
>

Unfortunately this would make packetdrill flaky, thus break our tests.

Also, I would guess the pacing decreases because CWND is small anyway,
or RTT increases ?
What CC are you using ?
The issue I see here is that bi modal behavior will cause all kinds of
artifacts.

BBR2 has something to give an extra allowance based on min_rtt.

I think we should adopt this for all CC, because it is not bi-modal,
and even allow full size TSO packets
for hosts in the same rack.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
const struct tcp_sock *tp,
 static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
                            int min_tso_segs)
 {
-       u32 bytes, segs;
+/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
+ * in bigger TSO bursts. By default we cut the RTT-based allowance in half
+ * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
+ * is below 1500 bytes after 6 * ~500 usec = 3ms.
+ * Default: halve allowance per 2^9 usecs, 512us.
+ */
+       const u32 rtt_shift = 9;
+       unsigned long bytes;
+       u32 r;
+
+       bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
+       /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
+        * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
+        * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
+        */
+       r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
+       if (r < BITS_PER_TYPE(u32))
+               bytes += GSO_MAX_SIZE >> r;
+
+       bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);

-       bytes = min_t(unsigned long,
-                     sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
-                     sk->sk_gso_max_size);

        /* Goal is to send at least one packet per ms,
         * not one big TSO packet every 100 ms.
         * This preserves ACK clocking and is consistent
         * with tcp_tso_should_defer() heuristic.
         */
-       segs = max_t(u32, bytes / mss_now, min_tso_segs);
-
-       return segs;
+       return max_t(u32, bytes / mss_now, min_tso_segs);
 }

 /* Return the number of segments we want in the skb we are transmitting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08  3:50 ` Eric Dumazet
@ 2022-03-08  4:29   ` Jakub Kicinski
  2022-03-08  9:07   ` David Laight
  1 sibling, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-08  4:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Willem de Bruijn, Neal Cardwell, Yuchung Cheng

On Mon, 7 Mar 2022 19:50:10 -0800 Eric Dumazet wrote:
> On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > We observed Tx-heavy workloads causing softirq overload because
> > with increased load and therefore latency the pacing rates fall,
> > pushing TCP to generate smaller and smaller TSO packets.  
> 
> Yes, we saw this behavior but came up with something more generic,
> also helping the common case. Cooking larger TSO is really a function
> of the radius (distance between peers)

Excellent, I was hoping you have a better fix :)

> > It seems reasonable to allow larger packets to be built when
> > system is under stress. TCP already uses the
> >
> >   this_cpu_ksoftirqd() == current
> >
> > condition as an indication of overload for TSQ scheduling.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > ---
> > Sending as an RFC because it seems reasonable, but really
> > I haven't run any large scale testing, yet. Bumping
> > tcp_min_tso_segs to prevent overloads is okay but it
> > seems like we can do better since we only need coarser
> > pacing once disaster strikes?
> >
> > The downsides are that users may have already increased
> > the value to what's needed during overload, or applied
> > the same logic in out-of-tree CA algo implementations
> > (only BBR implements ca_ops->min_tso_segs() upstream).
> >  
> 
> Unfortunately this would make packetdrill flaky, thus break our tests.
> 
> Also, I would guess the pacing decreases because CWND is small anyway,
> or RTT increases ?

Both increase - CWND can go up to the 256-512 bucket (in a histogram)
but latency gets insane as the machine tries to pump out 2kB segments,
doing a lot of splitting and barely services the ACK from the Rx ring. 

With a Rx ring of a few thousand packets latency crosses 250ms,
in-building. I've seen srtt_us > 1M.

> What CC are you using ?

A mix of CUBIC and DCTCP for this application, primarily DCTCP.

> The issue I see here is that bi modal behavior will cause all kinds of
> artifacts.
> 
> BBR2 has something to give an extra allowance based on min_rtt.
> 
> I think we should adopt this for all CC, because it is not bi-modal,
> and even allow full size TSO packets
> for hosts in the same rack.

Using min_rtt makes perfect sense in the case I saw.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
> const struct tcp_sock *tp,
>  static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
>                             int min_tso_segs)
>  {
> -       u32 bytes, segs;
> +/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
> + * in bigger TSO bursts. By default we cut the RTT-based allowance in half
> + * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
> + * is below 1500 bytes after 6 * ~500 usec = 3ms.
> + * Default: halve allowance per 2^9 usecs, 512us.
> + */
> +       const u32 rtt_shift = 9;
> +       unsigned long bytes;
> +       u32 r;
> +
> +       bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
> +       /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
> +        * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
> +        * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
> +        */
> +       r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
> +       if (r < BITS_PER_TYPE(u32))
> +               bytes += GSO_MAX_SIZE >> r;
> +
> +       bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
> 
> -       bytes = min_t(unsigned long,
> -                     sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
> -                     sk->sk_gso_max_size);
> 
>         /* Goal is to send at least one packet per ms,
>          * not one big TSO packet every 100 ms.
>          * This preserves ACK clocking and is consistent
>          * with tcp_tso_should_defer() heuristic.
>          */
> -       segs = max_t(u32, bytes / mss_now, min_tso_segs);
> -
> -       return segs;
> +       return max_t(u32, bytes / mss_now, min_tso_segs);
>  }
> 
>  /* Return the number of segments we want in the skb we are transmitting.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08  3:50 ` Eric Dumazet
  2022-03-08  4:29   ` Jakub Kicinski
@ 2022-03-08  9:07   ` David Laight
  2022-03-08 19:53     ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2022-03-08  9:07 UTC (permalink / raw)
  To: 'Eric Dumazet', Jakub Kicinski
  Cc: netdev, Willem de Bruijn, Neal Cardwell, Yuchung Cheng

From: Eric Dumazet
> Sent: 08 March 2022 03:50
...
>         /* Goal is to send at least one packet per ms,
>          * not one big TSO packet every 100 ms.
>          * This preserves ACK clocking and is consistent
>          * with tcp_tso_should_defer() heuristic.
>          */
> -       segs = max_t(u32, bytes / mss_now, min_tso_segs);
> -
> -       return segs;
> +       return max_t(u32, bytes / mss_now, min_tso_segs);
>  }

Which is the common side of that max_t() ?
If it is mon_tso_segs it might be worth avoiding the
divide by coding as:

	return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08  9:07   ` David Laight
@ 2022-03-08 19:53     ` Eric Dumazet
  2022-03-08 22:12       ` David Laight
  2022-03-09  0:18       ` Jakub Kicinski
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 19:53 UTC (permalink / raw)
  To: David Laight
  Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
	Yuchung Cheng

On Tue, Mar 8, 2022 at 1:08 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 08 March 2022 03:50
> ...
> >         /* Goal is to send at least one packet per ms,
> >          * not one big TSO packet every 100 ms.
> >          * This preserves ACK clocking and is consistent
> >          * with tcp_tso_should_defer() heuristic.
> >          */
> > -       segs = max_t(u32, bytes / mss_now, min_tso_segs);
> > -
> > -       return segs;
> > +       return max_t(u32, bytes / mss_now, min_tso_segs);
> >  }
>
> Which is the common side of that max_t() ?
> If it is mon_tso_segs it might be worth avoiding the
> divide by coding as:
>
>         return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
>

I think the common case is when the divide must happen.
Not sure if this really matters with current cpus.

Jakub, Neal, I am going to send a patch for net-next.

In conjunction with BIG TCP, this gives a considerable boost of performance.


Before:
otrv5:/home/google/edumazet# nstat -n;./super_netperf 600 -H otrv6 -l
20 -- -K dctcp -q 20000000;nstat|egrep
"TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96005
TcpInSegs                       15649381           0.0
TcpOutSegs                      58659574           0.0  # Average of
3.74 4K segments per TSO packet
TcpExtTCPDelivered              58655240           0.0
TcpExtTCPDeliveredCE            21                 0.0

After:
otrv5:/home/google/edumazet# nstat -n;./super_netperf 600 -H otrv6 -l
20 -- -K dctcp -q 20000000;nstat|egrep
"TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96046
TcpInSegs                       1445864            0.0
TcpOutSegs                      58885065           0.0   # Average of
40.72 4K segments per TSO packet
TcpExtTCPDelivered              58880873           0.0
TcpExtTCPDeliveredCE            28                 0.0

-> 1,445,864 ACK packets instead of 15,649,381
And about 25 % of cpu cycles saved, according to perf stat

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 --
-K dctcp -q 20000000':

         66,895.00 msec task-clock                #    2.886 CPUs
utilized
         1,312,687      context-switches          # 19623.389 M/sec
             5,645      cpu-migrations            #   84.387 M/sec
           942,412      page-faults               # 14088.139 M/sec
   203,672,224,410      cycles                    # 3044700.936 GHz
               (83.40%)
    18,933,350,691      stalled-cycles-frontend   #    9.30% frontend
cycles idle     (83.46%)
   138,500,001,318      stalled-cycles-backend    #   68.00% backend
cycles idle      (83.38%)
    53,694,300,814      instructions              #    0.26  insn per
cycle
                                                  #    2.58  stalled
cycles per insn  (83.30%)
     9,100,155,390      branches                  # 136038439.770
M/sec               (83.26%)
       152,331,123      branch-misses             #    1.67% of all
branches          (83.47%)

      23.180309488 seconds time elapsed

-->

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 --
-K dctcp -q 20000000':

         48,964.30 msec task-clock                #    2.103 CPUs
utilized
           184,903      context-switches          # 3776.305 M/sec
             3,057      cpu-migrations            #   62.434 M/sec
           940,615      page-faults               # 19210.338 M/sec
   152,390,738,065      cycles                    # 3112301.652 GHz
               (83.61%)
    11,603,675,527      stalled-cycles-frontend   #    7.61% frontend
cycles idle     (83.49%)
   120,240,493,440      stalled-cycles-backend    #   78.90% backend
cycles idle      (83.30%)
    37,106,498,492      instructions              #    0.24  insn per
cycle
                                                  #    3.24  stalled
cycles per insn  (83.47%)
     5,968,256,846      branches                  # 121890712.483
M/sec               (83.25%)
        88,743,145      branch-misses             #    1.49% of all
branches          (83.24%)

      23.284583305 seconds time elapsed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08 19:53     ` Eric Dumazet
@ 2022-03-08 22:12       ` David Laight
  2022-03-08 22:26         ` Eric Dumazet
  2022-03-09  0:18       ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2022-03-08 22:12 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
	Yuchung Cheng

From: Eric Dumazet
> Sent: 08 March 2022 19:54
..
> > Which is the common side of that max_t() ?
> > If it is mon_tso_segs it might be worth avoiding the
> > divide by coding as:
> >
> >         return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
> >
> 
> I think the common case is when the divide must happen.
> Not sure if this really matters with current cpus.

Last document I looked at still quoted considerable latency
for integer divide on x86-64.
If you get a cmov then all the instructions will just get
queued waiting for the divide to complete.
But a branch could easily get mispredicted.
That is likely to hit ppc - which I don't think has a cmov?

OTOH if the divide is in the ?: bit nothing probably depends
on it for a while - so the latency won't matter.

Latest figures I have are for skylakeX
         u-ops            latency 1/throughput
DIV   r8 10 10 p0 p1 p5 p6  23        6
DIV  r16 10 10 p0 p1 p5 p6  23        6
DIV  r32 10 10 p0 p1 p5 p6  26        6
DIV  r64 36 36 p0 p1 p5 p6 35-88    21-83
IDIV  r8 11 11 p0 p1 p5 p6  24        6
IDIV r16 10 10 p0 p1 p5 p6  23        6
IDIV r32 10 10 p0 p1 p5 p6  26        6
IDIV r64 57 57 p0 p1 p5 p6 42-95    24-90

Broadwell is a bit slower.
Note that 64bit divide is really horrid.

I think that one will be 32bit - so 'only' 26 clocks
latency.

AMD Ryzen is a lot better for 64bit divides:
               ltncy  1/thpt
DIV   r8/m8  1 13-16 13-16
DIV  r16/m16 2 14-21 14-21
DIV  r32/m32 2 14-30 14-30
DIV  r64/m64 2 14-46 14-45
IDIV  r8/m8  1 13-16 13-16
IDIV r16/m16 2 13-21 14-22
IDIV r32/m32 2 14-30 14-30
IDIV r64/m64 2 14-47 14-45
But less pipelining for 32bit ones.

Quite how those tables actually affect real code
is another matter - but they are guidelines about
what is possible (if you can get the u-ops executed
on the right ports).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08 22:12       ` David Laight
@ 2022-03-08 22:26         ` Eric Dumazet
  2022-03-08 22:42           ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 22:26 UTC (permalink / raw)
  To: David Laight
  Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
	Yuchung Cheng

On Tue, Mar 8, 2022 at 2:12 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 08 March 2022 19:54
> ..
> > > Which is the common side of that max_t() ?
> > > If it is mon_tso_segs it might be worth avoiding the
> > > divide by coding as:
> > >
> > >         return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
> > >
> >
> > I think the common case is when the divide must happen.
> > Not sure if this really matters with current cpus.
>
> Last document I looked at still quoted considerable latency
> for integer divide on x86-64.
> If you get a cmov then all the instructions will just get
> queued waiting for the divide to complete.
> But a branch could easily get mispredicted.
> That is likely to hit ppc - which I don't think has a cmov?
>
> OTOH if the divide is in the ?: bit nothing probably depends
> on it for a while - so the latency won't matter.
>
> Latest figures I have are for skylakeX
>          u-ops            latency 1/throughput
> DIV   r8 10 10 p0 p1 p5 p6  23        6
> DIV  r16 10 10 p0 p1 p5 p6  23        6
> DIV  r32 10 10 p0 p1 p5 p6  26        6
> DIV  r64 36 36 p0 p1 p5 p6 35-88    21-83
> IDIV  r8 11 11 p0 p1 p5 p6  24        6
> IDIV r16 10 10 p0 p1 p5 p6  23        6
> IDIV r32 10 10 p0 p1 p5 p6  26        6
> IDIV r64 57 57 p0 p1 p5 p6 42-95    24-90
>
> Broadwell is a bit slower.
> Note that 64bit divide is really horrid.
>
> I think that one will be 32bit - so 'only' 26 clocks
> latency.
>
> AMD Ryzen is a lot better for 64bit divides:
>                ltncy  1/thpt
> DIV   r8/m8  1 13-16 13-16
> DIV  r16/m16 2 14-21 14-21
> DIV  r32/m32 2 14-30 14-30
> DIV  r64/m64 2 14-46 14-45
> IDIV  r8/m8  1 13-16 13-16
> IDIV r16/m16 2 13-21 14-22
> IDIV r32/m32 2 14-30 14-30
> IDIV r64/m64 2 14-47 14-45
> But less pipelining for 32bit ones.
>
> Quite how those tables actually affect real code
> is another matter - but they are guidelines about
> what is possible (if you can get the u-ops executed
> on the right ports).
>

Thanks, I think I will make sure that we use the 32bit divide then,
because compiler might not be smart enough to detect both operands are < ~0U

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08 22:26         ` Eric Dumazet
@ 2022-03-08 22:42           ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-08 22:42 UTC (permalink / raw)
  To: David Laight
  Cc: Jakub Kicinski, netdev, Willem de Bruijn, Neal Cardwell,
	Yuchung Cheng

On Tue, Mar 8, 2022 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote:
>

>
> Thanks, I think I will make sure that we use the 32bit divide then,
> because compiler might not be smart enough to detect both operands are < ~0U

BTW, it seems the compiler (clang for me) is smart enough.

bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
return max_t(u32, bytes / mss_now, min_tso_segs);

Compiler is using the divide by 32bit operation (div %ecx)

If you remove the min_t() clamping, and only keep:

return max_t(u32, bytes / mss_now, min_tso_segs);

Then clang makes a special case if bytes >= (1UL<<32)

    790d: 48 89 c2              mov    %rax,%rdx
    7910: 48 c1 ea 20          shr    $0x20,%rdx
    7914: 74 07                je     791d <tcp_tso_autosize+0x4d>
    7916: 31 d2                xor    %edx,%edx
    7918: 48 f7 f1              div    %rcx
  # More expensive divide
    791b: eb 04                jmp    7921 <tcp_tso_autosize+0x51>
    791d: 31 d2                xor    %edx,%edx
    791f: f7 f1                div    %ecx

    7921: 44 39 c0              cmp    %r8d,%eax
    7924: 44 0f 47 c0          cmova  %eax,%r8d
    7928: 44 89 c0              mov    %r8d,%eax
    792b: 5d                    pop    %rbp
    792c: c3                    ret

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-08 19:53     ` Eric Dumazet
  2022-03-08 22:12       ` David Laight
@ 2022-03-09  0:18       ` Jakub Kicinski
  2022-03-09  1:09         ` Eric Dumazet
  2022-03-09 16:42         ` Neal Cardwell
  1 sibling, 2 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-09  0:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, netdev, Willem de Bruijn, Neal Cardwell,
	Yuchung Cheng

On Tue, 8 Mar 2022 11:53:38 -0800 Eric Dumazet wrote:
> Jakub, Neal, I am going to send a patch for net-next.
> 
> In conjunction with BIG TCP, this gives a considerable boost of performance.

SGTM! The change may cause increased burstiness, or is that unlikely?  
I'm asking to gauge risk / figure out appropriate roll out plan.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-09  0:18       ` Jakub Kicinski
@ 2022-03-09  1:09         ` Eric Dumazet
  2022-03-09 16:42         ` Neal Cardwell
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2022-03-09  1:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Laight, netdev, Willem de Bruijn, Neal Cardwell,
	Yuchung Cheng

On Tue, Mar 8, 2022 at 4:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 8 Mar 2022 11:53:38 -0800 Eric Dumazet wrote:
> > Jakub, Neal, I am going to send a patch for net-next.
> >
> > In conjunction with BIG TCP, this gives a considerable boost of performance.
>
> SGTM! The change may cause increased burstiness, or is that unlikely?
> I'm asking to gauge risk / figure out appropriate roll out plan.

I guess we can make the shift factor a sysctl, default to 9.

If you want to disable the feature, set the sysctl to a small value like 0 or 1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-09  0:18       ` Jakub Kicinski
  2022-03-09  1:09         ` Eric Dumazet
@ 2022-03-09 16:42         ` Neal Cardwell
  2022-03-09 16:54           ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Neal Cardwell @ 2022-03-09 16:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Eric Dumazet, David Laight, netdev, Willem de Bruijn,
	Yuchung Cheng

On Tue, Mar 8, 2022 at 7:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 8 Mar 2022 11:53:38 -0800 Eric Dumazet wrote:
> > Jakub, Neal, I am going to send a patch for net-next.
> >
> > In conjunction with BIG TCP, this gives a considerable boost of performance.
>
> SGTM! The change may cause increased burstiness, or is that unlikely?
> I'm asking to gauge risk / figure out appropriate roll out plan.

In theory it could cause increased burstiness in some scenarios, but
in practice we have used this min_rtt-based TSO autosizing component
in production for about two years, where we see improvements in load
tests and no problems seen in production.

neal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC net-next] tcp: allow larger TSO to be built under overload
  2022-03-09 16:42         ` Neal Cardwell
@ 2022-03-09 16:54           ` Jakub Kicinski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2022-03-09 16:54 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, David Laight, netdev, Willem de Bruijn,
	Yuchung Cheng

On Wed, 9 Mar 2022 11:42:24 -0500 Neal Cardwell wrote:
> > SGTM! The change may cause increased burstiness, or is that unlikely?
> > I'm asking to gauge risk / figure out appropriate roll out plan.  
> 
> In theory it could cause increased burstiness in some scenarios, but
> in practice we have used this min_rtt-based TSO autosizing component
> in production for about two years, where we see improvements in load
> tests and no problems seen in production.

Perfect, sounds safe to put it in the next point release here, then.
Thank you!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-09 17:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-08  3:03 [RFC net-next] tcp: allow larger TSO to be built under overload Jakub Kicinski
2022-03-08  3:50 ` Eric Dumazet
2022-03-08  4:29   ` Jakub Kicinski
2022-03-08  9:07   ` David Laight
2022-03-08 19:53     ` Eric Dumazet
2022-03-08 22:12       ` David Laight
2022-03-08 22:26         ` Eric Dumazet
2022-03-08 22:42           ` Eric Dumazet
2022-03-09  0:18       ` Jakub Kicinski
2022-03-09  1:09         ` Eric Dumazet
2022-03-09 16:42         ` Neal Cardwell
2022-03-09 16:54           ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).