splice() performance for TCP socket forwarding

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* splice() performance for TCP socket forwarding
@ 2018-12-13 11:25 Marek Majkowski
  2018-12-13 12:49 ` Eric Dumazet
  2018-12-13 12:55 ` Willy Tarreau
  0 siblings, 2 replies; 11+ messages in thread
From: Marek Majkowski @ 2018-12-13 11:25 UTC (permalink / raw)
  To: netdev

Hi!

I'm basically trying to do TCP splicing in Linux. I'm focusing on
performance of the simplest case: receive data from one TCP socket,
write data to another TCP socket. I get poor performance with splice.

First, the naive code, pretty much:

while(1){
 n = read(rs, buf);
 write(ws, buf, n);
}

With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
~50% of CPU in application (sys mostly).

When replaced with splice version:

pipe(pfd);
fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
while(1) {
 n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
                       SPLICE_F_MOVE);
  splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
}

Full code:
https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59

I get 100% cpu (sys) and dramatically worse performance (1.5x slower).

naive run of perf record ./proxy-splice shows:
   5.73%  [k] queued_spin_lock_slowpath
   5.23%  [k] ipt_do_table
   4.72%  [k] __splice_segment.part.59
   4.72%  [k] do_tcp_sendpages
   3.47%  [k] _raw_spin_lock_bh
   3.36%  [k] __x86_indirect_thunk_rax

(kernel 4.14.71)

Is it possible to squeeze more from splice? Is it possible to force
splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't
work).

Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY
that landed in 4.19.

Cheers,
   Marek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 11:25 splice() performance for TCP socket forwarding Marek Majkowski
@ 2018-12-13 12:49 ` Eric Dumazet
  2018-12-13 13:17   ` Marek Majkowski
  2018-12-13 12:55 ` Willy Tarreau
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2018-12-13 12:49 UTC (permalink / raw)
  To: Marek Majkowski, netdev



On 12/13/2018 03:25 AM, Marek Majkowski wrote:
> Hi!
> 
> I'm basically trying to do TCP splicing in Linux. I'm focusing on
> performance of the simplest case: receive data from one TCP socket,
> write data to another TCP socket. I get poor performance with splice.
> 
> First, the naive code, pretty much:
> 
> while(1){
>  n = read(rs, buf);
>  write(ws, buf, n);
> }
> 
> With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
> ~50% of CPU in application (sys mostly).
> 
> When replaced with splice version:
> 
> pipe(pfd);
> fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);

Why 1 MB ?

splice code will be expensive if less than 1MB is present in receive queue.

> while(1) {
>  n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
>                        SPLICE_F_MOVE);
>   splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
> }
> 
> Full code:
> https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
> 
> I get 100% cpu (sys) and dramatically worse performance (1.5x slower).
> 
> naive run of perf record ./proxy-splice shows:
>    5.73%  [k] queued_spin_lock_slowpath
>    5.23%  [k] ipt_do_table
>    4.72%  [k] __splice_segment.part.59
>    4.72%  [k] do_tcp_sendpages
>    3.47%  [k] _raw_spin_lock_bh
>    3.36%  [k] __x86_indirect_thunk_rax
> 
> (kernel 4.14.71)
> 
> Is it possible to squeeze more from splice? Is it possible to force
> splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't
> work).

I believe it should work on recent linux kernels (4.18 )

03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users
796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT
d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning


> 
> Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY
> that landed in 4.19.
>

TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options),
so might be tricky, as it also requires NIC driver abilities to perform nice header splitting.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 12:49 ` Eric Dumazet
@ 2018-12-13 13:17   ` Marek Majkowski
  2018-12-13 13:18     ` Marek Majkowski
  2018-12-13 14:04     ` Willy Tarreau
  0 siblings, 2 replies; 11+ messages in thread
From: Marek Majkowski @ 2018-12-13 13:17 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

Eric,

On Thu, Dec 13, 2018 at 1:49 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 12/13/2018 03:25 AM, Marek Majkowski wrote:
> > Hi!
> >
> > I'm basically trying to do TCP splicing in Linux. I'm focusing on
> > performance of the simplest case: receive data from one TCP socket,
> > write data to another TCP socket. I get poor performance with splice.
> >
> > First, the naive code, pretty much:
> >
> > while(1){
> >  n = read(rs, buf);
> >  write(ws, buf, n);
> > }
> >
> > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
> > ~50% of CPU in application (sys mostly).
> >
> > When replaced with splice version:
> >
> > pipe(pfd);
> > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
>
> Why 1 MB ?
>
> splice code will be expensive if less than 1MB is present in receive queue.

I'm not sure what you are suggesting. I'm just shuffling data between
two sockets. Is there a better buffer size value? Is it possible to
keep splice() blocked until it succeeds to forward N bytes of data? (I
tried this unsuccessfully with SO_RCVLOWAT).

Here is a snippet from strace:

splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048>
splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108>
splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065>
splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202>
splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029>
splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106>
splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019>
splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028>
splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013>
splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021>

> > while(1) {
> >  n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
> >                        SPLICE_F_MOVE);
> >   splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
> > }
> >
> > Full code:
> > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
> >
> > I get 100% cpu (sys) and dramatically worse performance (1.5x slower).
> >
> > naive run of perf record ./proxy-splice shows:
> >    5.73%  [k] queued_spin_lock_slowpath
> >    5.23%  [k] ipt_do_table
> >    4.72%  [k] __splice_segment.part.59
> >    4.72%  [k] do_tcp_sendpages
> >    3.47%  [k] _raw_spin_lock_bh
> >    3.36%  [k] __x86_indirect_thunk_rax
> >
> > (kernel 4.14.71)
> >
> > Is it possible to squeeze more from splice? Is it possible to force
> > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't
> > work).
>
> I believe it should work on recent linux kernels (4.18 )
>
> 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users
> 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT
> d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning

I can confirm this. On 4.19 indeed splice program goes down to
expected ~50% cpu and performance comparable to naive read/write
version.

> >
> > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY
> > that landed in 4.19.
> >
>
> TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options),
> so might be tricky, as it also requires NIC driver abilities to perform nice header splitting.

Oh, that's a pity.

Thanks for help.
Marek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 13:17   ` Marek Majkowski
@ 2018-12-13 13:18     ` Marek Majkowski
  2018-12-13 13:33       ` Marek Majkowski
  2018-12-13 14:04     ` Willy Tarreau
  1 sibling, 1 reply; 11+ messages in thread
From: Marek Majkowski @ 2018-12-13 13:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

On Thu, Dec 13, 2018 at 2:17 PM Marek Majkowski <marek@cloudflare.com> wrote:
>
> Eric,
>
> On Thu, Dec 13, 2018 at 1:49 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On 12/13/2018 03:25 AM, Marek Majkowski wrote:
> > > Hi!
> > >
> > > I'm basically trying to do TCP splicing in Linux. I'm focusing on
> > > performance of the simplest case: receive data from one TCP socket,
> > > write data to another TCP socket. I get poor performance with splice.
> > >
> > > First, the naive code, pretty much:
> > >
> > > while(1){
> > >  n = read(rs, buf);
> > >  write(ws, buf, n);
> > > }
> > >
> > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
> > > ~50% of CPU in application (sys mostly).
> > >
> > > When replaced with splice version:
> > >
> > > pipe(pfd);
> > > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
> >
> > Why 1 MB ?
> >
> > splice code will be expensive if less than 1MB is present in receive queue.
>
> I'm not sure what you are suggesting. I'm just shuffling data between
> two sockets. Is there a better buffer size value? Is it possible to
> keep splice() blocked until it succeeds to forward N bytes of data? (I
> tried this unsuccessfully with SO_RCVLOWAT).

I jumped the gun here. Let me re-try SO_RCVLOWAT on 4.19.

> Here is a snippet from strace:
>
> splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048>
> splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108>
> splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065>
> splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202>
> splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029>
> splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106>
> splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019>
> splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028>
> splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013>
> splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021>
>
> > > while(1) {
> > >  n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
> > >                        SPLICE_F_MOVE);
> > >   splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
> > > }
> > >
> > > Full code:
> > > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
> > >
> > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower).
> > >
> > > naive run of perf record ./proxy-splice shows:
> > >    5.73%  [k] queued_spin_lock_slowpath
> > >    5.23%  [k] ipt_do_table
> > >    4.72%  [k] __splice_segment.part.59
> > >    4.72%  [k] do_tcp_sendpages
> > >    3.47%  [k] _raw_spin_lock_bh
> > >    3.36%  [k] __x86_indirect_thunk_rax
> > >
> > > (kernel 4.14.71)
> > >
> > > Is it possible to squeeze more from splice? Is it possible to force
> > > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't
> > > work).
> >
> > I believe it should work on recent linux kernels (4.18 )
> >
> > 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users
> > 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT
> > d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning
>
> I can confirm this. On 4.19 indeed splice program goes down to
> expected ~50% cpu and performance comparable to naive read/write
> version.
>
> > >
> > > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY
> > > that landed in 4.19.
> > >
> >
> > TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options),
> > so might be tricky, as it also requires NIC driver abilities to perform nice header splitting.
>
> Oh, that's a pity.
>
> Thanks for help.
> Marek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 13:18     ` Marek Majkowski
@ 2018-12-13 13:33       ` Marek Majkowski
  2018-12-13 14:03         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Marek Majkowski @ 2018-12-13 13:33 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

Ok, 4.19 does seem to kinda fix the SO_RCVLOWAT with splice, but I
don't fully understand it:

fcntl(8, F_SETPIPE_SZ, 1048576)         = 1048576 <0.000033>
setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [131072], 4) = 0 <0.000014>
splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 121435 <71.039385>
splice(8, NULL, 5, NULL, 121435, SPLICE_F_MOVE) = 121435 <0.000118>
splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 11806 <0.000019>
splice(8, NULL, 5, NULL, 11806, SPLICE_F_MOVE) = 11806 <0.000018>

So, even though I requested 128KiB, the first splice returned 121KiB
and the second one 11KiB. The first one can be explained by
data+metadata crossing 128KiB threshold. I'm not sure about the second
splice.

On Thu, Dec 13, 2018 at 2:18 PM Marek Majkowski <marek@cloudflare.com> wrote:
>
> On Thu, Dec 13, 2018 at 2:17 PM Marek Majkowski <marek@cloudflare.com> wrote:
> >
> > Eric,
> >
> > On Thu, Dec 13, 2018 at 1:49 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > On 12/13/2018 03:25 AM, Marek Majkowski wrote:
> > > > Hi!
> > > >
> > > > I'm basically trying to do TCP splicing in Linux. I'm focusing on
> > > > performance of the simplest case: receive data from one TCP socket,
> > > > write data to another TCP socket. I get poor performance with splice.
> > > >
> > > > First, the naive code, pretty much:
> > > >
> > > > while(1){
> > > >  n = read(rs, buf);
> > > >  write(ws, buf, n);
> > > > }
> > > >
> > > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
> > > > ~50% of CPU in application (sys mostly).
> > > >
> > > > When replaced with splice version:
> > > >
> > > > pipe(pfd);
> > > > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
> > >
> > > Why 1 MB ?
> > >
> > > splice code will be expensive if less than 1MB is present in receive queue.
> >
> > I'm not sure what you are suggesting. I'm just shuffling data between
> > two sockets. Is there a better buffer size value? Is it possible to
> > keep splice() blocked until it succeeds to forward N bytes of data? (I
> > tried this unsuccessfully with SO_RCVLOWAT).
>
> I jumped the gun here. Let me re-try SO_RCVLOWAT on 4.19.
>
> > Here is a snippet from strace:
> >
> > splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048>
> > splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108>
> > splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065>
> > splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202>
> > splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029>
> > splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106>
> > splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019>
> > splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028>
> > splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013>
> > splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021>
> >
> > > > while(1) {
> > > >  n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
> > > >                        SPLICE_F_MOVE);
> > > >   splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
> > > > }
> > > >
> > > > Full code:
> > > > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
> > > >
> > > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower).
> > > >
> > > > naive run of perf record ./proxy-splice shows:
> > > >    5.73%  [k] queued_spin_lock_slowpath
> > > >    5.23%  [k] ipt_do_table
> > > >    4.72%  [k] __splice_segment.part.59
> > > >    4.72%  [k] do_tcp_sendpages
> > > >    3.47%  [k] _raw_spin_lock_bh
> > > >    3.36%  [k] __x86_indirect_thunk_rax
> > > >
> > > > (kernel 4.14.71)
> > > >
> > > > Is it possible to squeeze more from splice? Is it possible to force
> > > > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't
> > > > work).
> > >
> > > I believe it should work on recent linux kernels (4.18 )
> > >
> > > 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users
> > > 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT
> > > d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning
> >
> > I can confirm this. On 4.19 indeed splice program goes down to
> > expected ~50% cpu and performance comparable to naive read/write
> > version.
> >
> > > >
> > > > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY
> > > > that landed in 4.19.
> > > >
> > >
> > > TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options),
> > > so might be tricky, as it also requires NIC driver abilities to perform nice header splitting.
> >
> > Oh, that's a pity.
> >
> > Thanks for help.
> > Marek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 13:33       ` Marek Majkowski
@ 2018-12-13 14:03         ` Eric Dumazet
  2018-12-13 14:05           ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2018-12-13 14:03 UTC (permalink / raw)
  To: Marek Majkowski, eric.dumazet; +Cc: netdev

On 12/13/2018 05:33 AM, Marek Majkowski wrote:
> Ok, 4.19 does seem to kinda fix the SO_RCVLOWAT with splice, but I
> don't fully understand it:
> 
> fcntl(8, F_SETPIPE_SZ, 1048576)         = 1048576 <0.000033>
> setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [131072], 4) = 0 <0.000014>
> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 121435 <71.039385>
> splice(8, NULL, 5, NULL, 121435, SPLICE_F_MOVE) = 121435 <0.000118>
> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 11806 <0.000019>
> splice(8, NULL, 5, NULL, 11806, SPLICE_F_MOVE) = 11806 <0.000018>
>

Good point.

At this moment SO_RCVLOWAT only tries to reduce number of POLLIN events.

But if your splice() system call is performed while there are already
available skbs in the receive queue, splice() wont block and deliver
what is available in the queue.

I guess that we would need to add some logic in recvmsg() and tcp_splice_read()
to truly implement SO_RCVLOWAT : block until at least sk->sk_rcvlowat bytes are
available in receive queue.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 14:03         ` Eric Dumazet
@ 2018-12-13 14:05           ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2018-12-13 14:05 UTC (permalink / raw)
  To: Marek Majkowski; +Cc: netdev



On 12/13/2018 06:03 AM, Eric Dumazet wrote:
> 
> 
> On 12/13/2018 05:33 AM, Marek Majkowski wrote:
>> Ok, 4.19 does seem to kinda fix the SO_RCVLOWAT with splice, but I
>> don't fully understand it:
>>
>> fcntl(8, F_SETPIPE_SZ, 1048576)         = 1048576 <0.000033>
>> setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [131072], 4) = 0 <0.000014>
>> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 121435 <71.039385>
>> splice(8, NULL, 5, NULL, 121435, SPLICE_F_MOVE) = 121435 <0.000118>
>> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 11806 <0.000019>
>> splice(8, NULL, 5, NULL, 11806, SPLICE_F_MOVE) = 11806 <0.000018>
>>
> 
> Good point.
> 
> At this moment SO_RCVLOWAT only tries to reduce number of POLLIN events.
> 
> But if your splice() system call is performed while there are already
> available skbs in the receive queue, splice() wont block and deliver
> what is available in the queue.
> 
> I guess that we would need to add some logic in recvmsg() and tcp_splice_read()
> to truly implement SO_RCVLOWAT : block until at least sk->sk_rcvlowat bytes are
> available in receive queue.
> 

You could also work around the problem by inserting a poll() system call before the splice(),
since poll() would only deliver the POLLIN event when sk->sk_rcvlowat bytes are present in the queue.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 13:17   ` Marek Majkowski
  2018-12-13 13:18     ` Marek Majkowski
@ 2018-12-13 14:04     ` Willy Tarreau
  1 sibling, 0 replies; 11+ messages in thread
From: Willy Tarreau @ 2018-12-13 14:04 UTC (permalink / raw)
  To: Marek Majkowski; +Cc: eric.dumazet, netdev

On Thu, Dec 13, 2018 at 02:17:20PM +0100, Marek Majkowski wrote:
> > splice code will be expensive if less than 1MB is present in receive queue.
> 
> I'm not sure what you are suggesting. I'm just shuffling data between
> two sockets. Is there a better buffer size value? Is it possible to
> keep splice() blocked until it succeeds to forward N bytes of data? (I
> tried this unsuccessfully with SO_RCVLOWAT).

I've personally observed performance decrease once the pipe is configured
larger than 512 kB. I think that it's related to the fact that you're
moving 256 pages around on each call and that it might even start to
have some effect on L1 caches when touch lots of data, though that could
be completely unrelated.

> Here is a snippet from strace:
> 
> splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048>
> splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108>
> splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065>
> splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202>
> splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029>
> splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106>
> splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019>
> splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028>
> splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013>
> splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021>

I think your driver is returning one segment per page. Let's do some
rough maths : assuming you're having an MSS of 1448 (timestamps enabled),
you'll retrieve 256*1448 = 370688 at once per call, which closely matches
what you're seeing. Hmmm checking closer, you're in fact exactly running
at 1460 (256*1460=373760) so you have timestamps disabled. Your numbers
seem normal to me (just the CPU usage doesn't, but maybe it improves
when using a smaller pipe).

Willy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 11:25 splice() performance for TCP socket forwarding Marek Majkowski
  2018-12-13 12:49 ` Eric Dumazet
@ 2018-12-13 12:55 ` Willy Tarreau
  2018-12-13 13:37   ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Willy Tarreau @ 2018-12-13 12:55 UTC (permalink / raw)
  To: Marek Majkowski; +Cc: netdev

Hi Marek,

On Thu, Dec 13, 2018 at 12:25:20PM +0100, Marek Majkowski wrote:
> Hi!
> 
> I'm basically trying to do TCP splicing in Linux. I'm focusing on
> performance of the simplest case: receive data from one TCP socket,
> write data to another TCP socket. I get poor performance with splice.
> 
> First, the naive code, pretty much:
> 
> while(1){
>  n = read(rs, buf);
>  write(ws, buf, n);
> }
> 
> With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
> ~50% of CPU in application (sys mostly).
> 
> When replaced with splice version:
> 
> pipe(pfd);
> fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
> while(1) {
>  n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
>                        SPLICE_F_MOVE);
>   splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
> }
> 
> Full code:
> https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
> 
> I get 100% cpu (sys) and dramatically worse performance (1.5x slower).

It's quite strange, it doesn't match at all what I'm used to. In haproxy
we're using splicing as well between sockets, and for medium to large
objects we always get much better performance with splicing than without.
3 years ago during a test, we reached 60 Gbps on a 4-core machine using
2 40G NICs, which is not an exceptional sizing. And between processes on
the loopback, numbers around 100G are totally possible. By the way this
is one test you should start with, to verify if the issue is more on the
splice side or on the NIC's side. It might be that your network driver is
totally inefficient when used with GRO/GSO. In my case, multi-10G using
ixgbe and 40G using mlx5 have always shown excellent results.

Regards,
Willy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 12:55 ` Willy Tarreau
@ 2018-12-13 13:37   ` Eric Dumazet
  2018-12-13 13:57     ` Willy Tarreau
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2018-12-13 13:37 UTC (permalink / raw)
  To: Willy Tarreau, Marek Majkowski; +Cc: netdev



On 12/13/2018 04:55 AM, Willy Tarreau wrote:

> 
> It's quite strange, it doesn't match at all what I'm used to. In haproxy
> we're using splicing as well between sockets, and for medium to large
> objects we always get much better performance with splicing than without.
> 3 years ago during a test, we reached 60 Gbps on a 4-core machine using
> 2 40G NICs, which is not an exceptional sizing. And between processes on
> the loopback, numbers around 100G are totally possible. By the way this
> is one test you should start with, to verify if the issue is more on the
> splice side or on the NIC's side. It might be that your network driver is
> totally inefficient when used with GRO/GSO. In my case, multi-10G using
> ixgbe and 40G using mlx5 have always shown excellent results.

Maybe mlx5 driver is in LRO mode, packing TCP payload in 4K pages ?

bnx2x GRO/LRO has this mode, meaning that around 8 pages are used for a GRO packets of ~32 KB,
while mlx4 for instance would use one page frag for every ~1428 bytes of payload.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: splice() performance for TCP socket forwarding
  2018-12-13 13:37   ` Eric Dumazet
@ 2018-12-13 13:57     ` Willy Tarreau
  0 siblings, 0 replies; 11+ messages in thread
From: Willy Tarreau @ 2018-12-13 13:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Marek Majkowski, netdev

Hi Eric!

On Thu, Dec 13, 2018 at 05:37:11AM -0800, Eric Dumazet wrote:
> Maybe mlx5 driver is in LRO mode, packing TCP payload in 4K pages ?

I could be wrong but I don't think so : I remember having been used to
LRO on myri10ge a decade ago giving me good performance which would
degrade with concurrent connections, till the point LRO got deprecated
when GRO started to work quite well. Thus this got me used to always
disabling LRO to be sure to measure something durable ;-)

> bnx2x GRO/LRO has this mode, meaning that around 8 pages are used for a GRO packets of ~32 KB,
> while mlx4 for instance would use one page frag for every ~1428 bytes of payload.

I remember that it was the same on myri10ge (1 segment per page), making
splice() return rougly 21 or 22kB per call for a 64kB pipe. BTW, I think
I said bullshit and that 3 years ago it was mlx4 and not mlx5 that I've
been using.

Willy

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-12-13 14:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-13 11:25 splice() performance for TCP socket forwarding Marek Majkowski
2018-12-13 12:49 ` Eric Dumazet
2018-12-13 13:17   ` Marek Majkowski
2018-12-13 13:18     ` Marek Majkowski
2018-12-13 13:33       ` Marek Majkowski
2018-12-13 14:03         ` Eric Dumazet
2018-12-13 14:05           ` Eric Dumazet
2018-12-13 14:04     ` Willy Tarreau
2018-12-13 12:55 ` Willy Tarreau
2018-12-13 13:37   ` Eric Dumazet
2018-12-13 13:57     ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).