* splice() performance for TCP socket forwarding
@ 2018-12-13 11:25 Marek Majkowski
2018-12-13 12:49 ` Eric Dumazet
2018-12-13 12:55 ` Willy Tarreau
0 siblings, 2 replies; 11+ messages in thread
From: Marek Majkowski @ 2018-12-13 11:25 UTC (permalink / raw)
To: netdev
Hi!
I'm basically trying to do TCP splicing in Linux. I'm focusing on
performance of the simplest case: receive data from one TCP socket,
write data to another TCP socket. I get poor performance with splice.
First, the naive code, pretty much:
while(1){
n = read(rs, buf);
write(ws, buf, n);
}
With GRO enabled, this code does roughly line-rate of 10Gbps, hovering
~50% of CPU in application (sys mostly).
When replaced with splice version:
pipe(pfd);
fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024);
while(1) {
n = splice(rd, NULL, pfd[1], NULL, 1024*1024,
SPLICE_F_MOVE);
splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE);
}
Full code:
https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59
I get 100% cpu (sys) and dramatically worse performance (1.5x slower).
naive run of perf record ./proxy-splice shows:
5.73% [k] queued_spin_lock_slowpath
5.23% [k] ipt_do_table
4.72% [k] __splice_segment.part.59
4.72% [k] do_tcp_sendpages
3.47% [k] _raw_spin_lock_bh
3.36% [k] __x86_indirect_thunk_rax
(kernel 4.14.71)
Is it possible to squeeze more from splice? Is it possible to force
splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't
work).
Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY
that landed in 4.19.
Cheers,
Marek
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: splice() performance for TCP socket forwarding 2018-12-13 11:25 splice() performance for TCP socket forwarding Marek Majkowski @ 2018-12-13 12:49 ` Eric Dumazet 2018-12-13 13:17 ` Marek Majkowski 2018-12-13 12:55 ` Willy Tarreau 1 sibling, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2018-12-13 12:49 UTC (permalink / raw) To: Marek Majkowski, netdev On 12/13/2018 03:25 AM, Marek Majkowski wrote: > Hi! > > I'm basically trying to do TCP splicing in Linux. I'm focusing on > performance of the simplest case: receive data from one TCP socket, > write data to another TCP socket. I get poor performance with splice. > > First, the naive code, pretty much: > > while(1){ > n = read(rs, buf); > write(ws, buf, n); > } > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering > ~50% of CPU in application (sys mostly). > > When replaced with splice version: > > pipe(pfd); > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024); Why 1 MB ? splice code will be expensive if less than 1MB is present in receive queue. > while(1) { > n = splice(rd, NULL, pfd[1], NULL, 1024*1024, > SPLICE_F_MOVE); > splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE); > } > > Full code: > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59 > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower). > > naive run of perf record ./proxy-splice shows: > 5.73% [k] queued_spin_lock_slowpath > 5.23% [k] ipt_do_table > 4.72% [k] __splice_segment.part.59 > 4.72% [k] do_tcp_sendpages > 3.47% [k] _raw_spin_lock_bh > 3.36% [k] __x86_indirect_thunk_rax > > (kernel 4.14.71) > > Is it possible to squeeze more from splice? Is it possible to force > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't > work). I believe it should work on recent linux kernels (4.18 ) 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning > > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY > that landed in 4.19. > TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options), so might be tricky, as it also requires NIC driver abilities to perform nice header splitting. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 12:49 ` Eric Dumazet @ 2018-12-13 13:17 ` Marek Majkowski 2018-12-13 13:18 ` Marek Majkowski 2018-12-13 14:04 ` Willy Tarreau 0 siblings, 2 replies; 11+ messages in thread From: Marek Majkowski @ 2018-12-13 13:17 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev Eric, On Thu, Dec 13, 2018 at 1:49 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > On 12/13/2018 03:25 AM, Marek Majkowski wrote: > > Hi! > > > > I'm basically trying to do TCP splicing in Linux. I'm focusing on > > performance of the simplest case: receive data from one TCP socket, > > write data to another TCP socket. I get poor performance with splice. > > > > First, the naive code, pretty much: > > > > while(1){ > > n = read(rs, buf); > > write(ws, buf, n); > > } > > > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering > > ~50% of CPU in application (sys mostly). > > > > When replaced with splice version: > > > > pipe(pfd); > > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024); > > Why 1 MB ? > > splice code will be expensive if less than 1MB is present in receive queue. I'm not sure what you are suggesting. I'm just shuffling data between two sockets. Is there a better buffer size value? Is it possible to keep splice() blocked until it succeeds to forward N bytes of data? (I tried this unsuccessfully with SO_RCVLOWAT). Here is a snippet from strace: splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048> splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108> splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065> splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202> splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029> splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106> splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019> splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028> splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013> splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021> > > while(1) { > > n = splice(rd, NULL, pfd[1], NULL, 1024*1024, > > SPLICE_F_MOVE); > > splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE); > > } > > > > Full code: > > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59 > > > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower). > > > > naive run of perf record ./proxy-splice shows: > > 5.73% [k] queued_spin_lock_slowpath > > 5.23% [k] ipt_do_table > > 4.72% [k] __splice_segment.part.59 > > 4.72% [k] do_tcp_sendpages > > 3.47% [k] _raw_spin_lock_bh > > 3.36% [k] __x86_indirect_thunk_rax > > > > (kernel 4.14.71) > > > > Is it possible to squeeze more from splice? Is it possible to force > > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't > > work). > > I believe it should work on recent linux kernels (4.18 ) > > 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users > 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT > d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning I can confirm this. On 4.19 indeed splice program goes down to expected ~50% cpu and performance comparable to naive read/write version. > > > > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY > > that landed in 4.19. > > > > TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options), > so might be tricky, as it also requires NIC driver abilities to perform nice header splitting. Oh, that's a pity. Thanks for help. Marek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 13:17 ` Marek Majkowski @ 2018-12-13 13:18 ` Marek Majkowski 2018-12-13 13:33 ` Marek Majkowski 2018-12-13 14:04 ` Willy Tarreau 1 sibling, 1 reply; 11+ messages in thread From: Marek Majkowski @ 2018-12-13 13:18 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev On Thu, Dec 13, 2018 at 2:17 PM Marek Majkowski <marek@cloudflare.com> wrote: > > Eric, > > On Thu, Dec 13, 2018 at 1:49 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On 12/13/2018 03:25 AM, Marek Majkowski wrote: > > > Hi! > > > > > > I'm basically trying to do TCP splicing in Linux. I'm focusing on > > > performance of the simplest case: receive data from one TCP socket, > > > write data to another TCP socket. I get poor performance with splice. > > > > > > First, the naive code, pretty much: > > > > > > while(1){ > > > n = read(rs, buf); > > > write(ws, buf, n); > > > } > > > > > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering > > > ~50% of CPU in application (sys mostly). > > > > > > When replaced with splice version: > > > > > > pipe(pfd); > > > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024); > > > > Why 1 MB ? > > > > splice code will be expensive if less than 1MB is present in receive queue. > > I'm not sure what you are suggesting. I'm just shuffling data between > two sockets. Is there a better buffer size value? Is it possible to > keep splice() blocked until it succeeds to forward N bytes of data? (I > tried this unsuccessfully with SO_RCVLOWAT). I jumped the gun here. Let me re-try SO_RCVLOWAT on 4.19. > Here is a snippet from strace: > > splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048> > splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108> > splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065> > splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202> > splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029> > splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106> > splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019> > splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028> > splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013> > splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021> > > > > while(1) { > > > n = splice(rd, NULL, pfd[1], NULL, 1024*1024, > > > SPLICE_F_MOVE); > > > splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE); > > > } > > > > > > Full code: > > > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59 > > > > > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower). > > > > > > naive run of perf record ./proxy-splice shows: > > > 5.73% [k] queued_spin_lock_slowpath > > > 5.23% [k] ipt_do_table > > > 4.72% [k] __splice_segment.part.59 > > > 4.72% [k] do_tcp_sendpages > > > 3.47% [k] _raw_spin_lock_bh > > > 3.36% [k] __x86_indirect_thunk_rax > > > > > > (kernel 4.14.71) > > > > > > Is it possible to squeeze more from splice? Is it possible to force > > > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't > > > work). > > > > I believe it should work on recent linux kernels (4.18 ) > > > > 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users > > 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT > > d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning > > I can confirm this. On 4.19 indeed splice program goes down to > expected ~50% cpu and performance comparable to naive read/write > version. > > > > > > > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY > > > that landed in 4.19. > > > > > > > TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options), > > so might be tricky, as it also requires NIC driver abilities to perform nice header splitting. > > Oh, that's a pity. > > Thanks for help. > Marek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 13:18 ` Marek Majkowski @ 2018-12-13 13:33 ` Marek Majkowski 2018-12-13 14:03 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Marek Majkowski @ 2018-12-13 13:33 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev Ok, 4.19 does seem to kinda fix the SO_RCVLOWAT with splice, but I don't fully understand it: fcntl(8, F_SETPIPE_SZ, 1048576) = 1048576 <0.000033> setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [131072], 4) = 0 <0.000014> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 121435 <71.039385> splice(8, NULL, 5, NULL, 121435, SPLICE_F_MOVE) = 121435 <0.000118> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 11806 <0.000019> splice(8, NULL, 5, NULL, 11806, SPLICE_F_MOVE) = 11806 <0.000018> So, even though I requested 128KiB, the first splice returned 121KiB and the second one 11KiB. The first one can be explained by data+metadata crossing 128KiB threshold. I'm not sure about the second splice. On Thu, Dec 13, 2018 at 2:18 PM Marek Majkowski <marek@cloudflare.com> wrote: > > On Thu, Dec 13, 2018 at 2:17 PM Marek Majkowski <marek@cloudflare.com> wrote: > > > > Eric, > > > > On Thu, Dec 13, 2018 at 1:49 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On 12/13/2018 03:25 AM, Marek Majkowski wrote: > > > > Hi! > > > > > > > > I'm basically trying to do TCP splicing in Linux. I'm focusing on > > > > performance of the simplest case: receive data from one TCP socket, > > > > write data to another TCP socket. I get poor performance with splice. > > > > > > > > First, the naive code, pretty much: > > > > > > > > while(1){ > > > > n = read(rs, buf); > > > > write(ws, buf, n); > > > > } > > > > > > > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering > > > > ~50% of CPU in application (sys mostly). > > > > > > > > When replaced with splice version: > > > > > > > > pipe(pfd); > > > > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024); > > > > > > Why 1 MB ? > > > > > > splice code will be expensive if less than 1MB is present in receive queue. > > > > I'm not sure what you are suggesting. I'm just shuffling data between > > two sockets. Is there a better buffer size value? Is it possible to > > keep splice() blocked until it succeeds to forward N bytes of data? (I > > tried this unsuccessfully with SO_RCVLOWAT). > > I jumped the gun here. Let me re-try SO_RCVLOWAT on 4.19. > > > Here is a snippet from strace: > > > > splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048> > > splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108> > > splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065> > > splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202> > > splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029> > > splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106> > > splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019> > > splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028> > > splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013> > > splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021> > > > > > > while(1) { > > > > n = splice(rd, NULL, pfd[1], NULL, 1024*1024, > > > > SPLICE_F_MOVE); > > > > splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE); > > > > } > > > > > > > > Full code: > > > > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59 > > > > > > > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower). > > > > > > > > naive run of perf record ./proxy-splice shows: > > > > 5.73% [k] queued_spin_lock_slowpath > > > > 5.23% [k] ipt_do_table > > > > 4.72% [k] __splice_segment.part.59 > > > > 4.72% [k] do_tcp_sendpages > > > > 3.47% [k] _raw_spin_lock_bh > > > > 3.36% [k] __x86_indirect_thunk_rax > > > > > > > > (kernel 4.14.71) > > > > > > > > Is it possible to squeeze more from splice? Is it possible to force > > > > splice() to hang forever and not return quickly (SO_RCVLOWAT doesn't > > > > work). > > > > > > I believe it should work on recent linux kernels (4.18 ) > > > > > > 03f45c883c6f391ed4fff8292415b35bd1107519 tcp: avoid extra wakeups for SO_RCVLOWAT users > > > 796f82eafcd96629c2f9a0332dbb4f474854aaf8 tcp: fix delayed acks behavior for SO_RCVLOWAT > > > d1361840f8c519eaee9a78ffe09e4f0a1b586846 tcp: fix SO_RCVLOWAT and RCVBUF autotuning > > > > I can confirm this. On 4.19 indeed splice program goes down to > > expected ~50% cpu and performance comparable to naive read/write > > version. > > > > > > > > > > Is there another way of doing TCP splicing? I'm aware of TCP ZEROCOPY > > > > that landed in 4.19. > > > > > > > > > > TCP zero copy is only working if your MSS is exactly 4096 bytes (+ TCP options), > > > so might be tricky, as it also requires NIC driver abilities to perform nice header splitting. > > > > Oh, that's a pity. > > > > Thanks for help. > > Marek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 13:33 ` Marek Majkowski @ 2018-12-13 14:03 ` Eric Dumazet 2018-12-13 14:05 ` Eric Dumazet 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2018-12-13 14:03 UTC (permalink / raw) To: Marek Majkowski, eric.dumazet; +Cc: netdev On 12/13/2018 05:33 AM, Marek Majkowski wrote: > Ok, 4.19 does seem to kinda fix the SO_RCVLOWAT with splice, but I > don't fully understand it: > > fcntl(8, F_SETPIPE_SZ, 1048576) = 1048576 <0.000033> > setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [131072], 4) = 0 <0.000014> > splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 121435 <71.039385> > splice(8, NULL, 5, NULL, 121435, SPLICE_F_MOVE) = 121435 <0.000118> > splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 11806 <0.000019> > splice(8, NULL, 5, NULL, 11806, SPLICE_F_MOVE) = 11806 <0.000018> > Good point. At this moment SO_RCVLOWAT only tries to reduce number of POLLIN events. But if your splice() system call is performed while there are already available skbs in the receive queue, splice() wont block and deliver what is available in the queue. I guess that we would need to add some logic in recvmsg() and tcp_splice_read() to truly implement SO_RCVLOWAT : block until at least sk->sk_rcvlowat bytes are available in receive queue. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 14:03 ` Eric Dumazet @ 2018-12-13 14:05 ` Eric Dumazet 0 siblings, 0 replies; 11+ messages in thread From: Eric Dumazet @ 2018-12-13 14:05 UTC (permalink / raw) To: Marek Majkowski; +Cc: netdev On 12/13/2018 06:03 AM, Eric Dumazet wrote: > > > On 12/13/2018 05:33 AM, Marek Majkowski wrote: >> Ok, 4.19 does seem to kinda fix the SO_RCVLOWAT with splice, but I >> don't fully understand it: >> >> fcntl(8, F_SETPIPE_SZ, 1048576) = 1048576 <0.000033> >> setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [131072], 4) = 0 <0.000014> >> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 121435 <71.039385> >> splice(8, NULL, 5, NULL, 121435, SPLICE_F_MOVE) = 121435 <0.000118> >> splice(4, NULL, 9, NULL, 1048576, SPLICE_F_MOVE) = 11806 <0.000019> >> splice(8, NULL, 5, NULL, 11806, SPLICE_F_MOVE) = 11806 <0.000018> >> > > Good point. > > At this moment SO_RCVLOWAT only tries to reduce number of POLLIN events. > > But if your splice() system call is performed while there are already > available skbs in the receive queue, splice() wont block and deliver > what is available in the queue. > > I guess that we would need to add some logic in recvmsg() and tcp_splice_read() > to truly implement SO_RCVLOWAT : block until at least sk->sk_rcvlowat bytes are > available in receive queue. > You could also work around the problem by inserting a poll() system call before the splice(), since poll() would only deliver the POLLIN event when sk->sk_rcvlowat bytes are present in the queue. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 13:17 ` Marek Majkowski 2018-12-13 13:18 ` Marek Majkowski @ 2018-12-13 14:04 ` Willy Tarreau 1 sibling, 0 replies; 11+ messages in thread From: Willy Tarreau @ 2018-12-13 14:04 UTC (permalink / raw) To: Marek Majkowski; +Cc: eric.dumazet, netdev On Thu, Dec 13, 2018 at 02:17:20PM +0100, Marek Majkowski wrote: > > splice code will be expensive if less than 1MB is present in receive queue. > > I'm not sure what you are suggesting. I'm just shuffling data between > two sockets. Is there a better buffer size value? Is it possible to > keep splice() blocked until it succeeds to forward N bytes of data? (I > tried this unsuccessfully with SO_RCVLOWAT). I've personally observed performance decrease once the pipe is configured larger than 512 kB. I think that it's related to the fact that you're moving 256 pages around on each call and that it might even start to have some effect on L1 caches when touch lots of data, though that could be completely unrelated. > Here is a snippet from strace: > > splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048> > splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108> > splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065> > splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202> > splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029> > splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106> > splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019> > splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028> > splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013> > splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021> I think your driver is returning one segment per page. Let's do some rough maths : assuming you're having an MSS of 1448 (timestamps enabled), you'll retrieve 256*1448 = 370688 at once per call, which closely matches what you're seeing. Hmmm checking closer, you're in fact exactly running at 1460 (256*1460=373760) so you have timestamps disabled. Your numbers seem normal to me (just the CPU usage doesn't, but maybe it improves when using a smaller pipe). Willy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 11:25 splice() performance for TCP socket forwarding Marek Majkowski 2018-12-13 12:49 ` Eric Dumazet @ 2018-12-13 12:55 ` Willy Tarreau 2018-12-13 13:37 ` Eric Dumazet 1 sibling, 1 reply; 11+ messages in thread From: Willy Tarreau @ 2018-12-13 12:55 UTC (permalink / raw) To: Marek Majkowski; +Cc: netdev Hi Marek, On Thu, Dec 13, 2018 at 12:25:20PM +0100, Marek Majkowski wrote: > Hi! > > I'm basically trying to do TCP splicing in Linux. I'm focusing on > performance of the simplest case: receive data from one TCP socket, > write data to another TCP socket. I get poor performance with splice. > > First, the naive code, pretty much: > > while(1){ > n = read(rs, buf); > write(ws, buf, n); > } > > With GRO enabled, this code does roughly line-rate of 10Gbps, hovering > ~50% of CPU in application (sys mostly). > > When replaced with splice version: > > pipe(pfd); > fcntl(pfd[0], F_SETPIPE_SZ, 1024 * 1024); > while(1) { > n = splice(rd, NULL, pfd[1], NULL, 1024*1024, > SPLICE_F_MOVE); > splice(pfd[0], NULL, wd, NULL, n, SPLICE_F_MOVE); > } > > Full code: > https://gist.github.com/majek/c58a97b9be7d9217fe3ebd6c1328faaa#file-proxy-splice-c-L59 > > I get 100% cpu (sys) and dramatically worse performance (1.5x slower). It's quite strange, it doesn't match at all what I'm used to. In haproxy we're using splicing as well between sockets, and for medium to large objects we always get much better performance with splicing than without. 3 years ago during a test, we reached 60 Gbps on a 4-core machine using 2 40G NICs, which is not an exceptional sizing. And between processes on the loopback, numbers around 100G are totally possible. By the way this is one test you should start with, to verify if the issue is more on the splice side or on the NIC's side. It might be that your network driver is totally inefficient when used with GRO/GSO. In my case, multi-10G using ixgbe and 40G using mlx5 have always shown excellent results. Regards, Willy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 12:55 ` Willy Tarreau @ 2018-12-13 13:37 ` Eric Dumazet 2018-12-13 13:57 ` Willy Tarreau 0 siblings, 1 reply; 11+ messages in thread From: Eric Dumazet @ 2018-12-13 13:37 UTC (permalink / raw) To: Willy Tarreau, Marek Majkowski; +Cc: netdev On 12/13/2018 04:55 AM, Willy Tarreau wrote: > > It's quite strange, it doesn't match at all what I'm used to. In haproxy > we're using splicing as well between sockets, and for medium to large > objects we always get much better performance with splicing than without. > 3 years ago during a test, we reached 60 Gbps on a 4-core machine using > 2 40G NICs, which is not an exceptional sizing. And between processes on > the loopback, numbers around 100G are totally possible. By the way this > is one test you should start with, to verify if the issue is more on the > splice side or on the NIC's side. It might be that your network driver is > totally inefficient when used with GRO/GSO. In my case, multi-10G using > ixgbe and 40G using mlx5 have always shown excellent results. Maybe mlx5 driver is in LRO mode, packing TCP payload in 4K pages ? bnx2x GRO/LRO has this mode, meaning that around 8 pages are used for a GRO packets of ~32 KB, while mlx4 for instance would use one page frag for every ~1428 bytes of payload. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: splice() performance for TCP socket forwarding 2018-12-13 13:37 ` Eric Dumazet @ 2018-12-13 13:57 ` Willy Tarreau 0 siblings, 0 replies; 11+ messages in thread From: Willy Tarreau @ 2018-12-13 13:57 UTC (permalink / raw) To: Eric Dumazet; +Cc: Marek Majkowski, netdev Hi Eric! On Thu, Dec 13, 2018 at 05:37:11AM -0800, Eric Dumazet wrote: > Maybe mlx5 driver is in LRO mode, packing TCP payload in 4K pages ? I could be wrong but I don't think so : I remember having been used to LRO on myri10ge a decade ago giving me good performance which would degrade with concurrent connections, till the point LRO got deprecated when GRO started to work quite well. Thus this got me used to always disabling LRO to be sure to measure something durable ;-) > bnx2x GRO/LRO has this mode, meaning that around 8 pages are used for a GRO packets of ~32 KB, > while mlx4 for instance would use one page frag for every ~1428 bytes of payload. I remember that it was the same on myri10ge (1 segment per page), making splice() return rougly 21 or 22kB per call for a 64kB pipe. BTW, I think I said bullshit and that 3 years ago it was mlx4 and not mlx5 that I've been using. Willy ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-12-13 14:05 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-12-13 11:25 splice() performance for TCP socket forwarding Marek Majkowski 2018-12-13 12:49 ` Eric Dumazet 2018-12-13 13:17 ` Marek Majkowski 2018-12-13 13:18 ` Marek Majkowski 2018-12-13 13:33 ` Marek Majkowski 2018-12-13 14:03 ` Eric Dumazet 2018-12-13 14:05 ` Eric Dumazet 2018-12-13 14:04 ` Willy Tarreau 2018-12-13 12:55 ` Willy Tarreau 2018-12-13 13:37 ` Eric Dumazet 2018-12-13 13:57 ` Willy Tarreau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).