From: Eric Dumazet <eric.dumazet@gmail.com>
To: Jesper Dangaard Brouer <jdb@comx.dk>
Cc: netdev <netdev@vger.kernel.org>, David Miller <davem@davemloft.net>
Subject: Re: Loopback performance from kernel 2.6.12 to 2.6.37
Date: Thu, 18 Nov 2010 18:41:53 +0100 [thread overview]
Message-ID: <1290102113.2781.237.camel@edumazet-laptop> (raw)
In-Reply-To: <1290088353.2781.137.camel@edumazet-laptop>
Le jeudi 18 novembre 2010 à 14:52 +0100, Eric Dumazet a écrit :
> Le mardi 09 novembre 2010 à 15:25 +0100, Eric Dumazet a écrit :
>
> > So far, so good. These are the expected numbers. Now we have to
> > understand why corei7 gets 38 seconds instead of 8 :)
> >
> >
>
> My tests show a problem with backlog processing, and too big TCP
> windows. (at least on loopback and wild senders)
>
> Basically, with huge tcp windows we have now (default 4 Mbytes),
> the reader process can have to process up to 4Mbytes of backlogged data
> in __release_sock() before returning from its 'small' read(fd, buffer,
> 1024) done by netcat.
>
> While it processes this backlog, it sends tcp ACKS, allowing sender to
> send new frames that might be dropped because of sk_rcvqueues_full(), or
> continue to fill receive queue up to the receiver window, feeding the
> task in __release_sock() [loop]
>
>
> This blows cpu caches completely [data is queued, and the dequeue is
> done long after], and latency of a single read() can be very high. This
> blocks the pipeline of user processing eventually.
>
>
> <disgress>
> I also understand why UDP latencies are so impacted. If we receive a
> burst of frames on same socket, the user process reading first frame
> might be forced to process the backlog before returning to userland.
>
> Really we must zap lock_sock() from UDP input path.
>
> commit 95766fff6b9a78d1 ([UDP]: Add memory accounting) was a big error.
> </disgress>
>
>
>
> On my server machine with 6Mbytes of L2 cache, you dont see the problem,
> while on my laptop with 3Mbytes of L2 cache, you can see the problem.
>
> I caught this because of new SNMP counter added in 2.6.34
> (TCPBacklogDrop), that could easily take 1000 increments during the
> test.
>
>
> I built a test program, maybe easier to use than various netcat flavors
> It also use two tasks only, thats better if you have a core 2 Duo like
> me on my laptop ;)
>
> To reproduce the problem, run it with option -l 4M
>
> $ netstat -s|grep TCPBacklogDrop
> TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 1k;netstat -s|grep TCPBacklogDrop
>
> real 0m14.013s
> user 0m0.630s
> sys 0m13.250s
> TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 128k;netstat -s|grep TCPBacklogDrop
>
> real 0m7.447s
> user 0m0.030s
> sys 0m5.490s
> TCPBacklogDrop: 789
> $ time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop
>
> real 0m11.206s
> user 0m0.020s
> sys 0m7.150s
> TCPBacklogDrop: 793
> $ time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop
>
> real 0m10.347s
> user 0m0.000s
> sys 0m6.120s
> TCPBacklogDrop: 1510
> $ time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop
>
> real 0m6.810s
> user 0m0.040s
> sys 0m6.670s
> TCPBacklogDrop: 1511
>
I forgot to include test results for my dev machine (server class
machine, 8 Mbytes of L2 cache) NUMA
2 sockets : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
# netstat -s|grep TCPBacklogDrop
TCPBacklogDrop: 8891
# time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop
real 0m7.033s
user 0m0.010s
sys 0m4.580s
TCPBacklogDrop: 9239
# time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop
real 0m5.408s
user 0m0.000s
sys 0m2.880s
TCPBacklogDrop: 9243
# time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop
real 0m2.965s
user 0m0.000s
sys 0m2.070s
TCPBacklogDrop: 10485
# time ./loopback_transfert -l 6M;netstat -s|grep TCPBacklogDrop
real 0m7.711s
user 0m0.000s
sys 0m3.180s
TCPBacklogDrop: 13537
# time ./loopback_transfert -l 8M;netstat -s|grep TCPBacklogDrop
real 0m11.497s
user 0m0.020s
sys 0m3.830s
TCPBacklogDrop: 17108
As soon as our working set is larger than L2 cache, this is very slow.
for the -l 8M bench :
# Events: 7K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ...................................
#
40.97% loopback_transf [kernel.kallsyms] [k] copy_user_generic_string
18.57% :8968 [kernel.kallsyms] [k] copy_user_generic_string
3.54% :8968 [kernel.kallsyms] [k] get_page_from_freelist
3.36% :8968 [kernel.kallsyms] [k] tcp_sendmsg
1.17% :8968 [kernel.kallsyms] [k] put_page
0.99% :8968 [kernel.kallsyms] [k] free_hot_cold_page
0.99% :8968 [kernel.kallsyms] [k] __might_sleep
0.88% :8968 [kernel.kallsyms] [k] __ticket_spin_lock
0.81% loopback_transf [kernel.kallsyms] [k] free_pcppages_bulk
0.79% :8968 [kernel.kallsyms] [k] __alloc_pages_nodemask
0.63% loopback_transf [kernel.kallsyms] [k] put_page
0.63% loopback_transf [kernel.kallsyms] [k] __might_sleep
0.63% loopback_transf [kernel.kallsyms] [k] tcp_transmit_skb
0.57% :8968 [kernel.kallsyms] [k] skb_release_data
0.55% loopback_transf [kernel.kallsyms] [k] free_hot_cold_page
0.53% :8968 [kernel.kallsyms] [k] tcp_ack
0.50% loopback_transf [kernel.kallsyms] [k] __inet_lookup_established
0.49% loopback_transf [kernel.kallsyms] [k] skb_copy_datagram_iovec
0.47% :8968 [kernel.kallsyms] [k] __rmqueue
0.45% :8968 [kernel.kallsyms] [k] get_pageblock_flags_group
0.41% :8968 [kernel.kallsyms] [k] zone_watermark_ok
0.41% :8968 [kernel.kallsyms] [k] __inc_zone_state
0.40% loopback_transf [kernel.kallsyms] [k] skb_release_data
0.39% :8968 [kernel.kallsyms] [k] tcp_transmit_skb
next prev parent reply other threads:[~2010-11-18 17:42 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1288954189.28003.178.camel@firesoul.comx.local>
[not found] ` <1288988955.2665.297.camel@edumazet-laptop>
[not found] ` <1289213926.15004.19.camel@firesoul.comx.local>
[not found] ` <1289214289.2820.188.camel@edumazet-laptop>
2010-11-08 15:06 ` Loopback performance from kernel 2.6.12 to 2.6.37 Eric Dumazet
2010-11-09 0:05 ` Andrew Hendry
2010-11-09 5:22 ` Eric Dumazet
2010-11-09 6:23 ` Eric Dumazet
2010-11-09 6:30 ` Andrew Hendry
2010-11-09 6:38 ` Eric Dumazet
2010-11-09 6:42 ` Eric Dumazet
2010-11-09 13:59 ` Jesper Dangaard Brouer
2010-11-09 14:06 ` Eric Dumazet
2010-11-09 14:16 ` Jesper Dangaard Brouer
2010-11-09 14:25 ` Eric Dumazet
2010-11-18 13:52 ` Eric Dumazet
2010-11-18 17:41 ` Eric Dumazet [this message]
2010-11-18 17:48 ` David Miller
2010-11-09 14:38 ` Jesper Dangaard Brouer
2010-11-10 11:24 ` Jesper Dangaard Brouer
2010-12-12 15:48 ` Arnaldo Carvalho de Melo
2010-11-09 21:35 Xose Vazquez Perez
2010-11-10 8:49 ` Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1290102113.2781.237.camel@edumazet-laptop \
--to=eric.dumazet@gmail.com \
--cc=davem@davemloft.net \
--cc=jdb@comx.dk \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox