Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
@ 2006-05-08 12:24 Evgeniy Polyakov
  2006-05-08 19:51 ` Evgeniy Polyakov
  2006-05-10 19:58 ` David S. Miller
  0 siblings, 2 replies; 11+ messages in thread
From: Evgeniy Polyakov @ 2006-05-08 12:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, caitlinb, kelly, rusty, johnpol

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

I hope he does not take offence at name shortening :)

I've sligtly modified UDP receiving path and run several benchmarks
in the following cases:
1. pure recvfrom() using copy_to_user() with 4k and 40k buffers.
2. recvfrom() remains the same, but skb->data is copied into kernel
buffer which can be mapped into userspace with 4k and 40k buffers
instead of copy_to_user().
3. recvfrom() remains the same, but no data is copied at all, and only 
iovec pointer is increased and it's size decreased.

Receiving is simple userspace application with one thread,
which does blocking read from UDP socket with default socket/stack parameters.

Receiver runs on 2.4 Ghz Xeon (HT enabled) with 1Gb of RAM and e1000 gigabit NIC.
Sender runs on amd64 nvidia nforce4 with 1Gb of RAM and r8169 NIC.
Machines are connected with d-link dgs-1216t gigabit switch.

Performance graph attached.

Conclusions:
at least in UDP case with 1gbit NIC performance was not increased,
but it can be the result of either NIC speed (I do not entrust to
nvidia and/or realtek), or broken sender application.
So the only observable result here is CPU usage changes:
it was decreased by 30% for copy_to_user() -> memcpy() changes
with 40k buffers. 4k buffers are too small to see any performance
changes due to syscall overhead.

If we transform CPU related changes to network speed, we still can
not get 6 times (or even 2 times) performance gain.

Luckily TCP processing is much more costly, e1000 interrupt handler
is too big, there are a lot of context switches and other
cache-unfriendly and locking stuff, but I still
wonder where does 6 (!) times performance gain lives.

-- 
	Evgeniy Polyakov

[-- Attachment #2: netchannel_speed.png --]
[-- Type: image/png, Size: 9120 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-08 12:24 Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user] Evgeniy Polyakov
@ 2006-05-08 19:51 ` Evgeniy Polyakov
  2006-05-08 20:15   ` David S. Miller
  2006-05-10 19:58 ` David S. Miller
  1 sibling, 1 reply; 11+ messages in thread
From: Evgeniy Polyakov @ 2006-05-08 19:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, caitlinb, kelly, rusty

On Mon, May 08, 2006 at 04:24:22PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> Luckily TCP processing is much more costly, e1000 interrupt handler
> is too big, there are a lot of context switches and other
> cache-unfriendly and locking stuff, but I still
> wonder where does 6 (!) times performance gain lives.

Since nocopy is actually equal to dma into mapped buffer,
so we get something close to 6 times less CPU usage, and if it can be
lineary transferred into performance gain, we found where the most
significant part of VJ channels lives. Unfortunately it is not backward
compatible with recv() system call, and requires major changes in
application to use this advantage.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-08 19:51 ` Evgeniy Polyakov
@ 2006-05-08 20:15   ` David S. Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David S. Miller @ 2006-05-08 20:15 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, caitlinb, kelly, rusty

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Mon, 8 May 2006 23:51:32 +0400

> Since nocopy is actually equal to dma into mapped buffer,
> so we get something close to 6 times less CPU usage, and if it can be
> lineary transferred into performance gain, we found where the most
> significant part of VJ channels lives.

Van's machines were cpu limited.  And once cpu limit was removed,
they became bus bandwidth limited.

> Unfortunately it is not backward compatible with recv() system call,
> and requires major changes in application to use this advantage.

I have stopped believing that compatible API for getting top
performance in networking receive is possible a very long time ago.

ABI change is an absolutely requirement.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-08 12:24 Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user] Evgeniy Polyakov
  2006-05-08 19:51 ` Evgeniy Polyakov
@ 2006-05-10 19:58 ` David S. Miller
  2006-05-11  6:40   ` Evgeniy Polyakov
  1 sibling, 1 reply; 11+ messages in thread
From: David S. Miller @ 2006-05-10 19:58 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, caitlinb, kelly, rusty

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Mon, 8 May 2006 16:24:22 +0400

> I hope he does not take offence at name shortening :)

Perhaps you are still not convinced how truly expensive the code path
from netif_receive_skb() to the protocol receive processing really is.

Van's channels eliminate that entire code path, and all it's data
structure references and locks, completely.  And by foregoing all of
those data references and expensive locks, we free up precious cpu
cache space and cpu cycles for other work.

Because you cannot "simulate" the extra cache lines that are available
by eliminating the code path between netif_receive_skb() and udp_rcv()
you will really need to compare with a full implementation of channels
to say anything for certain.

And we've known that getting rid of this code path is necessary for
_AGES_.  Ask anyone who has been to any of the yearly Linux networking
conferences, over and over again we talk about a grand unified flow
cache that would turn all of the routing, netfilter, and socket lookups
into one lookup.

All of these lookups touch different data structures, have different
locking rules, and have very poor cache behavior.

Van gets the transformation into a single lookup as a side effect of
how his channels work.

It is absolutely necessary to find ways to get rid of these layering
costs.  "Layering is how you design networking protocols, not how you
implement them."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-10 19:58 ` David S. Miller
@ 2006-05-11  6:40   ` Evgeniy Polyakov
  2006-05-11  7:07     ` David S. Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Evgeniy Polyakov @ 2006-05-11  6:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, caitlinb, kelly, rusty

On Wed, May 10, 2006 at 12:58:48PM -0700, David S. Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Mon, 8 May 2006 16:24:22 +0400
> 
> > I hope he does not take offence at name shortening :)
> 
> Perhaps you are still not convinced how truly expensive the code path
> from netif_receive_skb() to the protocol receive processing really is.

That is why UDP was selected - it is itself does not cost anything,
ip_rcv() + netif_receive_skb() will be in any channels, but instead of
searching through unified cache with src/port/dst/port/proto we search
through src/port/dst/port + through proto in ip_rcv().
There are no locks there except disabled preemption, those codepath
_never_ showed in profiles.
Grand unified cache is of course a good idea, but it will not bring new
performance gain to Linux.
It _is_ much more convenient and code path will be shorter, but only
because route/dst lookup will be hidden in unified cache.

Memory copy and context switch were eliminated in net channel, and that 
trashed any cache much more than than removing 50 lines of code accessed
parts of skb->data.

> It is absolutely necessary to find ways to get rid of these layering
> costs.  "Layering is how you design networking protocols, not how you
> implement them."

If I provide a patch which will allow to mark special socket as
no-protocol-and-any-upper-layer-lookup, but instead process skb->data
(like copying to userspace, or just allow recv() to return without any
copy) and performance will not differ from what we have with layers, 
will it justify that not abstract cache trashing and lookup split into
socket/route are not the problem?

Or have you switched from engineering to researching mode? :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-11  6:40   ` Evgeniy Polyakov
@ 2006-05-11  7:07     ` David S. Miller
  2006-05-11  8:30       ` Evgeniy Polyakov
  0 siblings, 1 reply; 11+ messages in thread
From: David S. Miller @ 2006-05-11  7:07 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, caitlinb, kelly, rusty

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 11 May 2006 10:40:37 +0400

> > It is absolutely necessary to find ways to get rid of these layering
> > costs.  "Layering is how you design networking protocols, not how you
> > implement them."
> 
> If I provide a patch which will allow to mark special socket as
> no-protocol-and-any-upper-layer-lookup, but instead process skb->data
> (like copying to userspace, or just allow recv() to return without any
> copy) and performance will not differ from what we have with layers, 
> will it justify that not abstract cache trashing and lookup split into
> socket/route are not the problem?
> 
> Or have you switched from engineering to researching mode? :)

You test with single socket and single source ID, what do you expect?
Everything is hot in the cache, as expected.

It is not research, I did put cycle counter sampling all over these
spots on sparc64 a long time ago just to familiarize myself with where
cpu spends most of it's time in softint processing when there are lots
of sockets and unique remote addresses.

And most of the time from netif_receive_skb() to the meat of
{udp,tcp}_rcv() is touching the routing cache and socket demux hash
tables.  Add bonus costs to netfilter if that is enabled too.  Once
you are past that point, for TCP, tcp_ack() is the primary cpu cycle
eater.

You can test with single stream, but then you are only testing
in-cache case.  Try several thousand sockets and real load from many
unique source systems, it becomes interesting then.

>From profiles of heavily used web server, what shows up is bulk of cpu
being in socket demux and tcp_ack().  Next bubble is routing cache.
I have not seen good profiles from a heavy web server employing any
real use of netfilter, that would be interesting as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-11  7:07     ` David S. Miller
@ 2006-05-11  8:30       ` Evgeniy Polyakov
  2006-05-11 16:18         ` Evgeniy Polyakov
  0 siblings, 1 reply; 11+ messages in thread
From: Evgeniy Polyakov @ 2006-05-11  8:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, caitlinb, kelly, rusty

On Thu, May 11, 2006 at 12:07:21AM -0700, David S. Miller (davem@davemloft.net) wrote:
> You can test with single stream, but then you are only testing
> in-cache case.  Try several thousand sockets and real load from many
> unique source systems, it becomes interesting then.

Route lookup is _additional_ cost for the system, but, as far as I
understand, netchannels are supposed to help with data processing, not
with destination point selection. It must have route lookups,
socket(netchannel) lookups, they just will be in other place.

I can test system with large number of streams, but unfortunately only
from small number of different src/dst ip addresses, so I can not
benchmark route lookup performance in layered design.

> From profiles of heavily used web server, what shows up is bulk of cpu
> being in socket demux and tcp_ack().  Next bubble is routing cache.
> I have not seen good profiles from a heavy web server employing any
> real use of netfilter, that would be interesting as well.

I have several oprofiles of static test web server which does 2.5k
requests/sec with about 3000 sockets created/removed per second. All
connections are very tiny.
Machines are in LAN, so no heavy route lookups, but socket lookup is quite
heavy. The most heavyweight network function is tcp_v4_rcv() (number 15),
next one is __alloc_skb() (25'th place), __kfree_skb() (35'th place).
netif_receive_skb() at 63, ip_rcv() - 80'th place.
tcp_ack() at 99. No *inet_lookup at all.
I do understand that it is synthetic benchmark, but it is not so rare
usage case.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-11  8:30       ` Evgeniy Polyakov
@ 2006-05-11 16:18         ` Evgeniy Polyakov
  2006-05-11 18:54           ` David S. Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Evgeniy Polyakov @ 2006-05-11 16:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, caitlinb, kelly, rusty

On Thu, May 11, 2006 at 12:30:32PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Thu, May 11, 2006 at 12:07:21AM -0700, David S. Miller (davem@davemloft.net) wrote:
> > You can test with single stream, but then you are only testing
> > in-cache case.  Try several thousand sockets and real load from many
> > unique source systems, it becomes interesting then.
>
> I can test system with large number of streams, but unfortunately only
> from small number of different src/dst ip addresses, so I can not
> benchmark route lookup performance in layered design.

I've run it with 200 UDP sockets in receive path. There were two load
generator machines with 100 clients in each.
There are no copies of skb->data in recvmsg().
Since I only have 1Gb link I'm unable to provide each client with high
bandwith, so they send 4k chunks.
Performance dropped twice down to 55 MB/sec and CPU usage increased noticebly
(slow drift from 12 to 8% compared to 2% with one socket),
but it is not because of cache effect I believe,
but due to highly increased number of syscalls per second.

Here is profile result:
1463625  78.0003  poll_idle
19171     1.0217  _spin_lock_irqsave
15887     0.8467  _read_lock
14712     0.7840  kfree
13370     0.7125  ip_frag_queue
11896     0.6340  delay_pmtmr
11811     0.6294  _spin_lock
11723     0.6247  csum_partial
11399     0.6075  ip_frag_destroy
11063     0.5896  serial_in
10533     0.5613  skb_release_data
10524     0.5609  ip_route_input
10319     0.5499  __alloc_skb
9903      0.5278  ip_defrag
9889      0.5270  _read_unlock
9536      0.5082  _write_unlock
8639      0.4604  _write_lock
7557      0.4027  netif_receive_skb
6748      0.3596  ip_frag_intern
6534      0.3482  preempt_schedule
6220      0.3315  __kmalloc
6005      0.3200  schedule
5924      0.3157  irq_entries_start
5823      0.3103  _spin_unlock_irqrestore
5678      0.3026  ip_rcv
5410      0.2883  __kfree_skb
5056      0.2694  kmem_cache_alloc
5014      0.2672  kfree_skb
4900      0.2611  eth_type_trans
4067      0.2167  kmem_cache_free
3532      0.1882  udp_recvmsg
3531      0.1882  ip_frag_reasm
3331      0.1775  _read_lock_irqsave
3327      0.1773  ipq_kill
3304      0.1761  udp_v4_lookup_longway

I'm going to resurrect zero-copy sniffer project [1] and create special
socket option which would allow to insert pages, which contain
skb->data, into process VMA using VM remapping tricks. Unfortunately it
requires TLB flushing and probably there will be no significant
performance/CPU gain if any, but I think, it is the only way to provide receiving 
zero-copy access to hardware which does not support header split.

Other idea, which I will try, if I understood you correctly, is to create unified cache.
I think some interesting results can be obtained from following
approach: in softint we do not process skb->data at all, but only get
src/dst/sport/dport/protocol numbers (it could require maximum two cache lines,
or it is not fast-path packet (but something like ipsec) and can be processed as usual) 
and create some "initial" cache based on that data, skb is then queued into that
"initial" cache entry and recvmsg() in process context later process' 
that entry.

Back to the drawing board...
Thanks for discussion.

1. zero-copy sniffer
http://tservice.net.ru/~s0mbre/old/?section=projects&item=af_tlb

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-11 16:18         ` Evgeniy Polyakov
@ 2006-05-11 18:54           ` David S. Miller
  2006-05-11 19:30             ` Rick Jones
  2006-05-12  7:54             ` Evgeniy Polyakov
  0 siblings, 2 replies; 11+ messages in thread
From: David S. Miller @ 2006-05-11 18:54 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, caitlinb, kelly, rusty

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 11 May 2006 20:18:15 +0400

> Here is profile result:
> 1463625  78.0003  poll_idle
> 19171     1.0217  _spin_lock_irqsave
> 15887     0.8467  _read_lock
> 14712     0.7840  kfree
> 13370     0.7125  ip_frag_queue
> 11896     0.6340  delay_pmtmr
> 11811     0.6294  _spin_lock
> 11723     0.6247  csum_partial
> 11399     0.6075  ip_frag_destroy
> 11063     0.5896  serial_in
> 10533     0.5613  skb_release_data
> 10524     0.5609  ip_route_input
> 10319     0.5499  __alloc_skb

Too bad spinlocks are not inlined any longer, this makes oprofile
output so much less useful.

Also, since you test UDP with >MTU sized sends, you add fragmentation
into the mix, yet another variable that you won't see with TCP :-)

BTW you make another massively critical error in your analysis of TCP
profiles.

You mention that "tcp_v4_rcv()" shows up in your profiles and not
__inet_lookup().  This __inet_lookup() is inlined, and thus it's cost
shows up as "tcp_v4_rcv()".  I find such oversight amazing for someone
as careful about details as you are :-)

I would suggest to look at instruction level profile hits, it makes
such mistakes in analysis almost impossible :-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-11 18:54           ` David S. Miller
@ 2006-05-11 19:30             ` Rick Jones
  2006-05-12  7:54             ` Evgeniy Polyakov
  1 sibling, 0 replies; 11+ messages in thread
From: Rick Jones @ 2006-05-11 19:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: johnpol, netdev, caitlinb, kelly, rusty

David S. Miller wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 11 May 2006 20:18:15 +0400
> 
> 
>>Here is profile result:
>>1463625  78.0003  poll_idle
>>19171     1.0217  _spin_lock_irqsave
>>15887     0.8467  _read_lock
>>14712     0.7840  kfree
>>13370     0.7125  ip_frag_queue
>>11896     0.6340  delay_pmtmr
>>11811     0.6294  _spin_lock
>>11723     0.6247  csum_partial
>>11399     0.6075  ip_frag_destroy
>>11063     0.5896  serial_in
>>10533     0.5613  skb_release_data
>>10524     0.5609  ip_route_input
>>10319     0.5499  __alloc_skb
> 
> 
> Too bad spinlocks are not inlined any longer, this makes oprofile
> output so much less useful.

But it is nice to see how much time is being spent in "locking"

rick jones


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user].
  2006-05-11 18:54           ` David S. Miller
  2006-05-11 19:30             ` Rick Jones
@ 2006-05-12  7:54             ` Evgeniy Polyakov
  1 sibling, 0 replies; 11+ messages in thread
From: Evgeniy Polyakov @ 2006-05-12  7:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, caitlinb, kelly, rusty

On Thu, May 11, 2006 at 11:54:09AM -0700, David S. Miller (davem@davemloft.net) wrote:
> BTW you make another massively critical error in your analysis of TCP
> profiles.
> 
> You mention that "tcp_v4_rcv()" shows up in your profiles and not
> __inet_lookup().  This __inet_lookup() is inlined, and thus it's cost
> shows up as "tcp_v4_rcv()".  I find such oversight amazing for someone
> as careful about details as you are :-)

Ugh, my fault.
But tcp_v4_rcv() also does a lot of other things which more likely
pushes this function in profile statistics :)

> I would suggest to look at instruction level profile hits, it makes
> such mistakes in analysis almost impossible :-)

It is much more challenging than running oprofile, so it will be
postponed for a while :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-05-12  7:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-08 12:24 Initial benchmarks of some VJ ideas [mmap memcpy vs copy_to_user] Evgeniy Polyakov
2006-05-08 19:51 ` Evgeniy Polyakov
2006-05-08 20:15   ` David S. Miller
2006-05-10 19:58 ` David S. Miller
2006-05-11  6:40   ` Evgeniy Polyakov
2006-05-11  7:07     ` David S. Miller
2006-05-11  8:30       ` Evgeniy Polyakov
2006-05-11 16:18         ` Evgeniy Polyakov
2006-05-11 18:54           ` David S. Miller
2006-05-11 19:30             ` Rick Jones
2006-05-12  7:54             ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).