netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "David S. Miller" <davem@davemloft.net>
To: johnpol@2ka.mipt.ru
Cc: kelly@au1.ibm.com, rusty@rustcorp.com.au, netdev@vger.kernel.org
Subject: Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
Date: Thu, 27 Apr 2006 13:09:18 -0700 (PDT)	[thread overview]
Message-ID: <20060427.130918.65400512.davem@davemloft.net> (raw)
In-Reply-To: <20060427115126.GA11570@2ka.mipt.ru>

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 27 Apr 2006 15:51:26 +0400

> There are some caveats here found while developing zero-copy sniffer
> [1]. Project's goal was to remap skbs into userspace in real-time.
> While absolute numbers (posted to netdev@) were really high, it is only
> applicable to read-only application. As was shown in IOAT thread,
> data must be warmed in caches, so reading from mapped area will be as
> fast as memcpy() (read+write), and copy_to_user() actually almost equal
> to memcpy() (benchmarks were posted to netdev@). And we must add
> remapping overhead.

Yes, all of these issues are related quite strongly.  Thanks for
making the connection explicit.

But, the mapping overhead is zero for this net channel stuff, at
least as it is implemented and designed by Kelly.  Ring buffer is
setup ahead of time into the user's address space, and a ring of
buffers into that area are given to the networking card.

We remember the translations here, so no get_user_pages() on each
transfer and garbage like that.  And yes this all harks back to the
issues that are discussed in Chapter 5 of Networking Algorithmics.
But the core thing to understand is that by defining a new API and
setting up the buffer pool ahead of time, we avoid all of the
get_user_pages() overhead while retaining full kernel/user protection.

Evgeniy, the difference between this and your work is that you did not
have an intelligent piece of hardware that could be told to recognize
flows, and only put packets for a specific flow into that's flow's
buffer pool.

> If we want to dma data from nic into premapped userspace area, this will
> strike with message sizes/misalignment/slow read and so on, so
> preallocation has even more problems.

I do not really think this is an issue, we put the full packet into
user space and teach it where the offset is to the actual data.
We'll do the same things we do today to try and get the data area
aligned.  User can do whatever is logical and relevant on his end
to deal with strange cases.

In fact we can specify that card has to take some care to get data
area of packet aligned on say an 8 byte boundary or something like
that.  When we don't have hardware assist, we are going to be doing
copies.

> This change also requires significant changes in application, at least
> until recv/send are changed, which is not the best thing to do.

This is exactly the point, we can only do a good job and receive zero
copy if we can change the interfaces, and that's exactly what we're
doing here.

> I do think that significant win in VJ's tests belongs not to remapping
> and cache-oriented changes, but to move all protocol processing into
> process' context.

I partly disagree.  The biggest win is eliminating all of the control
overhead (all of "softint RX + protocol demux + IP route lookup +
socket lookup" is turned into single flow demux), and the SMP safe
data structure which makes it realistic enough to always move the bulk
of the packet work to the socket's home cpu.

I do not think userspace protocol implementation buys enough to
justify it.  We have to do the protection switch in and out of kernel
space anyways, so why not still do the protected protocol processing
work in the kernel?  It is still being done on the user's behalf,
contributes to his time slice, and avoids all of the terrible issues
of userspace protocol implementations.

So in my mind, the optimal situation from both a protection preservation
and also a performance perspective is net channels to kernel socket
protocol processing, buffers DMA'd directly into userspace if hardware
assist is present.

> I fully agree with Dave that it must be implemented step-by-step, and
> the most significant, IMHO, is moving protocol processing into socket's
> "place". This will force to netfilter changes, but I do think that for
> the proof-of-concept code we can turn it off.

And I also want to note that even if the whole idea explodes and
cannot be made to work, there are good arguments for transitioning
to SKB'less drivers for their own sake.  So work will really not
be lost.

Let's have 100 different implementations of net channels! :-)

  reply	other threads:[~2006-04-27 20:09 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-04-26 11:47 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Kelly Daly
2006-04-26  7:33 ` David S. Miller
2006-04-27  3:31   ` Kelly Daly
2006-04-27  6:25     ` David S. Miller
2006-04-27 11:51       ` Evgeniy Polyakov
2006-04-27 20:09         ` David S. Miller [this message]
2006-04-28  6:05           ` Evgeniy Polyakov
2006-05-04  2:59       ` Kelly Daly
2006-05-04 23:22         ` David S. Miller
2006-05-05  1:31           ` Rusty Russell
2006-04-26  7:59 ` David S. Miller
2006-05-04  7:28   ` Kelly Daly
2006-05-04 23:11     ` David S. Miller
2006-05-05  2:48       ` Kelly Daly
2006-05-16  1:02         ` Kelly Daly
2006-05-16  1:05           ` David S. Miller
2006-05-16  1:15             ` Kelly Daly
2006-05-16  5:16           ` David S. Miller
2006-06-22  2:05             ` Kelly Daly
2006-06-22  3:58               ` James Morris
2006-06-22  4:31                 ` Arnaldo Carvalho de Melo
2006-06-22  4:36                 ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-08  0:05               ` David Miller
2006-05-16  6:19           ` [1/1] netchannel subsystem Evgeniy Polyakov
2006-05-16  6:57             ` David S. Miller
2006-05-16  6:59               ` Evgeniy Polyakov
2006-05-16  7:06                 ` David S. Miller
2006-05-16  7:15                   ` Evgeniy Polyakov
2006-05-16  7:07                 ` Evgeniy Polyakov
2006-05-16 17:34               ` [1/1] Netchannel subsyste Evgeniy Polyakov
2006-05-18 10:34                 ` Netchannel subsystem update Evgeniy Polyakov
2006-05-20 15:52                   ` Evgeniy Polyakov
2006-05-22  6:06                     ` David S. Miller
2006-05-22 16:34                       ` [Netchannel] Full TCP receiving support Evgeniy Polyakov
2006-05-24  9:38                         ` Evgeniy Polyakov
  -- strict thread matches above, loose matches on Subject: below --
2006-04-26 16:57 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Caitlin Bestler
2006-04-26 19:23 ` David S. Miller
2006-04-26 19:30 Caitlin Bestler
2006-04-26 19:46 ` Jeff Garzik
2006-04-26 22:40   ` David S. Miller
2006-04-27  3:40 ` Rusty Russell
2006-04-27  4:58   ` James Morris
2006-04-27  6:16     ` David S. Miller
2006-04-27  6:17   ` David S. Miller
2006-04-26 20:20 Caitlin Bestler
2006-04-26 22:35 ` David S. Miller
2006-04-26 22:53 Caitlin Bestler
2006-04-26 22:59 ` David S. Miller
2006-04-27  1:02 Caitlin Bestler
2006-04-27  6:08 ` David S. Miller
2006-04-27  6:17   ` Andi Kleen
2006-04-27  6:27     ` David S. Miller
2006-04-27  6:41       ` Andi Kleen
2006-04-27  7:52         ` David S. Miller
2006-04-27 21:12 Caitlin Bestler
2006-04-28  6:10 ` Evgeniy Polyakov
2006-04-28  7:20   ` David S. Miller
2006-04-28  7:32     ` Evgeniy Polyakov
2006-04-28 18:20       ` David S. Miller
2006-04-28  8:24 ` Rusty Russell
2006-04-28 19:21   ` David S. Miller
2006-04-28 22:04     ` Rusty Russell
2006-04-28 22:38       ` David S. Miller
2006-04-29  0:10         ` Rusty Russell
2006-04-28 15:59 Caitlin Bestler
2006-04-28 16:12 ` Evgeniy Polyakov
2006-04-28 19:09   ` David S. Miller
2006-04-28 17:02 Caitlin Bestler
2006-04-28 17:18 ` Stephen Hemminger
2006-04-28 17:29   ` Evgeniy Polyakov
2006-04-28 17:41     ` Stephen Hemminger
2006-04-28 17:55       ` Evgeniy Polyakov
2006-04-28 19:16         ` David S. Miller
2006-04-28 19:49           ` Stephen Hemminger
2006-04-28 19:59             ` Evgeniy Polyakov
2006-04-28 22:00               ` David S. Miller
2006-04-29 13:54                 ` Evgeniy Polyakov
     [not found]                 ` <20060429124451.GA19810@2ka.mipt.ru>
2006-05-01 21:32                   ` David S. Miller
2006-05-02  7:08                     ` Evgeniy Polyakov
2006-04-28 19:52           ` Evgeniy Polyakov
2006-04-28 19:10   ` David S. Miller
2006-04-28 20:46     ` Brent Cook
2006-04-28 17:25 ` Evgeniy Polyakov
2006-04-28 19:14   ` David S. Miller
2006-04-28 17:55 Caitlin Bestler
2006-04-28 22:17 ` Rusty Russell
2006-04-28 22:40   ` David S. Miller
2006-04-29  0:22     ` Rusty Russell
2006-04-29  6:46       ` David S. Miller
2006-04-28 23:45 Caitlin Bestler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060427.130918.65400512.davem@davemloft.net \
    --to=davem@davemloft.net \
    --cc=johnpol@2ka.mipt.ru \
    --cc=kelly@au1.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).