netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* zero copy for relay server
@ 2011-03-28 16:27 Viral Mehta
  2011-03-28 16:52 ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Viral Mehta @ 2011-03-28 16:27 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,
I am implementing a particular application where
my application acts nothing but like Relay Server.

Relay server accepts connection from machine A.
It also accepts connection from Machine B.

Machine A and B are on different LAN/subnnets.
Now, there are two connections.
What server is supposed to do is RECV packets from machine A and SEND same
to machine B.

Pseudo Code is something like,
while(1)
{
recvagain:
   n =3D recv(incoming_fd, &buf, 8192, ...)
   if(n < 0)
        goto recvagain;
   send(outgoing_fd, &buf, n, ...);
}

Now the question is,
I want to avoid kernel-user copy for such application.
I found that a syscall like "sendfile"; I wanted to know if there is any
similar thing exists in-kernel which can take 2 socket descriptors....

If not, is it possible ? I would like to implement the same if someone
can suggest some pointers.

Thanks,
Viral

The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and  using or disseminating the information,  and must notify the sender and delete it from their system. L&T Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail"

______________________________________________________________________

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zero copy for relay server
  2011-03-28 16:27 zero copy for relay server Viral Mehta
@ 2011-03-28 16:52 ` Eric Dumazet
  2011-03-28 18:18   ` Viral Mehta
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2011-03-28 16:52 UTC (permalink / raw)
  To: Viral Mehta; +Cc: netdev@vger.kernel.org

Le lundi 28 mars 2011 à 21:57 +0530, Viral Mehta a écrit :
> Hi,
> I am implementing a particular application where
> my application acts nothing but like Relay Server.
> 
> Relay server accepts connection from machine A.
> It also accepts connection from Machine B.
> 
> Machine A and B are on different LAN/subnnets.
> Now, there are two connections.
> What server is supposed to do is RECV packets from machine A and SEND same
> to machine B.
> 
> Pseudo Code is something like,
> while(1)
> {
> recvagain:
>    n =3D recv(incoming_fd, &buf, 8192, ...)
>    if(n < 0)
>         goto recvagain;
>    send(outgoing_fd, &buf, n, ...);
> }
> 
> Now the question is,
> I want to avoid kernel-user copy for such application.
> I found that a syscall like "sendfile"; I wanted to know if there is any
> similar thing exists in-kernel which can take 2 socket descriptors....
> 
> If not, is it possible ? I would like to implement the same if someone
> can suggest some pointers.

linux way (if you want to avoid netfilter stuff and use userland code)
is to use splice() system call, and a pipe between two sockets.


/* skeleton : must add error checking to exit the loop properly */
int fds[2];
pipe(fds);

while (1) {
	splice(incoming_fd, NULL, fds[1], NULL, 65536, 0);
	splice(fds[0], NULL, outgoing_fd, NULL, 65536, 0);
}

This way, messages dont cross kernel<>user boundary.

The pipe is acting as a buffer between the two sockets.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: zero copy for relay server
  2011-03-28 16:52 ` Eric Dumazet
@ 2011-03-28 18:18   ` Viral Mehta
  2011-03-28 18:34     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Viral Mehta @ 2011-03-28 18:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org


>From: Eric Dumazet [eric.dumazet@gmail.com]

>Le lundi 28 mars 2011 à 21:57 +0530, Viral Mehta a écrit :
>> Hi,
>> I am implementing a particular application where
>> my application acts nothing but like Relay Server.
>>
>> Relay server accepts connection from machine A.
>> It also accepts connection from Machine B.
>>
>> Machine A and B are on different LAN/subnnets.
>> Now, there are two connections.
>> What server is supposed to do is RECV packets from machine A and SEND same
>> to machine B.
>>
>> Pseudo Code is something like,
>> while(1)
>> {
>> recvagain:
>>    n =3D recv(incoming_fd, &buf, 8192, ...)
>>    if(n < 0)
>>         goto recvagain;
>>    send(outgoing_fd, &buf, n, ...);
>> }
>>
>> Now the question is,
>> I want to avoid kernel-user copy for such application.
>> I found that a syscall like "sendfile"; I wanted to know if there is any
>> similar thing exists in-kernel which can take 2 socket descriptors....
>>
>> If not, is it possible ? I would like to implement the same if someone
>> can suggest some pointers.

>linux way (if you want to avoid netfilter stuff and use userland code)
>is to use splice() system call, and a pipe between two sockets.

Yes, I want to avoid netfilter stuff.
I know it but I think it is more complicated than an application programmer should know.

>/* skeleton : must add error checking to exit the loop properly */
>int fds[2];
>pipe(fds);
>
>while (1) {
>        splice(incoming_fd, NULL, fds[1], NULL, 65536, 0);
>        splice(fds[0], NULL, outgoing_fd, NULL, 65536, 0);
>}

Still, these are two system calls.
In addition to this, many things to handle,
1. if the incoming_fd is blocking, then it will block till 64K data read. Why so ?
2. I believe underlying PIPE that we are using will also have some size limit
    (like in user space 4K or 64K, not sure)

So, all in all
Why cant we have just one system call which really transfers "length"
bytes of data form one socket to another ? Recv "length" bytes of data
from socket A and send to socket B.

I wanted to understand if there are any limitations or concerns that we still do
not have any such system call .... ?

>This way, messages dont cross kernel<>user boundary.

>The pipe is acting as a buffer between the two sockets.




______________________________________________________________________

The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and  using or disseminating the information,  and must notify the sender and delete it from their system. L&T Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail"

______________________________________________________________________

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: zero copy for relay server
  2011-03-28 18:18   ` Viral Mehta
@ 2011-03-28 18:34     ` Eric Dumazet
  2011-03-29  2:00       ` Changli Gao
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2011-03-28 18:34 UTC (permalink / raw)
  To: Viral Mehta; +Cc: netdev@vger.kernel.org

Le lundi 28 mars 2011 à 23:48 +0530, Viral Mehta a écrit :

> Still, these are two system calls.

Yes. Is it a problem ? What kind ?

> In addition to this, many things to handle,
> 1. if the incoming_fd is blocking, then it will block till 64K data read. Why so ?

I dont think so. Try it, like read() of recv().

> 2. I believe underlying PIPE that we are using will also have some size limit
>     (like in user space 4K or 64K, not sure)

What kind of socket is able to deliver more than 64K frames ?

> 
> So, all in all
> Why cant we have just one system call which really transfers "length"
> bytes of data form one socket to another ? Recv "length" bytes of data
> from socket A and send to socket B.
> 
> I wanted to understand if there are any limitations or concerns that we still do
> not have any such system call .... ?
> 

The answer is : Once you try to implement this, you'll discover it'll be
splice() based, using pipe as a buffer between the sockets.

sendfile() is based on top of splice(), but it's faster to use splice().




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zero copy for relay server
  2011-03-28 18:34     ` Eric Dumazet
@ 2011-03-29  2:00       ` Changli Gao
  2011-03-29  4:23         ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Changli Gao @ 2011-03-29  2:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Viral Mehta, netdev@vger.kernel.org

On Tue, Mar 29, 2011 at 2:34 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 28 mars 2011 à 23:48 +0530, Viral Mehta a écrit :
>
>> Still, these are two system calls.
>
> Yes. Is it a problem ? What kind ?

I think he concerns the overhead of system calls. In order to omit a
system call, I think you can implement sth. like this:

splice2(infd, outfd, pipefd, ...)

What you need do is maintaining pipes by yourself.

>> 2. I believe underlying PIPE that we are using will also have some size limit
>>     (like in user space 4K or 64K, not sure)
>
> What kind of socket is able to deliver more than 64K frames ?

You can enlarge the size with fcntl(pipefd, F_SETPIPE_SZ,...).

>
>>
>> So, all in all
>> Why cant we have just one system call which really transfers "length"
>> bytes of data form one socket to another ? Recv "length" bytes of data
>> from socket A and send to socket B.
>>
>> I wanted to understand if there are any limitations or concerns that we still do
>> not have any such system call .... ?
>>
>
> The answer is : Once you try to implement this, you'll discover it'll be
> splice() based, using pipe as a buffer between the sockets.

Yes, but I think the internal buffer of pipe is pages, and it limits
its use in socket context. See skb_splice_bits(), and I am afraid copy
usually happens. Maybe the buffer of pipe should be any data but with
proper tear down functions.

>
> sendfile() is based on top of splice(), but it's faster to use splice().
>
>

Why? Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zero copy for relay server
  2011-03-29  2:00       ` Changli Gao
@ 2011-03-29  4:23         ` Eric Dumazet
  2011-03-29 11:28           ` Changli Gao
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2011-03-29  4:23 UTC (permalink / raw)
  To: Changli Gao; +Cc: Viral Mehta, netdev@vger.kernel.org

Le mardi 29 mars 2011 à 10:00 +0800, Changli Gao a écrit :
> On Tue, Mar 29, 2011 at 2:34 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I think he concerns the overhead of system calls. In order to omit a
> system call, I think you can implement sth. like this:
> 
> splice2(infd, outfd, pipefd, ...)
> 

Yes, but given no numbers are given, and no code yet written, I ask the
question.

Giving 4 file descriptors to a single syscall sounds convoluted.


> What you need do is maintaining pipes by yourself.
> 
> >> 2. I believe underlying PIPE that we are using will also have some size limit
> >>     (like in user space 4K or 64K, not sure)
> >
> > What kind of socket is able to deliver more than 64K frames ?
> 
> You can enlarge the size with fcntl(pipefd, F_SETPIPE_SZ,...).
> 

Not really useful, since splice() internals use automatic arrays sized
with PIPE_DEF_BUFFERS.

You can enlarge the size of pipe, but still we are limited to at most
64K in skb_splice_bits() for example [On x86 and its 4KB pages]

This doesnt matter, since skb are limited to 16 pages anyway (or 64Kb)

F_SETPIPE_SZ only can increase size of pipe ringbuffer (which should be
empty or contain at most one skb), therefore increasing dcache needs.

 
> >
> > sendfile() is based on top of splice(), but it's faster to use splice().
> >
> >
> 
> Why? Thanks.
> 

The real cost is not syscall overhead, but context switches and cache
misses. Adding a "super syscall" adds kernel text and increases icache
misses on real machine (I am not talking about machine used in micro
benchmarks)

Most likely, GRO can significantly speed this workload, while a syscall
avoidance wont.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zero copy for relay server
  2011-03-29  4:23         ` Eric Dumazet
@ 2011-03-29 11:28           ` Changli Gao
  2011-03-29 14:13             ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Changli Gao @ 2011-03-29 11:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Viral Mehta, netdev@vger.kernel.org

On Tue, Mar 29, 2011 at 12:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 29 mars 2011 à 10:00 +0800, Changli Gao a écrit :
>> On Tue, Mar 29, 2011 at 2:34 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> I think he concerns the overhead of system calls. In order to omit a
>> system call, I think you can implement sth. like this:
>>
>> splice2(infd, outfd, pipefd, ...)
>>
>
> Yes, but given no numbers are given, and no code yet written, I ask the
> question.
>
> Giving 4 file descriptors to a single syscall sounds convoluted.
>

It is a waste of fd using pipe buffer with two fds. In fact, I had
ever posted a patch, which extends pipe(2) to return a O_RDWR fd when
NULL is passed in.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: zero copy for relay server
  2011-03-29 11:28           ` Changli Gao
@ 2011-03-29 14:13             ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2011-03-29 14:13 UTC (permalink / raw)
  To: Changli Gao; +Cc: Viral Mehta, netdev@vger.kernel.org

Le mardi 29 mars 2011 à 19:28 +0800, Changli Gao a écrit :

> It is a waste of fd using pipe buffer with two fds. In fact, I had
> ever posted a patch, which extends pipe(2) to return a O_RDWR fd when
> NULL is passed in.
> 

Oh well, one extra fd is less than 256 bytes.

Adding a syscall means you force users to have latest kernel.

All this is hypothetical, nobody gave performance numbers demonstrating
splice()/splice() is slow enough we have to optimize it and add kernel
bloat...




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-03-29 14:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-28 16:27 zero copy for relay server Viral Mehta
2011-03-28 16:52 ` Eric Dumazet
2011-03-28 18:18   ` Viral Mehta
2011-03-28 18:34     ` Eric Dumazet
2011-03-29  2:00       ` Changli Gao
2011-03-29  4:23         ` Eric Dumazet
2011-03-29 11:28           ` Changli Gao
2011-03-29 14:13             ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).