netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] Implementation of the sendgroup() system call
       [not found] <49FE47A1.7070700@uwaterloo.ca>
@ 2009-05-04  7:13 ` Andi Kleen
  2009-05-04  7:30   ` Avi Kivity
                     ` (2 more replies)
       [not found] ` <49FE9C8C.6090705@cosmosbay.com>
  1 sibling, 3 replies; 12+ messages in thread
From: Andi Kleen @ 2009-05-04  7:13 UTC (permalink / raw)
  To: Elad Lahav; +Cc: linux-kernel, netdev

Elad Lahav <elahav@uwaterloo.ca> writes:

> The attached patch contains an implementation of sendgroup(), a system
> call that allows a UDP packet to be transmitted efficiently to
> multiple recipients. Use cases for this system call include
> live-streaming and multi-player online games.
> The basic idea is that the caller maintains a group - a list of IP
> addresses and UDP ports - and calls sendgroup() with the group list
> and a common payload. Optionally, the call allows for per-recipient
> data to be prepended or appended to the shared block. The data is
> copied once in the kernel into an allocated page, and the
> per-recipient socket buffers point to that page. Savings come from
> avoiding both the multiple calls and the multiple copies of the data
> required with regular socket operations.

My guess it's more the copies than the calls? It sounds like
you want sendfile() for UDP. I think that would be a cleaner solution
than such a specific hack for your application. It would
have the advantage of saving the first copy too and be 
truly zero copy on capable NICs.

Or perhaps simple send to a local multicast group and let
some netfilter module turn that into regular UDP.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen
@ 2009-05-04  7:30   ` Avi Kivity
  2009-05-04  9:53     ` Andi Kleen
  2009-05-04  7:42   ` Rémi Denis-Courmont
  2009-05-04 13:44   ` Elad Lahav
  2 siblings, 1 reply; 12+ messages in thread
From: Avi Kivity @ 2009-05-04  7:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev

Andi Kleen wrote:
> Elad Lahav <elahav@uwaterloo.ca> writes:
>
>   
>> The attached patch contains an implementation of sendgroup(), a system
>> call that allows a UDP packet to be transmitted efficiently to
>> multiple recipients. Use cases for this system call include
>> live-streaming and multi-player online games.
>> The basic idea is that the caller maintains a group - a list of IP
>> addresses and UDP ports - and calls sendgroup() with the group list
>> and a common payload. Optionally, the call allows for per-recipient
>> data to be prepended or appended to the shared block. The data is
>> copied once in the kernel into an allocated page, and the
>> per-recipient socket buffers point to that page. Savings come from
>> avoiding both the multiple calls and the multiple copies of the data
>> required with regular socket operations.
>>     
>
> My guess it's more the copies than the calls? It sounds like
> you want sendfile() for UDP. I think that would be a cleaner solution
> than such a specific hack for your application. It would
> have the advantage of saving the first copy too and be 
> truly zero copy on capable NICs.
>   

An aio udp send could accomplish both multiple packets per call, and 
zero-copy, without adding new syscalls.  You could send the same packet 
to multiple recipients, or multiple packets to the same recipicent, or 
combinations thereof.

> Or perhaps simple send to a local multicast group and let
> some netfilter module turn that into regular UDP.
>   

Sounds hacky and rooty.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen
  2009-05-04  7:30   ` Avi Kivity
@ 2009-05-04  7:42   ` Rémi Denis-Courmont
  2009-05-04 13:44   ` Elad Lahav
  2 siblings, 0 replies; 12+ messages in thread
From: Rémi Denis-Courmont @ 2009-05-04  7:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev


On Monday 04 May 2009 10:13:10 ext Andi Kleen wrote:
> Elad Lahav <elahav@uwaterloo.ca> writes:
> > The attached patch contains an implementation of sendgroup(), a system
> > call that allows a UDP packet to be transmitted efficiently to
> > multiple recipients. Use cases for this system call include
> > live-streaming and multi-player online games.
> > The basic idea is that the caller maintains a group - a list of IP
> > addresses and UDP ports - and calls sendgroup() with the group list
> > and a common payload. Optionally, the call allows for per-recipient
> > data to be prepended or appended to the shared block. The data is
> > copied once in the kernel into an allocated page, and the
> > per-recipient socket buffers point to that page. Savings come from
> > avoiding both the multiple calls and the multiple copies of the data
> > required with regular socket operations.
>
> My guess it's more the copies than the calls? It sounds like
> you want sendfile() for UDP. I think that would be a cleaner solution
> than such a specific hack for your application. It would
> have the advantage of saving the first copy too and be
> truly zero copy on capable NICs.

Say you have a fragmented skbuff-capable NIC.

It is already possible to write() to a pipe, then issue a series of N tee()
from the pipe for each of N connected sockets, and finally splice() to
/dev/null. That would be one copy and N+2 system calls. I guess the one
copy cannot be removed because UDP payloads are not page-sized, so
vmsplice() won't cut it. As a small optmization, you could replace the last
tee() with a splice(), so that's N+1 system calls.

When using a non-connected socket and multiple destination, then you need
one corked sendmsg() to set the destination, followed by one splice() to
push the payload. That's at least 2N+1 system calls.

UDP payload will typically be small, probably 1400 bytes or less. The
system call overhead might well be higher than the memory copy overhead.

On top of that, the splice() trick only works if the NIC can cope with
fragments. The performance might be worse than with normal sendto(). The
application has no way to check this from the socket. For a reason.
Depending on the routing table, different destinations could use different
NICs with different capabilities.

To sum up, I can see to problems using splice() and friends here, with
regards to the use case:
- no support for destination socket addesses,
- no support for batch tee() operations.

Whether that justifies a new socket call is not for me to opiniate.
Personally, I would definitely use it in the RTSP broadcast output of the
VLC media player once/if it ever hits glibc.

> Or perhaps simple send to a local multicast group and let
> some netfilter module turn that into regular UDP.

Unless netfilter has changed dramatically recentely, it is not usable for
applications. A monolithic system-wide configuration paradigm is not usable
for applications, as they cannot know how not to step on another
application's feet.

-- 
Rémi Denis-Courmont


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
       [not found] ` <49FE9C8C.6090705@cosmosbay.com>
@ 2009-05-04  9:03   ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2009-05-04  9:03 UTC (permalink / raw)
  To: Elad Lahav; +Cc: linux-kernel, Linux Netdev List

Eric Dumazet a écrit :
> Elad Lahav a écrit :
>> The attached patch contains an implementation of sendgroup(), a system
>> call that allows a UDP packet to be transmitted efficiently to multiple
>> recipients. Use cases for this system call include live-streaming and
>> multi-player online games.
>> The basic idea is that the caller maintains a group - a list of IP
>> addresses and UDP ports - and calls sendgroup() with the group list and
>> a common payload. Optionally, the call allows for per-recipient data to
>> be prepended or appended to the shared block. The data is copied once in
>> the kernel into an allocated page, and the per-recipient socket buffers
>> point to that page. Savings come from avoiding both the multiple calls
>> and the multiple copies of the data required with regular socket
>> operations. We have measured an improvement of 42% in CPU utilisation
>> when using this system call with the Helix multimedia server (reference:
>> http://simula.no/~griff/nossdav2008/27-32.pdf).
>>
>> The patch includes two implementations: one as described above and one
>> that uses the udp_sendmsg() function in a tight loop inside the kernel
>> (and thus saves on mode switches, but not on data copies). The latter is
>> provided for reference and benchmarking only.
>>
>> Feedback is welcome.
>>
> 
> Hi Elad
> 
> Patch is not inlined, this is really asking for troubles, I doubt many people
> will actually read your patch...
> 
> My comments are :
> 
> 1) Lack of latency checks. Sending UDP on 1000 destinations is expensive.
>   A syscall is not preemptable unless special conditions are met.
> 
> 2) Lack of a 32/64 bits aware API. A 64bit kernel should be able to 
> run a 32bit application using a sendgroup() syscall.
> 
> 3) Are footer/header differents for each calls ? Maybe you need
>   something better to avoid extra copies for them at each sendgroup() systemcall
> 
> 4) One expensive thing on UDP sends is the route cache lookups. You could avoid
> this cost using 'connected' group setup (see point 3)
>  
> ie using a different syscall to setup the group (and compute/lookup all needed routes)
>   (this syscall would be able to add/delete members (with their footer/header) to socket group)
>   
> Then sendgroup() would be really light, since it would provide a group identifier
> (can be a file descriptor -> mapping one group), and the UDP message to send.

Ah some other points : You forgot to include netdev  (CCed on my message),
as some network guys dont read lkml every day :)

On your experiments, did you change NIC txqueue length ? (default being 1000)
Using sendgroup() or sendmsg(), you'll hit pretty fast the NIC queue limit anyway...

Also, since 2.6.25 added memory accounting on UDP sockets, you'll probably need to
increase SO_SNDBUF to avoid being blocked on sendmsg()/sendgroup() call

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  7:30   ` Avi Kivity
@ 2009-05-04  9:53     ` Andi Kleen
  2009-05-04  9:56       ` Eric Dumazet
  2009-05-04  9:58       ` Avi Kivity
  0 siblings, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2009-05-04  9:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andi Kleen, Elad Lahav, linux-kernel, netdev

> >My guess it's more the copies than the calls? It sounds like
> >you want sendfile() for UDP. I think that would be a cleaner solution
> >than such a specific hack for your application. It would
> >have the advantage of saving the first copy too and be 
> >truly zero copy on capable NICs.
> >  
> 
> An aio udp send could accomplish both multiple packets per call, and 

AIO sockets are a lot of work. There have been various attempts
over the years, but they are very difficult. This was mostly
for TCP -- possibly UDP would be a bit easier -- but still
many complications. It would also need a lot of changes and
you would need to convince the network maintainers that they
are a good idea.

> >Or perhaps simple send to a local multicast group and let
> >some netfilter module turn that into regular UDP.
> >  
> 
> Sounds hacky and rooty.

rooty? Everyone can send to all directions anyways.

It wouldn't be perfect, but quite usable as a short term solution
for a production server.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  9:53     ` Andi Kleen
@ 2009-05-04  9:56       ` Eric Dumazet
  2009-05-04 10:18         ` Andi Kleen
  2009-05-04  9:58       ` Avi Kivity
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2009-05-04  9:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Avi Kivity, Elad Lahav, linux-kernel, netdev

Andi Kleen a écrit :

>>> Or perhaps simple send to a local multicast group and let
>>> some netfilter module turn that into regular UDP.
>>>  
>> Sounds hacky and rooty.
> 
> rooty? Everyone can send to all directions anyways.
> 
> It wouldn't be perfect, but quite usable as a short term solution
> for a production server.

Anyway, a netfilter module doesnt fit the needs (separate header/footer
for each destination)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  9:53     ` Andi Kleen
  2009-05-04  9:56       ` Eric Dumazet
@ 2009-05-04  9:58       ` Avi Kivity
  1 sibling, 0 replies; 12+ messages in thread
From: Avi Kivity @ 2009-05-04  9:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev

Andi Kleen wrote:
>>> My guess it's more the copies than the calls? It sounds like
>>> you want sendfile() for UDP. I think that would be a cleaner solution
>>> than such a specific hack for your application. It would
>>> have the advantage of saving the first copy too and be 
>>> truly zero copy on capable NICs.
>>>  
>>>       
>> An aio udp send could accomplish both multiple packets per call, and 
>>     
>
> AIO sockets are a lot of work. There have been various attempts
> over the years, but they are very difficult. This was mostly
> for TCP -- possibly UDP would be a bit easier -- but still
> many complications. It would also need a lot of changes and
> you would need to convince the network maintainers that they
> are a good idea.
>   

I would love them for kvm.  As far as I understand, the only 
complication is proper socket destructors so we can put_page() the memory.

Right now sendfile() is only usable for read-only files.  It's not 
usable for files that change, or non-file memory.

>   
>>> Or perhaps simple send to a local multicast group and let
>>> some netfilter module turn that into regular UDP.
>>>  
>>>       
>> Sounds hacky and rooty.
>>     
>
> rooty? Everyone can send to all directions anyways.
>
> It wouldn't be perfect, but quite usable as a short term solution
> for a production server.
>   

I meant, you need root to insert that netfiler module.  It's workable as 
a one off but it's not something reusable.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  9:56       ` Eric Dumazet
@ 2009-05-04 10:18         ` Andi Kleen
  0 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2009-05-04 10:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, Avi Kivity, Elad Lahav, linux-kernel, netdev

On Mon, May 04, 2009 at 11:56:25AM +0200, Eric Dumazet wrote:
> Andi Kleen a écrit :
> 
> >>> Or perhaps simple send to a local multicast group and let
> >>> some netfilter module turn that into regular UDP.
> >>>  
> >> Sounds hacky and rooty.
> > 
> > rooty? Everyone can send to all directions anyways.
> > 
> > It wouldn't be perfect, but quite usable as a short term solution
> > for a production server.
> 
> Anyway, a netfilter module doesnt fit the needs (separate header/footer
> for each destination)

netfilter modues can create new skbs that reference the main data and replace 
the header.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04  7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen
  2009-05-04  7:30   ` Avi Kivity
  2009-05-04  7:42   ` Rémi Denis-Courmont
@ 2009-05-04 13:44   ` Elad Lahav
  2009-05-04 14:50     ` Andi Kleen
  2 siblings, 1 reply; 12+ messages in thread
From: Elad Lahav @ 2009-05-04 13:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev

> My guess it's more the copies than the calls?
It's a factor of both. This is why we also created the sendgroup()  
implementation that uses a tight loop of in-kernel calls to sendmsg()  
as a means for evaluating the cost of mode switches. It is definitely  
not negligible (exact numbers depend on the size of the group and the  
size of the payload, of course).

> It sounds like you want sendfile() for UDP.
Do you mean by having a per-recipient sendfile() call for the same  
file? Leaving the cost of the system call aside, this solution does  
not work well with the kind of real-time data that we've been working  
with (live streaming, online games). You would have to write the  
payload to the file as it is being generated and call sendfile() after  
each such write.

--Elad

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04 13:44   ` Elad Lahav
@ 2009-05-04 14:50     ` Andi Kleen
  2009-05-05  0:24       ` Elad Lahav
  2009-05-06 11:25       ` Tim Brecht
  0 siblings, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2009-05-04 14:50 UTC (permalink / raw)
  To: Elad Lahav; +Cc: Andi Kleen, Elad Lahav, linux-kernel, netdev

On Mon, May 04, 2009 at 09:44:31AM -0400, Elad Lahav wrote:
> >My guess it's more the copies than the calls?
> It's a factor of both. This is why we also created the sendgroup()  
> implementation that uses a tight loop of in-kernel calls to sendmsg()  
> as a means for evaluating the cost of mode switches. It is definitely  
> not negligible (exact numbers depend on the size of the group and the  
> size of the payload, of course).

How much is non negligible in your case?

> 
> >It sounds like you want sendfile() for UDP.
> Do you mean by having a per-recipient sendfile() call for the same  
> file? Leaving the cost of the system call aside, this solution does  
> not work well with the kind of real-time data that we've been working  
> with (live streaming, online games). You would have to write the  
> payload to the file as it is being generated and call sendfile() after  
> each such write.

You can mmap the file.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04 14:50     ` Andi Kleen
@ 2009-05-05  0:24       ` Elad Lahav
  2009-05-06 11:25       ` Tim Brecht
  1 sibling, 0 replies; 12+ messages in thread
From: Elad Lahav @ 2009-05-05  0:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev

> How much is non negligible in your case?
Please see the following link

http://www.cs.uwaterloo.ca/~elahav/sendgroup/mb_pkt_size.pdf

for a graph showing the (amortised) cost of sending a single packet to a 
group of 1000 recipients, using one of the three following methods:

1. A user-mode loop of sendmsg() calls
2. A kernel-mode loop of udp_sendmsg() calls
3. A single call to sendgroup()
The cost is measured in cycles and was determined using performance 
counters.

In this benchmark, savings introduced by sendgroup() are primarily due 
to the aviodence of multiple system calls, up to a packet size of about 
1000 bytes. Obviously, proportions change for different group sizes 
(though we were able to outperform method 1 with a group size as small 
as 2).

--Elad

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Implementation of the sendgroup() system call
  2009-05-04 14:50     ` Andi Kleen
  2009-05-05  0:24       ` Elad Lahav
@ 2009-05-06 11:25       ` Tim Brecht
  1 sibling, 0 replies; 12+ messages in thread
From: Tim Brecht @ 2009-05-06 11:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Elad Lahav, Elad Lahav, linux-kernel, netdev



On Mon, 4 May 2009, Andi Kleen wrote:

> On Mon, May 04, 2009 at 09:44:31AM -0400, Elad Lahav wrote:
>>> My guess it's more the copies than the calls?
>> It's a factor of both. This is why we also created the sendgroup()
>> implementation that uses a tight loop of in-kernel calls to sendmsg()
>> as a means for evaluating the cost of mode switches. It is definitely
>> not negligible (exact numbers depend on the size of the group and the
>> size of the payload, of course).
>
> How much is non negligible in your case?

As you can see from Elad's posting it can be pretty
significant.

>>
>>> It sounds like you want sendfile() for UDP.
>> Do you mean by having a per-recipient sendfile() call for the same
>> file? Leaving the cost of the system call aside, this solution does
>> not work well with the kind of real-time data that we've been working
>> with (live streaming, online games). You would have to write the
>> payload to the file as it is being generated and call sendfile() after
>> each such write.
>
> You can mmap the file.

There are a few problem with using mmap and sendfile:

1) One would really want something like sendfilev where
    one could specify multiple recipients in one syscall
    (in order to save on the mode switches).

2) I don't know what it would be like for UDP but for
    TCP one of the big problems with mmap/sendfile
    for zero copy is that the application
    doesn't know when the kernel has finished sending
    the data. As a result one can only reuse the mmapped buffer
    if there is some way for the application to deduce
    that the kernel is finished sending the data.
    Even if the application can deduce this it can
    often be long after the kernel has sent the data
    and as a result memory buffers can accumulate
    unnecessarily. We've had this problem trying to use
    this approach in a high-performance web server.

3) I think that including recipient specific data
    would be cumbersome and would probably require extra
    system calls. Possibly
       write(for prepend)
       sendfile(for common)
       write(for append)
    Unless one copies the common data into prestaged
    areas in user space ... which results in the copying
    we are trying to avoid.

    Perhaps if writev was able to
    write from an mmapped file with zero copies,
    a single recipient could be sent recipient
    specific and common data with one system call.
    However, this approach would still require one system
    call per recipient.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-05-06 12:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <49FE47A1.7070700@uwaterloo.ca>
2009-05-04  7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen
2009-05-04  7:30   ` Avi Kivity
2009-05-04  9:53     ` Andi Kleen
2009-05-04  9:56       ` Eric Dumazet
2009-05-04 10:18         ` Andi Kleen
2009-05-04  9:58       ` Avi Kivity
2009-05-04  7:42   ` Rémi Denis-Courmont
2009-05-04 13:44   ` Elad Lahav
2009-05-04 14:50     ` Andi Kleen
2009-05-05  0:24       ` Elad Lahav
2009-05-06 11:25       ` Tim Brecht
     [not found] ` <49FE9C8C.6090705@cosmosbay.com>
2009-05-04  9:03   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).