* Re: [PATCH] Implementation of the sendgroup() system call [not found] <49FE47A1.7070700@uwaterloo.ca> @ 2009-05-04 7:13 ` Andi Kleen 2009-05-04 7:30 ` Avi Kivity ` (2 more replies) [not found] ` <49FE9C8C.6090705@cosmosbay.com> 1 sibling, 3 replies; 12+ messages in thread From: Andi Kleen @ 2009-05-04 7:13 UTC (permalink / raw) To: Elad Lahav; +Cc: linux-kernel, netdev Elad Lahav <elahav@uwaterloo.ca> writes: > The attached patch contains an implementation of sendgroup(), a system > call that allows a UDP packet to be transmitted efficiently to > multiple recipients. Use cases for this system call include > live-streaming and multi-player online games. > The basic idea is that the caller maintains a group - a list of IP > addresses and UDP ports - and calls sendgroup() with the group list > and a common payload. Optionally, the call allows for per-recipient > data to be prepended or appended to the shared block. The data is > copied once in the kernel into an allocated page, and the > per-recipient socket buffers point to that page. Savings come from > avoiding both the multiple calls and the multiple copies of the data > required with regular socket operations. My guess it's more the copies than the calls? It sounds like you want sendfile() for UDP. I think that would be a cleaner solution than such a specific hack for your application. It would have the advantage of saving the first copy too and be truly zero copy on capable NICs. Or perhaps simple send to a local multicast group and let some netfilter module turn that into regular UDP. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen @ 2009-05-04 7:30 ` Avi Kivity 2009-05-04 9:53 ` Andi Kleen 2009-05-04 7:42 ` Rémi Denis-Courmont 2009-05-04 13:44 ` Elad Lahav 2 siblings, 1 reply; 12+ messages in thread From: Avi Kivity @ 2009-05-04 7:30 UTC (permalink / raw) To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev Andi Kleen wrote: > Elad Lahav <elahav@uwaterloo.ca> writes: > > >> The attached patch contains an implementation of sendgroup(), a system >> call that allows a UDP packet to be transmitted efficiently to >> multiple recipients. Use cases for this system call include >> live-streaming and multi-player online games. >> The basic idea is that the caller maintains a group - a list of IP >> addresses and UDP ports - and calls sendgroup() with the group list >> and a common payload. Optionally, the call allows for per-recipient >> data to be prepended or appended to the shared block. The data is >> copied once in the kernel into an allocated page, and the >> per-recipient socket buffers point to that page. Savings come from >> avoiding both the multiple calls and the multiple copies of the data >> required with regular socket operations. >> > > My guess it's more the copies than the calls? It sounds like > you want sendfile() for UDP. I think that would be a cleaner solution > than such a specific hack for your application. It would > have the advantage of saving the first copy too and be > truly zero copy on capable NICs. > An aio udp send could accomplish both multiple packets per call, and zero-copy, without adding new syscalls. You could send the same packet to multiple recipients, or multiple packets to the same recipicent, or combinations thereof. > Or perhaps simple send to a local multicast group and let > some netfilter module turn that into regular UDP. > Sounds hacky and rooty. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 7:30 ` Avi Kivity @ 2009-05-04 9:53 ` Andi Kleen 2009-05-04 9:56 ` Eric Dumazet 2009-05-04 9:58 ` Avi Kivity 0 siblings, 2 replies; 12+ messages in thread From: Andi Kleen @ 2009-05-04 9:53 UTC (permalink / raw) To: Avi Kivity; +Cc: Andi Kleen, Elad Lahav, linux-kernel, netdev > >My guess it's more the copies than the calls? It sounds like > >you want sendfile() for UDP. I think that would be a cleaner solution > >than such a specific hack for your application. It would > >have the advantage of saving the first copy too and be > >truly zero copy on capable NICs. > > > > An aio udp send could accomplish both multiple packets per call, and AIO sockets are a lot of work. There have been various attempts over the years, but they are very difficult. This was mostly for TCP -- possibly UDP would be a bit easier -- but still many complications. It would also need a lot of changes and you would need to convince the network maintainers that they are a good idea. > >Or perhaps simple send to a local multicast group and let > >some netfilter module turn that into regular UDP. > > > > Sounds hacky and rooty. rooty? Everyone can send to all directions anyways. It wouldn't be perfect, but quite usable as a short term solution for a production server. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 9:53 ` Andi Kleen @ 2009-05-04 9:56 ` Eric Dumazet 2009-05-04 10:18 ` Andi Kleen 2009-05-04 9:58 ` Avi Kivity 1 sibling, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2009-05-04 9:56 UTC (permalink / raw) To: Andi Kleen; +Cc: Avi Kivity, Elad Lahav, linux-kernel, netdev Andi Kleen a écrit : >>> Or perhaps simple send to a local multicast group and let >>> some netfilter module turn that into regular UDP. >>> >> Sounds hacky and rooty. > > rooty? Everyone can send to all directions anyways. > > It wouldn't be perfect, but quite usable as a short term solution > for a production server. Anyway, a netfilter module doesnt fit the needs (separate header/footer for each destination) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 9:56 ` Eric Dumazet @ 2009-05-04 10:18 ` Andi Kleen 0 siblings, 0 replies; 12+ messages in thread From: Andi Kleen @ 2009-05-04 10:18 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andi Kleen, Avi Kivity, Elad Lahav, linux-kernel, netdev On Mon, May 04, 2009 at 11:56:25AM +0200, Eric Dumazet wrote: > Andi Kleen a écrit : > > >>> Or perhaps simple send to a local multicast group and let > >>> some netfilter module turn that into regular UDP. > >>> > >> Sounds hacky and rooty. > > > > rooty? Everyone can send to all directions anyways. > > > > It wouldn't be perfect, but quite usable as a short term solution > > for a production server. > > Anyway, a netfilter module doesnt fit the needs (separate header/footer > for each destination) netfilter modues can create new skbs that reference the main data and replace the header. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 9:53 ` Andi Kleen 2009-05-04 9:56 ` Eric Dumazet @ 2009-05-04 9:58 ` Avi Kivity 1 sibling, 0 replies; 12+ messages in thread From: Avi Kivity @ 2009-05-04 9:58 UTC (permalink / raw) To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev Andi Kleen wrote: >>> My guess it's more the copies than the calls? It sounds like >>> you want sendfile() for UDP. I think that would be a cleaner solution >>> than such a specific hack for your application. It would >>> have the advantage of saving the first copy too and be >>> truly zero copy on capable NICs. >>> >>> >> An aio udp send could accomplish both multiple packets per call, and >> > > AIO sockets are a lot of work. There have been various attempts > over the years, but they are very difficult. This was mostly > for TCP -- possibly UDP would be a bit easier -- but still > many complications. It would also need a lot of changes and > you would need to convince the network maintainers that they > are a good idea. > I would love them for kvm. As far as I understand, the only complication is proper socket destructors so we can put_page() the memory. Right now sendfile() is only usable for read-only files. It's not usable for files that change, or non-file memory. > >>> Or perhaps simple send to a local multicast group and let >>> some netfilter module turn that into regular UDP. >>> >>> >> Sounds hacky and rooty. >> > > rooty? Everyone can send to all directions anyways. > > It wouldn't be perfect, but quite usable as a short term solution > for a production server. > I meant, you need root to insert that netfiler module. It's workable as a one off but it's not something reusable. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen 2009-05-04 7:30 ` Avi Kivity @ 2009-05-04 7:42 ` Rémi Denis-Courmont 2009-05-04 13:44 ` Elad Lahav 2 siblings, 0 replies; 12+ messages in thread From: Rémi Denis-Courmont @ 2009-05-04 7:42 UTC (permalink / raw) To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev On Monday 04 May 2009 10:13:10 ext Andi Kleen wrote: > Elad Lahav <elahav@uwaterloo.ca> writes: > > The attached patch contains an implementation of sendgroup(), a system > > call that allows a UDP packet to be transmitted efficiently to > > multiple recipients. Use cases for this system call include > > live-streaming and multi-player online games. > > The basic idea is that the caller maintains a group - a list of IP > > addresses and UDP ports - and calls sendgroup() with the group list > > and a common payload. Optionally, the call allows for per-recipient > > data to be prepended or appended to the shared block. The data is > > copied once in the kernel into an allocated page, and the > > per-recipient socket buffers point to that page. Savings come from > > avoiding both the multiple calls and the multiple copies of the data > > required with regular socket operations. > > My guess it's more the copies than the calls? It sounds like > you want sendfile() for UDP. I think that would be a cleaner solution > than such a specific hack for your application. It would > have the advantage of saving the first copy too and be > truly zero copy on capable NICs. Say you have a fragmented skbuff-capable NIC. It is already possible to write() to a pipe, then issue a series of N tee() from the pipe for each of N connected sockets, and finally splice() to /dev/null. That would be one copy and N+2 system calls. I guess the one copy cannot be removed because UDP payloads are not page-sized, so vmsplice() won't cut it. As a small optmization, you could replace the last tee() with a splice(), so that's N+1 system calls. When using a non-connected socket and multiple destination, then you need one corked sendmsg() to set the destination, followed by one splice() to push the payload. That's at least 2N+1 system calls. UDP payload will typically be small, probably 1400 bytes or less. The system call overhead might well be higher than the memory copy overhead. On top of that, the splice() trick only works if the NIC can cope with fragments. The performance might be worse than with normal sendto(). The application has no way to check this from the socket. For a reason. Depending on the routing table, different destinations could use different NICs with different capabilities. To sum up, I can see to problems using splice() and friends here, with regards to the use case: - no support for destination socket addesses, - no support for batch tee() operations. Whether that justifies a new socket call is not for me to opiniate. Personally, I would definitely use it in the RTSP broadcast output of the VLC media player once/if it ever hits glibc. > Or perhaps simple send to a local multicast group and let > some netfilter module turn that into regular UDP. Unless netfilter has changed dramatically recentely, it is not usable for applications. A monolithic system-wide configuration paradigm is not usable for applications, as they cannot know how not to step on another application's feet. -- Rémi Denis-Courmont ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen 2009-05-04 7:30 ` Avi Kivity 2009-05-04 7:42 ` Rémi Denis-Courmont @ 2009-05-04 13:44 ` Elad Lahav 2009-05-04 14:50 ` Andi Kleen 2 siblings, 1 reply; 12+ messages in thread From: Elad Lahav @ 2009-05-04 13:44 UTC (permalink / raw) To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev > My guess it's more the copies than the calls? It's a factor of both. This is why we also created the sendgroup() implementation that uses a tight loop of in-kernel calls to sendmsg() as a means for evaluating the cost of mode switches. It is definitely not negligible (exact numbers depend on the size of the group and the size of the payload, of course). > It sounds like you want sendfile() for UDP. Do you mean by having a per-recipient sendfile() call for the same file? Leaving the cost of the system call aside, this solution does not work well with the kind of real-time data that we've been working with (live streaming, online games). You would have to write the payload to the file as it is being generated and call sendfile() after each such write. --Elad ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 13:44 ` Elad Lahav @ 2009-05-04 14:50 ` Andi Kleen 2009-05-05 0:24 ` Elad Lahav 2009-05-06 11:25 ` Tim Brecht 0 siblings, 2 replies; 12+ messages in thread From: Andi Kleen @ 2009-05-04 14:50 UTC (permalink / raw) To: Elad Lahav; +Cc: Andi Kleen, Elad Lahav, linux-kernel, netdev On Mon, May 04, 2009 at 09:44:31AM -0400, Elad Lahav wrote: > >My guess it's more the copies than the calls? > It's a factor of both. This is why we also created the sendgroup() > implementation that uses a tight loop of in-kernel calls to sendmsg() > as a means for evaluating the cost of mode switches. It is definitely > not negligible (exact numbers depend on the size of the group and the > size of the payload, of course). How much is non negligible in your case? > > >It sounds like you want sendfile() for UDP. > Do you mean by having a per-recipient sendfile() call for the same > file? Leaving the cost of the system call aside, this solution does > not work well with the kind of real-time data that we've been working > with (live streaming, online games). You would have to write the > payload to the file as it is being generated and call sendfile() after > each such write. You can mmap the file. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 14:50 ` Andi Kleen @ 2009-05-05 0:24 ` Elad Lahav 2009-05-06 11:25 ` Tim Brecht 1 sibling, 0 replies; 12+ messages in thread From: Elad Lahav @ 2009-05-05 0:24 UTC (permalink / raw) To: Andi Kleen; +Cc: Elad Lahav, linux-kernel, netdev > How much is non negligible in your case? Please see the following link http://www.cs.uwaterloo.ca/~elahav/sendgroup/mb_pkt_size.pdf for a graph showing the (amortised) cost of sending a single packet to a group of 1000 recipients, using one of the three following methods: 1. A user-mode loop of sendmsg() calls 2. A kernel-mode loop of udp_sendmsg() calls 3. A single call to sendgroup() The cost is measured in cycles and was determined using performance counters. In this benchmark, savings introduced by sendgroup() are primarily due to the aviodence of multiple system calls, up to a packet size of about 1000 bytes. Obviously, proportions change for different group sizes (though we were able to outperform method 1 with a group size as small as 2). --Elad ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] Implementation of the sendgroup() system call 2009-05-04 14:50 ` Andi Kleen 2009-05-05 0:24 ` Elad Lahav @ 2009-05-06 11:25 ` Tim Brecht 1 sibling, 0 replies; 12+ messages in thread From: Tim Brecht @ 2009-05-06 11:25 UTC (permalink / raw) To: Andi Kleen; +Cc: Elad Lahav, Elad Lahav, linux-kernel, netdev On Mon, 4 May 2009, Andi Kleen wrote: > On Mon, May 04, 2009 at 09:44:31AM -0400, Elad Lahav wrote: >>> My guess it's more the copies than the calls? >> It's a factor of both. This is why we also created the sendgroup() >> implementation that uses a tight loop of in-kernel calls to sendmsg() >> as a means for evaluating the cost of mode switches. It is definitely >> not negligible (exact numbers depend on the size of the group and the >> size of the payload, of course). > > How much is non negligible in your case? As you can see from Elad's posting it can be pretty significant. >> >>> It sounds like you want sendfile() for UDP. >> Do you mean by having a per-recipient sendfile() call for the same >> file? Leaving the cost of the system call aside, this solution does >> not work well with the kind of real-time data that we've been working >> with (live streaming, online games). You would have to write the >> payload to the file as it is being generated and call sendfile() after >> each such write. > > You can mmap the file. There are a few problem with using mmap and sendfile: 1) One would really want something like sendfilev where one could specify multiple recipients in one syscall (in order to save on the mode switches). 2) I don't know what it would be like for UDP but for TCP one of the big problems with mmap/sendfile for zero copy is that the application doesn't know when the kernel has finished sending the data. As a result one can only reuse the mmapped buffer if there is some way for the application to deduce that the kernel is finished sending the data. Even if the application can deduce this it can often be long after the kernel has sent the data and as a result memory buffers can accumulate unnecessarily. We've had this problem trying to use this approach in a high-performance web server. 3) I think that including recipient specific data would be cumbersome and would probably require extra system calls. Possibly write(for prepend) sendfile(for common) write(for append) Unless one copies the common data into prestaged areas in user space ... which results in the copying we are trying to avoid. Perhaps if writev was able to write from an mmapped file with zero copies, a single recipient could be sent recipient specific and common data with one system call. However, this approach would still require one system call per recipient. ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <49FE9C8C.6090705@cosmosbay.com>]
* Re: [PATCH] Implementation of the sendgroup() system call [not found] ` <49FE9C8C.6090705@cosmosbay.com> @ 2009-05-04 9:03 ` Eric Dumazet 0 siblings, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2009-05-04 9:03 UTC (permalink / raw) To: Elad Lahav; +Cc: linux-kernel, Linux Netdev List Eric Dumazet a écrit : > Elad Lahav a écrit : >> The attached patch contains an implementation of sendgroup(), a system >> call that allows a UDP packet to be transmitted efficiently to multiple >> recipients. Use cases for this system call include live-streaming and >> multi-player online games. >> The basic idea is that the caller maintains a group - a list of IP >> addresses and UDP ports - and calls sendgroup() with the group list and >> a common payload. Optionally, the call allows for per-recipient data to >> be prepended or appended to the shared block. The data is copied once in >> the kernel into an allocated page, and the per-recipient socket buffers >> point to that page. Savings come from avoiding both the multiple calls >> and the multiple copies of the data required with regular socket >> operations. We have measured an improvement of 42% in CPU utilisation >> when using this system call with the Helix multimedia server (reference: >> http://simula.no/~griff/nossdav2008/27-32.pdf). >> >> The patch includes two implementations: one as described above and one >> that uses the udp_sendmsg() function in a tight loop inside the kernel >> (and thus saves on mode switches, but not on data copies). The latter is >> provided for reference and benchmarking only. >> >> Feedback is welcome. >> > > Hi Elad > > Patch is not inlined, this is really asking for troubles, I doubt many people > will actually read your patch... > > My comments are : > > 1) Lack of latency checks. Sending UDP on 1000 destinations is expensive. > A syscall is not preemptable unless special conditions are met. > > 2) Lack of a 32/64 bits aware API. A 64bit kernel should be able to > run a 32bit application using a sendgroup() syscall. > > 3) Are footer/header differents for each calls ? Maybe you need > something better to avoid extra copies for them at each sendgroup() systemcall > > 4) One expensive thing on UDP sends is the route cache lookups. You could avoid > this cost using 'connected' group setup (see point 3) > > ie using a different syscall to setup the group (and compute/lookup all needed routes) > (this syscall would be able to add/delete members (with their footer/header) to socket group) > > Then sendgroup() would be really light, since it would provide a group identifier > (can be a file descriptor -> mapping one group), and the UDP message to send. Ah some other points : You forgot to include netdev (CCed on my message), as some network guys dont read lkml every day :) On your experiments, did you change NIC txqueue length ? (default being 1000) Using sendgroup() or sendmsg(), you'll hit pretty fast the NIC queue limit anyway... Also, since 2.6.25 added memory accounting on UDP sockets, you'll probably need to increase SO_SNDBUF to avoid being blocked on sendmsg()/sendgroup() call ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-05-06 12:08 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <49FE47A1.7070700@uwaterloo.ca>
2009-05-04 7:13 ` [PATCH] Implementation of the sendgroup() system call Andi Kleen
2009-05-04 7:30 ` Avi Kivity
2009-05-04 9:53 ` Andi Kleen
2009-05-04 9:56 ` Eric Dumazet
2009-05-04 10:18 ` Andi Kleen
2009-05-04 9:58 ` Avi Kivity
2009-05-04 7:42 ` Rémi Denis-Courmont
2009-05-04 13:44 ` Elad Lahav
2009-05-04 14:50 ` Andi Kleen
2009-05-05 0:24 ` Elad Lahav
2009-05-06 11:25 ` Tim Brecht
[not found] ` <49FE9C8C.6090705@cosmosbay.com>
2009-05-04 9:03 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).