* GSO/GRO and UDP performance
@ 2013-09-04 10:07 James Yonan
  2013-09-04 11:53 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: James Yonan @ 2013-09-04 10:07 UTC (permalink / raw)
  To: netdev
I'm looking at ways to improve UDP performance in the kernel. 
Specifically I'd like to take some of the ideas in GSO/GRO for TCP and 
apply them to UDP as well.  Our use case is OpenVPN, but these methods 
should apply to any UDP-based app.
AS I understand GSO/GRO for TCP, there are essentially two central features:
(a) it's a way of batching packets with similar headers together so that 
they can efficiently traverse the network stack as a single unit
(b) it explicitly maps the batching of packets to the L4 segmenting 
features of TCP, so that batched packets can be coalesced into TCP segments.
This approach works great for TCP because of its built-in L4 segmenting 
features, but it tends to break down for UDP because of (b) in 
particular -- UDP doesn't have an L4 segmenting model, so the 
gso_segment method for UDP resorts to segmenting the packets with L3 IP 
fragmentation (i.e. UFO).  The problem is that IP fragmentation is 
broken on so many different levels that it can't be relied on for apps 
that need to communicate over the open internet (*).  Most UDP apps do 
their own app-level fragmentation and wouldn't want to be forced to buy 
into IP fragmentation in order to get the performance benefits of GSO/GRO.
So I would like to propose a GSO/GRO implementation for UDP that works 
by batching together separate UDP packets with similar headers into a 
single skb.  There is no-tie in with L3 IP fragmentation -- the packets 
are sent over the wire and received as individual UDP packets.
Here is an example of how this might work in practice:
When I call sendmmsg from userspace with a bunch of UDP packets having 
the same header, the kernel would assemble these packets into a single 
skb via shinfo(skb)->frag_list.  There would need to be a new gso_type 
indicating that frag_list is simply a list of UDP packets having the 
same header that should be transmitted separately.  No IP fragmentation 
would be necessary as long as the app has correctly sized the packets 
for the link MTU.
Once this skb is about to reach the driver, dev_hard_start_xmit could do 
the usual GSO thing and separate out the packets in 
shinfo(skb)->frag_list and pass them individually to the driver's 
ndo_start_xmit method, if the driver doesn't support batched UDP 
packets.  There would need to be a new gso_type for this batching model, 
e.g. "SKB_GSO_UDP_BUNDLE" that drivers could optionally support.
On the receive side, we would define a gro_receive method for UDP (none 
currently exists) that does the same batching in reverse:  UDP packets 
with the same header would be collected into shinfo(skb)->frag_list and 
gso_type would be set to SKB_GSO_UDP_BUNDLE.
The bundle of UDP packets would traverse the stack as a unit until it 
reaches the socket layer, where recvmmsg could pass the whole bundle up 
to userspace in a single transaction (or recvmsg could disaggregate the 
bundle and pass each datagram individually).
This approach should also significantly speed up UDP apps running on VM 
guests, because the skbs of bundled UDP packets could be passed across 
the hypervisor/guest barrier in a single transaction.
Because this technique bundles UDP packets without coalescing or 
modifying them, the approach should be lossless with respect to 
bridging, hypervisor/guest communication, routing, etc.  It also doesn't 
interfere with existing hardware support for L4 checksum offloading 
(unlike UFO).
Could this work?  Are there problems with this that I'm not considering? 
  Are there better or existing ways of doing this?
Thanks,
James
---------------------
(*) Well-known issues of UDP/IP fragmentation:
1. Relies on PMTU discovery, which often doesn't work in the real world 
because of inconsistent ICMP forwarding policies.
2. Breaks down on high-bandwidth links because the IPv4 16-bit packet ID 
value can wrap around, causing data corruption.
3. One fragment lost in transit means that the whole superpacket is lost.
James
^ permalink raw reply	[flat|nested] 7+ messages in thread
- * Re: GSO/GRO and UDP performance
  2013-09-04 10:07 GSO/GRO and UDP performance James Yonan
@ 2013-09-04 11:53 ` Eric Dumazet
  2013-09-06  9:22   ` James Yonan
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2013-09-04 11:53 UTC (permalink / raw)
  To: James Yonan; +Cc: netdev
On Wed, 2013-09-04 at 04:07 -0600, James Yonan wrote:
> The bundle of UDP packets would traverse the stack as a unit until it 
> reaches the socket layer, where recvmmsg could pass the whole bundle up 
> to userspace in a single transaction (or recvmsg could disaggregate the 
> bundle and pass each datagram individually).
That would require a lot of work, say in netfilter, but also in core
network stack in forwarding, and all UDP users (L2TP, vxlan).
Very unlikely to happen IMHO.
I suspect the performance is coming from aggregation done in user space,
then re-injected into the kernel ?
You could use a kernel module, using udp_encap_enable() and friends.
Check vxlan_socket_create() for an example
^ permalink raw reply	[flat|nested] 7+ messages in thread 
- * Re: GSO/GRO and UDP performance
  2013-09-04 11:53 ` Eric Dumazet
@ 2013-09-06  9:22   ` James Yonan
  2013-09-06 13:07     ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: James Yonan @ 2013-09-06  9:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
On 04/09/2013 05:53, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 04:07 -0600, James Yonan wrote:
>
>> The bundle of UDP packets would traverse the stack as a unit until it
>> reaches the socket layer, where recvmmsg could pass the whole bundle up
>> to userspace in a single transaction (or recvmsg could disaggregate the
>> bundle and pass each datagram individually).
>
> That would require a lot of work, say in netfilter, but also in core
> network stack in forwarding, and all UDP users (L2TP, vxlan).
>
> Very unlikely to happen IMHO.
I agree that aggregating packets by chaining multiple packets into a 
single skb would be too disruptive.
However I believe GSO/GRO provides a potential solution here that would 
be transparent to the core network stack and existing in-kernel UDP users.
GSO/GRO already allows any L4 protocol or lower to define their own 
segmentation and aggregation algorithms, as long as the algorithms are 
lossless.
There's no reason why GSO/GRO couldn't operate on L5 or higher protocols 
if segmentation and aggregation algorithms are provided by a kernel 
module that understands the specific app protocol.
It looks like this could be done with minimal changes to the GSO/GRO 
core.  There would need to be a hook where a kernel module could 
register itself as a GSO/GRO provider for UDP.  It could then perform 
segmentation/aggregation on UDP packets that belong to it.  The dispatch 
to the UDP GSO/GRO providers would be done by the existing offload code 
for UDP, so there would be zero added overhead for non-UDP protocols.
>
> I suspect the performance is coming from aggregation done in user space,
> then re-injected into the kernel ?
>
> You could use a kernel module, using udp_encap_enable() and friends.
>
> Check vxlan_socket_create() for an example
I actually put together a test kernel module using udp_encap_enable to 
see if I could accelerate UDP performance that way.  But even with the 
boost of running in kernel space, the packet processing overhead of 
dealing with 1500 byte packets negates most of the gain, while TCP gets 
a 43x performance boost by being able to aggregate up to 64KB per 
superpacket with GSO/GRO.
So I think that playing well with GSO/GRO is essential to get speedup in 
UDP apps because of this 43x multiplier.
James
^ permalink raw reply	[flat|nested] 7+ messages in thread 
- * Re: GSO/GRO and UDP performance
  2013-09-06  9:22   ` James Yonan
@ 2013-09-06 13:07     ` Eric Dumazet
  2013-09-06 16:42       ` Rick Jones
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2013-09-06 13:07 UTC (permalink / raw)
  To: James Yonan; +Cc: netdev
On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:
> So I think that playing well with GSO/GRO is essential to get speedup in 
> UDP apps because of this 43x multiplier.
> 
Thats not true. GRO cannot aggregate more than 16+1 packets.
I think we cannot aggregate UDP packets, because UDP lacks sequence
numbers, so reorders would be a problem.
You really need something that is not UDP generic.
^ permalink raw reply	[flat|nested] 7+ messages in thread 
- * Re: GSO/GRO and UDP performance
  2013-09-06 13:07     ` Eric Dumazet
@ 2013-09-06 16:42       ` Rick Jones
  2013-09-06 19:26         ` James Yonan
  0 siblings, 1 reply; 7+ messages in thread
From: Rick Jones @ 2013-09-06 16:42 UTC (permalink / raw)
  To: James Yonan; +Cc: Eric Dumazet, netdev
On 09/06/2013 06:07 AM, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:
>
>> So I think that playing well with GSO/GRO is essential to get speedup in
>> UDP apps because of this 43x multiplier.
>>
>
> Thats not true. GRO cannot aggregate more than 16+1 packets.
>
> I think we cannot aggregate UDP packets, because UDP lacks sequence
> numbers, so reorders would be a problem.
>
> You really need something that is not UDP generic.
It may  not be as sexy, and it cannot get the 43x multiplier (just what 
*is* the service demand change on a netperf TCP_STREAM test these days 
between GSO/GRO on and off anyway?), but looking for basic path-length 
reductions would be goodness.
rick jones
^ permalink raw reply	[flat|nested] 7+ messages in thread 
- * Re: GSO/GRO and UDP performance
  2013-09-06 16:42       ` Rick Jones
@ 2013-09-06 19:26         ` James Yonan
  2013-09-06 19:32           ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: James Yonan @ 2013-09-06 19:26 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, netdev
On 06/09/2013 10:42, Rick Jones wrote:
> On 09/06/2013 06:07 AM, Eric Dumazet wrote:
>> On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:
>>
>>> So I think that playing well with GSO/GRO is essential to get speedup in
>>> UDP apps because of this 43x multiplier.
>>>
>>
>> Thats not true. GRO cannot aggregate more than 16+1 packets.
Where does the 16+1 come from?  I'm getting my 43x from the ratio of max 
legal IP packet size (64KB) / internet MTU (1500).  Are you saying that 
GRO cannot aggregate up to 64 KB?
>> I think we cannot aggregate UDP packets, because UDP lacks sequence
>> numbers, so reorders would be a problem.
>> You really need something that is not UDP generic.
Right -- that's why I'm proposing a hook for UDP GSO/GRO providers that 
know about specific app-layer protocols and can provide segmentation and 
aggregation methods for them.  Such a provider would be implemented in a 
kernel module and would know about the specific app-layer protocol, so 
it would be able to losslessly segment and aggregate it (i.e. it could 
use a sequence number from the app-layer protocol).
> It may  not be as sexy, and it cannot get the 43x multiplier (just what
> *is* the service demand change on a netperf TCP_STREAM test these days
> between GSO/GRO on and off anyway?)
That's something I haven't really looked too closely at yet.  With 
MAX_GRO_SKBS set to only 8, how well would this really scale?
> but looking for basic path-length reductions would be goodness.
Path is fairly optimized as-is.
Direction 1: udp_encap_recv -> tunnel decapsulation -> netif_rx
Direction 2: ndo_start_xmit -> tunnel encapsulation -> ip_local_out
I've also looked into getting closer to driver TX by using 
dev_queue_xmit instead of ip_local_out.
Even though this is a virtual driver without interrupts, I'm also 
looking at NAPI as a way of getting packet flows into GRO on the RX side.
Bottom line is that I want to saturate 10 GigE with UDP packets without 
breaking a sweat.  ixgbe or other drivers in that class can handle it if 
the per-packet overhead in the network stack can be reduced enough.
James
^ permalink raw reply	[flat|nested] 7+ messages in thread 
- * Re: GSO/GRO and UDP performance
  2013-09-06 19:26         ` James Yonan
@ 2013-09-06 19:32           ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2013-09-06 19:32 UTC (permalink / raw)
  To: James Yonan; +Cc: Rick Jones, netdev
On Fri, 2013-09-06 at 13:26 -0600, James Yonan wrote:
> Where does the 16+1 come from?  I'm getting my 43x from the ratio of max 
> legal IP packet size (64KB) / internet MTU (1500).  Are you saying that 
> GRO cannot aggregate up to 64 KB?
> 
Yes this is what I said.
Hint : MAX_SKB_FRAGS is the number of fragments per skb
Each aggregated frame consumes at least one fragment.
Hint : some drivers uses more than one fragment per datagram.
-> A fragment in skb does not necessarily contains one and exactly one
datagram
> >> I think we cannot aggregate UDP packets, because UDP lacks sequence
> >> numbers, so reorders would be a problem.
> 
> >> You really need something that is not UDP generic.
> 
> Right -- that's why I'm proposing a hook for UDP GSO/GRO providers that 
> know about specific app-layer protocols and can provide segmentation and 
> aggregation methods for them.  Such a provider would be implemented in a 
> kernel module and would know about the specific app-layer protocol, so 
> it would be able to losslessly segment and aggregate it (i.e. it could 
> use a sequence number from the app-layer protocol).
Its not a choice given by application.
As I said you'll have to make sure all the stack will understand the
meaning of datagram aggregation.
^ permalink raw reply	[flat|nested] 7+ messages in thread 
 
 
 
 
 
end of thread, other threads:[~2013-09-06 19:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-04 10:07 GSO/GRO and UDP performance James Yonan
2013-09-04 11:53 ` Eric Dumazet
2013-09-06  9:22   ` James Yonan
2013-09-06 13:07     ` Eric Dumazet
2013-09-06 16:42       ` Rick Jones
2013-09-06 19:26         ` James Yonan
2013-09-06 19:32           ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).