netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Best way to reduce system call overhead for tun device I/O?
@ 2016-03-29 22:40 Guus Sliepen
  2016-03-31 21:18 ` Tom Herbert
  0 siblings, 1 reply; 10+ messages in thread
From: Guus Sliepen @ 2016-03-29 22:40 UTC (permalink / raw)
  To: netdev

I'm trying to reduce system call overhead when reading/writing to/from a
tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
but a tun fd is not a socket fd, so this doesn't work. I'm see several
options to allow userspace to read/write multiple packets with one
syscall:

- Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
  sockets.

- Implement a ioctl() to emulate sendmmsg()/recvmmsg().

- Add a flag that can be set using TUNSETIFF that makes regular
  read()/write() calls handle multiple packets in one go.

- Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
  be used. There is tun_get_socket() which is used internally in the
  kernel, but this is not exposed to userspace, and doesn't look trivial
  to do either.

What would be the right way to do this?

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-03-29 22:40 Best way to reduce system call overhead for tun device I/O? Guus Sliepen
@ 2016-03-31 21:18 ` Tom Herbert
  2016-03-31 21:20   ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Tom Herbert @ 2016-03-31 21:18 UTC (permalink / raw)
  To: Guus Sliepen; +Cc: Linux Kernel Network Developers

On Tue, Mar 29, 2016 at 6:40 PM, Guus Sliepen <guus@tinc-vpn.org> wrote:
> I'm trying to reduce system call overhead when reading/writing to/from a
> tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
> but a tun fd is not a socket fd, so this doesn't work. I'm see several
> options to allow userspace to read/write multiple packets with one
> syscall:
>
> - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
>   sockets.
>
> - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
>
> - Add a flag that can be set using TUNSETIFF that makes regular
>   read()/write() calls handle multiple packets in one go.
>
> - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
>   be used. There is tun_get_socket() which is used internally in the
>   kernel, but this is not exposed to userspace, and doesn't look trivial
>   to do either.
>
> What would be the right way to do this?
>
Personally I think tun could benefit greatly if it were implemented as
a socket instead of character interface. One thing that could be much
better is sending/receiving of meta data attached to skbuf. For
instance GSO data could be in ancillary data in a socket instead of
inline with packet data as tun seems to be doing now.

Tom

> --
> Met vriendelijke groet / with kind regards,
>      Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-03-31 21:18 ` Tom Herbert
@ 2016-03-31 21:20   ` David Miller
  2016-03-31 22:28     ` Guus Sliepen
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2016-03-31 21:20 UTC (permalink / raw)
  To: tom; +Cc: guus, netdev

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 31 Mar 2016 17:18:48 -0400

> On Tue, Mar 29, 2016 at 6:40 PM, Guus Sliepen <guus@tinc-vpn.org> wrote:
>> I'm trying to reduce system call overhead when reading/writing to/from a
>> tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
>> but a tun fd is not a socket fd, so this doesn't work. I'm see several
>> options to allow userspace to read/write multiple packets with one
>> syscall:
>>
>> - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
>>   sockets.
>>
>> - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
>>
>> - Add a flag that can be set using TUNSETIFF that makes regular
>>   read()/write() calls handle multiple packets in one go.
>>
>> - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
>>   be used. There is tun_get_socket() which is used internally in the
>>   kernel, but this is not exposed to userspace, and doesn't look trivial
>>   to do either.
>>
>> What would be the right way to do this?
>>
> Personally I think tun could benefit greatly if it were implemented as
> a socket instead of character interface. One thing that could be much
> better is sending/receiving of meta data attached to skbuf. For
> instance GSO data could be in ancillary data in a socket instead of
> inline with packet data as tun seems to be doing now.

Agreed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-03-31 21:20   ` David Miller
@ 2016-03-31 22:28     ` Guus Sliepen
  2016-03-31 23:39       ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: Guus Sliepen @ 2016-03-31 22:28 UTC (permalink / raw)
  To: David Miller; +Cc: tom, netdev

On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:

> >> I'm trying to reduce system call overhead when reading/writing to/from a
> >> tun device in userspace. [...] What would be the right way to do this?
> >>
> > Personally I think tun could benefit greatly if it were implemented as
> > a socket instead of character interface. One thing that could be much
> > better is sending/receiving of meta data attached to skbuf. For
> > instance GSO data could be in ancillary data in a socket instead of
> > inline with packet data as tun seems to be doing now.
> 
> Agreed.

Ok. So how should the userspace API work? Creating an AF_PACKET socket
and then using a tun ioctl to create a tun interface and bind it to the
socket?

int fd = socket(AF_PACKET, ...)
struct ifreq ifr = {...};
ioctl(fd, TUNSETIFF, &ifr);

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-03-31 22:28     ` Guus Sliepen
@ 2016-03-31 23:39       ` Stephen Hemminger
  2016-04-03 23:03         ` Willem de Bruijn
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2016-03-31 23:39 UTC (permalink / raw)
  To: Guus Sliepen; +Cc: David Miller, tom, netdev

On Fri, 1 Apr 2016 00:28:57 +0200
Guus Sliepen <guus@tinc-vpn.org> wrote:

> On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:
> 
> > >> I'm trying to reduce system call overhead when reading/writing to/from a
> > >> tun device in userspace. [...] What would be the right way to do this?
> > >>
> > > Personally I think tun could benefit greatly if it were implemented as
> > > a socket instead of character interface. One thing that could be much
> > > better is sending/receiving of meta data attached to skbuf. For
> > > instance GSO data could be in ancillary data in a socket instead of
> > > inline with packet data as tun seems to be doing now.
> > 
> > Agreed.
> 
> Ok. So how should the userspace API work? Creating an AF_PACKET socket
> and then using a tun ioctl to create a tun interface and bind it to the
> socket?
> 
> int fd = socket(AF_PACKET, ...)
> struct ifreq ifr = {...};
> ioctl(fd, TUNSETIFF, &ifr);
> 

Rather than bodge AF_PACKET onto TUN, why not just create a new device type
and control it from something modern like netlink.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-03-31 23:39       ` Stephen Hemminger
@ 2016-04-03 23:03         ` Willem de Bruijn
  2016-04-04 14:40           ` Guus Sliepen
  0 siblings, 1 reply; 10+ messages in thread
From: Willem de Bruijn @ 2016-04-03 23:03 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Guus Sliepen, David Miller, Tom Herbert, Network Development

On Thu, Mar 31, 2016 at 7:39 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 1 Apr 2016 00:28:57 +0200
> Guus Sliepen <guus@tinc-vpn.org> wrote:
>
>> On Thu, Mar 31, 2016 at 05:20:50PM -0400, David Miller wrote:
>>
>> > >> I'm trying to reduce system call overhead when reading/writing to/from a
>> > >> tun device in userspace. [...] What would be the right way to do this?
>> > >>
>> > > Personally I think tun could benefit greatly if it were implemented as
>> > > a socket instead of character interface. One thing that could be much
>> > > better is sending/receiving of meta data attached to skbuf. For
>> > > instance GSO data could be in ancillary data in a socket instead of
>> > > inline with packet data as tun seems to be doing now.
>> >
>> > Agreed.
>>
>> Ok. So how should the userspace API work? Creating an AF_PACKET socket
>> and then using a tun ioctl to create a tun interface and bind it to the
>> socket?
>>
>> int fd = socket(AF_PACKET, ...)
>> struct ifreq ifr = {...};
>> ioctl(fd, TUNSETIFF, &ifr);
>>
>
> Rather than bodge AF_PACKET onto TUN, why not just create a new device type
> and control it from something modern like netlink.

Depending on the use-case, it may be sufficient to extend AF_PACKET
with limited tap functionality:

- add a po->xmit mode that reinjects into the kernel receive path,
  analogous to pktgen's M_NETIF_RECEIVE mode.

- optionally drop packets in __netif_receive_skb_core and xmit_one
  if any of the registered packet sockets accepted the packet and has
  a new intercept feature flag enabled.

This can be applied to a dummy device, but much more interesting
is to interpose on the flow of a normal nic. It is clearly not a drop-in
replacement for a tap (let alone tun) device. I have some preliminary
code.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
@ 2016-04-04 13:35 ValdikSS
  2016-04-04 17:28 ` Stephen Hemminger
  0 siblings, 1 reply; 10+ messages in thread
From: ValdikSS @ 2016-04-04 13:35 UTC (permalink / raw)
  To: Guus Sliepen
  Cc: Stephen Hemminger, Willem de Bruijn, David Miller, Tom Herbert,
	netdev

I'm trying to increase OpenVPN throughput by optimizing tun manipulations, too.
Right now I have more questions than answers.

I get about 800 Mbit/s speeds via OpenVPN with authentication and encryption disabled on a local machine with OpenVPN server and client running in a different
network namespaces, which use veth for networking, with 1500 MTU on a TUN interface. This is rather limiting. Low-end devices like SOHO routers could only
achieve 15-20 Mbit/s via OpenVPN with encryption with a 560 MHz CPU.
Increasing MTU reduces overhead. You can get > 5GBit/s if you set 16000 MTU on a TUN interface.
That's not only OpenVPN related. All the tunneling software I tried can't achieve gigabit speeds without encryption on my machine with MTU 1500. Didn't test
tinc though.

TUN supports various offloading techniques: GSO, TSO, UFO, just as hardware NICs. From what I understand, if we use GSO/GRO for TUN, we would be able to receive
send small packets combined in a huge one with one send/recv call with MTU 1500 on a TUN interface, and the performance should increase and be just as it now
with increased MTU. But there is a very little information of how to use offloading with TUN.
I've found an old example code which creates TUN interface with GSO support (TUN_VNET_HDR), does NAT and echoes TUN data to stdout, and a script to run two
instances of this software connected with a pipe. But it doesn't work for me, I never see any combined frames (gso_type is always 0 in a virtio_net_hdr header).
Probably I did something wrong, but I'm not sure what exactly is wrong.

Here's said application: http://ovrload.ru/f/68996_tun.tar.gz

The questions are as follows:

 1. Do I understand correctly that GSO/GRO would have the same effect as increasing MTU on TUN interface?
 2. How GRO/GSO is different from TSO, UFO?
 3. Can we get and send combined frames directly from/to NIC with offloading support?
 4. How to implement GRO/GSO, TSO, UFO? What should be the logic behind it?


Any reply is greatly appreciated.

P.S. this could be helpful: https://ldpreload.com/p/tuntap-notes.txt

> I'm trying to reduce system call overhead when reading/writing to/from a
> tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
> but a tun fd is not a socket fd, so this doesn't work. I'm see several
> options to allow userspace to read/write multiple packets with one
> syscall:
>
> - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
>   sockets.
>
> - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
>
> - Add a flag that can be set using TUNSETIFF that makes regular
>   read()/write() calls handle multiple packets in one go.
>
> - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
>   be used. There is tun_get_socket() which is used internally in the
>   kernel, but this is not exposed to userspace, and doesn't look trivial
>   to do either.
>
> What would be the right way to do this?
>
> -- 
> Met vriendelijke groet / with kind regards,
>      Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
       [not found] <57026C8F.8050406@valdikss.org.ru>
@ 2016-04-04 14:31 ` Guus Sliepen
  0 siblings, 0 replies; 10+ messages in thread
From: Guus Sliepen @ 2016-04-04 14:31 UTC (permalink / raw)
  To: ValdikSS
  Cc: Stephen Hemminger, Willem de Bruijn, David Miller, Tom Herbert,
	netdev

On Mon, Apr 04, 2016 at 04:30:55PM +0300, ValdikSS wrote:

> I'm trying to increase OpenVPN throughput by optimizing tun manipulations, too.
> Right now I have more questions than answers.
> 
> I get about 800 Mbit/s speeds via OpenVPN with authentication and encryption disabled on a local machine with OpenVPN server and client running in a different
> network namespaces, which use veth for networking, with 1500 MTU on a TUN interface. This is rather limiting. Low-end devices like SOHO routers could only
> achieve 15-20 Mbit/s via OpenVPN with encryption with a 560 MHz CPU.
> Increasing MTU reduces overhead. You can get > 5GBit/s if you set 16000 MTU on a TUN interface.
> That's not only OpenVPN related. All the tunneling software I tried can't achieve gigabit speeds without encryption on my machine with MTU 1500. Didn't test
> tinc though.

It's exactly the same issue for tinc. But tinc does path MTU discovery,
and actively limits the size of packets inside the tunnel so that the
outer packets are not bigger than the PMTU. Of course this can be
disabled, but experience has shown that transmitting large UDP packets
over the Internet is not ideal, since they will be fragmented, and the
loss of one fragment means the whole packet is dropped. In the case of
OpenVPN, I think many users use -mssfix, so they too are in effect
limiting the size of packets inside the tunnel.

Of course, tinc could fragment packets internally (it actually does so
in some circumstances), but I'd rather avoid that.

Also, GSO and GRO only deal with optimizations within one UDP packet or
one TCP stream. If you have many concurrent programs sending data, or
one program sending lots of small UDP packets, those will never be
optimized.

So I think GSO/GRO is not the way to go, but there really should be a
way to receive and send many individual packets in one system call.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-04-03 23:03         ` Willem de Bruijn
@ 2016-04-04 14:40           ` Guus Sliepen
  0 siblings, 0 replies; 10+ messages in thread
From: Guus Sliepen @ 2016-04-04 14:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stephen Hemminger, David Miller, Tom Herbert, Network Development

On Sun, Apr 03, 2016 at 07:03:09PM -0400, Willem de Bruijn wrote:

> On Thu, Mar 31, 2016 at 7:39 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
>
> > Rather than bodge AF_PACKET onto TUN, why not just create a new device type
> > and control it from something modern like netlink.

Do we really want to introduce a whole new device type? The tun device
is working perfectly fine, except for the fact that there is no way to
send/receive multiple packets in one go.

> Depending on the use-case, it may be sufficient to extend AF_PACKET
> with limited tap functionality:
> 
> - add a po->xmit mode that reinjects into the kernel receive path,
>   analogous to pktgen's M_NETIF_RECEIVE mode.
> 
> - optionally drop packets in __netif_receive_skb_core and xmit_one
>   if any of the registered packet sockets accepted the packet and has
>   a new intercept feature flag enabled.
> 
> This can be applied to a dummy device, but much more interesting
> is to interpose on the flow of a normal nic. It is clearly not a drop-in
> replacement for a tap (let alone tun) device. I have some preliminary
> code.

It's not really tinc's use case, but I did try using socket(AF_PACKET)
bound to a tun interface, just to see if sendmmsg()/recvmmsg() works
then. It does, but indeed for packets that are sent to the socket, they
need to be reinjected into the kernel receive path. So I'll be happy to
test out your preliminary code.

-- 
Met vriendelijke groet / with kind regards,
     Guus Sliepen <guus@tinc-vpn.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Best way to reduce system call overhead for tun device I/O?
  2016-04-04 13:35 ValdikSS
@ 2016-04-04 17:28 ` Stephen Hemminger
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Hemminger @ 2016-04-04 17:28 UTC (permalink / raw)
  To: ValdikSS
  Cc: Guus Sliepen, Willem de Bruijn, David Miller, Tom Herbert, netdev

On Mon, 4 Apr 2016 16:35:13 +0300
ValdikSS <iam@valdikss.org.ru> wrote:

> I'm trying to increase OpenVPN throughput by optimizing tun manipulations, too.
> Right now I have more questions than answers.
> 
> I get about 800 Mbit/s speeds via OpenVPN with authentication and encryption disabled on a local machine with OpenVPN server and client running in a different
> network namespaces, which use veth for networking, with 1500 MTU on a TUN interface. This is rather limiting. Low-end devices like SOHO routers could only
> achieve 15-20 Mbit/s via OpenVPN with encryption with a 560 MHz CPU.
> Increasing MTU reduces overhead. You can get > 5GBit/s if you set 16000 MTU on a TUN interface.
> That's not only OpenVPN related. All the tunneling software I tried can't achieve gigabit speeds without encryption on my machine with MTU 1500. Didn't test
> tinc though.
> 
> TUN supports various offloading techniques: GSO, TSO, UFO, just as hardware NICs. From what I understand, if we use GSO/GRO for TUN, we would be able to receive
> send small packets combined in a huge one with one send/recv call with MTU 1500 on a TUN interface, and the performance should increase and be just as it now
> with increased MTU. But there is a very little information of how to use offloading with TUN.
> I've found an old example code which creates TUN interface with GSO support (TUN_VNET_HDR), does NAT and echoes TUN data to stdout, and a script to run two
> instances of this software connected with a pipe. But it doesn't work for me, I never see any combined frames (gso_type is always 0 in a virtio_net_hdr header).
> Probably I did something wrong, but I'm not sure what exactly is wrong.
> 
> Here's said application: http://ovrload.ru/f/68996_tun.tar.gz
> 
> The questions are as follows:
> 
>  1. Do I understand correctly that GSO/GRO would have the same effect as increasing MTU on TUN interface?
>  2. How GRO/GSO is different from TSO, UFO?
>  3. Can we get and send combined frames directly from/to NIC with offloading support?
>  4. How to implement GRO/GSO, TSO, UFO? What should be the logic behind it?
> 
> 
> Any reply is greatly appreciated.
> 
> P.S. this could be helpful: https://ldpreload.com/p/tuntap-notes.txt
> 
> > I'm trying to reduce system call overhead when reading/writing to/from a
> > tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
> > but a tun fd is not a socket fd, so this doesn't work. I'm see several
> > options to allow userspace to read/write multiple packets with one
> > syscall:
> >
> > - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
> >   sockets.
> >
> > - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
> >
> > - Add a flag that can be set using TUNSETIFF that makes regular
> >   read()/write() calls handle multiple packets in one go.
> >
> > - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
> >   be used. There is tun_get_socket() which is used internally in the
> >   kernel, but this is not exposed to userspace, and doesn't look trivial
> >   to do either.
> >
> > What would be the right way to do this?
> >
> > -- 
> > Met vriendelijke groet / with kind regards,
> >      Guus Sliepen <guus@tinc-vpn.org>

The first step to getting better performance through GRO would be modifying
TUN device to use NAPI when receiving. I tried this once, and it got more complex
than I had patience for because TUN device write is obviously in userspace context.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-04-04 17:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-29 22:40 Best way to reduce system call overhead for tun device I/O? Guus Sliepen
2016-03-31 21:18 ` Tom Herbert
2016-03-31 21:20   ` David Miller
2016-03-31 22:28     ` Guus Sliepen
2016-03-31 23:39       ` Stephen Hemminger
2016-04-03 23:03         ` Willem de Bruijn
2016-04-04 14:40           ` Guus Sliepen
  -- strict thread matches above, loose matches on Subject: below --
2016-04-04 13:35 ValdikSS
2016-04-04 17:28 ` Stephen Hemminger
     [not found] <57026C8F.8050406@valdikss.org.ru>
2016-04-04 14:31 ` Guus Sliepen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).