From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: [PATCH] tuntap: add flow control to support back pressure Date: Mon, 14 Apr 2014 08:40:31 +0300 Message-ID: <20140414054031.GA15088@redhat.com> References: <5345FFBC.7020502@adjacentlink.com> <20140410102931.GA12077@redhat.com> <5347487B.4040608@adjacentlink.com> <34F30F7A-5AEF-4E7B-9513-7A008CEE8820@nrl.navy.mil> <20140413141418.GA17839@redhat.com> <534B39D3.8040804@adjacentlink.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Brian Adamson , davem@davemloft.net, jasowang@redhat.com, xemul@parallels.com, wuzhy@linux.vnet.ibm.com, therbert@google.com, yamato@redhat.com, richardcochran@gmail.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Joseph Giovatto To: Steven Galgano Return-path: Content-Disposition: inline In-Reply-To: <534B39D3.8040804@adjacentlink.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On Sun, Apr 13, 2014 at 09:28:51PM -0400, Steven Galgano wrote: > On 04/13/2014 10:14 AM, Michael S. Tsirkin wrote: > >=20 > > Steven, Brian, > >=20 > > thanks for reporting this issue. > > Please see my comments below. > >=20 > > On Fri, Apr 11, 2014 at 12:41:42PM -0400, Brian Adamson wrote: > >> To weigh in on the desire to have support (at least as an optional= behavior) for the legacy flow control behavior, there are many existin= g uses of it. Many these are related to experimental purposes where th= e tuntap driver can be used (with a little user space code) as a surrog= ate for a network interface type that may not even yet exist. And in s= ome cases these experimental purposes have had utility for actual deplo= yment (e.g. disaster relief wireless networks where the TAP device has= provided some intermediate assistance for routing or other protocols, = even an underwater acoustic sensor network proposed for reef monitoring= , etc where a TAP device provides a network interface and the sound car= d is used as a modem on an embedded system). Some of these networks ha= ve low data rates or packet loss and delays that make TCP (which provid= es flow control as part of its usual reliable transport for more typica= l networking purpose) not an ideal protocol to use and so UDP or oth! > er alterna > tives or used. To keep this short, I'll list a few use cases here I = know (and was involved with the implementation of some) with some links= (where I know them): > >> > >> 1) CORE network emulation tool (http://code.google.com/p/coreemu/= ) > >> > >> 2) EMANE network emulation tool (https://github.com/adjacentlink/e= mane) > >> > >> (likely other network emulation tools exist that have used tuntap = as surrogates for real physical interfaces and expect the same backpres= sure to sockets and queues that physical interfaces provide) > >> > >> 3) I don't have a link to it but I implemented an experimental IP = interface/ MAC protocol called SLIDE (serial-link internet daemon) that= implemented a user-space CSMA MAC protocol where an underwater acousti= c modem was connected to the serial port and TAP was used to present a = virtual network interface to the IP stack. Because of the low data rat= es involved, the back pressure flow control to application sockets (and= protocol daemons and qdiscs applied) was important. > >> > >> 4) User space implementation of Simplified Multicast Forwarding (= SMF) of RFC 6621 has a "device" option that establishes TAP interfaces = to perform distributed "backpressure" based flow control (and potential= ly routing) for MANET wireless networks. (http://www.nrl.navy.mil/itd/= ncs/products/smf) > >> > >> There are probably some more, among the more esoteric wireless and= other special networking communities, where host (or routing/gateway/p= roxy non-host), e.g. special embedded system devices based on Linux suc= h as sensors, etc) have a first hop network attachment that is _not_ th= e typical Ethernet or something and may be using tuntap along with a so= rt of user-space "driver" to present an IP interface to the network sta= ck. some of this stuff, especially embedded systems, tend to lag behind= with respect to kernel versions and this behavior change in Linux may = be yet undiscovered so far even though the change was put in a couple y= ears ago. > >> > >> Several of these are implemented across multiple platforms, and, f= or example, BSD-based systems tuntap provides the same flow control beh= avior. Even if it was never formally documented, I think this behavior= was fairly well known (at least for these sorts of experimental purpos= es) and used. I understand the concern that a single bad behaving flow= can possibly block the flow of others unless traffic control queuing d= isciplines (as done for other network interfaces). For the purposes of= which I'm aware, I think having this behavior as _optional_ is probabl= y OK =E2=80=A6 If accepted, and something is implemented here, it may b= e a good opportunity to have it documented (and the pros and cons of it= s use) for the more general Linux community. > >=20 > > Yes, a UDP socket with sufficiently deep qdisc and tun queues > > would previously get slowed down so it matches the speed of > > the interface. > >=20 > > But IIUC this was not really designed to be a flow control measure, > > so depending on what else is in the qdisc you could easily get > > into a setup where it behaves exactly as it does now. > > For example, have several UDP sockets send data out a single > > interface. > >=20 > > Another problem is that this depends on userspace to be > > well-behaved and consume packets in a timely manner: > > a misbehaving userspace operating a tun device can cause other > > tun devices and/or sockets to get blocked forever and prevent them > > from communicating with all destinations (not just the misbehaving = one) > > as their wmem limit is exhausted. > >=20 > > It should be possible to reproduce with an old kernel and your user= space > > drivers, too - just stop the daemon temporarily. > > I realize that your daemon normally is well-behaved, and > > simply moves all incoming packets to the backend without > > delay, but I'd like to find a solution that addresses > > this without trusting userspace to be responsive. > >=20 > >=20 > >=20 > >=20 > > At the moment, for this use-case it's possible to limit the speed o= f tun > > interface using a non work conserving qdisc. Make that match the s= peed > > of the backend device, and you get back basically the old behaviour > > without the security problem in that the rate is basically ensured > > by kernel and all packets queued are eventually consumed. > >=20 > > Does this solve the problem for you? > >=20 > >=20 > I do not believe this solves the problem. >=20 > The issue is with what happens when a queue is full. When multiqueue > support was introduced, the tun went from stopping the queue to dropp= ing > packets. This actually happened previously too on qdisc queue overrun. > In the back pressure use case, the application code using the charact= er > device side of the tun interface stops reading frames in order to cau= se > the tun queue to backup and eventually stop. This leads to tx socket > buffer backup which provides back pressure to the applications sendin= g > traffic over the tun interface. > > The determination as to when to stop reading frames or when to thrott= le > back from reading frames is dynamic. It can be based on many things: > congestion on the backend media, power level, data rate, etc. >=20 > And the application response to back pressure can be as simple as jus= t > blocking on a send() or more complex like dynamically throttling data > rates or changing compression ratios to reduce overall network load s= o > that when congestion clears there is less of a strain on the network. >=20 > Once the tun started dropping packets instead of stopping the queue, = it > removed the ability for the application controlling the tun to apply = its > own queue management policy: read frames and possibly discard or stop > reading frames. That is the problem, the fixed discard policy. >=20 > I'll be posting another version of the patch that allows you to speci= fy > the flow control behavior on a queue by queue basis. This would let t= he > application decide what should happen once a queue is full: discard o= r > netif_tx_stop_queue(). >=20 > I'm not hopeful this solution will be accepted but it might be helpfu= l > to someone else. >=20 > >> BTW, in my initial noticing this issue, it _seemed_ that even the = default interface pfifo_fast priority bands were not being properly enf= orced for the tap interface without the old flow control behavior?. I = need to do a little more "old vs new" comparison testing on this regard= =2E > >> > >> best regards, > >> > >> Brian=20 > >=20 > > I think that as tun is never stopped, nothing is ever queued in qdi= sc. > >=20 > >=20 > >> On Apr 10, 2014, at 9:42 PM, Steven Galgano wrote: > >> > >>> On 04/10/2014 06:29 AM, Michael S. Tsirkin wrote: > >>>> On Wed, Apr 09, 2014 at 10:19:40PM -0400, Steven Galgano wrote: > >>>>> Add tuntap flow control support for use by back pressure routin= g protocols. Setting the new TUNSETIFF flag IFF_FLOW_CONTROL, will sign= al resources as unavailable when the tx queue limit is reached by issui= ng a netif_tx_stop_all_queues() rather than discarding frames. A netif_= tx_wake_all_queues() is issued after reading a frame from the queue to = signal resource availability. > >>>>> > >>>>> Back pressure capability was previously supported by the legacy= tun default mode. This change restores that functionality, which was l= ast present in v3.7. > >>>>> > >>>>> Reported-by: Brian Adamson > >>>>> Tested-by: Joseph Giovatto > >>>>> Signed-off-by: Steven Galgano > >>>> > >>>> I don't think it's a good idea. > >>>> > >>>> This trivial flow control really created more problems than it w= as worth. > >>>> > >>>> In particular this blocks all flows so it's trivially easy for o= ne flow > >>>> to block and starve all others: just send a bunch of packets to = loopback > >>>> destinations that get queued all over the place. > >>>> > >>>> Luckily it was never documented so we changed the default and no= thing > >>>> seems to break, but we won't be so lucky if we add an explicit A= PI. > >>>> > >>>> > >>>> One way to implement this would be with ubuf_info callback this = is > >>>> already invoked in most places where a packet might get stuck fo= r a long > >>>> time. It's still incomplete though: this will prevent head of q= ueue > >>>> blocking literally forever, but a single bad flow can still degr= ade > >>>> performance significantly. > >>>> > >>>> Another alternative is to try and isolate the flows that we > >>>> can handle and throttle them. > >>>> > >>>> It's all fixable but we really need to fix the issues *before* > >>>> exposing the interface to userspace. > >>>> > >>>> > >>>> > >>> > >>> It was only after recent upgrades that we picked up a newer kerne= l and > >>> discovered the change to the default tun mode. > >>> > >>> The new default behavior has broken applications that depend on t= he > >>> legacy behavior. Although not documented, the legacy behavior was= well > >>> known at least to those working in the back pressure research com= munity. > >>> The default legacy mode was/is a valid use case although I am not= sure > >>> it fits with recent multiqueue support. > >>> > >>> When back pressure protocols are running over a tun interface the= y > >>> require legacy flow control in order to communicate congestion de= tected > >>> on the physical media they are using. Multiqueues do not apply he= re. > >>> These protocols only use one queue, so netif_tx_stop_all_queues()= is the > >>> necessary behavior. > >>> > >>> I'm not tied to the idea of IFF_FLOW_CONTROL. I was aiming for mi= ddle > >>> ground where an application controlling the tun interface can sta= te it > >>> wants the legacy flow control behavior understanding its limitati= ons > >>> concerning multiple queues. > >>> > >>> What if we resurrect IFF_ONE_QUEUE and use that as a mechanism to > >>> indicate legacy flow control. A tun instance initially configured= with > >>> IFF_ONE_QUEUE would not be allowed to attach or detach queues wit= h > >>> TUNSETQUEUE and any additional opens with the same device name wo= uld > >>> fail. This mode would use the > >>> netif_tx_stop_all_queues()/netif_tx_wake_all_queues() flow contro= l > >>> mechanism. > >>> > >>> If a tun application wants the current default behavior with only= a > >>> single queue, it would not set the IFF_ONE_QUEUE flag. Not settin= g > >>> IFF_MULTI_QUEUE would not imply IFF_ONE_QUEUE. > >>> > >>> I'd be happy to implement this change if it is an acceptable solu= tion. > >>> This would allow multiqueue tun development to advance while stil= l > >>> supporting use cases dependent on legacy flow control. > >>> > >>> -steve > >>> > >>>>> --- > >>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c > >>>>> index ee328ba..268130c 100644 > >>>>> --- a/drivers/net/tun.c > >>>>> +++ b/drivers/net/tun.c > >>>>> @@ -783,8 +783,19 @@ static netdev_tx_t tun_net_xmit(struct sk_= buff *skb, struct net_device *dev) > >>>>> * number of queues. > >>>>> */ > >>>>> if (skb_queue_len(&tfile->socket.sk->sk_receive_queue) * numqu= eues > >>>>> - >=3D dev->tx_queue_len) > >>>>> - goto drop; > >>>>> + >=3D dev->tx_queue_len) { > >>>>> + if (tun->flags & TUN_FLOW_CONTROL) { > >>>>> + /* Resources unavailable stop transmissions */ > >>>>> + netif_tx_stop_all_queues(dev); > >>>>> + > >>>>> + /* We won't see all dropped packets individually, so > >>>>> + * over run error is more appropriate. > >>>>> + */ > >>>>> + dev->stats.tx_fifo_errors++; > >>>>> + } else { > >>>>> + goto drop; > >>>>> + } > >>>>> + } > >>>>> > >>>>> if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) > >>>>> goto drop; > >>>>> @@ -1362,6 +1373,9 @@ static ssize_t tun_do_read(struct tun_str= uct *tun, struct tun_file *tfile, > >>>>> continue; > >>>>> } > >>>>> > >>>>> + /* Wake in case resources previously signaled unavailable */ > >>>>> + netif_tx_wake_all_queues(tun->dev); > >>>>> + > >>>>> ret =3D tun_put_user(tun, tfile, skb, iv, len); > >>>>> kfree_skb(skb); > >>>>> break; > >>>>> @@ -1550,6 +1564,9 @@ static int tun_flags(struct tun_struct *t= un) > >>>>> if (tun->flags & TUN_PERSIST) > >>>>> flags |=3D IFF_PERSIST; > >>>>> > >>>>> + if (tun->flags & TUN_FLOW_CONTROL) > >>>>> + flags |=3D IFF_FLOW_CONTROL; > >>>>> + > >>>>> return flags; > >>>>> } > >>>>> > >>>>> @@ -1732,6 +1749,11 @@ static int tun_set_iff(struct net *net, = struct file *file, struct ifreq *ifr) > >>>>> else > >>>>> tun->flags &=3D ~TUN_TAP_MQ; > >>>>> > >>>>> + if (ifr->ifr_flags & IFF_FLOW_CONTROL) > >>>>> + tun->flags |=3D TUN_FLOW_CONTROL; > >>>>> + else > >>>>> + tun->flags &=3D ~TUN_FLOW_CONTROL; > >>>>> + > >>>>> /* Make sure persistent devices do not get stuck in > >>>>> * xoff state. > >>>>> */ > >>>>> @@ -1900,7 +1922,8 @@ static long __tun_chr_ioctl(struct file *= file, unsigned int cmd, > >>>>> * This is needed because we never checked for invalid flags = on > >>>>> * TUNSETIFF. */ > >>>>> return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE= | > >>>>> - IFF_VNET_HDR | IFF_MULTI_QUEUE, > >>>>> + IFF_VNET_HDR | IFF_MULTI_QUEUE | > >>>>> + IFF_FLOW_CONTROL, > >>>>> (unsigned int __user*)argp); > >>>>> } else if (cmd =3D=3D TUNSETQUEUE) > >>>>> return tun_set_queue(file, &ifr); > >>>>> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/i= f_tun.h > >>>>> index e9502dd..bcf2790 100644 > >>>>> --- a/include/uapi/linux/if_tun.h > >>>>> +++ b/include/uapi/linux/if_tun.h > >>>>> @@ -36,6 +36,7 @@ > >>>>> #define TUN_PERSIST 0x0100=09 > >>>>> #define TUN_VNET_HDR 0x0200 > >>>>> #define TUN_TAP_MQ 0x0400 > >>>>> +#define TUN_FLOW_CONTROL 0x0800 > >>>>> > >>>>> /* Ioctl defines */ > >>>>> #define TUNSETNOCSUM _IOW('T', 200, int)=20 > >>>>> @@ -70,6 +71,7 @@ > >>>>> #define IFF_MULTI_QUEUE 0x0100 > >>>>> #define IFF_ATTACH_QUEUE 0x0200 > >>>>> #define IFF_DETACH_QUEUE 0x0400 > >>>>> +#define IFF_FLOW_CONTROL 0x0010 > >>>>> /* read-only flag */ > >>>>> #define IFF_PERSIST 0x0800 > >>>>> #define IFF_NOFILTER 0x1000