From mboxrd@z Thu Jan 1 00:00:00 1970 From: Hannes Frederic Sowa Subject: Re: [PATCH net-next 0/4] net: allow setting congctl via routing table Date: Fri, 05 Dec 2014 22:03:33 +0100 Message-ID: <1417813413.6122.7.camel@localhost> References: <1417793092-6263-1-git-send-email-dborkman@redhat.com> <1417804540.2462.4.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: Daniel Borkmann , "davem@davemloft.net" , Florian Westphal , "netdev@vger.kernel.org" To: Dave Taht Return-path: Received: from out3-smtp.messagingengine.com ([66.111.4.27]:33138 "EHLO out3-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751915AbaLEVDg (ORCPT ); Fri, 5 Dec 2014 16:03:36 -0500 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 7817E20809 for ; Fri, 5 Dec 2014 16:03:35 -0500 (EST) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: Hi Dave, On Fr, 2014-12-05 at 11:05 -0800, Dave Taht wrote: > On Fri, Dec 5, 2014 at 10:35 AM, Hannes Frederic Sowa > wrote: > > On Fr, 2014-12-05 at 08:35 -0800, Dave Taht wrote: > >> On Fri, Dec 5, 2014 at 7:24 AM, Daniel Borkmann wrote: > >> > This is the second part of our work and allows for setting the congestion > >> > control algorithm via routing table. For details, please see individual > >> > patches. > >> > > >> > Joint work with Florian Westphal, suggested by Hannes Frederic Sowa. > >> > > >> > Thanks! > >> > > >> > Daniel Borkmann (4): > >> > net: tcp: refactor reinitialization of congestion control > >> > net: tcp: add key management to congestion control > >> > net: tcp: add RTAX_CC_ALGO fib handling > >> > net: tcp: add per route congestion control > >> > >> > >> Very interesting. Have you tried something other than dctcp here > >> (e.g. westwood or lp?) > >> > >> Have you considered the case where the route changes underneath > >> you from one device to another? > > > > Notice, there is no way the state of a tcp congestion control algorithm > > can be converted to be used by a different one, so this would only > > affect new tcp connections via this interface. > > You are missing the point. If the route changes from a path that > is DCTCP capable to one that is not, (say you fail over to a backup link) I don't think that today's datacenter are designed that the backup path has less performance than the primary link (different AQM settings). It is much more important e.g. to allow the connections to a e.g. database server selecting dctcp as CC and having all connections going to the internet using some "ordinary" tcp congestion algorithm. > and flows persist, bad things will happen. DCTCP, in particular, depends > upon a very specific AQM configuration on all the hops in the path, without that > it can be very aggressive. That's for sure. > I do think it is feasible to convert from at least some of the > core state from one tcp congestion control algorithm to another. Hmm, I haven't looked if that is possible. It might be. > >> Example, here I am routing everything through eth0, where I > >> would want cubic, probably... > >> > >> root@ganesha:~/git/tinc# ip route > >> default via 172.26.16.1 dev eth0 proto babel onlink > >> 69.181.216.0/22 via 172.26.16.1 dev eth0 proto babel onlink > >> 169.254.0.0/16 dev eth0 scope link metric 1000 > >> 172.26.16.0/24 dev eth0 proto kernel scope link src 172.26.16.177 > >> 172.26.16.1 via 172.26.16.1 dev eth0 proto babel onlink > >> 172.26.16.112 via 172.26.16.112 dev eth0 proto babel onlink > >> 172.26.17.0/24 via 172.26.16.1 dev eth0 proto babel onlink > >> 172.26.17.3 via 172.26.16.1 dev eth0 proto babel onlink > >> 172.26.17.227 via 172.26.16.1 dev eth0 proto babel onlink > >> 192.168.7.0/30 dev eth1 proto kernel scope link src 192.168.7.1 metric 1 > >> 192.168.7.2 via 172.26.16.112 dev eth0 proto babel onlink > >> > >> And I pull the plug, and everything flips over to wlan0, > >> where I might want westwood (or something saner than > >> that. It might be nice to have a per-device cc default > >> algorithm...) > > > > Something like that might be possible with metrics and "via ... dev if0 > > metric xxx" routes, which will be cleaned up as soon as the interface > > goes down and the fallback will be to a route with a different > > congestion algorithm. > > mmm... I do dynamic routing via various routing protocols, which > generally don't bother with inserting more than one metric. I totally understand, they might even remove the routes and re-add them, thus losing the tcp cc property. > While we are thinking through this, what happens with tunnels? Tunnels should behave just like ordinary interfaces, but depending how they get routed it might make problems regarding DCTCP. > This route in my network switches between interfaces and routes > depending on which is best. > > fde5:dfb9:df90:fff0::/64 dev vpn6 proto kernel metric 256 > fde5:dfb9:df90:fff0::/60 via fde5:dfb9:df90:fff0::1 dev vpn6 metric 1024 > > > >> root@ganesha:~/git/tinc# ip route > >> default via 172.26.17.224 dev wlan0 proto babel onlink > >> 69.181.216.0/22 via 172.26.17.224 dev wlan0 proto babel onlink > >> 169.254.0.0/16 dev eth0 scope link metric 1000 > >> 172.26.16.0/24 dev eth0 proto kernel scope link src 172.26.16.177 > >> 172.26.16.1 via 172.26.17.227 dev wlan0 proto babel onlink > >> 172.26.16.112 via 172.26.17.227 dev wlan0 proto babel onlink > >> 172.26.17.0/24 via 172.26.17.224 dev wlan0 proto babel onlink > >> 172.26.17.3 via 172.26.17.227 dev wlan0 proto babel onlink > >> 172.26.17.227 via 172.26.17.227 dev wlan0 proto babel onlink > >> 192.168.7.0/30 dev eth1 proto kernel scope link src 192.168.7.1 metric 1 > >> 192.168.7.2 via 172.26.17.227 dev wlan0 proto babel onlink Please note, that is is an end-node only feature. Normally, routers don't do heavy tcp processing, thus using this feature on a router wasn't considered by us. That's the same problematic like e.g. tcp_quick_ack. As soon as you have control over the application and it allows you to bind to an interface via SO_BINDTODEVICE, you are able to select the congestion control algorithm by using ip rule oif matching. But the application could also chose the CC also by itself by using 'TCP_CONGESTION' setsockopt on a per-socket basis if you have source access. Bye, Hannes