From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?= Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Date: Tue, 18 Jan 2011 22:20:08 +0100 Message-ID: <4D360408.1080104@gmail.com> References: <20110114190714.GA11655@yandex-team.ru> <17405.1295036019@death> <4D30D37B.6090908@yandex-team.ru> <26330.1295049912@death> <4D35060D.5080004@intel.com> <4D358A47.4020009@yandex-team.ru> <4D35A9B4.7030701@gmail.com> <4D35B1B0.2090905@yandex-team.ru> <4D35BED5.7040301@gmail.com> <28837.1295382268@death> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "Oleg V. Ukhno" , John Fastabend , "David S. Miller" , "netdev@vger.kernel.org" , =?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?= , Christophe Paasch To: Jay Vosburgh Return-path: Received: from mail-ww0-f42.google.com ([74.125.82.42]:35000 "EHLO mail-ww0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750961Ab1ARVUL (ORCPT ); Tue, 18 Jan 2011 16:20:11 -0500 Received: by wwi17 with SMTP id 17so3708531wwi.1 for ; Tue, 18 Jan 2011 13:20:10 -0800 (PST) In-Reply-To: <28837.1295382268@death> Sender: netdev-owner@vger.kernel.org List-ID: Le 18/01/2011 21:24, Jay Vosburgh a =C3=A9crit : > Nicolas de Peslo=C3=BCan wrote: >>>> - it is possible to detect path failure using arp monitoring inste= ad of >>>> miimon. > > I don't think this is true, at least not for the case of > balance-rr. Using ARP monitoring with any sort of load balance schem= e > is problematic, because the replies may be balanced to a different sl= ave > than the sender. Cannot we achieve the expected arp monitoring by using the exact same a= rtifice that Oleg suggested:=20 using a different source MAC per slave for arp monitoring, so that retu= rn path match sending path ? >>>> - changing the destination MAC address of egress packets are not >>>> necessary, because egress path selection force ingress path select= ion >>>> due to the VLAN. > > This is true, with one comment: Oleg's proposal we're discussing > changes the source MAC address of outgoing packets, not the destinati= on. > The purpose being to manipulate the src-mac balancing algorithm on th= e > switch when the packets are hashed at the egress port channel group. > The packets (for a particular destination) all bear the same destinat= ion > MAC, but (as I understand it) are manually assigned tailored source M= AC > addresses that hash to sequential values. Yes, you're right. > That's true. The big problem with the "VLAN tunnel" approach is > that it's not tolerant of link failures. Yes, except if we find a way to make arp monitoring reliable in load ba= lancing situation. [snip] > This is essentially the same thing as the diagram I pasted in up > above, except with VLANs and an additional layer of switches between = the > hosts. The multiple VLANs take the place of multiple discrete switch= es. > > This could also be accomplished via bridge groups (in > Cisco-speak). For example, instead of VLAN 100, that could be bridge > group X, VLAN 200 is bridge group Y, and so on. > > Neither the VLAN nor the bridge group methods handle link > failures very well; if, in the above diagram, the link from "switch 2 > vlan 100" to "host B" fails, there's no way for host A to know to sto= p > sending to "switch 1 vlan 100," and there's no backup path for VLAN 1= 00 > to "host B." Can't we imagine to "arp monitor" the destination MAC address of host B= , on both paths ? That way,=20 host A would know that a given path is down, because return path would = be the same. The target host=20 should send the reply on the slave on which it receive the request, whi= ch is the normal way to reply=20 to arp request. > One item I'd like to see some more data on is the level of > reordering at the receiver in Oleg's system. This is exactly the reason why I asked Oleg to do some test with balanc= e-rr. I cannot find a good=20 reason for a possibly new xmit_hash_policy to provide better throughput= than current balance-rr. If=20 the throughput increase by, let's say, less than 20%, whatever tcp_reor= dering value, then it is=20 probably a dead end way. > One of the reasons round robin isn't as useful as it once was is > due to the rise of NAPI and interrupt coalescing, both of which will > tend to increase the reordering of packets at the receiver when the > packets are evenly striped. In the old days, it was one interrupt, o= ne > packet. Now, it's one interrupt or NAPI poll, many packets. With th= e > packets striped across interfaces, this will tend to increase > reordering. E.g., > > slave 1 slave 2 slave 3 > Packet 1 P2 P3 > P4 P5 P6 > P7 P8 P9 > > and so on. A poll of slave 1 will get packets 1, 4 and 7 (and > probably several more), then a poll of slave 2 will get 2, 5 and 8, e= tc. Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P= 7, P8, P9 on slave3, possibly=20 by sending grouped packets, changing the sending slave every N packets = instead of every packet ? I=20 think we already discussed this possibility a few months or years ago i= n bonding-devel ML. For as=20 far as I remember, the idea was not developed because it was not easy t= o find the number of packets=20 to send through the same slave. Anyway, this might help reduce out of o= rder delivery. > Barring evidence to the contrary, I presume that Oleg's system > delivers out of order at the receiver. That's not automatically a > reason to reject it, but this entire proposal is sufficiently complex= to > configure that very explicit documentation will be necessary. Yes, and this is already true for some bonding modes and in particular = for balance-rr. Nicolas.