Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Nicolas de Pesloüan" <nicolas.2p.debian@gmail.com>
To: Jay Vosburgh <fubar@us.ibm.com>
Cc: "Oleg V. Ukhno" <olegu@yandex-team.ru>,
	"John Fastabend" <john.r.fastabend@intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"Sébastien Barré" <sebastien.barre@uclouvain.be>,
	"Christophe Paasch" <christoph.paasch@uclouvain.be>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
Date: Tue, 18 Jan 2011 22:20:08 +0100	[thread overview]
Message-ID: <4D360408.1080104@gmail.com> (raw)
In-Reply-To: <28837.1295382268@death>

Le 18/01/2011 21:24, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@gmail.com>  wrote:

>>>> - it is possible to detect path failure using arp monitoring instead of
>>>> miimon.
>
> 	I don't think this is true, at least not for the case of
> balance-rr.  Using ARP monitoring with any sort of load balance scheme
> is problematic, because the replies may be balanced to a different slave
> than the sender.

Cannot we achieve the expected arp monitoring by using the exact same artifice that Oleg suggested: 
using a different source MAC per slave for arp monitoring, so that return path match sending path ?

>>>> - changing the destination MAC address of egress packets are not
>>>> necessary, because egress path selection force ingress path selection
>>>> due to the VLAN.
>
> 	This is true, with one comment: Oleg's proposal we're discussing
> changes the source MAC address of outgoing packets, not the destination.
> The purpose being to manipulate the src-mac balancing algorithm on the
> switch when the packets are hashed at the egress port channel group.
> The packets (for a particular destination) all bear the same destination
> MAC, but (as I understand it) are manually assigned tailored source MAC
> addresses that hash to sequential values.

Yes, you're right.

> 	That's true.  The big problem with the "VLAN tunnel" approach is
> that it's not tolerant of link failures.

Yes, except if we find a way to make arp monitoring reliable in load balancing situation.

[snip]

> 	This is essentially the same thing as the diagram I pasted in up
> above, except with VLANs and an additional layer of switches between the
> hosts.  The multiple VLANs take the place of multiple discrete switches.
>
> 	This could also be accomplished via bridge groups (in
> Cisco-speak).  For example, instead of VLAN 100, that could be bridge
> group X, VLAN 200 is bridge group Y, and so on.
>
> 	Neither the VLAN nor the bridge group methods handle link
> failures very well; if, in the above diagram, the link from "switch 2
> vlan 100" to "host B" fails, there's no way for host A to know to stop
> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
> to "host B."

Can't we imagine to "arp monitor" the destination MAC address of host B, on both paths ? That way, 
host A would know that a given path is down, because return path would be the same. The target host 
should send the reply on the slave on which it receive the request, which is the normal way to reply 
to arp request.

> 	One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.

This is exactly the reason why I asked Oleg to do some test with balance-rr. I cannot find a good 
reason for a possibly new xmit_hash_policy to provide better throughput than current balance-rr. If 
the throughput increase by, let's say, less than 20%, whatever tcp_reordering value, then it is 
probably a dead end way.

> 	One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped.  In the old days, it was one interrupt, one
> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
> packets striped across interfaces, this will tend to increase
> reordering.  E.g.,
>
> 	slave 1		slave 2		slave 3
> 	Packet 1	P2		P3
> 	P4		P5		P6
> 	P7		P8		P9
>
> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.

Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, P8, P9 on slave3, possibly 
by sending grouped packets, changing the sending slave every N packets instead of every packet ? I 
think we already discussed this possibility a few months or years ago in bonding-devel ML. For as 
far as I remember, the idea was not developed because it was not easy to find the number of packets 
to send through the same slave. Anyway, this might help reduce out of order delivery.

> 	Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver.  That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.

Yes, and this is already true for some bonding modes and in particular for balance-rr.

	Nicolas.

next prev parent reply	other threads:[~2011-01-18 21:20 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-14 19:07 [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Oleg V. Ukhno
2011-01-14 20:10 ` John Fastabend
2011-01-14 23:12   ` Oleg V. Ukhno
2011-01-14 20:13 ` Jay Vosburgh
2011-01-14 22:51   ` Oleg V. Ukhno
2011-01-15  0:05     ` Jay Vosburgh
2011-01-15 12:11       ` Oleg V. Ukhno
2011-01-18  3:16       ` John Fastabend
2011-01-18 12:40         ` Oleg V. Ukhno
2011-01-18 14:54           ` Nicolas de Pesloüan
2011-01-18 15:28             ` Oleg V. Ukhno
2011-01-18 16:24               ` Nicolas de Pesloüan
2011-01-18 16:57                 ` Oleg V. Ukhno
2011-01-18 20:24                 ` Jay Vosburgh
2011-01-18 21:20                   ` Nicolas de Pesloüan [this message]
2011-01-19  1:45                     ` Jay Vosburgh
2011-01-18 22:22                   ` Oleg V. Ukhno
2011-01-19 16:13                   ` Oleg V. Ukhno
2011-01-19 20:12                     ` Nicolas de Pesloüan
2011-01-21 13:55                       ` Oleg V. Ukhno
2011-01-22 12:48                         ` Nicolas de Pesloüan
2011-01-24 19:32                           ` Oleg V. Ukhno
2011-01-29  2:28                         ` Jay Vosburgh
2011-02-01 16:25                           ` Oleg V. Ukhno
2011-02-02 17:30                             ` Jay Vosburgh
2011-02-02  9:54                           ` Nicolas de Pesloüan
2011-02-02 17:57                             ` Jay Vosburgh
2011-02-03 14:54                               ` Oleg V. Ukhno
2011-01-18 17:56               ` Kirill Smelkov
2011-01-18 16:41           ` John Fastabend
2011-01-18 17:21             ` Oleg V. Ukhno
2011-01-14 20:41 ` Nicolas de Pesloüan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D360408.1080104@gmail.com \
    --to=nicolas.2p.debian@gmail.com \
    --cc=christoph.paasch@uclouvain.be \
    --cc=davem@davemloft.net \
    --cc=fubar@us.ibm.com \
    --cc=john.r.fastabend@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=olegu@yandex-team.ru \
    --cc=sebastien.barre@uclouvain.be \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).