netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] remove claim balance_rr won't reorder on many to one
@ 2007-10-30 19:48 Rick Jones
  2007-10-30 20:55 ` Jay Vosburgh
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2007-10-30 19:48 UTC (permalink / raw)
  To: netdev

Remove the text which suggests that many balance_rr links feeding into
a single uplink will not experience packet reordering.

More up-to-date tests, with 1G links feeding into a switch with a 10G
uplink, using a 2.6.23-rc8 kernel on the system on which the 1G links
were bonded with balance_rr (mode=0) shows that even a many to one
link configuration will experience packet reordering and the attendant
TCP issues involving spurrious retransmissions and the congestion
window.  This happens even with a single, simple bulk transfer such as
a netperf TCP_STREAM test.  A more complete description of the tests
and results, including tcptrace analysis of packet traces showing the
degree of reordering and such can be found at:

http://marc.info/?l=linux-netdev&m=119101513406349&w=2

Also, note that some switches use the term "trunking" in a context
other than link aggregation.

Signed-off-by:  Rick Jones <rick.jones2@hp.com>

---
diff -r 35e54d4beaad Documentation/networking/bonding.txt
--- a/Documentation/networking/bonding.txt	Wed Oct 24 05:06:40 2007 +0000
+++ b/Documentation/networking/bonding.txt	Mon Oct 29 03:47:19 2007 -0700
@@ -1696,23 +1696,6 @@ balance-rr: This mode is the only mode t
 	interface's worth of throughput, even after adjusting
 	tcp_reordering.
 
-	Note that this out of order delivery occurs when both the
-	sending and receiving systems are utilizing a multiple
-	interface bond.  Consider a configuration in which a
-	balance-rr bond feeds into a single higher capacity network
-	channel (e.g., multiple 100Mb/sec ethernets feeding a single
-	gigabit ethernet via an etherchannel capable switch).  In this
-	configuration, traffic sent from the multiple 100Mb devices to
-	a destination connected to the gigabit device will not see
-	packets out of order.  However, traffic sent from the gigabit
-	device to the multiple 100Mb devices may or may not see
-	traffic out of order, depending upon the balance policy of the
-	switch.  Many switches do not support any modes that stripe
-	traffic (instead choosing a port based upon IP or MAC level
-	addresses); for those devices, traffic flowing from the
-	gigabit device to the many 100Mb devices will only utilize one
-	interface.
-
 	If you are utilizing protocols other than TCP/IP, UDP for
 	example, and your application can tolerate out of order
 	delivery, then this mode can allow for single stream datagram
@@ -1720,7 +1703,9 @@ balance-rr: This mode is the only mode t
 	to the bond.
 
 	This mode requires the switch to have the appropriate ports
-	configured for "etherchannel" or "trunking."
+	configured for "etherchannel" or "aggregation." N.B. some
+	switches might use the term "trunking" for something other 
+	than link aggregation.
 
 active-backup: There is not much advantage in this network topology to
 	the active-backup mode, as the inactive backup devices are all

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-10-30 19:48 [PATCH] remove claim balance_rr won't reorder on many to one Rick Jones
@ 2007-10-30 20:55 ` Jay Vosburgh
  2007-10-30 22:12   ` Rick Jones
  2007-10-31  1:08   ` Rick Jones
  0 siblings, 2 replies; 9+ messages in thread
From: Jay Vosburgh @ 2007-10-30 20:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

Rick Jones <rick.jones2@hp.com> wrote:
[...]
>-	Note that this out of order delivery occurs when both the
>-	sending and receiving systems are utilizing a multiple
>-	interface bond.  Consider a configuration in which a
>-	balance-rr bond feeds into a single higher capacity network
>-	channel (e.g., multiple 100Mb/sec ethernets feeding a single
>-	gigabit ethernet via an etherchannel capable switch).  In this
>-	configuration, traffic sent from the multiple 100Mb devices to
>-	a destination connected to the gigabit device will not see
>-	packets out of order.  However, traffic sent from the gigabit
>-	device to the multiple 100Mb devices may or may not see
>-	traffic out of order, depending upon the balance policy of the
>-	switch.  Many switches do not support any modes that stripe
>-	traffic (instead choosing a port based upon IP or MAC level
>-	addresses); for those devices, traffic flowing from the
>-	gigabit device to the many 100Mb devices will only utilize one
>-	interface.

	Rather than simply removing this entirely (because I do think
there is value in discussion of the reordering aspects of balance-rr),
I'd rather see something that makes the following points:

	1- the worst reordering is balance-rr to balance-rr, back to
back.  The reordering rate here depends upon (a) the number of slaves
involved and (b) packet reception scheduling behaviors (packet
coalescing, NAPI, etc), and thus will vary signficantly, but won't be
better than case #2.

	2- next worst is "balance-rr many slow" to "single fast", with
the reordering rate generally being substantially lower than case #1 (it
looked like your test showed about a 1% reordering rate, if I'm reading
your data correctly).

	3- For the "single fast" to "balance-rr many" case, going
through a switch configured for etherchannel "may or may not see traffic
out of order, depending upon the balance policy of the switch.  Many
switches do not support any modes that stripe traffic (instead choosing
a port based upon IP or MAC level addresses); for those devices, traffic
flowing from the [single fast] device to the [balance-rr many] devices
will only utilize one interface."

[...]
> 	This mode requires the switch to have the appropriate ports
>-	configured for "etherchannel" or "trunking."
>+	configured for "etherchannel" or "aggregation." N.B. some
>+	switches might use the term "trunking" for something other 
>+	than link aggregation.

	If memory serves, Sun uses the term "trunking" to refer to
"etherchannel" compatible behavior.

	I'm also hearing "aggregation" used to described 802.3ad
specifically.

	Perhaps text of the form:

	This mode requires the switch to have the appropriate ports
configured for "Etherchannel."  Some switches use different terms, so
the configuration may be called "trunking" or "aggregation."  Note that
both of these terms also have other meanings.  For example, "trunking"
is also used to describe a type of switch port, and "aggregation" or
"link aggregation" is often used to refer to 802.3ad link aggregation,
which is compatible with bonding's 802.3ad mode, but not balance-rr.

	Thoughts?

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-10-30 20:55 ` Jay Vosburgh
@ 2007-10-30 22:12   ` Rick Jones
  2007-10-31  0:22     ` Jay Vosburgh
  2007-10-31  1:08   ` Rick Jones
  1 sibling, 1 reply; 9+ messages in thread
From: Rick Jones @ 2007-10-30 22:12 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev

Jay Vosburgh wrote:
> Rick Jones <rick.jones2@hp.com> wrote:
> [...]
> 
>>-	Note that this out of order delivery occurs when both the
>>-	sending and receiving systems are utilizing a multiple
>>-	interface bond.  Consider a configuration in which a
>>-	balance-rr bond feeds into a single higher capacity network
>>-	channel (e.g., multiple 100Mb/sec ethernets feeding a single
>>-	gigabit ethernet via an etherchannel capable switch).  In this
>>-	configuration, traffic sent from the multiple 100Mb devices to
>>-	a destination connected to the gigabit device will not see
>>-	packets out of order.  However, traffic sent from the gigabit
>>-	device to the multiple 100Mb devices may or may not see
>>-	traffic out of order, depending upon the balance policy of the
>>-	switch.  Many switches do not support any modes that stripe
>>-	traffic (instead choosing a port based upon IP or MAC level
>>-	addresses); for those devices, traffic flowing from the
>>-	gigabit device to the many 100Mb devices will only utilize one
>>-	interface.
> 
> 
> 	Rather than simply removing this entirely (because I do think
> there is value in discussion of the reordering aspects of balance-rr),
> I'd rather see something that makes the following points:
> 
> 	1- the worst reordering is balance-rr to balance-rr, back to
> back.  The reordering rate here depends upon (a) the number of slaves
> involved and (b) packet reception scheduling behaviors (packet
> coalescing, NAPI, etc), and thus will vary signficantly, but won't be
> better than case #2.
> 
> 	2- next worst is "balance-rr many slow" to "single fast", with
> the reordering rate generally being substantially lower than case #1 (it
> looked like your test showed about a 1% reordering rate, if I'm reading
> your data correctly).
> 
> 	3- For the "single fast" to "balance-rr many" case, going
> through a switch configured for etherchannel "may or may not see traffic
> out of order, depending upon the balance policy of the switch.  Many
> switches do not support any modes that stripe traffic (instead choosing
> a port based upon IP or MAC level addresses); for those devices, traffic
> flowing from the [single fast] device to the [balance-rr many] devices
> will only utilize one interface."

I have to wonder if the full description of the different versions of being a 
little bit pregnant is worth it.  Just saying that using balance-rr will result 
in reordering seems much more simple to comprehend.  Also, since balance-rr is 
strictly an outbound policy, does case three even enter into it - as you say, 
that will be up to the switch, which will be doing whatever it was told or felt 
like doing regardless of balance-rr on the bond in the host.

> 
> [...]
> 
>>	This mode requires the switch to have the appropriate ports
>>-	configured for "etherchannel" or "trunking."
>>+	configured for "etherchannel" or "aggregation." N.B. some
>>+	switches might use the term "trunking" for something other 
>>+	than link aggregation.
> 
> 
> 	If memory serves, Sun uses the term "trunking" to refer to
> "etherchannel" compatible behavior.

I'm not really all that tied to that part of the change - it is there because I 
noticed in one of the HP ITRC forums someone talking about a switch (Cisco?) 
where trunking meant something with vlans rather than aggregation.

> 
> 	I'm also hearing "aggregation" used to described 802.3ad
> specifically.
> 
> 	Perhaps text of the form:
> 
> 	This mode requires the switch to have the appropriate ports
> configured for "Etherchannel."  Some switches use different terms, so
> the configuration may be called "trunking" or "aggregation."  Note that
> both of these terms also have other meanings.  For example, "trunking"
> is also used to describe a type of switch port, and "aggregation" or
> "link aggregation" is often used to refer to 802.3ad link aggregation,
> which is compatible with bonding's 802.3ad mode, but not balance-rr.
> 
> 	Thoughts?

Even better would be to be able to start to move away from "etherchannel" 
towards the de jure standard's terms, whatever the heck they are :)

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-10-30 22:12   ` Rick Jones
@ 2007-10-31  0:22     ` Jay Vosburgh
  2007-10-31  1:02       ` Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Jay Vosburgh @ 2007-10-31  0:22 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev


Rick Jones <rick.jones2@hp.com> wrote:

>I have to wonder if the full description of the different versions of
>being a little bit pregnant is worth it.  Just saying that using
>balance-rr will result in reordering seems much more simple to comprehend.

	True, but the different configurations produce very different
levels of reordering.

	There seem to be users out there trying to use balance-rr to
maximize single stream TCP throughput (even with reordering), so I think
the relative badness information is worthwhile.

>Also, since balance-rr is strictly an outbound policy, does case three
>even enter into it - as you say, that will be up to the switch, which will
>be doing whatever it was told or felt like doing regardless of balance-rr
>on the bond in the host.

	Point three provides an answer to a question I've been asked
pretty regularly by customers, so I think it's good information.

[...]
>I'm not really all that tied to that part of the change - it is there
>because I noticed in one of the HP ITRC forums someone talking about a
>switch (Cisco?) where trunking meant something with vlans rather than
>aggregation.

	In Ciscoville, switch ports can be configured as either "access"
or "trunk."  A trunk port accepts all VLANs, an access port is tied to a
specific VLAN (simplifying some here).  The Cisco documentation uses the
term EtherChannel to descibe the link aggregation system we're talking
about here in reference to bonding's balance-rr mode.

>Even better would be to be able to start to move away from "etherchannel"
>towards the de jure standard's terms, whatever the heck they are :)

	I believe that EtherChannel is the standard term for what we're
talking about here, but it's a Cisco trademark.  I'd guess that most
switch vendors don't come right out and call their "EtherChannel(tm)
compatible" mode exactly that; they call it something else, but it's
still meant to be compatible with EtherChannel.

	For bonding, this applies to the balance-rr and balance-xor
modes.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-10-31  0:22     ` Jay Vosburgh
@ 2007-10-31  1:02       ` Rick Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2007-10-31  1:02 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev

Jay Vosburgh wrote:
> Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
>>I have to wonder if the full description of the different versions of
>>being a little bit pregnant is worth it.  Just saying that using
>>balance-rr will result in reordering seems much more simple to comprehend.
> 
> 
> 	True, but the different configurations produce very different
> levels of reordering.
> 
> 	There seem to be users out there trying to use balance-rr to
> maximize single stream TCP throughput (even with reordering), so I think
> the relative badness information is worthwhile.

I will admit to coming from an "if you want a single stream to go faster buy the 
next higher speed NIC" point of view, which means I pretty much lump all the 
degrees of reordering badness together.

The relative badness though is likely a _very_ broad space which couldn't be 
covered adequately in just a paragraph or two.  Notice how long my email 
describing my one experiment ended-up.

For example, I suspect, but have not verified that the one way one might get 
minimal reordering with many to one would be to have a sender with the many 
interfaces slow enough to not be able to get ahead of the sum of the NICs in the 
bond, so the transmit queues all remain at 0, coupled perhaps with NICs all in 
equal-speed and equal-feed I/O slots.

Start to be able to keep ahead of one or more of the NICs and soon we are 
starting along a rather long continuum which includes whether there are other 
concurrent connections, the distribution of send() sized by the applications, 
whether or not various offloads are enabled etc etc etc...

>>Also, since balance-rr is strictly an outbound policy, does case three
>>even enter into it - as you say, that will be up to the switch, which will
>>be doing whatever it was told or felt like doing regardless of balance-rr
>>on the bond in the host.
> 
> 
> 	Point three provides an answer to a question I've been asked
> pretty regularly by customers, so I think it's good information.

But since it isn't specific to balance_rr it would seem better placed in a 
"Switch Considerations" or "Inbound Considerations" section?

>>Even better would be to be able to start to move away from "etherchannel"
>>towards the de jure standard's terms, whatever the heck they are :)
> 
> 
> 	I believe that EtherChannel is the standard term for what we're
> talking about here, but it's a Cisco trademark.  I'd guess that most
> switch vendors don't come right out and call their "EtherChannel(tm)
> compatible" mode exactly that; they call it something else, but it's
> still meant to be compatible with EtherChannel.

Well, that assumes that many switch vendors are still including EtherChannel.  I 
know of at least one non-trivial switch vendor which has consolidated on 
LACP/802.3ad.  When that vendor was supporting EtherChannel, they called it such.

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-10-30 20:55 ` Jay Vosburgh
  2007-10-30 22:12   ` Rick Jones
@ 2007-10-31  1:08   ` Rick Jones
  2007-11-06 21:40     ` Rick Jones
  1 sibling, 1 reply; 9+ messages in thread
From: Rick Jones @ 2007-10-31  1:08 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev

> 	2- next worst is "balance-rr many slow" to "single fast", with
> the reordering rate generally being substantially lower than case #1 (it
> looked like your test showed about a 1% reordering rate, if I'm reading
> your data correctly).

The percentage of reordering for TCP is likely capped by its effect on the 
congestion window, limiting the number of outstanding segments.

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-10-31  1:08   ` Rick Jones
@ 2007-11-06 21:40     ` Rick Jones
  2007-11-06 22:49       ` Jay Vosburgh
  0 siblings, 1 reply; 9+ messages in thread
From: Rick Jones @ 2007-11-06 21:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: Jay Vosburgh, netdev

Jay -

So, where do you and I stand wrt the proposed changes to bonding.txt?  Are we at 
an impass?

sincerely,

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-11-06 21:40     ` Rick Jones
@ 2007-11-06 22:49       ` Jay Vosburgh
  2007-11-06 22:59         ` Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Jay Vosburgh @ 2007-11-06 22:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev


Rick Jones <rick.jones2@hp.com> wrote:

>So, where do you and I stand wrt the proposed changes to bonding.txt?  Are
>we at an impass?

	Nope, I'm doing a doc update next to incorporate several things,
the reordering stuff included (which I plan to change to describe the
levels of badness, as it were).  Needed to fix a couple "no workee"
things first.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] remove claim balance_rr won't reorder on many to one
  2007-11-06 22:49       ` Jay Vosburgh
@ 2007-11-06 22:59         ` Rick Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2007-11-06 22:59 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev

Jay Vosburgh wrote:
> Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
>>So, where do you and I stand wrt the proposed changes to bonding.txt?  Are
>>we at an impass?
> 
> 	Nope, I'm doing a doc update next to incorporate several things,
> the reordering stuff included (which I plan to change to describe the
> levels of badness, as it were).  Needed to fix a couple "no workee"
> things first.

OK.  I look forward to applying my critical eye to that section :)

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-11-06 22:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-30 19:48 [PATCH] remove claim balance_rr won't reorder on many to one Rick Jones
2007-10-30 20:55 ` Jay Vosburgh
2007-10-30 22:12   ` Rick Jones
2007-10-31  0:22     ` Jay Vosburgh
2007-10-31  1:02       ` Rick Jones
2007-10-31  1:08   ` Rick Jones
2007-11-06 21:40     ` Rick Jones
2007-11-06 22:49       ` Jay Vosburgh
2007-11-06 22:59         ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).