From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Fastabend Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing Date: Tue, 18 Jan 2011 08:41:59 -0800 Message-ID: <4D35C2D7.6090008@intel.com> References: <20110114190714.GA11655@yandex-team.ru> <17405.1295036019@death> <4D30D37B.6090908@yandex-team.ru> <26330.1295049912@death> <4D35060D.5080004@intel.com> <4D358A47.4020009@yandex-team.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jay Vosburgh , "netdev@vger.kernel.org" , "David S. Miller" To: "Oleg V. Ukhno" Return-path: Received: from mga02.intel.com ([134.134.136.20]:3673 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752341Ab1ARQmA (ORCPT ); Tue, 18 Jan 2011 11:42:00 -0500 In-Reply-To: <4D358A47.4020009@yandex-team.ru> Sender: netdev-owner@vger.kernel.org List-ID: On 1/18/2011 4:40 AM, Oleg V. Ukhno wrote: > On 01/18/2011 06:16 AM, John Fastabend wrote: >> On 1/14/2011 4:05 PM, Jay Vosburgh wrote: >>> Can somebody (John?) more knowledgable than I about dm-multipath >>> comment on the above? >> >> Here I'll give it a go. >> >> I don't think detecting L2 link failure this way is very robust. If = there >> is a failure farther away then your immediate link your going to bre= ak >> completely? Your bonding hash will continue to round robin the iscsi >> packets and half them will get dropped on the floor. dm-multipath ha= ndles >> this reasonably gracefully. Also in this bonding environment you see= m to >> be very sensitive to RTT times on the network. Maybe not bad out rig= ht but >> I wouldn't consider this robust either. >=20 > John, I agree - this bonding mode should be used in quite limited num= ber=20 > of situations, but as for failure farther away then immediate link -=20 > every bonding mode will suffer same problems in this case - bonding=20 > detects only L2 failures, other is done by upper-layer mechanisms. An= d=20 > almost all bonding modes depend on equal RTT on slaves. And, there is= =20 > already similar load balancing mode - balance-alb - what I did is=20 > approximately the same, but for 802.3ad bonding mode and provides=20 > "better"(more equal and non-conditional layser2) load striping for tx= =20 > and _rx_ . >=20 > I think I shouldn't mention the particular use case of this patch - w= hen=20 > I wrote it I tried to make a more general solution - my goal was "mak= e=20 > equal or near-equal load striping for TX and (most important part) RX= =20 > within single ethernet(layer 2) domain for TCP transmission". This=20 > bonding mode just introduces ability to stripe rx and tx load for=20 > single TCP connection between hosts inside of one ethernet segment.=20 > iSCSI is just an example. It is possible to stripe load between a=20 > linux-based router and linux-based web/ftp/etc server as well in the=20 > same manner. I think this feature will be useful in some number of=20 > network configurations. >=20 > Also, I looked into net-next code - it seems to me that it can be=20 > implemented(adapted to net-next bonding code) without any difficultie= s=20 > and hashing function change makes no problem here. >=20 > What I've written below is just my personal experience and opinion af= ter=20 > 5 years of using Oracle +iSCSI +mpath(later - patched bonding). >=20 > From my personal experience I just can say that most iSCSI failures = are=20 > caused by link failures, and also I would never send any significant=20 > iSCSI traffic via router - router would be a bottleneck in this case. > So, in my case iSCSI traffic flows within one ethernet domain and in=20 > case of link failure bonding driver simply fails one slave(in case of= =20 > bonding) , instead of checking and failing hundreths of paths (in cas= e=20 > of mpath) and first case significantly less cpu, net and time=20 > consuming(if using default mpath checker - readsector0). > Mpath is good for me, when I use it to "merge" drbd mirrors from=20 > different hosts, but for just doing simple load striping within singl= e=20 > L2 network switch between 2 .. 16 hosts is some overkill(particularl= y=20 > in maintaining human-readable device naming) :). >=20 > John, what is you opinion on such load balancing method in general,=20 > without referring to particular use cases? >=20 This seems reasonable to me, but I'll defer to Jay on this. As long as = the limitations are documented and it looks like they are this may be fine. Mostly I was interested to know what led you down this path and why MPI= O was not working as at least I expected it should. When I get some time = I'll see if we can address at least some of these issues. Even so it seems l= ike this bonding mode may still be useful for some use cases perhaps even n= one storage use cases. >=20 >> >> You could tweak your scsi timeout values and fail_fast values, set t= he io >> retry to 0 to cause the fail over to occur faster. I suspect you alr= eady >> did this and still it is too slow? Maybe adding a checker in multipa= thd to >> listen for link events would be fast enough. The checker could then = fail >> the path immediately. >> >> I'll try to address your comments from the other thread here. In gen= eral I >> wonder if it would be better to solve the problems in dm-multipath r= ather than >> add another bonding mode? > Of course I did this, but mpath is fine when device quantity is below= =20 > 30-40 devices with two paths, 150-200 devices with 2+ paths can make=20 > life far more interesting :) OK admittedly this gets ugly fast. >> >> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize= latency) >> >> The dm-multipath layer is adding latency? How much? If this is reall= y true >> maybe its best to the address the real issue here and not avoid it b= y >> using the bonding layer. >=20 > I do not remember exact number now, but switching one of my databases= ,=20 > about 2 years ago to bonding increased read throughput for the entire= db=20 > from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators a= nd=20 > 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged= in=20 > one switch) because of "full" bandwidth use. Also, bonding usage=20 > simplifies network and application setup greatly(compared to mpath) >=20 >> >> OVU - it handles any link failures bad, because of it's command queu= e >> limitation(all queued commands above 32 are discarded in case of pat= h >> failure, as I remember) >> >> Maybe true but only link failures with the immediate peer are handle= d >> with a bonding strategy. By working at the block layer we can detect >> failures throughout the path. I would need to look into this again I >> know when we were looking at this sometime ago there was some talk a= bout >> improving this behavior. I need to take some time to go back through= the >> error recovery stuff to remember how this works. >> >> OVU - it performs very bad when there are many devices and ma=D1=82y= paths(I was >> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 pa= ths >> per each disk) >=20 > Well, I think that behavior can be explained in such a way: > when balancing by I/Os number per path(rr_min_io), and there is a hug= e=20 > number of devices, mpath is doing load-balaning per-device, and it is= =20 > not possible to quarantee equal device use for all devices, so there=20 > will be imbalance over network interface(mpath is unaware of it's=20 > existence, etc), and it is likely it becomes more imbalanced when the= re=20 > are many devices. Also, counting I/O's for many devices and paths=20 > consumes some CPU resources and also can cause excessive context swit= ches. >=20 hmm I'll get something setup here and see if this is the case. >> >> Hmm well that seems like something is broken. I'll try this setup wh= en >> I get some time next few days. This really shouldn't be the case dm-= multipath >> should not add a bunch of extra latency or effect throughput signifi= cantly. >> By the way what are you seeing without mpio? >=20 > And one more obsevation from my 2-years old tests - reading device(us= ing=20 > dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpat= h=20 > device with single path was done at approximately 120-150mb/s, and sa= me=20 > test on non-mpath device at 800-900mb/s. Here I am quite sure, it was= a=20 > kind of revelation to me that time. >=20 Similarly I'll have a look. Thanks for the info. >> >> Thanks, >> John >> >=20 >=20 =20