From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jay Vosburgh Subject: Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used Date: Thu, 16 Jan 2014 14:38:59 -0800 Message-ID: <21868.1389911939@death.nxdomain> References: <1389837916-5377-1-git-send-email-vfalico@redhat.com> <15764.1389848997@death.nxdomain> <20140116084102.GM1867@redhat.com> Cc: netdev@vger.kernel.org, Andy Gospodarek , "David S. Miller" To: Veaceslav Falico Return-path: Received: from e38.co.us.ibm.com ([32.97.110.159]:32911 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750775AbaAPWjG (ORCPT ); Thu, 16 Jan 2014 17:39:06 -0500 Received: from /spool/local by e38.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 16 Jan 2014 15:39:05 -0700 Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id BB42A3E40052 for ; Thu, 16 Jan 2014 15:39:02 -0700 (MST) Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by b03cxnp08026.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s0GMcoPx10551680 for ; Thu, 16 Jan 2014 23:38:50 +0100 Received: from d03av01.boulder.ibm.com (localhost [127.0.0.1]) by d03av01.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s0GMd283013742 for ; Thu, 16 Jan 2014 15:39:02 -0700 In-reply-to: <20140116084102.GM1867@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: Veaceslav Falico wrote: >On Wed, Jan 15, 2014 at 09:09:57PM -0800, Jay Vosburgh wrote: >>Veaceslav Falico wrote: >> >>>Currently, if arp_validate is off (0), slave_last_rx() returns the >>>slave->dev->last_rx, which is always updated on *any* packet received by >>>slave, and not only arps. This means that, if the validation of arps is >>>off, we're treating *any* incoming packet as a proof of slave being up, and >>>not only arps. >> >> The "any incoming packet" part is intentional. >> >>>This might seem logical at the first glance, however it can cause a lot of >>>troubles and false-positives, one example would be: >>> >>>The arp_ip_target is NOT accessible, however someone in the broadcast domain >>>spams with any broadcast traffic. This way bonding will be tricked that the >>>slave is still up (as in - can access arp_ip_target), while it's not. >> >> This type of situation is why arp_validate was added. >> >> The specific situation was when multiple hosts using bonding >>with the ARP monitor were set up behind a common gateway (in the same >>Ethernet broadcast domain). The arp_ip_target is unreachable for >>whatever reason. In that case, the various bonding instances on the >>different hosts will each issue broadcast ARP requests, and (in the >>absence of arp_validate) those requests would trick the other bonds into >>believing that they are up. >> >> I don't think this patch set will resolve that problem, since >>you explicitly permit any incoming ARP to count. > >I've said it was tricky at first glance :). > >For this situation the arp_validate is *indeed* the cure. I'm not disabling >(or even working with) how arp_validate=1/2 works, I'm working with >arp_validate == 0. > >Before the patchset (with arp_validate == 0 ): > >*Any* packet (arp and non-arp) will signal us that the slave is up - >because we use slave->dev->last_rx (updated on *every* incoming packet in >bond_handle_frame). > >After the patchset (with arp_validate == 0 ): > >*ONLY ARP* packets signal us that the slave is up - because we use >slave->last_arp_rx that is updated every time we see an ARP packet. But that's effectively making arp_validate the only setting if the host is on a quiet network segment without much other host ARP traffic. This may not be an issue for the active-backup mode, but the load balance modes will have serious issues (more on that below). Actually, thinking about it, for active-backup, if a backup slave is in a different Ethernet broadcast domain than the active slave, your change will likely break those configurations. With arp_validate enabled or your change applied, the backup slaves in active-backup depend on the ARP request broadcasts to be forwarded by the switch to the backup slave. Currently, without arp_validate, that dependency is not there; the backup slave can receive any traffic (although in most configurations, the ARP broadcast is what it will receive). That would be a legal (if very odd) bonding configuration. I'm not aware of anybody seting things up that way. >The way the modes work with arp_validate > 0 don't change :), as we're >updating slave->last_arp_rx the old way in this case - after validation. > >So that the scenario you've described still works flawlessly, and now >already we won't be tricked by some weird broadcast traffic even with >arp_validate == 0. So why not just enable arp_validate and get the same effect? >>>The documentation for arp_validate also states that *ARPs* will (not) be >>>validated if it's on/off, and that the arp monitoring works on arps as >>>traffic generators. >> >> I wrote most of that text in the documentation, and the intent >>was not to imply that only ARPs should count for "up-ness" even without >>arp_validate enabled. The intent was to distinguish it from >>"non-validate," in which any incoming traffic counted for "up-ness." >> >> The main reason for preserving the non-validate behavior (any >>traffic counts) is for the loadbalance (xor and rr) modes. In those >>modes, the switch decides which slave receives the incoming traffic, and >>so it's to our advantage to permit any incoming traffic to count for >>"up-ness." The arp_validate option is not allowed in these modes >>because it won't work. >> >> With these changes, I suspect that the loadbalance ARP monitor >>will be less reliable with these changes (granted that it's already a >>bit dodgy in its dependence on the switch to hit all slaves with >>incoming packets regularly). Particularly if the switch ports are >>configured into an Etherchannel ("static link aggregation") group, in >>which case only one slave will receive any given frame (broadcast / >>multicast traffic will not be duplicated across all slaves). > >The non-AB modes also gave me a headache, however after thinking a bit I've >decided to change them also (mainly, it's the change of >arp_loadbalance_mon function). I had that same headache when I was implementing the arp_validate stuff. But I disagree with changing this. >The usual usage, however, is to generate traffic via arps. If we don't see >arp replies - this means that arp_ip_target is down, and thus the slave is >down. The issue with the loadbalance (-xor and -rr) modes is that the incoming ARPs will be balanced by the switch, and won't be delivered to all slaves, or at least not at a rate such that each slave sees an ARP every arp_interval or so. Currently, the loadbalance modes will work with the ARP monitor as long as sufficient traffic volume (of any kind) flows across the slaves; with your change, that will no longer be true. In this case, the ARP monitor currently is essentially using "slave can send and receive packets" as the test for availability. >> I'm not sure that this change (the "only count ARPs even without >>arp_validate" bit) won't break existing configurations. Did you test >>the -rr and -xor modes with ARP monitor after your changes (with and >>without configuring a channel group on the switch ports)? > >Sure, all works fine, afaics. Obviously, these were basic tests, and bugs >might exist. I set up bonding to run some tests. I used three slaves, connected to a switch with those ports set for Etherchannel, balancing according to source IP (Cisco port-channel load-balance src-ip). For non-IP packets (like ARP), this will balance by source MAC. I set arp_interval to 1000 (1 second), and used one arp_ip_target. Without your patches, I can get all slaves to stay up with a sufficient traffic load (ping, netperf, etc). This is dependent upon the switch distributing the packets egressing the switch to all ports of the Etherchannel. This is not by any means an ideal situation, but has been the state for some time now. With your patches, one slave stays up (it is receiving the ARP replies from the arp_ip_target). Regular traffic of any kind does not keep the second and third slaves up. In my opinion, this is a regression from previous behavior. How did you test this such that all slaves stayed up? My suspicion is that you did not configure the switch ports for Etherchannel (static link aggregation), and thus all broadcast ARP requests were flooded to all slaves. This would mark all slaves up, but not have any real reliance on ARP replies coming from the arp_ip_target (i.e., if there were no ARP replies, the slaves would still remain up due to the broadcast ARPs generated by the bond itself). >The only possible scenario of breakage for someone, from my POV, is: > >1) arp monitor is used with loadbalance mode >2) arp_ip_targets are set but _any_ arp replies are never received >3) the user relies on that every slave will receive at least one packet per >arp_interval For 2, above, the issue is that, after your changes, each slave must receive an ARP during each arp_interval. >This use case: > >1) contradicts with documentation >2) contradicts with logic (arp monitor, arp ip targets etc. are used >without, actually, meaning something) >3) is really unstable On 1, I'll grant that the documentation is ambiguous, but the acceptance of any incoming packet for the non-validate ARP monitor is functioning as designed. At worst, this means the documentation should be updated to be clearer, not that the code is functioning incorrectly because it can be shown to contradict a reading of the documentation. Heck, looking at the bonding.txt documentation, it still says that device drivers have to update ->last_rx and ->trans_start, which isn't true. So, yah, the documentation needs some work. On 2, sure they mean something; (without arp_validate or this change) the ARP request / replies are one source, but not the only source, of traffic to determine slave state. On 3, that is, indeed, the case, and this is mentioned in the documentation somewhere (that a continuous flow of traffic is necessary to maintain correct up/down status for the slaves). The whole point of accepting any incoming frame (not just ARPs) is to permit this very scenario to function at maximum efficiency, even if that's not 100% reliable in all cases. >In this case, indeed, it won't work. Two points, though: > >1) It shouldn't work in the first place, is unstable etc. >2) Can be easily fixed by the following oneliner (though I *really* woudn't >like to do it, as it's useless and dangerous). Basically, for non-ab mode >we set last_arp_rx for every packet. Again, I wouldn't like to do it, but >if you know any use case scenario for this usage (no working arps but >continuous receive traffic) - I can send it as v2 or 7/6 patch (whatever >suits David better, as he already applied it but didn't push). It's not useless and dangerous, it's preserving the intended behavior for backwards compatibility with existing configurations. Further, the current behavior is what allows ARP monitor for the loadbalance modes (-xor and -rr) to work at all. The use case is any -xor or -rr bond using ARP monitor against a switch configured for Etherchannel (static link aggregation) on the bonded ports. In those cases, the incoming ARP replies from an ARP target will be delivered to only one slave (because switches generally do load balancing by hash), and the other slaves will not see any incoming ARP traffic from the arp_target. Even if the switch did round robin, the slaves still won't see enough ARPs coming in to reliably keep the slaves up, even if there is lots of regular traffic. The fact that "accept only ARP" does not work at all for the loadbalance modes is the reason that arp_validate is not permitted for those modes. I think the bottom line here is pretty simple: Using the ARP monitor with the loadbalance modes is not a common configuration in my experience, and making it work is tricky. However, anyone using it today will be relying on the current behavior, which we therefore must not change. -J >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 0f613ae..87358e5 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2286,6 +2286,11 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, > __be32 sip, tip; > int alen; > + if (!USES_PRIMARY(bond->params.mode)) { >+ slave->last_arp_rx = jiffies; >+ return RX_HANDLER_ANOTHER; >+ } >+ > if (skb->protocol != __cpu_to_be16(ETH_P_ARP)) > return RX_HANDLER_ANOTHER; > > >> >>>Also, the net_device->last_rx is already used in a lot of drivers (even >>>though the comment states to NOT do it :)), and it's also ugly to modify it >>>from bonding. >> >> I didn't check, but I suspect those are mostly leftovers from >>the distant past, when the drivers were expected to update last_rx, or >>perhaps drivers using it for their own purposes. > >It's really a mix. Somebody just updates them, somebody uses it for their >own purposes etc. > >> >> I don't really see an issue in decoupling bonding from the >>net_device->last_rx; it's pretty much the same thing that was done for >>trans_start some time ago. > >trans_start removal is also in queue :). Though still needs some >polishing... > >> >> -J >> >>--- >> -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com >> >> >>>So, to fix this, remove the last_rx from bonding, *always* call >>>bond_arp_rcv() in slave's rx_handler (bond_handle_frame), and if we spot an >>>arp there - update the slave->last_arp_rx - and use it instead of >>>net_device->last_rx. Finally, rename slave_last_rx() to slave_last_arp_rx() >>>to reflect the changes. >>> >>>As the changes touch really sensitive parts, I've tried to split them as >>>much as possible, for easier debugging/bisecting. >>> >>>CC: Jay Vosburgh >>>CC: Andy Gospodarek >>>CC: "David S. Miller" >>>CC: netdev@vger.kernel.org >>>Signed-off-by: Veaceslav Falico >>> >>>--- >>> drivers/net/bonding/bond_main.c | 18 ++++++++---------- >>> drivers/net/bonding/bond_options.c | 12 ++---------- >>> drivers/net/bonding/bonding.h | 16 ++++++---------- >>> include/linux/netdevice.h | 8 +------- >>> 4 files changed, 17 insertions(+), 37 deletions(-) --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com