* how to handle bonding failover when using a bridge over the bond? @ 2013-02-12 23:19 Chris Friesen 2013-02-13 0:02 ` Jay Vosburgh 0 siblings, 1 reply; 9+ messages in thread From: Chris Friesen @ 2013-02-12 23:19 UTC (permalink / raw) To: bonding-devel, Jay Vosburgh, netdev I've got a scenario that seems to be not well handled with the current bonding code in linux, but maybe I'm missing something. I have a physical host with two ethernet links that are bonded together (active/backup). Each link is connected to a separate L2 switch, which are in turn connected with a crosslink for redundancy. The physical host is running multiple virtual machines each with a virtual adapter. The virtual adapters and the bond are all bridged together to allow communication between the virtual machines, the host, and the outside world. Now suppose one of the slave links fails. The bond device will failover to the other slave and send out a gratuitous arp on the newly active slave. This will cause the L2 switches to update their lookup tables for the MAC address associated with the bond (so it now points to the newly active slave), but doesn't update the MAC addresses associated with the various virtual machines. If someone on the network sends a packet to one of the virtual machines, the switch will try to send it over the failed slave. What's the recommended solution for this? The logical solution would seem to be to have something issue GARPs for each virtual machine when the bond device fails over, but there doesn't seem to be any way to register for notification (via rtnetlink for instance) when the bond fails over. I could monitor for carrier loss, but that wouldn't work for the case where bonding is using arp monitoring. Any suggestions? Thanks, Chris ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-12 23:19 how to handle bonding failover when using a bridge over the bond? Chris Friesen @ 2013-02-13 0:02 ` Jay Vosburgh 2013-02-13 0:30 ` Chris Friesen 0 siblings, 1 reply; 9+ messages in thread From: Jay Vosburgh @ 2013-02-13 0:02 UTC (permalink / raw) To: Chris Friesen; +Cc: bonding-devel, netdev Chris Friesen <chris.friesen@genband.com> wrote: >I've got a scenario that seems to be not well handled with the current >bonding code in linux, but maybe I'm missing something. > >I have a physical host with two ethernet links that are bonded together >(active/backup). Each link is connected to a separate L2 switch, which >are in turn connected with a crosslink for redundancy. > >The physical host is running multiple virtual machines each with a virtual >adapter. The virtual adapters and the bond are all bridged together to >allow communication between the virtual machines, the host, and the >outside world. > >Now suppose one of the slave links fails. The bond device will failover to >the other slave and send out a gratuitous arp on the newly active slave. >This will cause the L2 switches to update their lookup tables for the MAC >address associated with the bond (so it now points to the newly active >slave), but doesn't update the MAC addresses associated with the various >virtual machines. If someone on the network sends a packet to one of the >virtual machines, the switch will try to send it over the failed slave. If the link failure is such that there is no carrier on the switch port, the switch will drop the forwarding entry for the virtual machine's MAC address from that port. The traffic for the VM's MAC would then flood to all ports, presumably including the link to the other switch, which wouldn't have a forwarding entry for the MAC, either (or it would be the switch link port), and would also flood it to all ports, one of which is the correct one. Now, I'm speculating a bit here, as I have not traced out exactly how this works. I have discussed bonding failover with people here who have systems set up in the manner you describe (and did some testing), and it appears to be working for them. On the other hand, something like a manual change of active slave won't bring down the carrier of the previously-active slave, and in that case there might be a problem with traffic destined for one of the VMs, until the VM sends something that makes it to the new switch. Is this actually failing for you, or is this a thought experiment? >What's the recommended solution for this? The logical solution would seem >to be to have something issue GARPs for each virtual machine when the bond >device fails over, but there doesn't seem to be any way to register for >notification (via rtnetlink for instance) when the bond fails over. I >could monitor for carrier loss, but that wouldn't work for the case where >bonding is using arp monitoring. There is a NETDEV_BONDING_FAILOVER notifier that is called for active-backup mode when a new active slave is assigned. The rtnetlink_event function is on that chain, and will send an rtnetlink message, although I don't see that the actual event is included in the message. The bond doesn't track all of the MACs that go through it, but the bridge presumably does, and could respond to the FAILOVER notifier with something to notify the switch that the port assignments for the various MACs have changed. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-13 0:02 ` Jay Vosburgh @ 2013-02-13 0:30 ` Chris Friesen 2013-02-13 17:14 ` Chris Friesen 2013-02-14 8:01 ` Cong Wang 0 siblings, 2 replies; 9+ messages in thread From: Chris Friesen @ 2013-02-13 0:30 UTC (permalink / raw) To: Jay Vosburgh; +Cc: bonding-devel, netdev, Stephen Hemminger, bridge On 02/12/2013 06:02 PM, Jay Vosburgh wrote: > Chris Friesen<chris.friesen@genband.com> wrote: > >> I've got a scenario that seems to be not well handled with the current >> bonding code in linux, but maybe I'm missing something. >> >> I have a physical host with two ethernet links that are bonded together >> (active/backup). Each link is connected to a separate L2 switch, which >> are in turn connected with a crosslink for redundancy. >> >> The physical host is running multiple virtual machines each with a virtual >> adapter. The virtual adapters and the bond are all bridged together to >> allow communication between the virtual machines, the host, and the >> outside world. >> >> Now suppose one of the slave links fails. The bond device will failover to >> the other slave and send out a gratuitous arp on the newly active slave. >> This will cause the L2 switches to update their lookup tables for the MAC >> address associated with the bond (so it now points to the newly active >> slave), but doesn't update the MAC addresses associated with the various >> virtual machines. If someone on the network sends a packet to one of the >> virtual machines, the switch will try to send it over the failed slave. > > If the link failure is such that there is no carrier on the > switch port, the switch will drop the forwarding entry for the virtual > machine's MAC address from that port. The traffic for the VM's MAC > would then flood to all ports, presumably including the link to the > other switch, which wouldn't have a forwarding entry for the MAC, either > (or it would be the switch link port), and would also flood it to all > ports, one of which is the correct one. This makes sense, though it wouldn't cover the case where the link only loses carrier in one direction, or if the bond is using arp failover and something fails beyond the first hop. > Is this actually failing for you, or is this a thought > experiment? It actually failed. During a customer demo. :) From what I understand it was a physical link pull, which (based on what you say above) should have caused the switch to react appropriately. I'll see if I can get some more information. Maybe the switches weren't behaving properly or something. >> What's the recommended solution for this? The logical solution would seem >> to be to have something issue GARPs for each virtual machine when the bond >> device fails over, but there doesn't seem to be any way to register for >> notification (via rtnetlink for instance) when the bond fails over. I >> could monitor for carrier loss, but that wouldn't work for the case where >> bonding is using arp monitoring. > > There is a NETDEV_BONDING_FAILOVER notifier that is called for > active-backup mode when a new active slave is assigned. The > rtnetlink_event function is on that chain, and will send an rtnetlink > message, although I don't see that the actual event is included in the > message. If I'm reading this right it will end up sending an RTM_NEWLINK message, which seems a bit odd. > The bond doesn't track all of the MACs that go through it, but > the bridge presumably does, and could respond to the FAILOVER notifier > with something to notify the switch that the port assignments for the > various MACs have changed. That would probably make sense. I've added the bridging folks, maybe they'll have a suggestion how this sort of thing should be handled. Chris ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-13 0:30 ` Chris Friesen @ 2013-02-13 17:14 ` Chris Friesen 2013-02-14 8:01 ` Cong Wang 1 sibling, 0 replies; 9+ messages in thread From: Chris Friesen @ 2013-02-13 17:14 UTC (permalink / raw) To: Jay Vosburgh; +Cc: bonding-devel, netdev, Stephen Hemminger, bridge On 02/12/2013 06:30 PM, Chris Friesen wrote: > On 02/12/2013 06:02 PM, Jay Vosburgh wrote: >> Chris Friesen<chris.friesen@genband.com> wrote: >>> I have a physical host with two ethernet links that are bonded >>> together (active/backup). Each link is connected to a separate L2 >>> switch, which are in turn connected with a crosslink for >>> redundancy. >>> >>> The physical host is running multiple virtual machines each with >>> a virtual adapter. The virtual adapters and the bond are all >>> bridged together to allow communication between the virtual >>> machines, the host, and the outside world. >>> >>> Now suppose one of the slave links fails. The bond device will >>> failover to the other slave and send out a gratuitous arp on the >>> newly active slave. This will cause the L2 switches to update >>> their lookup tables for the MAC address associated with the bond >>> (so it now points to the newly active slave), but doesn't update >>> the MAC addresses associated with the various virtual machines. >>> If someone on the network sends a packet to one of the virtual >>> machines, the switch will try to send it over the failed slave. >> >> If the link failure is such that there is no carrier on the switch >> port, the switch will drop the forwarding entry for the virtual >> machine's MAC address from that port. The traffic for the VM's MAC >> would then flood to all ports, presumably including the link to >> the other switch, which wouldn't have a forwarding entry for the >> MAC, either (or it would be the switch link port), and would also >> flood it to all ports, one of which is the correct one. I talked with our networking guy. Apparently what is happening is that if we pull the link to switch A it drops the forwarding entries for all MACs on the downed link, but switch B still has stale entries pointing to the inter-switch link. If a packet destined for the VM that arrives at switch B, it will send it across to switch A. (Which is pointless since A no longer has a working link to the MAC in question.) If a packet destined for the VM that arrives at switch A, it will broadcast it to all ports, including the inter-switch link to switch B. However, switch B still thinks the MAC address is connected to switch A, so it drops the packet. Once the VMs send out packets switch B will update its tables, but if the VMs are event-driven and mostly only respond to incoming packets they could end up waiting a long time. Chris ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-13 0:30 ` Chris Friesen 2013-02-13 17:14 ` Chris Friesen @ 2013-02-14 8:01 ` Cong Wang 2013-02-14 16:43 ` Chris Friesen 1 sibling, 1 reply; 9+ messages in thread From: Cong Wang @ 2013-02-14 8:01 UTC (permalink / raw) To: netdev On Wed, 13 Feb 2013 at 00:30 GMT, Chris Friesen <chris.friesen@genband.com> wrote: > On 02/12/2013 06:02 PM, Jay Vosburgh wrote: >> The bond doesn't track all of the MACs that go through it, but >> the bridge presumably does, and could respond to the FAILOVER notifier >> with something to notify the switch that the port assignments for the >> various MACs have changed. > > That would probably make sense. I've added the bridging folks, maybe > they'll have a suggestion how this sort of thing should be handled. > It is already handled. When BONDING_FAILOVER is triggered and the MAC has been changed, NETDEV_CHANGEADDR is issued too, then bridge will capture it and update its fdb: case NETDEV_CHANGEADDR: spin_lock_bh(&br->lock); br_fdb_changeaddr(p, dev->dev_addr); changed_addr = br_stp_recalculate_bridge_id(br); spin_unlock_bh(&br->lock); if (changed_addr) call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev); break; ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-14 8:01 ` Cong Wang @ 2013-02-14 16:43 ` Chris Friesen 2013-02-14 18:03 ` Jay Vosburgh 0 siblings, 1 reply; 9+ messages in thread From: Chris Friesen @ 2013-02-14 16:43 UTC (permalink / raw) To: Cong Wang; +Cc: netdev On 02/14/2013 02:01 AM, Cong Wang wrote: > On Wed, 13 Feb 2013 at 00:30 GMT, Chris Friesen<chris.friesen@genband.com> wrote: >> On 02/12/2013 06:02 PM, Jay Vosburgh wrote: >>> The bond doesn't track all of the MACs that go through it, but >>> the bridge presumably does, and could respond to the FAILOVER notifier >>> with something to notify the switch that the port assignments for the >>> various MACs have changed. >> >> That would probably make sense. I've added the bridging folks, maybe >> they'll have a suggestion how this sort of thing should be handled. >> > > It is already handled. When BONDING_FAILOVER is triggered and the MAC has > been changed, NETDEV_CHANGEADDR is issued too, then bridge will capture > it and update its fdb: > > case NETDEV_CHANGEADDR: > spin_lock_bh(&br->lock); > br_fdb_changeaddr(p, dev->dev_addr); > changed_addr = br_stp_recalculate_bridge_id(br); > spin_unlock_bh(&br->lock); > > if (changed_addr) > call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev); > > break; I'm not familiar with the bridge code, can you elaborate on how this helps? The problem scenario is this: I have a host with eth0/eth1 bonded together as bond0. eth0/eth1 are connected to separate L2 switches, which are interconnected. On the host there are a number of virtual machines, each with a virtual interface. All the virtual interfaces as well as bond0 are bridged together to allow the VMs, the host, and the outside world to talk to each other. Currently the host does NOT participate in STP because it is considered an edge node. Suppose eth0 is the active link and we pull it. The bond will make eth1 active and emit gratuitous arp packets for itself, so the external L2 switches will update the location of the MAC address belonging to the bond. On loss of carrier for the link to eth0 L2 switch "A" will drop the entries for the MAC addresses, including the ones for the virtual machines. The problem is that L2 switch "B" still thinks that all the virtual machines are accessible via L2 switch "A". Thus any incoming packets destined for a virtual machine will get dropped. Chris ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-14 16:43 ` Chris Friesen @ 2013-02-14 18:03 ` Jay Vosburgh 2013-02-14 19:29 ` Chris Friesen 0 siblings, 1 reply; 9+ messages in thread From: Jay Vosburgh @ 2013-02-14 18:03 UTC (permalink / raw) To: Chris Friesen; +Cc: Cong Wang, netdev Chris Friesen <chris.friesen@genband.com> wrote: >On 02/14/2013 02:01 AM, Cong Wang wrote: >> On Wed, 13 Feb 2013 at 00:30 GMT, Chris Friesen<chris.friesen@genband.com> wrote: >>> On 02/12/2013 06:02 PM, Jay Vosburgh wrote: >>>> The bond doesn't track all of the MACs that go through it, but >>>> the bridge presumably does, and could respond to the FAILOVER notifier >>>> with something to notify the switch that the port assignments for the >>>> various MACs have changed. >>> >>> That would probably make sense. I've added the bridging folks, maybe >>> they'll have a suggestion how this sort of thing should be handled. >>> >> >> It is already handled. When BONDING_FAILOVER is triggered and the MAC has >> been changed, NETDEV_CHANGEADDR is issued too, then bridge will capture >> it and update its fdb: >> >> case NETDEV_CHANGEADDR: >> spin_lock_bh(&br->lock); >> br_fdb_changeaddr(p, dev->dev_addr); >> changed_addr = br_stp_recalculate_bridge_id(br); >> spin_unlock_bh(&br->lock); >> >> if (changed_addr) >> call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev); >> >> break; > >I'm not familiar with the bridge code, can you elaborate on how this helps? I'm not sure that it does, even if you're using STP (although I'd want to try it with STP to make sure). This only updates the fdb's MAC for the bond's port. It won't affect the VM's MACs (which it shouldn't, because they don't change), and won't send any gratuitous updates through the bond's port to the switch that would notify the second switch ("B" in Chris's description, below) that the switch port for the VM's MAC(s) has changed. Also, if the bond has fail_over_mac=follow, then no CHANGEADDR is issued, because the MAC address does not change. This is not common (and not the case in the configuration described below), but does occur. >The problem scenario is this: > >I have a host with eth0/eth1 bonded together as bond0. eth0/eth1 are >connected to separate L2 switches, which are interconnected. > >On the host there are a number of virtual machines, each with a virtual >interface. > >All the virtual interfaces as well as bond0 are bridged together to allow >the VMs, the host, and the outside world to talk to each other. > >Currently the host does NOT participate in STP because it is considered an >edge node. > >Suppose eth0 is the active link and we pull it. The bond will make eth1 >active and emit gratuitous arp packets for itself, so the external L2 >switches will update the location of the MAC address belonging to the >bond. On loss of carrier for the link to eth0 L2 switch "A" will drop the >entries for the MAC addresses, including the ones for the virtual >machines. > >The problem is that L2 switch "B" still thinks that all the virtual >machines are accessible via L2 switch "A". Thus any incoming packets >destined for a virtual machine will get dropped. I'm trying to track down the system I tested previously to see exactly how it is set up and why it works when yours does not. It's possible that it doesn't work, and the testing we did simply missed this case. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-14 18:03 ` Jay Vosburgh @ 2013-02-14 19:29 ` Chris Friesen 2013-02-14 19:42 ` Rick Jones 0 siblings, 1 reply; 9+ messages in thread From: Chris Friesen @ 2013-02-14 19:29 UTC (permalink / raw) To: Jay Vosburgh; +Cc: Cong Wang, netdev On 02/14/2013 12:03 PM, Jay Vosburgh wrote: > Chris Friesen<chris.friesen@genband.com> wrote: >> The problem is that L2 switch "B" still thinks that all the virtual >> machines are accessible via L2 switch "A". Thus any incoming packets >> destined for a virtual machine will get dropped. > > I'm trying to track down the system I tested previously to see > exactly how it is set up and why it works when yours does not. It's > possible that it doesn't work, and the testing we did simply missed this > case. After thinking about this for a while, it doesn't seem like a natural thing for either the bond or the bridge to care about--though if someone wanted to fix it generically that would be great. It might be worth considering sending out a bonding-specific netlink message when a bond fails over, giving the name of the bond as well as the newly active slave device. This would allow for an efficient userspace daemon to issue gratuitous arps for all the VM MAC addresses. It almost seems like the most elegant way to deal with this would be to forgo the bond completely and just add both links into the bridge, then run STP to make sure there are no loops. That way link pulls get handled immediately and STP will update the other L2 switches appropriately. Unfortunately this doesn't seem to be an option for us since some apparently customers frown on a server participating in STP. (Not sure why, we're already dealing with much more complicated protocols...) Chris ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: how to handle bonding failover when using a bridge over the bond? 2013-02-14 19:29 ` Chris Friesen @ 2013-02-14 19:42 ` Rick Jones 0 siblings, 0 replies; 9+ messages in thread From: Rick Jones @ 2013-02-14 19:42 UTC (permalink / raw) To: Chris Friesen; +Cc: Jay Vosburgh, Cong Wang, netdev If a VM were to see its link to the bridge module bounce/cycle, might that cause it to send a gratuitous ARP on its own? It would be something akin to the "if this uplink fails, bring down these downlinks" functionality that is out there in some places. Bond says "hey, I've had a change" then bridge toggles the "downlinks" to the VMs. rick jones ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-02-14 19:42 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-12 23:19 how to handle bonding failover when using a bridge over the bond? Chris Friesen 2013-02-13 0:02 ` Jay Vosburgh 2013-02-13 0:30 ` Chris Friesen 2013-02-13 17:14 ` Chris Friesen 2013-02-14 8:01 ` Cong Wang 2013-02-14 16:43 ` Chris Friesen 2013-02-14 18:03 ` Jay Vosburgh 2013-02-14 19:29 ` Chris Friesen 2013-02-14 19:42 ` Rick Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).