how to handle bonding failover when using a bridge over the bond?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* how to handle bonding failover when using a bridge over the bond?
@ 2013-02-12 23:19 Chris Friesen
  2013-02-13  0:02 ` Jay Vosburgh
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Friesen @ 2013-02-12 23:19 UTC (permalink / raw)
  To: bonding-devel, Jay Vosburgh, netdev

I've got a scenario that seems to be not well handled with the current 
bonding code in linux, but maybe I'm missing something.

I have a physical host with two ethernet links that are bonded together 
(active/backup).  Each link is connected to a separate L2 switch, which 
are in turn connected with a crosslink for redundancy.

The physical host is running multiple virtual machines each with a 
virtual adapter.  The virtual adapters and the bond are all bridged 
together to allow communication between the virtual machines, the host, 
and the outside world.

Now suppose one of the slave links fails. The bond device will failover 
to the other slave and send out a gratuitous arp on the newly active 
slave.  This will cause the L2 switches to update their lookup tables 
for the MAC address associated with the bond (so it now points to the 
newly active slave), but doesn't update the MAC addresses associated 
with the various virtual machines.  If someone on the network sends a 
packet to one of the virtual machines, the switch will try to send it 
over the failed slave.

What's the recommended solution for this?  The logical solution would 
seem to be to have something issue GARPs for each virtual machine when 
the bond device fails over, but there doesn't seem to be any way to 
register for notification (via rtnetlink for instance) when the bond 
fails over.  I could monitor for carrier loss, but that wouldn't work 
for the case where bonding is using arp monitoring.

Any suggestions?

Thanks,
Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-12 23:19 how to handle bonding failover when using a bridge over the bond? Chris Friesen
@ 2013-02-13  0:02 ` Jay Vosburgh
  2013-02-13  0:30   ` Chris Friesen
  0 siblings, 1 reply; 9+ messages in thread
From: Jay Vosburgh @ 2013-02-13  0:02 UTC (permalink / raw)
  To: Chris Friesen; +Cc: bonding-devel, netdev

Chris Friesen <chris.friesen@genband.com> wrote:

>I've got a scenario that seems to be not well handled with the current
>bonding code in linux, but maybe I'm missing something.
>
>I have a physical host with two ethernet links that are bonded together
>(active/backup).  Each link is connected to a separate L2 switch, which
>are in turn connected with a crosslink for redundancy.
>
>The physical host is running multiple virtual machines each with a virtual
>adapter.  The virtual adapters and the bond are all bridged together to
>allow communication between the virtual machines, the host, and the
>outside world.
>
>Now suppose one of the slave links fails. The bond device will failover to
>the other slave and send out a gratuitous arp on the newly active slave.
>This will cause the L2 switches to update their lookup tables for the MAC
>address associated with the bond (so it now points to the newly active
>slave), but doesn't update the MAC addresses associated with the various
>virtual machines.  If someone on the network sends a packet to one of the
>virtual machines, the switch will try to send it over the failed slave.

	If the link failure is such that there is no carrier on the
switch port, the switch will drop the forwarding entry for the virtual
machine's MAC address from that port.  The traffic for the VM's MAC
would then flood to all ports, presumably including the link to the
other switch, which wouldn't have a forwarding entry for the MAC, either
(or it would be the switch link port), and would also flood it to all
ports, one of which is the correct one.

	Now, I'm speculating a bit here, as I have not traced out
exactly how this works.  I have discussed bonding failover with people
here who have systems set up in the manner you describe (and did some
testing), and it appears to be working for them.

	On the other hand, something like a manual change of active
slave won't bring down the carrier of the previously-active slave, and
in that case there might be a problem with traffic destined for one of
the VMs, until the VM sends something that makes it to the new switch.

	Is this actually failing for you, or is this a thought
experiment?

>What's the recommended solution for this?  The logical solution would seem
>to be to have something issue GARPs for each virtual machine when the bond
>device fails over, but there doesn't seem to be any way to register for
>notification (via rtnetlink for instance) when the bond fails over.  I
>could monitor for carrier loss, but that wouldn't work for the case where
>bonding is using arp monitoring.

	There is a NETDEV_BONDING_FAILOVER notifier that is called for
active-backup mode when a new active slave is assigned.  The
rtnetlink_event function is on that chain, and will send an rtnetlink
message, although I don't see that the actual event is included in the
message.

	The bond doesn't track all of the MACs that go through it, but
the bridge presumably does, and could respond to the FAILOVER notifier
with something to notify the switch that the port assignments for the
various MACs have changed.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-13  0:02 ` Jay Vosburgh
@ 2013-02-13  0:30   ` Chris Friesen
  2013-02-13 17:14     ` Chris Friesen
  2013-02-14  8:01     ` Cong Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Chris Friesen @ 2013-02-13  0:30 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: bonding-devel, netdev, Stephen Hemminger, bridge

On 02/12/2013 06:02 PM, Jay Vosburgh wrote:
> Chris Friesen<chris.friesen@genband.com>  wrote:
>
>> I've got a scenario that seems to be not well handled with the current
>> bonding code in linux, but maybe I'm missing something.
>>
>> I have a physical host with two ethernet links that are bonded together
>> (active/backup).  Each link is connected to a separate L2 switch, which
>> are in turn connected with a crosslink for redundancy.
>>
>> The physical host is running multiple virtual machines each with a virtual
>> adapter.  The virtual adapters and the bond are all bridged together to
>> allow communication between the virtual machines, the host, and the
>> outside world.
>>
>> Now suppose one of the slave links fails. The bond device will failover to
>> the other slave and send out a gratuitous arp on the newly active slave.
>> This will cause the L2 switches to update their lookup tables for the MAC
>> address associated with the bond (so it now points to the newly active
>> slave), but doesn't update the MAC addresses associated with the various
>> virtual machines.  If someone on the network sends a packet to one of the
>> virtual machines, the switch will try to send it over the failed slave.
>
> 	If the link failure is such that there is no carrier on the
> switch port, the switch will drop the forwarding entry for the virtual
> machine's MAC address from that port.  The traffic for the VM's MAC
> would then flood to all ports, presumably including the link to the
> other switch, which wouldn't have a forwarding entry for the MAC, either
> (or it would be the switch link port), and would also flood it to all
> ports, one of which is the correct one.

This makes sense, though it wouldn't cover the case where the link only 
loses carrier in one direction, or if the bond is using arp failover and 
something fails beyond the first hop.

> 	Is this actually failing for you, or is this a thought
> experiment?

It actually failed.  During a customer demo.  :)  From what I understand 
it was a physical link pull, which (based on what you say above) should 
have caused the switch to react appropriately.

I'll see if I can get some more information.  Maybe the switches weren't 
behaving properly or something.

>> What's the recommended solution for this?  The logical solution would seem
>> to be to have something issue GARPs for each virtual machine when the bond
>> device fails over, but there doesn't seem to be any way to register for
>> notification (via rtnetlink for instance) when the bond fails over.  I
>> could monitor for carrier loss, but that wouldn't work for the case where
>> bonding is using arp monitoring.
>
> 	There is a NETDEV_BONDING_FAILOVER notifier that is called for
> active-backup mode when a new active slave is assigned.  The
> rtnetlink_event function is on that chain, and will send an rtnetlink
> message, although I don't see that the actual event is included in the
> message.

If I'm reading this right it will end up sending an RTM_NEWLINK message, 
which seems a bit odd.

> 	The bond doesn't track all of the MACs that go through it, but
> the bridge presumably does, and could respond to the FAILOVER notifier
> with something to notify the switch that the port assignments for the
> various MACs have changed.

That would probably make sense.  I've added the bridging folks, maybe 
they'll have a suggestion how this sort of thing should be handled.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-13  0:30   ` Chris Friesen
@ 2013-02-13 17:14     ` Chris Friesen
  2013-02-14  8:01     ` Cong Wang
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Friesen @ 2013-02-13 17:14 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: bonding-devel, netdev, Stephen Hemminger, bridge

On 02/12/2013 06:30 PM, Chris Friesen wrote:
> On 02/12/2013 06:02 PM, Jay Vosburgh wrote:
>> Chris Friesen<chris.friesen@genband.com> wrote:

>>> I have a physical host with two ethernet links that are bonded
>>> together (active/backup). Each link is connected to a separate L2
>>> switch, which are in turn connected with a crosslink for
>>> redundancy.
>>>
>>> The physical host is running multiple virtual machines each with
>>> a virtual adapter. The virtual adapters and the bond are all
>>> bridged together to allow communication between the virtual
>>> machines, the host, and the outside world.
>>>
>>> Now suppose one of the slave links fails. The bond device will
>>> failover to the other slave and send out a gratuitous arp on the
>>> newly active slave. This will cause the L2 switches to update
>>> their lookup tables for the MAC address associated with the bond
>>> (so it now points to the newly active slave), but doesn't update
>>> the MAC addresses associated with the various virtual machines.
>>> If someone on the network sends a packet to one of the virtual
>>> machines, the switch will try to send it over the failed slave.
>>
>> If the link failure is such that there is no carrier on the switch
>> port, the switch will drop the forwarding entry for the virtual
>> machine's MAC address from that port. The traffic for the VM's MAC
>> would then flood to all ports, presumably including the link to
>> the other switch, which wouldn't have a forwarding entry for the
>> MAC, either (or it would be the switch link port), and would also
>> flood it to all ports, one of which is the correct one.

I talked with our networking guy.  Apparently what is happening is that 
if we pull the link to switch A it drops the forwarding entries for all 
MACs on the downed link, but switch B still has stale entries pointing 
to the inter-switch link.

If a packet destined for the VM that arrives at switch B, it will send 
it across to switch A.  (Which is pointless since A no longer has a 
working link to the MAC in question.)

If a packet destined for the VM that arrives at switch A, it will 
broadcast it to all ports, including the inter-switch link to switch B. 
  However, switch B still thinks the MAC address is connected to switch 
A, so it drops the packet.

Once the VMs send out packets switch B will update its tables, but if 
the VMs are event-driven and mostly only respond to incoming packets 
they could end up waiting a long time.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-13  0:30   ` Chris Friesen
  2013-02-13 17:14     ` Chris Friesen
@ 2013-02-14  8:01     ` Cong Wang
  2013-02-14 16:43       ` Chris Friesen
  1 sibling, 1 reply; 9+ messages in thread
From: Cong Wang @ 2013-02-14  8:01 UTC (permalink / raw)
  To: netdev

On Wed, 13 Feb 2013 at 00:30 GMT, Chris Friesen <chris.friesen@genband.com> wrote:
> On 02/12/2013 06:02 PM, Jay Vosburgh wrote:
>> 	The bond doesn't track all of the MACs that go through it, but
>> the bridge presumably does, and could respond to the FAILOVER notifier
>> with something to notify the switch that the port assignments for the
>> various MACs have changed.
>
> That would probably make sense.  I've added the bridging folks, maybe 
> they'll have a suggestion how this sort of thing should be handled.
>

It is already handled. When BONDING_FAILOVER is triggered and the MAC has
been changed, NETDEV_CHANGEADDR is issued too, then bridge will capture
it and update its fdb:

        case NETDEV_CHANGEADDR:
                spin_lock_bh(&br->lock);
                br_fdb_changeaddr(p, dev->dev_addr);
                changed_addr = br_stp_recalculate_bridge_id(br);
                spin_unlock_bh(&br->lock);

                if (changed_addr)
                        call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);

                break;
		

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-14  8:01     ` Cong Wang
@ 2013-02-14 16:43       ` Chris Friesen
  2013-02-14 18:03         ` Jay Vosburgh
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Friesen @ 2013-02-14 16:43 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev

On 02/14/2013 02:01 AM, Cong Wang wrote:
> On Wed, 13 Feb 2013 at 00:30 GMT, Chris Friesen<chris.friesen@genband.com>  wrote:
>> On 02/12/2013 06:02 PM, Jay Vosburgh wrote:
>>> 	The bond doesn't track all of the MACs that go through it, but
>>> the bridge presumably does, and could respond to the FAILOVER notifier
>>> with something to notify the switch that the port assignments for the
>>> various MACs have changed.
>>
>> That would probably make sense.  I've added the bridging folks, maybe
>> they'll have a suggestion how this sort of thing should be handled.
>>
>
> It is already handled. When BONDING_FAILOVER is triggered and the MAC has
> been changed, NETDEV_CHANGEADDR is issued too, then bridge will capture
> it and update its fdb:
>
>          case NETDEV_CHANGEADDR:
>                  spin_lock_bh(&br->lock);
>                  br_fdb_changeaddr(p, dev->dev_addr);
>                  changed_addr = br_stp_recalculate_bridge_id(br);
>                  spin_unlock_bh(&br->lock);
>
>                  if (changed_addr)
>                          call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);
>
>                  break;

I'm not familiar with the bridge code, can you elaborate on how this helps?

The problem scenario is this:

I have a host with eth0/eth1 bonded together as bond0.  eth0/eth1 are 
connected to separate L2 switches, which are interconnected.

On the host there are a number of virtual machines, each with a virtual 
interface.

All the virtual interfaces as well as bond0 are bridged together to 
allow the VMs, the host, and the outside world to talk to each other.

Currently the host does NOT participate in STP because it is considered 
an edge node.

Suppose eth0 is the active link and we pull it.  The bond will make eth1 
active and emit gratuitous arp packets for itself, so the external L2 
switches will update the location of the MAC address belonging to the 
bond.  On loss of carrier for the link to eth0 L2 switch "A" will drop 
the entries for the MAC addresses, including the ones for the virtual 
machines.

The problem is that L2 switch "B" still thinks that all the virtual 
machines are accessible via L2 switch "A".  Thus any incoming packets 
destined for a virtual machine will get dropped.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-14 16:43       ` Chris Friesen
@ 2013-02-14 18:03         ` Jay Vosburgh
  2013-02-14 19:29           ` Chris Friesen
  0 siblings, 1 reply; 9+ messages in thread
From: Jay Vosburgh @ 2013-02-14 18:03 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Cong Wang, netdev

Chris Friesen <chris.friesen@genband.com> wrote:

>On 02/14/2013 02:01 AM, Cong Wang wrote:
>> On Wed, 13 Feb 2013 at 00:30 GMT, Chris Friesen<chris.friesen@genband.com>  wrote:
>>> On 02/12/2013 06:02 PM, Jay Vosburgh wrote:
>>>> 	The bond doesn't track all of the MACs that go through it, but
>>>> the bridge presumably does, and could respond to the FAILOVER notifier
>>>> with something to notify the switch that the port assignments for the
>>>> various MACs have changed.
>>>
>>> That would probably make sense.  I've added the bridging folks, maybe
>>> they'll have a suggestion how this sort of thing should be handled.
>>>
>>
>> It is already handled. When BONDING_FAILOVER is triggered and the MAC has
>> been changed, NETDEV_CHANGEADDR is issued too, then bridge will capture
>> it and update its fdb:
>>
>>          case NETDEV_CHANGEADDR:
>>                  spin_lock_bh(&br->lock);
>>                  br_fdb_changeaddr(p, dev->dev_addr);
>>                  changed_addr = br_stp_recalculate_bridge_id(br);
>>                  spin_unlock_bh(&br->lock);
>>
>>                  if (changed_addr)
>>                          call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);
>>
>>                  break;
>
>I'm not familiar with the bridge code, can you elaborate on how this helps?

	I'm not sure that it does, even if you're using STP (although
I'd want to try it with STP to make sure).  This only updates the fdb's
MAC for the bond's port.  It won't affect the VM's MACs (which it
shouldn't, because they don't change), and won't send any gratuitous
updates through the bond's port to the switch that would notify the
second switch ("B" in Chris's description, below) that the switch port
for the VM's MAC(s) has changed.

	Also, if the bond has fail_over_mac=follow, then no CHANGEADDR
is issued, because the MAC address does not change.  This is not common
(and not the case in the configuration described below), but does occur.

>The problem scenario is this:
>
>I have a host with eth0/eth1 bonded together as bond0.  eth0/eth1 are
>connected to separate L2 switches, which are interconnected.
>
>On the host there are a number of virtual machines, each with a virtual
>interface.
>
>All the virtual interfaces as well as bond0 are bridged together to allow
>the VMs, the host, and the outside world to talk to each other.
>
>Currently the host does NOT participate in STP because it is considered an
>edge node.
>
>Suppose eth0 is the active link and we pull it.  The bond will make eth1
>active and emit gratuitous arp packets for itself, so the external L2
>switches will update the location of the MAC address belonging to the
>bond.  On loss of carrier for the link to eth0 L2 switch "A" will drop the
>entries for the MAC addresses, including the ones for the virtual
>machines.
>
>The problem is that L2 switch "B" still thinks that all the virtual
>machines are accessible via L2 switch "A".  Thus any incoming packets
>destined for a virtual machine will get dropped.

	I'm trying to track down the system I tested previously to see
exactly how it is set up and why it works when yours does not.  It's
possible that it doesn't work, and the testing we did simply missed this
case.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-14 18:03         ` Jay Vosburgh
@ 2013-02-14 19:29           ` Chris Friesen
  2013-02-14 19:42             ` Rick Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Friesen @ 2013-02-14 19:29 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Cong Wang, netdev

On 02/14/2013 12:03 PM, Jay Vosburgh wrote:
> Chris Friesen<chris.friesen@genband.com>  wrote:

>> The problem is that L2 switch "B" still thinks that all the virtual
>> machines are accessible via L2 switch "A".  Thus any incoming packets
>> destined for a virtual machine will get dropped.
>
> 	I'm trying to track down the system I tested previously to see
> exactly how it is set up and why it works when yours does not.  It's
> possible that it doesn't work, and the testing we did simply missed this
> case.

After thinking about this for a while, it doesn't seem like a natural 
thing for either the bond or the bridge to care about--though if someone 
wanted to fix it generically that would be great.

It might be worth considering sending out a bonding-specific netlink 
message when a bond fails over, giving the name of the bond as well as 
the newly active slave device.  This would allow for an efficient 
userspace daemon to issue gratuitous arps for all the VM MAC addresses.

It almost seems like the most elegant way to deal with this would be to 
forgo the bond completely and just add both links into the bridge, then 
run STP to make sure there are no loops.  That way link pulls get 
handled immediately and STP will update the other L2 switches 
appropriately.  Unfortunately this doesn't seem to be an option for us 
since some apparently customers frown on a server participating in STP. 
(Not sure why, we're already dealing with much more complicated 
protocols...)

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: how to handle bonding failover when using a bridge over the bond?
  2013-02-14 19:29           ` Chris Friesen
@ 2013-02-14 19:42             ` Rick Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Rick Jones @ 2013-02-14 19:42 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Jay Vosburgh, Cong Wang, netdev

If a VM were to see its link to the bridge module bounce/cycle, might 
that cause it to send a gratuitous ARP on its own?  It would be 
something akin to the "if this uplink fails, bring down these downlinks" 
functionality that is out there in some places.  Bond says "hey, I've 
had a change" then bridge toggles the "downlinks" to the VMs.

rick jones

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-02-14 19:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-12 23:19 how to handle bonding failover when using a bridge over the bond? Chris Friesen
2013-02-13  0:02 ` Jay Vosburgh
2013-02-13  0:30   ` Chris Friesen
2013-02-13 17:14     ` Chris Friesen
2013-02-14  8:01     ` Cong Wang
2013-02-14 16:43       ` Chris Friesen
2013-02-14 18:03         ` Jay Vosburgh
2013-02-14 19:29           ` Chris Friesen
2013-02-14 19:42             ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).