* Receive issues with bonding and vlans
@ 2010-04-12 22:17 Chris Leech
2010-04-12 22:17 ` [PATCH] vlan: remove receive checks for bonding Chris Leech
2010-04-12 23:10 ` Receive issues with bonding and vlans Jay Vosburgh
0 siblings, 2 replies; 12+ messages in thread
From: Chris Leech @ 2010-04-12 22:17 UTC (permalink / raw)
To: netdev, Andy Gospodarek, Patrick McHardy; +Cc: bonding-devel
Quick summary: VLANs and bonding are interacting in strange ways in the
receive path, VLAN devices do not act the same as real Ethernet devices,
hardware accelerated VLANs do not act the same as software tagged VLANs,
and I think frames are incorrectly being passed up to protocols from
inactive bonding links.
I've been looking at high availability configurations for converged LAN
+ SAN networking, trying to see what running FCoE and IP traffic looked
like with bonding and dm_multipath. The goal is to allow sysadmins to
use the tools they are already using with separate LAN and SAN adapters,
now on a single converged adapter.
The setup I'm trying to use looks like this; with IP traffic running on
bond0, storage VLANs created on eth0 and eth1, and FCoE running on the
VLANs. Both switches provide Fiber Channel Forwarder (FCF) services,
and connect to the same LAN and SAN.
.-----------------------------------------.
| .--------------. |
| | dm_multipath | |
| '--------------' |
| ^ |
| .----------. | .----------. |
| | fc_host0 |--------'------| fc_host1 | |
| '----------' '----------' |
| ^ ^ |
| | | |
| .----------. .-------. .----------. |
| | eth0.101 | | bond0 | | eth1.101 | |
| '----------' '-------' '----------' |
| ^ ^ ^ |
| | .------. | .------. | |
| '-| eth0 |---'----| eth1 |-' |
| '------' '------' |
'-------------|---------------|-----------'
| |
v v
.----------. .----------.
| switch A |----| switch B |
'----------' '----------'
| | | |
.--'--'------------'--'-.
| |
v v
.-,( ),-. .-,( ),-.
.-( )-. .-( )-.
( FC SAN ) ( IP LAN )
'-( ).-' '-( ).-'
'-.( ).-' '-.( ).-'
bond0 is in active-backup mode, but FCoE is actively running on both
links providing two different paths into the SAN. This configuration
matches a typical HA setup with separate Ethernet + FC adapters. In
this case I'm interested in software convergence where all traffic
passes through the standard network transmit and receive paths.
The VLANs aren't strictly required by FCoE, but it is the recommended
best practice by switch vendors. The FCF switches map FC VSANs to
VLANs.
Ever since this series of changes to net/core/dev.c
Author: Joe Eykholt <jre@nuovasystems.com>
Date: Wed Jul 2 18:22:02 2008 -0700
net/core: Uninline skb_bond().
net/core: Allow certain receives on inactive slave.
net/core: Allow receive on active slaves.
it has been possible to receive directly on both active and inactive
slave links if the packet_type specifies the slave device. This
combined with the PACKET_ORIGDEV socket option allowed for FCoE to run
on the slave devices (DCB link configuration uses a userspace LLDP
agent, and FCoE includes a VLAN discovery protocol that is implemented
in userspace as well).
The problem is that it doesn't work for hardware accelerated VLAN
devices, because the VLAN receive paths have their own
skb_bond_should_drop calls that were not updated.
>From what I can tell, VLAN receives always end up going through
netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
the frame isn't dropped the first time. I think the bonding checks in
__vlan_hwaccel_rx and vlan_gro_common should just be removed.
@@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
if (netpoll_rx(skb))
return NET_RX_DROP;
- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
- goto drop;
-
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
@@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
{
struct sk_buff *p;
- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
- goto drop;
-
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
That fixes my setup ... but thinking about it raised some more
questions. The VLAN discovery tool I wrote shouldn't have worked, I
didn't bother to bind a packet socket to each interface I wanted to use.
So a single unbound packet socket is successfully passing traffic on
both active and inactive slave interfaces, which from my understanding
shouldn't work. It's easier for me this way, but it still seems wrong.
I think the problem was introduced with these changes.
Author: Andy Gospodarek <andy@greyhouse.net>
Date: Wed Jan 6 12:56:37 2010 +0000
fix bonding: allow arp_ip_targets on separate vlans to use arp validation
Date: Mon Dec 14 10:48:58 2009 +0000
bonding: allow arp_ip_targets on separate vlans to use arp validation
The use of null_or_bond in netif_receive_skb looks suspicious to me. In
the presence of both bonding and VLANs it probably does what was
intended. Without VLANs however, it is always set to NULL which matches
unbound packet_types. So unbound packet_types will process all frames
received on an inactive slave link, ignoring the result of
skb_bond_should_drop.
I haven't quite figured out what I think the correct change for
null_or_bond is. I suspect it involves not using NULL at all. I can
see how it addresses the arp_ip_target on a VLAN issue, but this is also
changing the receive matching rules for other traffic in unexpected
ways.
- Chris
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH] vlan: remove receive checks for bonding 2010-04-12 22:17 Receive issues with bonding and vlans Chris Leech @ 2010-04-12 22:17 ` Chris Leech 2010-04-12 23:19 ` Jay Vosburgh 2010-04-12 23:10 ` Receive issues with bonding and vlans Jay Vosburgh 1 sibling, 1 reply; 12+ messages in thread From: Chris Leech @ 2010-04-12 22:17 UTC (permalink / raw) To: netdev, Andy Gospodarek, Patrick McHardy; +Cc: bonding-devel The checks in the hardware accelerated receive path are not up to date with what's in netif_receive_skb, which will get called anyway if the frame is not dropped in the vlan code. Signed-off-by: Chris Leech <christopher.leech@intel.com> --- net/8021q/vlan_core.c | 6 ------ 1 files changed, 0 insertions(+), 6 deletions(-) diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c index c584a0a..7576f9c 100644 --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, if (netpoll_rx(skb)) return NET_RX_DROP; - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) - goto drop; - skb->skb_iif = skb->dev->ifindex; __vlan_hwaccel_put_tag(skb, vlan_tci); skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); @@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp, { struct sk_buff *p; - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) - goto drop; - skb->skb_iif = skb->dev->ifindex; __vlan_hwaccel_put_tag(skb, vlan_tci); skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] vlan: remove receive checks for bonding 2010-04-12 22:17 ` [PATCH] vlan: remove receive checks for bonding Chris Leech @ 2010-04-12 23:19 ` Jay Vosburgh 0 siblings, 0 replies; 12+ messages in thread From: Jay Vosburgh @ 2010-04-12 23:19 UTC (permalink / raw) To: Chris Leech; +Cc: netdev, Andy Gospodarek, Patrick McHardy, bonding-devel Chris Leech <christopher.leech@intel.com> wrote: >The checks in the hardware accelerated receive path are not up to date >with what's in netif_receive_skb, which will get called anyway if the >frame is not dropped in the vlan code. > >Signed-off-by: Chris Leech <christopher.leech@intel.com> NAK. As I explained in a reply to Chris's separate message detailing the problem he sees, the skb_bond_should_drop logic as implemented is dependent upon knowing the original skb->dev the packet arrived on, prior to VLAN reassigning it. That's not to say there's nothing wrong here, but removing the calls with break other things. -J >--- > > net/8021q/vlan_core.c | 6 ------ > 1 files changed, 0 insertions(+), 6 deletions(-) > >diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c >index c584a0a..7576f9c 100644 >--- a/net/8021q/vlan_core.c >+++ b/net/8021q/vlan_core.c >@@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, > if (netpoll_rx(skb)) > return NET_RX_DROP; > >- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >- goto drop; >- > skb->skb_iif = skb->dev->ifindex; > __vlan_hwaccel_put_tag(skb, vlan_tci); > skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >@@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp, > { > struct sk_buff *p; > >- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >- goto drop; >- > skb->skb_iif = skb->dev->ifindex; > __vlan_hwaccel_put_tag(skb, vlan_tci); > skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); > --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-04-12 22:17 Receive issues with bonding and vlans Chris Leech 2010-04-12 22:17 ` [PATCH] vlan: remove receive checks for bonding Chris Leech @ 2010-04-12 23:10 ` Jay Vosburgh 2010-04-12 23:35 ` Chris Leech 1 sibling, 1 reply; 12+ messages in thread From: Jay Vosburgh @ 2010-04-12 23:10 UTC (permalink / raw) To: Chris Leech; +Cc: netdev, Andy Gospodarek, Patrick McHardy, bonding-devel Chris Leech <christopher.leech@intel.com> wrote: >Quick summary: VLANs and bonding are interacting in strange ways in the >receive path, VLAN devices do not act the same as real Ethernet devices, >hardware accelerated VLANs do not act the same as software tagged VLANs, >and I think frames are incorrectly being passed up to protocols from >inactive bonding links. > >I've been looking at high availability configurations for converged LAN >+ SAN networking, trying to see what running FCoE and IP traffic looked >like with bonding and dm_multipath. The goal is to allow sysadmins to >use the tools they are already using with separate LAN and SAN adapters, >now on a single converged adapter. > >The setup I'm trying to use looks like this; with IP traffic running on >bond0, storage VLANs created on eth0 and eth1, and FCoE running on the >VLANs. Both switches provide Fiber Channel Forwarder (FCF) services, >and connect to the same LAN and SAN. > > .-----------------------------------------. > | .--------------. | > | | dm_multipath | | > | '--------------' | > | ^ | > | .----------. | .----------. | > | | fc_host0 |--------'------| fc_host1 | | > | '----------' '----------' | > | ^ ^ | > | | | | > | .----------. .-------. .----------. | > | | eth0.101 | | bond0 | | eth1.101 | | > | '----------' '-------' '----------' | > | ^ ^ ^ | > | | .------. | .------. | | > | '-| eth0 |---'----| eth1 |-' | > | '------' '------' | > '-------------|---------------|-----------' > | | > v v > .----------. .----------. > | switch A |----| switch B | > '----------' '----------' > | | | | > .--'--'------------'--'-. > | | > v v > .-,( ),-. .-,( ),-. > .-( )-. .-( )-. > ( FC SAN ) ( IP LAN ) > '-( ).-' '-( ).-' > '-.( ).-' '-.( ).-' > >bond0 is in active-backup mode, but FCoE is actively running on both >links providing two different paths into the SAN. This configuration >matches a typical HA setup with separate Ethernet + FC adapters. In >this case I'm interested in software convergence where all traffic >passes through the standard network transmit and receive paths. > >The VLANs aren't strictly required by FCoE, but it is the recommended >best practice by switch vendors. The FCF switches map FC VSANs to >VLANs. > >Ever since this series of changes to net/core/dev.c > > Author: Joe Eykholt <jre@nuovasystems.com> > Date: Wed Jul 2 18:22:02 2008 -0700 > net/core: Uninline skb_bond(). > net/core: Allow certain receives on inactive slave. > net/core: Allow receive on active slaves. > >it has been possible to receive directly on both active and inactive >slave links if the packet_type specifies the slave device. This >combined with the PACKET_ORIGDEV socket option allowed for FCoE to run >on the slave devices (DCB link configuration uses a userspace LLDP >agent, and FCoE includes a VLAN discovery protocol that is implemented >in userspace as well). Is the FCoE supposed to run over the inactive bonding slave? Or am I misunderstanding what you're saying? I had thought the LLDP, et al, special case in bonding was to permit, essentially, path discovery, not necessarily active use of the inactive slave. Not that this is necessarily bad; the "drop stuff on inactive slaves" is really there for duplicate suppression, but it also can uncover network topology issues, e.g., network layouts that won't work if the devices fail, but appear to work during testing because the "inactive" slave still receives traffic (it hasn't really failed). >The problem is that it doesn't work for hardware accelerated VLAN >devices, because the VLAN receive paths have their own >skb_bond_should_drop calls that were not updated. > >From what I can tell, VLAN receives always end up going through >netif_receive_skb anyway, so skb_bond_should_drop gets called twice if >the frame isn't dropped the first time. I think the bonding checks in >__vlan_hwaccel_rx and vlan_gro_common should just be removed. I'm not so sure. The checks in __vlan_hwaccel_rx are done with the original receiving device in skb->dev; by the time the packet gets to netif_receive_skb, the original slave the packet was received on has been lost (and replaced with the VLAN device). Various things are interested in that, in particular the "arp_validate" and the "inactive slave drop" logic for bonding depend on knowing the real device the packet arrived on. I note that the vlan accel logic doesn't change skb_iif to the VLAN device; it remains as the original device. I suppose one alternative would be to convert the bonding drop, et al, logic to use skb_iif instead of skb->dev; if that works, then I think the VLAN core would not need to call skb_bond_should_drop, which in turn would be a bit more complicated as it would have to look up the dev from the skb_iif. There's already some code in bonding that takes advantage of this property of the VLANs, so maybe this is the way to go. >@@ -11,9 +11,6 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, > if (netpoll_rx(skb)) > return NET_RX_DROP; > >- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >- goto drop; >- > skb->skb_iif = skb->dev->ifindex; > __vlan_hwaccel_put_tag(skb, vlan_tci); > skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >@@ -83,9 +80,6 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp, > { > struct sk_buff *p; > >- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >- goto drop; >- > skb->skb_iif = skb->dev->ifindex; > __vlan_hwaccel_put_tag(skb, vlan_tci); > skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); > >That fixes my setup ... but thinking about it raised some more >questions. The VLAN discovery tool I wrote shouldn't have worked, I >didn't bother to bind a packet socket to each interface I wanted to use. >So a single unbound packet socket is successfully passing traffic on >both active and inactive slave interfaces, which from my understanding >shouldn't work. It's easier for me this way, but it still seems wrong. There is no logic to block transmission on bonding slave devices; anything is free to bind to the slave and send whatever it wants. The reception logic (to drop most traffic on the inactive slave in an active-backup bond) was originally put in place to prevent duplicate packets from being received for broadcasts and for unicasts when the switch floods to all ports (which can happen during the interval that a switch is still learning the MAC address). >I think the problem was introduced with these changes. > > Author: Andy Gospodarek <andy@greyhouse.net> > Date: Wed Jan 6 12:56:37 2010 +0000 > fix bonding: allow arp_ip_targets on separate vlans to use arp validation > Date: Mon Dec 14 10:48:58 2009 +0000 > bonding: allow arp_ip_targets on separate vlans to use arp validation > >The use of null_or_bond in netif_receive_skb looks suspicious to me. In >the presence of both bonding and VLANs it probably does what was >intended. Without VLANs however, it is always set to NULL which matches >unbound packet_types. So unbound packet_types will process all frames >received on an inactive slave link, ignoring the result of >skb_bond_should_drop. > >I haven't quite figured out what I think the correct change for >null_or_bond is. I suspect it involves not using NULL at all. I can >see how it addresses the arp_ip_target on a VLAN issue, but this is also >changing the receive matching rules for other traffic in unexpected >ways. I'll hazard a guess that something like this might do it: diff --git a/net/core/dev.c b/net/core/dev.c index b98ddc6..cc665bb 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2735,7 +2735,7 @@ ncls: &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_orig || ptype->dev == skb->dev || ptype->dev == orig_dev || - ptype->dev == null_or_bond)) { + (null_or_bond && (ptype->dev == null_or_bond))) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; I haven't tested this, but the theory is to only test against null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN traffic over bonding. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-04-12 23:10 ` Receive issues with bonding and vlans Jay Vosburgh @ 2010-04-12 23:35 ` Chris Leech 2010-04-13 0:08 ` Jay Vosburgh 0 siblings, 1 reply; 12+ messages in thread From: Chris Leech @ 2010-04-12 23:35 UTC (permalink / raw) To: Jay Vosburgh Cc: netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote: > Is the FCoE supposed to run over the inactive bonding slave? Or > am I misunderstanding what you're saying? I had thought the LLDP, et > al, special case in bonding was to permit, essentially, path discovery, > not necessarily active use of the inactive slave. That's what I'm trying to do, yes. Mostly because it's a setup that would work if you removed the FCoE traffic from the network data path, and only converged at the driver level and below. It's possible that the answer is "don't do that". > Not that this is necessarily bad; the "drop stuff on inactive > slaves" is really there for duplicate suppression, but it also can > uncover network topology issues, e.g., network layouts that won't work > if the devices fail, but appear to work during testing because the > "inactive" slave still receives traffic (it hasn't really failed). > > >The problem is that it doesn't work for hardware accelerated VLAN > >devices, because the VLAN receive paths have their own > >skb_bond_should_drop calls that were not updated. > > > >From what I can tell, VLAN receives always end up going through > >netif_receive_skb anyway, so skb_bond_should_drop gets called twice if > >the frame isn't dropped the first time. I think the bonding checks in > >__vlan_hwaccel_rx and vlan_gro_common should just be removed. > > I'm not so sure. The checks in __vlan_hwaccel_rx are done with > the original receiving device in skb->dev; by the time the packet gets > to netif_receive_skb, the original slave the packet was received on has > been lost (and replaced with the VLAN device). Various things are > interested in that, in particular the "arp_validate" and the "inactive > slave drop" logic for bonding depend on knowing the real device the > packet arrived on. > > I note that the vlan accel logic doesn't change skb_iif to the > VLAN device; it remains as the original device. I suppose one > alternative would be to convert the bonding drop, et al, logic to use > skb_iif instead of skb->dev; if that works, then I think the VLAN core > would not need to call skb_bond_should_drop, which in turn would be a > bit more complicated as it would have to look up the dev from the > skb_iif. There's already some code in bonding that takes advantage of > this property of the VLANs, so maybe this is the way to go. Thanks, I'll take another look and see if I can come up with something better. > >I haven't quite figured out what I think the correct change for > >null_or_bond is. I suspect it involves not using NULL at all. I can > >see how it addresses the arp_ip_target on a VLAN issue, but this is also > >changing the receive matching rules for other traffic in unexpected > >ways. > > I'll hazard a guess that something like this might do it: > > diff --git a/net/core/dev.c b/net/core/dev.c > index b98ddc6..cc665bb 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -2735,7 +2735,7 @@ ncls: > &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { > if (ptype->type == type && (ptype->dev == null_or_orig || > ptype->dev == skb->dev || ptype->dev == orig_dev || > - ptype->dev == null_or_bond)) { > + (null_or_bond && (ptype->dev == null_or_bond))) { > if (pt_prev) > ret = deliver_skb(skb, pt_prev, orig_dev); > pt_prev = ptype; > > > I haven't tested this, but the theory is to only test against > null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN > traffic over bonding. Yes, that should do it. - Chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-04-12 23:35 ` Chris Leech @ 2010-04-13 0:08 ` Jay Vosburgh 2010-05-03 3:04 ` John Fastabend 0 siblings, 1 reply; 12+ messages in thread From: Jay Vosburgh @ 2010-04-13 0:08 UTC (permalink / raw) To: Chris Leech Cc: netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net Chris Leech <christopher.leech@intel.com> wrote: >On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote: >> Is the FCoE supposed to run over the inactive bonding slave? Or >> am I misunderstanding what you're saying? I had thought the LLDP, et >> al, special case in bonding was to permit, essentially, path discovery, >> not necessarily active use of the inactive slave. > >That's what I'm trying to do, yes. Mostly because it's a setup that >would work if you removed the FCoE traffic from the network data path, >and only converged at the driver level and below. It's possible that >the answer is "don't do that". So, basically, you want the bond to act like usual for "regular" ethernet traffic, but act like the slaves are independent from the bond for the magic FCoE traffic, right? I'm not really sure if that's a "don't do that" or not. >> Not that this is necessarily bad; the "drop stuff on inactive >> slaves" is really there for duplicate suppression, but it also can >> uncover network topology issues, e.g., network layouts that won't work >> if the devices fail, but appear to work during testing because the >> "inactive" slave still receives traffic (it hasn't really failed). >> >> >The problem is that it doesn't work for hardware accelerated VLAN >> >devices, because the VLAN receive paths have their own >> >skb_bond_should_drop calls that were not updated. >> > >> >From what I can tell, VLAN receives always end up going through >> >netif_receive_skb anyway, so skb_bond_should_drop gets called twice if >> >the frame isn't dropped the first time. I think the bonding checks in >> >__vlan_hwaccel_rx and vlan_gro_common should just be removed. >> >> I'm not so sure. The checks in __vlan_hwaccel_rx are done with >> the original receiving device in skb->dev; by the time the packet gets >> to netif_receive_skb, the original slave the packet was received on has >> been lost (and replaced with the VLAN device). Various things are >> interested in that, in particular the "arp_validate" and the "inactive >> slave drop" logic for bonding depend on knowing the real device the >> packet arrived on. >> >> I note that the vlan accel logic doesn't change skb_iif to the >> VLAN device; it remains as the original device. I suppose one >> alternative would be to convert the bonding drop, et al, logic to use >> skb_iif instead of skb->dev; if that works, then I think the VLAN core >> would not need to call skb_bond_should_drop, which in turn would be a >> bit more complicated as it would have to look up the dev from the >> skb_iif. There's already some code in bonding that takes advantage of >> this property of the VLANs, so maybe this is the way to go. > >Thanks, I'll take another look and see if I can come up with something >better. I looked at the skb_bond_should_drop stuff a bit more after I wrote that; it's not as easy as I had suspected. The big sticking point is that currently the test in netif_receive_skb (now __netif_receive_skb in net-next-2.6) is on skb->dev->master to identify packets arriving on slaves of bonding. The VLAN skb->dev has ->master set to NULL. Doing that test against skb->skb_iif would be much more expensive, as it would require a device lookup for every packet. So, I suspect that something has to happen in the VLAN acceleration path, although I don't know exactly what. I don't know if it would be possible to flag the packets in some special way to indicate that they're "bonding slave" packets, or if it's better to keep the current structure and just fix the calls somehow. -J >> >I haven't quite figured out what I think the correct change for >> >null_or_bond is. I suspect it involves not using NULL at all. I can >> >see how it addresses the arp_ip_target on a VLAN issue, but this is also >> >changing the receive matching rules for other traffic in unexpected >> >ways. >> >> I'll hazard a guess that something like this might do it: >> >> diff --git a/net/core/dev.c b/net/core/dev.c >> index b98ddc6..cc665bb 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -2735,7 +2735,7 @@ ncls: >> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { >> if (ptype->type == type && (ptype->dev == null_or_orig || >> ptype->dev == skb->dev || ptype->dev == orig_dev || >> - ptype->dev == null_or_bond)) { >> + (null_or_bond && (ptype->dev == null_or_bond))) { >> if (pt_prev) >> ret = deliver_skb(skb, pt_prev, orig_dev); >> pt_prev = ptype; >> >> >> I haven't tested this, but the theory is to only test against >> null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN >> traffic over bonding. > >Yes, that should do it. > > - Chris --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-04-13 0:08 ` Jay Vosburgh @ 2010-05-03 3:04 ` John Fastabend 2010-05-03 18:25 ` Jay Vosburgh 0 siblings, 1 reply; 12+ messages in thread From: John Fastabend @ 2010-05-03 3:04 UTC (permalink / raw) To: Jay Vosburgh Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net Jay Vosburgh wrote: > Chris Leech <christopher.leech@intel.com> wrote: > >> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote: >>> Is the FCoE supposed to run over the inactive bonding slave? Or >>> am I misunderstanding what you're saying? I had thought the LLDP, et >>> al, special case in bonding was to permit, essentially, path discovery, >>> not necessarily active use of the inactive slave. >> That's what I'm trying to do, yes. Mostly because it's a setup that >> would work if you removed the FCoE traffic from the network data path, >> and only converged at the driver level and below. It's possible that >> the answer is "don't do that". > > So, basically, you want the bond to act like usual for "regular" > ethernet traffic, but act like the slaves are independent from the bond > for the magic FCoE traffic, right? > > I'm not really sure if that's a "don't do that" or not. > >>> Not that this is necessarily bad; the "drop stuff on inactive >>> slaves" is really there for duplicate suppression, but it also can >>> uncover network topology issues, e.g., network layouts that won't work >>> if the devices fail, but appear to work during testing because the >>> "inactive" slave still receives traffic (it hasn't really failed). >>> >>>> The problem is that it doesn't work for hardware accelerated VLAN >>>> devices, because the VLAN receive paths have their own >>>> skb_bond_should_drop calls that were not updated. >>>> >>> >From what I can tell, VLAN receives always end up going through >>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if >>>> the frame isn't dropped the first time. I think the bonding checks in >>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed. >>> I'm not so sure. The checks in __vlan_hwaccel_rx are done with >>> the original receiving device in skb->dev; by the time the packet gets >>> to netif_receive_skb, the original slave the packet was received on has >>> been lost (and replaced with the VLAN device). Various things are >>> interested in that, in particular the "arp_validate" and the "inactive >>> slave drop" logic for bonding depend on knowing the real device the >>> packet arrived on. >>> >>> I note that the vlan accel logic doesn't change skb_iif to the >>> VLAN device; it remains as the original device. I suppose one >>> alternative would be to convert the bonding drop, et al, logic to use >>> skb_iif instead of skb->dev; if that works, then I think the VLAN core >>> would not need to call skb_bond_should_drop, which in turn would be a >>> bit more complicated as it would have to look up the dev from the >>> skb_iif. There's already some code in bonding that takes advantage of >>> this property of the VLANs, so maybe this is the way to go. >> Thanks, I'll take another look and see if I can come up with something >> better. > > I looked at the skb_bond_should_drop stuff a bit more after I > wrote that; it's not as easy as I had suspected. The big sticking point > is that currently the test in netif_receive_skb (now __netif_receive_skb > in net-next-2.6) is on skb->dev->master to identify packets arriving on > slaves of bonding. The VLAN skb->dev has ->master set to NULL. Doing > that test against skb->skb_iif would be much more expensive, as it would > require a device lookup for every packet. > > So, I suspect that something has to happen in the VLAN > acceleration path, although I don't know exactly what. I don't know if > it would be possible to flag the packets in some special way to indicate > that they're "bonding slave" packets, or if it's better to keep the > current structure and just fix the calls somehow. > > -J > > Jay, It should be OK to allow packets to be received on the VLAN if it is not explicitly in the bond? Or if we want to be more paranoid deliver packets only to handlers with exact matches for the device. For non vlan devices we deliver skb's to packet handlers that match exactly even on inactive slaves so doing this on vlan devices as well makes sense and shouldn't cause any unexpected problems. Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond are not working as expected in __netif_receive_skb(). At least the comment 'deliver only exact match' could be inaccurate. Here's a patch to illustrate what I'm thinking compile tested only. If this sounds reasonable I'll work up an official patch. [PATCH] net: allow vlans on bonded real net_devices For converged I/O it is reasonable to use dm_multipathing to provice failover and load balancing for storage traffic and then use bonding for the LAN failover and load balancing. Currently this works if the multipathed devices are using the real net_device and those devices are enslaved to a bonded device what does not work is creating a vlan on the real device and trying to use it outside the bond for multipathing. This patch adds logic so that if the skb is destined for a vlan that is not in the bond the skb will not be dropped. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> --- net/8021q/vlan_core.c | 31 +++++++++++++++++++++---------- net/core/dev.c | 11 ++++++++--- 2 files changed, 29 insertions(+), 13 deletions(-) diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c index c584a0a..3bce0c3 100644 --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -8,18 +8,24 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, u16 vlan_tci, int polling) { + struct net_device *vlan_dev; + if (netpoll_rx(skb)) return NET_RX_DROP; - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); + + if (!vlan_dev) + goto drop; + + if ((vlan_dev->priv_flags & IFF_BONDING || + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) goto drop; skb->skb_iif = skb->dev->ifindex; __vlan_hwaccel_put_tag(skb, vlan_tci); - skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); - - if (!skb->dev) - goto drop; + skb->dev = vlan_dev; return (polling ? netif_receive_skb(skb) : netif_rx(skb)); @@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp, unsigned int vlan_tci, struct sk_buff *skb) { struct sk_buff *p; + struct net_device *vlan_dev; - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); + + if (!vlan_dev) + goto drop; + + if ((vlan_dev->priv_flags & IFF_BONDING || + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) goto drop; skb->skb_iif = skb->dev->ifindex; __vlan_hwaccel_put_tag(skb, vlan_tci); - skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); - - if (!skb->dev) - goto drop; + skb->dev = vlan_dev; for (p = napi->gro_list; p; p = p->next) { NAPI_GRO_CB(p)->same_flow = diff --git a/net/core/dev.c b/net/core/dev.c index 100dcbd..9ea4550 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb) struct net_device *master; struct net_device *null_or_orig; struct net_device *null_or_bond; + struct net_device *real_dev; int ret = NET_RX_DROP; __be16 type; @@ -2853,9 +2854,13 @@ ncls: * handler may have to adjust skb->dev and orig_dev. */ null_or_bond = NULL; - if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) && - (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) { - null_or_bond = vlan_dev_real_dev(skb->dev); + if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) { + real_dev = vlan_dev_real_dev(skb->dev); + if (real_dev->priv_flags & IFF_BONDING) + null_or_bond = vlan_dev_real_dev(skb->dev); + if (!(skb->dev->priv_flags & IFF_BONDING) && + real_dev->priv_flags & IFF_SLAVE_INACTIVE) + null_or_orig = skb->dev; } type = skb->protocol; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-05-03 3:04 ` John Fastabend @ 2010-05-03 18:25 ` Jay Vosburgh 2010-05-03 21:17 ` John Fastabend 0 siblings, 1 reply; 12+ messages in thread From: Jay Vosburgh @ 2010-05-03 18:25 UTC (permalink / raw) To: John Fastabend Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net John Fastabend <john.r.fastabend@intel.com> wrote: >Jay Vosburgh wrote: >> Chris Leech <christopher.leech@intel.com> wrote: >> >>> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote: >>>> Is the FCoE supposed to run over the inactive bonding slave? Or >>>> am I misunderstanding what you're saying? I had thought the LLDP, et >>>> al, special case in bonding was to permit, essentially, path discovery, >>>> not necessarily active use of the inactive slave. >>> That's what I'm trying to do, yes. Mostly because it's a setup that >>> would work if you removed the FCoE traffic from the network data path, >>> and only converged at the driver level and below. It's possible that >>> the answer is "don't do that". >> >> So, basically, you want the bond to act like usual for "regular" >> ethernet traffic, but act like the slaves are independent from the bond >> for the magic FCoE traffic, right? >> >> I'm not really sure if that's a "don't do that" or not. >> >>>> Not that this is necessarily bad; the "drop stuff on inactive >>>> slaves" is really there for duplicate suppression, but it also can >>>> uncover network topology issues, e.g., network layouts that won't work >>>> if the devices fail, but appear to work during testing because the >>>> "inactive" slave still receives traffic (it hasn't really failed). >>>> >>>>> The problem is that it doesn't work for hardware accelerated VLAN >>>>> devices, because the VLAN receive paths have their own >>>>> skb_bond_should_drop calls that were not updated. >>>>> >>>> >From what I can tell, VLAN receives always end up going through >>>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if >>>>> the frame isn't dropped the first time. I think the bonding checks in >>>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed. >>>> I'm not so sure. The checks in __vlan_hwaccel_rx are done with >>>> the original receiving device in skb->dev; by the time the packet gets >>>> to netif_receive_skb, the original slave the packet was received on has >>>> been lost (and replaced with the VLAN device). Various things are >>>> interested in that, in particular the "arp_validate" and the "inactive >>>> slave drop" logic for bonding depend on knowing the real device the >>>> packet arrived on. >>>> >>>> I note that the vlan accel logic doesn't change skb_iif to the >>>> VLAN device; it remains as the original device. I suppose one >>>> alternative would be to convert the bonding drop, et al, logic to use >>>> skb_iif instead of skb->dev; if that works, then I think the VLAN core >>>> would not need to call skb_bond_should_drop, which in turn would be a >>>> bit more complicated as it would have to look up the dev from the >>>> skb_iif. There's already some code in bonding that takes advantage of >>>> this property of the VLANs, so maybe this is the way to go. >>> Thanks, I'll take another look and see if I can come up with something >>> better. >> >> I looked at the skb_bond_should_drop stuff a bit more after I >> wrote that; it's not as easy as I had suspected. The big sticking point >> is that currently the test in netif_receive_skb (now __netif_receive_skb >> in net-next-2.6) is on skb->dev->master to identify packets arriving on >> slaves of bonding. The VLAN skb->dev has ->master set to NULL. Doing >> that test against skb->skb_iif would be much more expensive, as it would >> require a device lookup for every packet. >> >> So, I suspect that something has to happen in the VLAN >> acceleration path, although I don't know exactly what. I don't know if >> it would be possible to flag the packets in some special way to indicate >> that they're "bonding slave" packets, or if it's better to keep the >> current structure and just fix the calls somehow. >> >> -J >> >> > >Jay, > >It should be OK to allow packets to be received on the VLAN if it is not >explicitly in the bond? Lemme see if I have this straight, because all of these special cases are making my brain hurt. This one is for a configuration like this: bond0 ----- eth0 / vlan.xxx -/ I.e., a VLAN configured directly atop an ethernet device, said ethernet also being a slave to bonding. Is that correct? Extrapolating from the ASCII art in a prior message in this discussion, would this configuration really be something like this: vlan.xxx -\ \ bond0 ----- eth1 bond0 ----- eth0 / vlan.xxx -/ I.e., two slaves to bonding, with VLAN xxx configured atop both of the slaves? Or would the eth0 and eth1 use discrete VLAN ids? The reason I ask is in regards to duplicate suppression. The whole reason the "inactive" slave drops (most) incoming packets is to eliminate duplicates when the switch floods traffic to both slave ports. This is a bit tricky, because it's not really about broadcasts / multicasts so much, but about traffic that the switch sends to all ports because the switch's MAC address table isn't up to date with the destination MAC of the traffic (which is a transient condition, so this would only happen for perhaps one second or so). That would result in duplicate unicast packets being received by the bond (back in the day before bonding had the "drop inactive traffic" logic). So if the same VLAN is configured atop the two slaves, I wonder if that will open a window for the duplicate unicast packet problem. If the VLAN ids are different, then I'll assume this is some kind of device mapper magic, doing the load balancing elsewhere. >Or if we want to be more paranoid deliver packets only to handlers with >exact matches for the device. For non vlan devices we deliver skb's to >packet handlers that match exactly even on inactive slaves so doing this >on vlan devices as well makes sense and shouldn't cause any unexpected >problems. Yah, the whole concept of "inactive" slave is pretty mutated now; it's kind of become the "active-backup with semi-manual load balance for clever protocols, oh, and duplicate suppression" mode. >Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond >are not working as expected in __netif_receive_skb(). At least the >comment 'deliver only exact match' could be inaccurate. I don't think this is unrelated at all. This code (the packet to device lookup stuff in __netif_receive_skb) has been modified pretty extensively lately for various bonding-related special cases, and I think it is getting hard to follow. Whatever comments are there need to be accurate, and, honestly, I think this code needs comments to explain what exactly is supposed to happen for these special cases. >Here's a patch to illustrate what I'm thinking compile tested only. If >this sounds reasonable I'll work up an official patch. > > >[PATCH] net: allow vlans on bonded real net_devices > >For converged I/O it is reasonable to use dm_multipathing to provice >failover and load balancing for storage traffic and then use bonding >for the LAN failover and load balancing. > >Currently this works if the multipathed devices are using the real >net_device and those devices are enslaved to a bonded device what >does not work is creating a vlan on the real device and trying to >use it outside the bond for multipathing. > >This patch adds logic so that if the skb is destined for a vlan >that is not in the bond the skb will not be dropped. > >Signed-off-by: John Fastabend <john.r.fastabend@intel.com> >--- > > net/8021q/vlan_core.c | 31 +++++++++++++++++++++---------- > net/core/dev.c | 11 ++++++++--- > 2 files changed, 29 insertions(+), 13 deletions(-) > >diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c >index c584a0a..3bce0c3 100644 >--- a/net/8021q/vlan_core.c >+++ b/net/8021q/vlan_core.c >@@ -8,18 +8,24 @@ > int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, > u16 vlan_tci, int polling) > { >+ struct net_device *vlan_dev; >+ > if (netpoll_rx(skb)) > return NET_RX_DROP; > >- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >+ vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >+ >+ if (!vlan_dev) >+ goto drop; >+ >+ if ((vlan_dev->priv_flags & IFF_BONDING || >+ vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >+ skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) I'm not sure this will do the right thing if the VLAN device itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0. In that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev) doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I believe). I think this will result in all incoming traffic being accepted on such a configuration (leading to duplicates, as described above). I suspect, but have not tested, that something like this might do what you're looking for: if ((vlan_dev->priv_flags & IFF_BONDING || vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) && skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) I.e., if the VLAN device is either a MASTER (configured above the bond) or a slave (configured below the bond) do the duplicate suppresion. > goto drop; > > skb->skb_iif = skb->dev->ifindex; > __vlan_hwaccel_put_tag(skb, vlan_tci); >- skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >- >- if (!skb->dev) >- goto drop; >+ skb->dev = vlan_dev; > > return (polling ? netif_receive_skb(skb) : netif_rx(skb)); > >@@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct >vlan_group *grp, > unsigned int vlan_tci, struct sk_buff *skb) > { > struct sk_buff *p; >+ struct net_device *vlan_dev; > >- if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >+ vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >+ >+ if (!vlan_dev) >+ goto drop; >+ >+ if ((vlan_dev->priv_flags & IFF_BONDING || >+ vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >+ skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) > goto drop; > > skb->skb_iif = skb->dev->ifindex; > __vlan_hwaccel_put_tag(skb, vlan_tci); >- skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >- >- if (!skb->dev) >- goto drop; >+ skb->dev = vlan_dev; > > for (p = napi->gro_list; p; p = p->next) { > NAPI_GRO_CB(p)->same_flow = >diff --git a/net/core/dev.c b/net/core/dev.c >index 100dcbd..9ea4550 100644 >--- a/net/core/dev.c >+++ b/net/core/dev.c >@@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb) > struct net_device *master; > struct net_device *null_or_orig; > struct net_device *null_or_bond; >+ struct net_device *real_dev; > int ret = NET_RX_DROP; > __be16 type; > >@@ -2853,9 +2854,13 @@ ncls: > * handler may have to adjust skb->dev and orig_dev. > */ > null_or_bond = NULL; >- if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) && >- (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) { >- null_or_bond = vlan_dev_real_dev(skb->dev); >+ if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) { >+ real_dev = vlan_dev_real_dev(skb->dev); >+ if (real_dev->priv_flags & IFF_BONDING) >+ null_or_bond = vlan_dev_real_dev(skb->dev); >+ if (!(skb->dev->priv_flags & IFF_BONDING) && >+ real_dev->priv_flags & IFF_SLAVE_INACTIVE) >+ null_or_orig = skb->dev; > } > > type = skb->protocol; Is this another way of accomplishing what I had suggested at the end of this message: http://marc.info/?l=linux-netdev&m=127111386702765&w=2 The patch part of my suggestion was as follows: >diff --git a/net/core/dev.c b/net/core/dev.c >index b98ddc6..cc665bb 100644 >--- a/net/core/dev.c >+++ b/net/core/dev.c >@@ -2735,7 +2735,7 @@ ncls: > &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { > if (ptype->type == type && (ptype->dev == null_or_orig || > ptype->dev == skb->dev || ptype->dev == orig_dev || >- ptype->dev == null_or_bond)) { >+ (null_or_bond && (ptype->dev == null_or_bond))) { > if (pt_prev) > ret = deliver_skb(skb, pt_prev, orig_dev); > pt_prev = ptype; > > > I haven't tested this, but the theory is to only test against >null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN >traffic over bonding. Chris Leech said "that should do it" but I don't recall seeing if it actually did so in practice. Or is your change meant to fix something else? -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-05-03 18:25 ` Jay Vosburgh @ 2010-05-03 21:17 ` John Fastabend 2010-05-03 23:17 ` Jay Vosburgh 0 siblings, 1 reply; 12+ messages in thread From: John Fastabend @ 2010-05-03 21:17 UTC (permalink / raw) To: Jay Vosburgh Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net Jay Vosburgh wrote: > John Fastabend <john.r.fastabend@intel.com> wrote: > >> Jay Vosburgh wrote: >>> Chris Leech <christopher.leech@intel.com> wrote: >>> >>>> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote: >>>>> Is the FCoE supposed to run over the inactive bonding slave? Or >>>>> am I misunderstanding what you're saying? I had thought the LLDP, et >>>>> al, special case in bonding was to permit, essentially, path discovery, >>>>> not necessarily active use of the inactive slave. >>>> That's what I'm trying to do, yes. Mostly because it's a setup that >>>> would work if you removed the FCoE traffic from the network data path, >>>> and only converged at the driver level and below. It's possible that >>>> the answer is "don't do that". >>> So, basically, you want the bond to act like usual for "regular" >>> ethernet traffic, but act like the slaves are independent from the bond >>> for the magic FCoE traffic, right? >>> >>> I'm not really sure if that's a "don't do that" or not. >>> >>>>> Not that this is necessarily bad; the "drop stuff on inactive >>>>> slaves" is really there for duplicate suppression, but it also can >>>>> uncover network topology issues, e.g., network layouts that won't work >>>>> if the devices fail, but appear to work during testing because the >>>>> "inactive" slave still receives traffic (it hasn't really failed). >>>>> >>>>>> The problem is that it doesn't work for hardware accelerated VLAN >>>>>> devices, because the VLAN receive paths have their own >>>>>> skb_bond_should_drop calls that were not updated. >>>>>> >>>>> >From what I can tell, VLAN receives always end up going through >>>>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if >>>>>> the frame isn't dropped the first time. I think the bonding checks in >>>>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed. >>>>> I'm not so sure. The checks in __vlan_hwaccel_rx are done with >>>>> the original receiving device in skb->dev; by the time the packet gets >>>>> to netif_receive_skb, the original slave the packet was received on has >>>>> been lost (and replaced with the VLAN device). Various things are >>>>> interested in that, in particular the "arp_validate" and the "inactive >>>>> slave drop" logic for bonding depend on knowing the real device the >>>>> packet arrived on. >>>>> >>>>> I note that the vlan accel logic doesn't change skb_iif to the >>>>> VLAN device; it remains as the original device. I suppose one >>>>> alternative would be to convert the bonding drop, et al, logic to use >>>>> skb_iif instead of skb->dev; if that works, then I think the VLAN core >>>>> would not need to call skb_bond_should_drop, which in turn would be a >>>>> bit more complicated as it would have to look up the dev from the >>>>> skb_iif. There's already some code in bonding that takes advantage of >>>>> this property of the VLANs, so maybe this is the way to go. >>>> Thanks, I'll take another look and see if I can come up with something >>>> better. >>> I looked at the skb_bond_should_drop stuff a bit more after I >>> wrote that; it's not as easy as I had suspected. The big sticking point >>> is that currently the test in netif_receive_skb (now __netif_receive_skb >>> in net-next-2.6) is on skb->dev->master to identify packets arriving on >>> slaves of bonding. The VLAN skb->dev has ->master set to NULL. Doing >>> that test against skb->skb_iif would be much more expensive, as it would >>> require a device lookup for every packet. >>> >>> So, I suspect that something has to happen in the VLAN >>> acceleration path, although I don't know exactly what. I don't know if >>> it would be possible to flag the packets in some special way to indicate >>> that they're "bonding slave" packets, or if it's better to keep the >>> current structure and just fix the calls somehow. >>> >>> -J >>> >>> >> Jay, >> >> It should be OK to allow packets to be received on the VLAN if it is not >> explicitly in the bond? > > Lemme see if I have this straight, because all of these special > cases are making my brain hurt. This one is for a configuration like this: > > bond0 ----- eth0 > / > vlan.xxx -/ > > I.e., a VLAN configured directly atop an ethernet device, said > ethernet also being a slave to bonding. Is that correct? > Yes, this is the correct scenario that we are considering. > Extrapolating from the ASCII art in a prior message in this > discussion, would this configuration really be something like this: > > vlan.xxx -\ > \ > bond0 ----- eth1 > bond0 ----- eth0 > / > vlan.xxx -/ > > I.e., two slaves to bonding, with VLAN xxx configured atop both > of the slaves? Or would the eth0 and eth1 use discrete VLAN ids? The > reason I ask is in regards to duplicate suppression. The whole reason > the "inactive" slave drops (most) incoming packets is to eliminate > duplicates when the switch floods traffic to both slave ports. > These vlan ids could be the same or discrete I think both configurations should be valid. > This is a bit tricky, because it's not really about broadcasts / > multicasts so much, but about traffic that the switch sends to all ports > because the switch's MAC address table isn't up to date with the > destination MAC of the traffic (which is a transient condition, so this > would only happen for perhaps one second or so). That would result in > duplicate unicast packets being received by the bond (back in the day > before bonding had the "drop inactive traffic" logic). > > So if the same VLAN is configured atop the two slaves, I wonder > if that will open a window for the duplicate unicast packet problem. OK, this does appear to open a window for duplicated unicast packets. By only allowing handlers with exact matches at least this issue is less obvious and we are assuming the packet handler can deal with this duplication. This seems to be the current assumption made. The same issue exists today for real device in the following setup, vlan --> bond0 --> eth Specifically for FCoE we use the san mac address so it wouldn't be an issue here. The expectation being that the switch will only ever use the correct san mac on the port. > > If the VLAN ids are different, then I'll assume this is some > kind of device mapper magic, doing the load balancing elsewhere. Correct device mapper handles load balancing and failover for both cases, when the vlan ids are different and when they are the same. > >> Or if we want to be more paranoid deliver packets only to handlers with >> exact matches for the device. For non vlan devices we deliver skb's to >> packet handlers that match exactly even on inactive slaves so doing this >> on vlan devices as well makes sense and shouldn't cause any unexpected >> problems. > > Yah, the whole concept of "inactive" slave is pretty mutated > now; it's kind of become the "active-backup with semi-manual load > balance for clever protocols, oh, and duplicate suppression" mode. > >> Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond >> are not working as expected in __netif_receive_skb(). At least the >> comment 'deliver only exact match' could be inaccurate. > > I don't think this is unrelated at all. This code (the packet > to device lookup stuff in __netif_receive_skb) has been modified pretty > extensively lately for various bonding-related special cases, and I > think it is getting hard to follow. Whatever comments are there need to > be accurate, and, honestly, I think this code needs comments to explain > what exactly is supposed to happen for these special cases. > Agreed. This should be cleaned up and some explanations added. The current behavior in active-backup mode is receiving packets on the bonded real device in active mode fails but putting that same real device in an inactive state will cause it to receive packets. This is an inconsistency, which should probably be fixed by initializing null_or_bond to orig_dev. And also renaming it orig_or_bond at that point. >> Here's a patch to illustrate what I'm thinking compile tested only. If >> this sounds reasonable I'll work up an official patch. >> >> >> [PATCH] net: allow vlans on bonded real net_devices >> >> For converged I/O it is reasonable to use dm_multipathing to provice >> failover and load balancing for storage traffic and then use bonding >> for the LAN failover and load balancing. >> >> Currently this works if the multipathed devices are using the real >> net_device and those devices are enslaved to a bonded device what >> does not work is creating a vlan on the real device and trying to >> use it outside the bond for multipathing. >> >> This patch adds logic so that if the skb is destined for a vlan >> that is not in the bond the skb will not be dropped. >> >> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> >> --- >> >> net/8021q/vlan_core.c | 31 +++++++++++++++++++++---------- >> net/core/dev.c | 11 ++++++++--- >> 2 files changed, 29 insertions(+), 13 deletions(-) >> >> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c >> index c584a0a..3bce0c3 100644 >> --- a/net/8021q/vlan_core.c >> +++ b/net/8021q/vlan_core.c >> @@ -8,18 +8,24 @@ >> int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, >> u16 vlan_tci, int polling) >> { >> + struct net_device *vlan_dev; >> + >> if (netpoll_rx(skb)) >> return NET_RX_DROP; >> >> - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >> + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >> + >> + if (!vlan_dev) >> + goto drop; >> + >> + if ((vlan_dev->priv_flags & IFF_BONDING || >> + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >> + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) > > I'm not sure this will do the right thing if the VLAN device > itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0. In > that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev) > doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I > believe). > correct, vlan_dev does have IFF_BONDING and IFF_SLAVE here and doesn't have IFF_MASTER. > I think this will result in all incoming traffic being accepted > on such a configuration (leading to duplicates, as described above). > > I suspect, but have not tested, that something like this might > do what you're looking for: > > if ((vlan_dev->priv_flags & IFF_BONDING || > vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) && > skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) > > I.e., if the VLAN device is either a MASTER (configured above > the bond) or a slave (configured below the bond) do the duplicate > suppresion. Here are the three basic cases I see, #1. vlanx --> bond0 --> ethx In this case vlanx does not have IFF_BONDING set and real_dev is ethx with IFF_SLAVE set. ethx has master dev->bond0 so this should work. And shows why we need the IFF_SLAVE bit as you pointed out and I dropped. #2. bond --> vlanx --> ethx This case is broke, skb->dev->master is NULL so we would never drop this pkt. As it exists today I suspect this is broken as well. #3 bond0 --> ethx vlanx --> -| Here is the case where adding the IFF_SLAVE bit doesn't work as I hoped. We don't want to run skb_bond_should_drop here. So I think there needs to be a bit of logic here to determine if we need to check skb_bond_should_drop with the vlan device or with the skb->dev->master. Something like might do: should_drop_dev = vlan_dev->master ? vlan_dev->master : skb->dev->master This should fix case #2 without breaking case #1. And the case I want to allow is still not resolved. I'll think about this some more maybe this logic can be fixed for all cases. > >> goto drop; >> >> skb->skb_iif = skb->dev->ifindex; >> __vlan_hwaccel_put_tag(skb, vlan_tci); >> - skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >> - >> - if (!skb->dev) >> - goto drop; >> + skb->dev = vlan_dev; >> >> return (polling ? netif_receive_skb(skb) : netif_rx(skb)); >> >> @@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct >> vlan_group *grp, >> unsigned int vlan_tci, struct sk_buff *skb) >> { >> struct sk_buff *p; >> + struct net_device *vlan_dev; >> >> - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >> + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >> + >> + if (!vlan_dev) >> + goto drop; >> + >> + if ((vlan_dev->priv_flags & IFF_BONDING || >> + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >> + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >> goto drop; >> >> skb->skb_iif = skb->dev->ifindex; >> __vlan_hwaccel_put_tag(skb, vlan_tci); >> - skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >> - >> - if (!skb->dev) >> - goto drop; >> + skb->dev = vlan_dev; >> >> for (p = napi->gro_list; p; p = p->next) { >> NAPI_GRO_CB(p)->same_flow = >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 100dcbd..9ea4550 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb) >> struct net_device *master; >> struct net_device *null_or_orig; >> struct net_device *null_or_bond; >> + struct net_device *real_dev; >> int ret = NET_RX_DROP; >> __be16 type; >> >> @@ -2853,9 +2854,13 @@ ncls: >> * handler may have to adjust skb->dev and orig_dev. >> */ >> null_or_bond = NULL; >> - if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) && >> - (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) { >> - null_or_bond = vlan_dev_real_dev(skb->dev); >> + if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) { >> + real_dev = vlan_dev_real_dev(skb->dev); >> + if (real_dev->priv_flags & IFF_BONDING) >> + null_or_bond = vlan_dev_real_dev(skb->dev); >> + if (!(skb->dev->priv_flags & IFF_BONDING) && >> + real_dev->priv_flags & IFF_SLAVE_INACTIVE) >> + null_or_orig = skb->dev; >> } >> >> type = skb->protocol; > > Is this another way of accomplishing what I had suggested at the > end of this message: > > http://marc.info/?l=linux-netdev&m=127111386702765&w=2 > > The patch part of my suggestion was as follows: > I think we need the code you suggested either way, or initialize null_or_bond to orig_dev as I suggested above. This logic was to deliver the skb only to exact matches for this case, bond0 ---> eth0 vlanx ---> -| Here vlanx is not in a bond and the real_dev is an inactive slave. I'll rethink this, but I believe only delivering this packet to handlers with exact matches is a good idea. At least it is consistent with the non vlan case. >> diff --git a/net/core/dev.c b/net/core/dev.c >> index b98ddc6..cc665bb 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -2735,7 +2735,7 @@ ncls: >> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { >> if (ptype->type == type && (ptype->dev == null_or_orig || >> ptype->dev == skb->dev || ptype->dev == orig_dev || >> - ptype->dev == null_or_bond)) { >> + (null_or_bond && (ptype->dev == null_or_bond))) { >> if (pt_prev) >> ret = deliver_skb(skb, pt_prev, orig_dev); >> pt_prev = ptype; >> >> >> I haven't tested this, but the theory is to only test against >> null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN >> traffic over bonding. > > Chris Leech said "that should do it" but I don't recall seeing > if it actually did so in practice. > > Or is your change meant to fix something else? > The missing piece with just this bit of code is if its dropped in the vlan_gro_common or __vlan_hwaccel_rx it never gets to the netif_receive_skb path. Also null_or_bond wouldn't be set to the vlan dev that we want so I don't think this gets us there. At this point I think there are two bug fixes that need to be made, one to address null_or_bond and another to check the correct net_device in case #1 and #2 above. I'll try to put together another RFC patch series with all your feedback this evening. With good comments to hopefully explain what is going and at least make it clear where things will work and not work. Thanks for all the good feedback! John. > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-05-03 21:17 ` John Fastabend @ 2010-05-03 23:17 ` Jay Vosburgh 2010-05-05 5:35 ` John Fastabend 0 siblings, 1 reply; 12+ messages in thread From: Jay Vosburgh @ 2010-05-03 23:17 UTC (permalink / raw) To: John Fastabend Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net John Fastabend <john.r.fastabend@intel.com> wrote: >Jay Vosburgh wrote: >> John Fastabend <john.r.fastabend@intel.com> wrote: >> [...] >>> It should be OK to allow packets to be received on the VLAN if it is not >>> explicitly in the bond? >> >> Lemme see if I have this straight, because all of these special >> cases are making my brain hurt. This one is for a configuration like this: >> >> bond0 ----- eth0 >> / >> vlan.xxx -/ >> >> I.e., a VLAN configured directly atop an ethernet device, said >> ethernet also being a slave to bonding. Is that correct? >> > >Yes, this is the correct scenario that we are considering. > >> Extrapolating from the ASCII art in a prior message in this >> discussion, would this configuration really be something like this: >> >> vlan.xxx -\ >> \ >> bond0 ----- eth1 >> bond0 ----- eth0 >> / >> vlan.xxx -/ >> >> I.e., two slaves to bonding, with VLAN xxx configured atop both >> of the slaves? Or would the eth0 and eth1 use discrete VLAN ids? The >> reason I ask is in regards to duplicate suppression. The whole reason >> the "inactive" slave drops (most) incoming packets is to eliminate >> duplicates when the switch floods traffic to both slave ports. >> > >These vlan ids could be the same or discrete I think both configurations >should be valid. > >> This is a bit tricky, because it's not really about broadcasts / >> multicasts so much, but about traffic that the switch sends to all ports >> because the switch's MAC address table isn't up to date with the >> destination MAC of the traffic (which is a transient condition, so this >> would only happen for perhaps one second or so). That would result in >> duplicate unicast packets being received by the bond (back in the day >> before bonding had the "drop inactive traffic" logic). >> >> So if the same VLAN is configured atop the two slaves, I wonder >> if that will open a window for the duplicate unicast packet problem. > >OK, this does appear to open a window for duplicated unicast packets. By >only allowing handlers with exact matches at least this issue is less >obvious and we are assuming the packet handler can deal with this >duplication. This seems to be the current assumption made. The same issue >exists today for real device in the following setup, > >vlan --> bond0 --> eth I just tested this, and I'm not seeing duplicate packets using the test that used to show the problem before the "drop dups" logic went in (clear the switch's mac address-table, ping -c 25 -f [peer on VLAN], compare "packets transmitted" to "packets received"). That doesn't mean there isn't a gap in the logic somewhere, just that the original problem hasn't resurfaced (as far as I can tell). >Specifically for FCoE we use the san mac address so it wouldn't be an >issue here. The expectation being that the switch will only ever use the >correct san mac on the port. The issue arises when the switch does not have the destination MAC in its address table, and as such is transitory, and only occurs after sufficiently long periods of no traffic (or a manual flush of the table). The packets are sent to all ports until the MAC table updates (which seems to take place asynchronously), which is usually about 1 second or so (on the midrange Cisco gear I have). For example, with the switch's mac address table cleared, when starting a "ping -f" I can watch as first every port's light blinks, then all but two stop blinking. During the time that every port is blinking, the switch is sending all the packets to every port because the mac address table hasn't updated the switching logic (however that works under the covers). >> If the VLAN ids are different, then I'll assume this is some >> kind of device mapper magic, doing the load balancing elsewhere. > >Correct device mapper handles load balancing and failover for both cases, >when the vlan ids are different and when they are the same. > >> >>> Or if we want to be more paranoid deliver packets only to handlers with >>> exact matches for the device. For non vlan devices we deliver skb's to >>> packet handlers that match exactly even on inactive slaves so doing this >>> on vlan devices as well makes sense and shouldn't cause any unexpected >>> problems. >> >> Yah, the whole concept of "inactive" slave is pretty mutated >> now; it's kind of become the "active-backup with semi-manual load >> balance for clever protocols, oh, and duplicate suppression" mode. >> >>> Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond >>> are not working as expected in __netif_receive_skb(). At least the >>> comment 'deliver only exact match' could be inaccurate. >> >> I don't think this is unrelated at all. This code (the packet >> to device lookup stuff in __netif_receive_skb) has been modified pretty >> extensively lately for various bonding-related special cases, and I >> think it is getting hard to follow. Whatever comments are there need to >> be accurate, and, honestly, I think this code needs comments to explain >> what exactly is supposed to happen for these special cases. >> > >Agreed. This should be cleaned up and some explanations added. The >current behavior in active-backup mode is receiving packets on the bonded >real device in active mode fails but putting that same real device in an >inactive state will cause it to receive packets. This is an >inconsistency, which should probably be fixed by initializing null_or_bond >to orig_dev. And also renaming it orig_or_bond at that point. > >>> Here's a patch to illustrate what I'm thinking compile tested only. If >>> this sounds reasonable I'll work up an official patch. >>> >>> >>> [PATCH] net: allow vlans on bonded real net_devices >>> >>> For converged I/O it is reasonable to use dm_multipathing to provice >>> failover and load balancing for storage traffic and then use bonding >>> for the LAN failover and load balancing. >>> >>> Currently this works if the multipathed devices are using the real >>> net_device and those devices are enslaved to a bonded device what >>> does not work is creating a vlan on the real device and trying to >>> use it outside the bond for multipathing. >>> >>> This patch adds logic so that if the skb is destined for a vlan >>> that is not in the bond the skb will not be dropped. >>> >>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> >>> --- >>> >>> net/8021q/vlan_core.c | 31 +++++++++++++++++++++---------- >>> net/core/dev.c | 11 ++++++++--- >>> 2 files changed, 29 insertions(+), 13 deletions(-) >>> >>> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c >>> index c584a0a..3bce0c3 100644 >>> --- a/net/8021q/vlan_core.c >>> +++ b/net/8021q/vlan_core.c >>> @@ -8,18 +8,24 @@ >>> int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, >>> u16 vlan_tci, int polling) >>> { >>> + struct net_device *vlan_dev; >>> + >>> if (netpoll_rx(skb)) >>> return NET_RX_DROP; >>> >>> - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >>> + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >>> + >>> + if (!vlan_dev) >>> + goto drop; >>> + >>> + if ((vlan_dev->priv_flags & IFF_BONDING || >>> + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >>> + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >> >> I'm not sure this will do the right thing if the VLAN device >> itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0. In >> that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev) >> doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I >> believe). >> > >correct, vlan_dev does have IFF_BONDING and IFF_SLAVE here and doesn't >have IFF_MASTER. > > >> I think this will result in all incoming traffic being accepted >> on such a configuration (leading to duplicates, as described above). >> >> I suspect, but have not tested, that something like this might >> do what you're looking for: >> >> if ((vlan_dev->priv_flags & IFF_BONDING || >> vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) && >> skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >> >> I.e., if the VLAN device is either a MASTER (configured above >> the bond) or a slave (configured below the bond) do the duplicate >> suppresion. > >Here are the three basic cases I see, > >#1. vlanx --> bond0 --> ethx > >In this case vlanx does not have IFF_BONDING set and real_dev is ethx with >IFF_SLAVE set. ethx has master dev->bond0 so this should work. And shows >why we need the IFF_SLAVE bit as you pointed out and I dropped. > >#2. bond --> vlanx --> ethx > >This case is broke, skb->dev->master is NULL so we would never drop this >pkt. As it exists today I suspect this is broken as well. In the VLAN pass, yes, but the VLAN input path will call into netif_receive_skb, and at that point the skb->dev is the vlan device, and it has a dev->master. I haven't tested this lately, but I'm fairly sure this works. >#3 bond0 --> ethx > vlanx --> -| > >Here is the case where adding the IFF_SLAVE bit doesn't work as I >hoped. We don't want to run skb_bond_should_drop here. Yes, this is tricky because the VLAN device will copy the dev->flags from the device it's placed atop, so the VLAN will inherit the ethx's IFF_SLAVE flag. This happens regardless of the setup order (enslave ethX, then add VLAN, or vice versa). I suspect this case may be testable because the VLAN device has IFF_SLAVE, but has no dev->master. >So I think there needs to be a bit of logic here to determine if we need >to check skb_bond_should_drop with the vlan device or with the >skb->dev->master. Something like might do: > >should_drop_dev = vlan_dev->master ? vlan_dev->master : skb->dev->master > >This should fix case #2 without breaking case #1. And the case I want to >allow is still not resolved. I'll think about this some more maybe this >logic can be fixed for all cases. As I said above, I don't think case #2 is really broken now. > >> >>> goto drop; >>> >>> skb->skb_iif = skb->dev->ifindex; >>> __vlan_hwaccel_put_tag(skb, vlan_tci); >>> - skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >>> - >>> - if (!skb->dev) >>> - goto drop; >>> + skb->dev = vlan_dev; >>> >>> return (polling ? netif_receive_skb(skb) : netif_rx(skb)); >>> >>> @@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct >>> vlan_group *grp, >>> unsigned int vlan_tci, struct sk_buff *skb) >>> { >>> struct sk_buff *p; >>> + struct net_device *vlan_dev; >>> >>> - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >>> + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >>> + >>> + if (!vlan_dev) >>> + goto drop; >>> + >>> + if ((vlan_dev->priv_flags & IFF_BONDING || >>> + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >>> + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >>> goto drop; >>> >>> skb->skb_iif = skb->dev->ifindex; >>> __vlan_hwaccel_put_tag(skb, vlan_tci); >>> - skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >>> - >>> - if (!skb->dev) >>> - goto drop; >>> + skb->dev = vlan_dev; >>> >>> for (p = napi->gro_list; p; p = p->next) { >>> NAPI_GRO_CB(p)->same_flow = >>> diff --git a/net/core/dev.c b/net/core/dev.c >>> index 100dcbd..9ea4550 100644 >>> --- a/net/core/dev.c >>> +++ b/net/core/dev.c >>> @@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb) >>> struct net_device *master; >>> struct net_device *null_or_orig; >>> struct net_device *null_or_bond; >>> + struct net_device *real_dev; >>> int ret = NET_RX_DROP; >>> __be16 type; >>> >>> @@ -2853,9 +2854,13 @@ ncls: >>> * handler may have to adjust skb->dev and orig_dev. >>> */ >>> null_or_bond = NULL; >>> - if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) && >>> - (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) { >>> - null_or_bond = vlan_dev_real_dev(skb->dev); >>> + if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) { >>> + real_dev = vlan_dev_real_dev(skb->dev); >>> + if (real_dev->priv_flags & IFF_BONDING) >>> + null_or_bond = vlan_dev_real_dev(skb->dev); >>> + if (!(skb->dev->priv_flags & IFF_BONDING) && >>> + real_dev->priv_flags & IFF_SLAVE_INACTIVE) >>> + null_or_orig = skb->dev; >>> } >>> >>> type = skb->protocol; >> >> Is this another way of accomplishing what I had suggested at the >> end of this message: >> >> http://marc.info/?l=linux-netdev&m=127111386702765&w=2 >> >> The patch part of my suggestion was as follows: >> > >I think we need the code you suggested either way, or initialize >null_or_bond to orig_dev as I suggested above. > >This logic was to deliver the skb only to exact matches for this case, > >bond0 ---> eth0 >vlanx ---> -| > >Here vlanx is not in a bond and the real_dev is an inactive slave. I'll >rethink this, but I believe only delivering this packet to handlers with >exact matches is a good idea. At least it is consistent with the non vlan >case. Yes, the intent is that packets arriving on the inactive slaves should only be delivered to destinations that have explicitly asked for packets on the "inactive" devices, i.e., have bound directly to the device, and are prepared to deal with the duplicates they may receive (in the global sense, if a particular module has bound to multiple bonding slaves). Delivering to wildcards I suspect would permit duplicates to sneak through to unexpecting code. >>> diff --git a/net/core/dev.c b/net/core/dev.c >>> index b98ddc6..cc665bb 100644 >>> --- a/net/core/dev.c >>> +++ b/net/core/dev.c >>> @@ -2735,7 +2735,7 @@ ncls: >>> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { >>> if (ptype->type == type && (ptype->dev == null_or_orig || >>> ptype->dev == skb->dev || ptype->dev == orig_dev || >>> - ptype->dev == null_or_bond)) { >>> + (null_or_bond && (ptype->dev == null_or_bond))) { >>> if (pt_prev) >>> ret = deliver_skb(skb, pt_prev, orig_dev); >>> pt_prev = ptype; >>> >>> >>> I haven't tested this, but the theory is to only test against >>> null_or_bond if null_or_bond isn't NULL, which is only the case for VLAN >>> traffic over bonding. >> >> Chris Leech said "that should do it" but I don't recall seeing >> if it actually did so in practice. >> >> Or is your change meant to fix something else? >> > >The missing piece with just this bit of code is if its dropped in the >vlan_gro_common or __vlan_hwaccel_rx it never gets to the >netif_receive_skb path. By this I presume you mean that a drop in the VLAN code won't ever have the opportunity to be evaluated against null_or_bond or null_or_orig. >Also null_or_bond wouldn't be set to the vlan dev that we want so I don't >think this gets us there. > >At this point I think there are two bug fixes that need to be made, one to >address null_or_bond and another to check the correct net_device in case >#1 and #2 above. > >I'll try to put together another RFC patch series with all your feedback >this evening. With good comments to hopefully explain what is going and >at least make it clear where things will work and not work. Thanks for >all the good feedback! Hopefully this will be the last futzing around with this, and won't make it too complicated. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-05-03 23:17 ` Jay Vosburgh @ 2010-05-05 5:35 ` John Fastabend 2010-05-06 17:42 ` Jay Vosburgh 0 siblings, 1 reply; 12+ messages in thread From: John Fastabend @ 2010-05-05 5:35 UTC (permalink / raw) To: Jay Vosburgh Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net Jay Vosburgh wrote: > John Fastabend <john.r.fastabend@intel.com> wrote: > >> Jay Vosburgh wrote: >>> John Fastabend <john.r.fastabend@intel.com> wrote: >>> > [...] >>>> It should be OK to allow packets to be received on the VLAN if it is not >>>> explicitly in the bond? >>> Lemme see if I have this straight, because all of these special >>> cases are making my brain hurt. This one is for a configuration like this: >>> >>> bond0 ----- eth0 >>> / >>> vlan.xxx -/ >>> >>> I.e., a VLAN configured directly atop an ethernet device, said >>> ethernet also being a slave to bonding. Is that correct? >>> >> Yes, this is the correct scenario that we are considering. >> >>> Extrapolating from the ASCII art in a prior message in this >>> discussion, would this configuration really be something like this: >>> >>> vlan.xxx -\ >>> \ >>> bond0 ----- eth1 >>> bond0 ----- eth0 >>> / >>> vlan.xxx -/ >>> >>> I.e., two slaves to bonding, with VLAN xxx configured atop both >>> of the slaves? Or would the eth0 and eth1 use discrete VLAN ids? The >>> reason I ask is in regards to duplicate suppression. The whole reason >>> the "inactive" slave drops (most) incoming packets is to eliminate >>> duplicates when the switch floods traffic to both slave ports. >>> >> These vlan ids could be the same or discrete I think both configurations >> should be valid. >> >>> This is a bit tricky, because it's not really about broadcasts / >>> multicasts so much, but about traffic that the switch sends to all ports >>> because the switch's MAC address table isn't up to date with the >>> destination MAC of the traffic (which is a transient condition, so this >>> would only happen for perhaps one second or so). That would result in >>> duplicate unicast packets being received by the bond (back in the day >>> before bonding had the "drop inactive traffic" logic). >>> >>> So if the same VLAN is configured atop the two slaves, I wonder >>> if that will open a window for the duplicate unicast packet problem. >> OK, this does appear to open a window for duplicated unicast packets. By >> only allowing handlers with exact matches at least this issue is less >> obvious and we are assuming the packet handler can deal with this >> duplication. This seems to be the current assumption made. The same issue >> exists today for real device in the following setup, >> >> vlan --> bond0 --> eth > > I just tested this, and I'm not seeing duplicate packets using > the test that used to show the problem before the "drop dups" logic went > in (clear the switch's mac address-table, ping -c 25 -f [peer on VLAN], > compare "packets transmitted" to "packets received"). > > That doesn't mean there isn't a gap in the logic somewhere, just > that the original problem hasn't resurfaced (as far as I can tell). > >> Specifically for FCoE we use the san mac address so it wouldn't be an >> issue here. The expectation being that the switch will only ever use the >> correct san mac on the port. > > The issue arises when the switch does not have the destination > MAC in its address table, and as such is transitory, and only occurs > after sufficiently long periods of no traffic (or a manual flush of the > table). The packets are sent to all ports until the MAC table updates > (which seems to take place asynchronously), which is usually about 1 > second or so (on the midrange Cisco gear I have). > > For example, with the switch's mac address table cleared, when > starting a "ping -f" I can watch as first every port's light blinks, > then all but two stop blinking. During the time that every port is > blinking, the switch is sending all the packets to every port because > the mac address table hasn't updated the switching logic (however that > works under the covers). > > > >>> If the VLAN ids are different, then I'll assume this is some >>> kind of device mapper magic, doing the load balancing elsewhere. >> Correct device mapper handles load balancing and failover for both cases, >> when the vlan ids are different and when they are the same. >> >>>> Or if we want to be more paranoid deliver packets only to handlers with >>>> exact matches for the device. For non vlan devices we deliver skb's to >>>> packet handlers that match exactly even on inactive slaves so doing this >>>> on vlan devices as well makes sense and shouldn't cause any unexpected >>>> problems. >>> Yah, the whole concept of "inactive" slave is pretty mutated >>> now; it's kind of become the "active-backup with semi-manual load >>> balance for clever protocols, oh, and duplicate suppression" mode. >>> >>>> Also on a somewhat unrelated note I suspect null_or_orig and null_or_bond >>>> are not working as expected in __netif_receive_skb(). At least the >>>> comment 'deliver only exact match' could be inaccurate. >>> I don't think this is unrelated at all. This code (the packet >>> to device lookup stuff in __netif_receive_skb) has been modified pretty >>> extensively lately for various bonding-related special cases, and I >>> think it is getting hard to follow. Whatever comments are there need to >>> be accurate, and, honestly, I think this code needs comments to explain >>> what exactly is supposed to happen for these special cases. >>> >> Agreed. This should be cleaned up and some explanations added. The >> current behavior in active-backup mode is receiving packets on the bonded >> real device in active mode fails but putting that same real device in an >> inactive state will cause it to receive packets. This is an >> inconsistency, which should probably be fixed by initializing null_or_bond >> to orig_dev. And also renaming it orig_or_bond at that point. >> >>>> Here's a patch to illustrate what I'm thinking compile tested only. If >>>> this sounds reasonable I'll work up an official patch. >>>> >>>> >>>> [PATCH] net: allow vlans on bonded real net_devices >>>> >>>> For converged I/O it is reasonable to use dm_multipathing to provice >>>> failover and load balancing for storage traffic and then use bonding >>>> for the LAN failover and load balancing. >>>> >>>> Currently this works if the multipathed devices are using the real >>>> net_device and those devices are enslaved to a bonded device what >>>> does not work is creating a vlan on the real device and trying to >>>> use it outside the bond for multipathing. >>>> >>>> This patch adds logic so that if the skb is destined for a vlan >>>> that is not in the bond the skb will not be dropped. >>>> >>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> >>>> --- >>>> >>>> net/8021q/vlan_core.c | 31 +++++++++++++++++++++---------- >>>> net/core/dev.c | 11 ++++++++--- >>>> 2 files changed, 29 insertions(+), 13 deletions(-) >>>> >>>> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c >>>> index c584a0a..3bce0c3 100644 >>>> --- a/net/8021q/vlan_core.c >>>> +++ b/net/8021q/vlan_core.c >>>> @@ -8,18 +8,24 @@ >>>> int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp, >>>> u16 vlan_tci, int polling) >>>> { >>>> + struct net_device *vlan_dev; >>>> + >>>> if (netpoll_rx(skb)) >>>> return NET_RX_DROP; >>>> >>>> - if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >>>> + vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK); >>>> + >>>> + if (!vlan_dev) >>>> + goto drop; >>>> + >>>> + if ((vlan_dev->priv_flags & IFF_BONDING || >>>> + vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) && >>>> + skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >>> I'm not sure this will do the right thing if the VLAN device >>> itself is a slave to bonding, e.g., bond0 ---> vlan.xxx ---> eth0. In >>> that case, eth0's dev->master is NULL, and the vlan_dev (vlan.xxx's dev) >>> doesn't have IFF_MASTER (but does have IFF_SLAVE and IFF_BONDING, I >>> believe). >>> >> correct, vlan_dev does have IFF_BONDING and IFF_SLAVE here and doesn't >> have IFF_MASTER. >> >> >>> I think this will result in all incoming traffic being accepted >>> on such a configuration (leading to duplicates, as described above). >>> >>> I suspect, but have not tested, that something like this might >>> do what you're looking for: >>> >>> if ((vlan_dev->priv_flags & IFF_BONDING || >>> vlan_dev_real_dev(vlan_dev)->flags & (IFF_MASTER | IFF_SLAVE)) && >>> skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master))) >>> >>> I.e., if the VLAN device is either a MASTER (configured above >>> the bond) or a slave (configured below the bond) do the duplicate >>> suppresion. >> Here are the three basic cases I see, >> >> #1. vlanx --> bond0 --> ethx >> >> In this case vlanx does not have IFF_BONDING set and real_dev is ethx with >> IFF_SLAVE set. ethx has master dev->bond0 so this should work. And shows >> why we need the IFF_SLAVE bit as you pointed out and I dropped. >> >> #2. bond --> vlanx --> ethx >> >> This case is broke, skb->dev->master is NULL so we would never drop this >> pkt. As it exists today I suspect this is broken as well. > > In the VLAN pass, yes, but the VLAN input path will call into > netif_receive_skb, and at that point the skb->dev is the vlan device, > and it has a dev->master. I haven't tested this lately, but I'm fairly > sure this works. > OK, these both seem to work as expected my test was invalid. >> #3 bond0 --> ethx >> vlanx --> -| >> >> Here is the case where adding the IFF_SLAVE bit doesn't work as I >> hoped. We don't want to run skb_bond_should_drop here. > > Yes, this is tricky because the VLAN device will copy the > dev->flags from the device it's placed atop, so the VLAN will inherit > the ethx's IFF_SLAVE flag. This happens regardless of the setup order > (enslave ethX, then add VLAN, or vice versa). > This doesn't appear to be true, adding a VLAN on ethx then enslave ethx doesn't set the IFF_SLAVE flag on the VLAN. Unless I am missing something. > I suspect this case may be testable because the VLAN device has > IFF_SLAVE, but has no dev->master. > >> So I think there needs to be a bit of logic here to determine if we need >> to check skb_bond_should_drop with the vlan device or with the >> skb->dev->master. Something like might do: >> >> should_drop_dev = vlan_dev->master ? vlan_dev->master : skb->dev->master >> >> This should fix case #2 without breaking case #1. And the case I want to >> allow is still not resolved. I'll think about this some more maybe this >> logic can be fixed for all cases. > > As I said above, I don't think case #2 is really broken now. > Seems to be working sorry for the noise. <<snip>> > > Hopefully this will be the last futzing around with this, and > won't make it too complicated. > I currently believe the cleanest way to implement this is to add a pkt_type flag PACKET_DROP to mark skbs that have been received on the inactive slave. I sent out a functional RFC I would like to run a few more tests on it, but otherwise I think it ok. Thanks, John. > -J > > --- > -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Receive issues with bonding and vlans 2010-05-05 5:35 ` John Fastabend @ 2010-05-06 17:42 ` Jay Vosburgh 0 siblings, 0 replies; 12+ messages in thread From: Jay Vosburgh @ 2010-05-06 17:42 UTC (permalink / raw) To: John Fastabend Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek, Patrick McHardy, bonding-devel@lists.sourceforge.net John Fastabend <john.r.fastabend@intel.com> wrote: >Jay Vosburgh wrote: >> John Fastabend <john.r.fastabend@intel.com> wrote: [...] >>> #3 bond0 --> ethx >>> vlanx --> -| >>> >>> Here is the case where adding the IFF_SLAVE bit doesn't work as I >>> hoped. We don't want to run skb_bond_should_drop here. >> >> Yes, this is tricky because the VLAN device will copy the >> dev->flags from the device it's placed atop, so the VLAN will inherit >> the ethx's IFF_SLAVE flag. This happens regardless of the setup order >> (enslave ethX, then add VLAN, or vice versa). >> > >This doesn't appear to be true, adding a VLAN on ethx then enslave ethx >doesn't set the IFF_SLAVE flag on the VLAN. Unless I am missing >something. I tried this again, and yes, the vlan device inherits the flags of the device at the time the vlan is added. I think I was confused because the vlan device doesn't lose IFF_SLAVE if the underlying ethX is taken out of the bond. I suspect both of these behaviors are because netdev_set_master doesn't do a notifier call (just an rtmsg_ifinfo) when it changes dev->flags outside of dev_set_flags. I don't think the vlan device should pick up IFF_SLAVE, though, when the vlan device itself is not a slave, so that part seems correct. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-05-06 17:42 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-12 22:17 Receive issues with bonding and vlans Chris Leech 2010-04-12 22:17 ` [PATCH] vlan: remove receive checks for bonding Chris Leech 2010-04-12 23:19 ` Jay Vosburgh 2010-04-12 23:10 ` Receive issues with bonding and vlans Jay Vosburgh 2010-04-12 23:35 ` Chris Leech 2010-04-13 0:08 ` Jay Vosburgh 2010-05-03 3:04 ` John Fastabend 2010-05-03 18:25 ` Jay Vosburgh 2010-05-03 21:17 ` John Fastabend 2010-05-03 23:17 ` Jay Vosburgh 2010-05-05 5:35 ` John Fastabend 2010-05-06 17:42 ` Jay Vosburgh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).