Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: Bonding
From: Gustavo Pimentel @ 2014-02-11 17:15 UTC (permalink / raw)
  To: Veaceslav Falico; +Cc: netdev@vger.kernel.org
In-Reply-To: <20140211141549.GB29570@redhat.com>

Hi Veaceslav,

It's quite different from broadcast mode, each frame sent through the slaves has attached a Redundancy Control Trailer also known as RCT (this trailer is compose by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
Also the equipment with PRP capability has to send periodically a supervision frame to both similar LANs. Each device on the network has to keep track of receive sequence numbers received, if the received a sequence number for instance from LAN A of specific device and it doesn't exist on internal table, the device should accept the frame and update the internal table. When receiving the same sequence number from the LAN B, the device should discard it, providing zero downtime redundancy.

I can supply you information about this redundancy protocol, if you like. This type of network redundancy is now being large deployed on electrical power stations (like thermal and hydro) and transmission power stations instead of teaming / bonding that depends on RSTP for redundancy.

> -----Original Message-----
> From: Veaceslav Falico [mailto:vfalico@redhat.com]
> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
> To: Gustavo Pimentel
> Cc: netdev@vger.kernel.org
> Subject: Re: Bonding
> 
> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
> >Hi,
> 
> Hi Gustavo,
> 
> >
> >I'm writing you because because I'm have implemented a new mode (PRP Parallel
> Redundancy Protocol) for bonding kernel driver. This new mode is quite simple, I
> don't know if you have heard about PRP, but it's a new standard that allows to
> overcome any single network failure without affecting the data transmission. The
> general idea resides on having two separate LAN (A & B) very similar and
> transmitting the almost the same frame through both LANs and the end device
> should accept one frame and discard the other according to a known mechanism.
> 
> Isn't that the current 'broadcast' mode, where every packet is transmitted over all
> the slaves? After quick googling/reading I don't see any difference there, though I
> might have missed something.
> 
> >
> >I have implemented this new mode on bonding driver, but I have some
> difficulties:
> >. Writing linux driver is quite new for me. I don't' know if exists guide lines for
> driver coding.
> 
> You can find everything under Documentation/, but without the code I can't tell you
> exact documents. CodingStyle and SubmittingPatches might be the first ones.
> 
> Also, try CC-ing relevant people for more feedback, specifically bonding
> maintainers.
> 
> >. I don't know how to submit the code to be include on kernel repository.
> >. Maybe another pair of eyes could find help to improve the writing code for this
> mode.
> 
> Try sending an RFC when net-next opens.
> 
> >
> >I think my driver code is 99% complete. I'm currently testing with 3 equipments (1
> pc + 1 embedded device running both my modify bonding driver) and a third party
> equipment called RedBox.
> >
> >Would you be interested in participating / helping this project?
> >
> >With my best regards,
> >
> >Gustavo Gama da Rocha Pimentel
> >Power Systems Automation / Innovation & Development Efacec Engenharia e
> >Sistemas, S.A.
> >Phone: +351229403391
> >Disclaimer
> >
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org More majordomo info
> >at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: linux 3.13: problems with isatap tunnel device and UFO
From: Wolfgang Walter @ 2014-02-11 17:42 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa
In-Reply-To: <20140211024403.GE11150@order.stressinduktion.org>

Am Dienstag, 11. Februar 2014, 03:44:03 schrieb Hannes Frederic Sowa:
> On Sun, Feb 09, 2014 at 12:17:15AM +0100, Wolfgang Walter wrote:
> > host A (which shows the problem with kernel 3.13):
> > 
> > $ ip addr ls eth0
> > 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state
> > UP group default qlen 1000
> > 
> >     link/ether 11:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff
> >     inet 192.168.1.1/24 brd 192.168.1.255 scope global eth0
> >        valid_lft forever preferred_lft forever
> >     inet6 2001:1111:2222:aaaa:0:5efe:c0a8:101/120 scope global
> >        valid_lft forever preferred_lft forever
> >     inet6 fe80::1322:33ff:fe44:5566/64 scope link
> >        valid_lft forever preferred_lft forever
> 
> What driver does this interface use?
> 
> ethtool -i eth0
> 

driver: r8169
version: 2.3LK-NAPI
firmware-version: rtl_nic/rtl8168e-1.fw
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

I see this also with another machine, here it is

driver: forcedeth
version: 0.64
firmware-version: 
bus-info: 0000:00:0a.0
supports-statistics: yes                                                        
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* [PATCH 0/3] net: Generalizations for GRE,GSO,GRO
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

This patch set contains some preliminary patches for Generic UDP
Encapsulation support. These generalize some uses of GRE, GSO, and
GRO.

^ permalink raw reply

* [PATCH 2/3] net: UDP gro_receive accept csum=0
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

The code to validate checksum in UDP gro_receive explictly checks
against driver having set CHECKSUM_COMPLETE. This does not perform
GRO on UDP packets with a checksum of zero (no checksum needed).
This patch adds the condition to allow UDP checksum to be zero.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/udp_offload.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 25f5cee..4db7796 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -156,13 +156,9 @@ static struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *s
 	unsigned int hlen, off;
 	int flush = 1;
 
-	if (NAPI_GRO_CB(skb)->udp_mark ||
-	    (!skb->encapsulation && skb->ip_summed != CHECKSUM_COMPLETE))
+	if (NAPI_GRO_CB(skb)->udp_mark)
 		goto out;
 
-	/* mark that this skb passed once through the udp gro layer */
-	NAPI_GRO_CB(skb)->udp_mark = 1;
-
 	off  = skb_gro_offset(skb);
 	hlen = off + sizeof(*uh);
 	uh   = skb_gro_header_fast(skb, off);
@@ -172,6 +168,13 @@ static struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *s
 			goto out;
 	}
 
+	if (!skb->encapsulation &&
+	    skb->ip_summed != CHECKSUM_COMPLETE && uh->check != 0)
+		goto out;
+
+	/* mark that this skb passed once through the udp gro layer */
+	NAPI_GRO_CB(skb)->udp_mark = 1;
+
 	rcu_read_lock();
 	uo_priv = rcu_dereference(udp_offload_base);
 	for (; uo_priv != NULL; uo_priv = rcu_dereference(uo_priv->next)) {
-- 
1.9.0.rc1.175.g0b1dcb5

^ permalink raw reply related

* Re: [PATCH V3] net/dt: Add support for overriding phy configuration from device tree
From: Florian Fainelli @ 2014-02-11 17:43 UTC (permalink / raw)
  To: Gerlando Falauto
  Cc: Matthew Garrett, netdev,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kishon Vijay Abraham I
In-Reply-To: <52F9E8E6.1090006-SkAbAL50j+5BDgjK7y7TUQ@public.gmane.org>

Hi Gerlando,

2014-02-11 1:09 GMT-08:00 Gerlando Falauto <gerlando.falauto-SkAbAL50j+6IwRZHo2/mJg@public.gmane.orgm>:
> Hi Florian,
>
> first of all, thank you for your answer.
>
>
> On 02/10/2014 06:09 PM, Florian Fainelli wrote:
>>
>> Hi Gerlando,
>>
>> Le lundi 10 février 2014, 17:14:59 Gerlando Falauto a écrit :
>>>
>>> Hi,
>>>
>>> I'm currently trying to fix an issue for which this patch provides a
>>> partial solution, so apologies in advance for jumping into the
>>> discussion for my own purposes...
>>>
>>> On 02/04/2014 09:39 PM, Florian Fainelli wrote:> 2014-01-17 Matthew
>>>
>>> Garrett <matthew.garrett-05XSO3Yj/JvQT0dZR+AlfA@public.gmane.org>:
>>>   >> Some hardware may be broken in interesting and board-specific ways,
>>> such
>>>   >> that various bits of functionality don't work. This patch provides a
>>>   >> mechanism for overriding mii registers during init based on the
>>>
>>> contents of
>>>
>>>   >> the device tree data, allowing board-specific fixups without having
>>> to
>>>   >> pollute generic code.
>>>   >
>>>   > It would be good to explain exactly how your hardware is broken
>>>   > exactly. I really do not think that such a fine-grained setting where
>>>   > you could disable, e.g: 100BaseT_Full, but allow 100BaseT_Half to
>>>   > remain usable makes that much sense. In general, Gigabit might be
>>>   > badly broken, but 100 and 10Mbits/sec should work fine. How about the
>>>   > MASTER-SLAVE bit, is overriding it really required?
>>>   >
>>>   > Is not a PHY fixup registered for a specific OUI the solution you are
>>>   > looking for? I am also concerned that this creates PHY
>>> troubleshooting
>>>   > issues much harder to debug than before as we may have no idea about
>>>   > how much information has been put in Device Tree to override that.
>>>   >
>>>   > Finally, how about making this more general just like the BCM87xx PHY
>>>   > driver, which is supplied value/reg pairs directly? There are 16
>>>   > common MII registers, and 16 others for vendor specific registers,
>>>   > this is just covering for about 2% of the possible changes.
>>>
>>> Good point. That would easily help me with my current issue, which
>>> requires autoneg to be disabled to begin with (by clearing BMCR_ANENABLE
>>> from register 0).
>>
>>
>> Is there a point in time (e.g: after some specific initial configuration
>> has
>> been made) where BMCR_ANENABLE can be used?
>
>
> What do you mean? In my case, for some HW-related reason (due to the PHY
> counterpart I guess) autoneg needs to be disabled.
> This is currently done by the bootloader code (which clears the bit).
> What I'm looking for is some way for the kernel to either reinforce this
> setting, or just take that into account and skip autoneg.
> On top of that, there's a HW errata about that particular PHY, which
> requires certain operations to be performed on the PHY as a workaround *WHEN
> AUTONEG IS DISABLED*. That I'd implement on a PHY-specif driver.

Ok.

>
>
>>> This would not however fix it entirely (I tried a quick hardwired
>>> implementation), as the whole PHY machinery would not take that into
>>> account and would re-enable autoneg anyway.
>>> I also tried changing the patch so that phydev->support gets updated
>>
>>
>> There are multiple things that you could try doing here:
>>
>> - override the PHY state machine in your read_status callback to make sure
>> that you always set phydev->autoneg set to AUTONEG_ENABLE
>
>
> [you mean AUTONEG_DISABLE, right?]

Right, I fat fingered here.

> Uhm, but I don't want to implement a driver for that PHY that always
> disables autoneg. I only want to disable autoneg for that particular board.
> I figure I might register a fixup for that board, but that kindof makes
> everything more complicated and less clear. Plus, what should be the
> criterion to determine whether we're running on that particular hardware?

of_machine_is_compatible() plus reading the specific PHY OUI should
provide you with with an unique machine + PHY tuple. If your machine
name is too generic.

>
>
>> - clear the SUPPORTED_Autoneg bits from phydev->supported right after PHY
>> registration and before the call to phy_start()
>
>
> I actually tried clearing it by tweaking the patch on this thread, but the
> end result is that it does not produce any effect (see further comments
> below). Only thing that seems to play a role here is explictly setting
> phydev->autoneg = AUTONEG_DISABLE.
>
>
>> - set the PHY_HAS_MAGICANEG bit in your PHY driver flag
>
>
> Again, this seems to play no role whatsoever here:
>
>                         } else if (0 == phydev->link_timeout--) {
>                                 needs_aneg = 1;
>                                 /* If we have the magic_aneg bit,
>                                  * we try again */
>                                 if (phydev->drv->flags & PHY_HAS_MAGICANEG)
>                                         break;
>                         }
>                         break;
>                 case PHY_NOLINK:
>
> This code might have made sense when it was written in 2006 -- back then,
> the break statement was skipping some fallback code. But now it seems to do
> nothing.
>
>
>>
>>>
>>> (instead of phydev->advertising):
>>>   >> +               if (!of_property_read_u32(np, override->prop, &tmp))
>>> {
>>>   >> +                       if (tmp) {
>>>   >> +                               *val |= override->value;
>>>   >> +                               phydev->advertising |=
>>>
>>> override->supported;
>>>
>>>   >> +                       } else {
>>>   >> +                               phydev->advertising &=
>>>
>>> ~(override->supported);
>>>
>>>   >> +                       }
>>>   >> +
>>>   >> +                       *mask |= override->value;
>>>
>>> What I find weird is that the only way phydev->autoneg could ever be set
>>> to disabled is from here (phy.c):
>>>
>>> static void phy_sanitize_settings(struct phy_device *phydev)
>>> {
>>>         u32 features = phydev->supported;
>>>         int idx;
>>>
>>>         /* Sanitize settings based on PHY capabilities */
>>>         if ((features & SUPPORTED_Autoneg) == 0)
>>>                 phydev->autoneg = AUTONEG_DISABLE;
>>>
>>> which is in turn only called when phydev->autoneg is set to
>>> AUTONEG_DISABLE to begin with:
>>>
>>> int phy_start_aneg(struct phy_device *phydev)
>>> {
>>>         int err;
>>>
>>>         mutex_lock(&phydev->lock);
>>>
>>>         if (AUTONEG_DISABLE == phydev->autoneg)
>>>                 phy_sanitize_settings(phydev);
>>>
>>> So could someone please help me figure out what I'm missing here?
>>
>>
>> At first glance it looks like the PHY driver should be reading the phydev-
>>>
>>> autoneg value when the PHY driver config_aneg() callback is called to be
>>
>> allowed to set the forced speed and settings.
>>
>> The way phy_sanitize_settings() is coded does not make it return a mask of
>> features, but only the forced supported speed and duplex. Then when the
>> link
>> is forced but we are having some issues getting a link status, libphy
>> tries
>> lower speeds and this function is used again to provide the next
>> speed/duplex
>> pair to try.
>>
>
> What I was trying to say is that phy_sanitize_settings() is only called when
> phydev->autoneg == AUTONEG_DISABLE, and in turn it's the only generic
> function setting phydev->autoneg = AUTONEG_DISABLE.
> So perhaps the condition should read:
>
> -       if (AUTONEG_DISABLE == phydev->autoneg)
> +       if ((features & SUPPORTED_Autoneg) == 0)
>                 phy_sanitize_settings(phydev);
>
> Or else, some other parts of the generic code should take care of setting it
> to AUTONEG_DISABLE, depending on whether the feature is supported or not.
> What I found weird is explicitly setting a value (phydev->autoneg =
> AUTONEG_DISABLE), from a static function which is only called when that
> condition is already true.

I do not think that this change is correct either, let me cook a patch
for you to allow disabling autoneg from the start.

>
> BTW, I feel like disabling autoneg from the start has never been a use case
> before, am I right?

Not really no, and that is because most hardware does not need quirks
to work correctly.

>
> Thanks!
> Gerlando



-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 3/3] net: GSO encapsulation for IP packets
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

The UDP GSO code assume that only encapsulated packets are Ethernet
frames. This patch fixes that so that we can support IP protocol
encpasulation (GUE, GRE/UDP, etc.)

We overload the inner_protocol field in the skb to store either the
Ethertype or the IP protocol (latter is indicated by ip_encapsulation
bit). As far as I can tell this should not adversely affect preexiting
uses for inner_protocol.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/skbuff.h |  8 ++++++--
 net/core/skbuff.c      |  1 +
 net/ipv4/udp.c         | 12 +++++++++++-
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1f689e6..757ed39 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -512,7 +512,11 @@ struct sk_buff {
 	 * headers if needed
 	 */
 	__u8			encapsulation:1;
-	/* 6/8 bit hole (depending on ndisc_nodetype presence) */
+	/* skbuf encpasulates an IP packet, inner_protocol should be
+	 * interpreted as an IP protocol, encapsulation bit is also set
+	 */
+	__u8			ip_encapsulation:1;
+	/* 5/7 bit hole (depending on ndisc_nodetype presence) */
 	kmemcheck_bitfield_end(flags2);
 
 #if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
@@ -530,7 +534,7 @@ struct sk_buff {
 		__u32		reserved_tailroom;
 	};
 
-	__be16			inner_protocol;
+	__u16			inner_protocol;
 	__u16			inner_transport_header;
 	__u16			inner_network_header;
 	__u16			inner_mac_header;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8f519db..64c6190 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -687,6 +687,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->ooo_okay		= old->ooo_okay;
 	new->no_fcs		= old->no_fcs;
 	new->encapsulation	= old->encapsulation;
+	new->ip_encapsulation	= old->ip_encapsulation;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 77bd16f..48d8cb2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2497,7 +2497,17 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 
 	/* segment inner packet. */
 	enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
-	segs = skb_mac_gso_segment(skb, enc_features);
+
+	if (skb->ip_encapsulation) {
+		const struct net_offload *ops;
+		ops = rcu_dereference(inet_offloads[skb->inner_protocol]);
+		if (likely(ops && ops->callbacks.gso_segment))
+			segs = ops->callbacks.gso_segment(skb, enc_features);
+	} else {
+		skb->protocol = htons(ETH_P_TEB);
+		segs = skb_mac_gso_segment(skb, enc_features);
+	}
+
 	if (!segs || IS_ERR(segs)) {
 		skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
 				     mac_len);
-- 
1.9.0.rc1.175.g0b1dcb5

^ permalink raw reply related

* [PATCH 1/3] net: Fix GRE RX to use skb_transport_header for GRE
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

GRE assumes that the GRE header is at skb_network_header +
ip_hrdlen(skb). It is more general to use skb_transport_header
and this allows the possbility of inserting additional header
between IP and GRE (which is what we will done in Generic UDP
Encapsulation for GRE).

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/gre_demux.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
index 1863422f..6162269 100644
--- a/net/ipv4/gre_demux.c
+++ b/net/ipv4/gre_demux.c
@@ -118,7 +118,6 @@ static __sum16 check_checksum(struct sk_buff *skb)
 static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 			    bool *csum_err)
 {
-	unsigned int ip_hlen = ip_hdrlen(skb);
 	const struct gre_base_hdr *greh;
 	__be32 *options;
 	int hdr_len;
@@ -126,7 +125,7 @@ static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 	if (unlikely(!pskb_may_pull(skb, sizeof(struct gre_base_hdr))))
 		return -EINVAL;
 
-	greh = (struct gre_base_hdr *)(skb_network_header(skb) + ip_hlen);
+	greh = (struct gre_base_hdr *)skb_transport_header(skb);
 	if (unlikely(greh->flags & (GRE_VERSION | GRE_ROUTING)))
 		return -EINVAL;
 
@@ -136,7 +135,7 @@ static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 	if (!pskb_may_pull(skb, hdr_len))
 		return -EINVAL;
 
-	greh = (struct gre_base_hdr *)(skb_network_header(skb) + ip_hlen);
+	greh = (struct gre_base_hdr *)skb_transport_header(skb);
 	tpi->proto = greh->protocol;
 
 	options = (__be32 *)(greh + 1);
-- 
1.9.0.rc1.175.g0b1dcb5

^ permalink raw reply related

* Re: Bonding
From: Jay Vosburgh @ 2014-02-11 17:55 UTC (permalink / raw)
  To: Gustavo Pimentel; +Cc: Veaceslav Falico, netdev@vger.kernel.org
In-Reply-To: <8532C3BD1ECBC64BA47CE43AFED6D38EC37E5B04@S103.efacec.pt>

Gustavo Pimentel <gustavo.pimentel@efacec.com> wrote:

>Hi Veaceslav,
>
>It's quite different from broadcast mode, each frame sent through the slaves has attached a Redundancy Control Trailer also known as RCT (this trailer is compose by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
>Also the equipment with PRP capability has to send periodically a supervision frame to both similar LANs. Each device on the network has to keep track of receive sequence numbers received, if the received a sequence number for instance from LAN A of specific device and it doesn't exist on internal table, the device should accept the frame and update the internal table. When receiving the same sequence number from the LAN B, the device should discard it, providing zero downtime redundancy.
>
>I can supply you information about this redundancy protocol, if you like. This type of network redundancy is now being large deployed on electrical power stations (like thermal and hydro) and transmission power stations instead of teaming / bonding that depends on RSTP for redundancy.

	Are you aware that there is already an implementation of HSR
(High-availability Seamless Redundancy) in the linux kernel?  I believe
HSR and PRP are defined by the same standard (IEC 62439-3), and are
similar enough to interoperate to some degree.  Perhaps PRP would be
better implemented as a variant within the existing net/hsr/ framework.

	-J


>> -----Original Message-----
>> From: Veaceslav Falico [mailto:vfalico@redhat.com]
>> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
>> To: Gustavo Pimentel
>> Cc: netdev@vger.kernel.org
>> Subject: Re: Bonding
>> 
>> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
>> >Hi,
>> 
>> Hi Gustavo,
>> 
>> >
>> >I'm writing you because because I'm have implemented a new mode (PRP Parallel
>> Redundancy Protocol) for bonding kernel driver. This new mode is quite simple, I
>> don't know if you have heard about PRP, but it's a new standard that allows to
>> overcome any single network failure without affecting the data transmission. The
>> general idea resides on having two separate LAN (A & B) very similar and
>> transmitting the almost the same frame through both LANs and the end device
>> should accept one frame and discard the other according to a known mechanism.
>> 
>> Isn't that the current 'broadcast' mode, where every packet is transmitted over all
>> the slaves? After quick googling/reading I don't see any difference there, though I
>> might have missed something.
>> 
>> >
>> >I have implemented this new mode on bonding driver, but I have some
>> difficulties:
>> >. Writing linux driver is quite new for me. I don't' know if exists guide lines for
>> driver coding.
>> 
>> You can find everything under Documentation/, but without the code I can't tell you
>> exact documents. CodingStyle and SubmittingPatches might be the first ones.
>> 
>> Also, try CC-ing relevant people for more feedback, specifically bonding
>> maintainers.
>> 
>> >. I don't know how to submit the code to be include on kernel repository.
>> >. Maybe another pair of eyes could find help to improve the writing code for this
>> mode.
>> 
>> Try sending an RFC when net-next opens.
>> 
>> >
>> >I think my driver code is 99% complete. I'm currently testing with 3 equipments (1
>> pc + 1 embedded device running both my modify bonding driver) and a third party
>> equipment called RedBox.
>> >
>> >Would you be interested in participating / helping this project?
>> >
>> >With my best regards,
>> >
>> >Gustavo Gama da Rocha Pimentel
>> >Power Systems Automation / Innovation & Development Efacec Engenharia e
>> >Sistemas, S.A.
>> >Phone: +351229403391
>> >Disclaimer

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Vlad Yasevich @ 2014-02-11 18:21 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA58E9.906@mojatatu.com>

On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
> On 02/10/14 11:31, Vlad Yasevich wrote:
>> On 02/09/2014 10:06 AM, Jamal Hadi Salim wrote:
> 
> 
>>> +    ndm = nlmsg_data(cb->nlh);
>>> +    if (ndm->ndm_ifindex) {
>>
>> We get really lucky here that ndm_ifindex and ifi_index happen to map to
>> the same location.
>>
> 
> Didnt follow - but I have a feeling you are looking at the reference
> point of a bridge port.
> Note as per my response to John: The target is a bridge device, not
> a bridge port.
> 

No, this was more the point that the current iproute code sends an
ifinfomsg struct down, and you change that to send ndmsg struct.
This is risky, but we luck out since the index is at the same offset
in both structs.

> 
> 
>>
>> I agree with both of Johns commens fro the above code.
>> I think you can use ndo_dflt_fdb_dump() here and remove the first check
>> for IFF_EBRIDGE.
>>
> 
> Same comment i made to John. The goal is to emulate
> brctl showmacs <bridge>
> ndo_dflt_fdb_dump() gives me in theory all the bridge ports
> unicast and multicast MAC addresses. There is a posibility that
> the bridgeport is a bridge - in which case I can find out from
> user space and safely request for it directly instead of via
> its parent.
> 

But that would only happen if the user said:
  # bridge fdb show br eth0

If eth0 in this case is a hw bridge device, getting the device's
version of fdb data is exactly what would be expected, isn't it?

If you mean a 'software bridge' above, then that's not an issue
since that's a disallowed config.  You can't stack software bridges
without something in the middle like bond or vlan.

>> The only odd thing is that it would permit syntax like:
>>   # bridge fbd show br eth0
>> or
>>   # bridge fdb show br macvlan0
>>
>> but I think that's ok.
> 
> Ok, since both you and John point to macvlan - is that
> considered as something with an fdb? It doesnt forward
> packets between two devices.
> 

Yes, macvlan can forward data to other macvlans, but that's
not the interesting thing.
When you configure multiple macvlan devices on top of the
same hw device, one could think of the hw device as a sort
of a bridge.  It's not really, but you could define it in
those terms.  The fdb entries, in this case, contain the mac
addresses of the macvlan devices.

> 
> 
>>> diff --git a/bridge/fdb.c b/bridge/fdb.c
>>> index e2e53f1..f3073d6 100644
>>> --- a/bridge/fdb.c
>>> +++ b/bridge/fdb.c
>>> @@ -33,7 +33,7 @@ static void usage(void)
>>>       fprintf(stderr, "Usage: bridge fdb { add | append | del |
>>> replace }
>> ADDR dev DEV {self|master} [ temp ]\n"
>>>                   "              [router] [ dst IPADDR] [ vlan VID ]\n"
>>>                   "              [ port PORT] [ vni VNI ] [via DEV]\n");
>>> -    fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
>>> +    fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV
>>> ]\n");
>>
>> 'port' option is now allowed in the show operation
>>
> 
> Thanks - it is already taken seems by vxlan using the same interface.
> 

Sorry, I wasn't very clear. What I meant was that you now support
  # bridge fdb show port <>

The usage message should reflect it.

-vlad
> 
> cheers,
> jamal
> 

^ permalink raw reply

* RE: Bonding
From: Gustavo Pimentel @ 2014-02-11 18:22 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Veaceslav Falico, netdev@vger.kernel.org
In-Reply-To: <27459.1392141335@death.nxdomain>

Hi Jay,

I was not aware of that. You are correct, both HSR and PRP are defined on IEC 62439-3. I will try to contact the person in charge of HSR, to acquire more information about his driver status.


> -----Original Message-----
> From: Jay Vosburgh [mailto:fubar@us.ibm.com]
> Sent: terça-feira, 11 de Fevereiro de 2014 17:56
> To: Gustavo Pimentel
> Cc: Veaceslav Falico; netdev@vger.kernel.org
> Subject: Re: Bonding
> 
> Gustavo Pimentel <gustavo.pimentel@efacec.com> wrote:
> 
> >Hi Veaceslav,
> >
> >It's quite different from broadcast mode, each frame sent through the slaves has
> attached a Redundancy Control Trailer also known as RCT (this trailer is compose
> by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
> >Also the equipment with PRP capability has to send periodically a supervision
> frame to both similar LANs. Each device on the network has to keep track of
> receive sequence numbers received, if the received a sequence number for
> instance from LAN A of specific device and it doesn't exist on internal table, the
> device should accept the frame and update the internal table. When receiving the
> same sequence number from the LAN B, the device should discard it, providing
> zero downtime redundancy.
> >
> >I can supply you information about this redundancy protocol, if you like. This
> type of network redundancy is now being large deployed on electrical power
> stations (like thermal and hydro) and transmission power stations instead of
> teaming / bonding that depends on RSTP for redundancy.
> 
> 	Are you aware that there is already an implementation of HSR (High-
> availability Seamless Redundancy) in the linux kernel?  I believe HSR and PRP are
> defined by the same standard (IEC 62439-3), and are similar enough to interoperate
> to some degree.  Perhaps PRP would be better implemented as a variant within the
> existing net/hsr/ framework.
> 
> 	-J
> 
> 
> >> -----Original Message-----
> >> From: Veaceslav Falico [mailto:vfalico@redhat.com]
> >> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
> >> To: Gustavo Pimentel
> >> Cc: netdev@vger.kernel.org
> >> Subject: Re: Bonding
> >>
> >> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
> >> >Hi,
> >>
> >> Hi Gustavo,
> >>
> >> >
> >> >I'm writing you because because I'm have implemented a new mode (PRP
> >> >Parallel
> >> Redundancy Protocol) for bonding kernel driver. This new mode is
> >> quite simple, I don't know if you have heard about PRP, but it's a
> >> new standard that allows to overcome any single network failure
> >> without affecting the data transmission. The general idea resides on
> >> having two separate LAN (A & B) very similar and transmitting the
> >> almost the same frame through both LANs and the end device should accept
> one frame and discard the other according to a known mechanism.
> >>
> >> Isn't that the current 'broadcast' mode, where every packet is
> >> transmitted over all the slaves? After quick googling/reading I don't
> >> see any difference there, though I might have missed something.
> >>
> >> >
> >> >I have implemented this new mode on bonding driver, but I have some
> >> difficulties:
> >> >. Writing linux driver is quite new for me. I don't' know if exists
> >> >guide lines for
> >> driver coding.
> >>
> >> You can find everything under Documentation/, but without the code I
> >> can't tell you exact documents. CodingStyle and SubmittingPatches might be
> the first ones.
> >>
> >> Also, try CC-ing relevant people for more feedback, specifically
> >> bonding maintainers.
> >>
> >> >. I don't know how to submit the code to be include on kernel repository.
> >> >. Maybe another pair of eyes could find help to improve the writing
> >> >code for this
> >> mode.
> >>
> >> Try sending an RFC when net-next opens.
> >>
> >> >
> >> >I think my driver code is 99% complete. I'm currently testing with 3
> >> >equipments (1
> >> pc + 1 embedded device running both my modify bonding driver) and a
> >> third party equipment called RedBox.
> >> >
> >> >Would you be interested in participating / helping this project?
> >> >
> >> >With my best regards,
> >> >
> >> >Gustavo Gama da Rocha Pimentel
> >> >Power Systems Automation / Innovation & Development Efacec
> >> >Engenharia e Sistemas, S.A.
> >> >Phone: +351229403391
> >> >Disclaimer
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 


^ permalink raw reply

* Re: [PATCH 3/3] net: GSO encapsulation for IP packets
From: Alexei Starovoitov @ 2014-02-11 19:12 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David S. Miller, netdev, ogerlitz
In-Reply-To: <alpine.DEB.2.02.1402110928030.7010@tomh.mtv.corp.google.com>

On Tue, Feb 11, 2014 at 9:43 AM, Tom Herbert <therbert@google.com> wrote:
> The UDP GSO code assume that only encapsulated packets are Ethernet
> frames. This patch fixes that so that we can support IP protocol
> encpasulation (GUE, GRE/UDP, etc.)
>
> We overload the inner_protocol field in the skb to store either the
> Ethertype or the IP protocol (latter is indicated by ip_encapsulation
> bit). As far as I can tell this should not adversely affect preexiting
> uses for inner_protocol.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/skbuff.h |  8 ++++++--
>  net/core/skbuff.c      |  1 +
>  net/ipv4/udp.c         | 12 +++++++++++-
>  3 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1f689e6..757ed39 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -512,7 +512,11 @@ struct sk_buff {
>          * headers if needed
>          */
>         __u8                    encapsulation:1;
> -       /* 6/8 bit hole (depending on ndisc_nodetype presence) */
> +       /* skbuf encpasulates an IP packet, inner_protocol should be
> +        * interpreted as an IP protocol, encapsulation bit is also set
> +        */
> +       __u8                    ip_encapsulation:1;
> +       /* 5/7 bit hole (depending on ndisc_nodetype presence) */
>         kmemcheck_bitfield_end(flags2);
>
>  #if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
> @@ -530,7 +534,7 @@ struct sk_buff {
>                 __u32           reserved_tailroom;
>         };
>
> -       __be16                  inner_protocol;
> +       __u16                   inner_protocol;
>         __u16                   inner_transport_header;
>         __u16                   inner_network_header;
>         __u16                   inner_mac_header;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 8f519db..64c6190 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -687,6 +687,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>         new->ooo_okay           = old->ooo_okay;
>         new->no_fcs             = old->no_fcs;
>         new->encapsulation      = old->encapsulation;
> +       new->ip_encapsulation   = old->ip_encapsulation;
>  #ifdef CONFIG_XFRM
>         new->sp                 = secpath_get(old->sp);
>  #endif
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 77bd16f..48d8cb2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2497,7 +2497,17 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
>
>         /* segment inner packet. */
>         enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
> -       segs = skb_mac_gso_segment(skb, enc_features);
> +
> +       if (skb->ip_encapsulation) {
> +               const struct net_offload *ops;
> +               ops = rcu_dereference(inet_offloads[skb->inner_protocol]);
> +               if (likely(ops && ops->callbacks.gso_segment))
> +                       segs = ops->callbacks.gso_segment(skb, enc_features);
> +       } else {
> +               skb->protocol = htons(ETH_P_TEB);

duplicate assignment ? Do you want to remove line 2496 which did the same
or proto=teb applies to ip_encap case as well?

> +               segs = skb_mac_gso_segment(skb, enc_features);
> +       }
> +
>         if (!segs || IS_ERR(segs)) {
>                 skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
>                                      mac_len);
> --
> 1.9.0.rc1.175.g0b1dcb5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: 3.14-mw regression: rtl8169 WARNING: DMA-API: exceeded 7 overlapping mappings of pfn 55ebe
From: Sander Eikelenboom @ 2014-02-11 19:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Konrad Rzeszutek Wilk, Wei Liu, Francois Romieu,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CAPcyv4g2EnCLFWfLfSDnjtwrid2tCq2k6wh-8sPYY06eJpM83A@mail.gmail.com>

Hi Dan,

FYI just tested and put Xen out of the equation (booting baremetal) and it still persists.

I tried something else .. don't know if it gives you anymore insights, but it's worth the try:

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 2defd13..0fe5b75 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -474,11 +474,11 @@ static int active_pfn_set_overlap(unsigned long pfn, int overlap)
        return overlap;
 }

-static void active_pfn_inc_overlap(unsigned long pfn)
+static void active_pfn_inc_overlap(struct dma_debug_entry *ent)
 {
-       int overlap = active_pfn_read_overlap(pfn);
+       int overlap = active_pfn_read_overlap(ent->pfn);

-       overlap = active_pfn_set_overlap(pfn, ++overlap);
+       overlap = active_pfn_set_overlap(ent->pfn, ++overlap);

        /* If we overflowed the overlap counter then we're potentially
         * leaking dma-mappings.  Otherwise, if maps and unmaps are
@@ -486,15 +486,43 @@ static void active_pfn_inc_overlap(unsigned long pfn)
         * debug_dma_assert_idle() as the pfn may be marked idle
         * prematurely.
         */
+
        WARN_ONCE(overlap > ACTIVE_PFN_MAX_OVERLAP,
                  "DMA-API: exceeded %d overlapping mappings of pfn %lx\n",
-                 ACTIVE_PFN_MAX_OVERLAP, pfn);
+                 ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
+
+       if(overlap > ACTIVE_PFN_MAX_OVERLAP){
+
+               dev_info(ent->dev, "DMA-API: exceeded %d overlapping mappings of pfn %lx .. start dump\n", ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
+               int idx;
+
+               for (idx = 0; idx < HASH_SIZE; idx++) {
+                    struct hash_bucket *bucket = &dma_entry_hash[idx];
+                    struct dma_debug_entry *entry;
+                   unsigned long flags;
+
+                    list_for_each_entry(entry, &bucket->list, list) {
+                                       if (entry->pfn == ent->pfn) {
+                                           dev_info(entry->dev, "%s idx %d P=%Lx N=%lx D=%Lx L=%Lx %s %s\n",
+                                                type2name[entry->type], idx,
+                                                phys_addr(entry), entry->pfn,
+                                                entry->dev_addr, entry->size,
+                                                dir2name[entry->direction],
+                                               maperr2str[entry->map_err_type]);
+                                       }
+                    }
+               }
+               dev_info(ent->dev, "DMA-API: exceeded %d overlapping mappings of pfn %lx .. end of dump\n", ACTIVE_PFN_MAX_OVERLAP, ent->pfn);
+       }
 }


@@ -505,10 +533,10 @@ static int active_pfn_insert(struct dma_debug_entry *entry)

        spin_lock_irqsave(&radix_lock, flags);
        rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
-       if (rc == -EEXIST)
-               active_pfn_inc_overlap(entry->pfn);
+       if (rc == -EEXIST){
+               active_pfn_inc_overlap(entry);
+       }
        spin_unlock_irqrestore(&radix_lock, flags);
-
        return rc;
 }


This results in:
[   27.708678] r8169 0000:0a:00.0 eth1: link down
[   27.712102] r8169 0000:0a:00.0 eth1: link down
[   28.015340] r8169 0000:0b:00.0 eth0: link down
[   28.015368] r8169 0000:0b:00.0 eth0: link down
[   29.654844] r8169 0000:0b:00.0 eth0: link up
[   30.278542] r8169 0000:0a:00.0 eth1: link up
[   60.829503] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   69.708979] EXT4-fs (dm-42): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   76.128678] EXT4-fs (dm-43): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   82.922836] EXT4-fs (dm-44): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   89.232889] EXT4-fs (dm-45): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[   95.359859] EXT4-fs (dm-46): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[  101.638559] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: barrier=1,errors=remount-ro
[  218.073407] ------------[ cut here ]------------
[  218.080983] WARNING: CPU: 5 PID: 0 at lib/dma-debug.c:492 add_dma_entry+0xf1/0x210()
[  218.088550] DMA-API: exceeded 7 overlapping mappings of pfn 3c421
[  218.095988] Modules linked in:
[  218.103270] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W    3.14.0-rc2-20140211-pcireset-net-btrevert-xenblock-dmadebug5+ #1
[  218.110712] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
[  218.118134]  0000000000000009 ffff88003fd437b8 ffffffff81b809c4 ffff88003e308000
[  218.125556]  ffff88003fd43808 ffff88003fd437f8 ffffffff810c985c 0000000000000000
[  218.132917]  00000000ffffffef 0000000000000036 ffff88003d9d3c00 0000000000000282
[  218.140154] Call Trace:
[  218.147193]  <IRQ>  [<ffffffff81b809c4>] dump_stack+0x46/0x58
[  218.154271]  [<ffffffff810c985c>] warn_slowpath_common+0x8c/0xc0
[  218.161293]  [<ffffffff810c9946>] warn_slowpath_fmt+0x46/0x50
[  218.168227]  [<ffffffff814f2cfa>] ? active_pfn_read_overlap+0x3a/0x70
[  218.175116]  [<ffffffff814f41d1>] add_dma_entry+0xf1/0x210
[  218.181865]  [<ffffffff814f4646>] debug_dma_map_page+0x126/0x150
[  218.188484]  [<ffffffff817aabeb>] rtl8169_start_xmit+0x21b/0xa20
[  218.195042]  [<ffffffff81a01877>] ? dev_queue_xmit_nit+0x1d7/0x260
[  218.201553]  [<ffffffff81a0188f>] ? dev_queue_xmit_nit+0x1ef/0x260
[  218.207965]  [<ffffffff81a016a5>] ? dev_queue_xmit_nit+0x5/0x260
[  218.214290]  [<ffffffff81a0661f>] dev_hard_start_xmit+0x37f/0x590
[  218.220481]  [<ffffffff81a26cae>] sch_direct_xmit+0xfe/0x280
[  218.226529]  [<ffffffff81a06a7f>] __dev_queue_xmit+0x24f/0x660
[  218.232521]  [<ffffffff81a06835>] ? __dev_queue_xmit+0x5/0x660
[  218.238439]  [<ffffffff81ab21b9>] ? ip_output+0x59/0xf0
[  218.244272]  [<ffffffff81a06eb0>] dev_queue_xmit+0x10/0x20
[  218.250043]  [<ffffffff81ab076b>] ip_finish_output+0x2cb/0x670
[  218.255682]  [<ffffffff81ab21b9>] ? ip_output+0x59/0xf0
[  218.261168]  [<ffffffff81ab21b9>] ip_output+0x59/0xf0
[  218.266559]  [<ffffffff81aad596>] ip_forward_finish+0x76/0x1a0
[  218.271883]  [<ffffffff81aad86b>] ip_forward+0x1ab/0x440
[  218.277148]  [<ffffffff81aab380>] ip_rcv_finish+0x150/0x660
[  218.282373]  [<ffffffff81aabe3b>] ip_rcv+0x22b/0x370
[  218.287436]  [<ffffffff81b09bc7>] ? packet_rcv_spkt+0x47/0x190
[  218.292372]  [<ffffffff81a03272>] __netif_receive_skb_core+0x722/0x8f0
[  218.297328]  [<ffffffff81a02c75>] ? __netif_receive_skb_core+0x125/0x8f0
[  218.302304]  [<ffffffff8112ce6e>] ? getnstimeofday+0xe/0x30
[  218.307296]  [<ffffffff819f42c5>] ? __netdev_alloc_frag+0x175/0x1b0
[  218.312166]  [<ffffffff81a03461>] __netif_receive_skb+0x21/0x70
[  218.316904]  [<ffffffff81a034d3>] netif_receive_skb_internal+0x23/0xf0
[  218.321596]  [<ffffffff81a04d2d>] napi_gro_receive+0x8d/0x100
[  218.326219]  [<ffffffff817a7bc3>] rtl8169_poll+0x2d3/0x680
[  218.330754]  [<ffffffff8112e366>] ? update_wall_time+0x356/0x690
[  218.335208]  [<ffffffff81a03a0a>] net_rx_action+0x18a/0x2c0
[  218.339595]  [<ffffffff810ce6f1>] ? __do_softirq+0xc1/0x300
[  218.343890]  [<ffffffff810ce767>] __do_softirq+0x137/0x300
[  218.348085]  [<ffffffff810cec9a>] irq_exit+0xaa/0xd0
[  218.352203]  [<ffffffff81b8e5a7>] do_IRQ+0x67/0x110
[  218.356225]  [<ffffffff81b8b772>] common_interrupt+0x72/0x72
[  218.360156]  <EOI>  [<ffffffff810536e6>] ? native_safe_halt+0x6/0x10
[  218.364087]  [<ffffffff81113a7d>] ? trace_hardirqs_on+0xd/0x10
[  218.367935]  [<ffffffff81020632>] default_idle+0x32/0xd0
[  218.371691]  [<ffffffff8102071e>] amd_e400_idle+0x4e/0x140
[  218.375360]  [<ffffffff81020f86>] arch_cpu_idle+0x36/0x40
[  218.378921]  [<ffffffff81120a01>] cpu_startup_entry+0xa1/0x2a0
[  218.382508]  [<ffffffff810473cf>] start_secondary+0x1af/0x210
[  218.386133] ---[ end trace 0e12f271209e2c18 ]---
[  218.389769] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c421 .. start dump
[  218.393566] r8169 0000:0b:00.0: single idx 563 P=3c421100 N=3c421 D=c66100 L=36 DMA_TO_DEVICE dma map error checked
[  218.397379] r8169 0000:0b:00.0: single idx 563 P=3c4212c0 N=3c421 D=c672c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.401094] r8169 0000:0b:00.0: single idx 564 P=3c421480 N=3c421 D=c68480 L=36 DMA_TO_DEVICE dma map error checked
[  218.404730] r8169 0000:0b:00.0: single idx 564 P=3c421640 N=3c421 D=c69640 L=36 DMA_TO_DEVICE dma map error checked
[  218.408310] r8169 0000:0b:00.0: single idx 565 P=3c421800 N=3c421 D=c6a800 L=36 DMA_TO_DEVICE dma map error checked
[  218.411762] r8169 0000:0b:00.0: single idx 565 P=3c4219c0 N=3c421 D=c6b9c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.415075] r8169 0000:0b:00.0: single idx 566 P=3c421b80 N=3c421 D=c6cb80 L=9b DMA_TO_DEVICE dma map error checked
[  218.418305] r8169 0000:0b:00.0: single idx 566 P=3c421dc0 N=3c421 D=c6ddc0 L=36 DMA_TO_DEVICE dma map error checked
[  218.421502] r8169 0000:0b:00.0: single idx 567 P=3c421f80 N=3c421 D=c6ef80 L=36 DMA_TO_DEVICE dma map error not checked
[  218.424677] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c421 .. end of dump
[  218.429050] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c423 .. start dump
[  218.432225] r8169 0000:0b:00.0: single idx 571 P=3c423040 N=3c423 D=c76040 L=36 DMA_TO_DEVICE dma map error checked
[  218.435408] r8169 0000:0b:00.0: single idx 571 P=3c423200 N=3c423 D=c77200 L=36 DMA_TO_DEVICE dma map error checked
[  218.438578] r8169 0000:0b:00.0: single idx 572 P=3c4233c0 N=3c423 D=c783c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.441695] r8169 0000:0b:00.0: single idx 572 P=3c423580 N=3c423 D=c79580 L=7b DMA_TO_DEVICE dma map error checked
[  218.444783] r8169 0000:0b:00.0: single idx 573 P=3c423780 N=3c423 D=c7a780 L=9b DMA_TO_DEVICE dma map error checked
[  218.447825] r8169 0000:0b:00.0: single idx 573 P=3c4239c0 N=3c423 D=c7b9c0 L=6b DMA_TO_DEVICE dma map error checked
[  218.450844] r8169 0000:0b:00.0: single idx 574 P=3c423bc0 N=3c423 D=c7cbc0 L=7b DMA_TO_DEVICE dma map error checked
[  218.453814] r8169 0000:0b:00.0: single idx 574 P=3c423dc0 N=3c423 D=c7ddc0 L=7b DMA_TO_DEVICE dma map error checked
[  218.456793] r8169 0000:0b:00.0: single idx 575 P=3c423fc0 N=3c423 D=c7efc0 L=7b DMA_TO_DEVICE dma map error not checked
[  218.459772] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c423 .. end of dump
[  218.473504] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c716 .. start dump
[  218.475662] r8169 0000:0b:00.0: single idx 586 P=3c7160c0 N=3c716 D=c940c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.477874] r8169 0000:0b:00.0: single idx 586 P=3c716280 N=3c716 D=c95280 L=36 DMA_TO_DEVICE dma map error checked
[  218.480075] r8169 0000:0b:00.0: single idx 587 P=3c716440 N=3c716 D=c96440 L=36 DMA_TO_DEVICE dma map error checked
[  218.482245] r8169 0000:0b:00.0: single idx 587 P=3c716600 N=3c716 D=c97600 L=36 DMA_TO_DEVICE dma map error checked
[  218.484390] r8169 0000:0b:00.0: single idx 588 P=3c7167c0 N=3c716 D=c987c0 L=42 DMA_TO_DEVICE dma map error checked
[  218.486510] r8169 0000:0b:00.0: single idx 588 P=3c7169c0 N=3c716 D=c999c0 L=36 DMA_TO_DEVICE dma map error checked
[  218.488603] r8169 0000:0b:00.0: single idx 589 P=3c716b80 N=3c716 D=c9ab80 L=42 DMA_TO_DEVICE dma map error checked
[  218.490682] r8169 0000:0b:00.0: single idx 589 P=3c716d80 N=3c716 D=c9bd80 L=42 DMA_TO_DEVICE dma map error checked
[  218.492735] r8169 0000:0b:00.0: single idx 590 P=3c716f80 N=3c716 D=c9cf80 L=42 DMA_TO_DEVICE dma map error not checked
[  218.494788] r8169 0000:0b:00.0: DMA-API: exceeded 7 overlapping mappings of pfn 3c716 .. end of dump

--
Sander





Thursday, February 6, 2014, 3:26:09 PM, you wrote:

> On Thu, Feb 6, 2014 at 5:09 AM, Sander Eikelenboom <linux@eikelenboom.it> wrote:
>> Hmm ok that last message was false .. sorry for that .. it did happen again without r8169.use_dac=1, it just doesn't seem to happen all the time...
>>
>> Konrad / Wei, do you happen to know of any xen related change that went into 3.14 merge window that relates to dma / xen networking ?
>>
>> --
>> Sander
>>
>> complete stacktrace:
>>
>> [  342.710738] ------------[ cut here ]------------
>> [  342.726890] WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:491 add_dma_entry+0x105/0x130()
>> [  342.743210] DMA-API: exceeded 7 overlapping mappings of pfn 40b00
>> [  342.759510] Modules linked in:
>> [  342.775557] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc1-20140206-pcireset-net-btrevert+ #1
>> [  342.791706] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS V1.8B1 09/13/2010
>> [  342.807627]  0000000000000009 ffff88005f603828 ffffffff81ad29fc ffffffff822134e0
>> [  342.823430]  ffff88005f603878 ffff88005f603868 ffffffff810bdf62 ffff880000000000
>> [  342.839081]  0000000000040b00 00000000ffffffef ffffffff822102e0 ffff8800592b9098
>> [  342.854572] Call Trace:
>> [  342.869748]  <IRQ>  [<ffffffff81ad29fc>] dump_stack+0x46/0x58
>> [  342.884915]  [<ffffffff810bdf62>] warn_slowpath_common+0x82/0xb0
>> [  342.899710]  [<ffffffff810be031>] warn_slowpath_fmt+0x41/0x50
>> [  342.914395]  [<ffffffff8147853a>] ? active_pfn_read_overlap+0x3a/0x70
>> [  342.929166]  [<ffffffff814792c5>] add_dma_entry+0x105/0x130
>> [  342.943733]  [<ffffffff814796c6>] debug_dma_map_page+0x126/0x150
>> [  342.957988]  [<ffffffff8171c8b6>] rtl8169_start_xmit+0x216/0xa20
>> [  342.972306]  [<ffffffff8195f08f>] ? dev_queue_xmit_nit+0x1ef/0x260
>> [  342.986523]  [<ffffffff8195eea0>] ? dev_loopback_xmit+0x1e0/0x1e0
>> [  343.000689]  [<ffffffff819631e6>] dev_hard_start_xmit+0x2e6/0x4a0
>> [  343.014466]  [<ffffffff81980f3e>] sch_direct_xmit+0xfe/0x280
>> [  343.028052]  [<ffffffff819635dc>] __dev_queue_xmit+0x23c/0x630
>> [  343.041338]  [<ffffffff819633a0>] ? dev_hard_start_xmit+0x4a0/0x4a0
>> [  343.054483]  [<ffffffff81a0a334>] ? ip_output+0x54/0xf0
>> [  343.067659]  [<ffffffff819639eb>] dev_queue_xmit+0xb/0x10
>> [  343.080804]  [<ffffffff81a0890b>] ip_finish_output+0x2cb/0x670
>> [  343.093746]  [<ffffffff81a0a334>] ? ip_output+0x54/0xf0
>> [  343.106391]  [<ffffffff81a0a334>] ip_output+0x54/0xf0
>> [  343.118683]  [<ffffffff81a05791>] ip_forward_finish+0x71/0x1a0
>> [  343.130901]  [<ffffffff81a05a63>] ip_forward+0x1a3/0x440
>> [  343.142829]  [<ffffffff810ffebb>] ? lock_is_held+0x8b/0xb0
>> [  343.154346]  [<ffffffff81a035c0>] ip_rcv_finish+0x150/0x660
>> [  343.165748]  [<ffffffff81a0406b>] ip_rcv+0x22b/0x370
>> [  343.176838]  [<ffffffff81a60972>] ? packet_rcv_spkt+0x42/0x190
>> [  343.187659]  [<ffffffff819609d2>] __netif_receive_skb_core+0x6d2/0x8a0
>> [  343.198209]  [<ffffffff81960414>] ? __netif_receive_skb_core+0x114/0x8a0
>> [  343.208819]  [<ffffffff81009010>] ? xen_clocksource_read+0x20/0x30
>> [  343.219471]  [<ffffffff81116e49>] ? getnstimeofday+0x9/0x30
>> [  343.229862]  [<ffffffff81960bbc>] __netif_receive_skb+0x1c/0x70
>> [  343.239953]  [<ffffffff81960c2e>] netif_receive_skb_internal+0x1e/0xf0
>> [  343.249908]  [<ffffffff81962110>] napi_gro_receive+0x70/0xa0
>> [  343.259509]  [<ffffffff817198a3>] rtl8169_poll+0x2d3/0x680
>> [  343.268982]  [<ffffffff81adcd2b>] ? _raw_spin_unlock_irq+0x2b/0x50
>> [  343.278091]  [<ffffffff819610d1>] net_rx_action+0x161/0x260
>> [  343.287056]  [<ffffffff810c28ec>] __do_softirq+0x12c/0x280
>> [  343.295756]  [<ffffffff810c2da2>] irq_exit+0xa2/0xd0
>> [  343.304235]  [<ffffffff814ffd5f>] xen_evtchn_do_upcall+0x2f/0x40
>> [  343.312387]  [<ffffffff81adf15e>] xen_do_hypervisor_callback+0x1e/0x30
>> [  343.320389]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [  343.328171]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [  343.335738]  [<ffffffff81008c70>] ? xen_safe_halt+0x10/0x20
>> [  343.343142]  [<ffffffff81018748>] ? default_idle+0x18/0x20
>> [  343.350202]  [<ffffffff81018f5e>] ? arch_cpu_idle+0x2e/0x40
>> [  343.356994]  [<ffffffff8110b551>] ? cpu_startup_entry+0x91/0x1e0
>> [  343.363658]  [<ffffffff81ac7d87>] ? rest_init+0xb7/0xc0
>> [  343.369924]  [<ffffffff81ac7cd0>] ? csum_partial_copy_generic+0x170/0x170
>> [  343.376057]  [<ffffffff8230ff1c>] ? start_kernel+0x409/0x416
>> [  343.381972]  [<ffffffff8230f912>] ? repair_env_string+0x5e/0x5e
>> [  343.387573]  [<ffffffff8230f5f8>] ? x86_64_start_reservations+0x2a/0x2c
>> [  343.393152]  [<ffffffff82312e28>] ? xen_start_kernel+0x586/0x588
>> [  343.398628] ---[ end trace 8379b598fb7ef5ee ]---
>>
>>
>>
>>
>>
>> Thursday, February 6, 2014, 12:36:31 PM, you wrote:
>>
>>> Hi Dan / Francois,
>>
>>> Didn't have time to test it before, but the patch doesn't seem to help.
>>> I'm still getting the "DMA-API: exceeded 7 overlapping mappings of pfn 55ebe",
>>> but i see now i forgot to mention i use r8169.use_dac=1 ...
>>
>>> Not using it seems to prevent the warning, but before 3.14 i have never seen this (with r8169.use_dac=1)

> If you are still hitting this with the patch:

>   59f2e7df574c dma-debug: fix overlap detection

> ...then I'm more inclined to think it is an actual positive report.

> If you don't mind I'll send some debug patches to narrow this down.

^ permalink raw reply related

* Re: [PATCH v2] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Matija Glavinic Pecotic @ 2014-02-11 19:56 UTC (permalink / raw)
  To: ext Vlad Yasevich
  Cc: linux-sctp@vger.kernel.org, netdev@vger.kernel.org,
	Alexander Sverdlin
In-Reply-To: <52FA3914.90002@gmail.com>

Hello Vlad,

On 02/11/2014 03:52 PM, ext Vlad Yasevich wrote:
> Hi Matija
> 
> On 02/09/2014 02:15 AM, Matija Glavinic Pecotic wrote:
>>
>> Proposed solution:
>>
>> Both problems share the same root cause, and that is improper scaling
> of socket
>> buffer with rwnd. Solution in which sizeof(sk_buff) is taken into
> concern while
>> calculating rwnd is not possible due to fact that there is no linear
>> relationship between amount of data blamed in increase/decrease with
> IP packet
>> in which payload arrived. Even in case such solution would be followed,
>> complexity of the code would increase. Due to nature of current rwnd
> handling,
>> slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure
> state is
>> entered is rationale, but it gives false representation to the sender
> of current
>> buffer space. Furthermore, it implements additional congestion control
> mechanism
>> which is defined on implementation, and not on standard basis.
>>
>> Proposed solution simplifies whole algorithm having on mind definition
> from rfc:
>>
>> o  Receiver Window (rwnd): This gives the sender an indication of the
> space
>>    available in the receiver's inbound buffer.
>>
>> Core of the proposed solution is given with these lines:
>>
>> sctp_assoc_rwnd_update:
>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> 	else
>> 		asoc->rwnd = 0;
>>
>> We advertise to sender (half of) actual space we have. Half is in the
> braces
>> depending whether you would like to observe size of socket buffer as
> SO_RECVBUF
>> or twice the amount, i.e. size is the one visible from userspace, that is,
>> from kernelspace.
>> In this way sender is given with good approximation of our buffer space,
>> regardless of the buffer policy - we always advertise what we have.
> Proposed
>> solution fixes described problems and removes necessity for rwnd
> restoration
>> algorithm. Finally, as proposed solution is simplification, some lines
> of code,
>> along with some bytes in struct sctp_association are saved.
>>
>> Signed-off-by: Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com>
>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>
>> --- net-next.orig/net/sctp/associola.c
>> +++ net-next/net/sctp/associola.c
>> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>>  	return false;
>>  }
>>
>> -/* Increase asoc's rwnd by len and send any window update SACK if
> needed. */
>> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned
> int len)
>> +/* Update asoc's rwnd for the approximated state in the buffer,
>> + * and check whether SACK needs to be sent.
>> + */
>> +void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool
> update_peer)
>>  {
>> +	int rx_count;
>>  	struct sctp_chunk *sack;
>>  	struct timer_list *timer;
>>
>> -	if (asoc->rwnd_over) {
>> -		if (asoc->rwnd_over >= len) {
>> -			asoc->rwnd_over -= len;
>> -		} else {
>> -			asoc->rwnd += (len - asoc->rwnd_over);
>> -			asoc->rwnd_over = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd += len;
>> -	}
>> +	if (asoc->ep->rcvbuf_policy)
>> +		rx_count = atomic_read(&asoc->rmem_alloc);
>> +	else
>> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>>
>> -	/* If we had window pressure, start recovering it
>> -	 * once our rwnd had reached the accumulated pressure
>> -	 * threshold.  The idea is to recover slowly, but up
>> -	 * to the initial advertised window.
>> -	 */
>> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
>> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
>> -		asoc->rwnd += change;
>> -		asoc->rwnd_press -= change;
>> -	}
>> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> +	else
>> +		asoc->rwnd = 0;
>>
>> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->a_rwnd);
>> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
>> +		 __func__, asoc, asoc->rwnd, rx_count,
>> +		 asoc->base.sk->sk_rcvbuf);
>>
>>  	/* Send a window update SACK if the rwnd has increased by at least the
>>  	 * minimum of the association's PMTU and half of the receive buffer.
>>  	 * The algorithm used is similar to the one described in
>>  	 * Section 4.2.3.3 of RFC 1122.
>>  	 */
>> -	if (sctp_peer_needs_update(asoc)) {
>> +	if (update_peer && sctp_peer_needs_update(asoc)) {
>>  		asoc->a_rwnd = asoc->rwnd;
>>
>>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
>> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>>  	}
>>  }
>>
>> -/* Decrease asoc's rwnd by len. */
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned
> int len)
>> -{
>> -	int rx_count;
>> -	int over = 0;
>> -
>> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
>> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
>> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
>> -			 asoc->rwnd, asoc->rwnd_over);
>> -
>> -	if (asoc->ep->rcvbuf_policy)
>> -		rx_count = atomic_read(&asoc->rmem_alloc);
>> -	else
>> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>> -
>> -	/* If we've reached or overflowed our receive buffer, announce
>> -	 * a 0 rwnd if rwnd would still be positive.  Store the
>> -	 * the potential pressure overflow so that the window can be restored
>> -	 * back to original value.
>> -	 */
>> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
>> -		over = 1;
>> -
>> -	if (asoc->rwnd >= len) {
>> -		asoc->rwnd -= len;
>> -		if (over) {
>> -			asoc->rwnd_press += asoc->rwnd;
>> -			asoc->rwnd = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd_over = len - asoc->rwnd;
>> -		asoc->rwnd = 0;
>> -	}
>> -
>> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->rwnd_press);
>> -}
>>
>>  /* Build the bind address list for the association based on info from the
>>   * local endpoint and the remote peer.
>> --- net-next.orig/include/net/sctp/structs.h
>> +++ net-next/include/net/sctp/structs.h
>> @@ -1653,17 +1653,6 @@ struct sctp_association {
>>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>>  	__u32 a_rwnd;
>>
>> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
>> -	 * to slop over a maximum of the association's frag_point.
>> -	 */
>> -	__u32 rwnd_over;
>> -
>> -	/* Keeps treack of rwnd pressure.  This happens when we have
>> -	 * a window, but not recevie buffer (i.e small packets).  This one
>> -	 * is releases slowly (1 PMTU at a time ).
>> -	 */
>> -	__u32 rwnd_press;
>> -
>>  	/* This is the sndbuf size in use for the association.
>>  	 * This corresponds to the sndbuf size for the association,
>>  	 * as specified in the sk->sndbuf.
>> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>>
>>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
>> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>> +void sctp_assoc_rwnd_update(struct sctp_association *, bool);
>>  void sctp_assoc_set_primary(struct sctp_association *,
>>  			    struct sctp_transport *);
>>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
>> --- net-next.orig/net/sctp/sm_statefuns.c
>> +++ net-next/net/sctp/sm_statefuns.c
>> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>>  	 * PMTU.  In cases, such as loopback, this might be a rather
>>  	 * large spill over.
>>  	 */
>> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>>
>>  		/* If this is the next TSN, consider reneging to make
>> --- net-next.orig/net/sctp/socket.c
>> +++ net-next/net/sctp/socket.c
>> @@ -2092,12 +2092,6 @@ static int sctp_recvmsg(struct kiocb *io
>>  		sctp_skb_pull(skb, copied);
>>  		skb_queue_head(&sk->sk_receive_queue, skb);
>>
>> -		/* When only partial message is copied to the user, increase
>> -		 * rwnd by that amount. If all the data in the skb is read,
>> -		 * rwnd is updated when the event is freed.
>> -		 */
>> -		if (!sctp_ulpevent_is_notification(event))
>> -			sctp_assoc_rwnd_increase(event->asoc, copied);
>>  		goto out;
>>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>>  		   (event->msg_flags & MSG_EOR))
>> --- net-next.orig/net/sctp/ulpevent.c
>> +++ net-next/net/sctp/ulpevent.c
>> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>>  	skb = sctp_event2skb(event);
>>  	/* Set the owner and charge rwnd for bytes received.  */
>>  	sctp_ulpevent_set_owner(event, asoc);
>> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>> +	sctp_assoc_rwnd_update(asoc, false);
>>
>>  	if (!skb->data_len)
>>  		return;
>> @@ -1035,8 +1035,9 @@ static void sctp_ulpevent_release_data(s
>>  	}
>>
>>  done:
>> -	sctp_assoc_rwnd_increase(event->asoc, len);
>> -	sctp_ulpevent_release_owner(event);
>> +	atomic_sub(event->rmem_len, &event->asoc->rmem_alloc);
>> +	sctp_assoc_rwnd_update(event->asoc, true);
>> +	sctp_association_put(event->asoc)
> 
> Can't we simply change the order of window update and release instead
> of open coding it like this?

that was the initial idea, but sctp_ulpevent_release_owner puts the association and calls sctp_association_destroy if its time to do so. IMHO, in the case if we would switch it, we would open a potential race condition.

I agree this doesn't look the best. But since we should call sctp_assoc_rwnd_update after accounting and before put, we have only option to move sctp_assoc_rwnd_update to _ulpevent_release_owner. As on this path we wish to update peer and generate sack, but we for sure do not want it on all paths where ulpevent_release_owner is used, I see no alternative but to add additional parameter to ulpevent_release_owner which would be just passed to rwnd_update - bool update_peer. On the other hand, I wonder whether ulpevent_release_owner would do more then it should in that case?

> 
> -vlad
> 
>>  }
>>
>>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] dp83640: Support a configurable number of periodic outputs
From: Richard Cochran @ 2014-02-11 20:09 UTC (permalink / raw)
  To: Stefan Sørensen
  Cc: grant.likely, robh+dt, mark.rutland, netdev, linux-kernel,
	devicetree
In-Reply-To: <1392132562-23644-2-git-send-email-stefan.sorensen@spectralink.com>

On Tue, Feb 11, 2014 at 04:29:21PM +0100, Stefan Sørensen wrote:

> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 547725f..d4fe95d 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -38,15 +38,11 @@
>  #define LAYER4		0x02
>  #define LAYER2		0x01
>  #define MAX_RXTS	64
> -#define N_EXT_TS	6
> +#define N_EXT		8
>  #define PSF_PTPVER	2
>  #define PSF_EVNT	0x4000
>  #define PSF_RX		0x2000
>  #define PSF_TX		0x1000
> -#define EXT_EVENT	1

Regarding this EXT_EVENT thing ...

> @@ -430,12 +419,12 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
>  	switch (rq->type) {
>  	case PTP_CLK_REQ_EXTTS:
>  		index = rq->extts.index;
> -		if (index < 0 || index >= N_EXT_TS)
> +		if (index < 0 || index >= n_ext_ts)
>  			return -EINVAL;
> -		event_num = EXT_EVENT + index;
> +		event_num = index;

there was a mapping between the "event numbers" and the external time
stamp channels. I don't remember off the top of my head why this these
two differ by one, but there was a good reason.

Are you sure this is still working with this change?

I am especially wondering about the event decoding here:

> @@ -642,7 +631,7 @@ static void recalibrate(struct dp83640_clock *clock)
>  
>  static inline u16 exts_chan_to_edata(int ch)
>  {
> -	return 1 << ((ch + EXT_EVENT) * 2);
> +	return 1 << ((ch) * 2);
>  }

Maybe I am just paranoid, but can you remind me how these event
numbers are supposed to work, before and after the change?

Thanks,
Richard

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 20:15 UTC (permalink / raw)
  To: vyasevic, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA6A24.3030402@redhat.com>

On 02/11/14 13:21, Vlad Yasevich wrote:
> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>> On 02/10/14 11:31, Vlad Yasevich wrote:

> No, this was more the point that the current iproute code sends an
> ifinfomsg struct down, and you change that to send ndmsg struct.
> This is risky, but we luck out since the index is at the same offset
> in both structs.
>

ah, ok, thanks for catching that. I should have said something - the
original code was wrong and i felt it was safe to make the change
given that the kernel code never even looked at what was being
sent to it. There is asymetry desires which are violated.
It doesnt make sense to send and ifm and expect back an ndm.
I should send that separately as a bug fix.

> But that would only happen if the user said:
>    # bridge fdb show br eth0
>
> If eth0 in this case is a hw bridge device, getting the device's
> version of fdb data is exactly what would be expected, isn't it?
>

Well, if it is a "bridge device" why would it not be tagged as a bridge
device?

> If you mean a 'software bridge' above, then that's not an issue
> since that's a disallowed config.  You can't stack software bridges
> without something in the middle like bond or vlan.
>

Ok, didnt realize that.
So i cant add a bridge as a bridge port to another bridge?

>
> Yes, macvlan can forward data to other macvlans, but that's
> not the interesting thing.

Sample config?

> When you configure multiple macvlan devices on top of the
> same hw device, one could think of the hw device as a sort
> of a bridge.  It's not really, but you could define it in
> those terms.  The fdb entries, in this case, contain the mac
> addresses of the macvlan devices.
>

It certainly has some equivalent semantics (looks at dst MAC then
picks the port). Possible to add Vlans as well?
Why dont we tag such a thing as a bridge then?

>
> Sorry, I wasn't very clear. What I meant was that you now support
>    # bridge fdb show port <>
>
> The usage message should reflect it.
>

Sorry - I noticed the word "port" at exactly where your quote came.
So i thought you noticed that "port" was already taken - it is used
for VXLAN fdb entries (for udp ports).

cheers,
jamal

^ permalink raw reply

* Re: [PATCH v2 2/2] dp83640: Get pin and master/slave configuration from DT
From: Richard Cochran @ 2014-02-11 20:19 UTC (permalink / raw)
  To: Stefan Sørensen
  Cc: grant.likely, robh+dt, mark.rutland, netdev, linux-kernel,
	devicetree
In-Reply-To: <1392132562-23644-3-git-send-email-stefan.sorensen@spectralink.com>

On Tue, Feb 11, 2014 at 04:29:22PM +0100, Stefan Sørensen wrote:

> diff --git a/Documentation/devicetree/bindings/net/dp83640.txt b/Documentation/devicetree/bindings/net/dp83640.txt
> new file mode 100644
> index 0000000..b9a57c0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/dp83640.txt
> @@ -0,0 +1,29 @@
> +Required properties for the National DP83640 ethernet phy:
> +
> +- compatible : Must contain "national,dp83640"
> +
> +Optional properties:
> +
> +- dp83640,slave: If present, this phy will be slave to another dp83640
> +  on the same mdio bus.

Wouldn't it be more natural to have one "dp83640,master" property
rather than multiple slave properties?

> @@ -949,6 +940,95 @@ static void dp83640_clock_put(struct dp83640_clock *clock)
>  	mutex_unlock(&clock->clock_lock);
>  }
>  
> +#ifdef CONFIG_OF
> +static int dp83640_probe_dt(struct device_node *node,
> +			    struct dp83640_private *dp83640)
> +{
> +	struct dp83640_clock *clock = dp83640->clock;
> +	struct property *prop;
> +	int err, proplen;
> +
> +	dp83640->slave = of_property_read_bool(node, "dp83640,slave");
> +	if (!dp83640->slave && clock->chosen) {
> +		pr_err("dp83640,slave must be set if more than one device on the same bus");

Most of these pr_err lines are a bit _way_ too long for coding style.

> +		return -EINVAL;
> +	}
> +
> +	prop = of_find_property(node, "dp83640,perout-pins", &proplen);
> +	if (prop) {
> +		if (dp83640->slave) {
> +			pr_err("dp83640,perout-pins property can not be set together with dp83640,slave");

(Here especially and in the code that followed.)

Overall the series is looking better. I will try to test the non-DT
case later on this week.

Thanks,
Richard

^ permalink raw reply

* Re: xfrm: is pmtu broken with ESP tunneling?
From: Ortwin Glück @ 2014-02-11 20:20 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: linux-kernel, netdev
In-Reply-To: <20140211023258.GC11150@order.stressinduktion.org>

On 02/11/2014 03:32 AM, Hannes Frederic Sowa wrote:
>> net.ipv4.ip_no_pmtu_disc=1.
>
> This setting will shrink the path mtu to min_pmtu when a frag needed icmp is
> received.

The UDP+ESP encapsulation adds 60 bytes to the original packet size.

ifconfig wla0 shows an mtu of 1500.

The size of the first big packet on the interface:
net.ipv4.ip_no_pmtu_disc=1: packet length is 1300
net.ipv4.ip_no_pmtu_disc=0: packet length is 1500

Length is without the ESP wrapper and UDP encapsulation. The packets are so big 
that they can't even leave the wireless interface and never show up on the 
router. So no ICMP packets are received. PMTU can't work with initial packets of 
that size.

dump question: which layer discard these packets? qdisc? why no notification to 
the sender?

When I increase the mtu of the interface to 2000 with ifconfig, then I start 
seeing ICMP fragmentation needed from the next hop, indicating 1500 as the mtu 
as response to a 1560 byte UDP[ESP] packet.

The next UDP[ESP] packet is shorter: 1360 bytes. It gets hard to see what's 
going on after that, but the connection is still not working.

So, instead of somehow losing these packets on the way out of the interface 
should the kernel not start with a lower mtu in the first place? Now it seems it 
is trying with the maximum of the interface and expecting to scale down with 
pmtu - which can ever happen.

> Can you send a ip route get <ip> to the problematic target to see how
> far off the calculated value is?

That command doesn't return anything useful. No hint on the mtu here.

BTW, instead of disabling pmtu, setting mtu explicitly also helps:
ip route add 10.6.6.0/24 via ${localip} mtu 1300

Thanks,

Ortwin

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: John Fastabend @ 2014-02-11 20:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: vyasevic, netdev@vger.kernel.org, Stephen Hemminger,
	Scott Feldman
In-Reply-To: <52FA84FA.2030608@mojatatu.com>

On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
>
>>
>> Yes, macvlan can forward data to other macvlans, but that's
>> not the interesting thing.
>
> Sample config?

ip link add link ethx name mv1 type macvlan mode bridge
ip link add link ethx name mv2 type macvlan mode bridge

Now you have a macvlan on ethx that will forward data between
mv1 and mv2.

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: John Fastabend @ 2014-02-11 20:30 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: vyasevic, netdev@vger.kernel.org, Stephen Hemminger,
	Scott Feldman
In-Reply-To: <52FA84FA.2030608@mojatatu.com>

On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:
> On 02/11/14 13:21, Vlad Yasevich wrote:
>> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>>> On 02/10/14 11:31, Vlad Yasevich wrote:
>
>> No, this was more the point that the current iproute code sends an
>> ifinfomsg struct down, and you change that to send ndmsg struct.
>> This is risky, but we luck out since the index is at the same offset
>> in both structs.
>>
>
> ah, ok, thanks for catching that. I should have said something - the
> original code was wrong and i felt it was safe to make the change
> given that the kernel code never even looked at what was being
> sent to it. There is asymetry desires which are violated.
> It doesnt make sense to send and ifm and expect back an ndm.
> I should send that separately as a bug fix.
>
>
>> But that would only happen if the user said:
>>    # bridge fdb show br eth0
>>
>> If eth0 in this case is a hw bridge device, getting the device's
>> version of fdb data is exactly what would be expected, isn't it?
>>
>
> Well, if it is a "bridge device" why would it not be tagged as a bridge
> device?

What do you mean by "bridge device" are you specifically talking about
IFF_BRIDGE flag? This flag is used only for ./net/bridge devices. For
example macvlan uses its own flag. I think there is a good case to be
made for netdevices which are acting as the management interface for a
hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.

>
>> If you mean a 'software bridge' above, then that's not an issue
>> since that's a disallowed config.  You can't stack software bridges
>> without something in the middle like bond or vlan.
>>
>
> Ok, didnt realize that.
> So i cant add a bridge as a bridge port to another bridge?
>

# ip link set dev bridge0 master bridge1
RTNETLINK answers: Too many levels of symbolic links

in the bridge case this doesn't work. But you can stack a macvlan
on top of the bridge port,

# ip link add link bridge0 type macvlan mode vepa

11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop 
state DOWN mode DEFAULT group default

And macvlans on macvlans is OK as well.

# ip link add link macvlan0 type macvlan mode vepa

[...]

>> When you configure multiple macvlan devices on top of the
>> same hw device, one could think of the hw device as a sort
>> of a bridge.  It's not really, but you could define it in
>> those terms.  The fdb entries, in this case, contain the mac
>> addresses of the macvlan devices.
>>
>
> It certainly has some equivalent semantics (looks at dst MAC then
> picks the port). Possible to add Vlans as well?
> Why dont we tag such a thing as a bridge then?
>

If its useful then we should. You can track them down in userspace
via /sys/class/net/ or looking for offloaded netdevices that point
to the interface but a flag is definitely more direct.

>>
>> Sorry, I wasn't very clear. What I meant was that you now support
>>    # bridge fdb show port <>
>>
>> The usage message should reflect it.
>>
>
> Sorry - I noticed the word "port" at exactly where your quote came.
> So i thought you noticed that "port" was already taken - it is used
> for VXLAN fdb entries (for udp ports).
>
>
> cheers,
> jamal

^ permalink raw reply

* Re: [PATCH 1/6] staging: r8188eu: Replace wrapper around _rtw_memcmp()
From: Greg KH @ 2014-02-11 20:40 UTC (permalink / raw)
  To: Larry Finger; +Cc: devel, netdev
In-Reply-To: <1391980559-24288-2-git-send-email-Larry.Finger@lwfinger.net>

On Sun, Feb 09, 2014 at 03:15:54PM -0600, Larry Finger wrote:
> This wrapper is replaced with a simple memcmp(). As the wrapper inverts the
> logic of memcmp(), care needed to be taken.

That's just evil, ugh, nice job...

^ permalink raw reply

* Re: [PATCH 6/6] staging: r8188eu: Remove _func_enter and _func_exit macros
From: Greg KH @ 2014-02-11 20:41 UTC (permalink / raw)
  To: Larry Finger; +Cc: devel, netdev
In-Reply-To: <1391980559-24288-7-git-send-email-Larry.Finger@lwfinger.net>

On Sun, Feb 09, 2014 at 03:15:59PM -0600, Larry Finger wrote:
> These debugging macros are seldom used for debugging once the driver
> is working. If routine tracing is needed, it can be added on an
> individual basis.

No, you can use the in-kernel tracing functionality :)

nice job.

greg k-h

^ permalink raw reply

* Re: [PATCH 3/3] net: GSO encapsulation for IP packets
From: Tom Herbert @ 2014-02-11 20:45 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David S. Miller, Linux Netdev List, Or Gerlitz
In-Reply-To: <CAADnVQ+PcO9-BGVE+ExFZt_sxhN1gDtiXtxMHPfHEX7wB3cWWg@mail.gmail.com>

On Tue, Feb 11, 2014 at 11:12 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Feb 11, 2014 at 9:43 AM, Tom Herbert <therbert@google.com> wrote:
>> The UDP GSO code assume that only encapsulated packets are Ethernet
>> frames. This patch fixes that so that we can support IP protocol
>> encpasulation (GUE, GRE/UDP, etc.)
>>
>> We overload the inner_protocol field in the skb to store either the
>> Ethertype or the IP protocol (latter is indicated by ip_encapsulation
>> bit). As far as I can tell this should not adversely affect preexiting
>> uses for inner_protocol.
>>
>> Signed-off-by
>> +++ b/net/ipv4/udp.c
>> @@ -2497,7 +2497,17 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
>>
>>         /* segment inner packet. */
>>         enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
>> -       segs = skb_mac_gso_segment(skb, enc_features);
>> +
>> +       if (skb->ip_encapsulation) {
>> +               const struct net_offload *ops;
>> +               ops = rcu_dereference(inet_offloads[skb->inner_protocol]);
>> +               if (likely(ops && ops->callbacks.gso_segment))
>> +                       segs = ops->callbacks.gso_segment(skb, enc_features);
>> +       } else {
>> +               skb->protocol = htons(ETH_P_TEB);
>
> duplicate assignment ? Do you want to remove line 2496 which did the same
> or proto=teb applies to ip_encap case as well?
>

Thanks for catching that!  I think the assignment at 2496 should be removed.

>> +               segs = skb_mac_gso_segment(skb, enc_features);
>> +       }
>> +
>>         if (!segs || IS_ERR(segs)) {
>>                 skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
>>                                      mac_len);
>> --
>> 1.9.0.rc1.175.g0b1dcb5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Vlad Yasevich @ 2014-02-11 21:00 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA84FA.2030608@mojatatu.com>

On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:
> On 02/11/14 13:21, Vlad Yasevich wrote:
>> On 02/11/2014 12:07 PM, Jamal Hadi Salim wrote:
>>> On 02/10/14 11:31, Vlad Yasevich wrote:
> 
>> No, this was more the point that the current iproute code sends an
>> ifinfomsg struct down, and you change that to send ndmsg struct.
>> This is risky, but we luck out since the index is at the same offset
>> in both structs.
>>
> 
> ah, ok, thanks for catching that. I should have said something - the
> original code was wrong and i felt it was safe to make the change
> given that the kernel code never even looked at what was being
> sent to it. There is asymetry desires which are violated.
> It doesnt make sense to send and ifm and expect back an ndm.
> I should send that separately as a bug fix.
> 
> 
>> But that would only happen if the user said:
>>    # bridge fdb show br eth0
>>
>> If eth0 in this case is a hw bridge device, getting the device's
>> version of fdb data is exactly what would be expected, isn't it?
>>
> 
> Well, if it is a "bridge device" why would it not be tagged as a bridge
> device?

Because it just a multi-function nic that isn't tagged with any
kine of bridge flag.  As John said, this might be useful, but not
done yet.

> 
>> If you mean a 'software bridge' above, then that's not an issue
>> since that's a disallowed config.  You can't stack software bridges
>> without something in the middle like bond or vlan.
>>
> 
> Ok, didnt realize that.
> So i cant add a bridge as a bridge port to another bridge?

Not directly.  However, if you put a layered software device in between
(vlan, bond, macvlan), then you can add that device to another bridge.
In fact, people do that to get GVRP working with VMs.

> 
>>
>> Yes, macvlan can forward data to other macvlans, but that's
>> not the interesting thing.
> 
> Sample config?
> 
>> When you configure multiple macvlan devices on top of the
>> same hw device, one could think of the hw device as a sort
>> of a bridge.  It's not really, but you could define it in
>> those terms.  The fdb entries, in this case, contain the mac
>> addresses of the macvlan devices.
>>
> 
> It certainly has some equivalent semantics (looks at dst MAC then
> picks the port). Possible to add Vlans as well?

I suppose.   You can do things like:
# ip link add link eth0 dev vlan100 protocol 8021Q id 100
# ip link add link vlan0 dev mac100 type macvlan

Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
Expressing this in terms of fdb would be a bit difficult since each
interface is separate and eth0 doesn't really know about the stack.
It would require quite a lot of code.

> Why dont we tag such a thing as a bridge then?
> 

Because they are not always a bridge.  It could be just a nic capable of
mac filtering.

>>
>> Sorry, I wasn't very clear. What I meant was that you now support
>>    # bridge fdb show port <>
>>
>> The usage message should reflect it.
>>
> 
> Sorry - I noticed the word "port" at exactly where your quote came.
> So i thought you noticed that "port" was already taken - it is used
> for VXLAN fdb entries (for udp ports).
>

Didn't realize it has different connotation for vxlan.  The you probably
don't want to include and support in the bridge fdb show command.

-vlad

> 
> cheers,
> jamal

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 21:04 UTC (permalink / raw)
  To: John Fastabend
  Cc: vyasevic, netdev@vger.kernel.org, Stephen Hemminger,
	Scott Feldman
In-Reply-To: <52FA8865.1070302@intel.com>

On 02/11/14 15:30, John Fastabend wrote:
> On 2/11/2014 12:15 PM, Jamal Hadi Salim wrote:

Thanks for the example on the other email.


> What do you mean by "bridge device" are you specifically talking about
> IFF_BRIDGE flag? This flag is used only for ./net/bridge devices.

Right - the simple definition is this thing has an fdb.
Yes, I know weve added vlan filtering and multicast snooping
but thats all lipstick. If it has an (ethernet) fdb it is a bridge.

>For
> example macvlan uses its own flag. I think there is a good case to be
> made for netdevices which are acting as the management interface for a
> hardware bridge to set an identifying flag. Perhaps IFF_HWBRIDGE.
>

If you introduce IFF_HWBRIDGE - I think that would satisfy the
distinction. The question then is why not just tag it IFF_BRIDGE?

>
> # ip link set dev bridge0 master bridge1
> RTNETLINK answers: Too many levels of symbolic links
>

pourquoi?  If the original rationale was to limit the
broadcast domain scope it sounds strange that a bridge in
the form a macvlan is allowed.

> in the bridge case this doesn't work. But you can stack a macvlan
> on top of the bridge port,
>
> # ip link add link bridge0 type macvlan mode vepa
>
> 11: macvlan0@bridge0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop
> state DOWN mode DEFAULT group default
>
> And macvlans on macvlans is OK as well.
>
> # ip link add link macvlan0 type macvlan mode vepa
>
> [...]
>

Ok, I need to let that sink in. Cool actually.


>
> If its useful then we should. You can track them down in userspace
> via /sys/class/net/ or looking for offloaded netdevices that point
> to the interface but a flag is definitely more direct.
>

I prefer a flag. Then i can deal with it via netlink.

cheers,
jamal

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 21:08 UTC (permalink / raw)
  To: vyasevic, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52FA8F8B.3080500@redhat.com>

On 02/11/14 16:00, Vlad Yasevich wrote:
> On 02/11/2014 03:15 PM, Jamal Hadi Salim wrote:

>
> Because it just a multi-function nic that isn't tagged with any
> kine of bridge flag.  As John said, this might be useful, but not
> done yet.
>

Ok, fair enough. Someone should send a patch - John perhaps.

>
> Not directly.  However, if you put a layered software device in between
> (vlan, bond, macvlan), then you can add that device to another bridge.
> In fact, people do that to get GVRP working with VMs.
>

Do you recall the reasoning behind it?


>> It certainly has some equivalent semantics (looks at dst MAC then
>> picks the port). Possible to add Vlans as well?
>
> I suppose.   You can do things like:
> # ip link add link eth0 dev vlan100 protocol 8021Q id 100
> # ip link add link vlan0 dev mac100 type macvlan
>
> Now, you have a macvlan (mac100) that will only receive vlan100 traffic.
> Expressing this in terms of fdb would be a bit difficult since each
> interface is separate and eth0 doesn't really know about the stack.
> It would require quite a lot of code.
>

nice.

>> Why dont we tag such a thing as a bridge then?
>>
>
> Because they are not always a bridge.  It could be just a nic capable of
> mac filtering.
>

I think in one of the modes it is merely a filter.
But you turn on this other feature it is a bridge.

>
> Didn't realize it has different connotation for vxlan.  The you probably
> don't want to include and support in the bridge fdb show command.

Thats what i thought you said earlier ;->

cheers,
jamal

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox