Netdev List
 help / color / mirror / Atom feed
* Re: Lockdep warning in vxlan
From: Stephen Hemminger @ 2012-12-20 18:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Yan Burman, netdev
In-Reply-To: <1356027360.21834.2973.camel@edumazet-glaptop>

On Thu, 20 Dec 2012 10:16:00 -0800
Eric Dumazet <erdnetdev@gmail.com> wrote:

> On Thu, 2012-12-20 at 08:34 -0800, Stephen Hemminger wrote:
> > On Thu, 20 Dec 2012 16:00:32 +0200
> > Yan Burman <yanb@mellanox.com> wrote:
> >   
> > > Hi.
> > > 
> > > When working with vxlan from current net-next, I got a lockdep warning 
> > > (below).
> > > It seems to happen when I have host B pinging host A and while the pings 
> > > continue,
> > > I do "ip link del" on the vxlan interface on host A. The lockdep warning 
> > > is on host A.
> > > Tell me if you need some more info.
> > >   
> > 
> > Looks like the case of nested ARP requests, the initial request is coming
> > from neigh_timer (ARP retransmit), but inside neigh_probe the lock
> > is dropped?  
> 
> Bug is from arp_solicit(), releasing the lock after arp_send()
> 
> Its used to protect neigh->ha
> 
> We could instead copy neigh->ha, without taking n->lock but ha_lock
> seqlock, using neigh_ha_snapshot() helper
> 
> Yan, could you test the following patch ?
> 
> Thanks
> diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
> index ce6fbdf..1169ed4 100644
> --- a/net/ipv4/arp.c
> +++ b/net/ipv4/arp.c
> @@ -321,7 +321,7 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb)
>  static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  {
>  	__be32 saddr = 0;
> -	u8  *dst_ha = NULL;
> +	u8 dst_ha[MAX_ADDR_LEN];
>  	struct net_device *dev = neigh->dev;
>  	__be32 target = *(__be32 *)neigh->primary_key;
>  	int probes = atomic_read(&neigh->probes);
> @@ -363,9 +363,9 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  	if (probes < 0) {
>  		if (!(neigh->nud_state & NUD_VALID))
>  			pr_debug("trying to ucast probe in NUD_INVALID\n");
> -		dst_ha = neigh->ha;
> -		read_lock_bh(&neigh->lock);
> +		neigh_ha_snapshot(dst_ha, neigh, dev);
>  	} else {
> +		memset(dst_ha, 0, dev->addr_len);
>  		probes -= neigh->parms->app_probes;
>  		if (probes < 0) {
>  #ifdef CONFIG_ARPD
> @@ -377,8 +377,6 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
>  
>  	arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
>  		 dst_ha, dev->dev_addr, NULL);
> -	if (dst_ha)
> -		read_unlock_bh(&neigh->lock);
>  }
>  
>  static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)

I like this. Getting rid of yet another read lock

^ permalink raw reply

* Re: Lockdep warning in vxlan
From: Eric Dumazet @ 2012-12-20 18:16 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Yan Burman, netdev
In-Reply-To: <20121220083436.0c7fc33f@nehalam.linuxnetplumber.net>

On Thu, 2012-12-20 at 08:34 -0800, Stephen Hemminger wrote:
> On Thu, 20 Dec 2012 16:00:32 +0200
> Yan Burman <yanb@mellanox.com> wrote:
> 
> > Hi.
> > 
> > When working with vxlan from current net-next, I got a lockdep warning 
> > (below).
> > It seems to happen when I have host B pinging host A and while the pings 
> > continue,
> > I do "ip link del" on the vxlan interface on host A. The lockdep warning 
> > is on host A.
> > Tell me if you need some more info.
> > 
> 
> Looks like the case of nested ARP requests, the initial request is coming
> from neigh_timer (ARP retransmit), but inside neigh_probe the lock
> is dropped?

Bug is from arp_solicit(), releasing the lock after arp_send()

Its used to protect neigh->ha

We could instead copy neigh->ha, without taking n->lock but ha_lock
seqlock, using neigh_ha_snapshot() helper

Yan, could you test the following patch ?

Thanks
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index ce6fbdf..1169ed4 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -321,7 +321,7 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb)
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
 	__be32 saddr = 0;
-	u8  *dst_ha = NULL;
+	u8 dst_ha[MAX_ADDR_LEN];
 	struct net_device *dev = neigh->dev;
 	__be32 target = *(__be32 *)neigh->primary_key;
 	int probes = atomic_read(&neigh->probes);
@@ -363,9 +363,9 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 	if (probes < 0) {
 		if (!(neigh->nud_state & NUD_VALID))
 			pr_debug("trying to ucast probe in NUD_INVALID\n");
-		dst_ha = neigh->ha;
-		read_lock_bh(&neigh->lock);
+		neigh_ha_snapshot(dst_ha, neigh, dev);
 	} else {
+		memset(dst_ha, 0, dev->addr_len);
 		probes -= neigh->parms->app_probes;
 		if (probes < 0) {
 #ifdef CONFIG_ARPD
@@ -377,8 +377,6 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 
 	arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
 		 dst_ha, dev->dev_addr, NULL);
-	if (dst_ha)
-		read_unlock_bh(&neigh->lock);
 }
 
 static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)

^ permalink raw reply related

* Re: at91sam9260 MACB problem with IP fragmentation
From: Nicolas Ferre @ 2012-12-20 17:51 UTC (permalink / raw)
  To: Erwin Rol
  Cc: linux-kernel, Havard Skinnemoen, linux-arm-kernel, matteo.fortini,
	netdev
In-Reply-To: <50D2D7BD.3030801@erwinrol.com>

On 12/20/2012 10:17 AM, Erwin Rol :
> Hallo Nicolas,
> 
> On 6-12-2012 14:27, Nicolas Ferre wrote:
>> Erwin,
>>
>> On 12/06/2012 12:32 PM, Erwin Rol :
>>> Hello Nicolas, Havard, all,
>>>
>>> I have a very obscure problem with a at91sam9260 board (almost 1 to 1
>>> copy of the Atmel EK).
>>>
>>> The MACB seems to stall when I use large (>2 * MTU) UDP datagrams. The
>>> test case is that a udp echo client (PC) sends datagrams with increasing
>>> length to the AT91 until the max length of the UDP datagram is reached.
>>> When there is no IP fragmentation everything is fine, but when the
>>> datagrams are starting to get fragmented the AT91 will not reply
>>> anymore. But as soon as some network traffic happens it goes on again,
>>> and non of the data is lost.
> 
> <snip>
> 
>>> I tried several kernels including the test version from Nicolas that he
>>> posted on LKML in October. They all show the same effect.
>>
>> [..]
>>
>> It seems that Matteo has the same behavior: check here:
>> http://www.spinics.net/lists/netdev/msg218951.html
> 
> I tried Matteo's patch and it seems to work. But I don't know if the
> patch is really the right solution. I checked again with wireshark and
> it really seems the sending that stalls not the receiving. But as soon
> as a ethernet frame is received the sending "un-stalls". So maybe the
> patch just causes an MACB IRQ at certain moments that causes the sending
> to continue?

Any digging is interesting for me.


>> I am working on the macb driver right now, so I will try to reproduce
>> and track this issue on my side.
> 
> Any luck reproducing it ?

Yes, I see unexpected things happening but as I am connected to a whole
company network so maybe some broadcast packets are unlocking the
interface...
Anyway, I am continuing to investigate.

Best regards,--
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH net-next V4 00/13] Add basic VLAN support to bridges
From: Stephen Hemminger @ 2012-12-20 17:07 UTC (permalink / raw)
  To: Vitalii Demianets
  Cc: Andrew Collins, Vlad Yasevich, netdev, davem, or.gerlitz, jhs,
	mst, erdnetdev, jiri
In-Reply-To: <201212201208.14204.vitas@nppfactor.kiev.ua>

On Thu, 20 Dec 2012 12:08:13 +0200
Vitalii Demianets <vitas@nppfactor.kiev.ua> wrote:

> On Thursday 20 December 2012 00:54:27 Andrew Collins wrote:
> > On Wed, Dec 19, 2012 at 10:48 AM, Vlad Yasevich <vyasevic@redhat.com> wrote:
> > > This series of patches provides an ability to add VLANs to the bridge
> > > ports.  This is similar to what can be found in most switches.  The
> > > bridge port may have any number of VLANs added to it including vlan 0
> > > priority tagged traffic.  When vlans are added to the port, only traffic
> > > tagged with particular vlan will forwarded over this port.  Additionally,
> > > vlan ids are added to FDB entries and become part of the lookup.  This
> > > way we correctly identify the FDB entry.
> >
> > This is likely well beyond the scope of this change, but I figured I'd
> > throw out the question anyway.  This changeset looks to bring the
> > Linux bridging code closer to the 802.1Q-2005 definition of a bridge,
> > which is nice to see, I'm curious if this changeset also opens up the
> > possibility of supporting MSTP in the future?  The big thing I see
> > missing is per-VLAN port state, although I'm not very familiar with
> > the current STP/bridge interactions.  Has anyone put any thought into
> > what other necessary bridge pieces might be missing for MSTP support?
> 
> I think, to be compatible with 802.1Q-2005 we need the following pieces:
> 1) Multiple FIDs (it is 802.1Q term for FDB) support. It means that kernel 
> should support several independent FDBs on a single bridge. The 802.1Q-2005 
> standard requires the number of supported FDBs to be no less than the number 
> of different MSTIs the implementation supports;
> 2) VLAN-to-FDB mapping should be introduced;
> 3) Support of Multiple Spanning Tree Instances (MSTIs);
> 4) FDB-to-MSTI mapping should be introduced;
> 5) And finally, per-MST port states should be implemented.
> 
> > obviously something to handle the MSTP protocol itself would need to exist 
> as well
> 
> Please look here: http://sourceforge.net/projects/mstpd/

A couple of points:
* How does this compare with features/functionality of commercial
  hardware bridges?
* Is this as simple as possible? It looks like there is creeping-featurism
  here. I am all for a simple extension to allow bridge vlan filtering, but
  not the added complexity of "let's teach bridges all about all possible
  things any user might want to do with vlan.s"


 

^ permalink raw reply

* Re: [PATCH net-next V4 03/13] bridge: Validate that vlan is permitted on ingress
From: Vlad Yasevich @ 2012-12-20 16:54 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <20121220182402.6143fcb1@pixies.home.jungo.com>

On 12/20/2012 11:24 AM, Shmulik Ladkani wrote:
> On Thu, 20 Dec 2012 10:41:28 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
>>>> +static bool br_allowed_ingress(struct net_bridge_port *p, struct sk_buff *skb)
>>>> +{
>>>> +	struct net_port_vlan *pve;
>>>> +	u16 vid;
>>>> +
>>>> +	/* If there are no vlan in the permitted list, all packets are
>>>> +	 * permitted.
>>>> +	 */
>>>> +	if (list_empty(&p->vlan_list))
>>>> +		return true;
>>>
>>> I assumed the default policy would be Drop in such case, otherwise
>>> leaking between vlan domains is possible.
>>> Or maybe, ingress policy when port isn't a member of ingress VID should
>>> be configurable (drop/allow).
>>
>> We have have to default to allow since we want to retain original bridge
>> functionality if there is no configuration.
>
> Ok; so having the port not a member of ANY vlan is a "port vlan
> disabled" configuration knob, and as such, it is a member of ANY vlan,
> meaning that:
>
> (1) every "non-vlan port" is connected to any other "non-vlan port"

Technically, it would be connected to every over port.

> (2) frame ingress on a "non-vlan" port may egress on a "vlan enabled"
>    port, depending on the ingress VID and the port-membership map of the
>    egress port
>    (and thus, PVID should be defined even to "non-vlan" ports, for the
>    case where untagged frame is received on the non-vlan port)

Sort of.  The way I did it (testing now), is like this:
    if there is egress policy
	apply policy and forward.
    else if there was ingress policy (pvid)
	apply it and forward
    else
	forward as is (old bridge behavior).

This way if there was a pvid on an ingress port and nothing on egress,
then pvid applies.  If there was nothing configured on ingress port,
but we have a egress policy, we'll apply any vlan information from
the frame to egress policy.  In this case, ingress untagged traffic 
would be assigned vlan 0.


> (3) frame ingress on a "vlan-enabled" port would always egress on
>    "non-vlan" ports

yes and they would egress based on their ingress policy.

>
> Seems ok.
> However this is an additional nuance that might not be expected by the
> user configuring the bridge; maybe this needs some clarification.

I'll try to document things sufficiently.  This hybrid approach may
produce some unintended results.  We could always remove it or introduce
the tunable to change default policy to drop once vlan configuration is
in effect.


>
>>>> +	vid = br_get_vlan(skb);
>>>> +	pve = nbp_vlan_find(p, vid);
>>>
>>> Why search by iterating through NBP's vlan_list?
>>> You know the VID (hence may fetch the net_bridge_vlan from the hash), so
>>> why don't you directly consult the net_bridge_vlan's port_bitmap?
>>
>> It's an alternative...  I am betting that this port isn't in too many
>> vlans and that searching the list might be faster.
>
> I assumed the opposite: finding the hash bucket is just a bitwise mask,
> and number of items in a bucket would rarely be grater than 1.
> I expect such code to be shorter, but this needs to be verified.

I'll try to set something up, but that will probably be next year...

-vlad
>
> Regards,
> Shmulik
>

^ permalink raw reply

* Re: [PATCH net-next] af_unix: MSG_TRUNC support for dgram sockets
From: Michael Kerrisk (man-pages) @ 2012-12-20 16:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, piergiorgio.beruto, netdev
In-Reply-To: <1329902695.18384.101.camel@edumazet-laptop>

On Wed, Feb 22, 2012 at 10:24 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Piergiorgio Beruto expressed the need to fetch size of first datagram in
> queue for AF_UNIX sockets and suggested a patch against SIOCINQ ioctl.
>
> I suggested instead to implement MSG_TRUNC support as a recv() input
> flag, as already done for RAW, UDP & NETLINK sockets.
>
> len = recv(fd, &byte, 1, MSG_PEEK | MSG_TRUNC);
>
> MSG_TRUNC asks recv() to return the real length of the packet, even when
> is was longer than the passed buffer.
>
> There is risk that a userland application used MSG_TRUNC by accident
> (since it had no effect on af_unix sockets) and this might break after
> this patch.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Tested-by: Piergiorgio Beruto <piergiorgio.beruto@gmail.com>
> CC: Michael Kerrisk <mtk.manpages@gmail.com>

Patch below to man-pages applied.

Thanks for CCing me, Eric.

Cheers,

Michael

--- a/man2/recv.2
+++ b/man2/recv.2
@@ -264,7 +264,7 @@ subsequent receive call will return the same data.
 For raw
 .RB ( AF_PACKET ),
 Internet datagram (since Linux 2.4.27/2.6.8),
-and netlink (since Linux 2.6.22) sockets:
+netlink (since Linux 2.6.22) and UNIX datagram (since Linux 3.4) sockets:
 return the real length of the packet or datagram,
 even when it was longer than the passed buffer.
 Not implemented for UNIX domain

^ permalink raw reply

* Re: Lockdep warning in vxlan
From: Stephen Hemminger @ 2012-12-20 16:34 UTC (permalink / raw)
  To: Yan Burman; +Cc: netdev
In-Reply-To: <50D31A00.7060905@mellanox.com>

On Thu, 20 Dec 2012 16:00:32 +0200
Yan Burman <yanb@mellanox.com> wrote:

> Hi.
> 
> When working with vxlan from current net-next, I got a lockdep warning 
> (below).
> It seems to happen when I have host B pinging host A and while the pings 
> continue,
> I do "ip link del" on the vxlan interface on host A. The lockdep warning 
> is on host A.
> Tell me if you need some more info.
> 

Looks like the case of nested ARP requests, the initial request is coming
from neigh_timer (ARP retransmit), but inside neigh_probe the lock
is dropped?

^ permalink raw reply

* Re: [PATCH net] net/vxlan: Use the underlying device index when joining/leaving multicast groups
From: Stephen Hemminger @ 2012-12-20 16:26 UTC (permalink / raw)
  To: Yan Burman; +Cc: netdev, ogerlitz
In-Reply-To: <1356010568-21644-1-git-send-email-yanb@mellanox.com>

On Thu, 20 Dec 2012 15:36:08 +0200
Yan Burman <yanb@mellanox.com> wrote:

> The socket calls from vxlan to join/leave multicast group aren't
> using the index of the underlying device, as a result the stack uses
> the first interface that is up. This results in vxlan being non functional
> over a device which isn't the 1st to be up.
> Fix this by providing the iflink field to the vxlan instance
> to the multicast calls.
> 
> Signed-off-by: Yan Burman <yanb@mellanox.com>
> ---
>  drivers/net/vxlan.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 3b3fdf6..40f2cc1 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -505,7 +505,8 @@ static int vxlan_join_group(struct net_device *dev)
>  	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
>  	struct sock *sk = vn->sock->sk;
>  	struct ip_mreqn mreq = {
> -		.imr_multiaddr.s_addr = vxlan->gaddr,
> +		.imr_multiaddr.s_addr	= vxlan->gaddr,
> +		.imr_ifindex		= vxlan->link,
>  	};
>  	int err;
>  
> @@ -532,7 +533,8 @@ static int vxlan_leave_group(struct net_device *dev)
>  	int err = 0;
>  	struct sock *sk = vn->sock->sk;
>  	struct ip_mreqn mreq = {
> -		.imr_multiaddr.s_addr = vxlan->gaddr,
> +		.imr_multiaddr.s_addr	= vxlan->gaddr,
> +		.imr_ifindex		= vxlan->link,
>  	};
>  
>  	/* Only leave group when last vxlan is done. */

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

^ permalink raw reply

* Re: [PATCH net-next V4 03/13] bridge: Validate that vlan is permitted on ingress
From: Shmulik Ladkani @ 2012-12-20 16:24 UTC (permalink / raw)
  To: vyasevic; +Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <50D331A8.3080206@redhat.com>

On Thu, 20 Dec 2012 10:41:28 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
> >> +static bool br_allowed_ingress(struct net_bridge_port *p, struct sk_buff *skb)
> >> +{
> >> +	struct net_port_vlan *pve;
> >> +	u16 vid;
> >> +
> >> +	/* If there are no vlan in the permitted list, all packets are
> >> +	 * permitted.
> >> +	 */
> >> +	if (list_empty(&p->vlan_list))
> >> +		return true;
> >
> > I assumed the default policy would be Drop in such case, otherwise
> > leaking between vlan domains is possible.
> > Or maybe, ingress policy when port isn't a member of ingress VID should
> > be configurable (drop/allow).
> 
> We have have to default to allow since we want to retain original bridge 
> functionality if there is no configuration.

Ok; so having the port not a member of ANY vlan is a "port vlan
disabled" configuration knob, and as such, it is a member of ANY vlan,
meaning that:

(1) every "non-vlan port" is connected to any other "non-vlan port"
(2) frame ingress on a "non-vlan" port may egress on a "vlan enabled"
  port, depending on the ingress VID and the port-membership map of the
  egress port
  (and thus, PVID should be defined even to "non-vlan" ports, for the
  case where untagged frame is received on the non-vlan port)
(3) frame ingress on a "vlan-enabled" port would always egress on
  "non-vlan" ports

Seems ok.
However this is an additional nuance that might not be expected by the
user configuring the bridge; maybe this needs some clarification.

> >> +	vid = br_get_vlan(skb);
> >> +	pve = nbp_vlan_find(p, vid);
> >
> > Why search by iterating through NBP's vlan_list?
> > You know the VID (hence may fetch the net_bridge_vlan from the hash), so
> > why don't you directly consult the net_bridge_vlan's port_bitmap?
> 
> It's an alternative...  I am betting that this port isn't in too many 
> vlans and that searching the list might be faster.

I assumed the opposite: finding the hash bucket is just a bitwise mask,
and number of items in a bucket would rarely be grater than 1.
I expect such code to be shorter, but this needs to be verified.

Regards,
Shmulik

^ permalink raw reply

* Re: [PATCH net-next V4 03/13] bridge: Validate that vlan is permitted on ingress
From: Vlad Yasevich @ 2012-12-20 15:41 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <20121220160713.30cdfc05@pixies.home.jungo.com>

On 12/20/2012 09:07 AM, Shmulik Ladkani wrote:
> Hi Vlad,
>
> On Wed, 19 Dec 2012 12:48:14 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
>> +static bool br_allowed_ingress(struct net_bridge_port *p, struct sk_buff *skb)
>> +{
>> +	struct net_port_vlan *pve;
>> +	u16 vid;
>> +
>> +	/* If there are no vlan in the permitted list, all packets are
>> +	 * permitted.
>> +	 */
>> +	if (list_empty(&p->vlan_list))
>> +		return true;
>
> I assumed the default policy would be Drop in such case, otherwise
> leaking between vlan domains is possible.
> Or maybe, ingress policy when port isn't a member of ingress VID should
> be configurable (drop/allow).

We have have to default to allow since we want to retain original bridge 
functionality if there is no configuration.

>
>> +	vid = br_get_vlan(skb);
>> +	pve = nbp_vlan_find(p, vid);
>
> Why search by iterating through NBP's vlan_list?
> You know the VID (hence may fetch the net_bridge_vlan from the hash), so
> why don't you directly consult the net_bridge_vlan's port_bitmap?

It's an alternative...  I am betting that this port isn't in too many 
vlans and that searching the list might be faster.

>
>> @@ -54,6 +74,9 @@ int br_handle_frame_finish(struct sk_buff *skb)
>>   	if (!p || p->state == BR_STATE_DISABLED)
>>   		goto drop;
>>
>> +	if (!br_allowed_ingress(p, skb))
>> +		goto drop;
>> +
>
> This condition should be also encorporated upon "ingress" at the "bridge
> master port" (that is, early at br_dev_xmit).
> Think of the "bridge master port" as yet another port:
> upon "ingress" (meaning, tx packets from the ip stack), we should
> also enforce any ingress permission rules.
>

I've tried that before and now can't think of a reason why I rejected 
it.  I'll try to remember...

Thanks
-vlad

> Regards,
> Shmulik
>

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Eric Dumazet @ 2012-12-20 15:39 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Ian Campbell, netdev@vger.kernel.org, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1797374383.20121220135139@eikelenboom.it>

On Thu, 2012-12-20 at 13:51 +0100, Sander Eikelenboom wrote:

> Eric:
>      From the warn_on_once, delta should be smaller than len, but probably they should be as close together as possible.
>      When you say "accurate estimation", what would be a acceptable difference between DELTA and LEN ?

I would use the most exact value, which is :

   skb->truesize += nr_frags * PAGE_SIZE;

Then, if we can spot later a regression in some stacks, adapt the
limiting parameters. I did a lot of work in GRO and TCP stack to reduce
the memory, and further changes are possible.

We really want to account memory, because we want to control how memory
is used on our machines and don't let some users use more than the
amount that was allowed to them.

^ permalink raw reply

* Re: [PATCH] net: ipv4: route: fix coding style issues net: ipv4: tcp: fix coding style issues
From: Eric Dumazet @ 2012-12-20 15:31 UTC (permalink / raw)
  To: Stefan Hasko
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel
In-Reply-To: <1356013685-31649-1-git-send-email-hasko.stevo@gmail.com>

On Thu, 2012-12-20 at 15:28 +0100, Stefan Hasko wrote:
> Fix a coding style issues.
> 
> Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
> ---
>  net/ipv4/route.c |  119 ++++++++++++++++-------------
>  net/ipv4/tcp.c   |  218 +++++++++++++++++++++++++++++++-----------------------
>  2 files changed, 194 insertions(+), 143 deletions(-)


I Nack this patch and any such patches in net/core, net/ipv4 trees for a
while.

We had too many recent changes and probably need a bunch of real fixes.

Thanks

^ permalink raw reply

* Re: [PATCH net-next V4 02/13] bridge: Add vlan filtering infrastructure
From: Vlad Yasevich @ 2012-12-20 15:31 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <20121220153913.11a10fd0@pixies.home.jungo.com>

On 12/20/2012 08:39 AM, Shmulik Ladkani wrote:
> Hi Vlad,
>
> On Wed, 19 Dec 2012 12:48:13 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
>> +static void nbp_vlan_flush(struct net_bridge_port *p)
>> +{
>> +	struct net_port_vlan *pve;
>> +	struct net_port_vlan *tmp;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	list_for_each_entry_safe(pve, tmp, &p->vlan_list, list)
>> +		nbp_vlan_delete(p, pve->vid, BRIDGE_FLAGS_SELF);
>
> Why would you want to clear "bridge master port" association from this
> vlan, in the event of NBP destruction?
> The "bridge port" may still be a member of this vlan, doesn't it?
> Seems flags argument should be 0.

This ends up getting fixed later, but you are right.  This should be 0.

>
>> +#define BR_VID_HASH_SIZE (1<<6)
>> +#define br_vlan_hash(vid) ((vid) % (BR_VID_HASH_SIZE - 1))
>
> Did you mean:                       & (BR_VID_HASH_SIZE - 1)

yes.

thanks
-vlad

>
> Regards,
> Shmulik
>

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Yury Stankevich @ 2012-12-20 14:59 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D305FD.7000901@mojatatu.com>

interesting,

#tc -s filter show dev usb0 parent ffff:
filter protocol ip pref 49152 u32
filter protocol ip pref 49152 u32 fh 800: ht divisor 1
filter protocol ip pref 49152 u32 fh 800::800 order 2048 key ht 800 bkt
0 terminal flowid ???  (rule hit 707 success 707)
  match 00000000/00000000 at 0 (success 707 )
	action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
	target  CONNMARK restore
	index 5 ref 1 bind 1 installed 394 sec used 11 sec
	Action statistics:
	Sent 783783 bytes 707 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

	action order 2: mirred (Egress Redirect to device ifb0) stolen
 	index 5 ref 1 bind 1 installed 394 sec used 11 sec
 	Action statistics:
	Sent 783783 bytes 707 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

so, looks like packets was sent to CONNMARK target.

but...
i make a iptables rule to log packets with 0xa mark:

Chain PREROUTING (policy ACCEPT 1308 packets, 848K bytes)
 pkts bytes target     prot opt in     out     source
destination
    0     0 NFLOG      all  --  *      *       0.0.0.0/0
0.0.0.0/0            mark match 0xa nflog-group 1

Chain POSTROUTING (policy ACCEPT 1240 packets, 550K bytes)
 pkts bytes target     prot opt in     out     source
destination
    1    40 CONNMARK   tcp  --  *      *       0.0.0.0/0
0.0.0.0/0            tcp dpt:80 connmark match  0x0 connbytes 204800
connbytes mode bytes connbytes direction both CONNMARK set 0xa

idea is:
i run downloading, rule from POSTROUTING must fire if i download more
than ~200K,
tc filter call to CONNMARK restore, must restore mark (0xa) for packets
belong to this connection.
so i expect, that PREROUTING rule must notice the restored mark, but it
doesn't.
maybe i miss something ?


20.12.2012 16:35, Jamal Hadi Salim пишет:
> 
> Could be your setup. I didnt do a lot of testing but
> from my notes (running different kernel at the moment):
> 
> #try to point to everything (no iptables setup)
> tc filter add dev eth0 parent ffff: protocol ip u32 match u32 0 0 flowid
> 23:23 action xt -j CONNMARK --restore-mark
> #let it run for a 1 sec then display with
> tc -s filter show dev eth0 parent ffff:
> 
> ----
> filter protocol ip pref 49152 u32
> filter protocol ip pref 49152 u32 fh 800: ht divisor 1
> filter protocol ip pref 49152 u32 fh 800::800 order 2048 key ht 800 bkt
> 0 flowid 23:23
>   match 00000000/00000000 at 0
>     action order 1: tablename: mangle  hook: NF_IP_PRE_ROUTING
>     target  CONNMARK restore
>     index 1 ref 1 bind 1 installed 3 sec used 1 sec
>     Action statistics:
>     Sent 280 bytes 4 pkt (dropped 0, overlimits 0 requeues 0)
>     backlog 0b 0p requeues 0
> ----
> 
> cheers,
> jamal
> 
> On 12-12-20 03:54 AM, Yury Stankevich wrote:
>> 19.12.2012 15:56, Jamal Hadi Salim пишет:
>>> Hasan/Yury, if you test this please use the latest iproute2 with only
>>> the first patch I posted (originally from Hasan). Hasan please use that
>>> patch not your version - if theres anything wrong we can find out sooner
>>> before the patch becomes final.
>>
>> Hello,
>> 3.7.1 kernel with 3.7.0 iproute,
>> patch-xt, xt-p1 + linkage fix was applyed
>> command successfully performed, but actually doesn't work.
>>
>> command:
>> tc filter add dev $dev parent ffff: protocol ip u32 match u32 0 0 \
>>              action xt -j CONNMARK --restore-mark \
>>              action mirred egress redirect dev ifb0
>> then i use filter:
>>
>> tc filter add dev ifb0 protocol ip parent 1: prio 2 handle 0xa fw flowid
>> 1:102
>>
>> iptables line:
>> iptable -t mangle -A POSTROUTING -p tcp --dport 80 -m connmark --mark 0
>> -m connbytes --connbytes 204800: --connbytes-dir both --connbytes-mode
>> bytes -j CONNMARK --set-mark 0xa
>>
>> once i run a test to download 300K file,
>> from iptables counters i can see that rule in POSTROUTING is triggered,
>> but from `tc -s qdisc show dev ifb0` i see that no packets was sent to
>> 1:102 flow.
>>
>> btw,
>> tc -p -s filter show dev ifb0 parent 1:
>> do not show stats `(rule hit 416 success 0)` for this (filter protocol
>> ip pref 2 fw handle 0xa classid 1:102) rule.
>>
>>
>>
> 


-- 
Linux registered user #402966 // pub 1024D/E99AF373 <pgp.mit.edu>

^ permalink raw reply

* Re: [PATCH] net: ipv4: route: fixed a coding style issues net: ipv4: tcp: fixed a coding style issues
From: Eric Dumazet @ 2012-12-20 15:23 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: Stefan Hasko, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-kernel
In-Reply-To: <50D2FF86.3000603@6wind.com>

On Thu, 2012-12-20 at 13:07 +0100, Nicolas Dichtel wrote:
> Le 20/12/2012 09:08, Stefan Hasko a écrit :

> > +				"out_hlist_search\n");
> checkpatch will warn you about this one, something like:
> "WARNING: quoted string split across lines".
> Not breaking such line ease to grep the pattern.


Yes.

Could we please leave this file as is for at least 2 years ?

We had a lot of recent changes and probable fixes are expected.

Such "coding style" patches are a real pain when trying to fix bugs,
especially dealing with stable/old kernels.

Thanks

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen/netfront: improve truesize tracking
From: Sander Eikelenboom @ 2012-12-20 14:58 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Eric Dumazet, netdev@vger.kernel.org, annie li,
	xen-devel@lists.xensource.com, Ian Campbell,
	Konrad Rzeszutek Wilk
In-Reply-To: <1457826869.20121220152326@eikelenboom.it>


Thursday, December 20, 2012, 3:23:26 PM, you wrote:


> Thursday, December 20, 2012, 1:51:39 PM, you wrote:


>> Wednesday, December 19, 2012, 5:17:49 PM, you wrote:

>>> On Wed, 2012-12-19 at 12:34 +0100, Sander Eikelenboom wrote:

>>>> Hi Ian,
>>>> 
>>>> It ran overnight and i haven't seen the warn_once trigger.
>>>> (but i also didn't with the previous patch)
>>>> 

>>> As I said, the miminum value to not trigger the warning was what Ian
>>> patch was doing, but it was still a not accurate estimation.

>>> Doing the real accounting might trigger slow transferts, or dropped
>>> packets because of socket limits (SNDBUF / RCVBUF) being hit sooner.

>>> So the real question was : If accounting for full pages, is your
>>> applications run as smooth as before, with no huge performance
>>> regression ?

>> Ok i have added some extra debug info (see diff's below), the code still uses the old calculation for truesize (in the hope to trigger the warn_on_once again), but also calculates the variants IanC came up with.

>> I haven't got a clear test case to trigger the warn_on_once, it happens just every once in a while during my normal usage and i'm not a netperf expert :-)
>> So at the moment i haven't been able to trigger the warn_on_once yet, but the results so far do seem to shed some light ..

>> - The first variant (current code) seems to be the most effcient and a good estimation *most* of the the, but sometimes triggers the warn_on_once in skb_try_coalesce.
>> - The first variant (current code) seems to always substract from the truesize for small packets.
>> - The second variant always seems keep the truesize as is for most of the small network traffic, but it also seems to work ok for larger packets.
>> - The third variant seems to be a pretty wasteful estimation.

>> So the last variant seems to be rather wasteful, and the second one the most accurate so far.

>> Eric:
>>      From the warn_on_once, delta should be smaller than len, but probably they should be as close together as possible.
>>      When you say "accurate estimation", what would be a acceptable difference between DELTA and LEN ?



>> [  116.965062] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
>> [  117.094538] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.094707] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.094869] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.095058] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.095216] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.096102] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.096311] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.096373] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.150398] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.150459] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
>> [  117.536901] eth0: mtu:1500 data_len:53642 len before:0 len after:53642 truesize before:896 truesize after:54282 nr_frags:14 variant1:53386(54282) variant2:53386(54282) variant3:57344(58240)
>> [  117.537463] eth0: mtu:1500 data_len:15994 len before:0 len after:15994 truesize before:896 truesize after:16634 nr_frags:5 variant1:15738(16634) variant2:15738(16634) variant3:20480(21376)
>> [  117.537915] eth0: mtu:1500 data_len:17442 len before:0 len after:17442 truesize before:896 truesize after:18082 nr_frags:5 variant1:17186(18082) variant2:17186(18082) variant3:20480(21376)
>> [  117.538543] eth0: mtu:1500 data_len:18890 len before:0 len after:18890 truesize before:896 truesize after:19530 nr_frags:6 variant1:18634(19530) variant2:18634(19530) variant3:24576(25472)
>> [  117.539223] eth0: mtu:1500 data_len:13098 len before:0 len after:13098 truesize before:896 truesize after:13738 nr_frags:4 variant1:12842(13738) variant2:12842(13738) variant3:16384(17280)
>> [  117.539283] eth0: mtu:1500 data_len:7306 len before:0 len after:7306 truesize before:896 truesize after:7946 nr_frags:2 variant1:7050(7946) variant2:7050(7946) variant3:8192(9088)
>> [  117.539403] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:7690 len:7240 from->truesize:7946 skb_headlen(from):190 skb_shinfo(to)->nr_frags:5 skb_shinfo(from)->nr_frags:2
>> [  117.540035] eth0: mtu:1500 data_len:4410 len before:0 len after:4410 truesize before:896 truesize after:5050 nr_frags:3 variant1:4154(5050) variant2:4304(5200) variant3:12288(13184)
>> [  117.540153] eth0: mtu:1500 data_len:1018 len before:0 len after:1018 truesize before:896 truesize after:1658 nr_frags:1 variant1:762(1658) variant2:762(1658) variant3:4096(4992)
>> [  121.981917] net_ratelimit: 27 callbacks suppressed
>> [  121.981960] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
>> [  122.985019] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
>> [  123.988308] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
>> [  124.991961] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
>> [  125.995003] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
>> [  126.998324] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)



>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index c26e28b..8833e38 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -964,6 +964,7 @@ static int xennet_poll(struct napi_struct *napi, int budget)
>>         struct sk_buff_head tmpq;
>>         unsigned long flags;
>>         int err;
>> +       int tsz,len;

>>         spin_lock(&np->rx_lock);

>> @@ -1037,9 +1038,22 @@ err:
>>                  * receive throughout using the standard receive
>>                  * buffer size was cut by 25%(!!!).
>>                  */
>> -               skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
>> +
>> +
>> +
>> +
>> +                tsz = skb->truesize;
>> +                len = skb->len;
>> +                /* skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags; */
>> +                skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
>>                 skb->len += skb->data_len;

>> +               net_warn_ratelimited("%s: mtu:%d data_len:%d len before:%d len after:%d truesize before:%d truesize after:%d nr_frags:%d variant1:%d(%d) variant2:%d(%d) variant3:%d(%d) \n",
>> +                        skb->dev->name, skb->dev->mtu, skb->data_len, len,  skb->len,tsz, skb->truesize, skb_shinfo(skb)->nr_frags,
>> +                        skb->data_len - RX_COPY_THRESHOLD, tsz + skb->data_len - RX_COPY_THRESHOLD ,
>> +                        skb->data_len - NETFRONT_SKB_CB(skb)->pull_to, tsz + skb->data_len - NETFRONT_SKB_CB(skb)->pull_to,
>> +                        PAGE_SIZE * skb_shinfo(skb)->nr_frags, tsz + (PAGE_SIZE * skb_shinfo(skb)->nr_frags));
>> +
>>                 if (rx->flags & XEN_NETRXF_csum_blank)
>>                         skb->ip_summed = CHECKSUM_PARTIAL;
>>                 else if (rx->flags & XEN_NETRXF_data_validated)
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 3ab989b..6d0cd86 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3471,6 +3471,16 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,

>>         WARN_ON_ONCE(delta < len);

>> +       if(delta < len) {
>> +               net_warn_ratelimited("to: %s from: %s  skb_try_coalesce: DELTA < LEN delta:%d len:%d from->truesize:%d skb_headlen(from):%d skb_shinfo(to)->nr_frags:%d skb_shinfo(from)->nr_frags:%d \n",
>> +                        to->dev->name, from->dev->name, delta, len, from->truesize, skb_headlen(from), skb_shinfo(to)->nr_frags, skb_shinfo(from)->nr_frags);
>> +       }
>> +
+       if (delta >>> len && delta - len > 100) {
>> +               net_warn_ratelimited("to: %s from: %s  skb_try_coalesce: DELTA - LEN > 100 delta:%d len:%d from->truesize:%d skb_headlen(from):%d skb_shinfo(to)->nr_frags:%d skb_shinfo(from)->nr_frags:%d \n",
>> +                        to->dev->name,from->dev->name, delta, len, from->truesize, skb_headlen(from), skb_shinfo(to)->nr_frags, skb_shinfo(from)->nr_frags);
>> +       }
>> +
>>         memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
>>                skb_shinfo(from)->frags,
>>                skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));



> Ok i succeeded in triggering the warn_on_once, but it seems the extra debug info from netfront was just rate limited away for the offending packet :(

> Dec 20 15:17:33 media kernel: [  393.464062] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:33 media kernel: [  393.464438] eth0: mtu:1500 data_len:762 len before:0 len after:762 truesize before:896 truesize after:1402 nr_frags:1 variant1:506(1402) variant2:506(1402) variant3:4096(4992)
> Dec 20 15:17:33 media kernel: [  393.465083] eth0: mtu:1500 data_len:118 len before:0 len after:118 truesize before:896 truesize after:758 nr_frags:1 variant1:-138(758) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:33 media kernel: [  393.466114] eth0: mtu:1500 data_len:118 len before:0 len after:118 truesize before:896 truesize after:758 nr_frags:1 variant1:-138(758) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:33 media kernel: [  393.467336] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  394.940211] ------------[ cut here ]------------
> Dec 20 15:17:35 media kernel: [  394.940259] WARNING: at net/core/skbuff.c:3472 skb_try_coalesce+0x3fc/0x470()
> Dec 20 15:17:35 media kernel: [  394.940282] Modules linked in:
> Dec 20 15:17:35 media kernel: [  394.940306] Pid: 2632, comm: glusterfs Not tainted 3.7.0-rc0-20121220-netfrontdebug1 #1
> Dec 20 15:17:35 media kernel: [  394.940330] Call Trace:
> Dec 20 15:17:35 media kernel: [  394.940343]  <IRQ>  [<ffffffff8106889a>] warn_slowpath_common+0x7a/0xb0
> Dec 20 15:17:35 media kernel: [  394.940384]  [<ffffffff810688e5>] warn_slowpath_null+0x15/0x20
> Dec 20 15:17:35 media kernel: [  394.940409]  [<ffffffff8184298c>] skb_try_coalesce+0x3fc/0x470
> Dec 20 15:17:35 media kernel: [  394.940434]  [<ffffffff818fb049>] tcp_try_coalesce+0x69/0xc0
> Dec 20 15:17:35 media kernel: [  394.940458]  [<ffffffff818fb0f4>] tcp_queue_rcv+0x54/0x100
> Dec 20 15:17:35 media kernel: [  394.940481]  [<ffffffff8190029f>] ? tcp_mtup_init+0x2f/0x90
> Dec 20 15:17:35 media kernel: [  394.940504]  [<ffffffff818ffbdb>] tcp_rcv_established+0x2bb/0x6a0
> Dec 20 15:17:35 media kernel: [  394.940528]  [<ffffffff8190839f>] ? tcp_v4_rcv+0x6cf/0xb10
> Dec 20 15:17:35 media kernel: [  394.940551]  [<ffffffff81907985>] tcp_v4_do_rcv+0x135/0x480
> Dec 20 15:17:35 media kernel: [  394.940576]  [<ffffffff819b3532>] ? _raw_spin_lock_nested+0x42/0x50
> Dec 20 15:17:35 media kernel: [  394.940600]  [<ffffffff8190839f>] ? tcp_v4_rcv+0x6cf/0xb10
> Dec 20 15:17:35 media kernel: [  394.940623]  [<ffffffff8190862d>] tcp_v4_rcv+0x95d/0xb10
> Dec 20 15:17:35 media kernel: [  394.940666]  [<ffffffff810b5688>] ? lock_acquire+0xd8/0x100
> Dec 20 15:17:35 media kernel: [  394.940694]  [<ffffffff818e4d6a>] ip_local_deliver_finish+0x11a/0x230
> Dec 20 15:17:35 media kernel: [  394.940720]  [<ffffffff818e4c95>] ? ip_local_deliver_finish+0x45/0x230
> Dec 20 15:17:35 media kernel: [  394.940745]  [<ffffffff818e4eb8>] ip_local_deliver+0x38/0x80
> Dec 20 15:17:35 media kernel: [  394.940784]  [<ffffffff818e447a>] ip_rcv_finish+0x15a/0x630
> Dec 20 15:17:35 media kernel: [  394.940807]  [<ffffffff818e4b68>] ip_rcv+0x218/0x300
> Dec 20 15:17:35 media kernel: [  394.940829]  [<ffffffff8184bf2d>] __netif_receive_skb+0x65d/0x8d0
> Dec 20 15:17:35 media kernel: [  394.940853]  [<ffffffff8184ba15>] ? __netif_receive_skb+0x145/0x8d0
> Dec 20 15:17:35 media kernel: [  394.940889]  [<ffffffff810b192d>] ? trace_hardirqs_on+0xd/0x10
> Dec 20 15:17:35 media kernel: [  394.940914]  [<ffffffff810fecbb>] ? free_hot_cold_page+0x1ab/0x1e0
> Dec 20 15:17:35 media kernel: [  394.940939]  [<ffffffff8184e4f8>] netif_receive_skb+0x28/0xf0
> Dec 20 15:17:35 media kernel: [  394.940964]  [<ffffffff81843e83>] ? __pskb_pull_tail+0x253/0x340
> Dec 20 15:17:35 media kernel: [  394.941000]  [<ffffffff8164fbb5>] xennet_poll+0xae5/0xed0
> Dec 20 15:17:35 media kernel: [  394.941024]  [<ffffffff81080081>] ? wake_up_worker+0x1/0x30
> Dec 20 15:17:35 media kernel: [  394.941046]  [<ffffffff810b2fbc>] ? validate_chain+0x13c/0x1300
> Dec 20 15:17:35 media kernel: [  394.941075]  [<ffffffff8184ed66>] net_rx_action+0x136/0x260
> Dec 20 15:17:35 media kernel: [  394.941098]  [<ffffffff81070551>] ? __do_softirq+0x71/0x1a0
> Dec 20 15:17:35 media kernel: [  394.941133]  [<ffffffff810705a9>] __do_softirq+0xc9/0x1a0
> Dec 20 15:17:35 media kernel: [  394.941157]  [<ffffffff819b623c>] call_softirq+0x1c/0x30
> Dec 20 15:17:35 media kernel: [  394.941179]  [<ffffffff8100fdc5>] do_softirq+0x85/0xf0
> Dec 20 15:17:35 media kernel: [  394.941201]  [<ffffffff8107041e>] irq_exit+0x9e/0xd0
> Dec 20 15:17:35 media kernel: [  394.941235]  [<ffffffff81430b1f>] xen_evtchn_do_upcall+0x2f/0x40
> Dec 20 15:17:35 media kernel: [  394.941259]  [<ffffffff819b629e>] xen_do_hypervisor_callback+0x1e/0x30
> Dec 20 15:17:35 media kernel: [  394.941279]  <EOI>  [<ffffffff8100122a>] ? xen_hypercall_xen_version+0xa/0x20
> Dec 20 15:17:35 media kernel: [  394.941318]  [<ffffffff8100122a>] ? xen_hypercall_xen_version+0xa/0x20
> Dec 20 15:17:35 media kernel: [  394.941356]  [<ffffffff8100890d>] ? xen_force_evtchn_callback+0xd/0x10
> Dec 20 15:17:35 media kernel: [  394.941381]  [<ffffffff810092b2>] ? check_events+0x12/0x20
> Dec 20 15:17:35 media kernel: [  394.941405]  [<ffffffff81009259>] ? xen_irq_enable_direct_reloc+0x4/0x4
> Dec 20 15:17:35 media kernel: [  394.941432]  [<ffffffff819b3f6c>] ? _raw_spin_unlock_irq+0x3c/0x70
> Dec 20 15:17:35 media kernel: [  394.941473]  [<ffffffff81095f83>] ? finish_task_switch+0x83/0xe0
> Dec 20 15:17:35 media kernel: [  394.941507]  [<ffffffff81095f46>] ? finish_task_switch+0x46/0xe0
> Dec 20 15:17:35 media kernel: [  394.941533]  [<ffffffff819b2434>] ? __schedule+0x444/0x880
> Dec 20 15:17:35 media kernel: [  394.941555]  [<ffffffff810b2fbc>] ? validate_chain+0x13c/0x1300
> Dec 20 15:17:35 media kernel: [  394.941580]  [<ffffffff810b4c4b>] ? __lock_acquire+0x46b/0xdd0
> Dec 20 15:17:35 media kernel: [  394.941614]  [<ffffffff810b4c4b>] ? __lock_acquire+0x46b/0xdd0
> Dec 20 15:17:35 media kernel: [  394.941638]  [<ffffffff819aff95>] ? __mutex_unlock_slowpath+0x135/0x1d0
> Dec 20 15:17:35 media kernel: [  394.941663]  [<ffffffff819b2904>] ? schedule+0x24/0x70
> Dec 20 15:17:35 media kernel: [  394.941697]  [<ffffffff819b179d>] ? schedule_hrtimeout_range_clock+0x11d/0x140
> Dec 20 15:17:35 media kernel: [  394.941725]  [<ffffffff810b5688>] ? lock_acquire+0xd8/0x100
> Dec 20 15:17:35 media kernel: [  394.941748]  [<ffffffff8118a558>] ? ep_poll+0xf8/0x3a0
> Dec 20 15:17:35 media kernel: [  394.941770]  [<ffffffff819b4015>] ? _raw_spin_unlock_irqrestore+0x75/0xa0
> Dec 20 15:17:35 media kernel: [  394.941808]  [<ffffffff810b1818>] ? trace_hardirqs_on_caller+0xf8/0x200
> Dec 20 15:17:35 media kernel: [  394.941833]  [<ffffffff819b17ce>] ? schedule_hrtimeout_range+0xe/0x10
> Dec 20 15:17:35 media kernel: [  394.941856]  [<ffffffff8118a75a>] ? ep_poll+0x2fa/0x3a0
> Dec 20 15:17:35 media kernel: [  394.941878]  [<ffffffff81098630>] ? try_to_wake_up+0x310/0x310
> Dec 20 15:17:35 media kernel: [  394.941913]  [<ffffffff810b5b17>] ? lock_release+0x117/0x250
> Dec 20 15:17:35 media kernel: [  394.941938]  [<ffffffff81165fd7>] ? fget_light+0xd7/0x140
> Dec 20 15:17:35 media kernel: [  394.941959]  [<ffffffff81165f3a>] ? fget_light+0x3a/0x140
> Dec 20 15:17:35 media kernel: [  394.941981]  [<ffffffff8118a8ce>] ? sys_epoll_wait+0xce/0xe0
> Dec 20 15:17:35 media kernel: [  394.942015]  [<ffffffff819b4e69>] ? system_call_fastpath+0x16/0x1b
> Dec 20 15:17:35 media kernel: [  394.942036] ---[ end trace 6f3a832c9e91c8af ]---
> Dec 20 15:17:35 media kernel: [  394.942056] to: (null) from: (null)  skb_try_coalesce: DELTA < LEN delta:22978 len:23168 from->truesize:23874 skb_headlen(from):0 skb_shinfo(to)->nr_frags:4 skb_shinfo(from)->nr_frags:6
> Dec 20 15:17:35 media kernel: [  394.968199] to: (null) from: (null)  skb_try_coalesce: DELTA < LEN delta:14290 len:14480 from->truesize:15186 skb_headlen(from):0 skb_shinfo(to)->nr_frags:13 skb_shinfo(from)->nr_frags:4
> Dec 20 15:17:35 media kernel: [  395.262814] net_ratelimit: 371 callbacks suppressed
> Dec 20 15:17:35 media kernel: [  395.262858] eth0: mtu:1500 data_len:90 len before:0 len after:90 truesize before:896 truesize after:730 nr_frags:1 variant1:-166(730) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.264767] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.266193] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.268422] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.271617] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.274794] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.278104] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.281319] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.284454] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.287797] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
> Dec 20 15:17:35 media kernel: [  395.291121] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)


Hmm perhaps a better example, i have indented some perhaps interesting points:

        Dec 20 14:12:57 media kernel: [  794.895136] eth0: mtu:1500 data_len:15994 len before:0 len after:15994 truesize before:896 truesize after:16634 nr_frags:5 variant1:15738(16634) variant2:15738(16634) variant3:20480(21376)
        Dec 20 14:12:57 media kernel: [  794.895431] eth0: mtu:1500 data_len:17442 len before:0 len after:17442 truesize before:896 truesize after:18082 nr_frags:5 variant1:17186(18082) variant2:17186(18082) variant3:20480(21376)
        Dec 20 14:12:57 media kernel: [  794.895616] eth0: mtu:1500 data_len:18890 len before:0 len after:18890 truesize before:896 truesize after:19530 nr_frags:6 variant1:18634(19530) variant2:18824(19720) variant3:24576(25472)
        Dec 20 14:12:57 media kernel: [  794.895804] eth0: mtu:1500 data_len:13098 len before:0 len after:13098 truesize before:896 truesize after:13738 nr_frags:4 variant1:12842(13738) variant2:12842(13738) variant3:16384(17280)
        Dec 20 14:12:57 media kernel: [  794.895823] eth0: mtu:1500 data_len:7306 len before:0 len after:7306 truesize before:896 truesize after:7946 nr_frags:3 variant1:7050(7946) variant2:7050(7946) variant3:12288(13184)
        Dec 20 14:12:57 media kernel: [  794.895868] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:7690 len:7240 from->truesize:7946 skb_headlen(from):190 skb_shinfo(to)->nr_frags:5 skb_shinfo(from)->nr_frags:3
        Dec 20 14:12:57 media kernel: [  794.896133] eth0: mtu:1500 data_len:15994 len before:0 len after:15994 truesize before:896 truesize after:16634 nr_frags:5 variant1:15738(16634) variant2:15738(16634) variant3:20480(21376)
        Dec 20 14:12:57 media kernel: [  794.896152] eth0: mtu:1500 data_len:1018 len before:0 len after:1018 truesize before:896 truesize after:1658 nr_frags:1 variant1:762(1658) variant2:762(1658) variant3:4096(4992)
        Dec 20 14:12:57 media kernel: [  794.896200] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:1402 len:952 from->truesize:1658 skb_headlen(from):190 skb_shinfo(to)->nr_frags:6 skb_shinfo(from)->nr_frags:1
        Dec 20 14:12:57 media kernel: [  794.907232] eth0: mtu:1500 data_len:23234 len before:0 len after:23234 truesize before:896 truesize after:23874 nr_frags:7 variant1:22978(23874) variant2:22978(23874) variant3:28672(29568)
        Dec 20 14:12:57 media kernel: [  794.907517] eth0: mtu:1500 data_len:24682 len before:0 len after:24682 truesize before:896 truesize after:25322 nr_frags:7 variant1:24426(25322) variant2:24426(25322) variant3:28672(29568)
        Dec 20 14:12:57 media kernel: [  794.907693] eth0: mtu:1500 data_len:26130 len before:0 len after:26130 truesize before:896 truesize after:26770 nr_frags:7 variant1:25874(26770) variant2:25874(26770) variant3:28672(29568)
        Dec 20 14:12:57 media kernel: [  794.907882] eth0: mtu:1500 data_len:14546 len before:0 len after:14546 truesize before:896 truesize after:15186 nr_frags:5 variant1:14290(15186) variant2:14290(15186) variant3:20480(21376)
        Dec 20 14:12:57 media kernel: [  794.907901] eth0: mtu:1500 data_len:13098 len before:0 len after:13098 truesize before:896 truesize after:13738 nr_frags:4 variant1:12842(13738) variant2:12842(13738) variant3:16384(17280)
        Dec 20 14:12:57 media kernel: [  794.907938] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:13482 len:13032 from->truesize:13738 skb_headlen(from):190 skb_shinfo(to)->nr_frags:6 skb_shinfo(from)->nr_frags:4
        Dec 20 14:12:57 media kernel: [  794.908191] eth0: mtu:1500 data_len:29026 len before:0 len after:29026 truesize before:896 truesize after:29666 nr_frags:9 variant1:28770(29666) variant2:28880(29776) variant3:36864(37760)
        Dec 20 14:12:57 media kernel: [  794.908386] eth0: mtu:1500 data_len:30474 len before:0 len after:30474 truesize before:896 truesize after:31114 nr_frags:8 variant1:30218(31114) variant2:30218(31114) variant3:32768(33664)

A1) Here we have a packet data_len: 5858 and truesize set to 6498 and nr_frags: 2
        Dec 20 14:12:57 media kernel: [  794.908560] eth0: mtu:1500 data_len:5858 len before:0 len after:5858 truesize before:896 truesize after:6498 nr_frags:2 variant1:5602(6498) variant2:5602(6498) variant3:8192(9088)

        Dec 20 14:12:57 media kernel: [  794.908581] eth0: mtu:1500 data_len:26130 len before:0 len after:26130 truesize before:896 truesize after:26770 nr_frags:7 variant1:25874(26770) variant2:25874(26770) variant3:28672(29568)

A2) That seems to end up in skb_try_coalesce, from->nr_frags is still 2, delta >> LEN in this case, no warning but perhaps wasteful ?
        Dec 20 14:12:57 media kernel: [  794.908616] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:6242 len:5792 from->truesize:6498 skb_headlen(from):190 skb_shinfo(to)->nr_frags:9 skb_shinfo(from)->nr_frags:2

        Dec 20 14:12:57 media kernel: [  794.908834] eth0: mtu:1500 data_len:33370 len before:0 len after:33370 truesize before:896 truesize after:34010 nr_frags:9 variant1:33114(34010) variant2:33114(34010) variant3:36864(37760)

B1) Here we have again a packet data_len: 5858 and truesize set to 6498, but nr_frags: 3 this time.
        Dec 20 14:12:57 media kernel: [  794.908992] eth0: mtu:1500 data_len:5858 len before:0 len after:5858 truesize before:896 truesize after:6498 nr_frags:3 variant1:5602(6498) variant2:5792(6688) variant3:12288(13184)
        Dec 20 14:12:57 media kernel: [  794.909012] eth0: mtu:1500 data_len:29026 len before:0 len after:29026 truesize before:896 truesize after:29666 nr_frags:8 variant1:28770(29666) variant2:28770(29666) variant3:32768(33664)

B2) That seems to end up in skb_try_coalesce, from->nr_frags is now 2 instead of 3, delta < LEN in this case, so it would have triggered the warn_on_once
        Dec 20 14:12:57 media kernel: [  794.909040] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA < LEN delta:5602 len:5792 from->truesize:6498 skb_headlen(from):0 skb_shinfo(to)->nr_frags:9 skb_shinfo(from)->nr_frags:2

        Dec 20 14:12:57 media kernel: [  794.909673] eth0: mtu:1500 data_len:1514 len before:0 len after:1514 truesize before:896 truesize after:2154 nr_frags:1 variant1:1258(2154) variant2:1258(2154) variant3:4096(4992)
        Dec 20 14:12:57 media kernel: [  794.909692] eth0: mtu:1500 data_len:522 len before:0 len after:522 truesize before:896 truesize after:1162 nr_frags:1 variant1:266(1162) variant2:266(1162) variant3:4096(4992)
        Dec 20 14:12:57 media kernel: [  794.909736] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:906 len:456 from->truesize:1162 skb_headlen(from):190 skb_shinfo(to)->nr_frags:2 skb_shinfo(from)->nr_frags:1
        Dec 20 14:12:57 media kernel: [  794.910205] eth0: mtu:1500 data_len:36266 len before:0 len after:36266 truesize before:896 truesize after:36906 nr_frags:10 variant1:36010(36906) variant2:36010(36906) variant3:40960(41856)
        Dec 20 14:12:57 media kernel: [  794.910706] eth0: mtu:1500 data_len:37714 len before:0 len after:37714 truesize before:896 truesize after:38354 nr_frags:10 variant1:37458(38354) variant2:37458(38354) variant3:40960(41856)
        Dec 20 14:12:57 media kernel: [  794.911472] eth0: mtu:1500 data_len:27578 len before:0 len after:27578 truesize before:896 truesize after:28218 nr_frags:8 variant1:27322(28218) variant2:27322(28218) variant3:32768(33664)
        Dec 20 14:12:57 media kernel: [  794.911695] eth0: mtu:1500 data_len:29026 len before:0 len after:29026 truesize before:896 truesize after:29666 nr_frags:9 variant1:28770(29666) variant2:28770(29666) variant3:36864(37760)
        Dec 20 14:12:57 media kernel: [  795.015511] eth0: mtu:1500 data_len:1018 len before:0 len after:1018 truesize before:896 truesize after:1658 nr_frags:1 variant1:762(1658) variant2:762(1658) variant3:4096(4992)
        Dec 20 14:12:57 media kernel: [  795.015585] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:1402 len:952 from->truesize:1658 skb_headlen(from):190 skb_shinfo(to)->nr_frags:10 skb_shinfo(from)->nr_frags:1
        Dec 20 14:12:57 media kernel: [  795.015641] eth0: mtu:1500 data_len:10202 len before:0 len after:10202 truesize before:896 truesize after:10842 nr_frags:4 variant1:9946(10842) variant2:9946(10842) variant3:16384(17280)
        Dec 20 14:12:57 media kernel: [  795.015657] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
        Dec 20 14:12:58 media kernel: [  795.817824] net_ratelimit: 9 callbacks suppressed

--
Sander

^ permalink raw reply

* [PATCH 3/3] iproute2: make `bridge mdb` output consistent with input
From: Cong Wang @ 2012-12-20 14:31 UTC (permalink / raw)
  To: netdev; +Cc: bridge, Cong Wang, Stephen Hemminger
In-Reply-To: <1356013915-20835-1-git-send-email-amwang@redhat.com>

bridge -> dev
group -> grp

Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
---
 bridge/mdb.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/bridge/mdb.c b/bridge/mdb.c
index 6217c5f..81d479b 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -53,12 +53,12 @@ static void print_mdb_entry(FILE *f, int ifindex, struct br_mdb_entry *e)
 	SPRINT_BUF(abuf);
 
 	if (e->addr.proto == htons(ETH_P_IP))
-		fprintf(f, "bridge %s port %s group %s %s\n", ll_index_to_name(ifindex),
+		fprintf(f, "dev %s port %s grp %s %s\n", ll_index_to_name(ifindex),
 			ll_index_to_name(e->ifindex),
 			inet_ntop(AF_INET, &e->addr.u.ip4, abuf, sizeof(abuf)),
 			(e->state & MDB_PERMANENT) ? "permanent" : "temp");
 	else
-		fprintf(f, "bridge %s port %s group %s %s\n", ll_index_to_name(ifindex),
+		fprintf(f, "dev %s port %s grp %s %s\n", ll_index_to_name(ifindex),
 			ll_index_to_name(e->ifindex),
 			inet_ntop(AF_INET6, &e->addr.u.ip6, abuf, sizeof(abuf)),
 			(e->state & MDB_PERMANENT) ? "permanent" : "temp");
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH 2/3] iproute2: update help info of bridge command
From: Cong Wang @ 2012-12-20 14:31 UTC (permalink / raw)
  To: netdev; +Cc: bridge, Cong Wang, Stephen Hemminger
In-Reply-To: <1356013915-20835-1-git-send-email-amwang@redhat.com>

Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
---
 bridge/bridge.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/bridge/bridge.c b/bridge/bridge.c
index 1fcd365..1d59a1e 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -27,7 +27,7 @@ static void usage(void)
 {
 	fprintf(stderr,
 "Usage: bridge [ OPTIONS ] OBJECT { COMMAND | help }\n"
-"where  OBJECT := { fdb |  monitor }\n"
+"where  OBJECT := { fdb |  mdb | monitor }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails]\n" );
 	exit(-1);
 }
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH 1/3] iproute2: distinguish permanent and temporary mdb entries
From: Cong Wang @ 2012-12-20 14:31 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, bridge, Cong Wang

This patch adds a flag to mdb entries so that we can distinguish
permanent entries with temporary ones.

Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
---
 bridge/mdb.c              |   24 +++++++++++++++---------
 include/linux/if_bridge.h |    3 +++
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/bridge/mdb.c b/bridge/mdb.c
index 121ce9c..6217c5f 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -28,7 +28,7 @@ int filter_index;
 
 static void usage(void)
 {
-	fprintf(stderr, "Usage: bridge mdb { add | del } dev DEV port PORT grp GROUP\n");
+	fprintf(stderr, "Usage: bridge mdb { add | del } dev DEV port PORT grp GROUP [permanent | temp]\n");
 	fprintf(stderr, "       bridge mdb {show} [ dev DEV ]\n");
 	exit(-1);
 }
@@ -53,13 +53,15 @@ static void print_mdb_entry(FILE *f, int ifindex, struct br_mdb_entry *e)
 	SPRINT_BUF(abuf);
 
 	if (e->addr.proto == htons(ETH_P_IP))
-		fprintf(f, "bridge %s port %s group %s\n", ll_index_to_name(ifindex),
+		fprintf(f, "bridge %s port %s group %s %s\n", ll_index_to_name(ifindex),
 			ll_index_to_name(e->ifindex),
-			inet_ntop(AF_INET, &e->addr.u.ip4, abuf, sizeof(abuf)));
+			inet_ntop(AF_INET, &e->addr.u.ip4, abuf, sizeof(abuf)),
+			(e->state & MDB_PERMANENT) ? "permanent" : "temp");
 	else
-		fprintf(f, "bridge %s port %s group %s\n", ll_index_to_name(ifindex),
+		fprintf(f, "bridge %s port %s group %s %s\n", ll_index_to_name(ifindex),
 			ll_index_to_name(e->ifindex),
-			inet_ntop(AF_INET6, &e->addr.u.ip6, abuf, sizeof(abuf)));
+			inet_ntop(AF_INET6, &e->addr.u.ip6, abuf, sizeof(abuf)),
+			(e->state & MDB_PERMANENT) ? "permanent" : "temp");
 }
 
 static void br_print_mdb_entry(FILE *f, int ifindex, struct rtattr *attr)
@@ -179,11 +181,15 @@ static int mdb_modify(int cmd, int flags, int argc, char **argv)
 		} else if (strcmp(*argv, "grp") == 0) {
 			NEXT_ARG();
 			grp = *argv;
+		} else if (strcmp(*argv, "port") == 0) {
+			NEXT_ARG();
+			p = *argv;
+		} else if (strcmp(*argv, "permanent") == 0) {
+			if (cmd == RTM_NEWMDB)
+				entry.state |= MDB_PERMANENT;
+		} else if (strcmp(*argv, "temp") == 0) {
+			;/* nothing */
 		} else {
-			if (strcmp(*argv, "port") == 0) {
-				NEXT_ARG();
-				p = *argv;
-			}
 			if (matches(*argv, "help") == 0)
 				usage();
 		}
diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b3b6a67..aac8b8c 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -163,6 +163,9 @@ struct br_port_msg {
 
 struct br_mdb_entry {
 	__u32 ifindex;
+#define MDB_TEMPORARY 0
+#define MDB_PERMANENT 1
+	__u8 state;
 	struct {
 		union {
 			__be32	ip4;
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH net-next V4 04/13] bridge: Verify that a vlan is allowed to egress on give port
From: Shmulik Ladkani @ 2012-12-20 14:28 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <1355939304-21804-5-git-send-email-vyasevic@redhat.com>

Hi Vlad,

On Wed, 19 Dec 2012 12:48:15 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
>  /* Don't forward packets to originating port or forwarding diasabled */
>  static inline int should_deliver(const struct net_bridge_port *p,
>  				 const struct sk_buff *skb)
>  {
>  	return (((p->flags & BR_HAIRPIN_MODE) || skb->dev != p->dev) &&
> +		br_allowed_egress(p, skb) &&
>  		p->state == BR_STATE_FORWARDING);
>  }

This should be also encorporated into 'br_pass_frame_up' somehow.

Egress permission when leaving the bridge towards IP stack ("egress"
on the "bridge master port" from bridging point-of-view) should be
validated according to master port's membership.

Regards,
Shmulik

^ permalink raw reply

* [PATCH] net: ipv4: route: fix coding style issues net: ipv4: tcp: fix coding style issues
From: Stefan Hasko @ 2012-12-20 14:28 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
  Cc: linux-kernel, Stefan Hasko

Fix a coding style issues.

Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
---
 net/ipv4/route.c |  119 ++++++++++++++++-------------
 net/ipv4/tcp.c   |  218 +++++++++++++++++++++++++++++++-----------------------
 2 files changed, 194 insertions(+), 143 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 844a9ef..29678e5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -20,7 +20,7 @@
  *		Alan Cox	:	Added BSD route gw semantics
  *		Alan Cox	:	Super /proc >4K
  *		Alan Cox	:	MTU in route table
- *		Alan Cox	: 	MSS actually. Also added the window
+ *		Alan Cox	:	MSS actually. Also added the window
  *					clamper.
  *		Sam Lantinga	:	Fixed route matching in rt_del()
  *		Alan Cox	:	Routing cache support.
@@ -31,30 +31,35 @@
  *	Miquel van Smoorenburg	:	BSD API fixes.
  *	Miquel van Smoorenburg	:	Metrics.
  *		Alan Cox	:	Use __u32 properly
- *		Alan Cox	:	Aligned routing errors more closely with BSD
+ *		Alan Cox	:	Aligned routing errors more
+ *					closely with BSD
  *					our system is still very different.
  *		Alan Cox	:	Faster /proc handling
- *	Alexey Kuznetsov	:	Massive rework to support tree based routing,
+ *	Alexey Kuznetsov	:	Massive rework to support
+ *					tree based routing,
  *					routing caches and better behaviour.
  *
  *		Olaf Erb	:	irtt wasn't being copied right.
  *		Bjorn Ekwall	:	Kerneld route support.
  *		Alan Cox	:	Multicast fixed (I hope)
- * 		Pavel Krauz	:	Limited broadcast fixed
+ *		Pavel Krauz	:	Limited broadcast fixed
  *		Mike McLagan	:	Routing by source
  *	Alexey Kuznetsov	:	End of old history. Split to fib.c and
  *					route.c and rewritten from scratch.
  *		Andi Kleen	:	Load-limit warning messages.
- *	Vitaly E. Lavrov	:	Transparent proxy revived after year coma.
+ *	Vitaly E. Lavrov	:	Transparent proxy revived
+ *					after year coma.
  *	Vitaly E. Lavrov	:	Race condition in ip_route_input_slow.
- *	Tobias Ringstrom	:	Uninitialized res.type in ip_route_output_slow.
+ *	Tobias Ringstrom	:	Uninitialized res.type in
+ *					ip_route_output_slow.
  *	Vladimir V. Ivanov	:	IP rule info (flowid) is really useful.
  *		Marc Boucher	:	routing by fwmark
  *	Robert Olsson		:	Added rt_cache statistics
  *	Arnaldo C. Melo		:	Convert proc stuff to seq_file
- *	Eric Dumazet		:	hashed spinlocks and rt_check_expire() fixes.
- * 	Ilia Sotnikov		:	Ignore TOS on PMTUD and Redirect
- * 	Ilia Sotnikov		:	Removed TOS from hash calculations
+ *	Eric Dumazet		:	hashed spinlocks and
+ *					rt_check_expire() fixes.
+ *	Ilia Sotnikov		:	Ignore TOS on PMTUD and Redirect
+ *	Ilia Sotnikov		:	Removed TOS from hash calculations
  *
  *		This program is free software; you can redistribute it and/or
  *		modify it under the terms of the GNU General Public License
@@ -65,7 +70,7 @@
 #define pr_fmt(fmt) "IPv4: " fmt
 
 #include <linux/module.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/bitops.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
@@ -139,7 +144,8 @@ static unsigned int	 ipv4_default_advmss(const struct dst_entry *dst);
 static unsigned int	 ipv4_mtu(const struct dst_entry *dst);
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
-static void		 ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
+static void		 ip_rt_update_pmtu(struct dst_entry *dst,
+					   struct sock *sk,
 					   struct sk_buff *skb, u32 mtu);
 static void		 ip_do_redirect(struct dst_entry *dst, struct sock *sk,
 					struct sk_buff *skb);
@@ -291,12 +297,11 @@ static int rt_cpu_seq_show(struct seq_file *seq, void *v)
 	struct rt_cache_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  in_hit in_slow_tot in_slow_mc in_no_route in_brd in_martian_dst in_martian_src  out_hit out_slow_tot out_slow_mc  gc_total gc_ignored gc_goal_miss gc_dst_overflow in_hlist_search out_hlist_search\n");
+		seq_printf(seq, "entries in_hit in_slow_tot in_slow_mc in_no_route in_brd in_martian_dst in_martian_src out_hit out_slow_tot out_slow_mc gc_total gc_ignored gc_goal_miss gc_dst_overflow in_hlist_search out_hlist_search\n");
 		return 0;
 	}
 
-	seq_printf(seq,"%08x  %08x %08x %08x %08x %08x %08x %08x "
-		   " %08x %08x %08x %08x %08x %08x %08x %08x %08x \n",
+		seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x  %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   dst_entries_get_slow(&ipv4_dst_ops),
 		   st->in_hit,
 		   st->in_slow_tot,
@@ -657,8 +662,8 @@ out_unlock:
 	return;
 }
 
-static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4,
-			     bool kill_route)
+static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb,
+			     struct flowi4 *fl4, bool kill_route)
 {
 	__be32 new_gw = icmp_hdr(skb)->un.gateway;
 	__be32 old_gw = ip_hdr(skb)->saddr;
@@ -695,7 +700,8 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
-		if (IN_DEV_SEC_REDIRECTS(in_dev) && ip_fib_check_default(new_gw, dev))
+		if (IN_DEV_SEC_REDIRECTS(in_dev) &&
+		    ip_fib_check_default(new_gw, dev))
 			goto reject_redirect;
 	} else {
 		if (inet_addr_type(net, new_gw) != RTN_UNICAST)
@@ -737,7 +743,8 @@ reject_redirect:
 	;
 }
 
-static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buff *skb)
+static void ip_do_redirect(struct dst_entry *dst, struct sock *sk,
+			   struct sk_buff *skb)
 {
 	struct rtable *rt;
 	struct flowi4 fl4;
@@ -1202,11 +1209,11 @@ static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt)
 	struct rtable *orig, *prev, **p;
 	bool ret = true;
 
-	if (rt_is_input_route(rt)) {
+	if (rt_is_input_route(rt))
 		p = (struct rtable **)&nh->nh_rth_input;
-	} else {
+	else
 		p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
-	}
+
 	orig = *p;
 
 	prev = cmpxchg(p, orig, rt);
@@ -1359,17 +1366,17 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 #endif
 	rth->dst.output = ip_rt_bug;
 
-	rth->rt_genid	= rt_genid(dev_net(dev));
-	rth->rt_flags	= RTCF_MULTICAST;
-	rth->rt_type	= RTN_MULTICAST;
-	rth->rt_is_input= 1;
-	rth->rt_iif	= 0;
-	rth->rt_pmtu	= 0;
-	rth->rt_gateway	= 0;
+	rth->rt_genid = rt_genid(dev_net(dev));
+	rth->rt_flags = RTCF_MULTICAST;
+	rth->rt_type = RTN_MULTICAST;
+	rth->rt_is_input = 1;
+	rth->rt_iif = 0;
+	rth->rt_pmtu = 0;
+	rth->rt_gateway = 0;
 	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 	if (our) {
-		rth->dst.input= ip_local_deliver;
+		rth->dst.input = ip_local_deliver;
 		rth->rt_flags |= RTCF_LOCAL;
 	}
 
@@ -1488,8 +1495,8 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->rt_flags = flags;
 	rth->rt_type = res->type;
 	rth->rt_is_input = 1;
-	rth->rt_iif 	= 0;
-	rth->rt_pmtu	= 0;
+	rth->rt_iif = 0;
+	rth->rt_pmtu = 0;
 	rth->rt_gateway	= 0;
 	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
@@ -1649,25 +1656,25 @@ local_input:
 	if (!rth)
 		goto e_nobufs;
 
-	rth->dst.input= ip_local_deliver;
-	rth->dst.output= ip_rt_bug;
+	rth->dst.input = ip_local_deliver;
+	rth->dst.output = ip_rt_bug;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	rth->dst.tclassid = itag;
 #endif
 
 	rth->rt_genid = rt_genid(net);
-	rth->rt_flags 	= flags|RTCF_LOCAL;
-	rth->rt_type	= res.type;
+	rth->rt_flags = flags|RTCF_LOCAL;
+	rth->rt_type = res.type;
 	rth->rt_is_input = 1;
-	rth->rt_iif	= 0;
-	rth->rt_pmtu	= 0;
+	rth->rt_iif = 0;
+	rth->rt_pmtu = 0;
 	rth->rt_gateway	= 0;
 	rth->rt_uses_gateway = 0;
 	INIT_LIST_HEAD(&rth->rt_uncached);
 	if (res.type == RTN_UNREACHABLE) {
-		rth->dst.input= ip_error;
-		rth->dst.error= -err;
-		rth->rt_flags 	&= ~RTCF_LOCAL;
+		rth->dst.input = ip_error;
+		rth->dst.error = -err;
+		rth->rt_flags &= ~RTCF_LOCAL;
 	}
 	if (do_cache)
 		rt_cache_route(&FIB_RES_NH(res), rth);
@@ -1772,7 +1779,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		return ERR_PTR(-EINVAL);
 
 	if (likely(!IN_DEV_ROUTE_LOCALNET(in_dev)))
-		if (ipv4_is_loopback(fl4->saddr) && !(dev_out->flags & IFF_LOOPBACK))
+		if (ipv4_is_loopback(fl4->saddr) &&
+		    !(dev_out->flags & IFF_LOOPBACK))
 			return ERR_PTR(-EINVAL);
 
 	if (ipv4_is_lbcast(fl4->daddr))
@@ -1919,7 +1927,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		if (fl4->flowi4_oif == 0 &&
 		    (ipv4_is_multicast(fl4->daddr) ||
 		     ipv4_is_lbcast(fl4->daddr))) {
-			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
+			/* It is equivalent to
+			 * inet_addr_type(saddr) == RTN_LOCAL
+			 */
 			dev_out = __ip_dev_find(net, fl4->saddr, false);
 			if (dev_out == NULL)
 				goto out;
@@ -1944,7 +1954,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		}
 
 		if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
-			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
+			/* It is equivalent to
+			 * inet_addr_type(saddr) == RTN_LOCAL
+			 */
 			if (!__ip_dev_find(net, fl4->saddr, false))
 				goto out;
 		}
@@ -1972,7 +1984,7 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 		if (fl4->saddr) {
 			if (ipv4_is_multicast(fl4->daddr))
 				fl4->saddr = inet_select_addr(dev_out, 0,
-							      fl4->flowi4_scope);
+							     fl4->flowi4_scope);
 			else if (!fl4->daddr)
 				fl4->saddr = inet_select_addr(dev_out, 0,
 							      RT_SCOPE_HOST);
@@ -2061,7 +2073,8 @@ out:
 }
 EXPORT_SYMBOL_GPL(__ip_route_output_key);
 
-static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
+static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst,
+						  u32 cookie)
 {
 	return NULL;
 }
@@ -2073,7 +2086,8 @@ static unsigned int ipv4_blackhole_mtu(const struct dst_entry *dst)
 	return mtu ? : dst->dev->mtu;
 }
 
-static void ipv4_rt_blackhole_update_pmtu(struct dst_entry *dst, struct sock *sk,
+static void ipv4_rt_blackhole_update_pmtu(struct dst_entry *dst,
+					  struct sock *sk,
 					  struct sk_buff *skb, u32 mtu)
 {
 }
@@ -2101,7 +2115,8 @@ static struct dst_ops ipv4_dst_blackhole_ops = {
 	.neigh_lookup		=	ipv4_neigh_lookup,
 };
 
-struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_orig)
+struct dst_entry *ipv4_blackhole_route(struct net *net,
+				       struct dst_entry *dst_orig)
 {
 	struct rtable *ort = (struct rtable *) dst_orig;
 	struct rtable *rt;
@@ -2265,7 +2280,8 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
-static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, void *arg)
+static int inet_rtm_getroute(struct sk_buff *in_skb,
+			     struct nlmsghdr *nlh, void *arg)
 {
 	struct net *net = sock_net(in_skb->sk);
 	struct rtmsg *rtm;
@@ -2297,7 +2313,9 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, void
 	skb_reset_mac_header(skb);
 	skb_reset_network_header(skb);
 
-	/* Bugfix: need to give ip_route_input enough of an IP header to not gag. */
+	/* Bugfix: need to give ip_route_input enough
+	 * of an IP header to not gag.
+	 */
 	ip_hdr(skb)->protocol = IPPROTO_ICMP;
 	skb_reserve(skb, MAX_HEADER + sizeof(struct iphdr));
 
@@ -2596,7 +2614,8 @@ int __init ip_rt_init(void)
 	int rc = 0;
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
-	ip_rt_acct = __alloc_percpu(256 * sizeof(struct ip_rt_acct), __alignof__(struct ip_rt_acct));
+	ip_rt_acct = __alloc_percpu(256 * sizeof(struct ip_rt_acct),
+				    __alignof__(struct ip_rt_acct));
 	if (!ip_rt_acct)
 		panic("IP: failed to allocate ip_rt_acct\n");
 #endif
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..12fadb2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -45,7 +45,7 @@
  *					escape still
  *		Alan Cox	:	Fixed another acking RST frame bug.
  *					Should stop LAN workplace lockups.
- *		Alan Cox	: 	Some tidyups using the new skb list
+ *		Alan Cox	:	Some tidyups using the new skb list
  *					facilities
  *		Alan Cox	:	sk->keepopen now seems to work
  *		Alan Cox	:	Pulls options out correctly on accepts
@@ -160,7 +160,8 @@
  *					generates them.
  *		Alan Cox	:	Cache last socket.
  *		Alan Cox	:	Per route irtt.
- *		Matt Day	:	poll()->select() match BSD precisely on error
+ *		Matt Day	:	poll()->select() match BSD precisely
+ *					on error
  *		Alan Cox	:	New buffers
  *		Marc Tamsky	:	Various sk->prot->retransmits and
  *					sk->retransmits misupdating fixed.
@@ -168,9 +169,9 @@
  *					and TCP syn retries gets used now.
  *		Mark Yarvis	:	In tcp_read_wakeup(), don't send an
  *					ack if state is TCP_CLOSED.
- *		Alan Cox	:	Look up device on a retransmit - routes may
- *					change. Doesn't yet cope with MSS shrink right
- *					but it's a start!
+ *		Alan Cox	:	Look up device on a retransmit - routes
+ *					may change. Doesn't yet cope with MSS
+ *					shrink right but it's a start!
  *		Marc Tamsky	:	Closing in closing fixes.
  *		Mike Shaver	:	RFC1122 verifications.
  *		Alan Cox	:	rcv_saddr errors.
@@ -199,7 +200,7 @@
  *					tcp_do_sendmsg to avoid burstiness.
  *		Eric Schenk	:	Fix fast close down bug with
  *					shutdown() followed by close().
- *		Andi Kleen 	:	Make poll agree with SIGIO
+ *		Andi Kleen	:	Make poll agree with SIGIO
  *	Salvatore Sanfilippo	:	Support SO_LINGER with linger == 1 and
  *					lingertime == 0 (RFC 793 ABORT Call)
  *	Hirokazu Takahashi	:	Use copy_from_user() instead of
@@ -268,6 +269,7 @@
 #include <linux/crypto.h>
 #include <linux/time.h>
 #include <linux/slab.h>
+#include <linux/uaccess.h>
 
 #include <net/icmp.h>
 #include <net/inet_common.h>
@@ -277,7 +279,6 @@
 #include <net/netdma.h>
 #include <net/sock.h>
 
-#include <asm/uaccess.h>
 #include <asm/ioctls.h>
 
 int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
@@ -286,22 +287,20 @@ struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
 int sysctl_tcp_wmem[3] __read_mostly;
-int sysctl_tcp_rmem[3] __read_mostly;
+EXPORT_SYMBOL(sysctl_tcp_wmem);
 
+int sysctl_tcp_rmem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_tcp_rmem);
-EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 EXPORT_SYMBOL(tcp_memory_allocated);
 
-/*
- * Current number of TCP sockets.
+/* Current number of TCP sockets.
  */
 struct percpu_counter tcp_sockets_allocated;
 EXPORT_SYMBOL(tcp_sockets_allocated);
 
-/*
- * TCP splice context
+/* TCP splice context
  */
 struct tcp_splice_state {
 	struct pipe_inode_info *pipe;
@@ -309,8 +308,7 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
-/*
- * Pressure flag: try to collapse.
+/* Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the __sk_mem_schedule() is of this nature: accounting
  * is strict, actions are advisory and have some latency.
@@ -430,8 +428,7 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
-/*
- *	Wait for a TCP event.
+/*	Wait for a TCP event.
  *
  *	Note that we don't need to lock the socket, as the upper poll layers
  *	take care of normal races (between the test and the event) and we don't
@@ -454,8 +451,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 	mask = 0;
 
-	/*
-	 * POLLHUP is certainly not done right. But poll() doesn't
+	/* POLLHUP is certainly not done right. But poll() doesn't
 	 * have a notion of HUP in just one direction, and for a
 	 * socket the read side is more interesting.
 	 *
@@ -498,7 +494,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 		/* Potential race condition. If read of tp below will
 		 * escape above sk->sk_state, we can be illegally awaken
-		 * in SYN_* states. */
+		 * in SYN_* states.
+		 */
 		if (tp->rcv_nxt - tp->copied_seq >= target)
 			mask |= POLLIN | POLLRDNORM;
 
@@ -509,14 +506,15 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 				set_bit(SOCK_ASYNC_NOSPACE,
 					&sk->sk_socket->flags);
 				set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-
-				/* Race breaker. If space is freed after
-				 * wspace test but before the flags are set,
-				 * IO signal will be lost.
-				 */
-				if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk))
-					mask |= POLLOUT | POLLWRNORM;
 			}
+
+			/* Race breaker. If space is freed after
+			 * wspace test but before the flags are set,
+			 * IO signal will be lost.
+			 */
+			if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)
+			 && sk_stream_wspace(sk) >= sk_stream_min_wspace(sk))
+				mask |= POLLOUT | POLLWRNORM;
 		} else
 			mask |= POLLOUT | POLLWRNORM;
 
@@ -634,7 +632,7 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 
 		tcp_mark_urg(tp, flags);
 		__tcp_push_pending_frames(sk, mss_now,
-					  (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+				(flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
 	}
 }
 
@@ -839,6 +837,7 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	int err;
 	ssize_t copied;
 	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+	int ass_res = 0;
 
 	/* Wait for a connection to finish. One exception is TCP Fast Open
 	 * (passive side) where data is allowed to be sent before a connection
@@ -846,7 +845,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	 */
 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
 	    !tcp_passive_fastopen(sk)) {
-		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+		ass_res = (err = sk_stream_wait_connect(sk, &timeo));
+		if (ass_res != 0)
 			goto out_err;
 	}
 
@@ -864,7 +864,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		int copy, i;
 		bool can_coalesce;
 
-		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+		ass_res = (copy = size_goal - skb->len);
+		if (!tcp_send_head(sk) || ass_res <= 0) {
 new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
@@ -911,7 +912,9 @@ new_segment:
 
 		copied += copy;
 		offset += copy;
-		if (!(size -= copy))
+
+		ass_res = (size -= copy);
+		if (!ass_res)
 			goto out;
 
 		if (skb->len < size_goal || (flags & MSG_OOB))
@@ -929,7 +932,8 @@ wait_for_sndbuf:
 wait_for_memory:
 		tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
-		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
+		ass_res = (err = sk_stream_wait_memory(sk, &timeo));
+		if (ass_res != 0)
 			goto do_error;
 
 		mss_now = tcp_send_mss(sk, &size_goal, flags);
@@ -1029,6 +1033,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
 	bool sg;
 	long timeo;
+	int ass_res = 0;
 
 	lock_sock(sk);
 
@@ -1050,7 +1055,8 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	 */
 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
 	    !tcp_passive_fastopen(sk)) {
-		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+		ass_res = (err = sk_stream_wait_connect(sk, &timeo));
+		if (ass_res != 0)
 			goto do_error;
 	}
 
@@ -1099,7 +1105,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		}
 
 		while (seglen > 0) {
-			int copy = 0;
+			int copy = 0, ass_res = 0;
 			int max = size_goal;
 
 			skb = tcp_write_queue_tail(sk);
@@ -1123,8 +1129,7 @@ new_segment:
 				if (!skb)
 					goto wait_for_memory;
 
-				/*
-				 * Check whether we can use HW checksum.
+				/* Check whether we can use HW checksum.
 				 */
 				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
 					skb->ip_summed = CHECKSUM_PARTIAL;
@@ -1162,7 +1167,8 @@ new_segment:
 					merge = false;
 				}
 
-				copy = min_t(int, copy, pfrag->size - pfrag->offset);
+				copy = min_t(int, copy,
+					pfrag->size - pfrag->offset);
 
 				if (!sk_wmem_schedule(sk, copy))
 					goto wait_for_memory;
@@ -1176,7 +1182,8 @@ new_segment:
 
 				/* Update the skb. */
 				if (merge) {
-					skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+					skb_frag_size_add(
+					  &skb_shinfo(skb)->frags[i - 1], copy);
 				} else {
 					skb_fill_page_desc(skb, i, pfrag->page,
 							   pfrag->offset, copy);
@@ -1194,15 +1201,19 @@ new_segment:
 
 			from += copy;
 			copied += copy;
-			if ((seglen -= copy) == 0 && iovlen == 0)
+			ass_res = (seglen -= copy);
+			if (ass_res == 0 && iovlen == 0)
 				goto out;
 
-			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
+			if (skb->len < max ||
+			   (flags & MSG_OOB) ||
+			   unlikely(tp->repair))
 				continue;
 
 			if (forced_push(tp)) {
 				tcp_mark_push(tp, skb);
-				__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
+				__tcp_push_pending_frames(sk, mss_now,
+					TCP_NAGLE_PUSH);
 			} else if (skb == tcp_send_head(sk))
 				tcp_push_one(sk, mss_now);
 			continue;
@@ -1211,9 +1222,11 @@ wait_for_sndbuf:
 			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
 			if (copied)
-				tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
+				tcp_push(sk, flags & ~MSG_MORE,
+					mss_now, TCP_NAGLE_PUSH);
 
-			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
+			ass_res = (err = sk_stream_wait_memory(sk, &timeo));
+			if (ass_res != 0)
 				goto do_error;
 
 			mss_now = tcp_send_mss(sk, &size_goal, flags);
@@ -1246,8 +1259,7 @@ out_err:
 }
 EXPORT_SYMBOL(tcp_sendmsg);
 
-/*
- *	Handle reading urgent data. BSD has very simple semantics for
+/*	Handle reading urgent data. BSD has very simple semantics for
  *	this, no blocking and very strange errors 8)
  */
 
@@ -1333,7 +1345,8 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 	if (inet_csk_ack_scheduled(sk)) {
 		const struct inet_connection_sock *icsk = inet_csk(sk);
 		   /* Delayed ACKs frequently hit locked sockets during bulk
-		    * receive. */
+		    * receive.
+		    */
 		if (icsk->icsk_ack.blocked ||
 		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
 		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
@@ -1366,7 +1379,8 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 
 			/* Send ACK now, if this read freed lots of space
 			 * in our buffer. Certainly, new_window is new window.
-			 * We can advertise it now, if it is not less than current one.
+			 * We can advertise it now, if it is not less than
+			 * current one.
 			 * "Lots" means "at least twice" here.
 			 */
 			if (new_window && new_window >= 2 * rcv_window_now)
@@ -1385,7 +1399,8 @@ static void tcp_prequeue_process(struct sock *sk)
 	NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPPREQUEUED);
 
 	/* RX process wants to run with disabled BHs, though it is not
-	 * necessary */
+	 * necessary
+	 */
 	local_bh_disable();
 	while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL)
 		sk_backlog_rcv(sk, skb);
@@ -1445,8 +1460,7 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 	return NULL;
 }
 
-/*
- * This routine provides an alternative to tcp_recvmsg() for routines
+/* This routine provides an alternative to tcp_recvmsg() for routines
  * that would like to handle copying from skbuffs directly in 'sendfile'
  * fashion.
  * Note:
@@ -1526,8 +1540,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 }
 EXPORT_SYMBOL(tcp_read_sock);
 
-/*
- *	This routine copies from a sock struct into the user buffer.
+/*	This routine copies from a sock struct into the user buffer.
  *
  *	Technical note: in 2.3 we work on _locked_ socket, so that
  *	tricks with *seq access order and skb->users are not required.
@@ -1610,12 +1623,15 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	do {
 		u32 offset;
 
-		/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */
+		/* Are we at urgent data? Stop if we have read
+		 * anything or have SIGURG pending.
+		 */
 		if (tp->urg_data && tp->urg_seq == *seq) {
 			if (copied)
 				break;
 			if (signal_pending(current)) {
-				copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
+				copied = timeo ?
+					sock_intr_errno(timeo) : -EAGAIN;
 				break;
 			}
 		}
@@ -1744,7 +1760,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 				tcp_service_net_dma(sk, true);
 				tcp_cleanup_rbuf(sk, copied);
 			} else
-				dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+				dma_async_memcpy_issue_pending(
+					tp->ucopy.dma_chan);
 		}
 #endif
 		if (copied >= target) {
@@ -1760,12 +1777,15 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 #endif
 
 		if (user_recv) {
-			int chunk;
+			int chunk, ass_res = 0;
 
 			/* __ Restore normal policy in scheduler __ */
 
-			if ((chunk = len - tp->ucopy.len) != 0) {
-				NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMBACKLOG, chunk);
+			ass_res = (chunk = len - tp->ucopy.len);
+			if (ass_res != 0) {
+				NET_ADD_STATS_USER(sock_net(sk),
+					LINUX_MIB_TCPDIRECTCOPYFROMBACKLOG,
+					chunk);
 				len -= chunk;
 				copied += chunk;
 			}
@@ -1775,8 +1795,11 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 do_prequeue:
 				tcp_prequeue_process(sk);
 
-				if ((chunk = len - tp->ucopy.len) != 0) {
-					NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk);
+				ass_res = (chunk = len - tp->ucopy.len);
+				if (ass_res != 0) {
+					NET_ADD_STATS_USER(sock_net(sk),
+					    LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE,
+					    chunk);
 					len -= chunk;
 					copied += chunk;
 				}
@@ -1791,7 +1814,7 @@ do_prequeue:
 		}
 		continue;
 
-	found_ok_skb:
+found_ok_skb:
 		/* Ok so how much can we use? */
 		used = skb->len - offset;
 		if (len < used)
@@ -1800,19 +1823,18 @@ do_prequeue:
 		/* Do we have urgent data here? */
 		if (tp->urg_data) {
 			u32 urg_offset = tp->urg_seq - *seq;
-			if (urg_offset < used) {
-				if (!urg_offset) {
-					if (!sock_flag(sk, SOCK_URGINLINE)) {
-						++*seq;
-						urg_hole++;
-						offset++;
-						used--;
-						if (!used)
-							goto skip_copy;
-					}
-				} else
-					used = urg_offset;
+			if (urg_offset < used && !urg_offset) {
+				if (!sock_flag(sk, SOCK_URGINLINE)) {
+					++*seq;
+					urg_hole++;
+					offset++;
+					used--;
+					if (!used)
+						goto skip_copy;
+				}
 			}
+			if (urg_offset < used && urg_offset)
+				used = urg_offset;
 		}
 
 		if (!(flags & MSG_TRUNC)) {
@@ -1821,7 +1843,9 @@ do_prequeue:
 				tp->ucopy.dma_chan = net_dma_find_channel();
 
 			if (tp->ucopy.dma_chan) {
-				tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
+				tp->ucopy.dma_cookie =
+					dma_skb_copy_datagram_iovec(
+
 					tp->ucopy.dma_chan, skb, offset,
 					msg->msg_iov, used,
 					tp->ucopy.pinned_list);
@@ -1837,7 +1861,8 @@ do_prequeue:
 					break;
 				}
 
-				dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+				dma_async_memcpy_issue_pending(
+					tp->ucopy.dma_chan);
 
 				if ((offset + used) == skb->len)
 					copied_early = true;
@@ -1878,7 +1903,7 @@ skip_copy:
 		}
 		continue;
 
-	found_fin_ok:
+found_fin_ok:
 		/* Process the FIN. */
 		++*seq;
 		if (!(flags & MSG_PEEK)) {
@@ -1890,14 +1915,17 @@ skip_copy:
 
 	if (user_recv) {
 		if (!skb_queue_empty(&tp->ucopy.prequeue)) {
-			int chunk;
+			int chunk, ass_res = 0;
 
 			tp->ucopy.len = copied > 0 ? len : 0;
 
 			tcp_prequeue_process(sk);
 
-			if (copied > 0 && (chunk = len - tp->ucopy.len) != 0) {
-				NET_ADD_STATS_USER(sock_net(sk), LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk);
+			ass_res = (chunk = len - tp->ucopy.len);
+			if (copied > 0 && ass_res != 0) {
+				NET_ADD_STATS_USER(sock_net(sk),
+					LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE,
+					chunk);
 				len -= chunk;
 				copied += chunk;
 			}
@@ -1971,13 +1999,13 @@ void tcp_set_state(struct sock *sk, int state)
 	sk->sk_state = state;
 
 #ifdef STATE_TRACE
-	SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n", sk, statename[oldstate], statename[state]);
+	SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n", sk,
+		statename[oldstate], statename[state]);
 #endif
 }
 EXPORT_SYMBOL_GPL(tcp_set_state);
 
-/*
- *	State processing on a close. This implements the state shift for
+/*	State processing on a close. This implements the state shift for
  *	sending our FIN frame. Note that we only send a FIN for some
  *	states. A shutdown() may have already sent the FIN, or we may be
  *	closed.
@@ -2009,8 +2037,7 @@ static int tcp_close_state(struct sock *sk)
 	return next & TCP_ACTION_FIN;
 }
 
-/*
- *	Shutdown the sending side of a connection. Much like close except
+/*	Shutdown the sending side of a connection. Much like close except
  *	that we don't receive shut down or sock_set_flag(sk, SOCK_DEAD).
  */
 
@@ -2125,7 +2152,7 @@ void tcp_close(struct sock *sk, long timeout)
 		 * required by specs (TCP_ESTABLISHED, TCP_CLOSE_WAIT, when
 		 * they look as CLOSING or LAST_ACK for Linux)
 		 * Probably, I missed some more holelets.
-		 * 						--ANK
+		 *                                             --ANK
 		 * XXX (TFO) - To start off we don't support SYN+ACK+FIN
 		 * in a single packet! (May consider it later but will
 		 * probably need API support or TCP_CORK SYN-ACK until
@@ -2235,6 +2262,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	int err = 0;
+	int ass_res = 0;
 	int old_state = sk->sk_state;
 
 	if (old_state != TCP_CLOSE)
@@ -2272,7 +2300,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	sk->sk_shutdown = 0;
 	sock_reset_flag(sk, SOCK_DONE);
 	tp->srtt = 0;
-	if ((tp->write_seq += tp->max_window + 2) == 0)
+	ass_res = (tp->write_seq += tp->max_window + 2);
+	if (ass_res == 0)
 		tp->write_seq = 1;
 	icsk->icsk_backoff = 0;
 	tp->snd_cwnd = 2;
@@ -2358,8 +2387,7 @@ static int tcp_repair_options_est(struct tcp_sock *tp,
 	return 0;
 }
 
-/*
- *	Socket option code for TCP.
+/*	Socket option code for TCP.
  */
 static int do_tcp_setsockopt(struct sock *sk, int level,
 		int optname, char __user *optval, unsigned int optlen)
@@ -2491,7 +2519,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	case TCP_MAXSEG:
 		/* Values greater than interface MTU won't take effect. However
 		 * at the point when this call is done we typically don't yet
-		 * know which interface is going to be used */
+		 * know which interface is going to be used
+		 */
+
 		if (val < TCP_MIN_MSS || val > MAX_TCP_WINDOW) {
 			err = -EINVAL;
 			break;
@@ -2509,6 +2539,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			 * an explicit push, which overrides even TCP_CORK
 			 * for currently queued segments.
 			 */
+
 			tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
 			tcp_push_pending_frames(sk);
 		} else {
@@ -2786,7 +2817,8 @@ void tcp_get_info(const struct sock *sk, struct tcp_info *info)
 	info->tcpi_fackets = tp->fackets_out;
 
 	info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime);
-	info->tcpi_last_data_recv = jiffies_to_msecs(now - icsk->icsk_ack.lrcvtime);
+	info->tcpi_last_data_recv =
+		 jiffies_to_msecs(now - icsk->icsk_ack.lrcvtime);
 	info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp);
 
 	info->tcpi_pmtu = icsk->icsk_pmtu_cookie;
@@ -3378,12 +3410,12 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
 }
 EXPORT_SYMBOL(tcp_md5_hash_skb_data);
 
-int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct tcp_md5sig_key *key)
+int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct tcp_md5sig_key *k)
 {
 	struct scatterlist sg;
 
-	sg_init_one(&sg, key->key, key->keylen);
-	return crypto_hash_update(&hp->md5_desc, &sg, key->keylen);
+	sg_init_one(&sg, k->key, k->keylen);
+	return crypto_hash_update(&hp->md5_desc, &sg, k->keylen);
 }
 EXPORT_SYMBOL(tcp_md5_hash_key);
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: [Xen-devel] [PATCH] xen/netfront: improve truesize tracking
From: Sander Eikelenboom @ 2012-12-20 14:23 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Eric Dumazet, netdev@vger.kernel.org, annie li,
	xen-devel@lists.xensource.com, Ian Campbell,
	Konrad Rzeszutek Wilk
In-Reply-To: <1797374383.20121220135139@eikelenboom.it>


Thursday, December 20, 2012, 1:51:39 PM, you wrote:


> Wednesday, December 19, 2012, 5:17:49 PM, you wrote:

>> On Wed, 2012-12-19 at 12:34 +0100, Sander Eikelenboom wrote:

>>> Hi Ian,
>>> 
>>> It ran overnight and i haven't seen the warn_once trigger.
>>> (but i also didn't with the previous patch)
>>> 

>> As I said, the miminum value to not trigger the warning was what Ian
>> patch was doing, but it was still a not accurate estimation.

>> Doing the real accounting might trigger slow transferts, or dropped
>> packets because of socket limits (SNDBUF / RCVBUF) being hit sooner.

>> So the real question was : If accounting for full pages, is your
>> applications run as smooth as before, with no huge performance
>> regression ?

> Ok i have added some extra debug info (see diff's below), the code still uses the old calculation for truesize (in the hope to trigger the warn_on_once again), but also calculates the variants IanC came up with.

> I haven't got a clear test case to trigger the warn_on_once, it happens just every once in a while during my normal usage and i'm not a netperf expert :-)
> So at the moment i haven't been able to trigger the warn_on_once yet, but the results so far do seem to shed some light ..

> - The first variant (current code) seems to be the most effcient and a good estimation *most* of the the, but sometimes triggers the warn_on_once in skb_try_coalesce.
> - The first variant (current code) seems to always substract from the truesize for small packets.
> - The second variant always seems keep the truesize as is for most of the small network traffic, but it also seems to work ok for larger packets.
> - The third variant seems to be a pretty wasteful estimation.

> So the last variant seems to be rather wasteful, and the second one the most accurate so far.

> Eric:
>      From the warn_on_once, delta should be smaller than len, but probably they should be as close together as possible.
>      When you say "accurate estimation", what would be a acceptable difference between DELTA and LEN ?



> [  116.965062] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> [  117.094538] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.094707] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.094869] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.095058] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.095216] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.096102] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.096311] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.096373] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.150398] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.150459] eth0: mtu:1500 data_len:54 len before:0 len after:54 truesize before:896 truesize after:694 nr_frags:1 variant1:-202(694) variant2:0(896) variant3:4096(4992)
> [  117.536901] eth0: mtu:1500 data_len:53642 len before:0 len after:53642 truesize before:896 truesize after:54282 nr_frags:14 variant1:53386(54282) variant2:53386(54282) variant3:57344(58240)
> [  117.537463] eth0: mtu:1500 data_len:15994 len before:0 len after:15994 truesize before:896 truesize after:16634 nr_frags:5 variant1:15738(16634) variant2:15738(16634) variant3:20480(21376)
> [  117.537915] eth0: mtu:1500 data_len:17442 len before:0 len after:17442 truesize before:896 truesize after:18082 nr_frags:5 variant1:17186(18082) variant2:17186(18082) variant3:20480(21376)
> [  117.538543] eth0: mtu:1500 data_len:18890 len before:0 len after:18890 truesize before:896 truesize after:19530 nr_frags:6 variant1:18634(19530) variant2:18634(19530) variant3:24576(25472)
> [  117.539223] eth0: mtu:1500 data_len:13098 len before:0 len after:13098 truesize before:896 truesize after:13738 nr_frags:4 variant1:12842(13738) variant2:12842(13738) variant3:16384(17280)
> [  117.539283] eth0: mtu:1500 data_len:7306 len before:0 len after:7306 truesize before:896 truesize after:7946 nr_frags:2 variant1:7050(7946) variant2:7050(7946) variant3:8192(9088)
> [  117.539403] skbuff: to: (null) from: (null)  skb_try_coalesce: DELTA - LEN > 100 delta:7690 len:7240 from->truesize:7946 skb_headlen(from):190 skb_shinfo(to)->nr_frags:5 skb_shinfo(from)->nr_frags:2
> [  117.540035] eth0: mtu:1500 data_len:4410 len before:0 len after:4410 truesize before:896 truesize after:5050 nr_frags:3 variant1:4154(5050) variant2:4304(5200) variant3:12288(13184)
> [  117.540153] eth0: mtu:1500 data_len:1018 len before:0 len after:1018 truesize before:896 truesize after:1658 nr_frags:1 variant1:762(1658) variant2:762(1658) variant3:4096(4992)
> [  121.981917] net_ratelimit: 27 callbacks suppressed
> [  121.981960] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> [  122.985019] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> [  123.988308] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> [  124.991961] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> [  125.995003] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
> [  126.998324] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)



> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index c26e28b..8833e38 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -964,6 +964,7 @@ static int xennet_poll(struct napi_struct *napi, int budget)
>         struct sk_buff_head tmpq;
>         unsigned long flags;
>         int err;
> +       int tsz,len;

>         spin_lock(&np->rx_lock);

> @@ -1037,9 +1038,22 @@ err:
>                  * receive throughout using the standard receive
>                  * buffer size was cut by 25%(!!!).
>                  */
> -               skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
> +
> +
> +
> +
> +                tsz = skb->truesize;
> +                len = skb->len;
> +                /* skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags; */
> +                skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
>                 skb->len += skb->data_len;

> +               net_warn_ratelimited("%s: mtu:%d data_len:%d len before:%d len after:%d truesize before:%d truesize after:%d nr_frags:%d variant1:%d(%d) variant2:%d(%d) variant3:%d(%d) \n",
> +                        skb->dev->name, skb->dev->mtu, skb->data_len, len,  skb->len,tsz, skb->truesize, skb_shinfo(skb)->nr_frags,
> +                        skb->data_len - RX_COPY_THRESHOLD, tsz + skb->data_len - RX_COPY_THRESHOLD ,
> +                        skb->data_len - NETFRONT_SKB_CB(skb)->pull_to, tsz + skb->data_len - NETFRONT_SKB_CB(skb)->pull_to,
> +                        PAGE_SIZE * skb_shinfo(skb)->nr_frags, tsz + (PAGE_SIZE * skb_shinfo(skb)->nr_frags));
> +
>                 if (rx->flags & XEN_NETRXF_csum_blank)
>                         skb->ip_summed = CHECKSUM_PARTIAL;
>                 else if (rx->flags & XEN_NETRXF_data_validated)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..6d0cd86 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3471,6 +3471,16 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,

>         WARN_ON_ONCE(delta < len);

> +       if(delta < len) {
> +               net_warn_ratelimited("to: %s from: %s  skb_try_coalesce: DELTA < LEN delta:%d len:%d from->truesize:%d skb_headlen(from):%d skb_shinfo(to)->nr_frags:%d skb_shinfo(from)->nr_frags:%d \n",
> +                        to->dev->name, from->dev->name, delta, len, from->truesize, skb_headlen(from), skb_shinfo(to)->nr_frags, skb_shinfo(from)->nr_frags);
> +       }
> +
+       if (delta >> len && delta - len > 100) {
> +               net_warn_ratelimited("to: %s from: %s  skb_try_coalesce: DELTA - LEN > 100 delta:%d len:%d from->truesize:%d skb_headlen(from):%d skb_shinfo(to)->nr_frags:%d skb_shinfo(from)->nr_frags:%d \n",
> +                        to->dev->name,from->dev->name, delta, len, from->truesize, skb_headlen(from), skb_shinfo(to)->nr_frags, skb_shinfo(from)->nr_frags);
> +       }
> +
>         memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
>                skb_shinfo(from)->frags,
>                skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));



Ok i succeeded in triggering the warn_on_once, but it seems the extra debug info from netfront was just rate limited away for the offending packet :(

Dec 20 15:17:33 media kernel: [  393.464062] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:33 media kernel: [  393.464438] eth0: mtu:1500 data_len:762 len before:0 len after:762 truesize before:896 truesize after:1402 nr_frags:1 variant1:506(1402) variant2:506(1402) variant3:4096(4992)
Dec 20 15:17:33 media kernel: [  393.465083] eth0: mtu:1500 data_len:118 len before:0 len after:118 truesize before:896 truesize after:758 nr_frags:1 variant1:-138(758) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:33 media kernel: [  393.466114] eth0: mtu:1500 data_len:118 len before:0 len after:118 truesize before:896 truesize after:758 nr_frags:1 variant1:-138(758) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:33 media kernel: [  393.467336] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  394.940211] ------------[ cut here ]------------
Dec 20 15:17:35 media kernel: [  394.940259] WARNING: at net/core/skbuff.c:3472 skb_try_coalesce+0x3fc/0x470()
Dec 20 15:17:35 media kernel: [  394.940282] Modules linked in:
Dec 20 15:17:35 media kernel: [  394.940306] Pid: 2632, comm: glusterfs Not tainted 3.7.0-rc0-20121220-netfrontdebug1 #1
Dec 20 15:17:35 media kernel: [  394.940330] Call Trace:
Dec 20 15:17:35 media kernel: [  394.940343]  <IRQ>  [<ffffffff8106889a>] warn_slowpath_common+0x7a/0xb0
Dec 20 15:17:35 media kernel: [  394.940384]  [<ffffffff810688e5>] warn_slowpath_null+0x15/0x20
Dec 20 15:17:35 media kernel: [  394.940409]  [<ffffffff8184298c>] skb_try_coalesce+0x3fc/0x470
Dec 20 15:17:35 media kernel: [  394.940434]  [<ffffffff818fb049>] tcp_try_coalesce+0x69/0xc0
Dec 20 15:17:35 media kernel: [  394.940458]  [<ffffffff818fb0f4>] tcp_queue_rcv+0x54/0x100
Dec 20 15:17:35 media kernel: [  394.940481]  [<ffffffff8190029f>] ? tcp_mtup_init+0x2f/0x90
Dec 20 15:17:35 media kernel: [  394.940504]  [<ffffffff818ffbdb>] tcp_rcv_established+0x2bb/0x6a0
Dec 20 15:17:35 media kernel: [  394.940528]  [<ffffffff8190839f>] ? tcp_v4_rcv+0x6cf/0xb10
Dec 20 15:17:35 media kernel: [  394.940551]  [<ffffffff81907985>] tcp_v4_do_rcv+0x135/0x480
Dec 20 15:17:35 media kernel: [  394.940576]  [<ffffffff819b3532>] ? _raw_spin_lock_nested+0x42/0x50
Dec 20 15:17:35 media kernel: [  394.940600]  [<ffffffff8190839f>] ? tcp_v4_rcv+0x6cf/0xb10
Dec 20 15:17:35 media kernel: [  394.940623]  [<ffffffff8190862d>] tcp_v4_rcv+0x95d/0xb10
Dec 20 15:17:35 media kernel: [  394.940666]  [<ffffffff810b5688>] ? lock_acquire+0xd8/0x100
Dec 20 15:17:35 media kernel: [  394.940694]  [<ffffffff818e4d6a>] ip_local_deliver_finish+0x11a/0x230
Dec 20 15:17:35 media kernel: [  394.940720]  [<ffffffff818e4c95>] ? ip_local_deliver_finish+0x45/0x230
Dec 20 15:17:35 media kernel: [  394.940745]  [<ffffffff818e4eb8>] ip_local_deliver+0x38/0x80
Dec 20 15:17:35 media kernel: [  394.940784]  [<ffffffff818e447a>] ip_rcv_finish+0x15a/0x630
Dec 20 15:17:35 media kernel: [  394.940807]  [<ffffffff818e4b68>] ip_rcv+0x218/0x300
Dec 20 15:17:35 media kernel: [  394.940829]  [<ffffffff8184bf2d>] __netif_receive_skb+0x65d/0x8d0
Dec 20 15:17:35 media kernel: [  394.940853]  [<ffffffff8184ba15>] ? __netif_receive_skb+0x145/0x8d0
Dec 20 15:17:35 media kernel: [  394.940889]  [<ffffffff810b192d>] ? trace_hardirqs_on+0xd/0x10
Dec 20 15:17:35 media kernel: [  394.940914]  [<ffffffff810fecbb>] ? free_hot_cold_page+0x1ab/0x1e0
Dec 20 15:17:35 media kernel: [  394.940939]  [<ffffffff8184e4f8>] netif_receive_skb+0x28/0xf0
Dec 20 15:17:35 media kernel: [  394.940964]  [<ffffffff81843e83>] ? __pskb_pull_tail+0x253/0x340
Dec 20 15:17:35 media kernel: [  394.941000]  [<ffffffff8164fbb5>] xennet_poll+0xae5/0xed0
Dec 20 15:17:35 media kernel: [  394.941024]  [<ffffffff81080081>] ? wake_up_worker+0x1/0x30
Dec 20 15:17:35 media kernel: [  394.941046]  [<ffffffff810b2fbc>] ? validate_chain+0x13c/0x1300
Dec 20 15:17:35 media kernel: [  394.941075]  [<ffffffff8184ed66>] net_rx_action+0x136/0x260
Dec 20 15:17:35 media kernel: [  394.941098]  [<ffffffff81070551>] ? __do_softirq+0x71/0x1a0
Dec 20 15:17:35 media kernel: [  394.941133]  [<ffffffff810705a9>] __do_softirq+0xc9/0x1a0
Dec 20 15:17:35 media kernel: [  394.941157]  [<ffffffff819b623c>] call_softirq+0x1c/0x30
Dec 20 15:17:35 media kernel: [  394.941179]  [<ffffffff8100fdc5>] do_softirq+0x85/0xf0
Dec 20 15:17:35 media kernel: [  394.941201]  [<ffffffff8107041e>] irq_exit+0x9e/0xd0
Dec 20 15:17:35 media kernel: [  394.941235]  [<ffffffff81430b1f>] xen_evtchn_do_upcall+0x2f/0x40
Dec 20 15:17:35 media kernel: [  394.941259]  [<ffffffff819b629e>] xen_do_hypervisor_callback+0x1e/0x30
Dec 20 15:17:35 media kernel: [  394.941279]  <EOI>  [<ffffffff8100122a>] ? xen_hypercall_xen_version+0xa/0x20
Dec 20 15:17:35 media kernel: [  394.941318]  [<ffffffff8100122a>] ? xen_hypercall_xen_version+0xa/0x20
Dec 20 15:17:35 media kernel: [  394.941356]  [<ffffffff8100890d>] ? xen_force_evtchn_callback+0xd/0x10
Dec 20 15:17:35 media kernel: [  394.941381]  [<ffffffff810092b2>] ? check_events+0x12/0x20
Dec 20 15:17:35 media kernel: [  394.941405]  [<ffffffff81009259>] ? xen_irq_enable_direct_reloc+0x4/0x4
Dec 20 15:17:35 media kernel: [  394.941432]  [<ffffffff819b3f6c>] ? _raw_spin_unlock_irq+0x3c/0x70
Dec 20 15:17:35 media kernel: [  394.941473]  [<ffffffff81095f83>] ? finish_task_switch+0x83/0xe0
Dec 20 15:17:35 media kernel: [  394.941507]  [<ffffffff81095f46>] ? finish_task_switch+0x46/0xe0
Dec 20 15:17:35 media kernel: [  394.941533]  [<ffffffff819b2434>] ? __schedule+0x444/0x880
Dec 20 15:17:35 media kernel: [  394.941555]  [<ffffffff810b2fbc>] ? validate_chain+0x13c/0x1300
Dec 20 15:17:35 media kernel: [  394.941580]  [<ffffffff810b4c4b>] ? __lock_acquire+0x46b/0xdd0
Dec 20 15:17:35 media kernel: [  394.941614]  [<ffffffff810b4c4b>] ? __lock_acquire+0x46b/0xdd0
Dec 20 15:17:35 media kernel: [  394.941638]  [<ffffffff819aff95>] ? __mutex_unlock_slowpath+0x135/0x1d0
Dec 20 15:17:35 media kernel: [  394.941663]  [<ffffffff819b2904>] ? schedule+0x24/0x70
Dec 20 15:17:35 media kernel: [  394.941697]  [<ffffffff819b179d>] ? schedule_hrtimeout_range_clock+0x11d/0x140
Dec 20 15:17:35 media kernel: [  394.941725]  [<ffffffff810b5688>] ? lock_acquire+0xd8/0x100
Dec 20 15:17:35 media kernel: [  394.941748]  [<ffffffff8118a558>] ? ep_poll+0xf8/0x3a0
Dec 20 15:17:35 media kernel: [  394.941770]  [<ffffffff819b4015>] ? _raw_spin_unlock_irqrestore+0x75/0xa0
Dec 20 15:17:35 media kernel: [  394.941808]  [<ffffffff810b1818>] ? trace_hardirqs_on_caller+0xf8/0x200
Dec 20 15:17:35 media kernel: [  394.941833]  [<ffffffff819b17ce>] ? schedule_hrtimeout_range+0xe/0x10
Dec 20 15:17:35 media kernel: [  394.941856]  [<ffffffff8118a75a>] ? ep_poll+0x2fa/0x3a0
Dec 20 15:17:35 media kernel: [  394.941878]  [<ffffffff81098630>] ? try_to_wake_up+0x310/0x310
Dec 20 15:17:35 media kernel: [  394.941913]  [<ffffffff810b5b17>] ? lock_release+0x117/0x250
Dec 20 15:17:35 media kernel: [  394.941938]  [<ffffffff81165fd7>] ? fget_light+0xd7/0x140
Dec 20 15:17:35 media kernel: [  394.941959]  [<ffffffff81165f3a>] ? fget_light+0x3a/0x140
Dec 20 15:17:35 media kernel: [  394.941981]  [<ffffffff8118a8ce>] ? sys_epoll_wait+0xce/0xe0
Dec 20 15:17:35 media kernel: [  394.942015]  [<ffffffff819b4e69>] ? system_call_fastpath+0x16/0x1b
Dec 20 15:17:35 media kernel: [  394.942036] ---[ end trace 6f3a832c9e91c8af ]---
Dec 20 15:17:35 media kernel: [  394.942056] to: (null) from: (null)  skb_try_coalesce: DELTA < LEN delta:22978 len:23168 from->truesize:23874 skb_headlen(from):0 skb_shinfo(to)->nr_frags:4 skb_shinfo(from)->nr_frags:6
Dec 20 15:17:35 media kernel: [  394.968199] to: (null) from: (null)  skb_try_coalesce: DELTA < LEN delta:14290 len:14480 from->truesize:15186 skb_headlen(from):0 skb_shinfo(to)->nr_frags:13 skb_shinfo(from)->nr_frags:4
Dec 20 15:17:35 media kernel: [  395.262814] net_ratelimit: 371 callbacks suppressed
Dec 20 15:17:35 media kernel: [  395.262858] eth0: mtu:1500 data_len:90 len before:0 len after:90 truesize before:896 truesize after:730 nr_frags:1 variant1:-166(730) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.264767] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.266193] eth0: mtu:1500 data_len:42 len before:0 len after:42 truesize before:896 truesize after:682 nr_frags:1 variant1:-214(682) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.268422] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.271617] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.274794] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.278104] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.281319] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.284454] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.287797] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)
Dec 20 15:17:35 media kernel: [  395.291121] eth0: mtu:1500 data_len:66 len before:0 len after:66 truesize before:896 truesize after:706 nr_frags:1 variant1:-190(706) variant2:0(896) variant3:4096(4992)

^ permalink raw reply

* Re: [PATCH net-next V4 03/13] bridge: Validate that vlan is permitted on ingress
From: Shmulik Ladkani @ 2012-12-20 14:07 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: netdev, shemminger, davem, or.gerlitz, jhs, mst, erdnetdev, jiri
In-Reply-To: <1355939304-21804-4-git-send-email-vyasevic@redhat.com>

Hi Vlad,

On Wed, 19 Dec 2012 12:48:14 -0500 Vlad Yasevich <vyasevic@redhat.com> wrote:
> +static bool br_allowed_ingress(struct net_bridge_port *p, struct sk_buff *skb)
> +{
> +	struct net_port_vlan *pve;
> +	u16 vid;
> +
> +	/* If there are no vlan in the permitted list, all packets are
> +	 * permitted.
> +	 */
> +	if (list_empty(&p->vlan_list))
> +		return true;

I assumed the default policy would be Drop in such case, otherwise
leaking between vlan domains is possible.
Or maybe, ingress policy when port isn't a member of ingress VID should
be configurable (drop/allow).

> +	vid = br_get_vlan(skb);
> +	pve = nbp_vlan_find(p, vid);

Why search by iterating through NBP's vlan_list?
You know the VID (hence may fetch the net_bridge_vlan from the hash), so
why don't you directly consult the net_bridge_vlan's port_bitmap?

> @@ -54,6 +74,9 @@ int br_handle_frame_finish(struct sk_buff *skb)
>  	if (!p || p->state == BR_STATE_DISABLED)
>  		goto drop;
>  
> +	if (!br_allowed_ingress(p, skb))
> +		goto drop;
> +

This condition should be also encorporated upon "ingress" at the "bridge
master port" (that is, early at br_dev_xmit).
Think of the "bridge master port" as yet another port:
upon "ingress" (meaning, tx packets from the ip stack), we should
also enforce any ingress permission rules.

Regards,
Shmulik

^ permalink raw reply

* Lockdep warning in vxlan
From: Yan Burman @ 2012-12-20 14:00 UTC (permalink / raw)
  To: shemminger, netdev, Yan Burman

Hi.

When working with vxlan from current net-next, I got a lockdep warning 
(below).
It seems to happen when I have host B pinging host A and while the pings 
continue,
I do "ip link del" on the vxlan interface on host A. The lockdep warning 
is on host A.
Tell me if you need some more info.

=============================================
[ INFO: possible recursive locking detected ]
3.7.0+ #24 Not tainted
---------------------------------------------
swapper/1/0 is trying to acquire lock:
  (&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0

but task is already holding lock:
  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&n->lock);
   lock(&n->lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

4 locks held by swapper/1/0:
  #0:  (((&n->timer))){+.-...}, at: [<ffffffff8104b350>] 
call_timer_fn+0x0/0x1c0
  #1:  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280
  #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>] 
dev_queue_xmit+0x0/0x5d0
  #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>] 
ip_finish_output+0x13e/0x640

stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
Call Trace:
  <IRQ>  [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
  [<ffffffff8108d570>] __lock_acquire+0x440/0xc30
  [<ffffffff813c3570>] ? inet_getpeer+0x40/0x600
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8108ddf5>] lock_acquire+0x95/0x140
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
  [<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
  [<ffffffff813cb62d>] ip_finish_output+0x34d/0x640
  [<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
  [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
  [<ffffffff813cb9a0>] ip_output+0x80/0xf0
  [<ffffffff813ca368>] ip_local_out+0x28/0x80
  [<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
  [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
  [<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
  [<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
  [<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
  [<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
  [<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
  [<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
  [<ffffffff813f5788>] arp_xmit+0x58/0x60
  [<ffffffff813f59db>] arp_send+0x3b/0x40
  [<ffffffff813f6424>] arp_solicit+0x204/0x280
  [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
  [<ffffffff8139f515>] neigh_probe+0x45/0x70
  [<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
  [<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
  [<ffffffff8104b350>] ? detach_if_pending+0x120/0x120
  [<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
  [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
  [<ffffffff81043e51>] __do_softirq+0x101/0x280
  [<ffffffff814518cc>] call_softirq+0x1c/0x30
  [<ffffffff81003b65>] do_softirq+0x85/0xc0
  [<ffffffff81043a7e>] irq_exit+0x9e/0xc0
  [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
  [<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
  <EOI>  [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
  [<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
  [<ffffffff8100a6a9>] cpu_idle+0x89/0xe0
  [<ffffffff81441127>] start_secondary+0x1b2/0x1b6

Hope this helps
Yan

^ permalink raw reply

* skb->cb size checks (was Re: [PATCH 00/17] ATM fixes for pppoatm/br2684)
From: David Woodhouse @ 2012-12-20 14:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121201.204906.1703696018528746748.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1996 bytes --]

On Sat, 2012-12-01 at 20:49 -0500, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Sun, 02 Dec 2012 00:40:47 +0000
> 
> > On Sat, 2012-12-01 at 17:33 +0000, David Woodhouse wrote:
> >> 
> >> Very glad I added the BUILD_BUG_ON on the cb struct size now. Perhaps
> >> there should be a generic helper for that? Something like
> >>  skb_cb_cast(struct foo_cb, skb) could do it automatically...?
> > 
> > Something like this, perhaps? Using skb_cast_cb() would then make it
> > fairly much impossible to accidentally overflow the size of the skb cb.
> 
> I actually prefer what we do now, which is do the BUILD_BUG_ON()
> once in the subsystem specific code, usually the initializer.
> 
> It's part of creating a new SKB cb, adding that assertion somewhere.

I looked harder at this, and should follow up before it actually does
fall out of the cracks in my brain and get completely forgotten.

Basically, you lie :)

What we *actually* do now, in about two-thirds of cases¹ even in net/
code (I didn't even look at drivers, which I expect to be worse), is use
skb->cb without any form of automatic size check at all. No manual
BUILD_BUG_ON() or anything.

Admittedly, in almost all cases that *isn't* a real problem, because the
structure *isn't* too big for skb->cb and it's all fine. But as a matter
of principle we probably *should* be doing those checks. Just in *case*
someone comes along and adds something stupid to the structure.

So... should we:
 - Ignore the "problem" and leave things as they are.

 - Go through and fix the 2/3 of offending net/ code and then the
   drivers too, *without* making the generic 'deference and automatic
   check' macro that I think would simplify that and help to keep us
   honest in future.
or
 - Let me add something like the skb_cast_cb() macro I wanted, then use
   it in all the offending code I can find.

-- 
dwmw2

¹ http://www.spinics.net/lists/netdev/msg218642.html


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox