[PATCH] bridge: avoid ptype_all packet handling

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] bridge: avoid ptype_all packet handling
@ 2007-03-01  1:18 Stephen Hemminger
  2007-03-01  1:28 ` Ben Greear
  2007-03-02 21:26 ` David Miller
  0 siblings, 2 replies; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-01  1:18 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev

I was measuring bridging/routing performance and noticed this.

The current code runs the "all packet" type handlers before calling the
bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
this means that each received packet gets run through the Berkeley Packet Filter
code in sk_run_filter (slow).

By moving the bridging hook to run first, the packets flowing through
the bridge get filtered out there. This results in a 14%
improvement in performance, but it does mean that some snooping applications
would miss packets if being used on a bridge.  The correct way to see all
packets on a bridge is to set the bridge pseudo-device to promiscuous
mode.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
---
 net/core/dev.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index cf71614..dc2cda6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1792,6 +1792,10 @@ int netif_receive_skb(struct sk_buff *skb)

 	rcu_read_lock();

+	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
+		goto out;
+
+
 #ifdef CONFIG_NET_CLS_ACT
 	if (skb->tc_verd & TC_NCLS) {
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
@@ -1826,9 +1830,6 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif

-	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
-		goto out;
-
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
 		if (ptype->type == type &&
-- 
1.4.4.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  1:18 [PATCH] bridge: avoid ptype_all packet handling Stephen Hemminger
@ 2007-03-01  1:28 ` Ben Greear
  2007-03-01  3:56   ` Stephen Hemminger
  2007-03-02 21:26 ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Ben Greear @ 2007-03-01  1:28 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, bridge, netdev

Stephen Hemminger wrote:
> I was measuring bridging/routing performance and noticed this.
> 
> The current code runs the "all packet" type handlers before calling the
> bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
> this means that each received packet gets run through the Berkeley Packet Filter
> code in sk_run_filter (slow).
> 
> By moving the bridging hook to run first, the packets flowing through
> the bridge get filtered out there. This results in a 14%
> improvement in performance, but it does mean that some snooping applications
> would miss packets if being used on a bridge.  The correct way to see all
> packets on a bridge is to set the bridge pseudo-device to promiscuous
> mode.

Seems it would be better to fix these clients to be more selective as to
where they bind.

This breaks the case where you want to see packets on a particular interface,
not just the entire bridge, right?

Thanks,
Ben

> 
> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
> ---
>  net/core/dev.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cf71614..dc2cda6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1792,6 +1792,10 @@ int netif_receive_skb(struct sk_buff *skb)
>  
>  	rcu_read_lock();
>  
> +	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
> +		goto out;
> +
> +
>  #ifdef CONFIG_NET_CLS_ACT
>  	if (skb->tc_verd & TC_NCLS) {
>  		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
> @@ -1826,9 +1830,6 @@ int netif_receive_skb(struct sk_buff *skb)
>  ncls:
>  #endif
>  
> -	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
> -		goto out;
> -
>  	type = skb->protocol;
>  	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
>  		if (ptype->type == type &&


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  1:28 ` Ben Greear
@ 2007-03-01  3:56   ` Stephen Hemminger
  2007-03-01  4:05     ` Ben Greear
  0 siblings, 1 reply; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-01  3:56 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, bridge, netdev

On Wed, 28 Feb 2007 17:28:09 -0800
Ben Greear <greearb@candelatech.com> wrote:

> Stephen Hemminger wrote:
> > I was measuring bridging/routing performance and noticed this.
> > 
> > The current code runs the "all packet" type handlers before calling
> > the bridge hook.  If an application (like some DHCP clients) is
> > using AF_PACKET, this means that each received packet gets run
> > through the Berkeley Packet Filter code in sk_run_filter (slow).
> > 
> > By moving the bridging hook to run first, the packets flowing
> > through the bridge get filtered out there. This results in a 14%
> > improvement in performance, but it does mean that some snooping
> > applications would miss packets if being used on a bridge.  The
> > correct way to see all packets on a bridge is to set the bridge
> > pseudo-device to promiscuous mode.
> 
> Seems it would be better to fix these clients to be more selective as
> to where they bind.

The problem is any use of BPF is a lose, if it has to be done to all
traffic.

> This breaks the case where you want to see packets on a particular
> interface, not just the entire bridge, right?

It might be possible to use promisc counter to handle this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  3:56   ` Stephen Hemminger
@ 2007-03-01  4:05     ` Ben Greear
  2007-03-01  7:04       ` Stephen Hemminger
  0 siblings, 1 reply; 25+ messages in thread
From: Ben Greear @ 2007-03-01  4:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, bridge, netdev

Stephen Hemminger wrote:
> On Wed, 28 Feb 2007 17:28:09 -0800
> Ben Greear <greearb@candelatech.com> wrote:
>
>   
>> Stephen Hemminger wrote:
>>     
>>> I was measuring bridging/routing performance and noticed this.
>>>
>>> The current code runs the "all packet" type handlers before calling
>>> the bridge hook.  If an application (like some DHCP clients) is
>>> using AF_PACKET, this means that each received packet gets run
>>> through the Berkeley Packet Filter code in sk_run_filter (slow).
>>>
>>> By moving the bridging hook to run first, the packets flowing
>>> through the bridge get filtered out there. This results in a 14%
>>> improvement in performance, but it does mean that some snooping
>>> applications would miss packets if being used on a bridge.  The
>>> correct way to see all packets on a bridge is to set the bridge
>>> pseudo-device to promiscuous mode.
>>>       
>> Seems it would be better to fix these clients to be more selective as
>> to where they bind.
>>     
>
> The problem is any use of BPF is a lose, if it has to be done to all
> traffic.
>   
Right, but couldn't you have the dhcp client bind to eth0, eth7, and br0 
(ie, skipping the eth1-6 that comprise the bridge group?)

The only difficulty I see is having the client know when new devices 
come and go, but there are probably
ways to know that without keeping a whole lot of state or probing the 
/proc/net/dev (like my own bloated app does :))

I envision the client args to be something like --skip-devices "eth1 
eth2 eth3 ..."

I know you can bind raw packet sockets to individual devices, though I 
don't know much about BPF, so it's
possible I'm wrong...

>> This breaks the case where you want to see packets on a particular
>> interface, not just the entire bridge, right?
>>     
>
> It might be possible to use promisc counter to handle this.
>   
Not really, it's perfectly valid to sniff a port in non-promiscuous mode...

Ben

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  4:05     ` Ben Greear
@ 2007-03-01  7:04       ` Stephen Hemminger
  2007-03-01  7:22         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-01  7:04 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, bridge, netdev

Ben Greear wrote:
> Stephen Hemminger wrote:
>> On Wed, 28 Feb 2007 17:28:09 -0800
>> Ben Greear <greearb@candelatech.com> wrote:
>>
>>  
>>> Stephen Hemminger wrote:
>>>    
>>>> I was measuring bridging/routing performance and noticed this.
>>>>
>>>> The current code runs the "all packet" type handlers before calling
>>>> the bridge hook.  If an application (like some DHCP clients) is
>>>> using AF_PACKET, this means that each received packet gets run
>>>> through the Berkeley Packet Filter code in sk_run_filter (slow).
>>>>
>>>> By moving the bridging hook to run first, the packets flowing
>>>> through the bridge get filtered out there. This results in a 14%
>>>> improvement in performance, but it does mean that some snooping
>>>> applications would miss packets if being used on a bridge.  The
>>>> correct way to see all packets on a bridge is to set the bridge
>>>> pseudo-device to promiscuous mode.
>>>>       
>>> Seems it would be better to fix these clients to be more selective as
>>> to where they bind.
>>>     
>>
>> The problem is any use of BPF is a lose, if it has to be done to all
>> traffic.
>>   
> Right, but couldn't you have the dhcp client bind to eth0, eth7, and 
> br0 (ie, skipping the eth1-6 that comprise the bridge group?)
>
> The only difficulty I see is having the client know when new devices 
> come and go, but there are probably
> ways to know that without keeping a whole lot of state or probing the 
> /proc/net/dev (like my own bloated app does :))
>
> I envision the client args to be something like --skip-devices "eth1 
> eth2 eth3 ..."
>
> I know you can bind raw packet sockets to individual devices, though I 
> don't know much about BPF, so it's
> possible I'm wrong...
>
The kernel has to deal with busted applications all the time. And each 
damn distro and configuration
seems to invent it's own new way of doing network configuration. If an 
normal application has
to use something like raw packet filtering, it seems there is a missing 
API.


>>> This breaks the case where you want to see packets on a particular
>>> interface, not just the entire bridge, right?
>>>     
>>
>> It might be possible to use promisc counter to handle this.
>>   
> Not really, it's perfectly valid to sniff a port in non-promiscuous 
> mode...
>
The non-promiscuous mode packets still make it in through the normal 
receive path.
The only packets that don't make up the stack are those that are being 
bridged.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  7:04       ` Stephen Hemminger
@ 2007-03-01  7:22         ` David Miller
  2007-03-01  7:26           ` Stephen Hemminger
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2007-03-01  7:22 UTC (permalink / raw)
  To: shemminger; +Cc: greearb, bridge, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 28 Feb 2007 23:04:36 -0800

> If an normal application has to use something like raw packet
> filtering, it seems there is a missing API.

I'm loosely following this discussion, but Ben mentions DHCP
and I remember learning the other month that DHCP uses AF_PACKET
and filtering instead of IP RAW sockets because it needs to get
the MAC address and RAW sockets don't provide that.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  7:22         ` David Miller
@ 2007-03-01  7:26           ` Stephen Hemminger
  2007-03-01  7:30             ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-01  7:26 UTC (permalink / raw)
  To: David Miller; +Cc: greearb, bridge, netdev

David Miller wrote:
> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Wed, 28 Feb 2007 23:04:36 -0800
>
>   
>> If an normal application has to use something like raw packet
>> filtering, it seems there is a missing API.
>>     
>
> I'm loosely following this discussion, but Ben mentions DHCP
> and I remember learning the other month that DHCP uses AF_PACKET
> and filtering instead of IP RAW sockets because it needs to get
> the MAC address and RAW sockets don't provide that.
>   
sounds like a socket option would help, the data is already there. Then 
the normal
UDP receive path would work.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  7:26           ` Stephen Hemminger
@ 2007-03-01  7:30             ` David Miller
  2007-03-01 11:47               ` jamal
  2007-03-03  2:14               ` Andi Kleen
  0 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2007-03-01  7:30 UTC (permalink / raw)
  To: shemminger; +Cc: greearb, bridge, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 28 Feb 2007 23:26:36 -0800

> sounds like a socket option would help, the data is already there. Then 
> the normal
> UDP receive path would work.

That would be perfect for new applications.

But we have to support all the old ones, so we're stuck
providing correctly functioning AF_PACKET handling on
all devices, sorry.

And in fact that effectively makes the new socket option
pointless, since it doesn't buy us anything since we have
to support the old stuff fully anyways.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  7:30             ` David Miller
@ 2007-03-01 11:47               ` jamal
  2007-03-03  2:14               ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: jamal @ 2007-03-01 11:47 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, greearb, bridge, netdev

On Wed, 2007-28-02 at 23:30 -0800, David Miller wrote:

> That would be perfect for new applications.
> But we have to support all the old ones, so we're stuck
> providing correctly functioning AF_PACKET handling on
> all devices, sorry.
> 

It also breaks all the ingress tc code by making that change.

The two scenarios i see in respect to performance are:
- run a sniffer and you will see a substantial performance degradation
as the pps goes up.  There is no rocket science to that. This has
nothing to do with bridging and will happen still even if that patch
went in. 

- dont run any tap with the current codepath and you still see the
_substantial_ performance drop then we have an opportunity
to optimize. Not _by avoiding the code_ as in Stephens patch but by
tunning that tap code path. [For example, you should still see a _tiny_
performance degradation if you turned on TC actions on ingress at
compile time but had zero policies at run time - but that code path has
been reduce to a single compare]. 


> And in fact that effectively makes the new socket option
> pointless, since it doesn't buy us anything since we have
> to support the old stuff fully anyways.

agreed.

I have been tied up elsewhere so havent been following netdev closely:
There seem to be complaints of bridging performance going down in recent
kernels - or is that someone misconfiguring their drivers?

cheers,
jamal


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  1:18 [PATCH] bridge: avoid ptype_all packet handling Stephen Hemminger
  2007-03-01  1:28 ` Ben Greear
@ 2007-03-02 21:26 ` David Miller
  2007-03-02 22:09   ` [RFC 1/2] " Stephen Hemminger
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: David Miller @ 2007-03-02 21:26 UTC (permalink / raw)
  To: shemminger; +Cc: bridge, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 28 Feb 2007 17:18:46 -0800

> I was measuring bridging/routing performance and noticed this.
> 
> The current code runs the "all packet" type handlers before calling the
> bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
> this means that each received packet gets run through the Berkeley Packet Filter
> code in sk_run_filter (slow).

I know we closed this out by saying that even though performance
sucks, we can't really apply this without breaking things.

What would be broken is if the DHCP client isn't specifying
a device ifindex when it binds the AF_PACKET socket.  That
would be an easy way to fix this performance problem at the
application level.

The DHCP client should only care about a particular interface's
traffic, the one it wants to listen on.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 21:26 ` David Miller
@ 2007-03-02 22:09   ` Stephen Hemminger
  2007-03-02 22:14     ` [RFC 2/2] bridge: per device promiscious taps Stephen Hemminger
  2007-03-02 22:48     ` [RFC 1/2] bridge: avoid ptype_all packet handling David Miller
  2007-03-02 22:15   ` Stephen Hemminger
  2007-03-03 12:04   ` [PATCH] " Stefan Rompf
  2 siblings, 2 replies; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-02 22:09 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev

On Fri, 02 Mar 2007 13:26:38 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Wed, 28 Feb 2007 17:18:46 -0800
> 
> > I was measuring bridging/routing performance and noticed this.
> > 
> > The current code runs the "all packet" type handlers before calling the
> > bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
> > this means that each received packet gets run through the Berkeley Packet Filter
> > code in sk_run_filter (slow).
> 
> I know we closed this out by saying that even though performance
> sucks, we can't really apply this without breaking things.

wrong.

> What would be broken is if the DHCP client isn't specifying
> a device ifindex when it binds the AF_PACKET socket.  That
> would be an easy way to fix this performance problem at the
> application level.
> 
> The DHCP client should only care about a particular interface's
> traffic, the one it wants to listen on.


My assumption is that when bridging, the normal stack path only has
to receive those packets that it would receive if it was not doing
bridging.

A better version of the patch is:
==============

The current code runs the "all packet" type handlers before calling the
bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
this means that each received packet gets run through the Berkeley Packet Filter
code in sk_run_filter. This is significant overhead.

By moving the bridging hook to run first, the packets flowing through
the bridge get filtered out there first. This results in a 14%
improvement in performance.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
---
 net/core/dev.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

--- netem.orig/net/core/dev.c
+++ netem/net/core/dev.c
@@ -1702,9 +1702,12 @@ struct net_bridge_fdb_entry *(*br_fdb_ge
 						unsigned char *addr);
 void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);
 
-static __inline__ int handle_bridge(struct sk_buff **pskb,
-				    struct packet_type **pt_prev, int *ret,
-				    struct net_device *orig_dev)
+/*
+ * If bridge module is loaded call bridging hook.
+ * when it returns 1, this is a non-local packet
+ */
+int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb) __read_mostly;
+static int handle_bridge(struct sk_buff **pskb)
 {
 	struct net_bridge_port *port;
 
@@ -1712,15 +1715,10 @@ static __inline__ int handle_bridge(stru
 	    (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
 		return 0;
 
-	if (*pt_prev) {
-		*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
 	return br_handle_frame_hook(port, pskb);
 }
 #else
-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(0)
+#define handle_bridge(pskb)	0
 #endif
 
 #ifdef CONFIG_NET_CLS_ACT
@@ -1799,6 +1797,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (handle_bridge(&skb))
+		goto out;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -1826,9 +1827,6 @@ int netif_receive_skb(struct sk_buff *sk
 ncls:
 #endif
 
-	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
-		goto out;
-
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
 		if (ptype->type == type &&



-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC 2/2] bridge: per device promiscious taps
  2007-03-02 22:09   ` [RFC 1/2] " Stephen Hemminger
@ 2007-03-02 22:14     ` Stephen Hemminger
  2007-03-02 22:48     ` [RFC 1/2] bridge: avoid ptype_all packet handling David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-02 22:14 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev

Part of the next set of bridge patches includes this.

It allows packet capture by interface on a bridge:
	tcpdump -i eth0

will work as expected.

@@ -128,34 +125,45 @@ static inline int is_link_local(const un
 int br_handle_frame(struct net_bridge_port *p, struct sk_buff **pskb)
 {
 	struct sk_buff *skb = *pskb;
+	struct sk_buff *skb2 = NULL;
 	const unsigned char *dest = eth_hdr(skb)->h_dest;
 
	if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
		goto err;

 	if (unlikely(is_link_local(dest))) {
 		skb->pkt_type = PACKET_HOST;
 		return NF_HOOK(PF_BRIDGE, NF_BR_LOCAL_IN, skb, skb->dev,
 			       NULL, br_handle_local_finish) != 0;
 	}
+
+	if (unlikely(p->dev->promiscuity > 1))
+		skb2 = skb_clone(skb, GFP_ATOMIC);
 
-	if (p->state == BR_STATE_FORWARDING || p->state == BR_STATE_LEARNING) {
+	switch (p->state) {
+	case BR_STATE_FORWARDING:
 		if (br_should_route_hook) {
-			if (br_should_route_hook(pskb))
+			if (br_should_route_hook(pskb)) {
+				kfree_skb(skb2);
 				return 0;
+			}
 			skb = *pskb;
 			dest = eth_hdr(skb)->h_dest;
 		}
 
 		if (!compare_ether_addr(p->br->dev->dev_addr, dest))
 			skb->pkt_type = PACKET_HOST;
+		/* fall thru */
 
+	case BR_STATE_LEARNING:
 		NF_HOOK(PF_BRIDGE, NF_BR_PRE_ROUTING, skb, skb->dev, NULL,
 			br_handle_frame_finish);
-		return 1;
+		break;
+
+	default:
+		kfree_skb(skb);
 	}
 
-err:
-	kfree_skb(skb);
-	return 1;
+	if (likely(!skb2))
+		return 1;
+
+	*pskb = skb2;
+	return 0;
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 21:26 ` David Miller
  2007-03-02 22:09   ` [RFC 1/2] " Stephen Hemminger
@ 2007-03-02 22:15   ` Stephen Hemminger
  2007-03-03 12:04   ` [PATCH] " Stefan Rompf
  2 siblings, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-02 22:15 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev

On Fri, 02 Mar 2007 13:26:38 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Wed, 28 Feb 2007 17:18:46 -0800
> 
> > I was measuring bridging/routing performance and noticed this.
> > 
> > The current code runs the "all packet" type handlers before calling the
> > bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
> > this means that each received packet gets run through the Berkeley Packet Filter
> > code in sk_run_filter (slow).
> 
> I know we closed this out by saying that even though performance
> sucks, we can't really apply this without breaking things.

wrong.

> What would be broken is if the DHCP client isn't specifying
> a device ifindex when it binds the AF_PACKET socket.  That
> would be an easy way to fix this performance problem at the
> application level.
> 
> The DHCP client should only care about a particular interface's
> traffic, the one it wants to listen on.


My assumption is that when bridging, the normal stack path only has
to receive those packets that it would receive if it was not doing
bridging.

A better version of the patch is:
==============

The current code runs the "all packet" type handlers before calling the
bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
this means that each received packet gets run through the Berkeley Packet Filter
code in sk_run_filter. This is significant overhead.

By moving the bridging hook to run first, the packets flowing through
the bridge get filtered out there first. This results in a 14%
improvement in performance.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
---
 net/core/dev.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

--- netem.orig/net/core/dev.c
+++ netem/net/core/dev.c
@@ -1702,9 +1702,12 @@ struct net_bridge_fdb_entry *(*br_fdb_ge
 						unsigned char *addr);
 void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);
 
-static __inline__ int handle_bridge(struct sk_buff **pskb,
-				    struct packet_type **pt_prev, int *ret,
-				    struct net_device *orig_dev)
+/*
+ * If bridge module is loaded call bridging hook.
+ * when it returns 1, this is a non-local packet
+ */
+int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb) __read_mostly;
+static int handle_bridge(struct sk_buff **pskb)
 {
 	struct net_bridge_port *port;
 
@@ -1712,15 +1715,10 @@ static __inline__ int handle_bridge(stru
 	    (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
 		return 0;
 
-	if (*pt_prev) {
-		*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
 	return br_handle_frame_hook(port, pskb);
 }
 #else
-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(0)
+#define handle_bridge(pskb)	0
 #endif
 
 #ifdef CONFIG_NET_CLS_ACT
@@ -1799,6 +1797,9 @@ int netif_receive_skb(struct sk_buff *sk
 	}
 #endif
 
+	if (handle_bridge(&skb))
+		goto out;
+
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
 		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
@@ -1826,9 +1827,6 @@ int netif_receive_skb(struct sk_buff *sk
 ncls:
 #endif
 
-	if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
-		goto out;
-
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
 		if (ptype->type == type &&



-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 22:09   ` [RFC 1/2] " Stephen Hemminger
  2007-03-02 22:14     ` [RFC 2/2] bridge: per device promiscious taps Stephen Hemminger
@ 2007-03-02 22:48     ` David Miller
  2007-03-02 23:18       ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2007-03-02 22:48 UTC (permalink / raw)
  To: shemminger; +Cc: bridge, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Fri, 2 Mar 2007 14:09:29 -0800

> On Fri, 02 Mar 2007 13:26:38 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Stephen Hemminger <shemminger@linux-foundation.org>
> > Date: Wed, 28 Feb 2007 17:18:46 -0800
> > 
> > > I was measuring bridging/routing performance and noticed this.
> > > 
> > > The current code runs the "all packet" type handlers before calling the
> > > bridge hook.  If an application (like some DHCP clients) is using AF_PACKET,
> > > this means that each received packet gets run through the Berkeley Packet Filter
> > > code in sk_run_filter (slow).
> > 
> > I know we closed this out by saying that even though performance
> > sucks, we can't really apply this without breaking things.
> 
> wrong.

I disagee, and your patch is still broken because as Jamal
pointed out (which you didn't address in any way) this breaks
traffic classification of bridged traffic as well.

If someone wants their network tap to hear all traffic, they do mean
all traffic, and this includes potentially seeing it multiple times
when things like bridging and virtual devices decap incoming frames.

We can't apply this.

Back to a workable solution, why doesn't DHCP specify a specific
device?  It would fix this performance problem completely, at the
application level.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 22:48     ` [RFC 1/2] bridge: avoid ptype_all packet handling David Miller
@ 2007-03-02 23:18       ` David Miller
  2007-03-02 23:34         ` Stephen Hemminger
  2007-03-03  5:38         ` Herbert Xu
  0 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2007-03-02 23:18 UTC (permalink / raw)
  To: shemminger; +Cc: bridge, netdev

From: David Miller <davem@davemloft.net>
Date: Fri, 02 Mar 2007 14:48:18 -0800 (PST)

> Back to a workable solution, why doesn't DHCP specify a specific
> device?  It would fix this performance problem completely, at the
> application level.

Since nobody seems to be able to be bothered to actually look
at what DHCP clients are doing, I actually did and it's no
surprise that broken stuff is happening here.

Here is how dhcp3-3.0.3 binds AF_PACKET sockets, in common/lpf.c:

	struct sockaddr sa;
 ...
	/* Bind to the interface name */
	memset (&sa, 0, sizeof sa);
	sa.sa_family = AF_PACKET;
	strncpy (sa.sa_data, (const char *)info -> ifp, sizeof sa.sa_data);
	if (bind (sock, &sa, sizeof sa)) {
		if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT ||
		    errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT ||
		    errno == EAFNOSUPPORT || errno == EINVAL) {
			log_error ("socket: %m - make sure");
			log_error ("CONFIG_PACKET (Packet socket) %s",
				   "and CONFIG_FILTER");
			log_error ("(Socket Filtering) are enabled %s",
				   "in your kernel");
			log_fatal ("configuration!");
		}
		log_fatal ("Bind socket to interface: %m");
	}

So it puts a string into the sockaddr data, and there
is no mention of sockaddr_ll, which is what is supposed to be
provided as the socket address here, in the entire DHCP tree.

I'm tempted to say I must be missing something here, since I can't see
how this could possible work at all.  The string passed in should
be interpreted as the ifindex value, and thus trigger a -ENODEV
return from AF_PACKET's bind() implementation.

My suspicions are confirmed by the patch here:

http://kernel.org/pub/linux/kernel/people/chuyee/patches/dhcp-3.0/dhcp-3.0-linux_cooked_packet.patch

Really, this bogus bind() explains everything.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 23:18       ` David Miller
@ 2007-03-02 23:34         ` Stephen Hemminger
  2007-03-02 23:41           ` David Miller
  2007-03-03  5:38         ` Herbert Xu
  1 sibling, 1 reply; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-02 23:34 UTC (permalink / raw)
  To: David Miller; +Cc: bridge, netdev

On Fri, 02 Mar 2007 15:18:03 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Fri, 02 Mar 2007 14:48:18 -0800 (PST)
> 
> > Back to a workable solution, why doesn't DHCP specify a specific
> > device?  It would fix this performance problem completely, at the
> > application level.
> 
> Since nobody seems to be able to be bothered to actually look
> at what DHCP clients are doing, I actually did and it's no
> surprise that broken stuff is happening here.

I was in middle of checking that..

> Here is how dhcp3-3.0.3 binds AF_PACKET sockets, in common/lpf.c:
> 
> 	struct sockaddr sa;
>  ...
> 	/* Bind to the interface name */
> 	memset (&sa, 0, sizeof sa);
> 	sa.sa_family = AF_PACKET;
> 	strncpy (sa.sa_data, (const char *)info -> ifp, sizeof sa.sa_data);
> 	if (bind (sock, &sa, sizeof sa)) {
> 		if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT ||
> 		    errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT ||
> 		    errno == EAFNOSUPPORT || errno == EINVAL) {
> 			log_error ("socket: %m - make sure");
> 			log_error ("CONFIG_PACKET (Packet socket) %s",
> 				   "and CONFIG_FILTER");
> 			log_error ("(Socket Filtering) are enabled %s",
> 				   "in your kernel");
> 			log_fatal ("configuration!");
> 		}
> 		log_fatal ("Bind socket to interface: %m");
> 	}
> 
> So it puts a string into the sockaddr data, and there
> is no mention of sockaddr_ll, which is what is supposed to be
> provided as the socket address here, in the entire DHCP tree.
> 
> I'm tempted to say I must be missing something here, since I can't see
> how this could possible work at all.  The string passed in should
> be interpreted as the ifindex value, and thus trigger a -ENODEV
> return from AF_PACKET's bind() implementation.
> 
> My suspicions are confirmed by the patch here:
> 
> http://kernel.org/pub/linux/kernel/people/chuyee/patches/dhcp-3.0/dhcp-3.0-linux_cooked_packet.patch

Can you get FC fixed?

> Really, this bogus bind() explains everything.

Should we add a warning to kernel log, to make distro's fix it?

It might make sense to add a per-device ptype_dev list in network device?



-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 23:34         ` Stephen Hemminger
@ 2007-03-02 23:41           ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2007-03-02 23:41 UTC (permalink / raw)
  To: shemminger; +Cc: bridge, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Fri, 2 Mar 2007 15:34:14 -0800

> Can you get FC fixed?

I am not the DHCP package maintainer. :-)

I'm up to my earfulls already dealing with people trying
to slug broken patches into the kernel networking that paper
around application bugs. ;)

> Should we add a warning to kernel log, to make distro's fix it?

Unfortunately it looks like a properly formed sockaddr_ll,
the ifindex is in fact zero, so there is nothing we can do
to warn about this case.

The sockaddr_ll sits after the first sockaddr string in the ifreq, and
the rest remains initialized to zeros, thus the bind() succeeds.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-01  7:30             ` David Miller
  2007-03-01 11:47               ` jamal
@ 2007-03-03  2:14               ` Andi Kleen
  2007-03-03  4:22                 ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2007-03-03  2:14 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, greearb, bridge, netdev

David Miller <davem@davemloft.net> writes:
> 
> And in fact that effectively makes the new socket option
> pointless, since it doesn't buy us anything since we have
> to support the old stuff fully anyways.

I don't think it's pointless because it would still allow
newer DHCP clients to have less impact on other packets
when they are active. 

This can matter when you have a system with multiple
interfaces where DHCP doesn't get a address on one.

That's pretty common with many x86 server boards because 
they come with two NICs by default but must people only
plug the cable into one. However the distro installers
run DHCP on all.

When this happens all packets are always forced through
ptype_all chains before being rejected by AF_PACKETs device
bind, which adds some overhead to them. 

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-03  2:14               ` Andi Kleen
@ 2007-03-03  4:22                 ` David Miller
  2007-03-03  7:09                   ` Stephen Hemminger
  2007-03-03 12:30                   ` Andi Kleen
  0 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2007-03-03  4:22 UTC (permalink / raw)
  To: andi; +Cc: shemminger, greearb, bridge, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: 03 Mar 2007 03:14:29 +0100

> That's pretty common with many x86 server boards because 
> they come with two NICs by default but must people only
> plug the cable into one. However the distro installers
> run DHCP on all.

Nope, that's not what I've seen them do, instead they run dhcp on
interfaces that report a link being present.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-02 23:18       ` David Miller
  2007-03-02 23:34         ` Stephen Hemminger
@ 2007-03-03  5:38         ` Herbert Xu
  2007-03-03  5:59           ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2007-03-03  5:38 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, bridge, netdev

David Miller <davem@davemloft.net> wrote:
> 
> I'm tempted to say I must be missing something here, since I can't see
> how this could possible work at all.  The string passed in should
> be interpreted as the ifindex value, and thus trigger a -ENODEV
> return from AF_PACKET's bind() implementation.

This is using packet_bind_spkt which uses a name instead of ifindex.

As you may recall, I've made a patch to convert it to use the new
(actually it's not-so-new anymore) AF_PACKET interface.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-03  5:38         ` Herbert Xu
@ 2007-03-03  5:59           ` David Miller
  2007-03-03  6:42             ` Herbert Xu
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2007-03-03  5:59 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, bridge, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 03 Mar 2007 16:38:45 +1100

> This is using packet_bind_spkt which uses a name instead of ifindex.

So it should be just fine, it should be binding to a specific
device (by name instead of ifindex) and therefore it should
only trigger the pt_all hook when the packet arrives on that
specific device.

> As you may recall, I've made a patch to convert it to use the new
> (actually it's not-so-new anymore) AF_PACKET interface.

That's right.

So it's still a mystery why dhcp is causing bridge devices
to trigger the network tap paths on Stephen's machine.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC 1/2] bridge: avoid ptype_all packet handling
  2007-03-03  5:59           ` David Miller
@ 2007-03-03  6:42             ` Herbert Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Herbert Xu @ 2007-03-03  6:42 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, bridge, netdev

On Fri, Mar 02, 2007 at 09:59:05PM -0800, David Miller wrote:
> 
> So it's still a mystery why dhcp is causing bridge devices
> to trigger the network tap paths on Stephen's machine.

If this is the ISC DHCP daemon then perhaps it's because Stephen
didn't specify an interface for it to listen on? By default it'll
enumerate all broadcast interfaces and listen to each one of them.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-03  4:22                 ` David Miller
@ 2007-03-03  7:09                   ` Stephen Hemminger
  2007-03-03 12:30                   ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: Stephen Hemminger @ 2007-03-03  7:09 UTC (permalink / raw)
  To: David Miller; +Cc: andi, greearb, bridge, netdev

David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: 03 Mar 2007 03:14:29 +0100
>
>   
>> That's pretty common with many x86 server boards because 
>> they come with two NICs by default but must people only
>> plug the cable into one. However the distro installers
>> run DHCP on all.
>>     
>
> Nope, that's not what I've seen them do, instead they run dhcp on
> interfaces that report a link being present.
>   

Actually, It may be even simpler... I start bridge with a script and 
there was still a dhclient
left over running on the original interface.  It was an interesting 
exercise, and I have new
tools to help, but still no magic bullet to get up to full line rate.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-02 21:26 ` David Miller
  2007-03-02 22:09   ` [RFC 1/2] " Stephen Hemminger
  2007-03-02 22:15   ` Stephen Hemminger
@ 2007-03-03 12:04   ` Stefan Rompf
  2 siblings, 0 replies; 25+ messages in thread
From: Stefan Rompf @ 2007-03-03 12:04 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, bridge, netdev

Am Freitag, 2. März 2007 22:26 schrieb David Miller:

> The DHCP client should only care about a particular interface's
> traffic, the one it wants to listen on.

Also, a DHCP client should close the socket between address acquisition and 
renewal. The only interesting events in that period are operstate changes.

Stefan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] bridge: avoid ptype_all packet handling
  2007-03-03  4:22                 ` David Miller
  2007-03-03  7:09                   ` Stephen Hemminger
@ 2007-03-03 12:30                   ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2007-03-03 12:30 UTC (permalink / raw)
  To: David Miller; +Cc: andi, shemminger, greearb, bridge, netdev

On Fri, Mar 02, 2007 at 08:22:11PM -0800, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: 03 Mar 2007 03:14:29 +0100
> 
> > That's pretty common with many x86 server boards because 
> > they come with two NICs by default but must people only
> > plug the cable into one. However the distro installers
> > run DHCP on all.
> 
> Nope, that's not what I've seen them do, instead they run dhcp on
> interfaces that report a link being present.

I've seen otherwise.

And that's also bad. It means that when the user moves the machine
and happens to plug the Ethernet into the other port network
will be a notwork until the configuration is manually changed.
Similar when the cable is not plugged in yet at install time.
All not good.

Allowing low overhead DHCP is useful imho. The main problem
with running it always is that it will use more power because
it will IFF_UP the interface. But longer term that can be only
properly solved by real idle network power management I think.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-03-03 12:31 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-01  1:18 [PATCH] bridge: avoid ptype_all packet handling Stephen Hemminger
2007-03-01  1:28 ` Ben Greear
2007-03-01  3:56   ` Stephen Hemminger
2007-03-01  4:05     ` Ben Greear
2007-03-01  7:04       ` Stephen Hemminger
2007-03-01  7:22         ` David Miller
2007-03-01  7:26           ` Stephen Hemminger
2007-03-01  7:30             ` David Miller
2007-03-01 11:47               ` jamal
2007-03-03  2:14               ` Andi Kleen
2007-03-03  4:22                 ` David Miller
2007-03-03  7:09                   ` Stephen Hemminger
2007-03-03 12:30                   ` Andi Kleen
2007-03-02 21:26 ` David Miller
2007-03-02 22:09   ` [RFC 1/2] " Stephen Hemminger
2007-03-02 22:14     ` [RFC 2/2] bridge: per device promiscious taps Stephen Hemminger
2007-03-02 22:48     ` [RFC 1/2] bridge: avoid ptype_all packet handling David Miller
2007-03-02 23:18       ` David Miller
2007-03-02 23:34         ` Stephen Hemminger
2007-03-02 23:41           ` David Miller
2007-03-03  5:38         ` Herbert Xu
2007-03-03  5:59           ` David Miller
2007-03-03  6:42             ` Herbert Xu
2007-03-02 22:15   ` Stephen Hemminger
2007-03-03 12:04   ` [PATCH] " Stefan Rompf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).