Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] cdc_ncm: return -ENOMEM if kzalloc fails
From: David Miller @ 2013-03-30 21:31 UTC (permalink / raw)
  To: devendra.aaru; +Cc: oliver, linux-usb, netdev
In-Reply-To: <1364634202-10341-1-git-send-email-devendra.aaru@gmail.com>

From: Devendra Naga <devendra.aaru@gmail.com>
Date: Sat, 30 Mar 2013 14:33:22 +0530

> return -ENOMEM instead if kzalloc of cdc_ncm_ctx structure is failed.
> 
> and also remove the comparision of ctx structure with NULL and make
> it as !ctx.
> 
> Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>

Applied.

^ permalink raw reply

* BUG: IPv6: ipv6_bind fails on implicitly-assigned address
From: ko-zu @ 2013-03-30 21:29 UTC (permalink / raw)
  To: netdev

Current Any-IP support for IPv6, introduced with this commit
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ab79ad14a2d51e95f0ac3cef7cd116a57089ba82
, allows to assign whole subnet to host but current implemantation
does not allow binding assigned address to socket without IP_FREEBIND
sockopt.
When an ipv6 address was assigned as Any-IP, ipv6_bind() fails with
EADDRNOTAVAIL error because ipv6_chk_addr() does not consider Any-IP
as locally hosted.

To reproduce this bug,

# assign whole subnet to host
ip -4 route add local 192.0.2.0/24 dev lo
ip -6 route add local 2001:db8::/32 dev lo

python -c '
import socket
IP_FREEBIND = 15

print "IPv4 without IP_FREEBIND"
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.bind(("192.0.2.123", 10000))

print "IPv6 with IP_FREEBIND"
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.setsockopt(socket.SOL_IP, IP_FREEBIND, 1)
s.bind(("2001:db8::1234", 10000))

print "IPv6 without IP_FREEBIND"
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind(("2001:db8::5678", 10000))
print "should success without error."
'

^ permalink raw reply

* Re: [PATCH] ip_gre: don't overwrite iflink during net_dev init
From: David Miller @ 2013-03-30 21:29 UTC (permalink / raw)
  To: ordex; +Cc: netdev, linux-kernel, fengguang.wu, pshelar
In-Reply-To: <1364663257-14717-1-git-send-email-ordex@autistici.org>

From: Antonio Quartulli <ordex@autistici.org>
Date: Sat, 30 Mar 2013 18:07:37 +0100

> iflink is currently set to 0 in __gre_tunnel_init(). This
> function is invoked in gre_tap_init() and
> ipgre_tunnel_init() which are both used to initialise the
> ndo_init field of the respective net_device_ops structs
> (ipgre.. and gre_tap..) used by GRE interfaces.
> 
> However, in netdevice_register() iflink is first set to -1,
> then ndo_init is invoked and then iflink is assigned to a
> proper value if and only if it still was -1.
> 
> Assigning 0 to iflink in ndo_init is therefore first
> preventing netdev_register() to correctly assign it a proper
> value and then breaking iflink at all since 0 has not
> correct meaning.
> 
> Fix this by removing the iflink assignment in
> __gre_tunnel_init().
> 
> Introduced by c54419321455631079c7d6e60bc732dd0c5914c5
> ("GRE: Refactor GRE tunneling code.")
> 
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Signed-off-by: Antonio Quartulli <ordex@autistici.org>

Please always indicate, in the subject line inside of the [] brackets,
what tree the patch is targetted at.

I can use "git describe" on the guilty commit, but why take a chance?

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH] net/wireless/sme: sched_scan_mtx lock in cfg80211_conn_work()
From: David Miller @ 2013-03-30 21:12 UTC (permalink / raw)
  To: artem.savkov; +Cc: johannes.berg, linux-wireless, netdev, linux-kernel
In-Reply-To: <1364664014-12304-1-git-send-email-artem.savkov@gmail.com>

From: Artem Savkov <artem.savkov@gmail.com>
Date: Sat, 30 Mar 2013 21:20:14 +0400

> Introduced in f9f475292dbb0e7035fb6661d1524761ea0888d9

When referencing commits, always provide the commit message header
line, in parenthesis and double quotes, so that it is unambiguous.

Otherwise, when changes are backported into -stable trees, the commit
ID changes and the commit in question can no longer be indentified
easily for the reader.

^ permalink raw reply

* [PATCH net-next] macvlan: use the right RCU api
From: Eric Dumazet @ 2013-03-30 20:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Make sure we use proper API to fetch dev->rx_handler_data,
instead of ugly casts.

Rename macvlan_port_get() to macvlan_port_get_rtnl() to document fact
that we hold RTNL when needed, with lockdep support.

This removes sparse warnings as well (CONFIG_SPARSE_RCU_POINTER=y)

CHECK   drivers/net/macvlan.c
drivers/net/macvlan.c:706:37: warning: cast removes address space of expression
drivers/net/macvlan.c:775:16: warning: cast removes address space of expression
drivers/net/macvlan.c:924:16: warning: cast removes address space of expression

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/macvlan.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 73abbc1..70af6dc 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -46,9 +46,16 @@ struct macvlan_port {
 
 static void macvlan_port_destroy(struct net_device *dev);
 
-#define macvlan_port_get_rcu(dev) \
-	((struct macvlan_port *) rcu_dereference(dev->rx_handler_data))
-#define macvlan_port_get(dev) ((struct macvlan_port *) dev->rx_handler_data)
+static struct macvlan_port *macvlan_port_get_rcu(const struct net_device *dev)
+{
+	return rcu_dereference(dev->rx_handler_data);
+}
+
+static struct macvlan_port *macvlan_port_get_rtnl(const struct net_device *dev)
+{
+	return rtnl_dereference(dev->rx_handler_data);
+}
+
 #define macvlan_port_exists(dev) (dev->priv_flags & IFF_MACVLAN_PORT)
 
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
@@ -703,7 +710,7 @@ static int macvlan_port_create(struct net_device *dev)
 
 static void macvlan_port_destroy(struct net_device *dev)
 {
-	struct macvlan_port *port = macvlan_port_get(dev);
+	struct macvlan_port *port = macvlan_port_get_rtnl(dev);
 
 	dev->priv_flags &= ~IFF_MACVLAN_PORT;
 	netdev_rx_handler_unregister(dev);
@@ -772,7 +779,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 		if (err < 0)
 			return err;
 	}
-	port = macvlan_port_get(lowerdev);
+	port = macvlan_port_get_rtnl(lowerdev);
 
 	/* Only 1 macvlan device can be created in passthru mode */
 	if (port->passthru)
@@ -921,7 +928,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 	if (!macvlan_port_exists(dev))
 		return NOTIFY_DONE;
 
-	port = macvlan_port_get(dev);
+	port = macvlan_port_get_rtnl(dev);
 
 	switch (event) {
 	case NETDEV_CHANGE:

^ permalink raw reply related

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Eric Dumazet @ 2013-03-30 19:32 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	dev-yBygre7rU0TnMu66kgdUjQ, paulmck-r/Jw6+rmf7HQT0dZR+AlfA,
	streeter-H+wXaHxf7aLQT0dZR+AlfA,
	nicolas.2p.debian-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, fubar-r/Jw6+rmf7HQT0dZR+AlfA,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	tglx-hfZtesqFncYOwBW4kG4KsQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	andy-QlMahl40kYEqcZcGjlUOXw
In-Reply-To: <1364671863.5113.122.camel@edumazet-glaptop>

On Sat, 2013-03-30 at 12:31 -0700, Eric Dumazet wrote:

> By the way, only dev->rx_handler needs to be RCU protected.
> 
> The patch send yesterday make the second rcu_dereference() (to get
> rx_handler_data) totally irrelevant.

I'll send patch when yesterday fix is merged into net-next.

^ permalink raw reply

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Jiri Pirko @ 2013-03-30 19:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	dev-yBygre7rU0TnMu66kgdUjQ, paulmck-r/Jw6+rmf7HQT0dZR+AlfA,
	streeter-H+wXaHxf7aLQT0dZR+AlfA,
	nicolas.2p.debian-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, fubar-r/Jw6+rmf7HQT0dZR+AlfA,
	kaber-dcUjhNyLwpNeoWH0uzbU5w, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	ivecera-H+wXaHxf7aLQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, andy-QlMahl40kYEqcZcGjlUOXw
In-Reply-To: <f72ccc9e-b09e-4abb-a0ce-a6516138b25c-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>

Sat, Mar 30, 2013 at 08:24:04PM CET, rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org wrote:
>
>
>Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org> wrote:
>
>>No need to have two pointers in struct netdevice for rx_handler func
>>and
>>priv data. Just embed rx_handler structure into driver port_priv and
>>have ->func pointer there. This introduces no performance penalty,
>>reduces struct netdevice by one pointer and reduces number of needed
>>rcu_dereference calls from 2 to 1.
>>
>>Note this also fixes the race bug pointed out by Steven Rostedt and
>>fixed by patch "[PATCH] net: add a synchronize_net() in
>>netdev_rx_handler_unregister()" by Eric Dumazet. This is based on
>>current net-next tree where the patch is not applied yet.
>>I can rebase it on whatever tree/state, just say so.
>>
>>Smoke tested with all drivers who use rx_handler.
>>
>>Reported-by: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
>>Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
>>---
> +
>>+
>>--- a/net/core/dev.c
>>+++ b/net/core/dev.c
>>@@ -3296,10 +3296,23 @@ out:
>> #endif
>> 
>> /**
>>+ *	netdev_rx_handler_init - init receive handler structure
>>+ *	@rx_handler: receive handler to init
>>+ *	@func: receive handler function
>>+ *
>>+ *	Inits receive handler structure.
>>+ */
>>+void netdev_rx_handler_init(struct netdev_rx_handler *rx_handler,
>>+			    rx_handler_func_t *func)
>>+{
>>+	rx_handler->func = func;
>>+}
>>+EXPORT_SYMBOL_GPL(netdev_rx_handler_init);
>>+
>>+/**
>>  *	netdev_rx_handler_register - register receive handler
>>  *	@dev: device to register a handler for
>>  *	@rx_handler: receive handler to register
>>- *	@rx_handler_data: data pointer that is used by rx handler
>>  *
>>  *	Register a receive hander for a device. This handler will then be
>>  *	called from __netif_receive_skb. A negative errno code is returned
>>@@ -3310,15 +3323,13 @@ out:
>> *	For a general description of rx_handler, see enum rx_handler_result.
>>  */
>> int netdev_rx_handler_register(struct net_device *dev,
>>-			       rx_handler_func_t *rx_handler,
>>-			       void *rx_handler_data)
>>+			       struct netdev_rx_handler *rx_handler)
>> {
>> 	ASSERT_RTNL();
>> 
>> 	if (dev->rx_handler)
>> 		return -EBUSY;
>> 
>>-	rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
>> 	rcu_assign_pointer(dev->rx_handler, rx_handler);
>> 
>> 	return 0;
>>@@ -3338,7 +3349,6 @@ void netdev_rx_handler_unregister(struct
>>net_device *dev)
>> 
>> 	ASSERT_RTNL();
>> 	RCU_INIT_POINTER(dev->rx_handler, NULL);
>>-	RCU_INIT_POINTER(dev->rx_handler_data, NULL);
>> }
>> EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>> 
>>@@ -3362,7 +3372,7 @@ static bool skb_pfmemalloc_protocol(struct
>>sk_buff *skb)
>>static int __netif_receive_skb_core(struct sk_buff *skb, bool
>>pfmemalloc)
>> {
>> 	struct packet_type *ptype, *pt_prev;
>>-	rx_handler_func_t *rx_handler;
>>+	struct netdev_rx_handler *rx_handler;
>> 	struct net_device *orig_dev;
>> 	struct net_device *null_or_dev;
>> 	bool deliver_exact = false;
>>@@ -3445,7 +3455,7 @@ ncls:
>> 			ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = NULL;
>> 		}
>>-		switch (rx_handler(&skb)) {
>>+		switch (rx_handler->func(&skb, rx_handler)) {
>
>
>This doesn't solve anything. The CPU can be executing func when you set it to null.  Then you have the same problem.  This patch shows you still don't understand the bug.  

rx_handler->func is never set to NULL and is valid for all rx_handler (port_priv)
lifetime.

>
>
>-- Steve
>
>
>> 		case RX_HANDLER_CONSUMED:
>> 			ret = NET_RX_SUCCESS;
>> 			goto unlock;
>>diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>>index 5558350..5248f8e 100644
>>--- a/net/openvswitch/dp_notify.c
>>+++ b/net/openvswitch/dp_notify.c
>>@@ -32,7 +32,7 @@ static int dp_device_event(struct notifier_block
>>*unused, unsigned long event,
>> 	if (ovs_is_internal_dev(dev))
>> 		vport = ovs_internal_dev_get_vport(dev);
>> 	else
>>-		vport = ovs_netdev_get_vport(dev);
>>+		vport = ovs_netdev_get_vport_rtnl(dev);
>> 
>> 	if (!vport)
>> 		return NOTIFY_DONE;
>>diff --git a/net/openvswitch/vport-netdev.c
>>b/net/openvswitch/vport-netdev.c
>>index 2130d61..0ff6aa1 100644
>>--- a/net/openvswitch/vport-netdev.c
>>+++ b/net/openvswitch/vport-netdev.c
>>@@ -35,9 +35,6 @@
>> /* Must be called with rcu_read_lock. */
>>static void netdev_port_receive(struct vport *vport, struct sk_buff
>>*skb)
>> {
>>-	if (unlikely(!vport))
>>-		goto error;
>>-
>> 	if (unlikely(skb_warn_if_lro(skb)))
>> 		goto error;
>> 
>>@@ -57,7 +54,8 @@ error:
>> }
>> 
>> /* Called with rcu_read_lock and bottom-halves disabled. */
>>-static rx_handler_result_t netdev_frame_hook(struct sk_buff **pskb)
>>+static rx_handler_result_t
>>+netdev_frame_hook(struct sk_buff **pskb, struct netdev_rx_handler
>>*rx_handler)
>> {
>> 	struct sk_buff *skb = *pskb;
>> 	struct vport *vport;
>>@@ -65,7 +63,7 @@ static rx_handler_result_t netdev_frame_hook(struct
>>sk_buff **pskb)
>> 	if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
>> 		return RX_HANDLER_PASS;
>> 
>>-	vport = ovs_netdev_get_vport(skb->dev);
>>+	vport = ovs_netdev_get_vport_by_rx_handler(rx_handler);
>> 
>> 	netdev_port_receive(vport, skb);
>> 
>>@@ -100,8 +98,8 @@ static struct vport *netdev_create(const struct
>>vport_parms *parms)
>> 		goto error_put;
>> 	}
>> 
>>-	err = netdev_rx_handler_register(netdev_vport->dev,
>>netdev_frame_hook,
>>-					 vport);
>>+	netdev_rx_handler_init(&vport->rx_handler, netdev_frame_hook);
>>+	err = netdev_rx_handler_register(netdev_vport->dev,
>>&vport->rx_handler);
>> 	if (err)
>> 		goto error_put;
>> 
>>@@ -185,16 +183,6 @@ error:
>> 	return 0;
>> }
>> 
>>-/* Returns null if this device is not attached to a datapath. */
>>-struct vport *ovs_netdev_get_vport(struct net_device *dev)
>>-{
>>-	if (likely(dev->priv_flags & IFF_OVS_DATAPATH))
>>-		return (struct vport *)
>>-			rcu_dereference_rtnl(dev->rx_handler_data);
>>-	else
>>-		return NULL;
>>-}
>>-
>> const struct vport_ops ovs_netdev_vport_ops = {
>> 	.type		= OVS_VPORT_TYPE_NETDEV,
>> 	.create		= netdev_create,
>>diff --git a/net/openvswitch/vport-netdev.h
>>b/net/openvswitch/vport-netdev.h
>>index 6478079..d3f1471 100644
>>--- a/net/openvswitch/vport-netdev.h
>>+++ b/net/openvswitch/vport-netdev.h
>>@@ -24,7 +24,21 @@
>> 
>> #include "vport.h"
>> 
>>-struct vport *ovs_netdev_get_vport(struct net_device *dev);
>>+#define ovs_netdev_vport_exists(dev) (dev->priv_flags &
>>IFF_OVS_DATAPATH)
>>+
>>+static inline struct vport *
>>+ovs_netdev_get_vport_by_rx_handler(const struct netdev_rx_handler
>>*rx_handler)
>>+{
>>+	return container_of(rx_handler, struct vport, rx_handler);
>>+}
>>+
>>+static inline struct vport *
>>+ovs_netdev_get_vport_rtnl(const struct net_device *dev)
>>+{
>>+	if (!ovs_netdev_vport_exists(dev))
>>+		return NULL;
>>+	return
>>ovs_netdev_get_vport_by_rx_handler(rtnl_dereference(dev->rx_handler));
>>+}
>> 
>> struct netdev_vport {
>> 	struct rcu_head rcu;
>>diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
>>index aee7d43..a7a07c0 100644
>>--- a/net/openvswitch/vport.h
>>+++ b/net/openvswitch/vport.h
>>@@ -67,6 +67,7 @@ struct vport_err_stats {
>> 
>> /**
>>  * struct vport - one port within a datapath
>>+ * @rx_handler: RX handler structure.
>>  * @rcu: RCU callback head for deferred destruction.
>>  * @dp: Datapath to which this port belongs.
>>* @upcall_portid: The Netlink port to use for packets received on this
>>port that
>>@@ -80,6 +81,7 @@ struct vport_err_stats {
>>  * @err_stats: Points to error statistics used and maintained by vport
>>  */
>> struct vport {
>>+	struct netdev_rx_handler rx_handler;
>> 	struct rcu_head rcu;
>> 	struct datapath	*dp;
>> 	u32 upcall_portid;
>
>-- 
>Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Eric Dumazet @ 2013-03-30 19:31 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	dev-yBygre7rU0TnMu66kgdUjQ, paulmck-r/Jw6+rmf7HQT0dZR+AlfA,
	streeter-H+wXaHxf7aLQT0dZR+AlfA,
	nicolas.2p.debian-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, fubar-r/Jw6+rmf7HQT0dZR+AlfA,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	tglx-hfZtesqFncYOwBW4kG4KsQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	andy-QlMahl40kYEqcZcGjlUOXw
In-Reply-To: <1364671688.5113.119.camel@edumazet-glaptop>

On Sat, 2013-03-30 at 12:28 -0700, Eric Dumazet wrote:
> On Sat, 2013-03-30 at 18:13 +0100, Jiri Pirko wrote:
> 
> > Well, not entirely true, depends on arch.
> > 
> 
> Are you really trying to obfuscate stack because of Alpha architecture ?


By the way, only dev->rx_handler needs to be RCU protected.

The patch send yesterday make the second rcu_dereference() (to get
rx_handler_data) totally irrelevant.

^ permalink raw reply

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Eric Dumazet @ 2013-03-30 19:28 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	dev-yBygre7rU0TnMu66kgdUjQ, paulmck-r/Jw6+rmf7HQT0dZR+AlfA,
	streeter-H+wXaHxf7aLQT0dZR+AlfA,
	nicolas.2p.debian-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, fubar-r/Jw6+rmf7HQT0dZR+AlfA,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	tglx-hfZtesqFncYOwBW4kG4KsQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	andy-QlMahl40kYEqcZcGjlUOXw
In-Reply-To: <20130330171325.GC1544-RDzucLLXGGI88b5SBfVpbw@public.gmane.org>

On Sat, 2013-03-30 at 18:13 +0100, Jiri Pirko wrote:

> Well, not entirely true, depends on arch.
> 

Are you really trying to obfuscate stack because of Alpha architecture ?

Really, a bit of stability in this code is welcome.

Lets fix existing bugs instead of possibly add new ones.

^ permalink raw reply

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Steven Rostedt @ 2013-03-30 19:24 UTC (permalink / raw)
  To: Jiri Pirko, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	dev-yBygre7rU0TnMu66kgdUjQ, paulmck-r/Jw6+rmf7HQT0dZR+AlfA,
	streeter-H+wXaHxf7aLQT0dZR+AlfA,
	nicolas.2p.debian-Re5JQEeQqe8AvxtiuMwx3w,
	fubar-r/Jw6+rmf7HQT0dZR+AlfA, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	tglx-hfZtesqFncYOwBW4kG4KsQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	andy-QlMahl40kYEqcZcGjlUOXw
In-Reply-To: <1364659034-7049-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>



Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org> wrote:

>No need to have two pointers in struct netdevice for rx_handler func
>and
>priv data. Just embed rx_handler structure into driver port_priv and
>have ->func pointer there. This introduces no performance penalty,
>reduces struct netdevice by one pointer and reduces number of needed
>rcu_dereference calls from 2 to 1.
>
>Note this also fixes the race bug pointed out by Steven Rostedt and
>fixed by patch "[PATCH] net: add a synchronize_net() in
>netdev_rx_handler_unregister()" by Eric Dumazet. This is based on
>current net-next tree where the patch is not applied yet.
>I can rebase it on whatever tree/state, just say so.
>
>Smoke tested with all drivers who use rx_handler.
>
>Reported-by: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
>Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
>---
 +
>+
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -3296,10 +3296,23 @@ out:
> #endif
> 
> /**
>+ *	netdev_rx_handler_init - init receive handler structure
>+ *	@rx_handler: receive handler to init
>+ *	@func: receive handler function
>+ *
>+ *	Inits receive handler structure.
>+ */
>+void netdev_rx_handler_init(struct netdev_rx_handler *rx_handler,
>+			    rx_handler_func_t *func)
>+{
>+	rx_handler->func = func;
>+}
>+EXPORT_SYMBOL_GPL(netdev_rx_handler_init);
>+
>+/**
>  *	netdev_rx_handler_register - register receive handler
>  *	@dev: device to register a handler for
>  *	@rx_handler: receive handler to register
>- *	@rx_handler_data: data pointer that is used by rx handler
>  *
>  *	Register a receive hander for a device. This handler will then be
>  *	called from __netif_receive_skb. A negative errno code is returned
>@@ -3310,15 +3323,13 @@ out:
> *	For a general description of rx_handler, see enum rx_handler_result.
>  */
> int netdev_rx_handler_register(struct net_device *dev,
>-			       rx_handler_func_t *rx_handler,
>-			       void *rx_handler_data)
>+			       struct netdev_rx_handler *rx_handler)
> {
> 	ASSERT_RTNL();
> 
> 	if (dev->rx_handler)
> 		return -EBUSY;
> 
>-	rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
> 	rcu_assign_pointer(dev->rx_handler, rx_handler);
> 
> 	return 0;
>@@ -3338,7 +3349,6 @@ void netdev_rx_handler_unregister(struct
>net_device *dev)
> 
> 	ASSERT_RTNL();
> 	RCU_INIT_POINTER(dev->rx_handler, NULL);
>-	RCU_INIT_POINTER(dev->rx_handler_data, NULL);
> }
> EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
> 
>@@ -3362,7 +3372,7 @@ static bool skb_pfmemalloc_protocol(struct
>sk_buff *skb)
>static int __netif_receive_skb_core(struct sk_buff *skb, bool
>pfmemalloc)
> {
> 	struct packet_type *ptype, *pt_prev;
>-	rx_handler_func_t *rx_handler;
>+	struct netdev_rx_handler *rx_handler;
> 	struct net_device *orig_dev;
> 	struct net_device *null_or_dev;
> 	bool deliver_exact = false;
>@@ -3445,7 +3455,7 @@ ncls:
> 			ret = deliver_skb(skb, pt_prev, orig_dev);
> 			pt_prev = NULL;
> 		}
>-		switch (rx_handler(&skb)) {
>+		switch (rx_handler->func(&skb, rx_handler)) {


This doesn't solve anything. The CPU can be executing func when you set it to null.  Then you have the same problem.  This patch shows you still don't understand the bug.  


-- Steve


> 		case RX_HANDLER_CONSUMED:
> 			ret = NET_RX_SUCCESS;
> 			goto unlock;
>diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
>index 5558350..5248f8e 100644
>--- a/net/openvswitch/dp_notify.c
>+++ b/net/openvswitch/dp_notify.c
>@@ -32,7 +32,7 @@ static int dp_device_event(struct notifier_block
>*unused, unsigned long event,
> 	if (ovs_is_internal_dev(dev))
> 		vport = ovs_internal_dev_get_vport(dev);
> 	else
>-		vport = ovs_netdev_get_vport(dev);
>+		vport = ovs_netdev_get_vport_rtnl(dev);
> 
> 	if (!vport)
> 		return NOTIFY_DONE;
>diff --git a/net/openvswitch/vport-netdev.c
>b/net/openvswitch/vport-netdev.c
>index 2130d61..0ff6aa1 100644
>--- a/net/openvswitch/vport-netdev.c
>+++ b/net/openvswitch/vport-netdev.c
>@@ -35,9 +35,6 @@
> /* Must be called with rcu_read_lock. */
>static void netdev_port_receive(struct vport *vport, struct sk_buff
>*skb)
> {
>-	if (unlikely(!vport))
>-		goto error;
>-
> 	if (unlikely(skb_warn_if_lro(skb)))
> 		goto error;
> 
>@@ -57,7 +54,8 @@ error:
> }
> 
> /* Called with rcu_read_lock and bottom-halves disabled. */
>-static rx_handler_result_t netdev_frame_hook(struct sk_buff **pskb)
>+static rx_handler_result_t
>+netdev_frame_hook(struct sk_buff **pskb, struct netdev_rx_handler
>*rx_handler)
> {
> 	struct sk_buff *skb = *pskb;
> 	struct vport *vport;
>@@ -65,7 +63,7 @@ static rx_handler_result_t netdev_frame_hook(struct
>sk_buff **pskb)
> 	if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
> 		return RX_HANDLER_PASS;
> 
>-	vport = ovs_netdev_get_vport(skb->dev);
>+	vport = ovs_netdev_get_vport_by_rx_handler(rx_handler);
> 
> 	netdev_port_receive(vport, skb);
> 
>@@ -100,8 +98,8 @@ static struct vport *netdev_create(const struct
>vport_parms *parms)
> 		goto error_put;
> 	}
> 
>-	err = netdev_rx_handler_register(netdev_vport->dev,
>netdev_frame_hook,
>-					 vport);
>+	netdev_rx_handler_init(&vport->rx_handler, netdev_frame_hook);
>+	err = netdev_rx_handler_register(netdev_vport->dev,
>&vport->rx_handler);
> 	if (err)
> 		goto error_put;
> 
>@@ -185,16 +183,6 @@ error:
> 	return 0;
> }
> 
>-/* Returns null if this device is not attached to a datapath. */
>-struct vport *ovs_netdev_get_vport(struct net_device *dev)
>-{
>-	if (likely(dev->priv_flags & IFF_OVS_DATAPATH))
>-		return (struct vport *)
>-			rcu_dereference_rtnl(dev->rx_handler_data);
>-	else
>-		return NULL;
>-}
>-
> const struct vport_ops ovs_netdev_vport_ops = {
> 	.type		= OVS_VPORT_TYPE_NETDEV,
> 	.create		= netdev_create,
>diff --git a/net/openvswitch/vport-netdev.h
>b/net/openvswitch/vport-netdev.h
>index 6478079..d3f1471 100644
>--- a/net/openvswitch/vport-netdev.h
>+++ b/net/openvswitch/vport-netdev.h
>@@ -24,7 +24,21 @@
> 
> #include "vport.h"
> 
>-struct vport *ovs_netdev_get_vport(struct net_device *dev);
>+#define ovs_netdev_vport_exists(dev) (dev->priv_flags &
>IFF_OVS_DATAPATH)
>+
>+static inline struct vport *
>+ovs_netdev_get_vport_by_rx_handler(const struct netdev_rx_handler
>*rx_handler)
>+{
>+	return container_of(rx_handler, struct vport, rx_handler);
>+}
>+
>+static inline struct vport *
>+ovs_netdev_get_vport_rtnl(const struct net_device *dev)
>+{
>+	if (!ovs_netdev_vport_exists(dev))
>+		return NULL;
>+	return
>ovs_netdev_get_vport_by_rx_handler(rtnl_dereference(dev->rx_handler));
>+}
> 
> struct netdev_vport {
> 	struct rcu_head rcu;
>diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
>index aee7d43..a7a07c0 100644
>--- a/net/openvswitch/vport.h
>+++ b/net/openvswitch/vport.h
>@@ -67,6 +67,7 @@ struct vport_err_stats {
> 
> /**
>  * struct vport - one port within a datapath
>+ * @rx_handler: RX handler structure.
>  * @rcu: RCU callback head for deferred destruction.
>  * @dp: Datapath to which this port belongs.
>* @upcall_portid: The Netlink port to use for packets received on this
>port that
>@@ -80,6 +81,7 @@ struct vport_err_stats {
>  * @err_stats: Points to error statistics used and maintained by vport
>  */
> struct vport {
>+	struct netdev_rx_handler rx_handler;
> 	struct rcu_head rcu;
> 	struct datapath	*dp;
> 	u32 upcall_portid;

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH] net/wireless/sme: sched_scan_mtx lock in cfg80211_conn_work()
From: Johannes Berg @ 2013-03-30 18:13 UTC (permalink / raw)
  To: Artem Savkov; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <1364664014-12304-1-git-send-email-artem.savkov@gmail.com>

On Sat, 2013-03-30 at 21:20 +0400, Artem Savkov wrote:
> Introduced in f9f475292dbb0e7035fb6661d1524761ea0888d9
> 
> cfg80211_conn_scan() which requires sched_scan_mtx to be held can be called
> from cfg80211_conn_work(). Without this we are hitting multiple warnings like
> the following:
> 
> [   51.996440] ------------[ cut here ]------------
> [   51.996484] WARNING: at net/wireless/sme.c:88 cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211]()

Applied, thanks. I guess I need to add

iw wlan0 scan trigger ; iw wlan0 connect -w NO_SUCH_SSID

to my tests ...

johannes

^ permalink raw reply

* [PATCH] ip_gre: don't overwrite iflink during net_dev init
From: Antonio Quartulli @ 2013-03-30 17:07 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Fengguang Wu, Antonio Quartulli, Pravin B Shelar,
	David S. Miller
In-Reply-To: <20130330115245.GA4024@ritirata.org>

iflink is currently set to 0 in __gre_tunnel_init(). This
function is invoked in gre_tap_init() and
ipgre_tunnel_init() which are both used to initialise the
ndo_init field of the respective net_device_ops structs
(ipgre.. and gre_tap..) used by GRE interfaces.

However, in netdevice_register() iflink is first set to -1,
then ndo_init is invoked and then iflink is assigned to a
proper value if and only if it still was -1.

Assigning 0 to iflink in ndo_init is therefore first
preventing netdev_register() to correctly assign it a proper
value and then breaking iflink at all since 0 has not
correct meaning.

Fix this by removing the iflink assignment in
__gre_tunnel_init().

Introduced by c54419321455631079c7d6e60bc732dd0c5914c5
("GRE: Refactor GRE tunneling code.")

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
---
 net/ipv4/ip_gre.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index ad662e9..e5dfd28 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -660,7 +660,6 @@ static void __gre_tunnel_init(struct net_device *dev)

 	dev->needed_headroom	= LL_MAX_HEADER + sizeof(struct iphdr) + 4;
 	dev->mtu		= ETH_DATA_LEN - sizeof(struct iphdr) - 4;
-	dev->iflink		= 0;

 	dev->features		|= NETIF_F_NETNS_LOCAL | GRE_FEATURES;
 	dev->hw_features	|= GRE_FEATURES;
-- 
1.8.1.5

^ permalink raw reply related

* [PATCH] net/wireless/sme: sched_scan_mtx lock in cfg80211_conn_work()
From: Artem Savkov @ 2013-03-30 17:20 UTC (permalink / raw)
  To: johannes.berg; +Cc: linux-wireless, netdev, linux-kernel, Artem Savkov

Introduced in f9f475292dbb0e7035fb6661d1524761ea0888d9

cfg80211_conn_scan() which requires sched_scan_mtx to be held can be called
from cfg80211_conn_work(). Without this we are hitting multiple warnings like
the following:

[   51.996440] ------------[ cut here ]------------
[   51.996484] WARNING: at net/wireless/sme.c:88 cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211]()
[   51.996489] Hardware name: 0578A21
[   51.996492] Modules linked in: iwldvm mac80211 uvcvideo videobuf2_vmalloc
videobuf2_memops iwlwifi videobuf2_core videodev media cfg80211 ums_realtek
kvm_intel kvm intel_ips r8169 lpc_ich joydev crc32c_intel acpi_cpufreq mperf thinkpad_acpi
[   51.996533] Pid: 620, comm: kworker/3:1 Not tainted 3.9.0-rc4-next-20130328+ #326
[   51.996537] Call Trace:
[   51.996550]  [<c1036992>] warn_slowpath_common+0x72/0xa0
[   51.996583]  [<faa4b0ec>] ? cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211]
[   51.996615]  [<faa4b0ec>] ? cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211]
[   51.996622]  [<c10369e2>] warn_slowpath_null+0x22/0x30
[   51.996654]  [<faa4b0ec>] cfg80211_conn_scan+0x1dc/0x3a0 [cfg80211]
[   51.996662]  [<c109076c>] ? __lock_is_held+0x3c/0x60
[   51.996694]  [<faa4b344>] cfg80211_conn_do_work+0x94/0x380 [cfg80211]
[   51.996700]  [<c109175a>] ? __lock_acquire.isra.25+0x38a/0xc80
[   51.996711]  [<c109e679>] ? is_module_text_address+0x19/0x30
[   51.996720]  [<c10577cf>] ? __kernel_text_address+0x4f/0x70
[   51.996728]  [<c10052b1>] ? print_context_stack+0x41/0xd0
[   51.996763]  [<faa4c3b2>] cfg80211_conn_work+0xa2/0x130 [cfg80211]
[   51.996771]  [<c1051858>] process_one_work+0x198/0x450
[   51.996779]  [<c10517d0>] ? process_one_work+0x110/0x450
[   51.996787]  [<c10527c9>] worker_thread+0xf9/0x320
[   51.996795]  [<c10526d0>] ? manage_workers.isra.17+0x260/0x260
[   51.996803]  [<c105a3c4>] kthread+0x94/0xa0
[   51.996812]  [<c1060000>] ? profiling_show+0x10/0x30
[   51.996823]  [<c170c277>] ret_from_kernel_thread+0x1b/0x28
[   51.996830]  [<c105a330>] ? __init_kthread_worker+0x60/0x60
[   51.996835] ---[ end trace 6c56f7f4ff21820b ]---

Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
---
 net/wireless/sme.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/wireless/sme.c b/net/wireless/sme.c
index 818ad63..a9dc5c7 100644
--- a/net/wireless/sme.c
+++ b/net/wireless/sme.c
@@ -228,6 +228,7 @@ void cfg80211_conn_work(struct work_struct *work)
 	rtnl_lock();
 	cfg80211_lock_rdev(rdev);
 	mutex_lock(&rdev->devlist_mtx);
+	mutex_lock(&rdev->sched_scan_mtx);
 
 	list_for_each_entry(wdev, &rdev->wdev_list, list) {
 		wdev_lock(wdev);
@@ -252,6 +253,7 @@ void cfg80211_conn_work(struct work_struct *work)
 		wdev_unlock(wdev);
 	}
 
+	mutex_unlock(&rdev->sched_scan_mtx);
 	mutex_unlock(&rdev->devlist_mtx);
 	cfg80211_unlock_rdev(rdev);
 	rtnl_unlock();
-- 
1.8.2

^ permalink raw reply related

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Jiri Pirko @ 2013-03-30 17:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, edumazet, fubar, andy, kaber, stephen, jesse,
	alexander.h.duyck, xiyou.wangcong, dev, rostedt,
	nicolas.2p.debian, tglx, streeter, paulmck, ivecera
In-Reply-To: <1364660588.5113.110.camel@edumazet-glaptop>

Sat, Mar 30, 2013 at 05:23:08PM CET, eric.dumazet@gmail.com wrote:
>On Sat, 2013-03-30 at 16:57 +0100, Jiri Pirko wrote:
>> No need to have two pointers in struct netdevice for rx_handler func and
>> priv data. Just embed rx_handler structure into driver port_priv and
>> have ->func pointer there. This introduces no performance penalty,
>> reduces struct netdevice by one pointer and reduces number of needed
>> rcu_dereference calls from 2 to 1.
>> 
>
>Thats not true.
>
>> Note this also fixes the race bug pointed out by Steven Rostedt and
>> fixed by patch "[PATCH] net: add a synchronize_net() in
>> netdev_rx_handler_unregister()" by Eric Dumazet. This is based on
>> current net-next tree where the patch is not applied yet.
>> I can rebase it on whatever tree/state, just say so.
>> 
>> Smoke tested with all drivers who use rx_handler.
>> 
>> Reported-by: Steven Rostedt <rostedt@goodmis.org>
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> ---
>
>I see no value for this patch.
>
>It obfuscates things for no good reason.

Well, I like this what you call obfuscation. It looks similar to
list_head and rcu_head. I do not think that this makes the readibility
more difficult.

>
>Once again rcu_dereference(dev->field) has no cost, its a memory read,
>like dev->field.

Well, not entirely true, depends on arch.

>
>I fear you don't understand enough RCU to make so invasive changes.
>
>Your patch adds a double dereference on fast path, and its more
>expensive than two single deref.
>
>dev->rx_handler actually gets the function pointer, and after your patch
>we would have to do dev->rx_handler->func instead, which is bad on many
>cpus.

dev->rx_handler == port_priv and it is treated as such and accessed in
handle_frame driver functions. Is it really that bad to access deref
port_priv->pointer (rx_handler->func) couple of instructions earlier?
I wonder what the difference is in compare with the original code.


Thanks, Jiri

>
>I'll send a patch reordering some fields of net_device, because as time
>passed, it seems a lot of junk broke work done in commit
>cd13539b8bc9ae884 (net: shrinks struct net_device)
>
>offsetof(struct net_device,dev_addr)=0x258
>offsetof(struct net_device,rx_handler)=0x2b8
>offsetof(struct net_device,ingress_queue)=0x2c8
>offsetof(struct net_device,broadcast)=0x278
>
>

^ permalink raw reply

* [PATCH net-next] net: reorder some fields of net_device
From: Eric Dumazet @ 2013-03-30 16:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

As time passed, some fields were added in net_device, and not
at sensible offsets.

Lets reorder some fields to reduce number of cache lines in RX path.
Fields not used in data path should be moved out of this critical cache
line.

In particular, move broadcast[] to the end of the rx section,
as it is less used, and ethernet uses only the beginning of the 32bytes
field.

Before patch :

offsetof(struct net_device,dev_addr)=0x258
offsetof(struct net_device,rx_handler)=0x2b8
offsetof(struct net_device,ingress_queue)=0x2c8
offsetof(struct net_device,broadcast)=0x278

After :

offsetof(struct net_device,dev_addr)=0x280
offsetof(struct net_device,rx_handler)=0x298
offsetof(struct net_device,ingress_queue)=0x2a8
offsetof(struct net_device,broadcast)=0x2b0


Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |   33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1dbb02c..4491414 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1071,6 +1071,8 @@ struct net_device {
 	struct list_head	dev_list;
 	struct list_head	napi_list;
 	struct list_head	unreg_list;
+	struct list_head	upper_dev_list; /* List of upper devices */
+
 
 	/* currently active device features */
 	netdev_features_t	features;
@@ -1143,6 +1145,13 @@ struct net_device {
 	spinlock_t		addr_list_lock;
 	struct netdev_hw_addr_list	uc;	/* Unicast mac addresses */
 	struct netdev_hw_addr_list	mc;	/* Multicast mac addresses */
+	struct netdev_hw_addr_list	dev_addrs; /* list of device
+						    * hw addresses
+						    */
+#ifdef CONFIG_SYSFS
+	struct kset		*queues_kset;
+#endif
+
 	bool			uc_promisc;
 	unsigned int		promiscuity;
 	unsigned int		allmulti;
@@ -1175,21 +1184,11 @@ struct net_device {
 						 * avoid dirtying this cache line.
 						 */
 
-	struct list_head	upper_dev_list; /* List of upper devices */
-
 	/* Interface address info used in eth_type_trans() */
 	unsigned char		*dev_addr;	/* hw address, (before bcast
 						   because most packets are
 						   unicast) */
 
-	struct netdev_hw_addr_list	dev_addrs; /* list of device
-						      hw addresses */
-
-	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
-
-#ifdef CONFIG_SYSFS
-	struct kset		*queues_kset;
-#endif
 
 #ifdef CONFIG_RPS
 	struct netdev_rx_queue	*_rx;
@@ -1200,18 +1199,14 @@ struct net_device {
 	/* Number of RX queues currently active in device */
 	unsigned int		real_num_rx_queues;
 
-#ifdef CONFIG_RFS_ACCEL
-	/* CPU reverse-mapping for RX completion interrupts, indexed
-	 * by RX queue number.  Assigned by driver.  This must only be
-	 * set if the ndo_rx_flow_steer operation is defined. */
-	struct cpu_rmap		*rx_cpu_rmap;
-#endif
 #endif
 
 	rx_handler_func_t __rcu	*rx_handler;
 	void __rcu		*rx_handler_data;
 
 	struct netdev_queue __rcu *ingress_queue;
+	unsigned char		broadcast[MAX_ADDR_LEN];	/* hw bcast add	*/
+
 
 /*
  * Cache lines mostly used on transmit path
@@ -1233,6 +1228,12 @@ struct net_device {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps __rcu *xps_maps;
 #endif
+#ifdef CONFIG_RFS_ACCEL
+	/* CPU reverse-mapping for RX completion interrupts, indexed
+	 * by RX queue number.  Assigned by driver.  This must only be
+	 * set if the ndo_rx_flow_steer operation is defined. */
+	struct cpu_rmap		*rx_cpu_rmap;
+#endif
 
 	/* These may be needed for future network-power-down code. */
 

^ permalink raw reply related

* Re: [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Eric Dumazet @ 2013-03-30 16:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	dev-yBygre7rU0TnMu66kgdUjQ, paulmck-r/Jw6+rmf7HQT0dZR+AlfA,
	streeter-H+wXaHxf7aLQT0dZR+AlfA,
	nicolas.2p.debian-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, fubar-r/Jw6+rmf7HQT0dZR+AlfA,
	rostedt-nx8X9YLhiw1AfugRpC6u6w, kaber-dcUjhNyLwpNeoWH0uzbU5w,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, ivecera-H+wXaHxf7aLQT0dZR+AlfA,
	tglx-hfZtesqFncYOwBW4kG4KsQ, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	andy-QlMahl40kYEqcZcGjlUOXw
In-Reply-To: <1364659034-7049-1-git-send-email-jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>

On Sat, 2013-03-30 at 16:57 +0100, Jiri Pirko wrote:
> No need to have two pointers in struct netdevice for rx_handler func and
> priv data. Just embed rx_handler structure into driver port_priv and
> have ->func pointer there. This introduces no performance penalty,
> reduces struct netdevice by one pointer and reduces number of needed
> rcu_dereference calls from 2 to 1.
> 

Thats not true.

> Note this also fixes the race bug pointed out by Steven Rostedt and
> fixed by patch "[PATCH] net: add a synchronize_net() in
> netdev_rx_handler_unregister()" by Eric Dumazet. This is based on
> current net-next tree where the patch is not applied yet.
> I can rebase it on whatever tree/state, just say so.
> 
> Smoke tested with all drivers who use rx_handler.
> 
> Reported-by: Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org>
> Signed-off-by: Jiri Pirko <jiri-rHqAuBHg3fBzbRFIqnYvSA@public.gmane.org>
> ---

I see no value for this patch.

It obfuscates things for no good reason.

Once again rcu_dereference(dev->field) has no cost, its a memory read,
like dev->field.

I fear you don't understand enough RCU to make so invasive changes.

Your patch adds a double dereference on fast path, and its more
expensive than two single deref.

dev->rx_handler actually gets the function pointer, and after your patch
we would have to do dev->rx_handler->func instead, which is bad on many
cpus.

I'll send a patch reordering some fields of net_device, because as time
passed, it seems a lot of junk broke work done in commit
cd13539b8bc9ae884 (net: shrinks struct net_device)

offsetof(struct net_device,dev_addr)=0x258
offsetof(struct net_device,rx_handler)=0x2b8
offsetof(struct net_device,ingress_queue)=0x2c8
offsetof(struct net_device,broadcast)=0x278

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Benoit Lourdelet @ 2013-03-30 16:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, Stephen Hemminger, Serge Hallyn,
	netdev@vger.kernel.org
In-Reply-To: <1364654674.5113.87.camel@edumazet-glaptop>

Sorry Eric,

This is not an lxc-start perf report.This is an "ip" report".

Will run an "lxc-start" perf  ASAP.

Regards

Benoit

On 30/03/2013 15:44, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

>On Sat, 2013-03-30 at 10:09 +0000, Benoit Lourdelet wrote:
>> Hello,
>> 
>> Here are my tests  of the last patches on 3 different platforms all
>> running 3.8.5 :
>> 
>> Time are in seconds :
>> 
>> 8x 3.7Ghz virtual cores
>> 
>> # veth 	create 	delete
>> 1000 	14 	18
>> 2000 	39 	56
>> 5000 	256 	161
>> 10000 	1200 	399
>> 
>> 
>> 8x 3.2Ghz virtual cores
>> 
>> # veth 	create 	delete
>> 
>> 1000 	19 	40
>> 2000 	118 	66
>> 5000 	305 	251
>> 
>> 
>> 
>> 32x 2Ghz virtual cores , 2 sockets
>> # veth 	create 	delete
>> 1000 	35 	86
>> 
>> 2000 	120 	90
>> 
>> 5000 	724 	245
>> 
>> 
>> Compared to initial iproute2 performance on this 32 virtual core
>>system :
>> 5000 1143  1185
>> 
>> 
>> 
>> "perf record" for creation of 5000 veth on the 32 core system :
>> 
>> # captured on: Fri Mar 29 14:03:35 2013
>> # hostname : ieng-serv06
>> # os release : 3.8.5
>> # perf version : 3.8.5
>> # arch : x86_64
>> # nrcpus online : 32
>> # nrcpus avail : 32
>> # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>> # cpuid : GenuineIntel,6,45,7
>> # total memory : 264124548 kB
>> # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a
>>./test3.script
>> # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2
>>=
>> 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
>> precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42,
>> # HEADER_CPU_TOPOLOGY info available, use -I to display
>> # HEADER_NUMA_TOPOLOGY info available, use -I to display
>> # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2,
>> uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 =
>> 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco
>> # ========
>> #
>> # Samples: 9M of event 'cycles'
>> # Event count (approx.): 2894480238483
>> #
>> # Overhead          Command                  Shared Object
>>                           Symbol
>> # ........  ...............  .............................
>> ...............................................
>> #
>>     15.17%             sudo  [kernel.kallsyms]              [k]
>> snmp_fold_field 
>>      5.94%             sudo  libc-2.15.so                   [.]
>> 0x00000000000802cd
>>      5.64%             sudo  [kernel.kallsyms]              [k]
>> find_next_bit   
>>      3.21%             init  libnih.so.1.0.0                [.]
>> nih_list_add_after
>>      2.12%          swapper  [kernel.kallsyms]              [k]
>>intel_idle
>>                 
>>      1.94%             init  [kernel.kallsyms]              [k]
>>page_fault
>>                 
>>      1.93%              sed  libc-2.15.so                   [.]
>> 0x00000000000a1368
>>      1.93%             sudo  [kernel.kallsyms]              [k]
>> rtnl_fill_ifinfo
>>      1.92%             sudo  [veth]                         [k]
>> veth_get_stats64
>>      1.78%             sudo  [kernel.kallsyms]              [k] memcpy
>>                 
>>      1.53%          ifquery  libc-2.15.so                   [.]
>> 0x000000000007f52b
>>      1.24%             init  libc-2.15.so                   [.]
>> 0x000000000008918f
>>      1.05%             sudo  [kernel.kallsyms]              [k]
>> inet6_fill_ifla6_attrs
>>      0.98%             init  [kernel.kallsyms]              [k]
>> copy_pte_range  
>>      0.88%       irqbalance  libc-2.15.so                   [.]
>> 0x00000000000802cd
>>      0.85%             sudo  [kernel.kallsyms]              [k] memset
>>                 
>>      0.72%              sed  ld-2.15.so                     [.]
>> 0x000000000000a226
>>      0.68%          ifquery  ld-2.15.so                     [.]
>> 0x00000000000165a0
>>      0.64%             init  libnih.so.1.0.0                [.]
>> nih_tree_next_post_full
>>      0.61%  bridge-network-  libc-2.15.so                   [.]
>> 0x0000000000131e2a
>>      0.59%             init  [kernel.kallsyms]              [k]
>>do_wp_page
>>                 
>>      0.59%          ifquery  [kernel.kallsyms]              [k]
>>page_fault
>>                 
>>      0.54%              sed  [kernel.kallsyms]              [k]
>>page_fault
>>                 
>> 
>> 
>> 
>> 
>> Regards
>> 
>> Benoit
>> 
>> 
>> 
>> 
>
>This means lxc-start does the same thing than ip :
>
>It fetches the whole device list.
>
>You could strace it to have a confirmation.
>
>
>
>
>

^ permalink raw reply

* [patch net-next] net: squash ->rx_handler and ->rx_handler_data into single rcu pointer
From: Jiri Pirko @ 2013-03-30 15:57 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, fubar, andy, kaber, stephen, jesse,
	alexander.h.duyck, xiyou.wangcong, dev, rostedt,
	nicolas.2p.debian, tglx, streeter, paulmck, ivecera

No need to have two pointers in struct netdevice for rx_handler func and
priv data. Just embed rx_handler structure into driver port_priv and
have ->func pointer there. This introduces no performance penalty,
reduces struct netdevice by one pointer and reduces number of needed
rcu_dereference calls from 2 to 1.

Note this also fixes the race bug pointed out by Steven Rostedt and
fixed by patch "[PATCH] net: add a synchronize_net() in
netdev_rx_handler_unregister()" by Eric Dumazet. This is based on
current net-next tree where the patch is not applied yet.
I can rebase it on whatever tree/state, just say so.

Smoke tested with all drivers who use rx_handler.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_main.c |  9 +++++----
 drivers/net/bonding/bonding.h   | 16 +++++++++++-----
 drivers/net/macvlan.c           | 30 +++++++++++++++++++++---------
 drivers/net/team/team.c         | 22 +++++++++++-----------
 include/linux/if_team.h         |  1 +
 include/linux/netdevice.h       | 17 ++++++++++++-----
 net/bridge/br_if.c              |  3 ++-
 net/bridge/br_input.c           |  5 +++--
 net/bridge/br_private.h         | 26 ++++++++++++++++++--------
 net/core/dev.c                  | 24 +++++++++++++++++-------
 net/openvswitch/dp_notify.c     |  2 +-
 net/openvswitch/vport-netdev.c  | 22 +++++-----------------
 net/openvswitch/vport-netdev.h  | 16 +++++++++++++++-
 net/openvswitch/vport.h         |  2 ++
 14 files changed, 124 insertions(+), 71 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 11a8cb3..03e9895 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1440,7 +1440,8 @@ static bool bond_should_deliver_exact_match(struct sk_buff *skb,
 	return false;
 }
 
-static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
+static rx_handler_result_t
+bond_handle_frame(struct sk_buff **pskb, struct netdev_rx_handler *rx_handler)
 {
 	struct sk_buff *skb = *pskb;
 	struct slave *slave;
@@ -1455,7 +1456,7 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 
 	*pskb = skb;
 
-	slave = bond_slave_get_rcu(skb->dev);
+	slave = bond_slave_get_by_rx_handler(rx_handler);
 	bond = slave->bond;
 
 	if (bond->params.arp_interval)
@@ -1880,8 +1881,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	if (res)
 		goto err_detach;
 
-	res = netdev_rx_handler_register(slave_dev, bond_handle_frame,
-					 new_slave);
+	netdev_rx_handler_init(&new_slave->rx_handler, bond_handle_frame);
+	res = netdev_rx_handler_register(slave_dev, &new_slave->rx_handler);
 	if (res) {
 		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
 		goto err_dest_symlinks;
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 2baec24..cd3ccb4 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -172,7 +172,8 @@ struct vlan_entry {
 };
 
 struct slave {
-	struct net_device *dev; /* first - useful for panic debug */
+	struct netdev_rx_handler rx_handler;
+	struct net_device *dev;
 	struct slave *next;
 	struct slave *prev;
 	struct bonding *bond; /* our master */
@@ -256,11 +257,16 @@ static inline bool bond_vlan_used(struct bonding *bond)
 	return !list_empty(&bond->vlan_list);
 }
 
-#define bond_slave_get_rcu(dev) \
-	((struct slave *) rcu_dereference(dev->rx_handler_data))
+static inline struct slave *
+bond_slave_get_by_rx_handler(const struct netdev_rx_handler *rx_handler)
+{
+	return container_of(rx_handler, struct slave, rx_handler);
+}
 
-#define bond_slave_get_rtnl(dev) \
-	((struct slave *) rtnl_dereference(dev->rx_handler_data))
+static inline struct slave *bond_slave_get_rtnl(const struct net_device *dev)
+{
+	return bond_slave_get_by_rx_handler(rtnl_dereference(dev->rx_handler));
+}
 
 /**
  * Returns NULL if the net_device does not belong to any of the bond's slaves
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 73abbc1..9a065e4 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -36,6 +36,7 @@
 #define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
 
 struct macvlan_port {
+	struct netdev_rx_handler rx_handler;
 	struct net_device	*dev;
 	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
 	struct list_head	vlans;
@@ -46,9 +47,17 @@ struct macvlan_port {
 
 static void macvlan_port_destroy(struct net_device *dev);
 
-#define macvlan_port_get_rcu(dev) \
-	((struct macvlan_port *) rcu_dereference(dev->rx_handler_data))
-#define macvlan_port_get(dev) ((struct macvlan_port *) dev->rx_handler_data)
+static struct macvlan_port *
+macvlan_port_get_by_rx_handler(const struct netdev_rx_handler *rx_handler)
+{
+	return container_of(rx_handler, struct macvlan_port, rx_handler);
+}
+
+static struct macvlan_port *macvlan_port_get_rtnl(const struct net_device *dev)
+{
+	return macvlan_port_get_by_rx_handler(rtnl_dereference(dev->rx_handler));
+}
+
 #define macvlan_port_exists(dev) (dev->priv_flags & IFF_MACVLAN_PORT)
 
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
@@ -174,7 +183,9 @@ static void macvlan_broadcast(struct sk_buff *skb,
 }
 
 /* called under rcu_read_lock() from netif_receive_skb */
-static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
+static rx_handler_result_t
+macvlan_handle_frame(struct sk_buff **pskb,
+		     struct netdev_rx_handler *rx_handler)
 {
 	struct macvlan_port *port;
 	struct sk_buff *skb = *pskb;
@@ -185,7 +196,7 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
 	unsigned int len = 0;
 	int ret = NET_RX_DROP;
 
-	port = macvlan_port_get_rcu(skb->dev);
+	port = macvlan_port_get_by_rx_handler(rx_handler);
 	if (is_multicast_ether_addr(eth->h_dest)) {
 		skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN);
 		if (!skb)
@@ -693,7 +704,8 @@ static int macvlan_port_create(struct net_device *dev)
 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
 
-	err = netdev_rx_handler_register(dev, macvlan_handle_frame, port);
+	netdev_rx_handler_init(&port->rx_handler, macvlan_handle_frame);
+	err = netdev_rx_handler_register(dev, &port->rx_handler);
 	if (err)
 		kfree(port);
 	else
@@ -703,7 +715,7 @@ static int macvlan_port_create(struct net_device *dev)
 
 static void macvlan_port_destroy(struct net_device *dev)
 {
-	struct macvlan_port *port = macvlan_port_get(dev);
+	struct macvlan_port *port = macvlan_port_get_rtnl(dev);
 
 	dev->priv_flags &= ~IFF_MACVLAN_PORT;
 	netdev_rx_handler_unregister(dev);
@@ -772,7 +784,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 		if (err < 0)
 			return err;
 	}
-	port = macvlan_port_get(lowerdev);
+	port = macvlan_port_get_rtnl(lowerdev);
 
 	/* Only 1 macvlan device can be created in passthru mode */
 	if (port->passthru)
@@ -921,7 +933,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 	if (!macvlan_port_exists(dev))
 		return NOTIFY_DONE;
 
-	port = macvlan_port_get(dev);
+	port = macvlan_port_get_rtnl(dev);
 
 	switch (event) {
 	case NETDEV_CHANGE:
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 621c1bd..86dcbf1 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -40,18 +40,17 @@
 
 #define team_port_exists(dev) (dev->priv_flags & IFF_TEAM_PORT)
 
-static struct team_port *team_port_get_rcu(const struct net_device *dev)
+static struct team_port *
+team_port_get_by_rx_handler(const struct netdev_rx_handler *rx_handler)
 {
-	struct team_port *port = rcu_dereference(dev->rx_handler_data);
-
-	return team_port_exists(dev) ? port : NULL;
+	return container_of(rx_handler, struct team_port, rx_handler);
 }
 
 static struct team_port *team_port_get_rtnl(const struct net_device *dev)
 {
-	struct team_port *port = rtnl_dereference(dev->rx_handler_data);
-
-	return team_port_exists(dev) ? port : NULL;
+	if (!team_port_exists(dev))
+		return NULL;
+	return team_port_get_by_rx_handler(rtnl_dereference(dev->rx_handler));
 }
 
 /*
@@ -632,7 +631,8 @@ static int team_change_mode(struct team *team, const char *kind)
  ************************/
 
 /* note: already called with rcu_read_lock */
-static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
+static rx_handler_result_t
+team_handle_frame(struct sk_buff **pskb, struct netdev_rx_handler *rx_handler)
 {
 	struct sk_buff *skb = *pskb;
 	struct team_port *port;
@@ -645,7 +645,7 @@ static rx_handler_result_t team_handle_frame(struct sk_buff **pskb)
 
 	*pskb = skb;
 
-	port = team_port_get_rcu(skb->dev);
+	port = team_port_get_by_rx_handler(rx_handler);
 	team = port->team;
 	if (!team_port_enabled(port)) {
 		/* allow exact match delivery for disabled ports */
@@ -1076,8 +1076,8 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
 		goto err_set_upper_link;
 	}
 
-	err = netdev_rx_handler_register(port_dev, team_handle_frame,
-					 port);
+	netdev_rx_handler_init(&port->rx_handler, team_handle_frame);
+	err = netdev_rx_handler_register(port_dev, &port->rx_handler);
 	if (err) {
 		netdev_err(dev, "Device %s failed to register rx_handler\n",
 			   portname);
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index 4474557..72fd12e 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -29,6 +29,7 @@ struct team_pcpu_stats {
 struct team;
 
 struct team_port {
+	struct netdev_rx_handler rx_handler;
 	struct net_device *dev;
 	struct hlist_node hlist; /* node in enabled ports hash list */
 	struct list_head list; /* node in ordinary list */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1dbb02c..738226e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -387,8 +387,15 @@ enum rx_handler_result {
 	RX_HANDLER_EXACT,
 	RX_HANDLER_PASS,
 };
+
 typedef enum rx_handler_result rx_handler_result_t;
-typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb);
+struct netdev_rx_handler;
+typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb,
+					      struct netdev_rx_handler *rx_handler);
+
+struct netdev_rx_handler {
+	rx_handler_func_t *func;
+};
 
 extern void __napi_schedule(struct napi_struct *n);
 
@@ -1208,8 +1215,7 @@ struct net_device {
 #endif
 #endif
 
-	rx_handler_func_t __rcu	*rx_handler;
-	void __rcu		*rx_handler_data;
+	struct netdev_rx_handler __rcu *rx_handler;
 
 	struct netdev_queue __rcu *ingress_queue;
 
@@ -2212,9 +2218,10 @@ static inline void napi_free_frags(struct napi_struct *napi)
 	napi->skb = NULL;
 }
 
+extern void netdev_rx_handler_init(struct netdev_rx_handler *rx_handler,
+				   rx_handler_func_t *func);
 extern int netdev_rx_handler_register(struct net_device *dev,
-				      rx_handler_func_t *rx_handler,
-				      void *rx_handler_data);
+				      struct netdev_rx_handler *rx_handler);
 extern void netdev_rx_handler_unregister(struct net_device *dev);
 
 extern bool		dev_valid_name(const char *name);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index ef1b914..f855c07 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -370,7 +370,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	if (err)
 		goto err4;
 
-	err = netdev_rx_handler_register(dev, br_handle_frame, p);
+	netdev_rx_handler_init(&p->rx_handler, br_handle_frame);
+	err = netdev_rx_handler_register(dev, &p->rx_handler);
 	if (err)
 		goto err5;
 
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 828e2bc..a8bc5a6 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -150,7 +150,8 @@ static int br_handle_local_finish(struct sk_buff *skb)
  * Return NULL if skb is handled
  * note: already called with rcu_read_lock
  */
-rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
+rx_handler_result_t br_handle_frame(struct sk_buff **pskb,
+				    struct netdev_rx_handler *rx_handler)
 {
 	struct net_bridge_port *p;
 	struct sk_buff *skb = *pskb;
@@ -167,7 +168,7 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 	if (!skb)
 		return RX_HANDLER_CONSUMED;
 
-	p = br_port_get_rcu(skb->dev);
+	p = br_port_get_by_rx_handler(rx_handler);
 
 	if (unlikely(is_link_local_ether_addr(dest))) {
 		/*
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3cbf5be..727cc51 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -127,6 +127,7 @@ struct net_bridge_mdb_htable
 
 struct net_bridge_port
 {
+	struct netdev_rx_handler	rx_handler;
 	struct net_bridge		*br;
 	struct net_device		*dev;
 	struct list_head		list;
@@ -180,18 +181,26 @@ struct net_bridge_port
 
 #define br_port_exists(dev) (dev->priv_flags & IFF_BRIDGE_PORT)
 
-static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev)
+static inline struct net_bridge_port *
+br_port_get_by_rx_handler(const struct netdev_rx_handler *rx_handler)
 {
-	struct net_bridge_port *port =
-			rcu_dereference_rtnl(dev->rx_handler_data);
+	return container_of(rx_handler, struct net_bridge_port, rx_handler);
+}
 
-	return br_port_exists(dev) ? port : NULL;
+static inline struct net_bridge_port *
+br_port_get_rcu(const struct net_device *dev)
+{
+	if (!br_port_exists(dev))
+		return NULL;
+	return br_port_get_by_rx_handler(rcu_dereference_rtnl(dev->rx_handler));
 }
 
-static inline struct net_bridge_port *br_port_get_rtnl(struct net_device *dev)
+static inline struct net_bridge_port *
+br_port_get_rtnl(const struct net_device *dev)
 {
-	return br_port_exists(dev) ?
-		rtnl_dereference(dev->rx_handler_data) : NULL;
+	if (!br_port_exists(dev))
+		return NULL;
+	return br_port_get_by_rx_handler(rtnl_dereference(dev->rx_handler));
 }
 
 struct br_cpu_netstats {
@@ -429,7 +438,8 @@ extern netdev_features_t br_features_recompute(struct net_bridge *br,
 
 /* br_input.c */
 extern int br_handle_frame_finish(struct sk_buff *skb);
-extern rx_handler_result_t br_handle_frame(struct sk_buff **pskb);
+extern rx_handler_result_t br_handle_frame(struct sk_buff **pskb,
+					   struct netdev_rx_handler *rx_handler);
 
 /* br_ioctl.c */
 extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2db88df..88e839c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3296,10 +3296,23 @@ out:
 #endif
 
 /**
+ *	netdev_rx_handler_init - init receive handler structure
+ *	@rx_handler: receive handler to init
+ *	@func: receive handler function
+ *
+ *	Inits receive handler structure.
+ */
+void netdev_rx_handler_init(struct netdev_rx_handler *rx_handler,
+			    rx_handler_func_t *func)
+{
+	rx_handler->func = func;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_init);
+
+/**
  *	netdev_rx_handler_register - register receive handler
  *	@dev: device to register a handler for
  *	@rx_handler: receive handler to register
- *	@rx_handler_data: data pointer that is used by rx handler
  *
  *	Register a receive hander for a device. This handler will then be
  *	called from __netif_receive_skb. A negative errno code is returned
@@ -3310,15 +3323,13 @@ out:
  *	For a general description of rx_handler, see enum rx_handler_result.
  */
 int netdev_rx_handler_register(struct net_device *dev,
-			       rx_handler_func_t *rx_handler,
-			       void *rx_handler_data)
+			       struct netdev_rx_handler *rx_handler)
 {
 	ASSERT_RTNL();
 
 	if (dev->rx_handler)
 		return -EBUSY;
 
-	rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
 	rcu_assign_pointer(dev->rx_handler, rx_handler);
 
 	return 0;
@@ -3338,7 +3349,6 @@ void netdev_rx_handler_unregister(struct net_device *dev)
 
 	ASSERT_RTNL();
 	RCU_INIT_POINTER(dev->rx_handler, NULL);
-	RCU_INIT_POINTER(dev->rx_handler_data, NULL);
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
@@ -3362,7 +3372,7 @@ static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
 static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 {
 	struct packet_type *ptype, *pt_prev;
-	rx_handler_func_t *rx_handler;
+	struct netdev_rx_handler *rx_handler;
 	struct net_device *orig_dev;
 	struct net_device *null_or_dev;
 	bool deliver_exact = false;
@@ -3445,7 +3455,7 @@ ncls:
 			ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = NULL;
 		}
-		switch (rx_handler(&skb)) {
+		switch (rx_handler->func(&skb, rx_handler)) {
 		case RX_HANDLER_CONSUMED:
 			ret = NET_RX_SUCCESS;
 			goto unlock;
diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
index 5558350..5248f8e 100644
--- a/net/openvswitch/dp_notify.c
+++ b/net/openvswitch/dp_notify.c
@@ -32,7 +32,7 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
 	if (ovs_is_internal_dev(dev))
 		vport = ovs_internal_dev_get_vport(dev);
 	else
-		vport = ovs_netdev_get_vport(dev);
+		vport = ovs_netdev_get_vport_rtnl(dev);
 
 	if (!vport)
 		return NOTIFY_DONE;
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index 2130d61..0ff6aa1 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -35,9 +35,6 @@
 /* Must be called with rcu_read_lock. */
 static void netdev_port_receive(struct vport *vport, struct sk_buff *skb)
 {
-	if (unlikely(!vport))
-		goto error;
-
 	if (unlikely(skb_warn_if_lro(skb)))
 		goto error;
 
@@ -57,7 +54,8 @@ error:
 }
 
 /* Called with rcu_read_lock and bottom-halves disabled. */
-static rx_handler_result_t netdev_frame_hook(struct sk_buff **pskb)
+static rx_handler_result_t
+netdev_frame_hook(struct sk_buff **pskb, struct netdev_rx_handler *rx_handler)
 {
 	struct sk_buff *skb = *pskb;
 	struct vport *vport;
@@ -65,7 +63,7 @@ static rx_handler_result_t netdev_frame_hook(struct sk_buff **pskb)
 	if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
 		return RX_HANDLER_PASS;
 
-	vport = ovs_netdev_get_vport(skb->dev);
+	vport = ovs_netdev_get_vport_by_rx_handler(rx_handler);
 
 	netdev_port_receive(vport, skb);
 
@@ -100,8 +98,8 @@ static struct vport *netdev_create(const struct vport_parms *parms)
 		goto error_put;
 	}
 
-	err = netdev_rx_handler_register(netdev_vport->dev, netdev_frame_hook,
-					 vport);
+	netdev_rx_handler_init(&vport->rx_handler, netdev_frame_hook);
+	err = netdev_rx_handler_register(netdev_vport->dev, &vport->rx_handler);
 	if (err)
 		goto error_put;
 
@@ -185,16 +183,6 @@ error:
 	return 0;
 }
 
-/* Returns null if this device is not attached to a datapath. */
-struct vport *ovs_netdev_get_vport(struct net_device *dev)
-{
-	if (likely(dev->priv_flags & IFF_OVS_DATAPATH))
-		return (struct vport *)
-			rcu_dereference_rtnl(dev->rx_handler_data);
-	else
-		return NULL;
-}
-
 const struct vport_ops ovs_netdev_vport_ops = {
 	.type		= OVS_VPORT_TYPE_NETDEV,
 	.create		= netdev_create,
diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h
index 6478079..d3f1471 100644
--- a/net/openvswitch/vport-netdev.h
+++ b/net/openvswitch/vport-netdev.h
@@ -24,7 +24,21 @@
 
 #include "vport.h"
 
-struct vport *ovs_netdev_get_vport(struct net_device *dev);
+#define ovs_netdev_vport_exists(dev) (dev->priv_flags & IFF_OVS_DATAPATH)
+
+static inline struct vport *
+ovs_netdev_get_vport_by_rx_handler(const struct netdev_rx_handler *rx_handler)
+{
+	return container_of(rx_handler, struct vport, rx_handler);
+}
+
+static inline struct vport *
+ovs_netdev_get_vport_rtnl(const struct net_device *dev)
+{
+	if (!ovs_netdev_vport_exists(dev))
+		return NULL;
+	return ovs_netdev_get_vport_by_rx_handler(rtnl_dereference(dev->rx_handler));
+}
 
 struct netdev_vport {
 	struct rcu_head rcu;
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index aee7d43..a7a07c0 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -67,6 +67,7 @@ struct vport_err_stats {
 
 /**
  * struct vport - one port within a datapath
+ * @rx_handler: RX handler structure.
  * @rcu: RCU callback head for deferred destruction.
  * @dp: Datapath to which this port belongs.
  * @upcall_portid: The Netlink port to use for packets received on this port that
@@ -80,6 +81,7 @@ struct vport_err_stats {
  * @err_stats: Points to error statistics used and maintained by vport
  */
 struct vport {
+	struct netdev_rx_handler rx_handler;
 	struct rcu_head rcu;
 	struct datapath	*dp;
 	u32 upcall_portid;
-- 
1.8.1.2

^ permalink raw reply related

* Re: [PATCH] Fix IXP4xx coherent allocations
From: Russell King - ARM Linux @ 2013-03-30 15:31 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Ben Hutchings, linux-arm-kernel, netdev, David Miller,
	linux-kernel, c.aeschlimann
In-Reply-To: <m3d2uhdl1l.fsf@intrepid.localdomain>

On Sat, Mar 30, 2013 at 03:22:46PM +0100, Krzysztof Halasa wrote:
> Russell King - ARM Linux <linux@arm.linux.org.uk> writes:
> 
> > I'm having a hard time understanding what is an ARM issue here, what is
> > an ARM bug, and what the DMA API requires.  The DMA API documentation
> > is extremely sparse in describing what's required of the DMA masks,
> > what these functions are supposed to do, and what determines whether
> > a mask is "possible" or not.
> >
> > Moreover, I'm also having a hard time understanding what broke in 3.7,
> > and why this fixes it.
> >
> > In other words, I'm completely failing to understand everything about
> > this patch.
> 
> The ARM issue here is that the coherent DMA mask, by default, is zero
> (i.e. coherent allocations are impossible by default UNLESS the device
> pointer supplied is NULL - since DMA masks are bound to devices).

So why is this the case - and on what devices?

If it's the case for platform devices, and those platform devices are
created by board code, that is the fault of whoever wrote the board
code for those platform devices.

And the reason this will happen is because programmers are human and
lazy.  If something isn't required for something to function, it won't
be thought about, and it seems from your description that the DMA masks
being set correctly wasn't required.

> On most other platforms, the default DMA mask is 0xFFFFFFFF = (u32)-1
> and this is also required by the DMA API.
> 
> In v3.7 (between v3.7-rc6 and v3.7-rc7), Xi Wang noticed that IXP4xx net
> drivers call dma_pool_create() with NULL dev argument, and changed them
> to use &port->netdev->dev. This broke coherent allocations since now the
> device supplied to dma_pool_create() is not NULL and the (zero) mask
> prevents coherent allocations. This means the Ethernet and HSS drivers
> are since then unusable.
> 
> 
> The first part of my patch makes dma_set_coherent_mask (IXP4xx-only
> code) actually set the mask. This is needed as on IXP4xx we have (at
> least) two different situations:
> 
> - PCI devices want "DMA zone" memory (26 bits = 64 MiB)
> - built-in devices can use any RAM (32 bits = 4 GiB).

As I understand it (bear in mind that I have nothing to do with IXP4xx,
so I'm talking from purely what I understand of the platform) the reason
this isn't done is because of this code:

static int ixp4xx_pci_platform_notify(struct device *dev)
{
        if(dev->bus == &pci_bus_type) {
                *dev->dma_mask =  SZ_64M - 1;
                dev->coherent_dma_mask = SZ_64M - 1;
                dmabounce_register_dev(dev, 2048, 4096, ixp4xx_needs_bounce);
        }
        return 0;
}

which pre-sets the DMA and coherent masks for PCI devices.  However,
this doesn't make your patch wrong - I'm just pointing out that the
setting of the mask should already be done for PCI devices at creation
time.

> Without the patch, dma_set_coherent_mask only returns 0 or -EIO, it
> doesn't change the actual coherent DMA mask and it's then impossible for
> coherent allocators to differentiate between the above two cases.
> 
> +++ b/arch/arm/mach-ixp4xx/common-pci.c
> @@ -454,10 +454,15 @@ int ixp4xx_setup(int nr, struct pci_sys_data *sys)
>  
>  int dma_set_coherent_mask(struct device *dev, u64 mask)
>  {
> -	if (mask >= SZ_64M - 1)
> -		return 0;
> +	if ((mask & DMA_BIT_MASK(26)) != DMA_BIT_MASK(26))
> +		return -EIO;
> +
> +	/* PCI devices are limited to 64 MiB */
> +	if (dev_is_pci(dev))
> +		mask &= DMA_BIT_MASK(26); /* Use DMA region */
>  
> -	return -EIO;
> +	dev->coherent_dma_mask = mask;
> +	return 0;
>  }
> 
> 
> The second part of my patch sets the coherent DMA masks of the Ethernet
> and HSS devices to 4 GiB (the value which should already be the
> default). This also seems to be a recommended action.
> 
> +++ b/drivers/net/ethernet/xscale/ixp4xx_eth.c
> @@ -1423,7 +1423,7 @@ static int eth_init_one(struct platform_device *pdev)
>  	dev->ethtool_ops = &ixp4xx_ethtool_ops;
>  	dev->tx_queue_len = 100;
> -
> +	dma_set_coherent_mask(&dev->dev, DMA_BIT_MASK(32));
>  	netif_napi_add(dev, &port->napi, eth_poll, NAPI_WEIGHT);
>  
> +++ b/drivers/net/wan/ixp4xx_hss.c
> @@ -1367,6 +1367,7 @@ static int hss_init_one(struct platform_device *pdev)
>  	port->dev = &pdev->dev;
>  	port->plat = pdev->dev.platform_data;
> +	dma_set_coherent_mask(&dev->dev, DMA_BIT_MASK(32));
>  	netif_napi_add(dev, &port->napi, hss_hdlc_poll, NAPI_WEIGHT);
> 

Right, so, the answer is - yes you are talking about platform devices,
and the reason that these aren't already set is because if you grep for
ixp4xx_eth or ixp4xx_hss in arch/arm/mach-ixp4xx, you'll notice that
_none_ of the device declarations set either of the DMA masks at all.
They don't even set the dev->dma_mask pointer.  That is why the masks
are zero.  For a device which does DMA, that is wrong.

If you look at the PCI code, it pre-initializes the DMA mask to be 4GiB:
drivers/pci/probe.c:        dev->dev.coherent_dma_mask = 0xffffffffull;

And that is what is missing from the IXP4xx platform code.

However, avoiding the call to dma_set_coherent_mask() from within the
driver also seems to be questionable as it bypasses the "check if the
mask is possible" part of the DMA API.

So, I think your patch(es) are fine, but I would also suggest adding
the initializations for the coherent DMA mask on those platform device
declarations in arch/arm/mach-ixp4xx, and setting the dma_mask pointer
properly as well - espeically as these drivers use the streaming DMA
API as well.  That will bring IXP4xx into line with stuff like the PCI
layer.

^ permalink raw reply

* Re: [PATCH net] net: fq_codel: Fix off-by-one error
From: Eric Dumazet @ 2013-03-30 15:28 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Vijay Subramanian, netdev, davem
In-Reply-To: <20130330150852.GA11315@x4>

On Sat, 2013-03-30 at 16:08 +0100, Markus Trippelsdorf wrote:

> This is exactly the setup that I'm using right now (prio + 4 fq_codel
> with bittorent set to low).
> And setting the fq_codel limit to 1024 improves latency in this
> situation. 
> That's all I wanted to communicate. If the result doesn't interest you,
> just ignore my mail.
> 

I didn't ignore your mail, I spent time from my Saturday to answer you.

If your prio setting was right, a limit of 10 should be enough for the
high prio queue, and a mere pfifo would be ok.

By definition, high prio packets should have a minimum latency (assuming
of course that BQL is enabled on your device)

Then, if all your packets land into he same prio queue, classification
is not correct.

If your link is oversubscribed (and Bittorent tends to push links to
over subscribed situation), then you want to increase drops by reducing
queue lengths.

fq_codel default limit is only a hint, like all defaults.

Some users want to increase it, others want to decrease it.

In your case, I suspect the number of flows is too large (and you get
hash collisions in fq), _and_ your ping packets land the crowded
fq_codel qdisc, instead of a small queue reserved for high prio packets.

^ permalink raw reply

* Re: [PATCH net] net: fq_codel: Fix off-by-one error
From: Markus Trippelsdorf @ 2013-03-30 15:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, netdev, davem
In-Reply-To: <1364654567.5113.85.camel@edumazet-glaptop>

On 2013.03.30 at 07:42 -0700, Eric Dumazet wrote:
> On Sat, 2013-03-30 at 07:53 +0100, Markus Trippelsdorf wrote:
> > On 2013.03.29 at 08:01 -0700, Eric Dumazet wrote:
> > > 
> > > Just curious, did you play changing the default limit (10240 packets) ?
> > 
> > I did some tests on my home router (running OpenWrt trunk) that is rate-
> > limited with hfsc to the speed of the cable modem.
> > 
> > My tests seem to indicate that lowering the default limit to 1024
> > packets results in much better latency behavior when using bittorrent.
> > 
> > With the default limit (10240 packets) I would get huge ping latencies
> > from 600-1200ms when downloading e.g.:
> > http://download.opensuse.org/distribution/12.3/iso/openSUSE-12.3-DVD-x86_64.iso.torrent
> > with hundreds of peers.
> > 
> > Setting the limit to 1024 did get the latencies back in check (20-30ms
> > with occasional spikes of ~100ms).
> 
> Bittorent uses its own rate limiting technique, defeating
> current cwnd control done in the TCP stack, because of a very known
> problem
> 
> ( http://www.ietf.org/id/draft-ietf-tcpm-newcwv-00.txt )
> 
> So if your goal is reducing latencies for a _given_ class of flows, just
> use prio + 3 fq_codel,  and classify your packets to make sure your
> lovely ping packets are not dropped or behind long packets.

This is exactly the setup that I'm using right now (prio + 4 fq_codel
with bittorent set to low).
And setting the fq_codel limit to 1024 improves latency in this
situation. 
That's all I wanted to communicate. If the result doesn't interest you,
just ignore my mail.

-- 
Markus

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Eric Dumazet @ 2013-03-30 14:44 UTC (permalink / raw)
  To: Benoit Lourdelet
  Cc: Eric W. Biederman, Stephen Hemminger, Serge Hallyn,
	netdev@vger.kernel.org
In-Reply-To: <CD7BAAE6.79D5%blourdel@juniper.net>

On Sat, 2013-03-30 at 10:09 +0000, Benoit Lourdelet wrote:
> Hello,
> 
> Here are my tests  of the last patches on 3 different platforms all
> running 3.8.5 :
> 
> Time are in seconds :
> 
> 8x 3.7Ghz virtual cores
> 
> # veth 	create 	delete
> 1000 	14 	18
> 2000 	39 	56
> 5000 	256 	161
> 10000 	1200 	399
> 
> 
> 8x 3.2Ghz virtual cores
> 
> # veth 	create 	delete
> 
> 1000 	19 	40
> 2000 	118 	66
> 5000 	305 	251
> 
> 
> 
> 32x 2Ghz virtual cores , 2 sockets
> # veth 	create 	delete
> 1000 	35 	86
> 
> 2000 	120 	90
> 
> 5000 	724 	245
> 
> 
> Compared to initial iproute2 performance on this 32 virtual core  system :
> 5000 1143  1185
> 
> 
> 
> "perf record" for creation of 5000 veth on the 32 core system :
> 
> # captured on: Fri Mar 29 14:03:35 2013
> # hostname : ieng-serv06
> # os release : 3.8.5
> # perf version : 3.8.5
> # arch : x86_64
> # nrcpus online : 32
> # nrcpus avail : 32
> # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> # cpuid : GenuineIntel,6,45,7
> # total memory : 264124548 kB
> # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a ./test3.script
> # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 =
> 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
> precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42,
> # HEADER_CPU_TOPOLOGY info available, use -I to display
> # HEADER_NUMA_TOPOLOGY info available, use -I to display
> # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2,
> uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 =
> 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco
> # ========
> #
> # Samples: 9M of event 'cycles'
> # Event count (approx.): 2894480238483
> #
> # Overhead          Command                  Shared Object
>                           Symbol
> # ........  ...............  .............................
> ...............................................
> #
>     15.17%             sudo  [kernel.kallsyms]              [k]
> snmp_fold_field    
>      5.94%             sudo  libc-2.15.so                   [.]
> 0x00000000000802cd 
>      5.64%             sudo  [kernel.kallsyms]              [k]
> find_next_bit      
>      3.21%             init  libnih.so.1.0.0                [.]
> nih_list_add_after 
>      2.12%          swapper  [kernel.kallsyms]              [k] intel_idle
>                    
>      1.94%             init  [kernel.kallsyms]              [k] page_fault
>                    
>      1.93%              sed  libc-2.15.so                   [.]
> 0x00000000000a1368 
>      1.93%             sudo  [kernel.kallsyms]              [k]
> rtnl_fill_ifinfo   
>      1.92%             sudo  [veth]                         [k]
> veth_get_stats64   
>      1.78%             sudo  [kernel.kallsyms]              [k] memcpy
>                    
>      1.53%          ifquery  libc-2.15.so                   [.]
> 0x000000000007f52b 
>      1.24%             init  libc-2.15.so                   [.]
> 0x000000000008918f 
>      1.05%             sudo  [kernel.kallsyms]              [k]
> inet6_fill_ifla6_attrs
>      0.98%             init  [kernel.kallsyms]              [k]
> copy_pte_range     
>      0.88%       irqbalance  libc-2.15.so                   [.]
> 0x00000000000802cd 
>      0.85%             sudo  [kernel.kallsyms]              [k] memset
>                    
>      0.72%              sed  ld-2.15.so                     [.]
> 0x000000000000a226 
>      0.68%          ifquery  ld-2.15.so                     [.]
> 0x00000000000165a0 
>      0.64%             init  libnih.so.1.0.0                [.]
> nih_tree_next_post_full
>      0.61%  bridge-network-  libc-2.15.so                   [.]
> 0x0000000000131e2a 
>      0.59%             init  [kernel.kallsyms]              [k] do_wp_page
>                    
>      0.59%          ifquery  [kernel.kallsyms]              [k] page_fault
>                    
>      0.54%              sed  [kernel.kallsyms]              [k] page_fault
>                    
> 
> 
> 
> 
> Regards
> 
> Benoit
> 
> 
> 
> 

This means lxc-start does the same thing than ip :

It fetches the whole device list.

You could strace it to have a confirmation.

^ permalink raw reply

* Re: [PATCH net] net: fq_codel: Fix off-by-one error
From: Eric Dumazet @ 2013-03-30 14:42 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Vijay Subramanian, netdev, davem
In-Reply-To: <20130330065345.GA23938@x4>

On Sat, 2013-03-30 at 07:53 +0100, Markus Trippelsdorf wrote:
> On 2013.03.29 at 08:01 -0700, Eric Dumazet wrote:
> > 
> > Just curious, did you play changing the default limit (10240 packets) ?
> 
> I did some tests on my home router (running OpenWrt trunk) that is rate-
> limited with hfsc to the speed of the cable modem.
> 
> My tests seem to indicate that lowering the default limit to 1024
> packets results in much better latency behavior when using bittorrent.
> 
> With the default limit (10240 packets) I would get huge ping latencies
> from 600-1200ms when downloading e.g.:
> http://download.opensuse.org/distribution/12.3/iso/openSUSE-12.3-DVD-x86_64.iso.torrent
> with hundreds of peers.
> 
> Setting the limit to 1024 did get the latencies back in check (20-30ms
> with occasional spikes of ~100ms).

Hi Markus

I am very bored having to explain {fq_}codel principles each time
someone does this kind of experiments.

Codel principle is _allowing_ bursts, as long as the queue is
controlled. Read Codel paper for details. TCP can be slow to lower the
queues, it takes several RTT. So observing large queues is quite normal
in your case.

Bittorent uses its own rate limiting technique, defeating
current cwnd control done in the TCP stack, because of a very known
problem

( http://www.ietf.org/id/draft-ietf-tcpm-newcwv-00.txt )

So if your goal is reducing latencies for a _given_ class of flows, just
use prio + 3 fq_codel,  and classify your packets to make sure your
lovely ping packets are not dropped or behind long packets.

fq_codel by itself is not universal.

My question about fq_codel limit was related to something completely
different.

The default is 10240 packets. The logic behind is to control the queue
at dequeue, not enqueue. But we needed an safety limit to avoid OOM in
case rate of enqueue() is insane.

It could theoretically hurt a low end machine, in case a burst fills the
queue with big GSO packets. But then these low end machines should not
use GRO/TSO anyway (as this is against anti bufferbloat techniques)

Probably a better choice would have been to limit sum(skb->truesize), or
sum(skb->len) (aka current sch->qstats.backlog)

^ permalink raw reply

* Re: [PATCH] Fix IXP4xx coherent allocations
From: Krzysztof Halasa @ 2013-03-30 14:22 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Ben Hutchings, linux-arm-kernel, netdev, David Miller,
	linux-kernel, c.aeschlimann
In-Reply-To: <20130330132915.GB17995@n2100.arm.linux.org.uk>

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

> I'm having a hard time understanding what is an ARM issue here, what is
> an ARM bug, and what the DMA API requires.  The DMA API documentation
> is extremely sparse in describing what's required of the DMA masks,
> what these functions are supposed to do, and what determines whether
> a mask is "possible" or not.
>
> Moreover, I'm also having a hard time understanding what broke in 3.7,
> and why this fixes it.
>
> In other words, I'm completely failing to understand everything about
> this patch.

The ARM issue here is that the coherent DMA mask, by default, is zero
(i.e. coherent allocations are impossible by default UNLESS the device
pointer supplied is NULL - since DMA masks are bound to devices).
On most other platforms, the default DMA mask is 0xFFFFFFFF = (u32)-1
and this is also required by the DMA API.

In v3.7 (between v3.7-rc6 and v3.7-rc7), Xi Wang noticed that IXP4xx net
drivers call dma_pool_create() with NULL dev argument, and changed them
to use &port->netdev->dev. This broke coherent allocations since now the
device supplied to dma_pool_create() is not NULL and the (zero) mask
prevents coherent allocations. This means the Ethernet and HSS drivers
are since then unusable.

The first part of my patch makes dma_set_coherent_mask (IXP4xx-only
code) actually set the mask. This is needed as on IXP4xx we have (at
least) two different situations:

- PCI devices want "DMA zone" memory (26 bits = 64 MiB)
- built-in devices can use any RAM (32 bits = 4 GiB).

Without the patch, dma_set_coherent_mask only returns 0 or -EIO, it
doesn't change the actual coherent DMA mask and it's then impossible for
coherent allocators to differentiate between the above two cases.

+++ b/arch/arm/mach-ixp4xx/common-pci.c
@@ -454,10 +454,15 @@ int ixp4xx_setup(int nr, struct pci_sys_data *sys)

 int dma_set_coherent_mask(struct device *dev, u64 mask)
 {
-	if (mask >= SZ_64M - 1)
-		return 0;
+	if ((mask & DMA_BIT_MASK(26)) != DMA_BIT_MASK(26))
+		return -EIO;
+
+	/* PCI devices are limited to 64 MiB */
+	if (dev_is_pci(dev))
+		mask &= DMA_BIT_MASK(26); /* Use DMA region */

-	return -EIO;
+	dev->coherent_dma_mask = mask;
+	return 0;
 }

The second part of my patch sets the coherent DMA masks of the Ethernet
and HSS devices to 4 GiB (the value which should already be the
default). This also seems to be a recommended action.

+++ b/drivers/net/ethernet/xscale/ixp4xx_eth.c
@@ -1423,7 +1423,7 @@ static int eth_init_one(struct platform_device *pdev)
 	dev->ethtool_ops = &ixp4xx_ethtool_ops;
 	dev->tx_queue_len = 100;
-
+	dma_set_coherent_mask(&dev->dev, DMA_BIT_MASK(32));
 	netif_napi_add(dev, &port->napi, eth_poll, NAPI_WEIGHT);

+++ b/drivers/net/wan/ixp4xx_hss.c
@@ -1367,6 +1367,7 @@ static int hss_init_one(struct platform_device *pdev)
 	port->dev = &pdev->dev;
 	port->plat = pdev->dev.platform_data;
+	dma_set_coherent_mask(&dev->dev, DMA_BIT_MASK(32));
 	netif_napi_add(dev, &port->napi, hss_hdlc_poll, NAPI_WEIGHT);

-- 
Krzysztof Halasa

^ permalink raw reply

* Re: [PATCH] Fix IXP4xx coherent allocations
From: Russell King - ARM Linux @ 2013-03-30 13:29 UTC (permalink / raw)
  To: Krzysztof Halasa
  Cc: Ben Hutchings, linux-arm-kernel, netdev, David Miller,
	linux-kernel, c.aeschlimann
In-Reply-To: <m3r4j4sdmv.fsf@intrepid.localdomain>

On Sun, Mar 24, 2013 at 10:15:36PM +0100, Krzysztof Halasa wrote:
> The problem on ARM (and probably on powerpc, and on something called
> "metag" - grep -r 'coherent DMA mask is unset' arch) is that the default
> coherent DMA mask is zero. IOW, coherent DMA allocations are, by
> default, disabled. A driver has to dma_set_coherent_mask() or, as many
> drivers do, set dev->coherent_dma_mask directly (IMHO
> dev->coherent_dma_mask along with dev->dma_mask are private DMA API
> stuff and e.g. device drivers have no interest there).
> 
> The zero default is IMHO, WRT the actual DMA API, an ARM bug (and
> powerpc's etc). Nevertheless, the patch I posted does everything as
> required by the API. Specifically, the IXP4xx arch part makes
> IXP4xx's dma_set_coherent_mask() compliant with DMA API, and the actual
> dma_set_coherent_mask() calls in drivers are both valid and I guess
> recommended by the API.
> 
> The patch doesn't touch the core ARM issue, that's right.

I'm having a hard time understanding what is an ARM issue here, what is
an ARM bug, and what the DMA API requires.  The DMA API documentation
is extremely sparse in describing what's required of the DMA masks,
what these functions are supposed to do, and what determines whether
a mask is "possible" or not.

Moreover, I'm also having a hard time understanding what broke in 3.7,
and why this fixes it.

In other words, I'm completely failing to understand everything about
this patch.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox