Netdev List
 help / color / mirror / Atom feed
* [net-2.6 PATCH 2/2] netlink: bug fix: wrong size was calculated for vfinfo list blob
From: Scott Feldman @ 2010-05-28  7:15 UTC (permalink / raw)
  To: davem; +Cc: chrisw, netdev, kaber, arnd
In-Reply-To: <20100528071546.4058.1332.stgit@localhost.localdomain>

From: Scott Feldman <scofeldm@cisco.com>

The wrong size was being calculated for vfinfo.  In one case, it was over-
calculating using nlmsg_total_size on attrs, in another case, it was
under-calculating by assuming ifla_vf_* structs are packed together, but
each struct is it's own attr w/ hdr (and padding).

Signed-off-by: Scott Feldman <scofeldm@cisco.com>
---
 net/core/rtnetlink.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)


diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 7331bb2..1a2af24 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -650,11 +650,12 @@ static inline int rtnl_vfinfo_size(const struct net_device *dev)
 	if (dev->dev.parent && dev_is_pci(dev->dev.parent)) {
 
 		int num_vfs = dev_num_vf(dev->dev.parent);
-		size_t size = nlmsg_total_size(sizeof(struct nlattr));
-		size += nlmsg_total_size(num_vfs * sizeof(struct nlattr));
-		size += num_vfs * (sizeof(struct ifla_vf_mac) +
-				  sizeof(struct ifla_vf_vlan) +
-				  sizeof(struct ifla_vf_tx_rate));
+		size_t size = nla_total_size(sizeof(struct nlattr));
+		size += nla_total_size(num_vfs * sizeof(struct nlattr));
+		size += num_vfs *
+			(nla_total_size(sizeof(struct ifla_vf_mac)) +
+			 nla_total_size(sizeof(struct ifla_vf_vlan)) +
+			 nla_total_size(sizeof(struct ifla_vf_tx_rate)));
 		return size;
 	} else
 		return 0;


^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb V3
From: Eric Dumazet @ 2010-05-28  7:02 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, shemminger, kaber
In-Reply-To: <20100528061241.GC2823@psychotron.redhat.com>

Le vendredi 28 mai 2010 à 08:12 +0200, Jiri Pirko a écrit :

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a1bff65..f54b97d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -254,6 +254,16 @@ struct netdev_hw_addr_list {
>  #define netdev_for_each_mc_addr(ha, dev) \
>  	netdev_hw_addr_list_for_each(ha, &(dev)->mc)
>  
> +
> +struct netdev_rx_handler {
> +	struct list_head	list;
> +	unsigned int		order;
> +#define NETDEV_RX_HANDLER_ORDER_BRIDGE	1
> +#define NETDEV_RX_HANDLER_ORDER_MACVLAN	2
> +	struct sk_buff		*(*callback)(struct sk_buff *skb);
> +	struct rcu_head		rcu;
> +};
> +
>  struct hh_cache {
>  	struct hh_cache *hh_next;	/* Next entry			     */
>  	atomic_t	hh_refcnt;	/* number of users                   */
> @@ -1031,6 +1041,9 @@ struct net_device {
>  	/* GARP */
>  	struct garp_port	*garp_port;
>  
> +	/* receive handlers (hooks) list */
> +	struct list_head	rx_handlers;
> +
>  	/* class/net/name entry */
>  	struct device		dev;
>  	/* space for optional device, statistics, and wireless sysfs groups */
> @@ -1685,6 +1698,11 @@ static inline void napi_free_frags(struct napi_struct *napi)
>  	napi->skb = NULL;
>  }
>  

Please chose another place to hold rx_handlers, to keep rx path memory
needs to minimum cache lines. Somewhere in the following section :

/*
 * Cache line mostly used on receive path (including eth_type_trans())
 */
        unsigned long           last_rx;        /* Time of last Rx      */
        /* Interface address info used in eth_type_trans() */
        unsigned char           *dev_addr;      /* hw address, (before bcast
                                                   because most packets are
                                                   unicast) */

        struct netdev_hw_addr_list      dev_addrs; /* list of device
                                                      hw addresses */

        unsigned char           broadcast[MAX_ADDR_LEN];        /* hw bcast add */

#ifdef CONFIG_RPS
        struct kset             *queues_kset;

        struct netdev_rx_queue  *_rx;

        /* Number of RX queues allocated at alloc_netdev_mq() time  */
        unsigned int            num_rx_queues;
#endif

        struct netdev_queue     rx_queue;


and before the :

	struct netdev_queue     *_tx ____cacheline_aligned_in_smp;




^ permalink raw reply

* Re: fn_trie_lookup and fn_hash_lookup
From: ratheesh k @ 2010-05-28  6:13 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, linux-net
In-Reply-To: <20100527204055.406343eb@nehalam>

On Fri, May 28, 2010 at 9:10 AM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Fri, 28 May 2010 08:53:11 +0530
> ratheesh k <ratheesh.ksz@gmail.com> wrote:
>
>> hi ,
>>
>> I was looking into 2.6.18 kernel . I can see two functions
>> fn_trie_lookup , fn_hash_lookup
>>
>> ipv4/fib_trie.c:      tb->tb_lookup = fn_trie_lookup;
>> ipv4/fib_hash.c:      tb->tb_lookup = fn_hash_lookup;
>>
>> What is the differnce between these two functions ?
>
> They are the two possible FIB algorithms configurable.
>
> net/ipv4/Kconfig:
>
> choice
>        prompt "Choose IP: FIB lookup algorithm (choose FIB_HASH if unsure)"
>        depends on IP_ADVANCED_ROUTER
>        default ASK_IP_FIB_HASH
>
> config ASK_IP_FIB_HASH
>        bool "FIB_HASH"
>        ---help---
>          Current FIB is very proven and good enough for most users.
>
> config IP_FIB_TRIE
>        bool "FIB_TRIE"
>        ---help---
>          Use new experimental LC-trie as FIB lookup algorithm.
>          This improves lookup performance if you have a large
>          number of routes.
>
>          LC-trie is a longest matching prefix lookup algorithm which
>          performs better than FIB_HASH for large routing tables.
>          But, it consumes more memory and is more complex.
>
>          LC-trie is described in:
>
>          IP-address lookup using LC-tries. Stefan Nilsson and Gunnar Karlsson
>          IEEE Journal on Selected Areas in Communications, 17(6):1083-1092,
>          June 1999
>
>          An experimental study of compression methods for dynamic tries
>          Stefan Nilsson and Matti Tikkanen. Algorithmica, 33(1):19-33, 2002.
>          http://www.nada.kth.se/~snilsson/public/papers/dyntrie2/
>
>
> Also see Documentation/networking/fib_trie.txt
>
>
>

Thanks ..

^ permalink raw reply

* [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb V3
From: Jiri Pirko @ 2010-05-28  6:12 UTC (permalink / raw)
  To: netdev; +Cc: davem, shemminger, kaber, eric.dumazet
In-Reply-To: <20100528055154.GB2823@psychotron.redhat.com>

changelog:
v2->v3
	- used GPL-ed exports
v1->v2
	- writers are locked by rtnl_lock (removed unnecessary spinlock)
	- using call_rcu in unregister
	- nicer macvlan_port_create cleanup
	- struct rx_hanler is made const in funtion parameters

What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a general
list of rx_handlers which is iterated thru instead.

Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/macvlan.c      |   27 ++++++--
 include/linux/if_bridge.h  |    2 -
 include/linux/if_macvlan.h |    4 -
 include/linux/netdevice.h  |   18 +++++
 net/bridge/br.c            |    2 -
 net/bridge/br_if.c         |   14 ++++
 net/bridge/br_input.c      |   12 +++-
 net/bridge/br_private.h    |    3 +-
 net/core/dev.c             |  160 ++++++++++++++++++++++++++------------------
 9 files changed, 160 insertions(+), 82 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 87e8d4c..f6b7924 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -145,15 +145,16 @@ static void macvlan_broadcast(struct sk_buff *skb,
 }
 
 /* called under rcu_read_lock() from netif_receive_skb */
-static struct sk_buff *macvlan_handle_frame(struct macvlan_port *port,
-					    struct sk_buff *skb)
+static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 {
+	struct macvlan_port *port;
 	const struct ethhdr *eth = eth_hdr(skb);
 	const struct macvlan_dev *vlan;
 	const struct macvlan_dev *src;
 	struct net_device *dev;
 	unsigned int len;
 
+	port = rcu_dereference(skb->dev->macvlan_port);
 	if (is_multicast_ether_addr(eth->h_dest)) {
 		src = macvlan_hash_lookup(port, eth->h_source);
 		if (!src)
@@ -511,10 +512,16 @@ static void macvlan_setup(struct net_device *dev)
 	dev->tx_queue_len	= 0;
 }
 
+static const struct netdev_rx_handler macvlan_rx_handler = {
+	.order		= NETDEV_RX_HANDLER_ORDER_MACVLAN,
+	.callback	= macvlan_handle_frame,
+};
+
 static int macvlan_port_create(struct net_device *dev)
 {
 	struct macvlan_port *port;
 	unsigned int i;
+	int err;
 
 	if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
 		return -EINVAL;
@@ -528,13 +535,26 @@ static int macvlan_port_create(struct net_device *dev)
 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
 	rcu_assign_pointer(dev->macvlan_port, port);
+
+	err = netdev_rx_handler_register(dev, &macvlan_rx_handler);
+	if (err)
+		goto cleanup;
+
 	return 0;
+
+cleanup:
+	rcu_assign_pointer(dev->macvlan_port, NULL);
+	synchronize_rcu();
+	kfree(port);
+
+	return err;
 }
 
 static void macvlan_port_destroy(struct net_device *dev)
 {
 	struct macvlan_port *port = dev->macvlan_port;
 
+	netdev_rx_handler_unregister(dev, &macvlan_rx_handler);
 	rcu_assign_pointer(dev->macvlan_port, NULL);
 	synchronize_rcu();
 	kfree(port);
@@ -767,14 +787,12 @@ static int __init macvlan_init_module(void)
 	int err;
 
 	register_netdevice_notifier(&macvlan_notifier_block);
-	macvlan_handle_frame_hook = macvlan_handle_frame;
 
 	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
 err1:
-	macvlan_handle_frame_hook = NULL;
 	unregister_netdevice_notifier(&macvlan_notifier_block);
 	return err;
 }
@@ -782,7 +800,6 @@ err1:
 static void __exit macvlan_cleanup_module(void)
 {
 	rtnl_link_unregister(&macvlan_link_ops);
-	macvlan_handle_frame_hook = NULL;
 	unregister_netdevice_notifier(&macvlan_notifier_block);
 }
 
diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 938b7e8..0d241a5 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -102,8 +102,6 @@ struct __fdb_entry {
 #include <linux/netdevice.h>
 
 extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
-extern struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
-					       struct sk_buff *skb);
 extern int (*br_should_route_hook)(struct sk_buff *skb);
 
 #endif
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9ea047a..c26a0e4 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -84,8 +84,4 @@ extern int macvlan_link_register(struct rtnl_link_ops *ops);
 extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 				      struct net_device *dev);
 
-
-extern struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *,
-						    struct sk_buff *);
-
 #endif /* _LINUX_IF_MACVLAN_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a1bff65..f54b97d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -254,6 +254,16 @@ struct netdev_hw_addr_list {
 #define netdev_for_each_mc_addr(ha, dev) \
 	netdev_hw_addr_list_for_each(ha, &(dev)->mc)
 
+
+struct netdev_rx_handler {
+	struct list_head	list;
+	unsigned int		order;
+#define NETDEV_RX_HANDLER_ORDER_BRIDGE	1
+#define NETDEV_RX_HANDLER_ORDER_MACVLAN	2
+	struct sk_buff		*(*callback)(struct sk_buff *skb);
+	struct rcu_head		rcu;
+};
+
 struct hh_cache {
 	struct hh_cache *hh_next;	/* Next entry			     */
 	atomic_t	hh_refcnt;	/* number of users                   */
@@ -1031,6 +1041,9 @@ struct net_device {
 	/* GARP */
 	struct garp_port	*garp_port;
 
+	/* receive handlers (hooks) list */
+	struct list_head	rx_handlers;
+
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
@@ -1685,6 +1698,11 @@ static inline void napi_free_frags(struct napi_struct *napi)
 	napi->skb = NULL;
 }
 
+extern int netdev_rx_handler_register(struct net_device *dev,
+				      const struct netdev_rx_handler *rh);
+extern void netdev_rx_handler_unregister(struct net_device *dev,
+					 const struct netdev_rx_handler *rh);
+
 extern void		netif_nit_deliver(struct sk_buff *skb);
 extern int		dev_valid_name(const char *name);
 extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
diff --git a/net/bridge/br.c b/net/bridge/br.c
index 76357b5..c8436fa 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -63,7 +63,6 @@ static int __init br_init(void)
 		goto err_out4;
 
 	brioctl_set(br_ioctl_deviceless_stub);
-	br_handle_frame_hook = br_handle_frame;
 
 #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
 	br_fdb_test_addr_hook = br_fdb_test_addr;
@@ -100,7 +99,6 @@ static void __exit br_deinit(void)
 	br_fdb_test_addr_hook = NULL;
 #endif
 
-	br_handle_frame_hook = NULL;
 	br_fdb_fini();
 }
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 18b245e..45270e5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -119,6 +119,11 @@ static void destroy_nbp_rcu(struct rcu_head *head)
 	destroy_nbp(p);
 }
 
+static const struct netdev_rx_handler br_rx_handler = {
+	.order		= NETDEV_RX_HANDLER_ORDER_BRIDGE,
+	.callback	= br_handle_frame,
+};
+
 /* Delete port(interface) from bridge is done in two steps.
  * via RCU. First step, marks device as down. That deletes
  * all the timers and stops new packets from flowing through.
@@ -147,6 +152,7 @@ static void del_nbp(struct net_bridge_port *p)
 
 	list_del_rcu(&p->list);
 
+	netdev_rx_handler_unregister(dev, &br_rx_handler);
 	rcu_assign_pointer(dev->br_port, NULL);
 
 	br_multicast_del_port(p);
@@ -429,6 +435,11 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 		goto err2;
 
 	rcu_assign_pointer(dev->br_port, p);
+
+	err = netdev_rx_handler_register(dev, &br_rx_handler);
+	if (err)
+		goto err3;
+
 	dev_disable_lro(dev);
 
 	list_add_rcu(&p->list, &br->port_list);
@@ -451,6 +462,9 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	br_netpoll_enable(br, dev);
 
 	return 0;
+err3:
+	rcu_assign_pointer(dev->br_port, NULL);
+	synchronize_rcu();
 err2:
 	br_fdb_delete_by_port(br, p, 1);
 err1:
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index d36e700..99647d8 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -131,15 +131,19 @@ static inline int is_link_local(const unsigned char *dest)
 }
 
 /*
- * Called via br_handle_frame_hook.
  * Return NULL if skb is handled
- * note: already called with rcu_read_lock (preempt_disabled)
+ * note: already called with rcu_read_lock (preempt_disabled) from
+ * netif_receive_skb
  */
-struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
+struct sk_buff *br_handle_frame(struct sk_buff *skb)
 {
+	struct net_bridge_port *p;
 	const unsigned char *dest = eth_hdr(skb)->h_dest;
 	int (*rhook)(struct sk_buff *skb);
 
+	if (skb->pkt_type == PACKET_LOOPBACK)
+		return skb;
+
 	if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
 		goto drop;
 
@@ -147,6 +151,8 @@ struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
 	if (!skb)
 		return NULL;
 
+	p = rcu_dereference(skb->dev->br_port);
+
 	if (unlikely(is_link_local(dest))) {
 		/* Pause frames shouldn't be passed up by driver anyway */
 		if (skb->protocol == htons(ETH_P_PAUSE))
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 0f4a74b..c83519b 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -331,8 +331,7 @@ extern void br_features_recompute(struct net_bridge *br);
 
 /* br_input.c */
 extern int br_handle_frame_finish(struct sk_buff *skb);
-extern struct sk_buff *br_handle_frame(struct net_bridge_port *p,
-				       struct sk_buff *skb);
+extern struct sk_buff *br_handle_frame(struct sk_buff *skb);
 
 /* br_ioctl.c */
 extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c82065..d2ab71a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2585,70 +2585,14 @@ static inline int deliver_skb(struct sk_buff *skb,
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
-#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
-
-#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
+#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
+    (defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE))
 /* This hook is defined here for ATM LANE */
 int (*br_fdb_test_addr_hook)(struct net_device *dev,
 			     unsigned char *addr) __read_mostly;
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-/*
- * If bridge module is loaded call bridging hook.
- *  returns NULL if packet was consumed.
- */
-struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
-					struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(br_handle_frame_hook);
-
-static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
-					    struct packet_type **pt_prev, int *ret,
-					    struct net_device *orig_dev)
-{
-	struct net_bridge_port *port;
-
-	if (skb->pkt_type == PACKET_LOOPBACK ||
-	    (port = rcu_dereference(skb->dev->br_port)) == NULL)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	return br_handle_frame_hook(port, skb);
-}
-#else
-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(skb)
-#endif
-
-#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
-struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *p,
-					     struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);
-
-static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
-					     struct packet_type **pt_prev,
-					     int *ret,
-					     struct net_device *orig_dev)
-{
-	struct macvlan_port *port;
-
-	port = rcu_dereference(skb->dev->macvlan_port);
-	if (!port)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-	return macvlan_handle_frame_hook(port, skb);
-}
-#else
-#define handle_macvlan(skb, pt_prev, ret, orig_dev)	(skb)
-#endif
-
 #ifdef CONFIG_NET_CLS_ACT
 /* TODO: Maybe we should just force sch_ingress to be compiled in
  * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
@@ -2744,6 +2688,85 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+static bool rx_handlers_equal(const struct netdev_rx_handler *rh1,
+			      const struct netdev_rx_handler *rh2)
+{
+	return (rh1->order == rh2->order) && (rh1->callback == rh2->callback);
+}
+
+/**
+ *	netdev_rx_handler_register - register receive handler
+ *	@dev: device to register a handler for
+ *	@rh: receive handler to register
+ *
+ *	Register a receive hander for a device. This handler will then be
+ *	called from __netif_receive_skb. A negative errno code is returned
+ *	on a failure.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+int netdev_rx_handler_register(struct net_device *dev,
+			       const struct netdev_rx_handler *rh)
+{
+	struct list_head *iter, *add_after;
+	struct netdev_rx_handler *rh1;
+	int err = 0;
+
+	ASSERT_RTNL();
+	add_after = &dev->rx_handlers;
+	list_for_each(iter, &dev->rx_handlers) {
+		rh1 = list_entry(iter, struct netdev_rx_handler, list);
+		if (rx_handlers_equal(rh, rh1))
+			return -EEXIST;
+		if (rh1->order > rh->order)
+			break;
+		add_after = iter;
+	}
+	rh1 = kzalloc(sizeof(*rh), GFP_KERNEL);
+	if (!rh1)
+		return -ENOMEM;
+
+	rh1->order = rh->order;
+	rh1->callback = rh->callback;
+	list_add_rcu(&rh1->list, add_after);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
+
+static void rx_handler_rcu_free(struct rcu_head *head)
+{
+	struct netdev_rx_handler *rh;
+
+	rh = container_of(head, struct netdev_rx_handler, rcu);
+	kfree(rh);
+}
+
+/**
+ *	netdev_rx_handler_unregister - unregister receive handler
+ *	@dev: device to unregister a handler from
+ *	@rh: receive handler to unregister
+ *
+ *	Unregister a receive hander from a device.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+void netdev_rx_handler_unregister(struct net_device *dev,
+				  const struct netdev_rx_handler *rh)
+{
+	struct netdev_rx_handler *rh1;
+
+	ASSERT_RTNL();
+	list_for_each_entry(rh1, &dev->rx_handlers, list) {
+		if (rx_handlers_equal(rh, rh1)) {
+			list_del_rcu(&rh1->list);
+			call_rcu(&rh1->rcu, rx_handler_rcu_free);
+			return;
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
+
 static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
 					      struct net_device *master)
 {
@@ -2796,6 +2819,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
+	struct netdev_rx_handler *rh;
 	struct net_device *orig_dev;
 	struct net_device *master;
 	struct net_device *null_or_orig;
@@ -2859,12 +2883,18 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
-	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto out;
-	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto out;
+	/*
+	 * Go through various rx handlers, like bridge, macvlan etc.
+	 */
+	list_for_each_entry_rcu(rh, &skb->dev->rx_handlers, list) {
+		if (pt_prev) {
+			ret = deliver_skb(skb, pt_prev, orig_dev);
+			pt_prev = NULL;
+		}
+		skb = rh->callback(skb);
+		if (!skb)
+			goto out;
+	}
 
 	/*
 	 * Make sure frames received on VLAN interfaces stacked on
@@ -5371,6 +5401,8 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
 	dev_mc_init(dev);
 	dev_uc_init(dev);
 
+	INIT_LIST_HEAD(&dev->rx_handlers);
+
 	dev_net_set(dev, &init_net);
 
 	dev->_tx = tx;
-- 
1.6.6.1


^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: replace hooks in __netif_receive_skb V2
From: Jiri Pirko @ 2010-05-28  5:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem, kaber, eric.dumazet
In-Reply-To: <20100527130822.02cb1661@nehalam>

Thu, May 27, 2010 at 10:08:22PM CEST, shemminger@vyatta.com wrote:
>On Thu, 27 May 2010 20:08:24 +0200
>Jiri Pirko <jpirko@redhat.com> wrote:
>
>> changelog:
>> v1->v2
>> 	- writers are locked by rtnl_lock (removed unnecessary spinlock)
>> 	- using call_rcu in unregister
>> 	- nicer macvlan_port_create cleanup
>> 	- struct rx_hanler is made const in funtion parameters
>> 
>> What this patch does is it removes two receive frame hooks (for bridge and for
>> macvlan) from __netif_receive_skb. These are replaced them with a general
>> list of rx_handlers which is iterated thru instead.
>> 
>> Then a network driver (of virtual netdev like macvlan or bridge) can register
>> an rx_handler for needed net device.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
>I wonder if we really need the chaining.  What about allowing only one
>hook per device (and return -EBUSY)?  That also gets rid of the allocation,
>and the extra overhead.

Hmm, that's a good question. I thought about it already. But I'm also wondering
if there is a possible scenario, when the chaining can be necessary. Also I do
not see any -significant- downside of having multiple handlers in a list, so I
would probably go this way now. Opinions?

>
>The handler hook, has to be export GPL, since we really don't want more
>network stack abuse by 3rd parties.

Noted, will be in the next patch version.

Jirka

^ permalink raw reply

* Re: [PATCH 2/2] netdev/fec: fix ifconfig eth0 down hang issue
From: Bryan Wu @ 2010-05-28  5:26 UTC (permalink / raw)
  To: afleming, davem
  Cc: Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev, linux-kernel
In-Reply-To: <1273199239-11057-3-git-send-email-bryan.wu@canonical.com>

Since this patch is for another bug and doesn't depend on my first one patch.
Could you please review this?

Thanks,
-Bryan

On 05/07/2010 10:27 AM, Bryan Wu wrote:
> BugLink: http://bugs.launchpad.net/bugs/559065
> 
> In fec open/close function, we need to use phy_connect and phy_disconnect
> operation before we start/stop phy. Otherwise it will cause system hang.
> 
> Only call fec_enet_mii_probe() in open function, because the first open
> action will cause NULL pointer error.
> 
> Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
> ---
>  drivers/net/fec.c |   28 ++++++++++++++++------------
>  1 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/fec.c b/drivers/net/fec.c
> index 9c58f6b..af4243f 100644
> --- a/drivers/net/fec.c
> +++ b/drivers/net/fec.c
> @@ -678,6 +678,8 @@ static int fec_enet_mii_probe(struct net_device *dev)
>  	struct phy_device *phy_dev = NULL;
>  	int phy_addr;
>  
> +	fep->phy_dev = NULL;
> +
>  	/* find the first phy */
>  	for (phy_addr = 0; phy_addr < PHY_MAX_ADDR; phy_addr++) {
>  		if (fep->mii_bus->phy_map[phy_addr]) {
> @@ -708,6 +710,11 @@ static int fec_enet_mii_probe(struct net_device *dev)
>  	fep->link = 0;
>  	fep->full_duplex = 0;
>  
> +	printk(KERN_INFO "%s: Freescale FEC PHY driver [%s] "
> +		"(mii_bus:phy_addr=%s, irq=%d)\n", dev->name,
> +		fep->phy_dev->drv->name, dev_name(&fep->phy_dev->dev),
> +		fep->phy_dev->irq);
> +
>  	return 0;
>  }
>  
> @@ -753,13 +760,8 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>  	if (mdiobus_register(fep->mii_bus))
>  		goto err_out_free_mdio_irq;
>  
> -	if (fec_enet_mii_probe(dev) != 0)
> -		goto err_out_unregister_bus;
> -
>  	return 0;
>  
> -err_out_unregister_bus:
> -	mdiobus_unregister(fep->mii_bus);
>  err_out_free_mdio_irq:
>  	kfree(fep->mii_bus->irq);
>  err_out_free_mdiobus:
> @@ -912,7 +914,12 @@ fec_enet_open(struct net_device *dev)
>  	if (ret)
>  		return ret;
>  
> -	/* schedule a link state check */
> +	/* Probe and connect to PHY when open the interface */
> +	ret = fec_enet_mii_probe(dev);
> +	if (ret) {
> +		fec_enet_free_buffers(dev);
> +		return ret;
> +	}
>  	phy_start(fep->phy_dev);
>  	netif_start_queue(dev);
>  	fep->opened = 1;
> @@ -926,10 +933,12 @@ fec_enet_close(struct net_device *dev)
>  
>  	/* Don't know what to do yet. */
>  	fep->opened = 0;
> -	phy_stop(fep->phy_dev);
>  	netif_stop_queue(dev);
>  	fec_stop(dev);
>  
> +	if (fep->phy_dev)
> +		phy_disconnect(fep->phy_dev);
> +
>          fec_enet_free_buffers(dev);
>  
>  	return 0;
> @@ -1293,11 +1302,6 @@ fec_probe(struct platform_device *pdev)
>  	if (ret)
>  		goto failed_register;
>  
> -	printk(KERN_INFO "%s: Freescale FEC PHY driver [%s] "
> -		"(mii_bus:phy_addr=%s, irq=%d)\n", ndev->name,
> -		fep->phy_dev->drv->name, dev_name(&fep->phy_dev->dev),
> -		fep->phy_dev->irq);
> -
>  	return 0;
>  
>  failed_register:

^ permalink raw reply

* Re: [RFC] netfilter: WIP: Xtables idletimer target implementation
From: Luciano Coelho @ 2010-05-28  5:25 UTC (permalink / raw)
  To: ext Jan Engelhardt
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	kaber@trash.net, Timo Teras
In-Reply-To: <alpine.LSU.2.01.1005280057190.14077@obet.zrqbmnf.qr>

Hi Jan,

Thanks a lot for your comments!

On Fri, 2010-05-28 at 01:17 +0200, ext Jan Engelhardt wrote:
> On Thursday 2010-05-27 22:54, Luciano Coelho wrote:
> >There are still some issues to be resolved:
> >
> >How to treat several rules for the same interface?
> >
> >We need a key for the timer list.  I'm using the targinfo pointer for that,
> >but this looks shaky, because there is no assurance that this pointer will
> >live for the entire lifetime of the rule.
> 
> Well xt_quota for example has its targinfo around at all times,
> as have other modules.

Great, so this will work.  I had checked the x_tables code and it seemed
that the lifetime of targinfo was sufficient, but I was not sure this
would be safe in the future.  Now, if this changes in the future, my
module won't be the only one to break ;)


> >This is now an x_tables target, so other protocols need to be implemented.
> 
> Huh? You're not looking at packets, so why does it need proto-specific
> stuff?

I need to associate the timers with specific interfaces, because it's
the idle time of the interfaces that the userspace in interested in.  I
didn't find any other way to associate the timers with them, except by
looking at the iniface and outiface values in ipt_ip (and eventually,
with IPv6 properly implemented, in ip6t_ip6).  This is what Patrick
suggested when he checked my previous patch [1] and triggered me to do a
major rework on my module ;)

Do you have any other suggestion on how I can associate the rules to
specific interfaces?


> >+static
> >+struct utimer_t *__utimer_find_by_info(const struct xt_idletimer_info *info)
> >+{
> >+	struct utimer_t *entry;
> >+
> >+	list_for_each_entry(entry, &active_utimer_head, entry) {
> >+		if (info == entry->info) {
> >+			return entry;
> >+		}
> >+	}
> >+
> >+	return NULL;
> >+}
> 
> Can do with less braces.

Sure.  These remained there after I removed some traces.  I'll clean
this up.


> >+static void utimer_expired(unsigned long data)
> >+{
> >+	struct utimer_t *timer = (struct utimer_t *) data;
> >+
> >+	pr_debug("xt_idletimer: timer '%s' ns %p expired\n",
> >+		 timer->name, timer->net);
> >+
> >+	schedule_work(&timer->work);
> >+}
> 
> You don't need xt_idletimer, because pr_debug already prints
> that (with #define pr_fmt(fmt) KBUILD_MODNAME ": " as many
> other modules do)

Ok, I'll change it.  Thanks for pointing out.


> >+
> >+static struct utimer_t *utimer_create(const char *name,
> >+				      struct net *net,
> >+				      const struct xt_idletimer_info *info)
> >+{
> >+	struct utimer_t *timer;
> >+
> >+	timer = kmalloc(sizeof(struct utimer_t), GFP_ATOMIC);
> >+	if (timer == NULL)
> >+		return NULL;
> 
> This is called from user context, so GFP_KERNEL will perfectly suffice.

Yup, will change.


> >+static int xt_idletimer_checkentry(const struct xt_tgchk_param *par)
> >+{
> >+	const struct xt_idletimer_info *info = par->targinfo;
> >+	const struct ipt_entry *entryinfo = par->entryinfo;
> >+	const struct ipt_ip *ip = &entryinfo->ip;
> 
> I'm not sure spying on ipt_ip is a long-term viable solution.

Do you have any other suggestions on how I could get an interface
associated with the rule? I thought about having the userspace pass the
interface as an option to the rule (like I already do for the timeout
value), but that looked ugly to me, since the interface can already be
defined as part of the ruleset.


> >+	pr_debug("xt_idletimer: checkentry targinfo %p\n", par->targinfo);
> >+
> >+	if (info->timeout == 0) {
> >+		pr_debug("xt_idletimer: timeout value is zero\n");
> >+		return -EINVAL;
> >+	}
> >+
> >+	/* FIXME: implement support for other protocol families */
> >+	switch (par->family) {
> >+	case NFPROTO_IPV4   :
> >+		pr_debug("xt_idletimer: NFPROTO_IPV4\n");
> >+		break;
> >+
> >+	default:
> >+		pr_debug("xt_idletimer: Unsupported protocol family family!\n");
> >+		return -EINVAL;
> >+	}
> >+
> >+	if (strlen(ip->iniface) == 0 && strlen(ip->outiface) == 0) {
> >+		pr_debug("xt_idletimer: in or out interface must "
> >+			 "be specified\n");
> >+		return -EINVAL;
> >+	}
> >+
> >+	if (utimer_create(ip->iniface, par->net, info) == NULL) {
> >+		pr_debug("xt_idletimer: failed to create timer\n");
> >+		return -ENOMEM;
> >+	}
> 
> What about outiface?
> What blows up when iniface is empty?

Ooops! I've redone this part of the code so many times and in this
version I completely forgot to include the outiface.  I'll fix it.


> >+static void xt_idletimer_destroy(const struct xt_tgdtor_param *par)
> >+{
> >+	const struct xt_idletimer_info *info = par->targinfo;
> >+
> >+	pr_debug("xt_idletimer: destroy targinfo %p\n", par->targinfo);
> >+
> >+	utimer_delete(info);
> >+}
> >+
> >+static int __init init(void)
> >+{
> >+	int ret;
> >+
> >+	ret = utimer_init();
> >+	if (ret)
> >+		return ret;
> >+
> >+	ret =  xt_register_target(&xt_idletimer);
> >+	if (ret < 0) {
> >+		utimer_fini();
> >+		return ret;
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >+static void __exit fini(void)
> >+{
> >+	xt_unregister_target(&xt_idletimer);
> >+	utimer_fini();
> >+}
> >+
> >+module_init(init);
> >+module_exit(fini);
> 
> Call it just exit?

Yes.


> Also give the functions better names (see other modules), that is going
> to be unrecognizable in stacktraces otherwise.

I agree.  These names are coming from the original code.  I thought
about changing them to something clearer, but didn't bother to do it
yet, because I was focusing on the actual functionality.  I'll change
the names.

Again, thanks for your comments.  I'll rework and submit v2 soon.

Ah, and please excuse my lameness of mistyping the netdev email address
when I submitted the patch.  I fixed it now.

[1]
http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/33934


-- 
Cheers,
Luca.


^ permalink raw reply

* Re: [PATCH 1/2] netdev/fec: fix performance impact from mdio poll operation
From: Bryan Wu @ 2010-05-28  5:23 UTC (permalink / raw)
  To: David Miller; +Cc: s.hauer, gerg, amit.kucheria, netdev, linux-kernel
In-Reply-To: <20100516.002818.133869940.davem@davemloft.net>

On 05/16/2010 03:28 PM, David Miller wrote:
> From: Bryan Wu <bryan.wu@canonical.com>
> Date: Fri,  7 May 2010 10:27:18 +0800
> 
>> BugLink: http://bugs.launchpad.net/bugs/546649
>> BugLink: http://bugs.launchpad.net/bugs/457878
>>
>> After introducing phylib supporting, users experienced performace drop. That is
>> because of the mdio polling operation of phylib. Use msleep to replace the busy
>> waiting cpu_relax() and remove the warning message.
>>
>> Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
>> Acked-by: Andy Whitcroft <apw@canonical.com>
> 
> As you've already been told, making these MDIO interfaces fail silently
> is not acceptable.
> 
> Please fix this bug properly and resubmit this patch series.
> 

So sorry for the delay. My board's broken for several weeks. After I got a new,
I will try to fix this and resubmit this patch.

Thanks
-- 
Bryan Wu <bryan.wu@canonical.com>
Kernel Developer    +86.138-1617-6545 Mobile
Ubuntu Kernel Team | Hardware Enablement Team
Canonical Ltd.      www.canonical.com
Ubuntu - Linux for human beings | www.ubuntu.com

^ permalink raw reply

* Re: fn_trie_lookup and fn_hash_lookup
From: Stephen Hemminger @ 2010-05-28  3:40 UTC (permalink / raw)
  To: ratheesh k; +Cc: netdev, linux-net
In-Reply-To: <AANLkTikmIzPaOKdlXtCwosfJXzaKY1vxib_MdqI0vN_5@mail.gmail.com>

On Fri, 28 May 2010 08:53:11 +0530
ratheesh k <ratheesh.ksz@gmail.com> wrote:

> hi ,
> 
> I was looking into 2.6.18 kernel . I can see two functions
> fn_trie_lookup , fn_hash_lookup
> 
> ipv4/fib_trie.c:	tb->tb_lookup = fn_trie_lookup;
> ipv4/fib_hash.c:	tb->tb_lookup = fn_hash_lookup;
> 
> What is the differnce between these two functions ?

They are the two possible FIB algorithms configurable.

net/ipv4/Kconfig:

choice
	prompt "Choose IP: FIB lookup algorithm (choose FIB_HASH if unsure)"
	depends on IP_ADVANCED_ROUTER
	default ASK_IP_FIB_HASH

config ASK_IP_FIB_HASH
	bool "FIB_HASH"
	---help---
	  Current FIB is very proven and good enough for most users.

config IP_FIB_TRIE
	bool "FIB_TRIE"
	---help---
	  Use new experimental LC-trie as FIB lookup algorithm.
	  This improves lookup performance if you have a large
	  number of routes.

	  LC-trie is a longest matching prefix lookup algorithm which
	  performs better than FIB_HASH for large routing tables.
	  But, it consumes more memory and is more complex.

	  LC-trie is described in:

	  IP-address lookup using LC-tries. Stefan Nilsson and Gunnar Karlsson
	  IEEE Journal on Selected Areas in Communications, 17(6):1083-1092,
	  June 1999

	  An experimental study of compression methods for dynamic tries
	  Stefan Nilsson and Matti Tikkanen. Algorithmica, 33(1):19-33, 2002.
	  http://www.nada.kth.se/~snilsson/public/papers/dyntrie2/


Also see Documentation/networking/fib_trie.txt



^ permalink raw reply

* Re: [net-next-2.6 V9 PATCH 1/2] Add netlink support for virtual port management (was iovnl)
From: Scott Feldman @ 2010-05-28  3:24 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, chrisw, arnd
In-Reply-To: <4BF54616.1020508@trash.net>

On 5/20/10 7:24 AM, "Patrick McHardy" <kaber@trash.net> wrote:

> Scott Feldman wrote:
>> +static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
>> +{
>> + struct nlattr *vf_ports;
>> + struct nlattr *vf_port;
>> + int vf;
>> + int err;
>> +
>> + vf_ports = nla_nest_start(skb, IFLA_VF_PORTS);
>> + if (!vf_ports)
>> +  return -EMSGSIZE;
>> +
>> + for (vf = 0; vf < dev_num_vf(dev->dev.parent); vf++) {
>> +  vf_port = nla_nest_start(skb, IFLA_VF_PORT);
>> +  if (!vf_port) {
>> +   nla_nest_cancel(skb, vf_ports);
>> +   return -EMSGSIZE;
>> +  }
>> +  NLA_PUT_U32(skb, IFLA_PORT_VF, vf);
>> +  err = dev->netdev_ops->ndo_get_vf_port(dev, vf, skb);
>> +  if (err) {
>> +nla_put_failure:
>> +   nla_nest_cancel(skb, vf_port);
>> +   continue;
> 
> Are you sure you want to continue here? During a full dump
> this means that the last VF port fitting in the skb will most
> likely be incomplete since the higher layer code won't
> notice that the size was exceeded and the entry should be
> dumped into another skb.

Drats, that's a bug.  I'll get that fixed.  Thanks Patrick for catching it.

-scott


^ permalink raw reply

* fn_trie_lookup and fn_hash_lookup
From: ratheesh k @ 2010-05-28  3:23 UTC (permalink / raw)
  To: netdev, linux-net

hi ,

I was looking into 2.6.18 kernel . I can see two functions
fn_trie_lookup , fn_hash_lookup

ipv4/fib_trie.c:	tb->tb_lookup = fn_trie_lookup;
ipv4/fib_hash.c:	tb->tb_lookup = fn_hash_lookup;

What is the differnce between these two functions ?


Thanks,
Ratheesh

^ permalink raw reply

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-05-28  2:47 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: linux-kernel, Matt Mackall, netdev, bridge, Andy Gospodarek,
	Neil Horman, Jeff Moyer, Stephen Hemminger, bonding-devel,
	Jay Vosburgh, David Miller
In-Reply-To: <20100527180545.GA2345@sysclose.org>

On 05/28/10 02:05, Flavio Leitner wrote:
>
> Hi guys!
>
> I finally could test this to see if an old problem reported on bugzilla[1] was
> fixed now, but unfortunately it is still there.
>
> The ticket is private I guess, but basically the problem happens when bonding
> driver tries to print something after it had taken the write_lock (monitor
> functions, enslave/de-enslave), so the printk() will pass through netpoll, then
> on bonding again which no matter what mode you use, it will try to read_lock()
> the lock again. The result is a deadlock and the entire system hangs.


This is true, I already fixed some similar issues.

>
> I manage to get a fresh backtrace with mode 1, see below:
>
>
> [   93.167079] Call Trace:
> [   93.167079]  [<ffffffff81034cf9>] warn_slowpath_common+0x77/0x8f
> [   93.167079]  [<ffffffff81034d5e>] warn_slowpath_fmt+0x3c/0x3e
> [   93.167079]  [<ffffffff81366aef>] ? _raw_read_trylock+0x11/0x4b
> [   93.167079]  [<ffffffffa02a2c42>] ? bond_start_xmit+0x12b/0x401 [bonding]
> ->  read_lock fails
> [   93.167079]  [<ffffffffa02a2c9f>] bond_start_xmit+0x188/0x401 [bonding]
> [   93.167079]  [<ffffffff81055b37>] ? trace_hardirqs_off+0xd/0xf
> [   93.167079]  [<ffffffff812dfdb9>] netpoll_send_skb+0xbd/0x1f3
> [   93.167079]  [<ffffffff812e00ed>] netpoll_send_udp+0x1fe/0x20d
> [   93.167079]  [<ffffffffa02c017c>] write_msg+0x89/0xcd [netconsole]
> [   93.167079]  [<ffffffff81034e65>] __call_console_drivers+0x67/0x79
> [   93.167079]  [<ffffffff81034ed0>] _call_console_drivers+0x59/0x5d
> [   93.167079]  [<ffffffff810352d3>] release_console_sem+0x121/0x1d7
> [   93.167079]  [<ffffffff8103590a>] vprintk+0x35d/0x393
> [   93.167079]  [<ffffffff8103f947>] ? add_timer+0x17/0x19
> [   93.167079]  [<ffffffff81046ddf>] ? queue_delayed_work_on+0xa2/0xa9
> [   93.167079]  [<ffffffff81363bb8>] printk+0x3c/0x44
> [   93.167079]  [<ffffffffa02a3b17>] bond_select_active_slave+0x105/0x109 [bonding]
> ->  write_locked
> [   93.167079]  [<ffffffffa02a4798>] bond_mii_monitor+0x479/0x4ed [bonding]
> [   93.167079]  [<ffffffff81046009>] worker_thread+0x1ef/0x2e2
>
> In this case, the message should be
>      "bonding: bond0: making interface eth0 the new active one"


Hmm, you triggered a warning here, let me check the source code
and try to reproduce it here.

>
> I did the following patch to discard the packet if it was IN_NETPOLL
> and the read_lock() fails, so I could go ahead testing it:
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 5e12462..a3b8bad 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4258,8 +4258,19 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
>   	struct bonding *bond = netdev_priv(bond_dev);
>   	int res = 1;
>
> -	read_lock(&bond->lock);
> -	read_lock(&bond->curr_slave_lock);
> +	if (read_trylock(&bond->lock) == 0&&
> +		(bond_dev->flags&  IFF_IN_NETPOLL)) {
> +			dev_kfree_skb(skb);
> +			return NETDEV_TX_OK;
> +	}
> +
> +	if (read_trylock(&bond->curr_slave_lock) == 0&&
> +		(bond_dev->flags&  IFF_IN_NETPOLL)) {
> +			read_unlock(&bond->lock);
> +			dev_kfree_skb(skb);
> +			return NETDEV_TX_OK;
> +	}
> +			
>
>   	if (!BOND_IS_OK(bond))
>   		goto out;
>


This looks like a workaround, not a fix. :)

>
> and I found another problem.  The function netpoll_send_skb() checks
> if the npinfo's queue length is zero and if it's not, it will queue
> the packet to make sure it's in order and then schedule the thread
> to run. Later, the thread wakes up running queue_process() which disables
> interrupts before calling ndo_start_xmit().  However, dev_queue_xmit()
> uses rcu_*_bh() and before return, it will enable the interrupts again,
> spitting this:
>
> ------------[ cut here ]------------
> WARNING: at kernel/softirq.c:143 local_bh_enable+0x3c/0x86()
> Hardware name: Precision WorkStation 490
> Modules linked in: netconsole bonding sunrpc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6]
> Pid: 17, comm: events/2 Not tainted 2.6.34-04700-gd938a70 #21
> Call Trace:
>   [<ffffffff810381d6>] warn_slowpath_common+0x77/0x8f
>   [<ffffffff810381fd>] warn_slowpath_null+0xf/0x11
>   [<ffffffff8103d691>] local_bh_enable+0x3c/0x86
>   [<ffffffff812e4d85>] dev_queue_xmit+0x462/0x493
>   [<ffffffffa018805f>] bond_dev_queue_xmit+0x1bd/0x1e3 [bonding]
>   [<ffffffffa01881dd>] bond_start_xmit+0x158/0x37b [bonding]
> ->  interrupts disabled
>   [<ffffffff812f3fca>] queue_process+0x9d/0xf9
>   [<ffffffff8104d022>] worker_thread+0x19d/0x224
>   [<ffffffff812f3f2d>] ? queue_process+0x0/0xf9
>   [<ffffffff81050819>] ? autoremove_wake_function+0x0/0x34
>   [<ffffffff8104ce85>] ? worker_thread+0x0/0x224
>   [<ffffffff8105040b>] kthread+0x7a/0x82
>   [<ffffffff810036d4>] kernel_thread_helper+0x4/0x10
>   [<ffffffff81050391>] ? kthread+0x0/0x82
>   [<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace 74e3904503fdb632 ]---
>
> kernel/softirq.c:
> 141 static inline void _local_bh_enable_ip(unsigned long ip)
> 142 {
> 143         WARN_ON_ONCE(in_irq() || irqs_disabled());
> 144 #ifdef CONFIG_TRACE_IRQFLAGS
> 145         local_irq_disable();
> 146 #endif
> 147         /*
> 148          * Are softirqs going to be turned on now:
> 149          */
>
>

I am wondering if this was caused by the previous issue.


> The git is updated up to:
>    d938a70 be2net: increase POST timeout for EEH recovery
>
> Two slave interfaces, bonding mode 1, netconsole over bond0.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=248374#c5

How did you reproduce this?
I will check that BZ to see if I can find how to reproduce this.

Thanks.


^ permalink raw reply

* RE: issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode
From: Xin, Xiaohui @ 2010-05-28  1:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Herbert Xu
In-Reply-To: <20100527082025.GA5579@redhat.com>

Michael,
What's you have suggested could avoid to taint the guest virtio-net driver. Really thanks!
For the two alternative, the first will do more trick with driver or net-core stuff. So currently, I prefer to try the second one. Anyway, let me have a good think of it. Thanks!

Thanks
Xiaohui

>-----Original Message-----
>From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of
>Michael S. Tsirkin
>Sent: Thursday, May 27, 2010 4:20 PM
>To: Xin, Xiaohui
>Cc: netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Herbert
>Xu
>Subject: Re: issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy
>to support PS mode
>
>On Thu, May 27, 2010 at 09:21:02AM +0800, Xin, Xiaohui wrote:
>> Michael,
>> I'm now looking into the vhost mergeable buffer, and I tried to use it to support PS mode
>with zero-copy. And I found an issue there that I have to modify the guest virito-net driver.
>>
>> When guest virtio-net driver submits mergeable buffers, it submits multiple pages outside.
>In zero-copy case, vhost cannot know which page is used to put header, and which page is
>used to put payload. Then vhost can only reserves 12 bytes for each page. That means, the
>page_offset of the payload DMAed into the guest buffer is always 12 bytes. But guest
>virtio-net driver always use offset 0 to put the data (See receive_mergeable()). That's where
>the zero-copy use mergeable buffer must modify.
>>
>> Have I missed something here? And how do you think about it?
>>
>> Thanks
>> Xiaohui
>
>Maybe you can teach the hardware skip the first 12 bytes: qemu will
>call an ioctl telling hardware what the virtio header size is.
>This is how we plan to do it for tap.
>
>Alternatively, buffers can be used in any order.
>So we can have hardware use N buffers for the packet, and then
>have vhost put the header in buffer N+1.
>
>--
>MST
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] cnic: Fix context memory init. on 5709.
From: David Miller @ 2010-05-27 23:31 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1274991858-6201-1-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Thu, 27 May 2010 13:24:18 -0700

> We need to zero context memory on 5709 in the function cnic_init_context().
> Without this, iscsid restart on 5709 will not work because of stale data.
> TX context blocks should not be initialized by cnic_init_context() because
> of the special remapping on 5709.
> 
> Update version to 2.1.2.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied, thanks Michael.

^ permalink raw reply

* Re: [PATCH 6/11] drivers/net: Eliminate a NULL pointer dereference
From: David Miller @ 2010-05-27 23:30 UTC (permalink / raw)
  To: julia; +Cc: netdev, linux-kernel, kernel-janitors
In-Reply-To: <Pine.LNX.4.64.1005271433180.5422@ask.diku.dk>

From: Julia Lawall <julia@diku.dk>
Date: Thu, 27 May 2010 14:33:32 +0200 (CEST)

> At the point of the print, dev is NULL.
 ...
> Signed-off-by: Julia Lawall <julia@diku.dk>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH 4/11] drivers/net/hamradio: Eliminate a NULL pointer dereference
From: David Miller @ 2010-05-27 23:29 UTC (permalink / raw)
  To: julia; +Cc: jpr, linux-hams, netdev, linux-kernel, kernel-janitors
In-Reply-To: <Pine.LNX.4.64.1005271432450.5422@ask.diku.dk>

From: Julia Lawall <julia@diku.dk>
Date: Thu, 27 May 2010 14:33:00 +0200 (CEST)

> At the point of the print, dev is NULL.
> 
> A simplified version of the semantic match that finds this problem is as
> follows: (http://coccinelle.lip6.fr/)
 ...
> Signed-off-by: Julia Lawall <julia@diku.dk>

Applied.

^ permalink raw reply

* Re: [PATCH net-2.6] be2net: Patch removes redundant while statement in loop.
From: David Miller @ 2010-05-27 23:28 UTC (permalink / raw)
  To: sarveshwarb; +Cc: netdev
In-Reply-To: <20100527093208.GA13740@serverengines.com>

From: Sarveshwar Bandi <sarveshwarb@serverengines.com>
Date: Thu, 27 May 2010 15:02:19 +0530

> Signed-off-by: Sarveshwar Bandi <sarveshwarb@serverengines.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH]: bnx2: Fix uninitialized variable warning in bnx2_disable_forced_2g5()
From: David Miller @ 2010-05-27 23:26 UTC (permalink / raw)
  To: prarit; +Cc: netdev, mchan
In-Reply-To: <20100527220856.25418.86593.sendpatchset@prarit.bos.redhat.com>

From: Prarit Bhargava <prarit@redhat.com>
Date: Thu, 27 May 2010 18:13:01 -0400

> Fix warning:
> 
> drivers/net/bnx2.c: In function 'bnx2_disable_forced_2g5':
> drivers/net/bnx2.c:1489: warning: 'bmcr' may be used uninitialized in this function
> 
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>

This is another bad patch, you're hiding a bug by making this warning
go away.

If bnx2_read_phy() in fact does not initialize "bmcr" we might
want to do something about that instead of just blindly using
whatever garbage is in that local variable.

Please stop making these patches without applying at least some
intelligence to the analysis of why these warnings are being
printed at all.

None of these cases you're posting patches for are situations where
the compiler simply can't see the control flow properly, they are real
issues.

Thanks.

^ permalink raw reply

* Re: [PATCH]: niu: fix uninitialized variable warning
From: David Miller @ 2010-05-27 23:23 UTC (permalink / raw)
  To: prarit; +Cc: netdev
In-Reply-To: <20100527183100.24251.12228.sendpatchset@prarit.bos.redhat.com>

From: Prarit Bhargava <prarit@redhat.com>
Date: Thu, 27 May 2010 14:35:03 -0400

> Fixes warning:
> 
> drivers/net/niu.c: In function ‘niu_process_rx_pkt’:
> drivers/net/niu.c:3457: error: ‘link’ may be used uninitialized in this function
> 
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>

These blind uninitialized_var() annotations are almost always the wrong
thing to do.  And that is the case here.

The code that looks up the page here can return NULL if there is some
error and the page hashes aren't setup properly, and then the callers
go and blindly assume that 'page' is non-NULL without any error
checking.

Applying this patch does more harm than good because it hides this
issue.  So I'm definitely not applying this patch.


^ permalink raw reply

* Re: ipv6: Add GSO support on forwarding path
From: David Miller @ 2010-05-27 23:14 UTC (permalink / raw)
  To: herbert; +Cc: ralf, netdev
In-Reply-To: <20100527115440.GA9952@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 27 May 2010 21:54:40 +1000

> On Thu, May 27, 2010 at 08:24:29PM +1000, Herbert Xu wrote:
>> 
>> Actually, this patch is still not quite right for the GSO case.
>> Right now it's only adding the IP header size, not the full header
>> size.
>> 
>> I need to think a bit more about this.
> 
> Let's do it the old way first, so that it at least works for the
> common case.  I'll fix it properly later.
> 
> ipv6: Add GSO support on forwarding path

Ok, applied, thanks Herbert.

^ permalink raw reply

* Re: [GIT PULL net-2.6] vhost-net: error handling fixes
From: David Miller @ 2010-05-27 23:13 UTC (permalink / raw)
  To: mst; +Cc: kvm, virtualization, netdev, linux-kernel, krkumar2
In-Reply-To: <20100527115714.GA7899@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Thu, 27 May 2010 14:57:14 +0300

> David,
> The following tree includes fixes dealing with error handling
> in vhost-net. It is on top of net-2.6.
> Please merge it for 2.6.35.
> Thanks!
> 
> The following changes since commit 8a74ad60a546b13bd1096b2a61a7a5c6fd9ae17c:
> 
>   net: fix lock_sock_bh/unlock_sock_bh (2010-05-27 00:30:53 -0700)
> 
> are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net

Pulled, thanks Michael.

^ permalink raw reply

* Re: boot crash in arp_error_report()
From: David Miller @ 2010-05-27 23:10 UTC (permalink / raw)
  To: eric.dumazet; +Cc: torvalds, mingo, tglx, akpm, netdev, linux-kernel
In-Reply-To: <1274991504.2446.13.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 May 2010 22:18:24 +0200

> [PATCH] net: fix __neigh_event_send()
> 
> commit 7fee226ad23 (net: add a noref bit on skb dst) missed one spot
> where an skb is enqueued, with a possibly not refcounted dst entry.
> 
> __neigh_event_send() inserts skb into arp_queue, so we must make sure
> dst entry is refcounted, or dst entry can be freed by garbage collector
> after caller exits from rcu protected section.
> 
> Reported-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

Ingo can we get a confirmation that this fixes the bootup crash?

Thanks!

^ permalink raw reply

* [PATCH]: bnx2: Fix uninitialized variable warning in bnx2_disable_forced_2g5()
From: Prarit Bhargava @ 2010-05-27 22:13 UTC (permalink / raw)
  To: netdev, mchan; +Cc: Prarit Bhargava

Fix warning:

drivers/net/bnx2.c: In function 'bnx2_disable_forced_2g5':
drivers/net/bnx2.c:1489: warning: 'bmcr' may be used uninitialized in this function

Signed-off-by: Prarit Bhargava <prarit@redhat.com>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 667f419..7e2f8ac 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -1486,7 +1486,7 @@ bnx2_enable_forced_2g5(struct bnx2 *bp)
 static void
 bnx2_disable_forced_2g5(struct bnx2 *bp)
 {
-	u32 bmcr;
+	u32 uninitialized_var(bmcr);
 
 	if (!(bp->phy_flags & BNX2_PHY_FLAG_2_5G_CAPABLE))
 		return;

^ permalink raw reply related

* Re: UDP Fragmentation and DF bit..
From: Vincent, Pradeep @ 2010-05-27 21:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev@vger.kernel.org
In-Reply-To: <87bpc11xyw.fsf@basil.nowhere.org>

Thanks Andi.

> I don't understand your mtu-small/big concept. PMTU is per IP and per
> flow. So there's only always a single PMTU, not small and big.

Example Setup: One part of the network supports MTU-big and another part of
the network supports MTU-small. The NIC interfaces are configured with MTU
corresponding to the part of the network they are connected to.

When a host in MTU-big part of the network sends a packet to a host in
MTU-small part of the network, the router bordering these two sections of
the network sends ICMP messages to the host in MTU-big section of the
network to enable PMTU discovery.

Everything works great for TCP. UDP's PMTU discovery seems to be available
for some datagram sizes ( size < MTU-big) but not otherwise.

>DF=1 on fragments would mean the application has to do pmtu discovery
> even with fragments for the case when the kernel does not know
> the path mtu yet.

Is there a reason kernel can't learn the PMTU using ICMP messages generated
due to large size fragments with DF=1 ?


>But if the app supports pmtu discovery it's better
> to not use fragments in the first place.

Absolutely. 

Thanks,

- Pradeep Vincent

On 5/27/10 2:43 AM, "Andi Kleen" <andi@firstfloor.org> wrote:

> "Vincent, Pradeep" <pradeepv@amazon.com> writes:
> 
>> OMan 7 ip¹ declares that ³The  system-wide  default  is  controlled  by  the
>> ip_no_pmtu_disc  sysctl  for SOCK_STREAM  sockets,  and  disabled  on all
>> others.² which led me to think ODF¹ bit will not be set for UDP packets.
>> But..
> 
> One should add ip.7 is not really a spec, just documentation
> how things were quite a few years ago. It unfortunately
> does often not get updated when things change.
> 
>> 
>> In a network environment where MTU-big and MTU-small co-exist (and have
>> router¹s fragmentation turned off in favor of PMTU discovery), UDP packets
>> that are > MTU-small and < MTU-big find the PMTU effectively but UDP
>> packets
> 
> 
> I don't understand your mtu-small/big concept. PMTU is per IP and per
> flow. So there's only always a single PMTU, not small and big.
> 
> Or do you refer to a single IP NAT situation where a single IP
> shares different MTUs?
> 
>> Is there a reason ODF¹ bit cannot be set on fragmented packets on UDP
>> transmission ? I couldn¹t find anything in RFC for IP protocol that
>> prohibited DF bit on fragmented packets. Did I miss
>> something here ?
> 
>> Would it be reasonable if PMTU discovery is performed (DF bit set +
>> appropriate icmp logic) even for locally fragmented packets ? I think
>> this
> 
> 
> DF=1 on fragments would mean the application has to do pmtu discovery
> even with fragments for the case when the kernel does not know
> the path mtu yet. But if the app supports pmtu discovery it's better
> to not use fragments in the first place.
> 
> -Andi
> 
> --
> ak@linux.intel.com -- Speaking for myself only.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply

* Re: [REGRESSION,BISECTED] MIPv6 support broken by f4f914b58019f0
From: Scott C Otto @ 2010-05-27 21:31 UTC (permalink / raw)
  To: Brian Haley
  Cc: Arnaud Ebalard, David Miller,
	YOSHIFUJI Hideaki / 吉藤英明, Jiri Olsa,
	netdev
In-Reply-To: <4BFECA65.7030102@hp.com>

All,
Thanks for looking into this.

The behavior of SO_BINDTODEVICE, certainly with IPV4, is to identify a specific
interface to use for sending/receiving AF_INET packets upon.   And that's how it
has behaved with IPV4.   Tools like ping, traceroute (and ping6, traceroute6)
make use of it with the -i (interface) options.

We ran into this issue when updating an IPV4 application to IPV6.   The report
provided one example of the issue.

The original change was based on the semantics of flowi.oif used in IPV4.  There
flowi.oif (as it is with IPV6 as well) is also set to sk->sk_bound_dev_if when
using ip_route_connect() and routing simply took a non-zero oif to enforce an
interface.  Looking at IPV4 tunneling, flowi.oif is set the same way from
tunnel's parms.link as with IPV6 so would follow those semantics as well.

Obviously, with IPV6, the semantics of flowi.oif are different and perhaps even
vary with the users of IPV6.

If that variability is desired, the proposed change from Brian (using
sk->sk_bound_dev_if) would be an equivalent means of supporting SO_BINDTODEVICE
while limiting the impact.

Scott

On 5/27/2010 2:39 PM, Brian Haley wrote:
> Hi Arnoud,
> 
> On 05/27/2010 11:14 AM, Arnaud Ebalard wrote:
> 
>>Hi,
>>
>>Thanks for your reply Brian and sorry for the length of this response. If
>>Hideaki and David can comment on the IPv6/XFRM and SO_BINDTODEVICE
>>aspects discussed below that would be helpful, IMHO.
> 
> 
> Thanks for all the analysis, comments below.
> 
> 
>>>On 05/26/2010 01:01 PM, Arnaud Ebalard wrote:
>>>
>>>>Hi,
>>>>
>>>>I just updated my laptop's kernel to 2.6.34 (previously running .33 and
>>>>configured to act as an IPsec/IKE-protected MIPv6 Mobile Node using
>>>>racoon and umip): after rebooting on the new kernel, the transport mode
>>>>SA protecting MIPv6 signaling traffic are missing.
>>>>
>>>>I bisected the issue down to f4f914b58019f0e50d521bbbadfaee260d766f95
>>>>(net: ipv6 bind to device issue) which was added after 2.6.34-rc5: 
>>>>
>>>>diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>>>index c2438e8..05ebd78 100644
>>>>--- a/net/ipv6/route.c
>>>>+++ b/net/ipv6/route.c
>>>>@@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>>>> {
>>>>        int flags = 0;
>>>> 
>>>>-       if (rt6_need_strict(&fl->fl6_dst))
>>>>+       if (fl->oif || rt6_need_strict(&fl->fl6_dst))
>>>>                flags |= RT6_LOOKUP_F_IFACE;
>>>
>>>Can you see if fl->oif is at least a sane value here?  Maybe there's some
>>>partially un-initialized flowi getting passed-in, a quick source code check
>>>didn't find anything obvious.
>>
>>When it's not 0, fl->oif is a sane value: it is set to the index of the
>>interface on which the current *Care-of Address* is configured. All the
>>traffic is expected to leave the host via this interface. 
> 
> 
> Ok, so it's not un-initialized data causing this.
> 
> 
>>In previous debug outputs, the content of the fl->oif is ok, i.e. it is
>>set to the interface on which the CoA is configured, i.e. the output
>>interface. But the commit results in flags |= RT6_LOOKUP_F_IFACE.
>>Later, in rt6_score_route(), the call to rt6_check_dev() returns 0
>>(dev->ifindex is ip6tnl1 but oif is wlan0). Because of the change to flags 
>>flags, we quickly return -1 in rt6_score_route():
> 
> 
> Ok, so the call to ip6_route_output() was from the tunnel code, which is
> using it's cached flowi, which has oif set to the tunnel.  The XFRM code
> swaps the addresses, which should invalidate the oif, but it doesn't.
> 
> 
>>static int rt6_score_route(struct rt6_info *rt, int oif,
>>			   int strict)
>>{
>>	int m, n;
>>
>>	m = rt6_check_dev(rt, oif);
>>	if (!m && (strict & RT6_LOOKUP_F_IFACE))
>>                return -1;
>>        ...
>>
>>Now, I wonder if the following is correct. Don't hesitate to correct me
>>if I am wrong:
>>
>>Initially (before f4f914b58019f0), the purpose of the test using
>>rt6_need_strict() in ip6_route_output() (introduced by c71099ac) was to
>>allow the multiple routing table logic to be applied to all global
>>addresses but to preserve the addresses for which it would not make
>>sense (link-local, multicast, ). The change introduced by f4f914b58019f0
>>basically reduces the ability to route traffic as you want and forces
>>the traffic to leave the device by the interface on which it is
>>configured (if fl->oif is set).
> 
> 
> The problem is we assumed the caller's would only set fl->oif if they
> wanted it enforced (multicast, link-local, SO_BINDTODEVICE), but it
> didn't take into account the tunnel code.  I guess the easy answer
> would be to revert this until we can fix it correctly.
> 
> 
>>From my (very limited and possibly wrong) understanding, the change
>>introduced by f4f914b58019f0 looks like a workaround for the 
>>SO_BINDTODEVICE issue. Looking at the code, there is something I don't
>>understand: if SO_BINDTODEVICE has been used on a socket, the socket
>>should have its sk_bound_dev_if attribute set to the correct ifindex
>>value. Hence the following (naive) question: why is that information not
>>used to inflect the selection of the route cached for the socket? And
>>why would the fix be at the adress level instead of being at the
>>interface level (ifindex)?
> 
> 
> I guess I always believed setting SO_BINDTODEVICE should always force
> traffic out that interface, but from Yoshifuji's email it seems that
> maybe wasn't the intention, at least for things that don't meet
> the rt_need_strict() criteria like globals.  I don't know the history
> behind the setsockopt.
> 
> The below might actually be what was actually intended, triggering
> on what the user forced, rather than assuming all callers require
> strict behavior.
> 
> -Brian
> 
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 294cbe8..252d761 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -814,7 +814,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>  	int flags = 0;
>  
> -	if (fl->oif || rt6_need_strict(&fl->fl6_dst))
> +	if ((sk && sk->sk_bound_dev_if) || rt6_need_strict(&fl->fl6_dst))
>  		flags |= RT6_LOOKUP_F_IFACE;
>  
>  	if (!ipv6_addr_any(&fl->fl6_src))



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox