Netdev List
 help / color / mirror / Atom feed
* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Stephen Hemminger @ 2010-05-07 17:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhaskie, bhutchings, netdev
In-Reply-To: <1273210329.2222.42.camel@edumazet-laptop>

On Fri, 07 May 2010 07:32:09 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 06 mai 2010 à 22:04 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> 
> > > This code should be completely rewritten for linux-2.6.35, its very ugly
> > > and over complex, yet it is not scalable.
> > > 
> > > It could use true percpu data, with no central lock or refcount.
> > 
> > Yes I've always disliked the way the TCP MD5 pool stuff is coded, it's
> > frankly crap and this is like the 5th major bug that had to get fixed
> > in it. :-)
> > 
> > But let's fix this as simply as possible in net-2.6 and -stable, so Eric
> > let me know when you have a tested patch for me to apply.
> 
> There are so many things ...
> 
> I am comtemplating commit aa133076
> 
> __tcp_alloc_md5sig_pool() now do a : 
> 
> p = kzalloc(sizeof(*p), sk->sk_allocation);
> 
> But later call :
> 
> hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
> 
> (GFP_KERNEL allocations for sure)
> 
> 
> tcp_v4_parse_md5_keys() also use :
> 
> p = kzalloc(sizeof(*p), sk->sk_allocation);
> 
> right after a (possibly sleeping) copy_from_user()
> 
> Oh well...
> 
> 
> I'll test the already suggested patch before official submission,
> thanks.

Forget the per cpu data; the pool should just be scrapped.

The only reason the pool exists is that the crypto hash state which
should just be moved into the md5_info (88 bytes).  The pseudo
header can just be generated on the stack before passing to the crypto
code.


^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-07 17:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, bhaskie, bhutchings, netdev
In-Reply-To: <20100507101451.1b4286b7@nehalam>

Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :

> Forget the per cpu data; the pool should just be scrapped.
> 
> The only reason the pool exists is that the crypto hash state which
> should just be moved into the md5_info (88 bytes).  The pseudo
> header can just be generated on the stack before passing to the crypto
> code.


Sure, but I'm afraid there is no generic API do do that (if we want to
reuse crypto/md5.c code)




^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Stephen Hemminger @ 2010-05-07 17:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhaskie, bhutchings, netdev
In-Reply-To: <1273252893.2261.84.camel@edumazet-laptop>

On Fri, 07 May 2010 19:21:33 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :
> 
> > Forget the per cpu data; the pool should just be scrapped.
> > 
> > The only reason the pool exists is that the crypto hash state which
> > should just be moved into the md5_info (88 bytes).  The pseudo
> > header can just be generated on the stack before passing to the crypto
> > code.
> 
> 
> Sure, but I'm afraid there is no generic API do do that (if we want to
> reuse crypto/md5.c code).

It looks like the pool is just an optimization to avoid opening too
many crypto API connections.  This should only be an issue if offloading
MD5.

^ permalink raw reply

* RE: ixgbe and mac-vlans problem
From: Tantilov, Emil S @ 2010-05-07 17:40 UTC (permalink / raw)
  To: Ben Greear; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <4BE38502.2000507@candelatech.com>

Ben Greear wrote:
> On 05/06/2010 05:06 PM, Tantilov, Emil S wrote:
>> Ben Greear wrote:
>>> On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:
>>> 
>>>> Hi Ben,
>>>> 
>>>> We do have a patch in testing (see attached). It may not apply
>>>> cleanly as it is on top of some other patches currently in
>>>> validation. Let me know if it works for you.
>>> 
>>> It wasn't difficult to backport this patch to 2.6.31.12....
>>> 
>>> I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.
>>> 
>>> The NIC doesn't show as PROMISC in any way I can detect, but I guess
>>> it must actually be in PROMISC mode:
>> 
>> Yes the interface is in promisc mode. The driver sets the FCTRL.UPE
>> bit (unicast promisc mode) when the number of allowed rar_entries is
>> exceeded. 
> 
> Is there any way to get this setting from ethtool or similar?  It
> would be nice to know the actual PROMISC state of the NIC regardless of
> what user-space has or has not configured.

ethtool -d eth11 | grep -i promisc 

Thanks,
Emil


^ permalink raw reply

* [PATCH 1/1] ibmveth: Add suspend/resume support
From: Brian King @ 2010-05-07 18:56 UTC (permalink / raw)
  To: netdev; +Cc: brking


Adds support for resuming from suspend for IBM virtual ethernet devices.
We may have lost an interrupt over the suspend, so we just kick the
interrupt handler to process anything that is outstanding.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 drivers/net/ibmveth.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN drivers/net/ibmveth.c~ibmveth_suspend drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_suspend	2010-05-07 13:49:19.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2010-05-07 13:49:19.000000000 -0500
@@ -45,6 +45,7 @@
 #include <linux/init.h>
 #include <linux/delay.h>
 #include <linux/mm.h>
+#include <linux/pm.h>
 #include <linux/ethtool.h>
 #include <linux/proc_fs.h>
 #include <linux/in.h>
@@ -1589,6 +1590,12 @@ static struct kobj_type ktype_veth_pool
 	.default_attrs  = veth_pool_attrs,
 };
 
+static int ibmveth_resume(struct device *dev)
+{
+	struct net_device *netdev = dev_get_drvdata(dev);
+	ibmveth_interrupt(netdev->irq, netdev);
+	return 0;
+}
 
 static struct vio_device_id ibmveth_device_table[] __devinitdata= {
 	{ "network", "IBM,l-lan"},
@@ -1596,6 +1603,10 @@ static struct vio_device_id ibmveth_devi
 };
 MODULE_DEVICE_TABLE(vio, ibmveth_device_table);
 
+static struct dev_pm_ops ibmveth_pm_ops = {
+	.resume = ibmveth_resume
+};
+
 static struct vio_driver ibmveth_driver = {
 	.id_table	= ibmveth_device_table,
 	.probe		= ibmveth_probe,
@@ -1604,6 +1615,7 @@ static struct vio_driver ibmveth_driver
 	.driver		= {
 		.name	= ibmveth_driver_name,
 		.owner	= THIS_MODULE,
+		.pm = &ibmveth_pm_ops,
 	}
 };
 
_

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v3)
From: Neil Horman @ 2010-05-07 19:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506171639.GA5063@hmsreliant.think-freely.org>

Hey-
	Sorry for the delay, but I got interested in Erics suggestion of
changing how we update rt_genid, and had a few thoughts.  Heres version 3 of
this patch:

Change Notes:

1) Removed the secret_interval binary interface entry in the list (forgot to do
that before)

2) Took Erics Suggestion to change the update for net->ipv4.rt_genid.  Now
instead of doing a small incremental change, we simply grab 32 new random bits.

3) The change in (2) got me thinking that part of the reason we used the Jenkins
hash in rt_genid was to ensure non-repetion of id's in a short time period
(which is important, so as not to inadvertently reuse route cache entries that
should be getting expired).  While extra randomness makes it harder for
attackers to guess the secret, it makes it possible to return to previously
guessed values as well (if they're lucky).  As such, I created a configurable
option, CONFIG_GENID_DUPCHECK.  With this option on, the low order 8 bits of the
genid are replaced with a rolling counter, that increments on each new genid.
This creates in effect, a 256 deep list of previously used genid values.  In
rt_drop we compare the genids for duplicates.  If we find that 2 genid values
have different indexes, but idential remaining bits, they are noted as a repeat
genid, and we call rt_cache_invalidate to generate a new genid and avoid the
duplication problem.


I've done some testing with this patch here, and it seems to work well.

Summary

A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant.  This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.


Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/netns/ipv4.h |    5 -
 kernel/sysctl_binary.c   |    1 
 net/ipv4/Kconfig         |   10 ++
 net/ipv4/route.c         |  164 ++++++++++++++++-------------------------------
 4 files changed, 69 insertions(+), 111 deletions(-)


diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ae07fee..fb53b3c 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,9 +55,10 @@ struct netns_ipv4 {
 	int sysctl_rt_cache_rebuild_count;
 	int current_rt_cache_rebuild_count;
 
-	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
-
+#ifdef CONFIG_GENID_DUPCHECK
+	atomic_t rt_genidx;
+#endif
 #ifdef CONFIG_IP_MROUTE
 #ifndef CONFIG_IP_MROUTE_MULTIPLE_TABLES
 	struct mr_table		*mrt;
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 5903057..937d31d 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -224,7 +224,6 @@ static const struct bin_table bin_net_ipv4_route_table[] = {
 	{ CTL_INT,	NET_IPV4_ROUTE_MTU_EXPIRES,		"mtu_expires" },
 	{ CTL_INT,	NET_IPV4_ROUTE_MIN_PMTU,		"min_pmtu" },
 	{ CTL_INT,	NET_IPV4_ROUTE_MIN_ADVMSS,		"min_adv_mss" },
-	{ CTL_INT,	NET_IPV4_ROUTE_SECRET_INTERVAL,		"secret_interval" },
 	{}
 };
 
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 8e3a1fd..ad934a9 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -98,6 +98,16 @@ config IP_FIB_TRIE_STATS
 	  Keep track of statistics on structure of FIB TRIE table.
 	  Useful for testing and measuring TRIE performance.
 
+config GENID_DUPCHECK
+	bool "Duplicate generation id checking in the route cache"
+	---help---
+	  Add checks to the route cache to detect duplicate route cache
+	  generation id.  Duplicate id's can lead to inappropriate reuse of
+	  route caches entries, which can degrate performance.  This check
+	  adds an index count of 8 bits to the genid which can determine if
+	  the same genid is reused within the last 256 iterations.  Turn this
+	  on
+
 config IP_MULTIPLE_TABLES
 	bool "IP: policy routing"
 	depends on IP_ADVANCED_ROUTER
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..57e51d4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,6 +112,10 @@
 #define RT_FL_TOS(oldflp) \
     ((u32)(oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK)))
 
+#ifdef CONFIG_GENID_DUPCHECK
+#define GENID_IDX_MASK 0xffffff00
+#endif
+
 #define IP_MAX_MTU	0xFFF0
 
 #define RT_GC_TIMEOUT (300*HZ)
@@ -129,7 +133,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
 static int rt_chain_length_max __read_mostly	= 20;
 
 static struct delayed_work expires_work;
@@ -147,7 +150,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
-
+static void rt_cache_invalidate(struct net *net);
 
 static struct dst_ops ipv4_dst_ops = {
 	.family =		AF_INET,
@@ -270,6 +273,34 @@ static inline int rt_genid(struct net *net)
 	return atomic_read(&net->ipv4.rt_genid);
 }
 
+#ifdef CONFIG_GENID_DUPCHECK
+static inline int rt_genid_dup(int genid1, int genid2)
+{
+	int idx1, idx2;
+	int rid1, rid2;
+
+	idx1 = genid1 & ~GENID_IDX_MASK;
+	idx2 = genid2 & ~GENID_IDX_MASK;
+
+	rid1 = genid1 & GENID_IDX_MASK;
+	rid2 = genid2 & GENID_IDX_MASK;
+
+	/*
+	 * Duplicate genids are ids in which
+	 * the idx value is different, but the
+	 * remainder of the genid is identical
+	 */
+	if ((idx1 != idx2) && (rid1 == rid2))
+		return 1;
+	return 0;
+}
+#else
+static inline int rt_genid_dup(genid1, int genid2)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_PROC_FS
 struct rt_cache_iter_state {
 	struct seq_net_private p;
@@ -615,6 +646,10 @@ static inline void rt_free(struct rtable *rt)
 
 static inline void rt_drop(struct rtable *rt)
 {
+	if (rt_genid_dup(rt->rt_genid, rt_genid(dev_net(rt->u.dst.dev)))) {
+		printk(KERN_WARNING "Duplicate route genid found!\n");
+		rt_cache_invalidate(dev_net(rt->u.dst.dev));
+	}
 	ip_rt_put(rt);
 	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
 }
@@ -888,17 +923,18 @@ static void rt_worker_func(struct work_struct *work)
 }
 
 /*
- * Pertubation of rt_genid by a small quantity [1..256]
- * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
- * many times (2^24) without giving recent rt_genid.
- * Jenkins hash is strong enough that litle changes of rt_genid are OK.
+ * invalidate the cache by making a new rt_genid.
  */
 static void rt_cache_invalidate(struct net *net)
 {
-	unsigned char shuffle;
+	unsigned int new;
 
-	get_random_bytes(&shuffle, sizeof(shuffle));
-	atomic_add(shuffle + 1U, &net->ipv4.rt_genid);
+	get_random_bytes(&new, sizeof(new));
+#ifdef CONFIG_GENID_DUPCHECK
+	new &= GENID_IDX_MASK;
+	new |= (atomic_inc_return(&net->ipv4.rt_genidx) & ~GENID_IDX_MASK);
+#endif
+	atomic_set(&net->ipv4.rt_genid, new);
 }
 
 /*
@@ -918,32 +954,11 @@ void rt_cache_flush_batch(void)
 	rt_do_flush(!in_softirq());
 }
 
-/*
- * We change rt_genid and let gc do the cleanup
- */
-static void rt_secret_rebuild(unsigned long __net)
-{
-	struct net *net = (struct net *)__net;
-	rt_cache_invalidate(net);
-	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
-static void rt_secret_rebuild_oneshot(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-	rt_cache_invalidate(net);
-	if (ip_rt_secret_interval)
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
 static void rt_emergency_hash_rebuild(struct net *net)
 {
-	if (net_ratelimit()) {
+	if (net_ratelimit())
 		printk(KERN_WARNING "Route hash chain too long!\n");
-		printk(KERN_WARNING "Adjust your secret_interval!\n");
-	}
-
-	rt_secret_rebuild_oneshot(net);
+	rt_cache_invalidate(net);
 }
 
 /*
@@ -3101,48 +3116,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 	return -EINVAL;
 }
 
-static void rt_secret_reschedule(int old)
-{
-	struct net *net;
-	int new = ip_rt_secret_interval;
-	int diff = new - old;
-
-	if (!diff)
-		return;
-
-	rtnl_lock();
-	for_each_net(net) {
-		int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-		long time;
-
-		if (!new)
-			continue;
-
-		if (deleted) {
-			time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-			if (time <= 0 || (time += diff) <= 0)
-				time = 0;
-		} else
-			time = new;
-
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
-	}
-	rtnl_unlock();
-}
-
-static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
-					  void __user *buffer, size_t *lenp,
-					  loff_t *ppos)
-{
-	int old = ip_rt_secret_interval;
-	int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
-
-	rt_secret_reschedule(old);
-
-	return ret;
-}
-
 static ctl_table ipv4_route_table[] = {
 	{
 		.procname	= "gc_thresh",
@@ -3251,13 +3224,6 @@ static ctl_table ipv4_route_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "secret_interval",
-		.data		= &ip_rt_secret_interval,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= ipv4_sysctl_rt_secret_interval,
-	},
 	{ }
 };
 
@@ -3336,34 +3302,18 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 };
 #endif
 
-
-static __net_init int rt_secret_timer_init(struct net *net)
+static __net_init int rt_genid_init(struct net *net)
 {
-	atomic_set(&net->ipv4.rt_genid,
-			(int) ((num_physpages ^ (num_physpages>>8)) ^
-			(jiffies ^ (jiffies >> 7))));
-
-	net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
-	net->ipv4.rt_secret_timer.data = (unsigned long)net;
-	init_timer_deferrable(&net->ipv4.rt_secret_timer);
-
-	if (ip_rt_secret_interval) {
-		net->ipv4.rt_secret_timer.expires =
-			jiffies + net_random() % ip_rt_secret_interval +
-			ip_rt_secret_interval;
-		add_timer(&net->ipv4.rt_secret_timer);
-	}
+	/*
+	 * This just serves to start off each new net namespace
+	 * with a non-zero rt_genid value, making it harder to guess
+	 */
+	rt_cache_invalidate(net);
 	return 0;
 }
 
-static __net_exit void rt_secret_timer_exit(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-}
-
-static __net_initdata struct pernet_operations rt_secret_timer_ops = {
-	.init = rt_secret_timer_init,
-	.exit = rt_secret_timer_exit,
+static __net_initdata struct pernet_operations rt_genid_ops = {
+	.init = rt_genid_init,
 };
 
 
@@ -3424,9 +3374,6 @@ int __init ip_rt_init(void)
 	schedule_delayed_work(&expires_work,
 		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
 
-	if (register_pernet_subsys(&rt_secret_timer_ops))
-		printk(KERN_ERR "Unable to setup rt_secret_timer\n");
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM
@@ -3438,6 +3385,7 @@ int __init ip_rt_init(void)
 #ifdef CONFIG_SYSCTL
 	register_pernet_subsys(&sysctl_route_ops);
 #endif
+	register_pernet_subsys(&rt_genid_ops);
 	return rc;
 }
 

^ permalink raw reply related

* [PATCH] fix multi-buffer logging with mergeable buffers
From: David L Stevens @ 2010-05-07 20:11 UTC (permalink / raw)
  To: mst; +Cc: netdev

This patch fixes the multibuffer case of logging with
mergeable buffers.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 85519b4..d526b68 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -236,8 +236,9 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 	int seg = 0;
 	int headcount = 0;
 	unsigned d;
-	int r;
+	int r, nlogs;
 
+	nlogs = 0;
 	while (datalen > 0) {
 		if (unlikely(headcount >= VHOST_NET_MAX_SG)) {
 			r = -ENOBUFS;
@@ -245,7 +246,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 		}
 		d = vhost_get_desc(vq->dev, vq, vq->iov + seg,
 				   ARRAY_SIZE(vq->iov) - seg, &out,
-				   &in, log, log_num);
+				   &in, log+nlogs, log_num);
+		nlogs += log ? *log_num : 0;
 		if (d == vq->num) {
 			r = 0;
 			goto err;
@@ -263,6 +265,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
 		seg += in;
 	}
 	*iovcount = seg;
+	if (unlikely(log))
+		*log_num = nlogs;
 	return headcount;
 err:
 	vhost_discard_desc(vq, headcount);



^ permalink raw reply related

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v3)
From: Eric Dumazet @ 2010-05-07 21:04 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100507195528.GA3752@hmsreliant.think-freely.org>

Le vendredi 07 mai 2010 à 15:55 -0400, Neil Horman a écrit :
> Hey-
> 	Sorry for the delay, but I got interested in Erics suggestion of
> changing how we update rt_genid, and had a few thoughts.  Heres version 3 of
> this patch:
> 

We have time, no hurry Neil.

> Change Notes:
> 
> 1) Removed the secret_interval binary interface entry in the list (forgot to do
> that before)
> 
> 2) Took Erics Suggestion to change the update for net->ipv4.rt_genid.  Now
> instead of doing a small incremental change, we simply grab 32 new random bits.
> 

My suggestion was to initialize _once_ at boot time, the _full_ 32bits.

Not to change the perturbations, they are very fine, and need no extra
CONFIG_SOME_MAGICAL_SWITCH.

We have a guarantee that no duplicates are delivered unless you perform
2^24 generations in a short period of time.

But because you want to change full 32bits, you need a complex dupcheck
thing ?

> 3) The change in (2) got me thinking that part of the reason we used the Jenkins
> hash in rt_genid was to ensure non-repetion of id's in a short time period
> (which is important, so as not to inadvertently reuse route cache entries that
> should be getting expired).  While extra randomness makes it harder for
> attackers to guess the secret, it makes it possible to return to previously
> guessed values as well (if they're lucky).  As such, I created a configurable
> option, CONFIG_GENID_DUPCHECK.  With this option on, the low order 8 bits of the
> genid are replaced with a rolling counter, that increments on each new genid.
> This creates in effect, a 256 deep list of previously used genid values.  In
> rt_drop we compare the genids for duplicates.  If we find that 2 genid values
> have different indexes, but idential remaining bits, they are noted as a repeat
> genid, and we call rt_cache_invalidate to generate a new genid and avoid the
> duplication problem.
> 
> 

This is not necessary and over engineering if you ask me.

You now rely on probabilistic rules, and depends on get_random_bytes()
be really random, or a new CONFIG setting...

What exact problem do you want to solve Neil ?

Please submit your initial patch, with the small changes :

1) Remove the secret_interval binary interface entry in the list 

2) Initialize full 32bits at struct net init time.

Thanks




^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-07 21:18 UTC (permalink / raw)
  To: Bhaskar Dutta; +Cc: Stephen Hemminger, Ben Hutchings, netdev, David Miller
In-Reply-To: <1273247090.2261.81.camel@edumazet-laptop>

Le vendredi 07 mai 2010 à 17:44 +0200, Eric Dumazet a écrit :
> Le vendredi 07 mai 2010 à 17:18 +0200, Eric Dumazet a écrit :
> > OK, I found the second problem.
> > 
> > if/when IP route cache is invalidated, ip_queue_xmit() has to refetch a
> > route and calls sk_setup_caps(sk, &rt->u.dst), destroying the 
> > 
> > sk->sk_route_caps &= ~NETIF_F_GSO_MASK
> > 
> > that MD5 desesperatly try to make all over its way (from
> > tcp_transmit_skb() for example)
> > 
> > So we send few bad packets, and everything is fine when
> > tcp_transmit_skb() is called again.
> > 
> > You get many errors on remote peer if you do
> > 
> > ip route flush cache
> > 


Patch solves the problem for me. I tested it with 200 MD5 sockets
established between two 16 cpus machine, with a multiqueue NIC. Trafic
of 100.000 pps per second, and "ip route flush cache" every minute on
both machines. After five hours, not a single frame had a bad hash
value.

Here is the official submission.

[PATCH] net: Introduce sk_route_nocaps

TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the 

sk->sk_route_caps &= ~NETIF_F_GSO_MASK

that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)

So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.

Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h    |    8 ++++++++
 net/core/sock.c       |    1 +
 net/ipv4/tcp_ipv4.c   |    6 +++---
 net/ipv4/tcp_output.c |    2 +-
 net/ipv6/tcp_ipv6.c   |    4 ++--
 5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1ad6435..abfadfe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -177,6 +177,7 @@ struct sock_common {
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
   *	@sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
   *	@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
+  *	@sk_route_nocaps: forbidden route capabilities (e.g NETIF_F_GSO_MASK)
   *	@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
   *	@sk_gso_max_size: Maximum GSO segment size to build
   *	@sk_lingertime: %SO_LINGER l_linger setting
@@ -276,6 +277,7 @@ struct sock {
 	int			sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
+	int			sk_route_nocaps;
 	int			sk_gso_type;
 	unsigned int		sk_gso_max_size;
 	int			sk_rcvlowat;
@@ -1257,6 +1259,12 @@ static inline int sk_can_gso(const struct sock *sk)
 
 extern void sk_setup_caps(struct sock *sk, struct dst_entry *dst);
 
+static inline void sk_nocaps_add(struct sock *sk, int flags)
+{
+	sk->sk_route_nocaps |= flags;
+	sk->sk_route_caps &= ~flags;
+}
+
 static inline int skb_copy_to_page(struct sock *sk, char __user *from,
 				   struct sk_buff *skb, struct page *page,
 				   int off, int copy)
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..5056a6a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1227,6 +1227,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
 	sk->sk_route_caps = dst->dev->features;
 	if (sk->sk_route_caps & NETIF_F_GSO)
 		sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
+	sk->sk_route_caps &= ~sk->sk_route_nocaps;
 	if (sk_can_gso(sk)) {
 		if (dst->header_len) {
 			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3c23e70..f1a1dd9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -894,7 +894,7 @@ int tcp_v4_md5_do_add(struct sock *sk, __be32 addr,
 				kfree(newkey);
 				return -ENOMEM;
 			}
-			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
+			sk_nocaps_add(sk, NETIF_F_GSO_MASK);
 		}
 		if (tcp_alloc_md5sig_pool(sk) == NULL) {
 			kfree(newkey);
@@ -1024,7 +1024,7 @@ static int tcp_v4_parse_md5_keys(struct sock *sk, char __user *optval,
 			return -EINVAL;
 
 		tp->md5sig_info = p;
-		sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
+		sk_nocaps_add(sk, NETIF_F_GSO_MASK);
 	}
 
 	newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, sk->sk_allocation);
@@ -1465,7 +1465,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		if (newkey != NULL)
 			tcp_v4_md5_do_add(newsk, newinet->inet_daddr,
 					  newkey, key->keylen);
-		newsk->sk_route_caps &= ~NETIF_F_GSO_MASK;
+		sk_nocaps_add(newsk, NETIF_F_GSO_MASK);
 	}
 #endif
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0dda86e..0193a39 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -872,7 +872,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 #ifdef CONFIG_TCP_MD5SIG
 	/* Calculate the MD5 hash, as we have all we need now */
 	if (md5) {
-		sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
+		sk_nocaps_add(sk, NETIF_F_GSO_MASK);
 		tp->af_specific->calc_md5_hash(opts.hash_location,
 					       md5, sk, NULL, skb);
 	}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 075f540..bf34893 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -600,7 +600,7 @@ static int tcp_v6_md5_do_add(struct sock *sk, struct in6_addr *peer,
 				kfree(newkey);
 				return -ENOMEM;
 			}
-			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
+			sk_nocaps_add(sk, NETIF_F_GSO_MASK);
 		}
 		if (tcp_alloc_md5sig_pool(sk) == NULL) {
 			kfree(newkey);
@@ -737,7 +737,7 @@ static int tcp_v6_parse_md5_keys (struct sock *sk, char __user *optval,
 			return -ENOMEM;
 
 		tp->md5sig_info = p;
-		sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
+		sk_nocaps_add(sk, NETIF_F_GSO_MASK);
 	}
 
 	newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL);



^ permalink raw reply related

* [PATCH 1/1] iproute2: fix addrlabel interface names handling
From: Florian Westphal @ 2010-05-07 21:31 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

ip addrlabel outputs if%d names due to missing init call:
$ ip addrlabel s
prefix a::42/128 dev if4 label 1000

Also, ip did not accept "if%d" interfaces on input.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 ip/ipaddrlabel.c |    2 ++
 lib/ll_map.c     |    6 +++++-
 2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
index cf438d6..95e813d 100644
--- a/ip/ipaddrlabel.c
+++ b/ip/ipaddrlabel.c
@@ -252,6 +252,8 @@ static int ipaddrlabel_flush(int argc, char **argv)
 
 int do_ipaddrlabel(int argc, char **argv)
 {
+	ll_init_map(&rth);
+
 	if (argc < 1) {
 		return ipaddrlabel_list(0, NULL);
 	} else if (matches(argv[0], "list") == 0 ||
diff --git a/lib/ll_map.c b/lib/ll_map.c
index 5addf4a..b8b49aa 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -161,6 +161,7 @@ unsigned ll_name_to_index(const char *name)
 	static int icache;
 	struct idxmap *im;
 	int i;
+	unsigned idx;
 
 	if (name == NULL)
 		return 0;
@@ -176,7 +177,10 @@ unsigned ll_name_to_index(const char *name)
 		}
 	}
 
-	return if_nametoindex(name);
+	idx = if_nametoindex(name);
+	if (idx == 0)
+		sscanf(name, "if%u", &idx);
+	return idx;
 }
 
 int ll_init_map(struct rtnl_handle *rth)
-- 
1.6.4.4


^ permalink raw reply related

* [PATCH 1/1] ipv6 addrlabel: permit deletion of labels assigned to removed dev
From: Florian Westphal @ 2010-05-07 21:31 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

as addrlabels with an interface index are left alone when the
interface gets removed this results in addrlabels that can no
longer be removed.

Restrict validation of index to adding new addrlabels.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv6/addrlabel.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index 6ff73c4..8342296 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -421,10 +421,6 @@ static int ip6addrlbl_newdel(struct sk_buff *skb, struct nlmsghdr *nlh,
 	    ifal->ifal_prefixlen > 128)
 		return -EINVAL;
 
-	if (ifal->ifal_index &&
-	    !__dev_get_by_index(net, ifal->ifal_index))
-		return -EINVAL;
-
 	if (!tb[IFAL_ADDRESS])
 		return -EINVAL;
 
@@ -440,6 +436,10 @@ static int ip6addrlbl_newdel(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	switch(nlh->nlmsg_type) {
 	case RTM_NEWADDRLABEL:
+		if (ifal->ifal_index &&
+		    !__dev_get_by_index(net, ifal->ifal_index))
+			return -EINVAL;
+
 		err = ip6addrlbl_add(net, pfx, ifal->ifal_prefixlen,
 				     ifal->ifal_index, label,
 				     nlh->nlmsg_flags & NLM_F_REPLACE);
-- 
1.6.4.4


^ permalink raw reply related

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-07 21:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, bhaskie, bhutchings, netdev
In-Reply-To: <20100507103639.4f1a51fa@nehalam>

Le vendredi 07 mai 2010 à 10:36 -0700, Stephen Hemminger a écrit :
> On Fri, 07 May 2010 19:21:33 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :
> > 
> > > Forget the per cpu data; the pool should just be scrapped.
> > > 
> > > The only reason the pool exists is that the crypto hash state which
> > > should just be moved into the md5_info (88 bytes).  The pseudo
> > > header can just be generated on the stack before passing to the crypto
> > > code.
> > 
> > 
> > Sure, but I'm afraid there is no generic API do do that (if we want to
> > reuse crypto/md5.c code).
> 
> It looks like the pool is just an optimization to avoid opening too
> many crypto API connections.  This should only be an issue if offloading
> MD5.

You mean we could allocate two contexts per socket, one for tx path, one
for rx path, but TCP-MD5 implementors chose to use percpu allocations to
factorize them. They should have allocated two contexts per cpu (one for
process context, preemption disabled, one for BH context)

As you said, this could be allocated on stack, with some changes to
crypto API I guess. Since TCP is not a module, md5 is also static, so
there is no module loading involved.

struct crypto_tfm *__crypto_alloc_tfm(struct crypto_alg *alg, u32
type,u32 mask)

-->

struct crypto_tfm *__crypto_alloc_tfm_onstack(struct crypto_alg *alg,
u32 type, u32 mask, void *storage, size_t storage_max_length)


Or a direct plug to crypto/md5.c functions and hand crafted context.




^ permalink raw reply

* Re: [stable]  ixgbe: Fix return of invalid txq
From: Greg KH @ 2010-05-07 22:25 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: stable, netdev, linux-kernel, brandon, David Miller
In-Reply-To: <alpine.WNT.2.00.1005031333200.4376@jbrandeb-desk1.amr.corp.intel.com>

On Mon, May 03, 2010 at 01:56:57PM -0700, Brandeburg, Jesse wrote:
> Please consider commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f for 
> inclusion in 2.6.32.y (it is already in 2.6.33.y)
> 
> Here is the commit message, it fixes a panic on machines with a larger 
> number of cpus than ixgbe has tx queues (64).
> 
> commit fdd3d631cddad20ad9d3e1eb7dbf26825a8a121f

This doesn't apply at all on the .32-stable tree, as the logic has
changed there.  Care to properly backport it if you really need it
there?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v3)
From: Neil Horman @ 2010-05-07 23:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273266271.2325.16.camel@edumazet-laptop>

On Fri, May 07, 2010 at 11:04:31PM +0200, Eric Dumazet wrote:
> Le vendredi 07 mai 2010 à 15:55 -0400, Neil Horman a écrit :
> > Hey-
> > 	Sorry for the delay, but I got interested in Erics suggestion of
> > changing how we update rt_genid, and had a few thoughts.  Heres version 3 of
> > this patch:
> > 
> 
> We have time, no hurry Neil.
> 
Yeah, but if I don't keep on top of they slip off my todo list pretty quick :)

> > Change Notes:
> > 
> > 1) Removed the secret_interval binary interface entry in the list (forgot to do
> > that before)
> > 
> > 2) Took Erics Suggestion to change the update for net->ipv4.rt_genid.  Now
> > instead of doing a small incremental change, we simply grab 32 new random bits.
> > 
> 
> My suggestion was to initialize _once_ at boot time, the _full_ 32bits.
> 
Apologies, my read of your statement was that you wanted to randomize the genid
every iteration, not just the first, to avoid the genid n+1 being within 256 of
the last genid.

> Not to change the perturbations, they are very fine, and need no extra
> CONFIG_SOME_MAGICAL_SWITCH.
> 
> We have a guarantee that no duplicates are delivered unless you perform
> 2^24 generations in a short period of time.
> 
Yes, I mentioned that, thats why I added the index check.

> But because you want to change full 32bits, you need a complex dupcheck
> thing ?
> 
We don't _need_ it, thats why I made it configurable.

> > 3) The change in (2) got me thinking that part of the reason we used the Jenkins
> > hash in rt_genid was to ensure non-repetion of id's in a short time period
> > (which is important, so as not to inadvertently reuse route cache entries that
> > should be getting expired).  While extra randomness makes it harder for
> > attackers to guess the secret, it makes it possible to return to previously
> > guessed values as well (if they're lucky).  As such, I created a configurable
> > option, CONFIG_GENID_DUPCHECK.  With this option on, the low order 8 bits of the
> > genid are replaced with a rolling counter, that increments on each new genid.
> > This creates in effect, a 256 deep list of previously used genid values.  In
> > rt_drop we compare the genids for duplicates.  If we find that 2 genid values
> > have different indexes, but idential remaining bits, they are noted as a repeat
> > genid, and we call rt_cache_invalidate to generate a new genid and avoid the
> > duplication problem.
> > 
> > 
> 
> This is not necessary and over engineering if you ask me.
> 
I can't say I disagree, but I was looking at this change based on your
suggestion.

> You now rely on probabilistic rules, and depends on get_random_bytes()
> be really random, or a new CONFIG setting...
> 
> What exact problem do you want to solve Neil ?
> 
You know good and well what I'm trying to do here, don't be thick.  The only
reason I was making changes to the genid in the first place was because you were
asking for them.  I'm more than happy to make a simpler version of this.
Apologies for not interpreting your previous request the way you had intended.

> Please submit your initial patch, with the small changes :
> 
> 1) Remove the secret_interval binary interface entry in the list 
> 
> 2) Initialize full 32bits at struct net init time.
> 
Yeah, ok.  I'll repost on monday.


Thanks
Neil




^ permalink raw reply

* Re: [PATCH][v4] tcp: fix ICMP-RTO war
From: Jerry Chu @ 2010-05-07 23:25 UTC (permalink / raw)
  To: damian; +Cc: ilpo.jarvinen, netdev
In-Reply-To: <o2z349f35ee1005071622z38fcd66ek398402d7512542ae@mail.gmail.com>

Hi,

I'm working on a patch that tries to measure and use the RTT for the passive
open side when the TS option is NOT enabled. My code conflicts with your
recently added "tcp_ack_update_rtt(sk, 0, 0);" Could you tell me why do you
force this call for the no-TS case when obviously "0" is not a measured RTT?
If you try to force icsk_rto to be initialized correctly, it is
already initialized to
TCP_TIMEOUT_INIT by tcp_create_openreq_child(). What am I missing?

Thanks,

Jerry

>From: David Miller <davem@davemloft.net>
>
> Date: Sun, Feb 21, 2010 at 7:10 PM
> Subject: Re: [PATCH][v4] tcp: fix ICMP-RTO war
> To: ilpo.jarvinen@helsinki.fi
> Cc: damian@tvk.rwth-aachen.de, netdev@vger.kernel.org
>
>
> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 16 Feb 2010 14:45:25 +0200 (EET)
>
> > On Wed, 10 Feb 2010, Damian Lukowski wrote:
> >
> >> @@ -5783,12 +5783,10 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
> >>
> >>                              /* tcp_ack considers this ACK as duplicate
> >>                               * and does not calculate rtt.
> >> -                             * Fix it at least with timestamps.
> >> +                             * Force it here.
> >>                               */
> >> -                            if (tp->rx_opt.saw_tstamp &&
> >> -                                tp->rx_opt.rcv_tsecr && !tp->srtt)
> >> -                                    tcp_ack_saw_tstamp(sk, 0);
> >> -
> >> +                            tcp_ack_update_rtt(sk, 0, 0);
> >> +
> >
> > ...Here a zero seq_rtt is given to RTT estimator (it will be effective
> > only in the case w/o timestamps, TS case recalculates it from the stored
> > timestamps). Maybe we could use some field (timestamp related one comes to
> > my mind) in request sock to get a real RTT estimate for non-timestamp case
> > too. ...It seems possible to me, though tricky because the request_sock is
> > no longer that easily available here so some parameter passing would be
> > needed.
>
> Agreed.
>
> But even more simply I think we should make even the current
> tcp_ack_update_rtt() call here conditional on at least
> tp->srtt being zero.
>
> Damian do you at least agree with that?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Fw: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL
From: David Stevens @ 2010-05-07 23:50 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, netdev, netdev-owner, Stephen Hemminger
In-Reply-To: <OFD4CF7550.98F2D787-ON8825771A.005CD3AE-8825771A.005E4045@us.ibm.com>

Dave,
        Do you have an opinion on this one? Linux code (for many years 
now)
allows you to do SSM calls after an ordinary ASM join, and vice-versa. The 
bugzilla
basically wants to see mixing the calls return EINVAL.
        That's not so hard to do, but if existing apps are mixing them, 
they, of
course, will fail. That's the intent of the RFC, but the code predates it 
and
happily converts between ASM and SSM automatically.
        So, I'm guessing the answer is "will not fix," but
it's your call. If you think we want to enforce this, I'll do
a patch.
        The original post is here:

http://marc.info/?l=linux-netdev&m=127307292215281&w=2

                                                        +-DLS


^ permalink raw reply

* Re: [PATCH] net: deliver skbs on inactive slaves to exact matches
From: John Fastabend @ 2010-05-08  0:15 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: bonding-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	Leech, Christopher, andy@greyhouse.net, kaber@trash.net
In-Reply-To: <4BE30C67.4030104@intel.com>

John Fastabend wrote:
> Jay Vosburgh wrote:
>> John Fastabend <john.r.fastabend@intel.com> wrote:
>>
>>> Currently, the accelerated receive path for VLAN's will
>>> drop packets if the real device is an inactive slave and
>>> is not one of the special pkts tested for in
>>> skb_bond_should_drop().  This behavior is different then
>>> the non-accelerated path and for pkts over a bonded vlan.
>>>
>>> For example,
>>>
>>> vlanx -> bond0 -> ethx
>>>
>>> will be dropped in the vlan path and not delivered to any
>>> packet handlers.  However,
>>>
>>> bond0 -> vlanx -> ethx
>>>
>>> will be delivered to handlers that match the exact dev,
>>> because the VLAN path checks the real_dev which is not a
>>> slave and netif_recv_skb() doesn't drop frames but only
>>> delivers them to exact matches.
>>>
>>> This patch adds a pkt_type PACKET_DROP which is now used
>>> to identify skbs that would previously been dropped and
>>> allows the skb to continue to skb_netif_recv().  Here we
>>> add logic to check for PACKET_DROP and if so only deliver
>>> to handlers that match exactly.  IMHO this is more
>>> consistent and gives pkt handlers a way to identify skbs
>>> that come from inactive slaves.
>> 	I was looking at this some yesterday and this morning, trying to
>> figure out if a packet going up the IP protocol stack with pkt_type ==
>> PACKET_DROP would break anything.  For example:
>>
>> net/ipv4/ip_options.c:
>> int ip_options_rcv_srr(struct sk_buff *skb)
>> {
>> [...]
>>         if (!opt->srr)
>>                 return 0;
>>
>>         if (skb->pkt_type != PACKET_HOST)
>>                 return -EINVAL;
>>
>> 	or:
>>
>> net/ipv4/tcp_ipv4.c:
>> int tcp_v4_rcv(struct sk_buff *skb)
>> {
>>         const struct iphdr *iph;
>>         struct tcphdr *th;
>>         struct sock *sk;
>>         int ret;
>>         struct net *net = dev_net(skb->dev);
>>
>>         if (skb->pkt_type != PACKET_HOST)
>>                 goto discard_it;
>>
>> 	Is pkt_type == PACKET_DROP instead going to break things for
>> these, or the other similar cases?
>>
>> 	I think what you're looking for is really the usual PACKET_HOST,
>> PACKET_OTHERHOST, et al, upper protocol behavior, with the exception
>> that at the __netif_receive_skb level, wildcard matches in the ptypes
>> are not done.  Setting the pkt_type to PACKET_DROP loses the _HOST,
>> _OTHERHOST, etc, information, but sends the packet up the stack anyway,
>> and I'm not sure that's really the right thing to do.
> 
> Because we wouldn't be doing wildcard matches the skb shouldn't be 
> passed to the IP protocol stack.  Really what we want is the ip_rcv() to 
> drop the skb in this case,
> 
> net/ipv4/ip_input.c
> int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct 
> packet_type *pt, struct net_device *orig_dev)
> {
> [...]
> 
> if (skb->pkt_type == PACKET_OTHERHOST ||
>      skb->pkt_type == PACKET_DROP)
> 	goto drop;
> 
> 
> We do lose some information about the packet _HOST, _OTHERHOST, etc, but 
> we also gain something namely that the packet was received on an 
> inactive slave.  Currently, we pass these frames up the stack (for non 
> wildcard matches) without any indication that they were received on an 
> inactive slave.
> 
> What we need is a way for the VLAN path to flag skbs so that wildcard 
> matches in the ptyes are not done.  Also I think it may be valuable for 
> the packet handler to be able to determine if the skb is coming from an 
> inactive slave.  I think using the pkt_type field might be OK any ideas 
> for a better field?
> 
> Thanks,
> John
> 

Another possibility which would keep the pkt_type info would be to add a 
  flag to the flags2 field in the sk_buff struct.  Seeing as there is 
already a u8 there for ndisc_nodetype this should work.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 82f5116..bb1bc22 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -374,8 +374,13 @@ struct sk_buff {

         kmemcheck_bitfield_begin(flags2);
         __u16                   queue_mapping:16;
-#ifdef CONFIG_IPV6_NDISC_NODETYPE
+#if defined(CONTIF_IPV6_NDISC_NODETYPE) && defined(CONFIG_BONDING)
+       __u8                    ndisc_nodetype:2,
+                               bond_should_drop:1;
+#elif defined(CONFIG_IPV6_NDISC_NODETYPE)
         __u8                    ndisc_nodetype:2;
+#elif defined(CONFIG_BONDING)
+       __u8                    bond_should_drop:1;
  #endif
         kmemcheck_bitfield_end(flags2);



Thanks,
John

^ permalink raw reply related

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v4)
From: Neil Horman @ 2010-05-08  1:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506171639.GA5063@hmsreliant.think-freely.org>

Hey, Had a moment tonight so I spun version 4 of the patch.

Change notes:
1) Redid the initialization, in light of Erics clarification.  We randomly seed
all 32 bits of the rt_genid for a netns, but still just do 256 byte
modifications on cache invalidations

2) got rid of the dup checking crap.


Summary:

A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant.  This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/netns/ipv4.h |    1 
 kernel/sysctl_binary.c   |    1 
 net/ipv4/route.c         |  108 +++--------------------------------------------
 3 files changed, 8 insertions(+), 102 deletions(-)


diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ae07fee..d68c3f1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,7 +55,6 @@ struct netns_ipv4 {
 	int sysctl_rt_cache_rebuild_count;
 	int current_rt_cache_rebuild_count;
 
-	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
 
 #ifdef CONFIG_IP_MROUTE
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 5903057..937d31d 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -224,7 +224,6 @@ static const struct bin_table bin_net_ipv4_route_table[] = {
 	{ CTL_INT,	NET_IPV4_ROUTE_MTU_EXPIRES,		"mtu_expires" },
 	{ CTL_INT,	NET_IPV4_ROUTE_MIN_PMTU,		"min_pmtu" },
 	{ CTL_INT,	NET_IPV4_ROUTE_MIN_ADVMSS,		"min_adv_mss" },
-	{ CTL_INT,	NET_IPV4_ROUTE_SECRET_INTERVAL,		"secret_interval" },
 	{}
 };
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..dea3f92 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
 static int rt_chain_length_max __read_mostly	= 20;
 
 static struct delayed_work expires_work;
@@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
 	rt_do_flush(!in_softirq());
 }
 
-/*
- * We change rt_genid and let gc do the cleanup
- */
-static void rt_secret_rebuild(unsigned long __net)
-{
-	struct net *net = (struct net *)__net;
-	rt_cache_invalidate(net);
-	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
-static void rt_secret_rebuild_oneshot(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-	rt_cache_invalidate(net);
-	if (ip_rt_secret_interval)
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
 static void rt_emergency_hash_rebuild(struct net *net)
 {
-	if (net_ratelimit()) {
+	if (net_ratelimit())
 		printk(KERN_WARNING "Route hash chain too long!\n");
-		printk(KERN_WARNING "Adjust your secret_interval!\n");
-	}
-
-	rt_secret_rebuild_oneshot(net);
+	rt_cache_invalidate(net);
 }
 
 /*
@@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 	return -EINVAL;
 }
 
-static void rt_secret_reschedule(int old)
-{
-	struct net *net;
-	int new = ip_rt_secret_interval;
-	int diff = new - old;
-
-	if (!diff)
-		return;
-
-	rtnl_lock();
-	for_each_net(net) {
-		int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-		long time;
-
-		if (!new)
-			continue;
-
-		if (deleted) {
-			time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-			if (time <= 0 || (time += diff) <= 0)
-				time = 0;
-		} else
-			time = new;
-
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
-	}
-	rtnl_unlock();
-}
-
-static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
-					  void __user *buffer, size_t *lenp,
-					  loff_t *ppos)
-{
-	int old = ip_rt_secret_interval;
-	int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
-
-	rt_secret_reschedule(old);
-
-	return ret;
-}
-
 static ctl_table ipv4_route_table[] = {
 	{
 		.procname	= "gc_thresh",
@@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "secret_interval",
-		.data		= &ip_rt_secret_interval,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= ipv4_sysctl_rt_secret_interval,
-	},
 	{ }
 };
 
@@ -3336,34 +3265,15 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 };
 #endif
 
-
-static __net_init int rt_secret_timer_init(struct net *net)
+static __net_init int rt_genid_init(struct net *net)
 {
-	atomic_set(&net->ipv4.rt_genid,
-			(int) ((num_physpages ^ (num_physpages>>8)) ^
-			(jiffies ^ (jiffies >> 7))));
-
-	net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
-	net->ipv4.rt_secret_timer.data = (unsigned long)net;
-	init_timer_deferrable(&net->ipv4.rt_secret_timer);
-
-	if (ip_rt_secret_interval) {
-		net->ipv4.rt_secret_timer.expires =
-			jiffies + net_random() % ip_rt_secret_interval +
-			ip_rt_secret_interval;
-		add_timer(&net->ipv4.rt_secret_timer);
-	}
+	get_random_bytes(&net->ipv4.rt_genid,
+			 sizeof(net->ipv4.rt_genid));
 	return 0;
 }
 
-static __net_exit void rt_secret_timer_exit(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-}
-
-static __net_initdata struct pernet_operations rt_secret_timer_ops = {
-	.init = rt_secret_timer_init,
-	.exit = rt_secret_timer_exit,
+static __net_initdata struct pernet_operations rt_genid_ops = {
+	.init = rt_genid_init,
 };
 
 
@@ -3424,9 +3334,6 @@ int __init ip_rt_init(void)
 	schedule_delayed_work(&expires_work,
 		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
 
-	if (register_pernet_subsys(&rt_secret_timer_ops))
-		printk(KERN_ERR "Unable to setup rt_secret_timer\n");
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM
@@ -3438,6 +3345,7 @@ int __init ip_rt_init(void)
 #ifdef CONFIG_SYSCTL
 	register_pernet_subsys(&sysctl_route_ops);
 #endif
+	register_pernet_subsys(&rt_genid_ops);
 	return rc;
 }
 

^ permalink raw reply related

* Re: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL
From: David Miller @ 2010-05-08  1:49 UTC (permalink / raw)
  To: dlstevens; +Cc: herbert, netdev, netdev-owner, shemminger
In-Reply-To: <OFF6B69D4E.4BD995AD-ON8825771C.0081F4BF-8825771C.0082FBC7@us.ibm.com>

From: David Stevens <dlstevens@us.ibm.com>
Date: Fri, 7 May 2010 16:50:44 -0700

>         Do you have an opinion on this one? Linux code (for many
> years now) allows you to do SSM calls after an ordinary ASM join,
> and vice-versa. The bugzilla basically wants to see mixing the calls
> return EINVAL.

We can't do that.

It will obviously break existing applications.

Someone please close the bug.

^ permalink raw reply

* Re: linux-next: build failure after merge of the suspend tree
From: Rafael J. Wysocki @ 2010-05-08  2:13 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-next, linux-kernel, Helmut Schaa, John W. Linville,
	David Miller, netdev
In-Reply-To: <20100507130821.3fb72c9d.sfr@canb.auug.org.au>

On Friday 07 May 2010, Stephen Rothwell wrote:
> Hi Rafael,
> 
> After merging the suspend tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> net/mac80211/scan.c: In function 'ieee80211_scan_state_decision':
> net/mac80211/scan.c:510: error: implicit declaration of function 'pm_qos_requirement'
> 
> Caused by commit 62bad14fc6e0911a99882c261390968977d43283 ("PM QOS
> update") from the suspend tree interacting with commit
> df13cce53a7b28a81460e6bfc4857e9df4956141 ("mac80211: Improve software
> scan timing") from the net tree.
> 
> I have added the following merge fixup patch and can carry it as
> necessary:

Thanks a lot, please do so if that's not a problem.

Both trees are based on Linus' current and I don't see a good way of fixing
this issue in any of them individually.

Rafael


> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Fri, 7 May 2010 13:02:54 +1000
> Subject: [PATCH] wireless: update for pm_qos_requirement to pm_qos_request rename
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
> ---
>  net/mac80211/scan.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/mac80211/scan.c b/net/mac80211/scan.c
> index e14c441..e1b0be7 100644
> --- a/net/mac80211/scan.c
> +++ b/net/mac80211/scan.c
> @@ -510,7 +510,7 @@ static int ieee80211_scan_state_decision(struct ieee80211_local *local,
>  		bad_latency = time_after(jiffies +
>  				ieee80211_scan_get_channel_time(next_chan),
>  				local->leave_oper_channel_time +
> -				usecs_to_jiffies(pm_qos_requirement(PM_QOS_NETWORK_LATENCY)));
> +				usecs_to_jiffies(pm_qos_request(PM_QOS_NETWORK_LATENCY)));
>  
>  		listen_int_exceeded = time_after(jiffies +
>  				ieee80211_scan_get_channel_time(next_chan),
> 

^ permalink raw reply

* [RFC-PATCH] caif: Add debug connection type for CAIF.
From: sjur.brandeland @ 2010-05-08  7:23 UTC (permalink / raw)
  To: netdev; +Cc: marcel, sjurbr, Sjur Braendeland

From: Sjur Braendeland <sjur.brandeland@stericsson.com>

Added new CAIF protocol type CAIFPROTO_DEBUG for accessing
CAIF debug on the ST Ericsson modems.

There are two debug servers on the modem, one for radio related
debug (CAIF_RADIO_DEBUG_SERVICE) and the other for
communication/application related debug (CAIF_COM_DEBUG_SERVICE).

The debug connection can contain trace debug printouts or
interactive debug used for debugging and test.

Debug connections can be of type STREAM or SEQPACKET.

Feedback is welcomed.
---
 include/linux/caif/caif_socket.h |   36 ++++++++++++++++++++++++++++++++++++
 net/caif/caif_config_util.c      |    5 +++++
 net/caif/caif_socket.c           |    5 -----
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/linux/caif/caif_socket.h b/include/linux/caif/caif_socket.h
index 2a61eb1..1a8fd1f 100644
--- a/include/linux/caif/caif_socket.h
+++ b/include/linux/caif/caif_socket.h
@@ -62,6 +62,7 @@ enum caif_channel_priority {
  * @CAIFPROTO_DATAGRAM_LOOP:	Datagram loopback channel, used for testing.
  * @CAIFPROTO_UTIL:		Utility (Psock) channel.
  * @CAIFPROTO_RFM:		Remote File Manager
+ * @CAIFPROTO_DEBUG:		Debug link
  *
  * This enum defines the CAIF Channel type to be used. This defines
  * the service to connect to on the modem.
@@ -72,6 +73,7 @@ enum caif_protocol_type {
 	CAIFPROTO_DATAGRAM_LOOP,
 	CAIFPROTO_UTIL,
 	CAIFPROTO_RFM,
+	CAIFPROTO_DEBUG,
 	_CAIFPROTO_MAX
 };
 #define	CAIFPROTO_MAX _CAIFPROTO_MAX
@@ -85,6 +87,29 @@ enum caif_at_type {
 };
 
 /**
+ * enum caif_debug_type - Content selection for debug connection
+ * @CAIF_DEBUG_TRACE_INTERACTIVE: Connection will contain
+ *				both trace and interactive debug.
+ * @CAIF_DEBUG_TRACE:		Connection contains trace only.
+ * @CAIF_DEBUG_INTERACTIVE:	Connection to interactive debug.
+ */
+enum caif_debug_type {
+	CAIF_DEBUG_TRACE_INTERACTIVE = 0,
+	CAIF_DEBUG_TRACE,
+	CAIF_DEBUG_INTERACTIVE,
+};
+
+/**
+ * enum caif_debug_service - Debug Service to connect.
+ * @CAIF_RADIO_DEBUG_SERVICE:	Debug on the Radio sub-system
+ * @CAIF_COM_DEBUG_SERVICE:	Debug on the communication sub-system
+ */
+enum caif_debug_service {
+	CAIF_RADIO_DEBUG_SERVICE = 1,
+	CAIF_COM_DEBUG_SERVICE
+};
+
+/**
  * struct sockaddr_caif - the sockaddr structure for CAIF sockets.
  * @family:		     Address family number, must be AF_CAIF.
  * @u:			     Union of address data 'switched' by family.
@@ -109,6 +134,13 @@ enum caif_at_type {
  *
  * @u.rfm.volume:            Volume to mount.
  *
+ * @u.dbg:                    Applies when family = CAIFPROTO_DEBUG.
+ *
+ * @u.dbg.type:		      Type of debug connection to set up
+ *			       (caif_debug_type).
+ *
+ * @u.dbg.service:            Service sub-system to connect (caif_debug_service)
+ *
  * Description:
  * This structure holds the connect parameters used for setting up a
  * CAIF Channel. It defines the service to connect to on the modem.
@@ -130,6 +162,10 @@ struct sockaddr_caif {
 			__u32 connection_id;
 			char	  volume[16];
 		} rfm;				/* CAIFPROTO_RFM */
+		struct {
+			__u8  type;		/* type:enum caif_debug_type */
+			__u8  service;		/* service:caif_debug_service */
+		} dbg;				/* CAIFPROTO_DEBUG */
 	} u;
 };
 
diff --git a/net/caif/caif_config_util.c b/net/caif/caif_config_util.c
index 6f36580..76ae683 100644
--- a/net/caif/caif_config_util.c
+++ b/net/caif/caif_config_util.c
@@ -80,6 +80,11 @@ int connect_req_to_link_param(struct cfcnfg *cnfg,
 		       l->u.utility.paramlen);
 
 		break;
+	case CAIFPROTO_DEBUG:
+		l->linktype = CFCTRL_SRV_DBG;
+		l->endpoint = s->sockaddr.u.dbg.service;
+		l->chtype = s->sockaddr.u.dbg.type;
+		break;
 	default:
 		return -EINVAL;
 	}
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index c3a70c5..4e6c610 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -843,11 +843,6 @@ static int caif_connect(struct socket *sock, struct sockaddr *uaddr,
 	if (uaddr->sa_family != AF_CAIF)
 		goto out;
 
-	err = -ESOCKTNOSUPPORT;
-	if (unlikely(!(sk->sk_type == SOCK_STREAM &&
-		       cf_sk->sk.sk_protocol == CAIFPROTO_AT) &&
-		       sk->sk_type != SOCK_SEQPACKET))
-		goto out;
 	switch (sock->state) {
 	case SS_UNCONNECTED:
 		/* Normal case, a fresh connect */
-- 
1.6.3.3


^ permalink raw reply related

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v4)
From: Eric Dumazet @ 2010-05-08  6:36 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100508010108.GA3028@localhost.localdomain>

Le vendredi 07 mai 2010 à 21:01 -0400, Neil Horman a écrit :
> Hey, Had a moment tonight so I spun version 4 of the patch.
> 
> Change notes:
> 1) Redid the initialization, in light of Erics clarification.  We randomly seed
> all 32 bits of the rt_genid for a netns, but still just do 256 byte
> modifications on cache invalidations
> 
> 2) got rid of the dup checking crap.
> 
> 
> Summary:
> 
> A while back there was a discussion regarding the rt_secret_interval timer.
> Given that we've had the ability to do emergency route cache rebuilds for awhile
> now, based on a statistical analysis of the various hash chain lengths in the
> cache, the use of the flush timer is somewhat redundant.  This patch removes the
> rt_secret_interval sysctl, allowing us to rely solely on the statistical
> analysis mechanism to determine the need for route cache flushes.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Sorry for the confusion Neil, thanks !



^ permalink raw reply

* RE: [RFC][PATCH v4 00/18] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-05-08  7:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, davem@davemloft.net,
	jdike@linux.intel.com
In-Reply-To: <20100425121420.GB10238@redhat.com>

Michael,
Sorry, somehow I missed this mail. :-(

>> Here, we have ever considered 2 ways to utilize the page constructor
>> API to dispense the user buffers.
>> 
>> One:	Modify __alloc_skb() function a bit, it can only allocate a 
>> 	structure of sk_buff, and the data pointer is pointing to a 
>> 	user buffer which is coming from a page constructor API.
>> 	Then the shinfo of the skb is also from guest.
>> 	When packet is received from hardware, the skb->data is filled
>> 	directly by h/w. What we have done is in this way.
>> 
>> 	Pros:	We can avoid any copy here.
>> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
>> 		the same method with the host NIC drivers, say the size
>> 		of netdev_alloc_skb() and the same reserved space in the
>> 		head of skb. Many NIC drivers are the same with guest and
>> 		ok for this. But some lastest NIC drivers reserves special
>> 		room in skb head. To deal with it, we suggest to provide
>> 		a method in guest virtio-net driver to ask for parameter
>> 		we interest from the NIC driver when we know which device 
>> 		we have bind to do zero-copy. Then we ask guest to do so.
>> 		Is that reasonable?

>Do you still do this?

Currently, we still use the first way. But we now ignore the room which 
host skb_reserve() required when device is doing zero-copy. Now we don't 
taint guest virtio-net driver with a new method by this way.

>> Two:	Modify driver to get user buffer allocated from a page constructor
>> 	API(to substitute alloc_page()), the user buffer are used as payload
>> 	buffers and filled by h/w directly when packet is received. Driver
>> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
>> 	the head buffer side, let host allocates skb, and h/w fills it. 
>> 	After that, the data filled in host skb header will be copied into
>> 	guest header buffer which is submitted together with the payload buffer.
>> 
>> 	Pros:	We could less care the way how guest or host allocates their
>> 		buffers.
>> 	Cons:	We still need a bit copy here for the skb header.
>> 
>> We are not sure which way is the better here. This is the first thing we want
>> to get comments from the community. We wish the modification to the network
>> part will be generic which not used by vhost-net backend only, but a user
>> application may use it as well when the zero-copy device may provides async
>> read/write operations later.

>I commented on this in the past. Do you still want comments?

Now we continue with the first way and try to push it. But any comments about the two methods are still welcome.

>That's nice. The thing to do is probably to enable GSO/TSO
>and see what we get this way. Also, mergeable buffer support
>was recently posted and I hope to merge it for 2.6.35.
>You might want to take a look.

I'm looking at the mergeable buffer. I think GSO/GRO support with zero-copy also needs it.
Currently, GSO/TSO is still not supported by vhost-net?
-- 
MST

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v4)
From: David Miller @ 2010-05-08  8:58 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nhorman, netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273300597.2325.55.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 08 May 2010 08:36:37 +0200

> Le vendredi 07 mai 2010 à 21:01 -0400, Neil Horman a écrit :
>> A while back there was a discussion regarding the rt_secret_interval timer.
>> Given that we've had the ability to do emergency route cache rebuilds for awhile
>> now, based on a statistical analysis of the various hash chain lengths in the
>> cache, the use of the flush timer is somewhat redundant.  This patch removes the
>> rt_secret_interval sysctl, allowing us to rely solely on the statistical
>> analysis mechanism to determine the need for route cache flushes.
>> 
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> 
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks guys!

^ permalink raw reply

* Re: [PATCH][v4] tcp: fix ICMP-RTO war
From: Damian Lukowski @ 2010-05-08  8:30 UTC (permalink / raw)
  To: Jerry Chu; +Cc: ilpo.jarvinen, netdev
In-Reply-To: <n2gd1c2719f1005071625nd3df3d72t9655a1f64f2fd825@mail.gmail.com>

> I'm working on a patch that tries to measure and use the RTT for the passive
> open side when the TS option is NOT enabled. My code conflicts with your
> recently added "tcp_ack_update_rtt(sk, 0, 0);" Could you tell me why do you
> force this call for the no-TS case when obviously "0" is not a measured RTT?
> If you try to force icsk_rto to be initialized correctly, it is
> already initialized to
> TCP_TIMEOUT_INIT by tcp_create_openreq_child(). What am I missing?

Hi,
the backoff reversion code uses __tcp_set_rto() to recalculate icsk_rto,
which itself relies on tp->srtt and rttvar.
srtt is explicitly set to 0 in tcp_create_openreq_child(), so I didn't change it.
Initializing it with TCP_TIMEOUT_INIT should also fix that specific bug,
but I don't know if there are other impacts.

Regards
 Damian

> Thanks,
> 
> Jerry
> 
>> From: David Miller <davem@davemloft.net>
>>
>> Date: Sun, Feb 21, 2010 at 7:10 PM
>> Subject: Re: [PATCH][v4] tcp: fix ICMP-RTO war
>> To: ilpo.jarvinen@helsinki.fi
>> Cc: damian@tvk.rwth-aachen.de, netdev@vger.kernel.org
>>
>>
>> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
>> Date: Tue, 16 Feb 2010 14:45:25 +0200 (EET)
>>
>>> On Wed, 10 Feb 2010, Damian Lukowski wrote:
>>>
>>>> @@ -5783,12 +5783,10 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
>>>>
>>>>                              /* tcp_ack considers this ACK as duplicate
>>>>                               * and does not calculate rtt.
>>>> -                             * Fix it at least with timestamps.
>>>> +                             * Force it here.
>>>>                               */
>>>> -                            if (tp->rx_opt.saw_tstamp &&
>>>> -                                tp->rx_opt.rcv_tsecr && !tp->srtt)
>>>> -                                    tcp_ack_saw_tstamp(sk, 0);
>>>> -
>>>> +                            tcp_ack_update_rtt(sk, 0, 0);
>>>> +
>>>
>>> ...Here a zero seq_rtt is given to RTT estimator (it will be effective
>>> only in the case w/o timestamps, TS case recalculates it from the stored
>>> timestamps). Maybe we could use some field (timestamp related one comes to
>>> my mind) in request sock to get a real RTT estimate for non-timestamp case
>>> too. ...It seems possible to me, though tricky because the request_sock is
>>> no longer that easily available here so some parameter passing would be
>>> needed.
>>
>> Agreed.
>>
>> But even more simply I think we should make even the current
>> tcp_ack_update_rtt() call here conditional on at least
>> tp->srtt being zero.
>>
>> Damian do you at least agree with that?
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox