Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Brad Campbell @ 2011-06-08  0:15 UTC (permalink / raw)
  To: Bart De Schuymer
  Cc: Patrick McHardy, kvm, linux-mm, linux-kernel, netdev,
	netfilter-devel
In-Reply-To: <4DEE6815.7040504@pandora.be>

On 08/06/11 02:04, Bart De Schuymer wrote:

> If the bug is easily triggered with your guest os, then you could try to
> capture the traffic with wireshark (or something else) in a
> configuration that doesn't crash your system. Save the traffic in a pcap
> file. Then you can see if resending that traffic in the vulnerable
> configuration triggers the bug (I don't know if something in Windows
> exists, but tcpreplay should work for Linux). Once you have such a
> capture , chances are the bug is even easily reproducible by us (unless
> it's hardware-specific). Success isn't guaranteed, but I think it's
> worth a shot...

The issue with this is I don't have a configuration that does not crash 
the system. This only happens under the specific circumstance that 
traffic from VM A is being DNAT'd to VM B. If I disable 
CONFIG_BRIDGE_NETFILTER, or I leave out the DNAT then I can't replicate 
the problem as I don't seem to be able to get the packets to go where I 
want them to go.

Let me try and explain it a little more clearly with made up IP 
addresses to illustrate the problem.

I have VM A (1.1.1.2) and VM B (1.1.1.3) on br1 (1.1.1.1)
I have public IP on ppp0 (2.2.2.2).

VM B can talk to VM A using its host address (1.1.1.2) and there is no 
problem.

The DNAT says anything destined for PPP0 that is on port 443 and coming 
from anywhere other than PPP0 (ie inside the network) is to be DNAT'd to 
1.1.1.3.

So VM B (1.1.1.3) tries to connect to ppp0 (2.2.2.2) on port 443, and 
this is redirected to VM B on 1.1.1.2.

Only under this specific circumstance does the problem occur. I can get 
VM B (1.1.1.3) to talk directly to VM A (1.1.1.2) all day long and there 
is no problem, it's only when VM B tries to talk to ppp0 that there is 
an issue (and it happens within seconds of the initial connection).

All these tests have been performed with VM B being a Windows XP guest. 
Tonight I'll try it with a Linux guest and see if I can make it happen. 
If that works I might be able to come up with some reproducible test 
case for you. I have a desktop machine that has Intel VT extensions, so 
I'll work toward making a portable test case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC] should we care of COMPAT mode in bridge ?
From: Stephen Hemminger @ 2011-06-07 23:54 UTC (permalink / raw)
  To: David Miller; +Cc: arnd, eric.dumazet, netdev
In-Reply-To: <20110606.132315.1222364671200899036.davem@davemloft.net>

On Mon, 06 Jun 2011 13:23:15 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> Date: Mon, 6 Jun 2011 22:19:33 +0200
> 
> > I think it would be good to first understand why it doesn't work with the
> > new style ioctls that are in the kernel. The source code of brctl that
> > I'm looking at here contains:
> 
> Eric's 32-bit binary was built against older headers that didn't
> define the new ioctls.
> 
> That makes it pretty clear to me that we have to support those
> older ioctls in compat mode even after all these years.

That is really old, even Debian Lenny isn't that out of date.

^ permalink raw reply

* Re: [RFC] should we care of COMPAT mode in bridge ?
From: Stephen Hemminger @ 2011-06-07 23:51 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20110606.130309.847468074358617705.davem@davemloft.net>

On Mon, 06 Jun 2011 13:03:09 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 06 Jun 2011 21:45:40 +0200
> 
> > While trying Alexander Holler patch, I found a 32bit brctl program was
> > not able to work on a 64bit kernel. So I had to switch to another
> > machine for my tests.
> > 
> > brctl addbr mybridge
> > ->
> > socket(PF_FILE, SOCK_STREAM, 0)         = 3
> > ioctl(3, SIOCSIFBR, 0xffd509c0)         = -1 EINVAL (Invalid argument)
> > 
> > Should we care or not ?
> 
> I think we should make an effort to fix this.

The problem is that most of the other ioctl's won't work because of use of
SIOCDEVPRIVATE.

^ permalink raw reply

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Brad Campbell @ 2011-06-07 23:43 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Bart De Schuymer, kvm, linux-mm, linux-kernel, netdev,
	netfilter-devel
In-Reply-To: <4DEE4538.1020404@trash.net>

On 07/06/11 23:35, Patrick McHardy wrote:

> The main suspects would be NAT and TCPMSS. Did you also try whether
> the crash occurs with only one of these these rules?

To be honest I'm actually having trouble finding where TCPMSS is 
actually set in that ruleset. This is a production machine so I can only 
take it down after about 9PM at night. I'll have another crack at it 
tonight.

>> I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
>> the address the way I was doing it, so that's a no-go for me.
>
> That's really weird since you're apparently not using any bridge
> netfilter features. It shouldn't have any effect besides changing
> at which point ip_tables is invoked. How are your network devices
> configured (specifically any bridges)?
>

I have one bridge with all my virtual machines on it.

In this particular instance the packets leave VM A destined for the IP 
address of ppp0 (the external interface). This is intercepted by the 
DNAT PREROUTING rule above and shunted back to VM B.

The VM's are on br1 and the external address is ppp0. Without 
CONFIG_BRIDGE_NETFILTER compiled in I can see the traffic entering and 
leaving VM B with tcpdump, but the packets never seem to get back to VM A.

VM A is XP 32 bit, VM B is Linux. I have some other Linux VM's, so I'll 
do some more testing tonight between those to see where the packets are 
going without CONFIG_BRIDGE_NETFILTER set.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] netfilter: nf_nat: avoid double nat for loopback
From: Patrick McHardy @ 2011-06-07 22:59 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev
In-Reply-To: <alpine.LFD.2.00.1106072234140.1619@ja.ssi.bg>

On 07.06.2011 21:46, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Tue, 7 Jun 2011, Patrick McHardy wrote:
> 
>> On 04.06.2011 16:02, Julian Anastasov wrote:
>>>
>>> 	Avoid double NAT and seq adjustment for loopback
>>> traffic because it causes silent repetition of TCP data. One
>>> example is passive FTP with DNAT rule and difference in the
>>> length of IP addresses.
>>>
>>> 	This patch adds checks if packet is sent and
>>> received via loopback device. As the same conntrack is used
>>> both for outgoing and incoming direction, we restrict NAT,
>>> seq adjustment and confirmation to happen only in
>>> outgoing direction (OUTPUT and POSTROUTING).
>>>
>>> Signed-off-by: Julian Anastasov <ja@ssi.bg>
>>> ---
>>>
>>> 	As the check is not so cheap, another alternative
>>> is to add new skb flag, eg. "loopback", that can be set in 
>>> drivers/net/loopback.c, loopback_xmit(). May be there is space
>>> for it in flags2?
>>
>> I don't think we should be adding code specifically needed for netfilter
>> to the loopback driver if we can avoid it. I don't think we need to
>> actually avoid calling nf_nat_packet twice, that shouldn't do any harm,
>> just the sequence number adjustment. So we could add the loopback check
> 
> 	Yes, may be calling nf_nat_packet is not fatal.
> 
>> to the IPS_SEQ_ADJUST_BIT case to at least avoid it in some cases.
>> Would that work or am I missing something?
> 
> 	Logically, the new check can be after
> test_bit(IPS_SEQ_ADJUST_BIT, &ct->status). But I suspect
> some modules adjust seqs in the helper->help call,
> for example, sip_help_tcp if I'm correctly reading the
> code.

Yes, you're right. But it's the only one since it's the only helper
doing possibly many modifications on a single TCP packet, which can't
be handled by the generic code properly. So if you're worried about
performance costs, I'd have no problems adding this check to the SIP
helper.

^ permalink raw reply

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
From: Al Viro @ 2011-06-07 22:59 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers
In-Reply-To: <20110607215834.GR11521@ZenIV.linux.org.uk>

On Tue, Jun 07, 2011 at 10:58:34PM +0100, Al Viro wrote:
> On Mon, Jun 06, 2011 at 12:03:52PM -0700, Eric W. Biederman wrote:
> 
> > Other pieces of information that should be helpful to know.
> > - All sysfs directory entries for a network namespace should be
> >   removed from sysfs by the time sysfs_exit_ns is called.
> 
> Then why do we need to do _anything_ with ->ns[...]?  Is there any problem
> with postponing actual freeing of that sucker until after umount, so that
> memory doesn't get reused?  Since everything stale should be gone by the
> point when sysfs_exit_ns() would put NULL into ->ns[...], we can as well
> keep the old pointer in there - it won't match anything else for as long
> as the struct net is not freed...

OK, how about the following:
	* extra refcount in struct net, initialized to 1.
	* new method in kobj_ns_type_operations: put_ns.  For struct net
that'd be "decrement that refcount by 1, if it hits zero call net_free()".
Existing callers of net_free() replaced by calls of that function.
	* current_ns semantics change: for struct net it'd bump that refcount.
	* sysfs_kill_sb() does corresponding put_ns on everything it finds in
its ->ns[...]
	* so does sysfs_mount() if it decides to discard the sysfs_super_info
after having found a duplicate
	* sysfs_exit_ns() is gone, along with kobj_ns_exit(), net_kobj_ns_exit()
and kobj_net_ops.

Completely untested patch follows:

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 2668957..1e7a2ee 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -111,8 +111,11 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 		info->ns[type] = kobj_ns_current(type);
 
 	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
-	if (IS_ERR(sb) || sb->s_fs_info != info)
+	if (IS_ERR(sb) || sb->s_fs_info != info) {
+		for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+			kobj_ns_put(type, info->ns[type]);
 		kfree(info);
+	}
 	if (IS_ERR(sb))
 		return ERR_CAST(sb);
 	if (!sb->s_root) {
@@ -131,11 +134,14 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
 static void sysfs_kill_sb(struct super_block *sb)
 {
 	struct sysfs_super_info *info = sysfs_info(sb);
+	int type;
 
 	/* Remove the superblock from fs_supers/s_instances
 	 * so we can't find it, before freeing sysfs_super_info.
 	 */
 	kill_anon_super(sb);
+	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
+		kobj_ns_put(type, info->ns[type]);
 	kfree(info);
 }
 
@@ -145,28 +151,6 @@ static struct file_system_type sysfs_fs_type = {
 	.kill_sb	= sysfs_kill_sb,
 };
 
-void sysfs_exit_ns(enum kobj_ns_type type, const void *ns)
-{
-	struct super_block *sb;
-
-	mutex_lock(&sysfs_mutex);
-	spin_lock(&sb_lock);
-	list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
-		struct sysfs_super_info *info = sysfs_info(sb);
-		/*
-		 * If we see a superblock on the fs_supers/s_instances
-		 * list the unmount has not completed and sb->s_fs_info
-		 * points to a valid struct sysfs_super_info.
-		 */
-		/* Ignore superblocks with the wrong ns */
-		if (info->ns[type] != ns)
-			continue;
-		info->ns[type] = NULL;
-	}
-	spin_unlock(&sb_lock);
-	mutex_unlock(&sysfs_mutex);
-}
-
 int __init sysfs_init(void)
 {
 	int err = -ENOMEM;
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 3d28af3..2ed2404 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -136,7 +136,7 @@ struct sysfs_addrm_cxt {
  * instance).
  */
 struct sysfs_super_info {
-	const void *ns[KOBJ_NS_TYPES];
+	void *ns[KOBJ_NS_TYPES];
 };
 #define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
 extern struct sysfs_dirent sysfs_root;
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 82cb5bf..5fa481c 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -38,9 +38,10 @@ enum kobj_ns_type {
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
-	const void *(*current_ns)(void);
+	void *(*current_ns)(void);
 	const void *(*netlink_ns)(struct sock *sk);
 	const void *(*initial_ns)(void);
+	void (*put_ns)(void *);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
@@ -48,9 +49,9 @@ int kobj_ns_type_registered(enum kobj_ns_type type);
 const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent);
 const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
 
-const void *kobj_ns_current(enum kobj_ns_type type);
+void *kobj_ns_current(enum kobj_ns_type type);
 const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
 const void *kobj_ns_initial(enum kobj_ns_type type);
-void kobj_ns_exit(enum kobj_ns_type type, const void *ns);
+void kobj_ns_put(enum kobj_ns_type type, void *ns);
 
 #endif /* _LINUX_KOBJECT_NS_H */
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c3acda6..e2696d7 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -177,9 +177,6 @@ struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
 struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd);
 void sysfs_put(struct sysfs_dirent *sd);
 
-/* Called to clear a ns tag when it is no longer valid */
-void sysfs_exit_ns(enum kobj_ns_type type, const void *tag);
-
 int __must_check sysfs_init(void);
 
 #else /* CONFIG_SYSFS */
@@ -338,10 +335,6 @@ static inline void sysfs_put(struct sysfs_dirent *sd)
 {
 }
 
-static inline void sysfs_exit_ns(int type, const void *tag)
-{
-}
-
 static inline int __must_check sysfs_init(void)
 {
 	return 0;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 2bf9ed9..415af64 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -35,8 +35,11 @@ struct netns_ipvs;
 #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
 
 struct net {
+	atomic_t		passive;	/* To decided when the network
+						 * namespace should be freed.
+						 */
 	atomic_t		count;		/* To decided when the network
-						 *  namespace should be freed.
+						 *  namespace should be shut down.
 						 */
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_t		use_count;	/* To track references we
@@ -154,6 +157,9 @@ int net_eq(const struct net *net1, const struct net *net2)
 {
 	return net1 == net2;
 }
+
+extern void net_put_ns(void *);
+
 #else
 
 static inline struct net *get_net(struct net *net)
@@ -175,6 +181,8 @@ int net_eq(const struct net *net1, const struct net *net2)
 {
 	return 1;
 }
+
+#define net_put_ns NULL
 #endif
 
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 82dc34c..f584cd4 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -948,9 +948,9 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj)
 }
 
 
-const void *kobj_ns_current(enum kobj_ns_type type)
+void *kobj_ns_current(enum kobj_ns_type type)
 {
-	const void *ns = NULL;
+	void *ns = NULL;
 
 	spin_lock(&kobj_ns_type_lock);
 	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
@@ -987,23 +987,15 @@ const void *kobj_ns_initial(enum kobj_ns_type type)
 	return ns;
 }
 
-/*
- * kobj_ns_exit - invalidate a namespace tag
- *
- * @type: the namespace type (i.e. KOBJ_NS_TYPE_NET)
- * @ns: the actual namespace being invalidated
- *
- * This is called when a tag is no longer valid.  For instance,
- * when a network namespace exits, it uses this helper to
- * make sure no sb's sysfs_info points to the now-invalidated
- * netns.
- */
-void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
+void kobj_ns_put(enum kobj_ns_type type, void *ns)
 {
-	sysfs_exit_ns(type, ns);
+	spin_lock(&kobj_ns_type_lock);
+	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
+	    kobj_ns_ops_tbl[type])
+		kobj_ns_ops_tbl[type]->put_ns(ns);
+	spin_unlock(&kobj_ns_type_lock);
 }
 
-
 EXPORT_SYMBOL(kobject_get);
 EXPORT_SYMBOL(kobject_put);
 EXPORT_SYMBOL(kobject_del);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 11b98bc..b002c71 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1179,9 +1179,12 @@ static void remove_queue_kobjects(struct net_device *net)
 #endif
 }
 
-static const void *net_current_ns(void)
+static void *net_current_ns(void)
 {
-	return current->nsproxy->net_ns;
+	struct net *ns = current->nsproxy->net_ns;
+	if (ns)
+		atomic_inc(&ns->passive);
+	return ns;
 }
 
 static const void *net_initial_ns(void)
@@ -1199,19 +1202,10 @@ struct kobj_ns_type_operations net_ns_type_operations = {
 	.current_ns = net_current_ns,
 	.netlink_ns = net_netlink_ns,
 	.initial_ns = net_initial_ns,
+	.put_ns = net_put_ns,
 };
 EXPORT_SYMBOL_GPL(net_ns_type_operations);
 
-static void net_kobj_ns_exit(struct net *net)
-{
-	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
-}
-
-static struct pernet_operations kobj_net_ops = {
-	.exit = net_kobj_ns_exit,
-};
-
-
 #ifdef CONFIG_HOTPLUG
 static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
 {
@@ -1339,6 +1333,5 @@ EXPORT_SYMBOL(netdev_class_remove_file);
 int netdev_kobject_init(void)
 {
 	kobj_ns_type_register(&net_ns_type_operations);
-	register_pernet_subsys(&kobj_net_ops);
 	return class_register(&net_class);
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6c6b86d..19feb20 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -128,6 +128,7 @@ static __net_init int setup_net(struct net *net)
 	LIST_HEAD(net_exit_list);
 
 	atomic_set(&net->count, 1);
+	atomic_set(&net->passive, 1);
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -210,6 +211,13 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
+void net_put_ns(void *p)
+{
+	struct net *ns = p;
+	if (ns && atomic_dec_and_test(&ns->passive))
+		net_free(ns);
+}
+
 struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 {
 	struct net *net;
@@ -230,7 +238,7 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	}
 	mutex_unlock(&net_mutex);
 	if (rv < 0) {
-		net_free(net);
+		net_put_ns(net);
 		return ERR_PTR(rv);
 	}
 	return net;
@@ -286,7 +294,7 @@ static void cleanup_net(struct work_struct *work)
 	/* Finally it is safe to free my network namespace structure */
 	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
 		list_del_init(&net->exit_list);
-		net_free(net);
+		net_put_ns(net);
 	}
 }
 static DECLARE_WORK(net_cleanup_work, cleanup_net);

^ permalink raw reply related

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Patrick McHardy @ 2011-06-07 22:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brad Campbell, Bart De Schuymer, kvm, linux-mm, linux-kernel,
	netdev, netfilter-devel
In-Reply-To: <1307471484.3091.43.camel@edumazet-laptop>

On 07.06.2011 20:31, Eric Dumazet wrote:
> Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :
> 
>> The main suspects would be NAT and TCPMSS. Did you also try whether
>> the crash occurs with only one of these these rules?
>>
>>> I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
>>> the address the way I was doing it, so that's a no-go for me.
>>
>> That's really weird since you're apparently not using any bridge
>> netfilter features. It shouldn't have any effect besides changing
>> at which point ip_tables is invoked. How are your network devices
>> configured (specifically any bridges)?
> 
> Something in the kernel does 
> 
> u16 *ptr = addr (given by kmalloc())
> 
> ptr[-1] = 0;
> 
> Could be an off-one error in a memmove()/memcopy() or loop...
> 
> I cant see a network issue here.

So far me neither, but netfilter appears to trigger the bug.

> I checked arch/x86/lib/memmove_64.S and it seems fine.

I was thinking it might be a missing skb_make_writable() combined
with vhost_net specifics in the netfilter code (TCPMSS and NAT are
both suspect), but was unable to find something. I also went
through the dst_metrics() conversion to see whether anything could
cause problems with the bridge fake_rttable, but also nothing
so far.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/2] vlan: only create special VLAN 0 once
From: Patrick McHardy @ 2011-06-07 22:50 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: David Miller, netdev, pedro.netdev
In-Reply-To: <20110607164136.GB5018@midget.suse.cz>

On 07.06.2011 18:41, Jiri Bohac wrote:
> On Tue, Jun 07, 2011 at 05:17:27PM +0200, Patrick McHardy wrote:
>> On 05.06.2011 23:28, David Miller wrote:
>>> I am pretty sure that the hardware device driver methods that
>>> implement ndo_vlan_rx_add_vid() assume that the device is up.
>>> Because most drivers completely reset the chip when the
>>> interface is brought up and this will likely clear out the
>>> VLAN ID tables in the chip.
>>>
>>
>> Good point.
>>
>> I don't think this approach works very well at all since
>> some drivers don't do incremental updates, but iterate over
>> the registered VLAN group when constructing filters. The
>> group is not created until the first real VLAN device is
>> registered however.
>>
>> Based on a quick grep (may have missed some):
>>
>> - via_velocity, mlx4, starfire: will do nothing
>>
>> - benet, igb, vxge, igbvf, ixgbevf, e1000e: would oops on
>>   rx_kill_vid due to unnecessary vlan_group_set_device()
> 
> which approach do you mean? David's or mine? I suppose you mean
> David's, because I did not call rx_kill_vid().

Neither, the approach how adding VID 0 was implemented. At
least the first list of drivers mentioned above will do
nothing, the second one would oops if we removed the VID
on NETDEV_DOWN without adding a real VLAN device.

>> The assumption of the drivers that a VLAN group exists
>> before the first VID is configured is reasonable in my
>> opinion, a lot of them also don't even configure VLAN
>> filtering until the VLAN group is registered.
> 
> So this is broken already since ad1afb00 :(

Exactly.

> The assumption broke the bonding driver and this got fixed by
> f35188fa, btw.
> 
>> So I think a good solution would be to make sure all
>> drivers don't enable VLAN filtering before the first
>> VLAN is actually registered and do the automatic
>> registration of VID 0 once the first real VLAN device
>> is created.
> 
> But this behaviour is not what was intended by ad1afb00.
> The VID 0 needs to be registered by default, to make 8021p work.
> Even without any real VLAN devices created.

I'm not sure what exactly you're referring to. HW VLAN filters
are (or should be) usually only activated when the first
VLAN device is registered, so this change has no effect.

If drivers behave that way *and* the VID 0 is only registered
automatically when the first real device is configured,
this should provide the desired behaviour.


^ permalink raw reply

* Re: [net-next 30/40] ixgbe: add support for nfc addition and removal of filters
From: Alexander Duyck @ 2011-06-07 22:39 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Jeff Kirsher, davem, netdev, gospo
In-Reply-To: <1307481879.2908.25.camel@bwh-desktop>

On 06/07/2011 02:24 PM, Ben Hutchings wrote:
> On Tue, 2011-06-07 at 05:33 -0700, Jeff Kirsher wrote:
>> From: Alexander Duyck<alexander.h.duyck@intel.com>
>>
>> This change is meant to allow for nfc to insert and remove filters in order
>> to test the ethtool interface which includes it's own rules manager.
>>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
>> Tested-by: Ross Brattain<ross.b.brattain@intel.com>
>> Signed-off-by: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethtool.c |  245 +++++++++++++++++++++++++++++++++++++
>>   drivers/net/ixgbe/ixgbe_main.c    |   45 +++++++
>>   2 files changed, 290 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
>> index 2c70363..f37c9b8 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethtool.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethtool.c
> [...]
>> +	/* Copy input into formatted structures */
>> +	input->filter.formatted.src_ip[0] = fsp->h_u.tcp_ip4_spec.ip4src;
>> +	mask.formatted.src_ip[0] = fsp->m_u.tcp_ip4_spec.ip4src;
>> +	input->filter.formatted.dst_ip[0] = fsp->h_u.tcp_ip4_spec.ip4dst;
>> +	mask.formatted.dst_ip[0] = fsp->m_u.tcp_ip4_spec.ip4dst;
>> +	input->filter.formatted.src_port = fsp->h_u.tcp_ip4_spec.psrc;
>> +	mask.formatted.src_port = fsp->m_u.tcp_ip4_spec.psrc;
>> +	input->filter.formatted.dst_port = fsp->h_u.tcp_ip4_spec.pdst;
>> +	mask.formatted.dst_port = fsp->m_u.tcp_ip4_spec.pdst;
>> +
>> +	if (fsp->flow_type&  FLOW_EXT) {
>> +		input->filter.formatted.vm_pool =
>> +				(unsigned char)ntohl(fsp->h_ext.data[1]);
>> +		mask.formatted.vm_pool =
>> +				(unsigned char)ntohl(fsp->m_ext.data[1]);
>> +		input->filter.formatted.vlan_id = fsp->h_ext.vlan_tci;
>> +		mask.formatted.vlan_id = fsp->m_ext.vlan_tci;
>> +		input->filter.formatted.flex_bytes =
>> +						fsp->h_ext.vlan_etype;
>> +		mask.formatted.flex_bytes = fsp->m_ext.vlan_etype;
>> +	}
> [...]
>
> Sure you don't need any byte-swapping for the IP, port, and VLAN tag?
>
> Also is the 'vlan_id' field used to match the whole VLAN/priority tag or
> only the VID?  If it only matches the VID then you need to check that
> the priority and CF bits are not set in the given mask.
>
> Ben.
I'm pretty sure that is all correct.  The input filter for the Flow 
Director hashes has everything in big-endian byte order.  Most of the 
input fields in the hardware for the perfect filters also accept things 
in big-endian byte order.  So it is in my best interest to just leave 
things in that byte order since that way I have to do as few byte swaps 
as possible.

The VLAN field can include the VLAN priority.  There is a check of the 
input masks done that verifies that the valid VLAN masks can be 0xEFFF, 
0xE000, 0x0FFF, and 0x0000.  This mask is anded against the input data 
when setting up the filter so there isn't any way that a input with a CF 
bit would be valid.

Thanks,

Alex

^ permalink raw reply

* Re: [net-next 38/40] ixgbe: Update feature flags so that LRO and Ntuple are restricted
From: Alexander Duyck @ 2011-06-07 22:32 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Jeff Kirsher, davem, netdev, gospo
In-Reply-To: <1307452559.2908.15.camel@bwh-desktop>

On 06/07/2011 06:15 AM, Ben Hutchings wrote:
> On Tue, 2011-06-07 at 05:33 -0700, Jeff Kirsher wrote:
>> From: Alexander Duyck<alexander.h.duyck@intel.com>
>>
>> This change makes it so that LRO and Ntuple flags are correctly restricted
>> to only devices that support those features.  Currently we weren't
>> enforcing any of those restrictions and as such it was possible to do
>> things such as enable LRO without it actually being supported on the
>> hardware.
>>
>> This change also makes a slight modification to the code that assumes the
>> ETH_FLAG_RXVLAN is the same as the netdev flag. I corrected it by just adding
>> a !! to cast the result of the flag&  to a bool in order to guarantee the two
>> checks are compared as boolean values.
> [...]
>
> You could BUILD_BUG_ON(ETH_FLAG_RXVLAN != NETIF_F_HW_VLAN_RX).
>
> Or implement the new features interface like every other driver...
>
> Ben.

I hadn't looked at what was going on outside of the Intel drivers and 
didn't realize that the get/set_flags calls had been deprecated.

For now I would appreciate it if the existing patch does get accepted, 
but I will see what can be done to move things over to using the new 
features interface.

Thanks,

Alex

^ permalink raw reply

* Re: [net-next 31/40] ethtool: remove support for ETHTOOL_GRXNTUPLE
From: Alexander Duyck @ 2011-06-07 22:20 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Jeff Kirsher, davem, netdev, gospo
In-Reply-To: <1307451997.2908.13.camel@bwh-desktop>

On 06/07/2011 06:06 AM, Ben Hutchings wrote:
> On Tue, 2011-06-07 at 05:33 -0700, Jeff Kirsher wrote:
>> From: Alexander Duyck<alexander.h.duyck@intel.com>
>>
>> This change is meant to remove all support for displaying an ntuple as
>> strings via ETHTOOL_GRXNTUPLE.  The reason for this change is due to the
>> fact that multiple issues have been found including:
>>   - Multiple buffer overruns for strings being displayed.
>>   - Incorrect filters displayed, cleared filters with ring of -2 are displayed
>>   - Setting get_rx_ntuple displays no rules if defined.
>>   - Endianess wrong on displayed values.
>>   - Hard limit of 1024 filters makes display functionality extremely limited
>>
>> The only driver that had supported this interface was ixgbe.  Since it no
>> longer uses the interface and due to the issues mentioned above I am
>> submitting this patch to remove it.
>>
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
>> Tested-by: Ross Brattain<ross.b.brattain@intel.com>
>> Signed-off-by: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
>> ---
>>   include/linux/ethtool.h   |    8 +-
>>   include/linux/netdevice.h |    3 -
>>   net/core/dev.c            |    5 -
>>   net/core/ethtool.c        |  299 ---------------------------------------------
>>   4 files changed, 2 insertions(+), 313 deletions(-)
>>
>> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
>> index c6a850a..3310ab6 100644
>> --- a/include/linux/ethtool.h
>> +++ b/include/linux/ethtool.h
>> @@ -287,7 +287,7 @@ enum ethtool_stringset {
>>   	ETH_SS_TEST		= 0,
>>   	ETH_SS_STATS,
>>   	ETH_SS_PRIV_FLAGS,
>> -	ETH_SS_NTUPLE_FILTERS,
>> +	ETH_SS_DO_NOT_USE,		/* was ETH_SS_NTUPLE_FILTERS */
>>   	ETH_SS_FEATURES,
>>   };
>>
> Since this feature didn't work properly, any code that tried to use it
> didn't really work, but it still feels kind of wrong to turn that into a
> compile error.  And it does no harm to leave the definition here, though
> you may want to comment that it is no longer supported.
>
Ok, if Jeff can pull that patch I will redo it with the definition left 
behind and a comment added.
>> @@ -720,8 +720,6 @@ struct ethtool_rx_ntuple_flow_spec_container {
>>   };
>>
>>   struct ethtool_rx_ntuple_list {
>> -#define ETHTOOL_MAX_NTUPLE_LIST_ENTRY 1024
>> -#define ETHTOOL_MAX_NTUPLE_STRING_PER_ENTRY 14
>>   	struct list_head	list;
>>   	unsigned int		count;
>>   };
> You can remove struct ethtool_rx_ntuple_flow_spec_container and struct
> ethtool_rx_ntuple_list as they were not exposed to userland.
>
> [...]
>> @@ -1017,7 +1013,7 @@ struct ethtool_ops {
>>   #define ETHTOOL_FLASHDEV	0x00000033 /* Flash firmware to device */
>>   #define ETHTOOL_RESET		0x00000034 /* Reset hardware */
>>   #define ETHTOOL_SRXNTUPLE	0x00000035 /* Add an n-tuple filter to device */
>> -#define ETHTOOL_GRXNTUPLE	0x00000036 /* Get n-tuple filters from device */
>> +/* ETHTOOL_GRXNTUPLE		0x00000036 disabled due to multiple issues */
>>   #define ETHTOOL_GSSET_INFO	0x00000037 /* Get string set info */
>>   #define ETHTOOL_GRXFHINDIR	0x00000038 /* Get RX flow hash indir'n table */
>>   #define ETHTOOL_SRXFHINDIR	0x00000039 /* Set RX flow hash indir'n table */
> [...]
>
> Same here; the command number needs to be reserved forever and the
> definition does no harm.
>
> Ben.
Same for this.  I will update it so that the definition remains, but 
comment that it should not be used.

Thanks,

Alex


^ permalink raw reply

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
From: Al Viro @ 2011-06-07 21:58 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers
In-Reply-To: <m1k4cy6auf.fsf@fess.ebiederm.org>

On Mon, Jun 06, 2011 at 12:03:52PM -0700, Eric W. Biederman wrote:

> Other pieces of information that should be helpful to know.
> - All sysfs directory entries for a network namespace should be
>   removed from sysfs by the time sysfs_exit_ns is called.

Then why do we need to do _anything_ with ->ns[...]?  Is there any problem
with postponing actual freeing of that sucker until after umount, so that
memory doesn't get reused?  Since everything stale should be gone by the
point when sysfs_exit_ns() would put NULL into ->ns[...], we can as well
keep the old pointer in there - it won't match anything else for as long
as the struct net is not freed...

^ permalink raw reply

* Re: [net-next 30/40] ixgbe: add support for nfc addition and removal of filters
From: Ben Hutchings @ 2011-06-07 21:24 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Alexander Duyck, netdev, gospo
In-Reply-To: <1307449995-9458-31-git-send-email-jeffrey.t.kirsher@intel.com>

On Tue, 2011-06-07 at 05:33 -0700, Jeff Kirsher wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change is meant to allow for nfc to insert and remove filters in order
> to test the ethtool interface which includes it's own rules manager.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Ross Brattain <ross.b.brattain@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethtool.c |  245 +++++++++++++++++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_main.c    |   45 +++++++
>  2 files changed, 290 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
> index 2c70363..f37c9b8 100644
> --- a/drivers/net/ixgbe/ixgbe_ethtool.c
> +++ b/drivers/net/ixgbe/ixgbe_ethtool.c
[...]
> +	/* Copy input into formatted structures */
> +	input->filter.formatted.src_ip[0] = fsp->h_u.tcp_ip4_spec.ip4src;
> +	mask.formatted.src_ip[0] = fsp->m_u.tcp_ip4_spec.ip4src;
> +	input->filter.formatted.dst_ip[0] = fsp->h_u.tcp_ip4_spec.ip4dst;
> +	mask.formatted.dst_ip[0] = fsp->m_u.tcp_ip4_spec.ip4dst;
> +	input->filter.formatted.src_port = fsp->h_u.tcp_ip4_spec.psrc;
> +	mask.formatted.src_port = fsp->m_u.tcp_ip4_spec.psrc;
> +	input->filter.formatted.dst_port = fsp->h_u.tcp_ip4_spec.pdst;
> +	mask.formatted.dst_port = fsp->m_u.tcp_ip4_spec.pdst;
> +
> +	if (fsp->flow_type & FLOW_EXT) {
> +		input->filter.formatted.vm_pool =
> +				(unsigned char)ntohl(fsp->h_ext.data[1]);
> +		mask.formatted.vm_pool =
> +				(unsigned char)ntohl(fsp->m_ext.data[1]);
> +		input->filter.formatted.vlan_id = fsp->h_ext.vlan_tci;
> +		mask.formatted.vlan_id = fsp->m_ext.vlan_tci;
> +		input->filter.formatted.flex_bytes =
> +						fsp->h_ext.vlan_etype;
> +		mask.formatted.flex_bytes = fsp->m_ext.vlan_etype;
> +	}
[...]

Sure you don't need any byte-swapping for the IP, port, and VLAN tag?

Also is the 'vlan_id' field used to match the whole VLAN/priority tag or
only the VID?  If it only matches the VID then you need to check that
the priority and CF bits are not set in the given mask.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [net-2.6] igb: fix i350 SR-IOV failture
From: David Miller @ 2011-06-07 21:23 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: mitch.a.williams, netdev, gospo, stable
In-Reply-To: <1307450339-9889-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue,  7 Jun 2011 05:38:59 -0700

> From: "Williams, Mitch A" <mitch.a.williams@intel.com>
> 
> When SR-IOV is enabled, i350 devices fail to pass traffic. This is due to
> the driver attempting to enable RSS on the PF device, which is not
> supported by the i350.
> 
> When max_vfs is specified on an i350 adapter, set the number of RSS queues
> to 1.
> 
> This issue affects 2.6.39 as well.
> 
> CC: stable@kernel.org
> Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied.

^ permalink raw reply

* [PATCH] net/mac80211/debugfs: Convert to kstrou8_from_user
From: Peter Huewe @ 2011-06-07 20:36 UTC (permalink / raw)
  To: Johannes Berg
  Cc: John W. Linville, David S. Miller, linux-wireless, netdev,
	linux-kernel, kernel-janitors, Peter Huewe

This patch replaces the code for getting an number from a
userspace buffer by a simple call to kstrou8_from_user.
This makes it easier to read and less error prone.

Since the old buffer was only 10 bytes long and the value is masked by a
nibble-mask anyway, we don't need to use kstrtoul but rather kstrtou8.

Kernel Version: v3.0-rc2

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
---
 net/mac80211/debugfs.c |   14 +++-----------
 1 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/net/mac80211/debugfs.c b/net/mac80211/debugfs.c
index 186e02f..267ed45 100644
--- a/net/mac80211/debugfs.c
+++ b/net/mac80211/debugfs.c
@@ -195,20 +195,12 @@ static ssize_t uapsd_queues_write(struct file *file,
 				  size_t count, loff_t *ppos)
 {
 	struct ieee80211_local *local = file->private_data;
-	unsigned long val;
-	char buf[10];
-	size_t len;
+	u8 val;
 	int ret;
 
-	len = min(count, sizeof(buf) - 1);
-	if (copy_from_user(buf, user_buf, len))
-		return -EFAULT;
-	buf[len] = '\0';
-
-	ret = strict_strtoul(buf, 0, &val);
-
+	ret = kstrtou8_from_user(user_buf, count, 0, &val);
 	if (ret)
-		return -EINVAL;
+		return ret;
 
 	if (val & ~IEEE80211_WMM_IE_STA_QOSINFO_AC_MASK)
 		return -ERANGE;
-- 
1.7.3.4

^ permalink raw reply related

* Re: [PATCH] netfilter: nf_nat: avoid double nat for loopback
From: Julian Anastasov @ 2011-06-07 19:46 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev
In-Reply-To: <4DEDF167.1010202@trash.net>


	Hello,

On Tue, 7 Jun 2011, Patrick McHardy wrote:

> On 04.06.2011 16:02, Julian Anastasov wrote:
> > 
> > 	Avoid double NAT and seq adjustment for loopback
> > traffic because it causes silent repetition of TCP data. One
> > example is passive FTP with DNAT rule and difference in the
> > length of IP addresses.
> > 
> > 	This patch adds checks if packet is sent and
> > received via loopback device. As the same conntrack is used
> > both for outgoing and incoming direction, we restrict NAT,
> > seq adjustment and confirmation to happen only in
> > outgoing direction (OUTPUT and POSTROUTING).
> > 
> > Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > ---
> > 
> > 	As the check is not so cheap, another alternative
> > is to add new skb flag, eg. "loopback", that can be set in 
> > drivers/net/loopback.c, loopback_xmit(). May be there is space
> > for it in flags2?
> 
> I don't think we should be adding code specifically needed for netfilter
> to the loopback driver if we can avoid it. I don't think we need to
> actually avoid calling nf_nat_packet twice, that shouldn't do any harm,
> just the sequence number adjustment. So we could add the loopback check

	Yes, may be calling nf_nat_packet is not fatal.

> to the IPS_SEQ_ADJUST_BIT case to at least avoid it in some cases.
> Would that work or am I missing something?

	Logically, the new check can be after
test_bit(IPS_SEQ_ADJUST_BIT, &ct->status). But I suspect
some modules adjust seqs in the helper->help call,
for example, sip_help_tcp if I'm correctly reading the
code.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Eric Dumazet @ 2011-06-07 18:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Brad Campbell, Bart De Schuymer, kvm, linux-mm, linux-kernel,
	netdev, netfilter-devel
In-Reply-To: <4DEE4538.1020404@trash.net>

Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :

> The main suspects would be NAT and TCPMSS. Did you also try whether
> the crash occurs with only one of these these rules?
> 
> > I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
> > the address the way I was doing it, so that's a no-go for me.
> 
> That's really weird since you're apparently not using any bridge
> netfilter features. It shouldn't have any effect besides changing
> at which point ip_tables is invoked. How are your network devices
> configured (specifically any bridges)?

Something in the kernel does 

u16 *ptr = addr (given by kmalloc())

ptr[-1] = 0;

Could be an off-one error in a memmove()/memcopy() or loop...

I cant see a network issue here.

I checked arch/x86/lib/memmove_64.S and it seems fine.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Bart De Schuymer @ 2011-06-07 18:04 UTC (permalink / raw)
  To: Brad Campbell
  Cc: Patrick McHardy, kvm, linux-mm, linux-kernel, netdev,
	netfilter-devel
In-Reply-To: <4DEE3859.6070808@fnarfbargle.com>

Op 7/06/2011 16:40, Brad Campbell schreef:
> On 07/06/11 21:30, Patrick McHardy wrote:
>> On 07.06.2011 05:33, Brad Campbell wrote:
>>> On 07/06/11 04:10, Bart De Schuymer wrote:
>>>> Hi Brad,
>>>>
>>>> This has probably nothing to do with ebtables, so please rmmod in case
>>>> it's loaded.
>>>> A few questions I didn't directly see an answer to in the threads I
>>>> scanned...
>>>> I'm assuming you actually use the bridging firewall functionality. So,
>>>> what iptables modules do you use? Can you reduce your iptables 
>>>> rules to
>>>> a core that triggers the bug?
>>>> Or does it get triggered even with an empty set of firewall rules?
>>>> Are you using a stock .35 kernel or is it patched?
>>>> Is this something I can trigger on a poor guy's laptop or does it
>>>> require specialized hardware (I'm catching up on qemu/kvm...)?
>>>
>>> Not specialised hardware as such, I've just not been able to reproduce
>>> it outside of this specific operating scenario.
>>
>> The last similar problem we've had was related to the 32/64 bit compat
>> code. Are you running 32 bit userspace on a 64 bit kernel?
>
> No, 32 bit Guest OS, but a completely 64 bit userspace on a 64 bit 
> kernel.
>
> Userspace is current Debian Stable. Kernel is Vanilla and qemu-kvm is 
> current git
>
If the bug is easily triggered with your guest os, then you could try to 
capture the traffic with wireshark (or something else) in a 
configuration that doesn't crash your system. Save the traffic in a pcap 
file. Then you can see if resending that traffic in the vulnerable 
configuration triggers the bug (I don't know if something in Windows 
exists, but tcpreplay should work for Linux). Once you have such a 
capture , chances are the bug is even easily reproducible by us (unless 
it's hardware-specific). Success isn't guaranteed, but I think it's 
worth a shot...

cheers,
Bart


-- 
Bart De Schuymer
www.artinalgorithms.be

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] usbnet/cdc_ncm: add missing .reset_resume hook
From: Greg KH @ 2011-06-07 17:35 UTC (permalink / raw)
  To: Stefan (metze) Metzmacher
  Cc: David Miller, oliver, linux-usb, netdev, linux-kernel
In-Reply-To: <4DEE5AEE.8070901@samba.org>

On Tue, Jun 07, 2011 at 07:07:58PM +0200, Stefan (metze) Metzmacher wrote:
> Am 07.06.2011 16:43, schrieb Greg KH:
> > On Tue, Jun 07, 2011 at 09:38:12AM +0200, Stefan (metze) Metzmacher wrote:
> >> Am 06.06.2011 17:06, schrieb Greg KH:
> >>> On Mon, Jun 06, 2011 at 02:23:16PM +0200, Stefan (metze) Metzmacher wrote:
> >>>> Hi David,
> >>>>
> >>>>> From: Stefan Metzmacher <metze@samba.org>
> >>>>> Date: Wed,  1 Jun 2011 14:01:41 +0200
> >>>>>
> >>>>>> This avoids messages like this after suspend:
> >>>>>>
> >>>>>>    cdc_ncm 2-1.4:1.6: no reset_resume for driver cdc_ncm?
> >>>>>>    cdc_ncm 2-1.4:1.7: no reset_resume for driver cdc_ncm?
> >>>>>>    cdc_ncm 2-1.4:1.6: usb0: unregister 'cdc_ncm' usb-0000:00:1d.0-1.4, CDC NCM
> >>>>>>
> >>>>>> This is important for the Ericsson F5521gw GSM/UMTS modem.
> >>>>>> Otherwise modemmanager looses the fact that the cdc_ncm and cdc_acm devices
> >>>>>> belong together.
> >>>>>>
> >>>>>> The cdc_ether module does the same.
> >>>>>>
> >>>>>> Signed-off-by: Stefan Metzmacher <metze@samba.org>
> >>>>>
> >>>>> Applied and queued up for -stable, thanks.
> >>>>
> >>>> It seems to be part of 3.0-rc2, but I'm not seeing it in any stable tree
> >>>> yet...
> >>>>
> >>>> When can I expect it in stable trees like 2.6.38.y?
> >>>
> >>> The .38.y tree is closed and will not have new releases, so you will
> >>> never see it there, sorry.
> >>
> >> Ok, are there chances for .39.y?
> > 
> > There are, it requires the patch to be in Linus's tree first,
> 
> It's already there:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85e3c65fa3a1d0542c181510a950a2be7733ff29
> 
> > please see Documentation/stable_kernel_rules.txt for the details.
> 
> Thanks for the pointer. It seems I've missed the Cc: <stable@kernel.org>
> line(s).
> 
> Would this be the correct syntax in order to get it into 2.6.38.x and
> 2.6.39.x?
> (2.6.38 wasn't closed at the time I submitted the patch)
> 
> Cc: <stable@kernel.org> # .38.x
> Cc: <stable@kernel.org> # .39.x
> 
> Should I resubmit the patch, or can you just cherry pick
> 85e3c65fa3a1d0542c181510a950a2be7733ff29
> from Linus's tree?

I've queued it up now, thanks.

greg k-h

^ permalink raw reply

* Re: [PATCH] usbnet/cdc_ncm: add missing .reset_resume hook
From: Stefan (metze) Metzmacher @ 2011-06-07 17:07 UTC (permalink / raw)
  To: Greg KH; +Cc: David Miller, oliver, linux-usb, netdev, linux-kernel
In-Reply-To: <20110607144348.GB19246@suse.de>

[-- Attachment #1: Type: text/plain, Size: 2033 bytes --]

Am 07.06.2011 16:43, schrieb Greg KH:
> On Tue, Jun 07, 2011 at 09:38:12AM +0200, Stefan (metze) Metzmacher wrote:
>> Am 06.06.2011 17:06, schrieb Greg KH:
>>> On Mon, Jun 06, 2011 at 02:23:16PM +0200, Stefan (metze) Metzmacher wrote:
>>>> Hi David,
>>>>
>>>>> From: Stefan Metzmacher <metze@samba.org>
>>>>> Date: Wed,  1 Jun 2011 14:01:41 +0200
>>>>>
>>>>>> This avoids messages like this after suspend:
>>>>>>
>>>>>>    cdc_ncm 2-1.4:1.6: no reset_resume for driver cdc_ncm?
>>>>>>    cdc_ncm 2-1.4:1.7: no reset_resume for driver cdc_ncm?
>>>>>>    cdc_ncm 2-1.4:1.6: usb0: unregister 'cdc_ncm' usb-0000:00:1d.0-1.4, CDC NCM
>>>>>>
>>>>>> This is important for the Ericsson F5521gw GSM/UMTS modem.
>>>>>> Otherwise modemmanager looses the fact that the cdc_ncm and cdc_acm devices
>>>>>> belong together.
>>>>>>
>>>>>> The cdc_ether module does the same.
>>>>>>
>>>>>> Signed-off-by: Stefan Metzmacher <metze@samba.org>
>>>>>
>>>>> Applied and queued up for -stable, thanks.
>>>>
>>>> It seems to be part of 3.0-rc2, but I'm not seeing it in any stable tree
>>>> yet...
>>>>
>>>> When can I expect it in stable trees like 2.6.38.y?
>>>
>>> The .38.y tree is closed and will not have new releases, so you will
>>> never see it there, sorry.
>>
>> Ok, are there chances for .39.y?
> 
> There are, it requires the patch to be in Linus's tree first,

It's already there:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=85e3c65fa3a1d0542c181510a950a2be7733ff29

> please see Documentation/stable_kernel_rules.txt for the details.

Thanks for the pointer. It seems I've missed the Cc: <stable@kernel.org>
line(s).

Would this be the correct syntax in order to get it into 2.6.38.x and
2.6.39.x?
(2.6.38 wasn't closed at the time I submitted the patch)

Cc: <stable@kernel.org> # .38.x
Cc: <stable@kernel.org> # .39.x

Should I resubmit the patch, or can you just cherry pick
85e3c65fa3a1d0542c181510a950a2be7733ff29
from Linus's tree?

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* RE: Multicast IP packet routed between 2 ports nic on the same host
From: David Stevens @ 2011-06-07 17:03 UTC (permalink / raw)
  To: BONNEAU Guy; +Cc: netdev@vger.kernel.org, netdev-owner@vger.kernel.org
In-Reply-To: <24665DDC0D7CF047BD6471A56E615EA628ABF94D@CA-OPS-MAILBOX.miranda.com>

netdev-owner@vger.kernel.org wrote on 06/07/2011 06:57:32 AM:

> From: BONNEAU Guy <gbonneau@miranda.com>
 
> Its works! Thanks a lot David! Googling for SO_BINDTODEVICE helped 
> me to get insight into my issue. This link was really useful : 
> http://codingrelic.geekhold.com/2009/10/code-snippet-sobindtodevice.html
> . However I'm still somewhat confused because I always assumed that 
> this was one of the purpose of IP_ADD_MEMBERSHIP beside joining a 
> multicast group. To constrain the multicast socket to a specific 
> "device" interface. In which case I wonder if the interface member 
> of structure ip_mreq is still relevant? Can you comment!

Guy,
        There are 2 steps in multicast delivery:

1) group membership, which is controlled by IP_ADD_MEMBERSHIP (et alia).
        This is per-interface and determines whether multicast packets
        on that interface will be delivered to the machine. If you don't
        join a group _on_that_interface_, the interface itself will drop
        the packets. This is what the interface specification in ip_mreq
        and ip_mreqn is for; if you want that group on all interfaces,
        you have to do the join on all interfaces, one at a time. If you
        don't specify an interface, it'll pick one, which isn't 
necessarily
        the one you want.
2) socket binding.
        This determines whether a particular socket receives a multicast
        packet or not. Clearly, all multicast packets that are not 
delivered
        to the machine will not be delivered to any socket, so _somebody_ 
has
        to have done a group join for sockets bound to multicast addresses
        to receive packets. However, the socket binding is entirely 
independent
        of the group membership. A binding matching the group and port 
(which
        also includes INADDR_ANY bindings) will receive matching packets
        whether or not you did a join on that socket (or a join at all) as
        long as some process on the machine has done a join on an 
interface
        that allows delivery to the machine.

In your test program, you were using the same group and port for both
instances so they will both receive the same set of packets. If any join
on any interface would allow delivery of matching packets to the machine,
they would both receive them, even if neither of your programs did the
IP_ADD_MEMBERSHIP. And if no process had done an join on an interface
with matching multicast traffic, neither of the programs would receive
anything.

                                                                +-DLS


^ permalink raw reply

* Re: [PATCH 1/2] vlan: only create special VLAN 0 once
From: Jiri Bohac @ 2011-06-07 16:41 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, jbohac, netdev, pedro.netdev
In-Reply-To: <4DEE4107.1080706@trash.net>

On Tue, Jun 07, 2011 at 05:17:27PM +0200, Patrick McHardy wrote:
> On 05.06.2011 23:28, David Miller wrote:
> > I am pretty sure that the hardware device driver methods that
> > implement ndo_vlan_rx_add_vid() assume that the device is up.
> > Because most drivers completely reset the chip when the
> > interface is brought up and this will likely clear out the
> > VLAN ID tables in the chip.
> > 
> 
> Good point.
> 
> I don't think this approach works very well at all since
> some drivers don't do incremental updates, but iterate over
> the registered VLAN group when constructing filters. The
> group is not created until the first real VLAN device is
> registered however.
> 
> Based on a quick grep (may have missed some):
> 
> - via_velocity, mlx4, starfire: will do nothing
> 
> - benet, igb, vxge, igbvf, ixgbevf, e1000e: would oops on
>   rx_kill_vid due to unnecessary vlan_group_set_device()

which approach do you mean? David's or mine? I suppose you mean
David's, because I did not call rx_kill_vid().

> The assumption of the drivers that a VLAN group exists
> before the first VID is configured is reasonable in my
> opinion, a lot of them also don't even configure VLAN
> filtering until the VLAN group is registered.

So this is broken already since ad1afb00 :(
The assumption broke the bonding driver and this got fixed by
f35188fa, btw.

> So I think a good solution would be to make sure all
> drivers don't enable VLAN filtering before the first
> VLAN is actually registered and do the automatic
> registration of VID 0 once the first real VLAN device
> is created.

But this behaviour is not what was intended by ad1afb00.
The VID 0 needs to be registered by default, to make 8021p work.
Even without any real VLAN devices created.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCH 1/2] vlan: only create special VLAN 0 once
From: Jiri Bohac @ 2011-06-07 16:18 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, kaber, netdev, pedro.netdev
In-Reply-To: <20110605.142823.1727360496050285755.davem@davemloft.net>

Hi David,

On Sun, Jun 05, 2011 at 02:28:23PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Fri, 3 Jun 2011 22:07:38 +0200
> 
> > Commit ad1afb00 registers a VLAN with vid == 0 for every device to handle
> > 802.1p frames.  This is currently done on every NETDEV_UP event and the special
> > vlan is never unregistered.  This may have strange effects on drivers
> > implementning ndo_vlan_rx_add_vid(). E.g. bonding will allocate a linked-list
> > element each time, causing a memory leak.
> > 
> > Only register the special VLAN once on NETDEV_REGISTER.
> > 
> > Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> 
> I recognize the problem, but this solution isn't all that good.
> 
> I am pretty sure that the hardware device driver methods that
> implement ndo_vlan_rx_add_vid() assume that the device is up.
> Because most drivers completely reset the chip when the
> interface is brought up and this will likely clear out the
> VLAN ID tables in the chip.

Really? In that case, we have a much bigger problem: the vlan
code allows registering a new vlan on an interface that is down.
And it only registers the VID with ndo_vlan_rx_add_vid() in
register_vlan_dev() during the registration of the new vlan
interface -- it never re-registers the VIDs on a NETDEV_UP.

That would mean doing:

	ip link set down eth0
	ip link add link eth0 name eth0.1 type vlan id 1
	ip link set up eth0

... should result in a non-working setup, right? I would expect
an -EINVAL somewhere along the way.
However, at least for e1000, I just tested the above setup works.

Could you be wrong about this? Or is this supposed to fail with
other chips?  In that case, the vlan code or the drivers need
fixing. Something should either disallow adding vlans when down
and disallow putting the interface down then vlans are
configured, or it should re-register the VIDs on every NETDEV_UP.

> Second, now even devices which don't ever get brought up will
> have the VLAN ID 0 thing allocated.

Why is this a problem?

> Probably the thing to do is to remove the VLAN ID 0 entry on
> NETDEV_DOWN.

OK, if you prefer fixing it this way, why not...

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling
From: Michael S. Tsirkin @ 2011-06-07 16:08 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <cover.1307029008.git.mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, Jun 02, 2011 at 06:42:35PM +0300, Michael S. Tsirkin wrote:
> OK, here's a new attempt to use the new capacity api.  I also added more
> comments to clarify the logic.  Hope this is more readable.  Let me know
> pls.
> 
> This is on top of the patches applied by Rusty.
> 
> Warning: untested. Posting now to give people chance to
> comment on the API.

OK, this seems to have survived some testing so far,
after I dropped patch 4 and fixed build for patch 3
(build fixup patch sent in reply to the original).

I'll be mostly offline until Sunday, would appreciate
testing reports.

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
virtio-net-xmit-polling-v2
git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git
virtio-net-event-idx-v3

Thanks!

> Changes from v1:
> - fix comment in patch 2 to correct confusion noted by Rusty
> - rewrite patch 3 along the lines suggested by Rusty
>   note: it's not exactly the same but I hope it's close
>   enough, the main difference is that mine does limited
>   polling even in the unlikely xmit failure case.
> - added a patch to not return capacity from add_buf
>   it always looked like a weird hack
> 
> Michael S. Tsirkin (4):
>   virtio_ring: add capacity check API
>   virtio_net: fix tx capacity checks using new API
>   virtio_net: limit xmit polling
>   Revert "virtio: make add_buf return capacity remaining:
> 
>  drivers/net/virtio_net.c     |  111 ++++++++++++++++++++++++++----------------
>  drivers/virtio/virtio_ring.c |   10 +++-
>  include/linux/virtio.h       |    7 ++-
>  3 files changed, 84 insertions(+), 44 deletions(-)
> 
> -- 
> 1.7.5.53.gc233e

^ permalink raw reply

* Re: [PATCHv2 RFC 3/4] virtio_net: limit xmit polling
From: Michael S. Tsirkin @ 2011-06-07 15:59 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <a80199422de16ae355e56ee1b2abc9b2bf91a7f6.1307029009.git.mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, Jun 02, 2011 at 06:43:17PM +0300, Michael S. Tsirkin wrote:
> Current code might introduce a lot of latency variation
> if there are many pending bufs at the time we
> attempt to transmit a new one. This is bad for
> real-time applications and can't be good for TCP either.
> 
> Free up just enough to both clean up all buffers
> eventually and to be able to xmit the next packet.
> 
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


I've been testing this patch and it seems to work fine
so far. The following fixups are needed to make it
build though:


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b25db1c..77cdf34 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -529,11 +529,8 @@ static bool free_old_xmit_skb(struct virtnet_info *vi)
  * virtqueue_add_buf will succeed. */
 static bool free_xmit_capacity(struct virtnet_info *vi)
 {
-	struct sk_buff *skb;
-	unsigned int len;
-
 	while (virtqueue_min_capacity(vi->svq) < MAX_SKB_FRAGS + 2)
-		if (unlikely(!free_old_xmit_skb))
+		if (unlikely(!free_old_xmit_skb(vi)))
 			return false;
 	return true;
 }
@@ -628,7 +625,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * Doing this after kick means there's a chance we'll free
 	 * the skb we have just sent, which is hot in cache. */
 	for (i = 0; i < 2; i++)
-		free_old_xmit_skb(v);
+		free_old_xmit_skb(vi);
 
 	if (likely(free_xmit_capacity(vi)))
 		return NETDEV_TX_OK;

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox