Netdev List
 help / color / mirror / Atom feed
* [PATCH] ieee802154: pass source address in dgram_recvmsg
From: Stephen Röttger @ 2012-05-25 12:14 UTC (permalink / raw)
  To: dbaryshkov, slapin
  Cc: davem, linux-zigbee-devel, netdev, linux-kernel,
	Stephen Röttger

This patch lets dgram_recvmsg fill in the sockaddr struct in
msg->msg_name with the source address of the packet.
This is used by the userland functions recvmsg and recvfrom to get the
senders address.
The patch is based on the devel branch of
git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee/kernel

Signed-off-by: Stephen Röttger <stephen.roettger@zero-entropy.de>
---
 net/ieee802154/dgram.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/ieee802154/dgram.c b/net/ieee802154/dgram.c
index 7883fa6..d0a6ebc 100644
--- a/net/ieee802154/dgram.c
+++ b/net/ieee802154/dgram.c
@@ -290,6 +290,9 @@ static int dgram_recvmsg(struct kiocb *iocb, struct sock *sk,
 	size_t copied = 0;
 	int err = -EOPNOTSUPP;
 	struct sk_buff *skb;
+	struct sockaddr_ieee802154 *saddr;
+
+	saddr = (struct sockaddr_ieee802154 *)msg->msg_name;
 
 	skb = skb_recv_datagram(sk, flags, noblock, &err);
 	if (!skb)
@@ -308,6 +311,13 @@ static int dgram_recvmsg(struct kiocb *iocb, struct sock *sk,
 
 	sock_recv_ts_and_drops(msg, sk, skb);
 
+	if (saddr) {
+		saddr->family = AF_IEEE802154;
+		saddr->addr = mac_cb(skb)->sa;
+	}
+	if (addr_len)
+		*addr_len = sizeof(*saddr);
+
 	if (flags & MSG_TRUNC)
 		copied = skb->len;
 done:
-- 
1.7.8

^ permalink raw reply related

* Re: [PATCH] ip.7: Improve explanation about calling listen or connect
From: Peter Schiffer @ 2012-05-25 11:02 UTC (permalink / raw)
  To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Flavio Leitner, linux-man-u79uwXL29TY76Z2rM5mHXA, netdev
In-Reply-To: <1336566636-14713-1-git-send-email-fbl-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Hi Michael,

do you have any comments for this update? Or do you need some supporting 
info?

peter

On 05/09/2012 02:30 PM, Flavio Leitner wrote:
> Signed-off-by: Flavio Leitner<fbl-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   man7/ip.7 |   15 +++++++++------
>   1 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/man7/ip.7 b/man7/ip.7
> index 9f560df..84fe32d 100644
> --- a/man7/ip.7
> +++ b/man7/ip.7
> @@ -69,12 +69,11 @@ For
>   you may specify a valid IANA IP protocol defined in
>   RFC\ 1700 assigned numbers.
>   .PP
> -.\" FIXME ip current does an autobind in listen, but I'm not sure
> -.\" if that should be documented.
>   When a process wants to receive new incoming packets or connections, it
>   should bind a socket to a local interface address using
>   .BR bind (2).
> -Only one IP socket may be bound to any given local (address, port) pair.
> +In this case, only one IP socket may be bound to any given local
> +(address, port) pair.
>   When
>   .B INADDR_ANY
>   is specified in the bind call, the socket will be bound to
> @@ -82,10 +81,14 @@ is specified in the bind call, the socket will be bound to
>   local interfaces.
>   When
>   .BR listen (2)
> -or
> +is called on an unbound socket, the socket is automatically bound
> +to a random free port with the local address set to
> +.BR INADDR_ANY .
> +When
>   .BR connect (2)
> -are called on an unbound socket, it is automatically bound to a
> -random free port with the local address set to
> +is called on an unbound socket, the socket is automatically bound
> +to a random free port or an usable shared port with the local address
> +set to
>   .BR INADDR_ANY .
>
>   A TCP local socket address that has been bound is unavailable for
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] can: Added constants containing length of CAN identifiers
From: Rostislav Lisovy @ 2012-05-25 10:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-can, pisa, sojkam1, oliver
In-Reply-To: <20120525.052256.2147003730285745711.davem@davemloft.net>

On Fri, 2012-05-25 at 05:22 -0400, David Miller wrote: 
> It is not appropriate to submit new features at this time,
> as I described in detail in:
> 
> http://marc.info/?l=netfilter-devel&m=133763475402372&w=2
> 
> I used a subject line with BIG CAPITAL LETTERS in that posting so
> there is really no reason you should have overlooked it.


I am very sorry for not going through the mailing list history
thoroughly enough and thus overlooking your announcement. This was
however meant more like a [RFC]. If anybody has any comments, please
send them to me.
I will resend the patches as soon as the net-next is open.

Best regards,
Rostislav Lisovy


^ permalink raw reply

* Re: [PATCH v7 1/2] Always free struct memcg through schedule_work()
From: Glauber Costa @ 2012-05-25  9:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, devel-GEFAQzZX7r8dnm+yROfE0A,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan, David Miller,
	Johannes Weiner
In-Reply-To: <20120525095007.GA30527-VqjxzfR4DlwKmadIfiO5sKVXKuFTiq87@public.gmane.org>

On 05/25/2012 01:50 PM, Michal Hocko wrote:
> On Fri 25-05-12 13:32:07, Glauber Costa wrote:
>> Right now we free struct memcg with kfree right after a
>> rcu grace period, but defer it if we need to use vfree() to get
>> rid of that memory area. We do that by need, because we need vfree
>> to be called in a process context.
>>
>> This patch unifies this behavior, by ensuring that even kfree will
>> happen in a separate thread. The goal is to have a stable place to
>> call the upcoming jump label destruction function outside the realm
>> of the complicated and quite far-reaching cgroup lock (that can't be
>> held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)
>>
>> Signed-off-by: Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Acked-by: Kamezawa Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>
> Acked-by: Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>
>
> Just one comment below
>
>> CC: Tejun Heo<tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> CC: Li Zefan<lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>> CC: Johannes Weiner<hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> CC: Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>
>> CC: Andrew Morton<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> ---
>>   mm/memcontrol.c |   24 +++++++++++++-----------
>>   1 files changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 932a734..0b4b4c8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -4826,23 +4826,28 @@ out_free:
>>   }
>>
>>   /*
>> - * Helpers for freeing a vzalloc()ed mem_cgroup by RCU,
>> + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
>>    * but in process context.  The work_freeing structure is overlaid
>>    * on the rcu_freeing structure, which itself is overlaid on memsw.
>>    */
>> -static void vfree_work(struct work_struct *work)
>> +static void free_work(struct work_struct *work)
>>   {
>>   	struct mem_cgroup *memcg;
>> +	int size = sizeof(struct mem_cgroup);
>>
>>   	memcg = container_of(work, struct mem_cgroup, work_freeing);
>> -	vfree(memcg);
>> +	if (size<  PAGE_SIZE)
>
> What about
> 	if (is_vmalloc_addr(memcg))
>> +		kfree(memcg);
>> +	else
>> +		vfree(memcg);
>>   }
>
Could be, but I believe this one is already in Andrew's tree from last 
submission (might be wrong)

^ permalink raw reply

* Re: [PATCH v7 1/2] Always free struct memcg through schedule_work()
From: Michal Hocko @ 2012-05-25  9:50 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, devel-GEFAQzZX7r8dnm+yROfE0A,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan, David Miller,
	Johannes Weiner
In-Reply-To: <1337938328-11537-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Fri 25-05-12 13:32:07, Glauber Costa wrote:
> Right now we free struct memcg with kfree right after a
> rcu grace period, but defer it if we need to use vfree() to get
> rid of that memory area. We do that by need, because we need vfree
> to be called in a process context.
> 
> This patch unifies this behavior, by ensuring that even kfree will
> happen in a separate thread. The goal is to have a stable place to
> call the upcoming jump label destruction function outside the realm
> of the complicated and quite far-reaching cgroup lock (that can't be
> held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Just one comment below

> CC: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> CC: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>  mm/memcontrol.c |   24 +++++++++++++-----------
>  1 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 932a734..0b4b4c8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -4826,23 +4826,28 @@ out_free:
>  }
>  
>  /*
> - * Helpers for freeing a vzalloc()ed mem_cgroup by RCU,
> + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
>   * but in process context.  The work_freeing structure is overlaid
>   * on the rcu_freeing structure, which itself is overlaid on memsw.
>   */
> -static void vfree_work(struct work_struct *work)
> +static void free_work(struct work_struct *work)
>  {
>  	struct mem_cgroup *memcg;
> +	int size = sizeof(struct mem_cgroup);
>  
>  	memcg = container_of(work, struct mem_cgroup, work_freeing);
> -	vfree(memcg);
> +	if (size < PAGE_SIZE)

What about
	if (is_vmalloc_addr(memcg)) 
> +		kfree(memcg);
> +	else
> +		vfree(memcg);
>  }

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply

* [PATCH v7 2/2] decrement static keys on real destroy time
From: Glauber Costa @ 2012-05-25  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, cgroups, devel, kamezawa.hiroyu, netdev, Tejun Heo,
	Li Zefan, David Miller, Glauber Costa, Johannes Weiner,
	Michal Hocko
In-Reply-To: <1337938328-11537-1-git-send-email-glommer@parallels.com>

We call the destroy function when a cgroup starts to be removed,
such as by a rmdir event.

However, because of our reference counters, some objects are still
inflight. Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference,
some objects will still have charges, but the code to properly
uncharge them won't be run.

This becomes a problem specially if it is ever enabled again, because
now new charges will be added to the staled charges making keeping
it pretty much impossible.

We just need to be careful with the static branch activation:
since there is no particular preferred order of their activation,
we need to make sure that we only start using it after all
call sites are active. This is achieved by having a per-memcg
flag that is only updated after static_key_slow_inc() returns.
At this time, we are sure all sites are active.

This is made per-memcg, not global, for a reason:
it also has the effect of making socket accounting more
consistent. The first memcg to be limited will trigger static_key()
activation, therefore, accounting. But all the others will then be
accounted no matter what. After this patch, only limited memcgs
will have its sockets accounted.

[v2: changed a tcp limited flag for a generic proto limited flag ]
[v3: update the current active flag only after the static_key update ]
[v4: disarm_static_keys() inside free_work ]
[v5: got rid of tcp_limit_mutex, now in the static_key interface ]
[v6: changed active and activated to a flags field, as suggested by akpm ]
[v7: merged more comments from akpm ]

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizefan@huawei.com>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Michal Hocko <mhocko@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 include/net/sock.h        |   22 ++++++++++++++++++++++
 mm/memcontrol.c           |   31 +++++++++++++++++++++++++++++--
 net/ipv4/tcp_memcontrol.c |   34 +++++++++++++++++++++++++++-------
 3 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3ebe6b..d6a8ae3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -46,6 +46,7 @@
 #include <linux/list_nulls.h>
 #include <linux/timer.h>
 #include <linux/cache.h>
+#include <linux/bitops.h>
 #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
@@ -907,12 +908,23 @@ struct proto {
 #endif
 };
 
+/*
+ * Bits in struct cg_proto.flags
+ */
+enum cg_proto_flags {
+	/* Currently active and new sockets should be assigned to cgroups */
+	MEMCG_SOCK_ACTIVE,
+	/* It was ever activated; we must disarm static keys on destruction */
+	MEMCG_SOCK_ACTIVATED,
+};
+
 struct cg_proto {
 	void			(*enter_memory_pressure)(struct sock *sk);
 	struct res_counter	*memory_allocated;	/* Current allocated memory. */
 	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
 	int			*memory_pressure;
 	long			*sysctl_mem;
+	unsigned long		flags;
 	/*
 	 * memcg field is used to find which memcg we belong directly
 	 * Each memcg struct can hold more than one cg_proto, so container_of
@@ -928,6 +940,16 @@ struct cg_proto {
 extern int proto_register(struct proto *prot, int alloc_slab);
 extern void proto_unregister(struct proto *prot);
 
+static inline bool memcg_proto_active(struct cg_proto *cg_proto)
+{
+	return test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
+}
+
+static inline bool memcg_proto_activated(struct cg_proto *cg_proto)
+{
+	return test_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags);
+}
+
 #ifdef SOCK_REFCNT_DEBUG
 static inline void sk_refcnt_debug_inc(struct sock *sk)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b4b4c8..788be2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -404,6 +404,7 @@ void sock_update_memcg(struct sock *sk)
 {
 	if (mem_cgroup_sockets_enabled) {
 		struct mem_cgroup *memcg;
+		struct cg_proto *cg_proto;
 
 		BUG_ON(!sk->sk_prot->proto_cgroup);
 
@@ -423,9 +424,10 @@ void sock_update_memcg(struct sock *sk)
 
 		rcu_read_lock();
 		memcg = mem_cgroup_from_task(current);
-		if (!mem_cgroup_is_root(memcg)) {
+		cg_proto = sk->sk_prot->proto_cgroup(memcg);
+		if (!mem_cgroup_is_root(memcg) && memcg_proto_active(cg_proto)) {
 			mem_cgroup_get(memcg);
-			sk->sk_cgrp = sk->sk_prot->proto_cgroup(memcg);
+			sk->sk_cgrp = cg_proto;
 		}
 		rcu_read_unlock();
 	}
@@ -454,6 +456,19 @@ EXPORT_SYMBOL(tcp_proto_cgroup);
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
+#if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)
+static void disarm_sock_keys(struct mem_cgroup *memcg)
+{
+	if (!memcg_proto_activated(&memcg->tcp_mem.cg_proto))
+		return;
+	static_key_slow_dec(&memcg_socket_limit_enabled);
+}
+#else
+static void disarm_sock_keys(struct mem_cgroup *memcg)
+{
+}
+#endif
+
 static void drain_all_stock_async(struct mem_cgroup *memcg);
 
 static struct mem_cgroup_per_zone *
@@ -4836,6 +4851,18 @@ static void free_work(struct work_struct *work)
 	int size = sizeof(struct mem_cgroup);
 
 	memcg = container_of(work, struct mem_cgroup, work_freeing);
+	/*
+	 * We need to make sure that (at least for now), the jump label
+	 * destruction code runs outside of the cgroup lock. This is because
+	 * get_online_cpus(), which is called from the static_branch update,
+	 * can't be called inside the cgroup_lock. cpusets are the ones
+	 * enforcing this dependency, so if they ever change, we might as well.
+	 *
+	 * schedule_work() will guarantee this happens. Be careful if you need
+	 * to move this code around, and make sure it is outside
+	 * the cgroup_lock.
+	 */
+	disarm_sock_keys(memcg);
 	if (size < PAGE_SIZE)
 		kfree(memcg);
 	else
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 1517037..b6f3583 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -74,9 +74,6 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
 	percpu_counter_destroy(&tcp->tcp_sockets_allocated);
 
 	val = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT);
-
-	if (val != RESOURCE_MAX)
-		static_key_slow_dec(&memcg_socket_limit_enabled);
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
 
@@ -107,10 +104,33 @@ static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
 		tcp->tcp_prot_mem[i] = min_t(long, val >> PAGE_SHIFT,
 					     net->ipv4.sysctl_tcp_mem[i]);
 
-	if (val == RESOURCE_MAX && old_lim != RESOURCE_MAX)
-		static_key_slow_dec(&memcg_socket_limit_enabled);
-	else if (old_lim == RESOURCE_MAX && val != RESOURCE_MAX)
-		static_key_slow_inc(&memcg_socket_limit_enabled);
+	if (val == RESOURCE_MAX)
+		clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
+	else if (val != RESOURCE_MAX) {
+		/*
+		 * The active bit needs to be written after the static_key
+		 * update. This is what guarantees that the socket activation
+		 * function is the last one to run. See sock_update_memcg() for
+		 * details, and note that we don't mark any socket as belonging
+		 * to this memcg until that flag is up.
+		 *
+		 * We need to do this, because static_keys will span multiple
+		 * sites, but we can't control their order. If we mark a socket
+		 * as accounted, but the accounting functions are not patched in
+		 * yet, we'll lose accounting.
+		 *
+		 * We never race with the readers in sock_update_memcg(),
+		 * because when this value change, the code to process it is not
+		 * patched in yet.
+		 *
+		 * The activated bit is used to guarantee that no two writers
+		 * will do the update in the same memcg. Without that, we can't
+		 * properly shutdown the static key.
+		 */
+		if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED, &cg_proto->flags))
+			static_key_slow_inc(&memcg_socket_limit_enabled);
+		set_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags);
+	}
 
 	return 0;
 }
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH v7 1/2] Always free struct memcg through schedule_work()
From: Glauber Costa @ 2012-05-25  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, cgroups, devel, kamezawa.hiroyu, netdev, Tejun Heo,
	Li Zefan, David Miller, Glauber Costa, Johannes Weiner,
	Michal Hocko
In-Reply-To: <1337938328-11537-1-git-send-email-glommer@parallels.com>

Right now we free struct memcg with kfree right after a
rcu grace period, but defer it if we need to use vfree() to get
rid of that memory area. We do that by need, because we need vfree
to be called in a process context.

This patch unifies this behavior, by ensuring that even kfree will
happen in a separate thread. The goal is to have a stable place to
call the upcoming jump label destruction function outside the realm
of the complicated and quite far-reaching cgroup lock (that can't be
held when calling neither the cpu_hotplug.lock nor the jump_label_mutex)

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizefan@huawei.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Michal Hocko <mhocko@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/memcontrol.c |   24 +++++++++++++-----------
 1 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 932a734..0b4b4c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -245,8 +245,8 @@ struct mem_cgroup {
 		 */
 		struct rcu_head rcu_freeing;
 		/*
-		 * But when using vfree(), that cannot be done at
-		 * interrupt time, so we must then queue the work.
+		 * We also need some space for a worker in deferred freeing.
+		 * By the time we call it, rcu_freeing is not longer in use.
 		 */
 		struct work_struct work_freeing;
 	};
@@ -4826,23 +4826,28 @@ out_free:
 }
 
 /*
- * Helpers for freeing a vzalloc()ed mem_cgroup by RCU,
+ * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
  * but in process context.  The work_freeing structure is overlaid
  * on the rcu_freeing structure, which itself is overlaid on memsw.
  */
-static void vfree_work(struct work_struct *work)
+static void free_work(struct work_struct *work)
 {
 	struct mem_cgroup *memcg;
+	int size = sizeof(struct mem_cgroup);
 
 	memcg = container_of(work, struct mem_cgroup, work_freeing);
-	vfree(memcg);
+	if (size < PAGE_SIZE)
+		kfree(memcg);
+	else
+		vfree(memcg);
 }
-static void vfree_rcu(struct rcu_head *rcu_head)
+
+static void free_rcu(struct rcu_head *rcu_head)
 {
 	struct mem_cgroup *memcg;
 
 	memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
-	INIT_WORK(&memcg->work_freeing, vfree_work);
+	INIT_WORK(&memcg->work_freeing, free_work);
 	schedule_work(&memcg->work_freeing);
 }
 
@@ -4868,10 +4873,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 		free_mem_cgroup_per_zone_info(memcg, node);
 
 	free_percpu(memcg->stat);
-	if (sizeof(struct mem_cgroup) < PAGE_SIZE)
-		kfree_rcu(memcg, rcu_freeing);
-	else
-		call_rcu(&memcg->rcu_freeing, vfree_rcu);
+	call_rcu(&memcg->rcu_freeing, free_rcu);
 }
 
 static void mem_cgroup_get(struct mem_cgroup *memcg)
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH v7 0/2] fixes for sock memcg static branch disablement
From: Glauber Costa @ 2012-05-25  9:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, cgroups, devel, kamezawa.hiroyu, netdev, Tejun Heo,
	Li Zefan, David Miller

Hi Andrew,

I believe this one addresses all of your previous comments.

Besides merging your patch, I tried to improve the comments so they would
be more informative. 

The first patch, I believe, is already merged at your tree. But I am including
it here for completeness. I had no changes since last submission, so feel free
to pick the second - or if there are still missing changes you'd like to see,
point me to them.

Thanks

Glauber Costa (2):
  Always free struct memcg through schedule_work()
  decrement static keys on real destroy time

 include/net/sock.h        |   22 ++++++++++++++++++
 mm/memcontrol.c           |   55 ++++++++++++++++++++++++++++++++++----------
 net/ipv4/tcp_memcontrol.c |   34 ++++++++++++++++++++++-----
 3 files changed, 91 insertions(+), 20 deletions(-)

-- 
1.7.7.6

^ permalink raw reply

* Re: [PATCH 1/2] can: Added constants containing length of CAN identifiers
From: David Miller @ 2012-05-25  9:22 UTC (permalink / raw)
  To: lisovy; +Cc: netdev, linux-can, pisa, sojkam1, oliver
In-Reply-To: <1337937157-7680-1-git-send-email-lisovy@gmail.com>


It is not appropriate to submit new features at this time,
as I described in detail in:

http://marc.info/?l=netfilter-devel&m=133763475402372&w=2

I used a subject line with BIG CAPITAL LETTERS in that posting so
there is really no reason you should have overlooked it.

^ permalink raw reply

* [PATCH 2/2] net/sched: CAN Filter/Classifier
From: Rostislav Lisovy @ 2012-05-25  9:12 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, pisa, sojkam1, oliver, Rostislav Lisovy
In-Reply-To: <1337937157-7680-1-git-send-email-lisovy@gmail.com>

The CAN classifier may be used with any available qdisc on Controller
Area Network (CAN) frames passed through AF_CAN networking subsystem.
The classifier classifies CAN frames according to their identifiers.
It can be used on CAN frames with both SFF or EFF identifiers.

The filtering rules for EFF frames are stored in an array, which
is traversed during classification. A bitmap is used to store SFF
rules -- one bit for each ID.

More info about the project:
http://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
---
 net/sched/Kconfig   |   10 +
 net/sched/Makefile  |    1 +
 net/sched/cls_can.c |  571 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 582 insertions(+)
 create mode 100644 net/sched/cls_can.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e7a8976..aeb3c29 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -323,6 +323,16 @@ config NET_CLS_BASIC
 	  To compile this code as a module, choose M here: the
 	  module will be called cls_basic.
 
+config NET_CLS_CAN
+	tristate "Controller Area Network classifier (CAN)"
+	select NET_CLS
+	---help---
+	  Say Y here if you want to be able to classify CAN frames according
+	  to their CAN identifiers (can_id).
+
+	  To compile this code as a module, choose M here: the
+	  module will be called cls_can.
+
 config NET_CLS_TCINDEX
 	tristate "Traffic-Control Index (TCINDEX)"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5940a19..0217341 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_NET_CLS_RSVP)	+= cls_rsvp.o
 obj-$(CONFIG_NET_CLS_TCINDEX)	+= cls_tcindex.o
 obj-$(CONFIG_NET_CLS_RSVP6)	+= cls_rsvp6.o
 obj-$(CONFIG_NET_CLS_BASIC)	+= cls_basic.o
+obj-$(CONFIG_NET_CLS_CAN)	+= cls_can.o
 obj-$(CONFIG_NET_CLS_FLOW)	+= cls_flow.o
 obj-$(CONFIG_NET_CLS_CGROUP)	+= cls_cgroup.o
 obj-$(CONFIG_NET_EMATCH)	+= ematch.o
diff --git a/net/sched/cls_can.c b/net/sched/cls_can.c
new file mode 100644
index 0000000..111668e
--- /dev/null
+++ b/net/sched/cls_can.c
@@ -0,0 +1,571 @@
+/*
+ * cls_can.c  -- Controller Area Network classifier.
+ * Makes decisions according to Controller Area Network identifiers (can_id).
+ *
+ *             This program is free software; you can distribute it and/or
+ *             modify it under the terms of the GNU General Public License
+ *             as published by the Free Software Foundation; version 2 of
+ *             the License.
+ *
+ * Idea:       Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
+ * Copyright:  (c) 2011 Czech Technical University in Prague
+ *             (c) 2011 Volkswagen Group Research
+ * Authors:    Michal Sojka <sojkam1@fel.cvut.cz>
+ *             Pavel Pisa <pisa@cmp.felk.cvut.cz>
+ *             Rostislav Lisovy <lisovy@gmail.cz>
+ * Funded by:  Volkswagen Group Research
+ *
+ * Some function descriptions are heavily inspired by article "Linux Network
+ * Traffic Control -- Implementation Overview" by Werner Almesberger
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/rtnetlink.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/act_api.h>
+#include <net/pkt_cls.h>
+#include <linux/bitmap.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/can.h>
+
+/* Definition of Netlink message parts */
+enum {
+	TCA_CANFLTR_UNSPEC,
+	TCA_CANFLTR_CLASSID,
+	TCA_CANFLTR_RULES,	/* Array of can_filter structs; We are able
+				to determine the length after receiving */
+	__TCA_CANFLTR_MAX
+};
+#define TCA_CANFLTR_MAX (__TCA_CANFLTR_MAX - 1)
+
+static const struct nla_policy canfltr_policy[TCA_CANFLTR_MAX + 1] = {
+	[TCA_CANFLTR_CLASSID]    = { .type = NLA_U32 }, /* Be aware of possible
+						problems with 64bit kernel and
+						32bit userspace etc. */
+	[TCA_CANFLTR_RULES]      = { .type = NLA_NESTED }
+};
+
+struct canfltr_rules {
+	struct can_filter *rules_raw;	/* Raw rules copied from netlink
+					message; Used for sending information
+					to userspace (when 'tc filter show' is
+					invoked) AND when matching EFF frames*/
+	DECLARE_BITMAP(match_sff, (1 << CAN_SFF_ID_BITS)); /* For each SFF CAN
+					ID (11 bit) there is one record in this
+					bitfield */
+	int rules_count;
+	int eff_rules_count;
+	int sff_rules_count;
+
+	struct rcu_head rcu;
+};
+
+struct canfltr_head {
+	u32 hgenerator;
+	struct list_head flist;
+};
+
+struct canfltr_state {
+	u32 handle;
+	struct canfltr_rules *rules;	/* All rules necessary for
+					classification */
+	struct tcf_result res;		/* Class ID (flow id) the instance
+					of a filter is bound to */
+	struct list_head link;
+};
+
+/*
+ * ----------------------------------------------------------------------------
+ */
+
+static void canfltr_sff_match_add(struct canfltr_rules *rls,
+				u32 can_id, u32 can_mask)
+{
+	int i;
+
+	/* Limit can_mask and can_id to SFF range to
+	protect against write after end of array */
+	can_mask &= CAN_SFF_MASK;
+	can_id &= can_mask;
+
+	/* single frame */
+	if (can_mask == CAN_SFF_MASK) {
+		set_bit(can_id, rls->match_sff);
+		return;
+	}
+
+	/* all frames */
+	if (can_mask == 0) {
+		bitmap_fill(rls->match_sff, (1 << CAN_SFF_ID_BITS));
+		return;
+	}
+
+	/* individual frame filter */
+	/* Add record (set bit to 1) for each ID that
+	conforms particular rule */
+	for (i = 0; i < (1 << CAN_SFF_ID_BITS); i++) {
+		if ((i & can_mask) == can_id)
+			set_bit(i, rls->match_sff);
+	}
+}
+
+/**
+ * canfltr_get_id() - Extracts Can ID out of the sk_buff structure.
+ */
+static canid_t canfltr_get_id(struct sk_buff *skb)
+{
+	/* Can ID is inside of data field */
+	struct can_frame *cf = (struct can_frame *)skb->data;
+
+	return cf->can_id;
+}
+
+/**
+ * canfltr_classify() - Performs the classification.
+ *
+ * @skb: Socket buffer
+ * @tp:
+ * @res: Is used for setting Class ID as a result of classification
+ *
+ * Iterates over all instances of filter, checking for CAN ID match.
+ *
+ * Returns value relevant for policing. Used return values:
+ *   TC_POLICE_OK if succesfully classified (without regard to policing rules)
+ *   TC_POLICE_UNSPEC if no matching rule was found
+ */
+static int canfltr_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+			  struct tcf_result *res)
+{
+	struct canfltr_head *head = (struct canfltr_head *)tp->root;
+	struct canfltr_state *f;
+	struct canfltr_rules *r;
+	canid_t can_id;
+	int i;
+
+	can_id = canfltr_get_id(skb);
+
+	rcu_read_lock();
+	list_for_each_entry(f, &head->flist, link) {
+		bool match = false;
+		r = rcu_dereference(f->rules);
+
+
+		if (can_id & CAN_EFF_FLAG) {
+			can_id &= CAN_EFF_MASK;
+
+			for (i = 0; i < r->eff_rules_count; i++) {
+				if (!(((r->rules_raw[i].can_id ^ can_id) &
+				r->rules_raw[i].can_mask) & CAN_EFF_MASK)) {
+					match = true;
+					break;
+				}
+			}
+		} else { /* SFF */
+			can_id &= CAN_SFF_MASK;
+			match = test_bit(can_id, r->match_sff);
+		}
+
+		if (match) {
+			*res = f->res;
+			rcu_read_unlock();
+			return TC_POLICE_OK;
+		}
+	}
+
+	rcu_read_unlock();
+	return TC_POLICE_UNSPEC;
+}
+
+/**
+ * canfltr_get() - Looks up a filter element by its handle and returns the
+ * internal filter ID (i.e. pointer)
+ */
+static unsigned long canfltr_get(struct tcf_proto *tp, u32 handle)
+{
+	struct canfltr_head *head = (struct canfltr_head *)tp->root;
+	struct canfltr_state *f;
+
+	if (head == NULL)
+		return 0UL;
+
+	list_for_each_entry(f, &head->flist, link) {
+		if (f->handle == handle)
+			return (unsigned long) f;
+	}
+
+	return 0UL;
+}
+
+/**
+ * canfltr_put() - Is invoked when a filter element previously referenced
+ * with get() is no longer used
+ */
+static void canfltr_put(struct tcf_proto *tp, unsigned long f)
+{
+}
+
+/**
+ * canfltr_gen_handle() - Generate handle for newly created filter
+ *
+ * This code is heavily inspired by handle generator in cls_basic.c
+ */
+static unsigned int canfltr_gen_handle(struct tcf_proto *tp)
+{
+	struct canfltr_head *head = (struct canfltr_head *)tp->root;
+	unsigned int i = 0x80000000;
+
+	do {
+		if (++head->hgenerator == 0x7FFFFFFF)
+			head->hgenerator = 1;
+	} while (--i > 0 && canfltr_get(tp, head->hgenerator));
+
+	if (i == 0)
+		return 0;
+
+	return head->hgenerator;
+}
+
+static void canfltr_rules_free_rcu(struct rcu_head *rcu)
+{
+	kfree(container_of(rcu, struct canfltr_rules, rcu));
+}
+
+static int canfltr_set_parms(struct tcf_proto *tp, struct canfltr_state *f,
+				unsigned long base, struct nlattr **tb,
+				struct nlattr *est)
+{
+	struct can_filter *canfltr_nl_rules;
+	struct canfltr_rules *rules_tmp;
+	int err;
+	int i;
+
+	rules_tmp = kzalloc(sizeof(*rules_tmp), GFP_KERNEL);
+	if (!rules_tmp)
+		return -ENOBUFS;
+
+	err = -EINVAL;
+	if (tb[TCA_CANFLTR_CLASSID] == NULL)
+		goto errout;
+
+	if (tb[TCA_CANFLTR_RULES]) {
+		canfltr_nl_rules = nla_data(tb[TCA_CANFLTR_RULES]);
+		rules_tmp->sff_rules_count = 0;
+		rules_tmp->eff_rules_count = 0;
+		rules_tmp->rules_count = (nla_len(tb[TCA_CANFLTR_RULES]) /
+			sizeof(struct can_filter));
+
+		rules_tmp->rules_raw = kzalloc(sizeof(struct can_filter) *
+			rules_tmp->rules_count, GFP_KERNEL);
+		err = -ENOMEM;
+		if (rules_tmp->rules_raw == NULL)
+			goto errout;
+
+		/* We need two for() loops for copying rules into
+		two contiguous areas in rules_raw */
+
+		/* Process EFF frame rules*/
+		for (i = 0; i < rules_tmp->rules_count; i++) {
+			if ((canfltr_nl_rules[i].can_id & CAN_EFF_FLAG) &&
+			    (canfltr_nl_rules[i].can_mask & CAN_EFF_FLAG)) {
+				memcpy(rules_tmp->rules_raw +
+					rules_tmp->eff_rules_count,
+					&canfltr_nl_rules[i],
+					sizeof(struct can_filter));
+				rules_tmp->eff_rules_count++;
+			} else {
+				continue;
+			}
+		}
+
+		/* Process SFF frame rules */
+		for (i = 0; i < rules_tmp->rules_count; i++) {
+			if ((canfltr_nl_rules[i].can_id & CAN_EFF_FLAG) &&
+			    (canfltr_nl_rules[i].can_mask & CAN_EFF_FLAG)) {
+				continue;
+			} else {
+				memcpy(rules_tmp->rules_raw +
+					rules_tmp->eff_rules_count +
+					rules_tmp->sff_rules_count,
+					&canfltr_nl_rules[i],
+					sizeof(struct can_filter));
+				rules_tmp->sff_rules_count++;
+				canfltr_sff_match_add(rules_tmp,
+					canfltr_nl_rules[i].can_id,
+					canfltr_nl_rules[i].can_mask);
+			}
+		}
+	}
+
+
+	/* Setting parameters for newly created filter */
+	if (f->rules == NULL) {
+		rcu_assign_pointer(f->rules, rules_tmp);
+	} else { /* Changing existing filter */
+		struct canfltr_rules *rules_old;
+
+		rules_old = xchg(&f->rules, rules_tmp);
+		call_rcu(&rules_old->rcu, canfltr_rules_free_rcu);
+	}
+
+	return 0;
+
+errout:
+	kfree(rules_tmp);
+	return err;
+}
+
+/**
+ * canfltr_change() - Called for changing properties of an existing filter or
+ * after addition of a new filter to a class (by calling bind_tcf which binds
+ * an instance of a filter to the class).
+ *
+ * @tp:     Structure representing instance of a filter.
+ *          Part of a linked list of all filters.
+ * @base:
+ * @handle:
+ * @tca:    Messages passed through the Netlink from userspace.
+ * @arg:
+ */
+static int canfltr_change(struct tcf_proto *tp, unsigned long base, u32 handle,
+			  struct nlattr **tca, unsigned long *arg)
+{
+	struct canfltr_head *head = (struct canfltr_head *)tp->root;
+	struct canfltr_state *f = (struct canfltr_state *)*arg;
+	struct nlattr *tb[TCA_CANFLTR_MAX + 1];
+	int err;
+
+	if (tca[TCA_OPTIONS] == NULL)
+		return -EINVAL;
+
+	/* Parses a stream of attributes and stores a pointer to each
+	attribute in the tb array accessible via the attribute type.
+	Policy may be set to NULL if no validation is required.*/
+	err = nla_parse_nested(tb, TCA_CANFLTR_MAX, tca[TCA_OPTIONS],
+		canfltr_policy);
+	if (err < 0)
+		return err;
+	/* Change existing filter (remove all settings and add
+	them thereafter as if filter was newly created) */
+	if (f != NULL) {
+		if (handle && f->handle != handle)
+			return -EINVAL;
+
+		return canfltr_set_parms(tp, f, base, tb, tca[TCA_RATE]);
+	}
+
+	/* Create new filter */
+	err = -ENOBUFS;
+	f = kzalloc(sizeof(*f), GFP_KERNEL);
+	if (f == NULL)
+		goto errout;
+
+	if (tb[TCA_CANFLTR_CLASSID]) {
+		f->res.classid = nla_get_u32(tb[TCA_U32_CLASSID]);
+		tcf_bind_filter(tp, &f->res, base);
+	}
+
+	err = -EINVAL;
+	if (handle) /* handle passed from userspace */
+		f->handle = handle;
+	else {
+		f->handle = canfltr_gen_handle(tp);
+		if (f->handle == 0)
+			goto errout;
+	}
+
+	/* Configure filter */
+	err = canfltr_set_parms(tp, f, base, tb, tca[TCA_RATE]);
+	if (err < 0)
+		goto errout;
+
+	/* Add newly created filter to list of all filters */
+	tcf_tree_lock(tp);
+	list_add(&f->link, &head->flist);
+	tcf_tree_unlock(tp);
+	*arg = (unsigned long) f;
+
+	return 0;
+
+errout:
+	if (*arg == 0UL && f)
+		kfree(f);
+
+	return err;
+}
+
+
+static void canfltr_delete_filter(struct tcf_proto *tp,
+				struct canfltr_state *f)
+{
+	tcf_unbind_filter(tp, &f->res);
+
+	rcu_barrier();
+	kfree(f->rules->rules_raw);
+	kfree(f->rules);
+	kfree(f);
+}
+
+/**
+ * canfltr_destroy() - Remove whole filter.
+ */
+static void canfltr_destroy(struct tcf_proto *tp)
+{
+	struct canfltr_head *head = tp->root;
+	struct canfltr_state *f, *n;
+
+	list_for_each_entry_safe(f, n, &head->flist, link) {
+		list_del(&f->link);
+		canfltr_delete_filter(tp, f);
+	}
+	kfree(head);
+}
+
+/**
+ * canfltr_delete() - Delete one instance of a filter.
+ */
+static int canfltr_delete(struct tcf_proto *tp, unsigned long arg)
+{
+	struct canfltr_head *head = (struct canfltr_head *)tp->root;
+	struct canfltr_state *t;
+	struct canfltr_state *f = (struct canfltr_state *)arg;
+
+	rcu_barrier(); /* Wait for completion of call_rcu()'s */
+
+	list_for_each_entry(t, &head->flist, link)
+		if (t == f) {
+			tcf_tree_lock(tp);
+			list_del(&t->link);
+			tcf_tree_unlock(tp);
+			canfltr_delete_filter(tp, t);
+			return 0;
+		}
+
+	return -ENOENT;
+}
+
+
+/**
+ * canfltr_init() - Initialize filter
+ */
+static int canfltr_init(struct tcf_proto *tp)
+{
+	struct canfltr_head *head;
+
+	if ((tp->protocol != htons(ETH_P_ALL)) &&
+	    (tp->protocol != htons(ETH_P_CAN)))
+		return -1;
+
+	/* Work only on CAN frames */
+	if (tp->protocol == htons(ETH_P_ALL))
+		tp->protocol = htons(ETH_P_CAN);
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (head == NULL)
+		return -ENOBUFS;
+
+	INIT_LIST_HEAD(&head->flist);
+	tp->root = head;
+
+	return 0;
+}
+
+/**
+ * canfltr_walk() - Iterates over all elements of a filter and invokes a
+ * callback function for each of them. This is used to obtain diagnostic data.
+ */
+static void canfltr_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+{
+	struct canfltr_head *head = (struct canfltr_head *) tp->root;
+	struct canfltr_state *f;
+
+	list_for_each_entry(f, &head->flist, link) {
+		if (arg->count < arg->skip)
+			goto skip;
+
+		if (arg->fn(tp, (unsigned long) f, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+skip:
+		arg->count++;
+	}
+}
+
+/**
+ * canfltr_dump() - Returns diagnostic data for a filter or one of its elements.
+ */
+static int canfltr_dump(struct tcf_proto *tp, unsigned long fh,
+			struct sk_buff *skb, struct tcmsg *t)
+{
+	struct canfltr_state *f = (struct canfltr_state *) fh;
+	struct nlattr *nest;
+	struct canfltr_rules *r;
+
+	if (f == NULL)
+		return skb->len;
+
+	rcu_read_lock();
+	r = rcu_dereference(f->rules);
+	t->tcm_handle = f->handle;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (nest == NULL)
+		goto nla_put_failure;
+
+	if (f->res.classid)
+		NLA_PUT_U32(skb, TCA_CANFLTR_CLASSID, f->res.classid);
+
+	NLA_PUT(skb, TCA_CANFLTR_RULES, r->rules_count *
+		sizeof(struct can_filter), r->rules_raw);
+
+
+	nla_nest_end(skb, nest);
+
+	rcu_read_unlock();
+	return skb->len;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	rcu_read_unlock();
+	return -1;
+}
+
+
+static struct tcf_proto_ops cls_canfltr_ops __read_mostly = {
+	.kind           =       "can",
+	.classify       =       canfltr_classify,
+	.init           =       canfltr_init,
+	.destroy        =       canfltr_destroy,
+	.get            =       canfltr_get,
+	.put            =       canfltr_put,
+	.change         =       canfltr_change,
+	.delete         =       canfltr_delete,
+	.walk           =       canfltr_walk,
+	.dump           =       canfltr_dump,
+	.owner          =       THIS_MODULE,
+};
+
+static int __init init_canfltr(void)
+{
+	pr_debug("canfltr: CAN filter loaded\n");
+	return register_tcf_proto_ops(&cls_canfltr_ops);
+}
+
+static void __exit exit_canfltr(void)
+{
+	pr_debug("canfltr: CAN filter removed\n");
+	unregister_tcf_proto_ops(&cls_canfltr_ops);
+}
+
+module_init(init_canfltr);
+module_exit(exit_canfltr);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Rostislav Lisovy <lisovy@gmail.cz>");
+MODULE_DESCRIPTION("Controller Area Network classifier");
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH 1/2] can: Added constants containing length of CAN identifiers
From: Rostislav Lisovy @ 2012-05-25  9:12 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, pisa, sojkam1, oliver, Rostislav Lisovy

The necessary information might be determined out of the CAN_*_MASK,
however it is undesirable to misuse masks for such purpose.

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
---
 include/linux/can.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/can.h b/include/linux/can.h
index 9a19bcb..08d1610 100644
--- a/include/linux/can.h
+++ b/include/linux/can.h
@@ -38,6 +38,9 @@
  */
 typedef __u32 canid_t;
 
+#define CAN_SFF_ID_BITS		11
+#define CAN_EFF_ID_BITS		29
+
 /*
  * Controller Area Network Error Frame Mask structure
  *
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH iproute2 2/3] CAN Filter/Classifier -- Source code
From: Rostislav Lisovy @ 2012-05-25  9:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, pisa, sojkam1, oliver, Rostislav Lisovy
In-Reply-To: <1337937106-7640-1-git-send-email-lisovy@gmail.com>

The CAN classifier may be used with any available qdisc on Controller
Area Network (CAN) frames passed through AF_CAN networking subsystem.
The classifier classifies CAN frames according to their identifiers.
It can be used on CAN frames with both SFF or EFF identifiers.

The filtering rules for EFF frames are stored in an array, which
is traversed during classification. A bitmap is used to store SFF
rules -- one bit for each ID.

More info about the project:
http://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
---
 include/linux/pkt_cls.h |   10 ++
 tc/Makefile             |    1 +
 tc/f_can.c              |  238 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 249 insertions(+)
 create mode 100644 tc/f_can.c

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index defbde2..83f9241 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -375,6 +375,16 @@ enum {
 
 #define TCA_BASIC_MAX (__TCA_BASIC_MAX - 1)
 
+/* CAN filter */
+
+enum {
+	TCA_CANFLTR_UNSPEC,
+	TCA_CANFLTR_CLASSID,
+	TCA_CANFLTR_RULES,
+	__TCA_CANFLTR_MAX
+};
+
+#define TCA_CANFLTR_MAX (__TCA_CANFLTR_MAX - 1)
 
 /* Cgroup classifier */
 
diff --git a/tc/Makefile b/tc/Makefile
index 64d93ad..1281568 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -22,6 +22,7 @@ TCMODULES += f_u32.o
 TCMODULES += f_route.o
 TCMODULES += f_fw.o
 TCMODULES += f_basic.o
+TCMODULES += f_can.o
 TCMODULES += f_flow.o
 TCMODULES += f_cgroup.o
 TCMODULES += q_dsmark.o
diff --git a/tc/f_can.c b/tc/f_can.c
new file mode 100644
index 0000000..208625f
--- /dev/null
+++ b/tc/f_can.c
@@ -0,0 +1,238 @@
+/*
+ * f_can.c  Filter for CAN packets
+ *
+ *		This program is free software; you can distribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Idea:       Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
+ * Copyright:  (c) 2011 Czech Technical University in Prague
+ *             (c) 2011 Volkswagen Group Research
+ * Authors:    Michal Sojka <sojkam1@fel.cvut.cz>
+ *             Pavel Pisa <pisa@cmp.felk.cvut.cz>
+ *             Rostislav Lisovy <lisovy@gmail.cz>
+ * Funded by:  Volkswagen Group Research
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <linux/if.h>
+#include <limits.h>
+#include <inttypes.h>
+#include "utils.h"
+#include "tc_util.h"
+#include "linux/can.h"
+
+#define RULES_SIZE		128 /* Maximum number of rules sent via the
+				netlink message during creation/configuration */
+
+
+static void canfltr_explain(void)
+{
+	fprintf(stderr, "Usage: ... can [ MATCHSPEC ] [ flowid FLOWID ]\n"
+			"\n"
+			"Where: MATCHSPEC := { sffid FILTERID | effid FILTERID |\n"
+			"                   MATCHSPEC ... }\n"
+			"       FILTERID := CANID[:MASK]\n"
+			"\n"
+			"NOTE: CLASSID, CANID, MASK is parsed as hexadecimal input.\n");
+}
+
+static int canfltr_parse_opt(struct filter_util *qu, char *handle,
+			 int argc, char **argv, struct nlmsghdr *n)
+{
+	struct tcmsg *t = NLMSG_DATA(n);
+	struct rtattr *tail;
+	struct can_filter canfltr_rules[RULES_SIZE];
+	int rules_count = 0;
+	long h = 0;
+	canid_t can_id;
+	canid_t can_mask;
+
+	if (!argc)
+		return 0;
+
+	if (handle) {
+		h = strtol(handle, NULL, 0);
+		if (h == LONG_MIN || h == LONG_MAX) {
+			fprintf(stderr, "Illegal handle \"%s\", must be numeric.\n",
+				handle);
+			return -1;
+		}
+	}
+
+	t->tcm_handle = h;
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, MAX_MSG, TCA_OPTIONS, NULL, 0);
+
+	while (argc > 0) {
+		if (matches(*argv, "sffid") == 0) {
+			/* parse SFF CAN ID optionally with mask */
+			if (rules_count >= RULES_SIZE) {
+				fprintf(stderr, "Too much rules on input. "
+					"Maximum number of rules is: %d\n",
+					RULES_SIZE);
+				return -1;
+			}
+
+			NEXT_ARG();
+
+			if (sscanf(*argv, "%"SCNx32 ":" "%"SCNx32,
+				&can_id, &can_mask) != 2) {
+				if (sscanf(*argv, "%"SCNx32, &can_id) != 1) {
+					fprintf(stderr, "Improperly formed CAN "
+						"ID & mask '%s'\n", *argv);
+					return -1;
+				} else
+					can_mask = CAN_SFF_MASK;
+			}
+
+			/* we do not support extra handling for RTR frames
+			due to the bitmap approach */
+			if (can_id & ~CAN_SFF_MASK) {
+				fprintf(stderr, "ID 0x%lx exceeded standard CAN ID range.\n",
+					(unsigned long)can_id);
+				return -1;
+			}
+
+			canfltr_rules[rules_count].can_id = can_id;
+			canfltr_rules[rules_count].can_mask =
+				(can_mask & CAN_SFF_MASK);
+			rules_count++;
+
+		} else if (matches(*argv, "effid") == 0) {
+			/* parse EFF CAN ID optionally with mask */
+			if (rules_count >= RULES_SIZE) {
+				fprintf(stderr, "Too much rules on input. "
+					"Maximum number of rules is: %d\n",
+					RULES_SIZE);
+				return -1;
+			}
+
+			NEXT_ARG();
+
+			if (sscanf(*argv, "%"SCNx32 ":" "%"SCNx32, &can_id, &can_mask) != 2) {
+				if (sscanf(*argv, "%"SCNx32, &can_id) != 1) {
+					fprintf(stderr, "Improperly formed CAN ID & mask '%s'\n", *argv);
+					return -1;
+				} else
+					can_mask = CAN_EFF_MASK;
+			}
+
+			if (can_id & ~CAN_EFF_MASK) {
+				fprintf(stderr, "ID 0x%lx exceeded extended CAN ID range.",
+					(unsigned long)can_id);
+				return -1;
+			}
+
+			canfltr_rules[rules_count].can_id =
+				can_id | CAN_EFF_FLAG;
+			canfltr_rules[rules_count].can_mask =
+				(can_mask & CAN_EFF_MASK) | CAN_EFF_FLAG;
+			rules_count++;
+
+		} else if (matches(*argv, "classid") == 0 || strcmp(*argv, "flowid") == 0) {
+			unsigned handle;
+			NEXT_ARG();
+			if (get_tc_classid(&handle, *argv)) {
+				fprintf(stderr, "Illegal \"classid\"\n");
+				return -1;
+			}
+			addattr_l(n, MAX_MSG, TCA_CANFLTR_CLASSID, &handle, 4);
+
+		} else if (strcmp(*argv, "help") == 0) {
+			canfltr_explain();
+			return -1;
+
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			canfltr_explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	addattr_l(n, MAX_MSG, TCA_CANFLTR_RULES, &canfltr_rules,
+		sizeof(struct can_filter) * rules_count);
+
+	tail->rta_len = (void *)NLMSG_TAIL(n) - (void *)tail;
+	return 0;
+}
+
+/* When "tc filter show dev XY" is executed, function canfltr_walk() (in
+ * kernel) is called (which calls canfltr_dump() for each instance of a
+ * filter) which sends information about each instance of a filter to
+ * userspace -- to this function which parses the message and prints it.
+ */
+static int canfltr_print_opt(struct filter_util *qu, FILE *f,
+			struct rtattr *opt, __u32 handle)
+{
+	struct rtattr *tb[TCA_CANFLTR_MAX+1];
+	struct can_filter *canfltr_rules = NULL;
+	int rules_count = 0;
+	int i;
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_CANFLTR_MAX, opt);
+
+	if (handle)
+		fprintf(f, "handle 0x%x ", handle);
+
+
+	if (tb[TCA_BASIC_CLASSID]) {
+		SPRINT_BUF(b1); /* allocates buffer b1 */
+		fprintf(f, "flowid %s ",
+			sprint_tc_classid(*(__u32 *)RTA_DATA(tb[TCA_BASIC_CLASSID]), b1));
+	}
+
+	if (tb[TCA_CANFLTR_RULES]) {
+		if (RTA_PAYLOAD(tb[TCA_CANFLTR_RULES]) < sizeof(struct can_filter))
+			return -1;
+
+		canfltr_rules = RTA_DATA(tb[TCA_CANFLTR_RULES]);
+		rules_count = (RTA_PAYLOAD(tb[TCA_CANFLTR_RULES]) /
+			sizeof(struct can_filter));
+
+		for (i = 0; i < rules_count; i++) {
+			struct can_filter *pcfltr = &canfltr_rules[i];
+
+			if (pcfltr->can_id & CAN_EFF_FLAG) {
+				if (pcfltr->can_mask == (CAN_EFF_FLAG|CAN_EFF_MASK))
+					fprintf(f, "effid 0x%"PRIX32" ",
+						pcfltr->can_id & CAN_EFF_MASK);
+				else
+					fprintf(f, "effid 0x%"PRIX32":0x%"PRIX32" ",
+						pcfltr->can_id & CAN_EFF_MASK,
+						pcfltr->can_mask & CAN_EFF_MASK);
+			} else {
+				if (pcfltr->can_mask == CAN_SFF_MASK)
+					fprintf(f, "sffid 0x%"PRIX32" ",
+						pcfltr->can_id);
+				else
+					fprintf(f, "sffid 0x%"PRIX32":0x%"PRIX32" ",
+						pcfltr->can_id,
+						pcfltr->can_mask);
+			}
+		}
+	}
+
+	return 0;
+}
+
+struct filter_util can_filter_util = {
+	.id = "can",
+	.parse_fopt = canfltr_parse_opt,
+	.print_fopt = canfltr_print_opt,
+};
+
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH iproute2 3/3] CAN Filter/Classifier -- Documentation
From: Rostislav Lisovy @ 2012-05-25  9:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, pisa, sojkam1, oliver, Rostislav Lisovy
In-Reply-To: <1337937106-7640-1-git-send-email-lisovy@gmail.com>

Added manpage describing usage of CAN Filter.

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
---
 man/man8/tc-can.8 |   97 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 man/man8/tc-can.8

diff --git a/man/man8/tc-can.8 b/man/man8/tc-can.8
new file mode 100644
index 0000000..54ee96a
--- /dev/null
+++ b/man/man8/tc-can.8
@@ -0,0 +1,97 @@
+.TH CAN 8 "8 May 2012" "iproute2" "Linux"
+.SH NAME
+CAN \- Controller Area Network classifier
+.SH SYNOPSIS
+.B tc filter ... dev
+DEV
+.B parent
+CLASSID
+.B [ prio
+PRIORITY
+.B ] [ protocol can ] [ handle 
+HANDLE
+.B ] can [ 
+MATCHSPEC
+.B ] [ flowid 
+FLOWID
+.B ]
+
+.B CLASSID := major:minor
+.br
+.B FLOWID := major:minor
+.br
+.B MATCHSPEC := { sffid 
+FILTERID
+.B | effid 
+FILTERID
+.B | MATCHSPEC ... }
+.br
+.B FILTERID := canid[:mask]
+
+.BR CLASSID ,
+.BR FLOWID ,
+.BR canid
+and
+.B mask
+are parsed as hexadecimal input.
+
+
+.SH DESCRIPTION
+The CAN classifier may be used with any available
+.B qdisc
+on Controller Area Network (CAN) frames passed through AF_CAN
+networking subsystem. The classifier classifies CAN frames according
+to their identifiers. It can be used on CAN frames with both SFF or
+EFF identifiers.
+
+It is possible to add CAN classifier to any qdisc configured on any networking
+device, however it will ignore non-CAN packets.
+
+
+.SH CLASSIFICATION
+The filtering rules for EFF frames are stored in an array, which is traversed
+during classification. This means that the worst-case time needed for
+classification of EFF frames increases with the number of configured rules.
+
+The filter implements an optimization for matching SFF frames using a bitmap
+with one bit for every ID. With this optimization, the classification time
+for SFF frames is nearly constant independently of the number of rules.
+
+.SH EXAMPLE
+This example shows how to set
+.B prio qdisc
+with
+.B CAN
+classifier.
+
+.nf
+tc qdisc add dev can0 root handle 1: prio
+
+tc filter add dev can0 parent 1:0 prio 1 handle 0xa \\
+    can sffid 0x7ff:0xf flowid 1:1
+tc filter add dev can0 parent 1:0 prio 2 handle 0xb \\
+    can sffid 0xC0:0x7ff effid 0x80:0x7ff flowid 1:2
+tc filter add dev can0 parent 1:0 prio 3 \\
+    can sffid 0x80:0x7ff flowid 1:2
+tc filter add dev can0 parent 1:0 prio 4 \\
+    can sffid 0x0:0x0 effid 0x0:0x0 flowid 1:3
+.fi
+
+
+.SH BUGS
+The maximum number or rules passed from
+.BR tc(8)
+utility to CAN classifier is fixed. The limit is set at compilation time
+(default is 128).
+
+
+.SH SEE ALSO
+.BR tc(8)
+
+
+.SH AUTHORS
+Michal Sojka <sojkam1@fel.cvut.cz>, Pavel Pisa <pisa@cmp.felk.cvut.cz>,
+Rostislav Lisovy <lisovy@gmail.cz>.
+
+This manpage maintained by Rostislav Lisovy <lisovy@gmail.com>
+
-- 
1.7.9.5


^ permalink raw reply related

* [PATCH iproute2 1/3] Added missing can.h
From: Rostislav Lisovy @ 2012-05-25  9:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, pisa, sojkam1, oliver, Rostislav Lisovy
In-Reply-To: <1337937106-7640-1-git-send-email-lisovy@gmail.com>

This header file is slightly modified version copied from
Linux kernel v. 3.3. It contains defines necessary for AF_CAN
communication.

Signed-off-by: Rostislav Lisovy <lisovy@gmail.com>
---
 include/linux/can.h |  112 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)
 create mode 100644 include/linux/can.h

diff --git a/include/linux/can.h b/include/linux/can.h
new file mode 100644
index 0000000..08d1610
--- /dev/null
+++ b/include/linux/can.h
@@ -0,0 +1,112 @@
+/*
+ * linux/can.h
+ *
+ * Definitions for CAN network layer (socket addr / CAN frame / CAN filter)
+ *
+ * Authors: Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
+ *          Urs Thuermann   <urs.thuermann@volkswagen.de>
+ * Copyright (c) 2002-2007 Volkswagen Group Electronic Research
+ * All rights reserved.
+ *
+ */
+
+#ifndef CAN_H
+#define CAN_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+
+/* controller area network (CAN) kernel definitions */
+
+/* special address description flags for the CAN_ID */
+#define CAN_EFF_FLAG 0x80000000U /* EFF/SFF is set in the MSB */
+#define CAN_RTR_FLAG 0x40000000U /* remote transmission request */
+#define CAN_ERR_FLAG 0x20000000U /* error frame */
+
+/* valid bits in CAN ID for frame formats */
+#define CAN_SFF_MASK 0x000007FFU /* standard frame format (SFF) */
+#define CAN_EFF_MASK 0x1FFFFFFFU /* extended frame format (EFF) */
+#define CAN_ERR_MASK 0x1FFFFFFFU /* omit EFF, RTR, ERR flags */
+
+/*
+ * Controller Area Network Identifier structure
+ *
+ * bit 0-28	: CAN identifier (11/29 bit)
+ * bit 29	: error frame flag (0 = data frame, 1 = error frame)
+ * bit 30	: remote transmission request flag (1 = rtr frame)
+ * bit 31	: frame format flag (0 = standard 11 bit, 1 = extended 29 bit)
+ */
+typedef __u32 canid_t;
+
+#define CAN_SFF_ID_BITS		11
+#define CAN_EFF_ID_BITS		29
+
+/*
+ * Controller Area Network Error Frame Mask structure
+ *
+ * bit 0-28	: error class mask (see include/linux/can/error.h)
+ * bit 29-31	: set to zero
+ */
+typedef __u32 can_err_mask_t;
+
+/**
+ * struct can_frame - basic CAN frame structure
+ * @can_id:  the CAN ID of the frame and CAN_*_FLAG flags, see above.
+ * @can_dlc: the data length field of the CAN frame
+ * @data:    the CAN frame payload.
+ */
+struct can_frame {
+	canid_t can_id;  /* 32 bit CAN_ID + EFF/RTR/ERR flags */
+	__u8    can_dlc; /* data length code: 0 .. 8 */
+	__u8    data[8] __attribute__((aligned(8)));
+};
+
+/* particular protocols of the protocol family PF_CAN */
+#define CAN_RAW		1 /* RAW sockets */
+#define CAN_BCM		2 /* Broadcast Manager */
+#define CAN_TP16	3 /* VAG Transport Protocol v1.6 */
+#define CAN_TP20	4 /* VAG Transport Protocol v2.0 */
+#define CAN_MCNET	5 /* Bosch MCNet */
+#define CAN_ISOTP	6 /* ISO 15765-2 Transport Protocol */
+#define CAN_NPROTO	7
+
+#define SOL_CAN_BASE 100
+
+/**
+ * struct sockaddr_can - the sockaddr structure for CAN sockets
+ * @can_family:  address family number AF_CAN.
+ * @can_ifindex: CAN network interface index.
+ * @can_addr:    protocol specific address information
+ */
+struct sockaddr_can {
+	__kernel_sa_family_t can_family;
+	int         can_ifindex;
+	union {
+		/* transport protocol class address information (e.g. ISOTP) */
+		struct { canid_t rx_id, tx_id; } tp;
+
+		/* reserved for future CAN protocols address information */
+	} can_addr;
+};
+
+/**
+ * struct can_filter - CAN ID based filter in can_register().
+ * @can_id:   relevant bits of CAN ID which are not masked out.
+ * @can_mask: CAN mask (see description)
+ *
+ * Description:
+ * A filter matches, when
+ *
+ *          <received_can_id> & mask == can_id & mask
+ *
+ * The filter can be inverted (CAN_INV_FILTER bit set in can_id) or it can
+ * filter for error frames (CAN_ERR_FLAG bit set in mask).
+ */
+struct can_filter {
+	canid_t can_id;
+	canid_t can_mask;
+};
+
+#define CAN_INV_FILTER 0x20000000U /* to be set in can_filter.can_id */
+
+#endif /* CAN_H */
-- 
1.7.9.5


^ permalink raw reply related

* [PATCH iproute2 0/3] CAN Filter/Classifier
From: Rostislav Lisovy @ 2012-05-25  9:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-can, pisa, sojkam1, oliver, Rostislav Lisovy

The CAN classifier may be used with any available qdisc on Controller
Area Network (CAN) frames passed through AF_CAN networking subsystem.
The classifier classifies CAN frames according to their identifiers.
It can be used on CAN frames with both SFF or EFF identifiers.

The filtering rules for EFF frames are stored in an array, which
is traversed during classification. A bitmap is used to store SFF
rules -- one bit for each ID.

More info about the project:
http://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf


Rostislav Lisovy (3):
  Added missing can.h
  CAN Filter/Classifier -- Source code
  CAN Filter/Classifier -- Documentation

 include/linux/can.h     |  112 ++++++++++++++++++++++
 include/linux/pkt_cls.h |   10 ++
 man/man8/tc-can.8       |   97 +++++++++++++++++++
 tc/Makefile             |    1 +
 tc/f_can.c              |  238 +++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 458 insertions(+)
 create mode 100644 include/linux/can.h
 create mode 100644 man/man8/tc-can.8
 create mode 100644 tc/f_can.c

-- 
1.7.9.5


^ permalink raw reply

* Re: IPv6 flapping with kernel 3.3 (regression from 3.2.9)
From: Alexey Ivanov @ 2012-05-25  9:02 UTC (permalink / raw)
  To: Marc Haber; +Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20120322073428.GA11510@torres.zugschlus.de>

On Thursday, March 22, 2012 12:20:02 PM UTC+4, Marc Haber wrote:
> Hi,
> 
> I have a host which has IPv6 misbehaving when running with Linux 3.3.
> It is flawlessly working with Linux 3.2.9.
> 
> The host
> - is running Debian stable (x64_64) with a few locally built and/or
>   backported packages, including the kernel.
> - has native IPv6 connectivity on eth0
> - is not doing SLAAC on eth0, both IP address (from 2a01/16) and
>   default gateway (fe80::1) are statically configured
> - is running a handful of VMs using KVM/libvirt
> - has IPv6 forwarding enabled
> - does IPv4 NAT
> - has a handful of iptables rules, both for v4 and v6. ICMP and ICMPv6
>   are fully open
> 
> - the gateway is not under my control
> - the VMs are either bridged to br0 or to br1
> - both br0 and br1 have an IPv6 /64 and radvd running to provide IPv6
>   to the VMs
> 
> This setup is unique in my machine list, my other machines either are
> no KVM hosts or do only have IPv6 tunneled.
> 
> When I run the box with kernel 3.3, it drops off the IPv6 network
> every few minutes and is not responding to pings any more. This state
> stays like 30 seconds to a minute and then IPv6 resumes. It looks to
> me that the box does not lose its default route though. Once in a
> while, I see "fe80::1 dev eth0  router FAILED" in the ip neigh output.

I think I observe similar problem on some of our boxes:
IPv6 default router on vlan gets stuck at FAILED state until I ping it.


I'm pinging some host on vlan763 and keep getting "Destination unreachable: No route":

user@host:~$ ping6 2a02:0000:0:a00::4d58:1602
PING 2a02:0000:0:a00::4d58:1602(2a02:0000:0:a00::4d58:1602) 56 data bytes
>From 2a02:0000:0:200::5 icmp_seq=1 Destination unreachable: No route
>From 2a02:0000:0:200::5 icmp_seq=2 Destination unreachable: No route
^C
--- 2a02:0000:0:a00::4d58:1602 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1001ms


Somehow route happens to be eth2:

user@host:~$ ip route get 2a02:0000:0:a00::4d58:1602
2a02:0000:0:a00::4d58:1602 from :: via 2a02:0000:0:88d:: dev eth2  src 2a02:0000:0:88d:215:17ff:feb8:3b62  metric 0 
    cache  mtu 1450 advmss 1390


But routing table says it should be on vlan763:

user@host:~$ ip -6 route 
.. snip ...
2a02:0000:0:a0b::/64 dev vlan763  proto kernel  metric 256 
2a02:0000:0:a00::/59 via 2a02:0000:0:a0b::1 dev vlan763  metric 1024  mtu 8950 advmss 8890
.. snip ...
default via 2a02:0000:0:88d:: dev eth2  metric 1024  mtu 1450 advmss 1390
default via fe80::225:90ff:fe06:223c dev eth2  proto kernel  metric 1024  expires 0sec


Here is our culprit - default router on vlan763 is marked as FAILED:

user@host:~$ ip -6 neigh
fe80::225:90ff:fe06:223c dev eth2 lladdr 00:25:90:06:22:3c router REACHABLE
2a02:0000:0:88d:: dev eth2 lladdr 00:25:90:06:22:3c router STALE
2a02:0000:0:a0b::1 dev vlan763  router FAILED


Now lets ping it:

user@host:~$ ping6 2a02:0000:0:a0b::1
PING 2a02:0000:0:a0b::1(2a02:0000:0:a0b::1) 56 data bytes
64 bytes from 2a02:0000:0:a0b::1: icmp_seq=1 ttl=64 time=2.62 ms
64 bytes from 2a02:0000:0:a0b::1: icmp_seq=2 ttl=64 time=47.8 ms
64 bytes from 2a02:0000:0:a0b::1: icmp_seq=3 ttl=64 time=0.341 ms
..snip...
^C
--- 2a02:0000:0:a0b::1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4003ms
rtt min/avg/max/mdev = 0.330/10.303/47.801/18.769 ms


Now everything is back to normal.
Router is REACHABLE:

user@host:~$ ip -6 neigh
fe80::225:90ff:fe06:223c dev eth2 lladdr 00:25:90:06:22:3c router STALE
fe80::224:50ff:fe5b:e400 dev vlan763 lladdr 00:24:50:5b:e4:00 DELAY
2a02:0000:0:88d:: dev eth2 lladdr 00:25:90:06:22:3c router STALE
2a02:0000:0:a0b::1 dev vlan763 lladdr 00:24:50:5b:e4:00 router REACHABLE


Route is on vlan763:

user@host:~$ ip route get 2a02:0000:0:a00::4d58:1602
2a02:0000:0:a00::4d58:1602 from :: via 2a02:0000:0:a0b::1 dev vlan763  src 2a02:0000:0:a0b::5f6c:9c1a  metric 0 
    cache  mtu 8950 advmss 8890


And I can finally ping hosts on the other side of the router:

user@host:~$ ping6 2a02:0000:0:a00::4d58:1602
PING 2a02:0000:0:a00::4d58:1602(2a02:0000:0:a00::4d58:1602) 56 data bytes
64 bytes from 2a02:0000:0:a00::4d58:1602: icmp_seq=1 ttl=62 time=1.82 ms
64 bytes from 2a02:0000:0:a00::4d58:1602: icmp_seq=2 ttl=62 time=1.77 ms
64 bytes from 2a02:0000:0:a00::4d58:1602: icmp_seq=3 ttl=62 time=1.83 ms
..snip...
^C
--- 2a02:0000:0:a00::4d58:1602 ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 8014ms
rtt min/avg/max/mdev = 1.511/1.671/1.852/0.141 ms


Kernel is:
Linux 3.2.0-23-server x86_64
This is copy of ubuntu 12.04 generic flavor with set of TCP patches that do not affect ND/routing: https://gist.github.com/2407652

user@host:~$ grep IPV6 /boot/config-3.2.0-23-server 
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_IPV6_MIP6=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_SIT_6RD=y
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y

Sysctls:
net.ipv6.conf.all.accept_ra = 0
net.ipv6.conf.all.autoconf = 0

> 
> Running a continuous ping in either direction doesn't seem to help.
> 
> Booting the box back to 3.2.9 immediately fixes the issue.
> 
> I have not yet re-tried going back to 3.3 since a few of the VMs are
> too important to reboot again today. I tried running tcpdump on eth0
> over night but hit br1 instead, so I don't have any packet dumps to
> show.
> 
> I guess that something goes wrong with neighbor detection regarding
> the IPv6 gateway.
> 
> Was there a relevant change between 3.2.9 and 3.3? Where do I look for
> the issue?
> 
> Greetings
> Marc
> 
> -- 
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 31958061
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 31958062
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Alexey Ivanov
Yandex Search Admin Team
*************
tel.: +7 (985) 120-35-83 (int. 7176)
http://staff.yandex-team.ru/rbtz
*************

^ permalink raw reply

* [RFC] mac80211: Use correct originator sequence number in a Path Reply
From: Qasim Javed @ 2012-05-25  8:21 UTC (permalink / raw)
  To: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	devel-ZwoEplunGu1xMJw8dq7oimD2FQJk+8+b
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ravip-DNmUmOh1Rg72fBVCVOL8/A

Sorry for the duplicate email. As suggested by Julian I have marked this as RFC (was PATCH previously), so that people can comment on it.

I have been doing some experiments using the 802.11s functionality in the mac80211 stack. Today I stumbled across something which I believe is a critical bug in the usage of originator sequence number for a Path Reply message upon the reception of a Path Request message.

Consider the following topology:

                    +---+
                    | S |
                    +---+
                  /      \
                 /        \
            +---+          +---+
            | A |          | B |
            +---+          +---+
                 \         /
                  \       /
                    +---+
                    | D |
                    +---+

Node S is the source node and D the destination. Clearly there are two possible paths from S to D namely S->A->D and S->B->D. When S wants to communicate with D, it will broadcast a Path Request (PREQ) where the originator will be S and target will be D. On receiving the PREQ, both A and B will broadcast it further to D. Let us assume that aggregate value of the metric for path D->B->S denoted by cost(DBS) is greater than cost(DAS). Notice that according to HWMP operation, when the PREQ is propagating from S to D, the cost on the "reverse" path is aggregated, that is why I used cost(DB) + cost(BS) for cost(DBS) and did not consider cost(SBD). Suppose also that smaller the metric the better it is which is the case for the default airtime link metric used by the 802.11s stack.

Let us suppose that the PREQ which passes through B arrives first at D and as mentioned earlier has a larger (worse) value than the soon to be received PREQ through the intermediate hop A. When D receives a PREQ from B, since it has not received any other PREQ, it generates a Path Reply (PREP). More specifically, the function hwmp_preq_frame_process generates the PREP. The PREQ contains originator and target sequence numbers which are used to avoid loops and ascertain the freshness of route information. On receiving a PREQ at D, the above mentioned function checks whether dot11MeshHWMPnetDiameterTraversalTime has elapsed since the last sequence number update (stored in ifmsh->last_sn_update). So suppose this is true when the first PREQ via B is received at D. So, in this case the originato
 r sequence number in PREP is incremented (that is becomes one more than the target sequence number in the PREQ).

Let us look at an example at this point. Suppose, the the originator sequence number in the PREQ is 1 and the target sequence number is 2. When this PREQ is received at D via B, and considering that dot11MeshHWMPnetDiameterTraversalTime have passed since the last sequence number update, we will increment the target sequence number which now becomes 3. Now for the PREP, the originator sequence number of PREQ, 1 in this case, becomes the target sequence number of PREP and the target sequence number of the PREQ (which has been updated and its value is 3) becomes the originator sequence number of the PREP.

As this PREQ which was received at D via B has a larger metric, we know that when the PREQ from S is received via A, it will have a lower (better) metric, hence we will also generate a PREP for that PREQ. Suppose the second PREQ via A is received at D within dot11MeshHWMPnetDiameterTraversalTime (currently 50ms). This is a reasonable assumption since the PREQ is broadcast by S and further broadcast by A and B in some order. There is very less likelihood that the time difference between the PREQ from A and B would be greater than 50ms since this is a lot of time in 802.11 speak where the nodes are contending for the channel on the order of hundreds of microseconds (typically). In short, it is very likely (confirmed through experiments) that this difference is less than 50ms.

So, when the PREQ from S arrives a D via A, since most likely this event happens within 50ms if the PREQ via B, the target sequence number will not be updated. Therefore, the originator sequence number stays at 1 and the target sequence number remains 2. It is very important to note that the code in hwmp_preq_frame_process just "swaps" the originator and target sequence numbers for use in the PREP. More specifically as mentioned earlier, the second PREP will have an originator sequence number of 2 and a target sequence number of 1.

At this point, we have two PREPs in flight, one via B and one via A.

PREP via B: originator sequence number = 3, target sequence number = 1
PREP via A: originator sequence number = 2, target sequence number = 1

The net effect is that when these PREPs reach S, irrespective of the order in which they arrive, the PREP via A will be ignored! This is very wrong since the reason we sent the PREP via A in the first place was that it had a better metric (albeit on the reverse path).

I have not tested the patch yet. This is more of a heads up email to let everyone know.

Signed-off-by: Qasim Javed <qasimj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 net/mac80211/mesh_hwmp.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/mac80211/mesh_hwmp.c b/net/mac80211/mesh_hwmp.c
index 70ac7d1..a13b593 100644
--- a/net/mac80211/mesh_hwmp.c
+++ b/net/mac80211/mesh_hwmp.c
@@ -543,6 +543,8 @@ static void hwmp_preq_frame_process(struct ieee80211_sub_if_data *sdata,
 		    time_before(jiffies, ifmsh->last_sn_update)) {
 			target_sn = ++ifmsh->sn;
 			ifmsh->last_sn_update = jiffies;
+		} else {
+			target_sn = ifmsh->sn;
 		}
 	} else {
 		rcu_read_lock();
-- 
1.7.1

^ permalink raw reply related

* [PATCH iproute2] tc-codel: Fix typos in manpage
From: Jan Ceuleers @ 2012-05-25  7:43 UTC (permalink / raw)
  To: Stephen Hemminger, netdev, Vijay Subramanian; +Cc: Jan Ceuleers

Signed-off-by: Jan Ceuleers <jan.ceuleers@computer.org>
---
 man/man8/tc-codel.8 |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/tc-codel.8 b/man/man8/tc-codel.8
index 605e498..61f163f 100644
--- a/man/man8/tc-codel.8
+++ b/man/man8/tc-codel.8
@@ -105,7 +105,7 @@ interval 30.0ms ecn
 .BR tc-red (8)
 
 .SH SOURCES
-o   Kathleen Nicols and Van Jaconson, "Controlling Queue Delay", ACM Queue,
+o   Kathleen Nichols and Van Jacobson, "Controlling Queue Delay", ACM Queue,
 http://queue.acm.org/detail.cfm?id=2209336
 
 .SH AUTHORS
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH IPROUTE2] tc-codel: Add manpage
From: Jan Ceuleers @ 2012-05-25  7:15 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, Stephen Hemminger, Eric Dumazet, Dave Taht
In-Reply-To: <4FBF2C66.8090706@computer.org>

On 05/25/2012 08:53 AM, Jan Ceuleers wrote:
> On 05/24/2012 06:33 AM, Vijay Subramanian wrote:
>> This patch adds the manpage for the CoDel (Controlled-Delay) AQM.
> ...
> 
>> +.SH SOURCES
>> +o   Kathleen Nicols and Van Jaconson, "Controlling Queue Delay", ACM Queue,
> 
> s/Nicols/Nichols/
> s/Jaconson/Jacobson/

Already applied, so I'll send a patch.

^ permalink raw reply

* Re: [PATCH IPROUTE2] tc-codel: Add manpage
From: Jan Ceuleers @ 2012-05-25  6:53 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, Stephen Hemminger, Eric Dumazet, Dave Taht
In-Reply-To: <1337834034-27803-1-git-send-email-subramanian.vijay@gmail.com>

On 05/24/2012 06:33 AM, Vijay Subramanian wrote:
> This patch adds the manpage for the CoDel (Controlled-Delay) AQM.
...

> +.SH SOURCES
> +o   Kathleen Nicols and Van Jaconson, "Controlling Queue Delay", ACM Queue,

s/Nicols/Nichols/
s/Jaconson/Jacobson/

Jan

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Eric Dumazet @ 2012-05-25  6:22 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Denys Fedoryshchenko, netdev
In-Reply-To: <CA+mtBx-gOcfOE91LnUpdq4PqM5KQusTn5WKcq0t2xvU7XM80Pg@mail.gmail.com>

On Thu, 2012-05-24 at 23:01 -0700, Tom Herbert wrote:
> I think there is a potential problem in that netdev_tx_completed could
> be called multiple times for the same interrupt, for example if napi
> poll routine completes it's budget and is scheduled again and some new
> packets are completed.  We're looking at a solution to this.
> 
> Denys, can you try to increase the netdev budget to see if that has an effect?

TX completion has no budget, I am not sure what you mean.

e1000e driver indeed has a limit : It cannot clean more than
tx_ring->count frames per e1000_clean_tx_irq() invocation.

But with BQL, this should not happen ?

# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096
Current hardware settings:
RX:		256
RX Mini:	0
RX Jumbo:	0
TX:		256

^ permalink raw reply

* Re: [PATCH 05/17] netfilter: add namespace support for l4proto_tcp
From: Gao feng @ 2012-05-25  6:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120525030015.GB21076@1984>

于 2012年05月25日 11:00, Pablo Neira Ayuso 写道:
> Hi Gao,
> 
> While having a look at this again, I have two new requests:
> 
> On Mon, May 14, 2012 at 04:52:15PM +0800, Gao feng wrote:
> [...]
>> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
>> index 4dfbfa8..dd19350 100644
>> --- a/net/netfilter/nf_conntrack_proto_tcp.c
>> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> [...]
>> @@ -1549,10 +1532,80 @@ static struct ctl_table tcp_compat_sysctl_table[] = {
>>  #endif /* CONFIG_NF_CONNTRACK_PROC_COMPAT */
>>  #endif /* CONFIG_SYSCTL */
>>
>> +static int tcp_init_net(struct net *net, u_int8_t compat)
>> +{
>> +	int i;
>> +	struct nf_tcp_net *tn = tcp_pernet(net);
>> +	struct nf_proto_net *pn = (struct nf_proto_net *)tn;
>> +#ifdef CONFIG_SYSCTL
>> +#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> +	if (compat) {
>> +		pn->ctl_compat_table = kmemdup(tcp_compat_sysctl_table,
>> +					       sizeof(tcp_compat_sysctl_table),
>> +					       GFP_KERNEL);
>> +		if (!pn->ctl_compat_table)
>> +			return -ENOMEM;
>> +
>> +		pn->ctl_compat_table[0].data = &tn->timeouts[TCP_CONNTRACK_SYN_SENT];
>> +		pn->ctl_compat_table[1].data = &tn->timeouts[TCP_CONNTRACK_SYN_SENT2];
>> +		pn->ctl_compat_table[2].data = &tn->timeouts[TCP_CONNTRACK_SYN_RECV];
>> +		pn->ctl_compat_table[3].data = &tn->timeouts[TCP_CONNTRACK_ESTABLISHED];
>> +		pn->ctl_compat_table[4].data = &tn->timeouts[TCP_CONNTRACK_FIN_WAIT];
>> +		pn->ctl_compat_table[5].data = &tn->timeouts[TCP_CONNTRACK_CLOSE_WAIT];
>> +		pn->ctl_compat_table[6].data = &tn->timeouts[TCP_CONNTRACK_LAST_ACK];
>> +		pn->ctl_compat_table[7].data = &tn->timeouts[TCP_CONNTRACK_TIME_WAIT];
>> +		pn->ctl_compat_table[8].data = &tn->timeouts[TCP_CONNTRACK_CLOSE];
>> +		pn->ctl_compat_table[9].data = &tn->timeouts[TCP_CONNTRACK_RETRANS];
>> +		pn->ctl_compat_table[10].data = &tn->tcp_loose;
>> +		pn->ctl_compat_table[11].data = &tn->tcp_be_liberal;
>> +		pn->ctl_compat_table[12].data = &tn->tcp_max_retrans;
> 
> You can make a generic function to set the ctl_data that you can
> reuse for this code above and the one below.
> 

Actually I want reuse this code too,
But Unfortunately the ctl_data has different order or different size.
ctl_compat_table[1].data = &tn->timeouts[TCP_CONNTRACK_SYN_SENT2]
but
ctl_table[1].data = &tn->timeouts[TCP_CONNTRACK_SYN_RECV];


>> +	}
>> +#endif
>> +	if (!pn->ctl_table) {
>> +#else
>> +	if (!pn->user++) {
>> +#endif
>> +		for (i = 0; i < TCP_CONNTRACK_TIMEOUT_MAX; i++)
>> +			tn->timeouts[i] = tcp_timeouts[i];
>> +		tn->tcp_loose = nf_ct_tcp_loose;
>> +		tn->tcp_be_liberal = nf_ct_tcp_be_liberal;
>> +		tn->tcp_max_retrans = nf_ct_tcp_max_retrans;
>> +#ifdef CONFIG_SYSCTL
>> +		pn->ctl_table = kmemdup(tcp_sysctl_table,
>> +					sizeof(tcp_sysctl_table),
>> +					GFP_KERNEL);
>> +		if (!pn->ctl_table) {
>> +#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> +			if (compat) {
>> +				kfree(pn->ctl_compat_table);
>> +				pn->ctl_compat_table = NULL;
>> +			}
>> +#endif
>> +			return -ENOMEM;
>> +		}
>> +		pn->ctl_table[0].data = &tn->timeouts[TCP_CONNTRACK_SYN_SENT];
>> +		pn->ctl_table[1].data = &tn->timeouts[TCP_CONNTRACK_SYN_RECV];
>> +		pn->ctl_table[2].data = &tn->timeouts[TCP_CONNTRACK_ESTABLISHED];
>> +		pn->ctl_table[3].data = &tn->timeouts[TCP_CONNTRACK_FIN_WAIT];
>> +		pn->ctl_table[4].data = &tn->timeouts[TCP_CONNTRACK_CLOSE_WAIT];
>> +		pn->ctl_table[5].data = &tn->timeouts[TCP_CONNTRACK_LAST_ACK];
>> +		pn->ctl_table[6].data = &tn->timeouts[TCP_CONNTRACK_TIME_WAIT];
>> +		pn->ctl_table[7].data = &tn->timeouts[TCP_CONNTRACK_CLOSE];
>> +		pn->ctl_table[8].data = &tn->timeouts[TCP_CONNTRACK_RETRANS];
>> +		pn->ctl_table[9].data = &tn->timeouts[TCP_CONNTRACK_UNACK];
>> +		pn->ctl_table[10].data = &tn->tcp_loose;
>> +		pn->ctl_table[11].data = &tn->tcp_be_liberal;
>> +		pn->ctl_table[12].data = &tn->tcp_max_retrans;
>> +#endif
> 
> I have bad experience with code that has lots of #ifdef's.
> 
> Please, split all *_init_net into smaller functions.

It did look ugly,I will try my best to make code clear. ;)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 01/17] netfilter: add struct nf_proto_net for register l4proto sysctl
From: Gao feng @ 2012-05-25  6:02 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano,
	Gao feng
In-Reply-To: <20120525025451.GA21076@1984>

于 2012年05月25日 10:54, Pablo Neira Ayuso 写道:
> On Fri, May 25, 2012 at 09:05:34AM +0800, Gao feng wrote:
>> 于 2012年05月24日 22:38, Pablo Neira Ayuso 写道:
>>> On Thu, May 24, 2012 at 06:54:42PM +0800, Gao feng wrote:
>>> [...]
>>>>>>> I don't see why we need this new field.
>>>>>>>
>>>>>>> It seems to be set to 1 in each structure that has set:
>>>>>>>
>>>>>>> .ctl_compat_table
>>>>>>>
>>>>>>> to non-NULL. So, it's redundant.
>>>>>>>
>>>>>>> Moreover, you already know from the protocol tracker itself if you
>>>>>>> have to allocate the compat ctl table or not.
>>>>>>>
>>>>>>> In other words: You set compat to 1 for nf_conntrack_l4proto_generic.
>>>>>>> Then, you pass that compat value to generic_init_net via ->inet_net
>>>>>>> again, but this information (that determines if the compat has to be
>>>>>>> done or not) is already in the scope of the protocol tracker.
>>>>>>>
>>>>>>
>>>>>> because some protocols such l4proto_tcp6 and l4proto_tcp use the same init_net
>>>>>> function. the l4proto_tcp6 doesn't need compat sysctl, so we should use this new
>>>>>> field to identify if we should kmemdup compat_sysctl_table.
>>>>>
>>>>> Then, could you use two init_net functions? one for TCP for IPv4 and another
>>>>> for TCP for IPv6?
>>>>
>>>> Of cause, if you prefer to impletment it in this way.
>>>
>>> If this removes the .compat field that you added, then use two
>>> init_net functions, yes.
>>
>> Sorry I miss something.
>>
>> nf_ct_l4proto_unregister_sysctl also uses .compat to identify if we
>> can unregister the compat sysctl.
>>
>> if we register l4proto_tcp and l4proto_tcp6 both. without .compat,
>> when unregister l4proto_tcp6, the compat sysctl will be unregister too.
>>
>> So maybe we have to use .compat.
> 
> Could you resolve this by checking pn->ctl_compat_header != NULL ?

pn->ctl_table_header and ctl_compat_header is shared by l4proto_tcp and l4proto_tcp6.
if we both register l4proto_tcp and l4proto_tcp6, when unregister l4proto_tcp6
pn->ctl_compat_header must not be NULL.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Tom Herbert @ 2012-05-25  6:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Denys Fedoryshchenko, netdev
In-Reply-To: <1337707497.3361.233.camel@edumazet-glaptop>

I think there is a potential problem in that netdev_tx_completed could
be called multiple times for the same interrupt, for example if napi
poll routine completes it's budget and is scheduled again and some new
packets are completed.  We're looking at a solution to this.

Denys, can you try to increase the netdev budget to see if that has an effect?

Thanks,
Tom

On Tue, May 22, 2012 at 10:24 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-05-22 at 20:11 +0300, Denys Fedoryshchenko wrote:
>
>> By the way, if BQL limit is going lower than MTU, is it considered as a
>> bug?
>> If yes, i can try to upload 3.4 to some servers and add condition to
>> WARN_ON if limit < 1500.
>
> There is no problem with BQL limit going lower than the max packet size.
>
> (With TSO it can be 64K)
>
> Remember BQL allows one packet to be sent to device, regardless of its
> size.
>
> Next packet might be blocked/stay in Qdisc
>
> If your workload is mostly idle, but sending bursts of 3 packets, then
> only one is immediately sent.
>
> Next packets shall wait the TX completion of first packet.
>
>
>

^ permalink raw reply

* [PATCH] mac80211: Use correct originator sequence number in a Path Reply
From: Qasim Javed @ 2012-05-25  5:02 UTC (permalink / raw)
  To: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	devel-ZwoEplunGu1xMJw8dq7oimD2FQJk+8+b
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ravip-DNmUmOh1Rg72fBVCVOL8/A

Hi,

I have been doing some experiments using the 802.11s functionality in the mac80211 stack. Today I stumbled across something which I believe is a critical bug in the usage of originator sequence number for a Path Reply message upon the reception of a Path Request message.

Consider the following topology:

                    +---+
                    | S |
                    +---+
                  /      \
                 /        \
            +---+          +---+
            | A |          | B |
            +---+          +---+
                 \         /
                  \       /
                    +---+
                    | D |
                    +---+

Node S is the source node and D the destination. Clearly there are two possible paths from S to D namely S->A->D and S->B->D. When S wants to communicate with D, it will broadcast a Path Request (PREQ) where the originator will be S and target will be D. On receiving the PREQ, both A and B will broadcast it further to D. Let us assume that aggregate value of the metric for path D->B->S denoted by cost(DBS) is greater than cost(DAS). Notice that according to HWMP operation, when the PREQ is propagating from S to D, the cost on the "reverse" path is aggregated, that is why I used cost(DB) + cost(BS) for cost(DBS) and did not consider cost(SBD). Suppose also that smaller the metric the better it is which is the case for the default airtime link metric used by the 802.11s stack.

Let us suppose that the PREQ which passes through B arrives first at D and as mentioned earlier has a larger (worse) value than the soon to be received PREQ through the intermediate hop A. When D receives a PREQ from B, since it has not received any other PREQ, it generates a Path Reply (PREP). More specifically, the function hwmp_preq_frame_process generates the PREP. The PREQ contains originator and target sequence numbers which are used to avoid loops and ascertain the freshness of route information. On receiving a PREQ at D, the above mentioned function checks whether dot11MeshHWMPnetDiameterTraversalTime have elapsed since the last sequence number update (stored in ifmsh->last_sn_update). So suppose this is true when the first PREQ via B is received at D. So, in this case the originat
 or sequence number in PREP is incremented (that is becomes one more than the target sequence number in the PREQ).

Let us look at an example at this point. Suppose, the the originator sequence number in the PREQ is 1 and the target sequence number is 2. When this PREQ is received at D via B, and considering that dot11MeshHWMPnetDiameterTraversalTime have passed since the last sequence number update, we will increment the target sequence number which now becomes 3. Now for the PREP, the originator sequence number of PREQ, 1 in this case, becomes the target sequence number of PREP and the target sequence number of the PREQ (which has been updated and its value is 3) becomes the originator sequence number of the PREP.

As this PREQ which was received at D via B has a larger metric, we know that when the PREQ from S is received via A, it will have a lower (better) metric, hence we will also generate a PREP for that PREQ. Suppose the second PREQ via A is received at D within dot11MeshHWMPnetDiameterTraversalTime (currently 50ms). This is a reasonable assumption since the PREQ is broadcast by S and further broadcast by A and B in some order. There is very less likelihood that the time difference between the PREQ from A and B would be greater than 50ms since this is a lot of time in 802.11 speak where the nodes are contending for the channel on the order of hundreds of microseconds (typically). In short, it is very likely (confirmed through experiments) that this difference is less than 50ms.

So, when the PREQ from S arrives a D via A, since most likely this event happens within 50ms if the PREQ via B, the target sequence number will not be updated. Therefore, the originator sequence number stays at 1 and the target sequence number remains 2. It is very important to note that the code in hwmp_preq_frame_process just "swaps" the originator and target sequence numbers for use in the PREP. More specifically as mentioned earlier, the second PREP will have an originator sequence number of 2 and a target sequence number of 1.

At this point, we have two PREPs in flight, one via B and one via A.

PREP via B: originator sequence number = 3, target sequence number = 1
PREP via A: originator sequence number = 2, target sequence number = 1

The net effect is that when these PREPs reach S, irrespective of the order in which they arrive, the PREP via A will be ignored! This is very wrong since the reason we sent the PREP via A in the first place was that it had a better metric (albeit on the reverse path).

I have not tested the patch yet. This is more of a heads up email to let everyone.

Signed-off-by: Qasim Javed <qasimj-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 net/mac80211/mesh_hwmp.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/mac80211/mesh_hwmp.c b/net/mac80211/mesh_hwmp.c
index 70ac7d1..a13b593 100644
--- a/net/mac80211/mesh_hwmp.c
+++ b/net/mac80211/mesh_hwmp.c
@@ -543,6 +543,8 @@ static void hwmp_preq_frame_process(struct ieee80211_sub_if_data *sdata,
 		    time_before(jiffies, ifmsh->last_sn_update)) {
 			target_sn = ++ifmsh->sn;
 			ifmsh->last_sn_update = jiffies;
+		} else {
+			target_sn = ifmsh->sn;
 		}
 	} else {
 		rcu_read_lock();
-- 
1.7.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox