Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tcp: take care of overlaps in tcp_try_coalesce()
From: David Miller @ 2012-05-24  4:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: lists, netdev
In-Reply-To: <1337831497.3361.3888.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 24 May 2012 05:51:37 +0200

> [PATCH] tcp: take care of overlaps in tcp_try_coalesce()
> 
> Sergio Correia reported following warning :
> 
> WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
> 
> WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
>      "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
>      tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
> 
> It appears TCP coalescing, and more specifically commit b081f85c297
> (net: implement tcp coalescing in tcp_queue_rcv()) should take care of
> possible segment overlaps in receive queue. This was properly done in
> the case of out_or_order_queue by the caller.
> 
> For example, segment at tail of queue have sequence 1000-2000, and we
> add a segment with sequence 1500-2500.
> This can happen in case of retransmits.
> 
> In this case, just don't do the coalescing.
> 
> Reported-by: Sergio Correia <lists@uece.net>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Sergio Correia <lists@uece.net>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH V2] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: David Miller @ 2012-05-24  4:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kunx.jiang, netdev, linux-kernel, yanmin_zhang
In-Reply-To: <1337827362.3361.3800.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 24 May 2012 04:42:42 +0200

> On Thu, 2012-05-24 at 09:39 +0800, kun.jiang wrote:
>> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
>> 
>> We hit a kernel OOPS.
>> 
> ...
>> 
>> Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
>> fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.
>> 
>> With the patch, we ran MTBF testing on Android mobile for 12 hours
>> and didn't trigger the issue.
>> 
>> Thank Eric for very detailed review/checking the issue.
>> 
>> Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
>> Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
>> ---
>>  net/ipv4/fib_semantics.c |   12 ++++++------
>>  1 files changed, 6 insertions(+), 6 deletions(-)
> 
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH v2] mm: add a low limit to alloc_large_system_hash
From: David Miller @ 2012-05-24  4:27 UTC (permalink / raw)
  To: tim.bird; +Cc: eric.dumazet, paul.gortmaker, linux-kernel, netdev
In-Reply-To: <4FBD73CF.2000306@am.sony.com>

From: Tim Bird <tim.bird@am.sony.com>
Date: Wed, 23 May 2012 16:33:35 -0700

> This patch seems to have fallen in the cracks:
> 
> https://lkml.org/lkml/2012/2/27/20
> 
> The last message on the thread was an ACK by David Miller, and
> the question "Who wants to take this?"

I'll take it in via my 'net' tree, thanks Tim.

^ permalink raw reply

* [PATCH] tcp: take care of overlaps in tcp_try_coalesce()
From: Eric Dumazet @ 2012-05-24  3:51 UTC (permalink / raw)
  To: Sergio Correia, David Miller; +Cc: netdev
In-Reply-To: <CAJyhjX1UW=N0B9GFMt_u_gFEsLH2=8OJx0N7zJWVCrpic5Z9kQ@mail.gmail.com>

From: Eric Dumazet <edumazet@google.com>

On Thu, 2012-05-24 at 00:21 -0300, Sergio Correia wrote:

> With your patch applied, the warning hasn't shown up in the last
> hours. Without it, dmesg would have been completely taken by them by
> now.
> I might be able to reproduce your test setup tomorrow, and if so, I will retest.
> 

I am going to send an official patch right now.

My own testings confirmed the bug origin and I am confident the patch is
OK.

Thanks !

[PATCH] tcp: take care of overlaps in tcp_try_coalesce()

Sergio Correia reported following warning :

WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()

WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
     "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
     tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);

It appears TCP coalescing, and more specifically commit b081f85c297
(net: implement tcp coalescing in tcp_queue_rcv()) should take care of
possible segment overlaps in receive queue. This was properly done in
the case of out_or_order_queue by the caller.

For example, segment at tail of queue have sequence 1000-2000, and we
add a segment with sequence 1500-2500.
This can happen in case of retransmits.

In this case, just don't do the coalescing.

Reported-by: Sergio Correia <lists@uece.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Sergio Correia <lists@uece.net>
---
 net/ipv4/tcp_input.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cfa2aa1..b224eb8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4555,6 +4555,11 @@ static bool tcp_try_coalesce(struct sock *sk,
 
 	if (tcp_hdr(from)->fin)
 		return false;
+
+	/* Its possible this segment overlaps with prior segment in queue */
+	if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
+		return false;
+
 	if (!skb_try_coalesce(to, from, fragstolen, &delta))
 		return false;
 

^ permalink raw reply related

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Sergio Correia @ 2012-05-24  3:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337798274.3361.3234.camel@edumazet-glaptop>

On Wed, May 23, 2012 at 3:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-05-23 at 15:30 -0300, Sergio Correia wrote:
>> On Wed, May 23, 2012 at 1:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2012-05-23 at 12:56 -0300, Sergio Correia wrote:
>> >> On Wed, May 23, 2012 at 12:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> > On Tue, 2012-05-22 at 11:47 -0300, Sergio Correia wrote:
>> >> >> Hi Eric,
>> >> > ...
>> >> >> Yes, it's an Atheros AR9285 adapter.
>> >> >> This morning I did a make mrproper before rebuilding the kernel
>> >> >> (should I always do that?), but the warning has just appeared again.
>> >> >
>> >> > OK, I am taking a look at this problem, thanks.
>> >> >
>> >>
>> >> Thanks. Let me know if you need additional info. As of now, my dmesg
>> >> basically shows only those warnings.
>> >
>> > I believe I found the bug and am testing a fix right now.
>> >
>> > By the way, we might have the same problem in tcp collapses.
>> >
>> > TCP coalescing (introduced in linux-3.5) triggers the problem faster.
>> >
>> > Please test following patch :
>> >
>>
>> I reverted back to 471368557a734c6c486ee757952c902b36e7fd01 and it
>> took almost one hour to trigger the warning. Now I have applied your
>> patch and will report back how it went after a few hours of testing.
>
> Thanks
>
> I triggered it very fast in my lab using following setup
>
> Sender machine :
>
> # tc qdisc add dev eth0 root netem delay 1ms 3ms 20 reorder 10 20
> for i in `seq 1 8`
> do
>  netperf -t OMNI  -C -c -H 172.30.42.8 -l 60 &
> done
> wait
> # tc -s -d qd
> qdisc netem 8002: dev eth0 root refcnt 2 limit 1000 delay 1.0ms  3.0ms
> 20% reorder 10% 20% gap 1
>  Sent 66030032010 bytes 43992779 pkt (dropped 13846, overlimits 0
> requeues 2712184)
>  backlog 0b 0p requeues 2712184
>
> receiver machine runs a netserver and triggers the bug in few seconds.
>
> (receiver being a slow machine, with r8169 NIC)
>

With your patch applied, the warning hasn't shown up in the last
hours. Without it, dmesg would have been completely taken by them by
now.
I might be able to reproduce your test setup tomorrow, and if so, I will retest.

thanks,
sergio

^ permalink raw reply

* Re: [PATCH V2] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: Eric Dumazet @ 2012-05-24  2:42 UTC (permalink / raw)
  To: kun.jiang; +Cc: netdev, LKML, YMZHANG, davem
In-Reply-To: <4FBD9161.3040206@intel.com>

On Thu, 2012-05-24 at 09:39 +0800, kun.jiang wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> 
> We hit a kernel OOPS.
> 
...
> 
> Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
> fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.
> 
> With the patch, we ran MTBF testing on Android mobile for 12 hours
> and didn't trigger the issue.
> 
> Thank Eric for very detailed review/checking the issue.
> 
> Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
> ---
>  net/ipv4/fib_semantics.c |   12 ++++++------
>  1 files changed, 6 insertions(+), 6 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [PATCH 03/17] netfilter: add namespace support for l3proto
From: Gao feng @ 2012-05-24  1:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523102910.GC2836@1984>

于 2012年05月23日 18:29, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:13PM +0800, Gao feng wrote:
>> -Add the struct net as param of nf_conntrack_l3proto_(un)register.
>>  register or unregister the l3proto only when the net is init_net.
>>
>> -The new struct nf_ip_net is used to store the sysctl header and data
>>  of l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6).
>>  because the protos such tcp and tcp6 use the same data,so making
>>  nf_ip_net as a field of netns_ct is the easiest way to manager it.
>>
>> -nf_ct_l3proto_register_sysctl call init_net to initial the pernet data
>>  of l3proto.
>>
>> -nf_ct_l3proto_net is used to get the pernet data of l3proto.
>>
>> -export nf_conntrack_l3proto_(un)register
>>
>> -use init_net as param of nf_conntrack_l3proto_(un)register.
>>
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l3proto.h   |    6 +-
>>  include/net/netns/conntrack.h                  |    8 ++
>>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |    6 +-
>>  net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |    6 +-
>>  net/netfilter/nf_conntrack_proto.c             |  127 +++++++++++++++---------
>>  5 files changed, 97 insertions(+), 56 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
>> index 9766005..d6df8c7 100644
>> --- a/include/net/netfilter/nf_conntrack_l3proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l3proto.h
>> @@ -79,8 +79,10 @@ struct nf_conntrack_l3proto {
>>  extern struct nf_conntrack_l3proto __rcu *nf_ct_l3protos[AF_MAX];
>>  
>>  /* Protocol registration. */
>> -extern int nf_conntrack_l3proto_register(struct nf_conntrack_l3proto *proto);
>> -extern void nf_conntrack_l3proto_unregister(struct nf_conntrack_l3proto *proto);
>> +extern int nf_conntrack_l3proto_register(struct net *net,
>> +					 struct nf_conntrack_l3proto *proto);
>> +extern void nf_conntrack_l3proto_unregister(struct net *net,
>> +					    struct nf_conntrack_l3proto *proto);
>>  extern struct nf_conntrack_l3proto *nf_ct_l3proto_find_get(u_int16_t l3proto);
>>  extern void nf_ct_l3proto_put(struct nf_conntrack_l3proto *p);
>>  
>> diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
>> index 1f53038..94992e9 100644
>> --- a/include/net/netns/conntrack.h
>> +++ b/include/net/netns/conntrack.h
>> @@ -20,6 +20,13 @@ struct nf_proto_net {
>>  	unsigned int		users;
>>  };
>>  
>> +struct nf_ip_net {
>> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>> +	struct ctl_table_header *ctl_table_header;
>> +	struct ctl_table	*ctl_table;
>> +#endif
>> +};
>> +
>>  struct netns_ct {
>>  	atomic_t		count;
>>  	unsigned int		expect_count;
>> @@ -40,6 +47,7 @@ struct netns_ct {
>>  	unsigned int		sysctl_log_invalid; /* Log invalid packets */
>>  	int			sysctl_auto_assign_helper;
>>  	bool			auto_assign_helper_warned;
>> +	struct nf_ip_net	proto;
>                                 ^^^^^
> please, rename this to something like nf_ct_proto.

Get it ;)

> 
>>  #ifdef CONFIG_SYSCTL
>>  	struct ctl_table_header	*sysctl_header;
>>  	struct ctl_table_header	*acct_sysctl_header;
>> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
>> index 46ec515..0c0fb90 100644
>> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
>> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
>> @@ -409,7 +409,7 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>>  		goto cleanup_udp;
>>  	}
>>  
>> -	ret = nf_conntrack_l3proto_register(&nf_conntrack_l3proto_ipv4);
>> +	ret = nf_conntrack_l3proto_register(&init_net, &nf_conntrack_l3proto_ipv4);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv4: can't register ipv4\n");
>>  		goto cleanup_icmp;
>> @@ -432,7 +432,7 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>>  	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
>>  #endif
>>   cleanup_ipv4:
>> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
>> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv4);
>>   cleanup_icmp:
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>>   cleanup_udp:
>> @@ -451,7 +451,7 @@ static void __exit nf_conntrack_l3proto_ipv4_fini(void)
>>  	nf_conntrack_ipv4_compat_fini();
>>  #endif
>>  	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
>> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
>> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv4);
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp4);
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp4);
>> diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
>> index 55f379f..6cfbe7b 100644
>> --- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
>> +++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
>> @@ -359,7 +359,7 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>>  		goto cleanup_udp;
>>  	}
>>  
>> -	ret = nf_conntrack_l3proto_register(&nf_conntrack_l3proto_ipv6);
>> +	ret = nf_conntrack_l3proto_register(&init_net, &nf_conntrack_l3proto_ipv6);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv6: can't register ipv6\n");
>>  		goto cleanup_icmpv6;
>> @@ -375,7 +375,7 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>>  	return ret;
>>  
>>   cleanup_ipv6:
>> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
>> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv6);
>>   cleanup_icmpv6:
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>>   cleanup_udp:
>> @@ -389,7 +389,7 @@ static void __exit nf_conntrack_l3proto_ipv6_fini(void)
>>  {
>>  	synchronize_net();
>>  	nf_unregister_hooks(ipv6_conntrack_ops, ARRAY_SIZE(ipv6_conntrack_ops));
>> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
>> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv6);
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp6);
>>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp6);
>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
>> index 6d68727..7ee6653 100644
>> --- a/net/netfilter/nf_conntrack_proto.c
>> +++ b/net/netfilter/nf_conntrack_proto.c
>> @@ -170,85 +170,116 @@ static int kill_l4proto(struct nf_conn *i, void *data)
>>  	       nf_ct_l3num(i) == l4proto->l3proto;
>>  }
>>  
>> -static int nf_ct_l3proto_register_sysctl(struct nf_conntrack_l3proto *l3proto)
>> +static struct nf_ip_net *nf_ct_l3proto_net(struct net *net,
>> +					   struct nf_conntrack_l3proto *l3proto)
>> +{
>> +	if (l3proto->l3proto == PF_INET)
>> +		return &net->ct.proto;
>> +	else
>> +		return NULL;
>> +}
>> +
>> +static int nf_ct_l3proto_register_sysctl(struct net *net,
>> +					 struct nf_conntrack_l3proto *l3proto)
>>  {
>>  	int err = 0;
>> +	struct nf_ip_net *in = nf_ct_l3proto_net(net, l3proto);
>>  
>> -#ifdef CONFIG_SYSCTL
>> -	if (l3proto->ctl_table != NULL) {
>> -		err = nf_ct_register_sysctl(&init_net,
>> -					    &l3proto->ctl_table_header,
>> +	if (in == NULL)
>> +		return 0;
> 
> Under what circunstances that in be NULL?

Because l3proto_ipv6 doesn't need sysctl,so l3proto_ipv6's nf_ip_net is NULL,
please see function nf_ct_l3proto_net above.

> 
>> +
>> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>> +	if (in->ctl_table != NULL) {
>> +		err = nf_ct_register_sysctl(net,
>> +					    &in->ctl_table_header,
>>  					    l3proto->ctl_table_path,
>> -					    l3proto->ctl_table, NULL);
>> +					    in->ctl_table,
>> +					    NULL);
>> +		if (err < 0) {
>> +			kfree(in->ctl_table);
>> +			in->ctl_table = NULL;
> 
> do we need this extra NULL assignment?
> 
>> +		}
>>  	}
>>  #endif
>>  	return err;
>>  }
>>  
>> -static void nf_ct_l3proto_unregister_sysctl(struct nf_conntrack_l3proto *l3proto)
>> +static void nf_ct_l3proto_unregister_sysctl(struct net *net,
>> +					    struct nf_conntrack_l3proto *l3proto)
>>  {
>> -#ifdef CONFIG_SYSCTL
>> -	if (l3proto->ctl_table_header != NULL)
>> -		nf_ct_unregister_sysctl(&l3proto->ctl_table_header,
>> -					&l3proto->ctl_table, NULL);
>> +	struct nf_ip_net *in = nf_ct_l3proto_net(net, l3proto);
>> +
>> +	if (in == NULL)
>> +		return;
>> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>> +	if (in->ctl_table_header != NULL)
>> +		nf_ct_unregister_sysctl(&in->ctl_table_header,
>> +					&in->ctl_table,
>> +					NULL);
>>  #endif
>>  }
>>  
>> -int nf_conntrack_l3proto_register(struct nf_conntrack_l3proto *proto)
>> +int nf_conntrack_l3proto_register(struct net *net,
>> +				  struct nf_conntrack_l3proto *proto)
>>  {
>>  	int ret = 0;
>> -	struct nf_conntrack_l3proto *old;
>> -
>> -	if (proto->l3proto >= AF_MAX)
>> -		return -EBUSY;
>>  
>> -	if (proto->tuple_to_nlattr && !proto->nlattr_tuple_size)
>> -		return -EINVAL;
>> +	if (net == &init_net) {
> 
> Same things as in previous patch. Move...
> 
> if (net == &init_net) {
>      ... this code ...
> }
> 
> into some static int nf_conntrack_l3proto_register_net function.
> 

Get it.
thanks

^ permalink raw reply

* Re: [PATCH 02/17] netfilter: add namespace support for l4proto
From: Gao feng @ 2012-05-24  1:52 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano,
	Gao feng
In-Reply-To: <20120523102557.GB2836@1984>

于 2012年05月23日 18:25, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:12PM +0800, Gao feng wrote:
>> From: Gao feng <gaofeng@cn.fujitus.com>
>>
>> -nf_ct_(un)register_sysctl are changed to support net namespace,
>>  use (un)register_net_sysctl_table replaces (un)register_sysctl_paths.
>>  and in nf_ct_unregister_sysctl,kfree table only when users is 0.
>>
>> -Add the struct net as param of nf_conntrack_l4proto_(un)register.
>>  register or unregister the l4proto only when the net is init_net.
>>
>> -nf_conntrack_l4proto_register call init_net to initial the pernet
>>  data of l4proto.
>>
>> -nf_ct_l4proto_net is used to get the pernet data of l4proto.
>>
>> -use init_net as a param of nf_conntrack_l4proto_(un)register.
>>
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitus.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l4proto.h   |   13 +-
>>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   18 +-
>>  net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   18 +-
>>  net/netfilter/nf_conntrack_proto.c             |  245 ++++++++++++++----------
>>  net/netfilter/nf_conntrack_proto_dccp.c        |   10 +-
>>  net/netfilter/nf_conntrack_proto_gre.c         |    6 +-
>>  net/netfilter/nf_conntrack_proto_sctp.c        |   10 +-
>>  net/netfilter/nf_conntrack_proto_udplite.c     |   10 +-
>>  8 files changed, 191 insertions(+), 139 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
>> index a90eab5..a93dcd5 100644
>> --- a/include/net/netfilter/nf_conntrack_l4proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
>> @@ -12,7 +12,7 @@
>>  #include <linux/netlink.h>
>>  #include <net/netlink.h>
>>  #include <net/netfilter/nf_conntrack.h>
>> -
>> +#include <net/netns/generic.h>
> 
> Minor nitpick: make sure there's still one line between this structure
> below and the include headers.

thanks! I will fix it.

> 
>>  struct seq_file;
>>  
>>  struct nf_conntrack_l4proto {
>> @@ -129,8 +129,15 @@ nf_ct_l4proto_find_get(u_int16_t l3proto, u_int8_t l4proto);
>>  extern void nf_ct_l4proto_put(struct nf_conntrack_l4proto *p);
>>  
>>  /* Protocol registration. */
>> -extern int nf_conntrack_l4proto_register(struct nf_conntrack_l4proto *proto);
>> -extern void nf_conntrack_l4proto_unregister(struct nf_conntrack_l4proto *proto);
>> +extern int nf_conntrack_l4proto_register(struct net *net,
>> +					 struct nf_conntrack_l4proto *proto);
>> +extern void nf_conntrack_l4proto_unregister(struct net *net,
>> +					    struct nf_conntrack_l4proto *proto);
>> +
>> +extern int nf_ct_l4proto_register_sysctl(struct net *net,
>> +					 struct nf_conntrack_l4proto *l4proto);
>> +extern void nf_ct_l4proto_unregister_sysctl(struct net *net,
>> +					    struct nf_conntrack_l4proto *l4proto);
>>  
>>  /* Generic netlink helpers */
>>  extern int nf_ct_port_tuple_to_nlattr(struct sk_buff *skb,
>> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
>> index 91747d4..46ec515 100644
>> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
>> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
>> @@ -391,19 +391,19 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>>  		return ret;
>>  	}
>>  
>> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_tcp4);
>> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_tcp4);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv4: can't register tcp.\n");
>>  		goto cleanup_sockopt;
>>  	}
>>  
>> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udp4);
>> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_udp4);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv4: can't register udp.\n");
>>  		goto cleanup_tcp;
>>  	}
>>  
>> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_icmp);
>> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_icmp);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv4: can't register icmp.\n");
>>  		goto cleanup_udp;
>> @@ -434,11 +434,11 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>>   cleanup_ipv4:
>>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
>>   cleanup_icmp:
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmp);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>>   cleanup_udp:
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp4);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp4);
>>   cleanup_tcp:
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp4);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp4);
>>   cleanup_sockopt:
>>  	nf_unregister_sockopt(&so_getorigdst);
>>  	return ret;
>> @@ -452,9 +452,9 @@ static void __exit nf_conntrack_l3proto_ipv4_fini(void)
>>  #endif
>>  	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
>>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmp);
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp4);
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp4);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp4);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp4);
>>  	nf_unregister_sockopt(&so_getorigdst);
>>  }
>>  
>> diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
>> index fe925e4..55f379f 100644
>> --- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
>> +++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
>> @@ -341,19 +341,19 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>>  	need_conntrack();
>>  	nf_defrag_ipv6_enable();
>>  
>> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_tcp6);
>> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_tcp6);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv6: can't register tcp.\n");
>>  		return ret;
>>  	}
>>  
>> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udp6);
>> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_udp6);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv6: can't register udp.\n");
>>  		goto cleanup_tcp;
>>  	}
>>  
>> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_icmpv6);
>> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_icmpv6);
>>  	if (ret < 0) {
>>  		pr_err("nf_conntrack_ipv6: can't register icmpv6.\n");
>>  		goto cleanup_udp;
>> @@ -377,11 +377,11 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>>   cleanup_ipv6:
>>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
>>   cleanup_icmpv6:
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmpv6);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>>   cleanup_udp:
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp6);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp6);
>>   cleanup_tcp:
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp6);
>>  	return ret;
>>  }
>>  
>> @@ -390,9 +390,9 @@ static void __exit nf_conntrack_l3proto_ipv6_fini(void)
>>  	synchronize_net();
>>  	nf_unregister_hooks(ipv6_conntrack_ops, ARRAY_SIZE(ipv6_conntrack_ops));
>>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmpv6);
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp6);
>> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp6);
>> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp6);
>>  }
>>  
>>  module_init(nf_conntrack_l3proto_ipv6_init);
>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
>> index 8b631b0..6d68727 100644
>> --- a/net/netfilter/nf_conntrack_proto.c
>> +++ b/net/netfilter/nf_conntrack_proto.c
>> @@ -35,30 +35,39 @@ EXPORT_SYMBOL_GPL(nf_ct_l3protos);
>>  static DEFINE_MUTEX(nf_ct_proto_mutex);
>>  
>>  #ifdef CONFIG_SYSCTL
>> -static int
>> -nf_ct_register_sysctl(struct ctl_table_header **header, const char *path,
>> -		      struct ctl_table *table, unsigned int *users)
>> +int
>> +nf_ct_register_sysctl(struct net *net,
>> +		      struct ctl_table_header **header,
>> +		      const char *path,
>> +		      struct ctl_table *table,
>> +		      unsigned int *users)
>>  {
>>  	if (*header == NULL) {
>> -		*header = register_net_sysctl(&init_net, path, table);
>> +		*header = register_net_sysctl(net, path, table);
>>  		if (*header == NULL)
>>  			return -ENOMEM;
>>  	}
>>  	if (users != NULL)
>>  		(*users)++;
>> +
>>  	return 0;
>>  }
>> +EXPORT_SYMBOL_GPL(nf_ct_register_sysctl);
> 
> I don't see why we need to export nf_ct_register_sysctl. I think this
> is a left-over from the previous patchset.

I miss it...
thanks

> 
>> -static void
>> +void
>>  nf_ct_unregister_sysctl(struct ctl_table_header **header,
>> -			struct ctl_table *table, unsigned int *users)
>> +			struct ctl_table **table,
>> +			unsigned int *users)
>>  {
>>  	if (users != NULL && --*users > 0)
>>  		return;
>>  
>>  	unregister_net_sysctl_table(*header);
>> +	kfree(*table);
>>  	*header = NULL;
>> +	*table = NULL;
>>  }
>> +EXPORT_SYMBOL_GPL(nf_ct_unregister_sysctl);
> 
> Same thing. I don't find any external user of this new exported
> function in your entire patchset.
> 
> You have to fix this.
> 
>>  #endif
>>  
>>  struct nf_conntrack_l4proto *
>> @@ -167,7 +176,8 @@ static int nf_ct_l3proto_register_sysctl(struct nf_conntrack_l3proto *l3proto)
>>  
>>  #ifdef CONFIG_SYSCTL
>>  	if (l3proto->ctl_table != NULL) {
>> -		err = nf_ct_register_sysctl(&l3proto->ctl_table_header,
>> +		err = nf_ct_register_sysctl(&init_net,
>> +					    &l3proto->ctl_table_header,
>>  					    l3proto->ctl_table_path,
>>  					    l3proto->ctl_table, NULL);
>>  	}
>> @@ -180,7 +190,7 @@ static void nf_ct_l3proto_unregister_sysctl(struct nf_conntrack_l3proto *l3proto
>>  #ifdef CONFIG_SYSCTL
>>  	if (l3proto->ctl_table_header != NULL)
>>  		nf_ct_unregister_sysctl(&l3proto->ctl_table_header,
>> -					l3proto->ctl_table, NULL);
>> +					&l3proto->ctl_table, NULL);
>>  #endif
>>  }
>>  
>> @@ -243,137 +253,172 @@ void nf_conntrack_l3proto_unregister(struct nf_conntrack_l3proto *proto)
>>  }
>>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
>>  
>> -static int nf_ct_l4proto_register_sysctl(struct nf_conntrack_l4proto *l4proto)
>> +static struct nf_proto_net *nf_ct_l4proto_net(struct net *net,
>> +					      struct nf_conntrack_l4proto *l4proto)
>>  {
>> -	int err = 0;
>> +	if (l4proto->net_id)
>> +		return net_generic(net, *l4proto->net_id);
>> +	else
>> +		return NULL;
>> +}
>>  
>> +int nf_ct_l4proto_register_sysctl(struct net *net,
>> +				  struct nf_conntrack_l4proto *l4proto)
>> +{
>> +	int err = 0;
>> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net, l4proto);
>> +	if (pn == NULL)
>> +		return 0;
>>  #ifdef CONFIG_SYSCTL
>> -	if (l4proto->ctl_table != NULL) {
>> -		err = nf_ct_register_sysctl(l4proto->ctl_table_header,
>> +	if (pn->ctl_table != NULL) {
>> +		err = nf_ct_register_sysctl(net,
>> +					    &pn->ctl_table_header,
>>  					    "net/netfilter",
>> -					    l4proto->ctl_table,
>> -					    l4proto->ctl_table_users);
>> -		if (err < 0)
>> +					    pn->ctl_table,
>> +					    &pn->users);
>> +		if (err < 0) {
>> +			kfree(pn->ctl_table);
>> +			pn->ctl_table = NULL;
>                                ^^^^^^^^^^^
> Do you really need to set this above to NULL? Is there any existing
> bug trap? If not, it's superfluous, please, remove it.
> 
yes,l4proto_tcp(udp,icmp)'s ctl_table is stored in netns_ct.proto,
so when we register l4proto_tcp's sysctl failed,ctl_table will still
point to the kfreed memory. this will cause panic the next
time we register l4proto_tcp's sysctl.

>>  			goto out;
>> +		}
>>  	}
>>  #ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> -	if (l4proto->ctl_compat_table != NULL) {
>> -		err = nf_ct_register_sysctl(&l4proto->ctl_compat_table_header,
>> +	if (l4proto->compat && pn->ctl_compat_table != NULL) {
>> +		err = nf_ct_register_sysctl(net,
>> +					    &pn->ctl_compat_header,
>>  					    "net/ipv4/netfilter",
>> -					    l4proto->ctl_compat_table, NULL);
>> +					    pn->ctl_compat_table,
>> +					    NULL);
>>  		if (err == 0)
>>  			goto out;
>> -		nf_ct_unregister_sysctl(l4proto->ctl_table_header,
>> -					l4proto->ctl_table,
>> -					l4proto->ctl_table_users);
>> +
>> +		kfree(pn->ctl_compat_table);
>> +		pn->ctl_compat_table = NULL;
>> +		nf_ct_unregister_sysctl(&pn->ctl_table_header,
>> +					&pn->ctl_table,
>> +					&pn->users);
>>  	}
>>  #endif /* CONFIG_NF_CONNTRACK_PROC_COMPAT */
>>  out:
>>  #endif /* CONFIG_SYSCTL */
>>  	return err;
>>  }
>> +EXPORT_SYMBOL_GPL(nf_ct_l4proto_register_sysctl);
>>  
>> -static void nf_ct_l4proto_unregister_sysctl(struct nf_conntrack_l4proto *l4proto)
>> +void nf_ct_l4proto_unregister_sysctl(struct net *net,
>> +				     struct nf_conntrack_l4proto *l4proto)
>>  {
>> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net, l4proto);
>> +	if (pn == NULL)
>> +		return;
>>  #ifdef CONFIG_SYSCTL
>> -	if (l4proto->ctl_table_header != NULL &&
>> -	    *l4proto->ctl_table_header != NULL)
>> -		nf_ct_unregister_sysctl(l4proto->ctl_table_header,
>> -					l4proto->ctl_table,
>> -					l4proto->ctl_table_users);
>> +	if (pn->ctl_table_header != NULL)
>> +		nf_ct_unregister_sysctl(&pn->ctl_table_header,
>> +					&pn->ctl_table,
>> +					&pn->users);
>> +
>>  #ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> -	if (l4proto->ctl_compat_table_header != NULL)
>> -		nf_ct_unregister_sysctl(&l4proto->ctl_compat_table_header,
>> -					l4proto->ctl_compat_table, NULL);
>> +	if (l4proto->compat && pn->ctl_compat_header != NULL)
>> +		nf_ct_unregister_sysctl(&pn->ctl_compat_header,
>> +					&pn->ctl_compat_table,
>> +					NULL);
>>  #endif /* CONFIG_NF_CONNTRACK_PROC_COMPAT */
>> +#else
>> +	pn->users--;
>>  #endif /* CONFIG_SYSCTL */
>>  }
>> +EXPORT_SYMBOL_GPL(nf_ct_l4proto_unregister_sysctl);
>>  
>>  /* FIXME: Allow NULL functions and sub in pointers to generic for
>>     them. --RR */
>> -int nf_conntrack_l4proto_register(struct nf_conntrack_l4proto *l4proto)
>> +int nf_conntrack_l4proto_register(struct net *net,
>> +				  struct nf_conntrack_l4proto *l4proto)
>>  {
>>  	int ret = 0;
> 
> Minor nitpick: you save this amount of edits in this function that
> result from the extra tabbing by moving all ...
> 
> if (net == &init_net) {
>     ... this code ...
> }
> 
> into some new static int nf_conntrack_l4proto_register_net(...) that
> will be called by nf_conntrack_l4proto_register.
> 
> It will result more maintainable code. We still stick to 80-chars
> columns, saving that extra tabbing makes the code more readable.
> 

Yes,it will be more readable,I will do it.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH V2] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
From: kun.jiang @ 2012-05-24  1:39 UTC (permalink / raw)
  To: netdev, LKML, YMZHANG, davem, eric.dumazet

From: Yanmin Zhang <yanmin_zhang@linux.intel.com>

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50

Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.

With the patch, we ran MTBF testing on Android mobile for 12 hours
and didn't trigger the issue.

Thank Eric for very detailed review/checking the issue.

Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
---
 net/ipv4/fib_semantics.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5063fa3..8861f91 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -145,6 +145,12 @@ static void free_fib_info_rcu(struct rcu_head *head)
 {
 	struct fib_info *fi = container_of(head, struct fib_info, rcu);
 
+	change_nexthops(fi) {
+		if (nexthop_nh->nh_dev)
+			dev_put(nexthop_nh->nh_dev);
+	} endfor_nexthops(fi);
+
+	release_net(fi->fib_net);
 	if (fi->fib_metrics != (u32 *) dst_default_metrics)
 		kfree(fi->fib_metrics);
 	kfree(fi);
@@ -156,13 +162,7 @@ void free_fib_info(struct fib_info *fi)
 		pr_warn("Freeing alive fib_info %p\n", fi);
 		return;
 	}
-	change_nexthops(fi) {
-		if (nexthop_nh->nh_dev)
-			dev_put(nexthop_nh->nh_dev);
-		nexthop_nh->nh_dev = NULL;
-	} endfor_nexthops(fi);
 	fib_info_cnt--;
-	release_net(fi->fib_net);
 	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
 
-- 
1.7.1

^ permalink raw reply related

* Re: NETDEV WATCHDOG: %s (%s): transmit queue %u timed out
From: George Spelvin @ 2012-05-24  1:39 UTC (permalink / raw)
  To: linux, romieu; +Cc: davej, kernel-team, netdev
In-Reply-To: <20120523223210.GA20536@electric-eye.fr.zoreil.com>

> You may try the attached patches on top of current -git. A complete dmesg
> will be welcome. So will an 'ethtool -d eth0' if the device stops working.
>
> You did not label the problem as a serious one. Does it means that the
> driver automatically recovers ?

It does indeed recover fine.  (Of course, there may be lost packets or
memory leaks.) The machine is the main office file server, which also
serves as an outside network firewall, so anything serious would be
noted VERY quickly.

> I'll add some ring and registers debug stuff tomorrow.

The main thing is I don't like to reboot it very often; it's not
a dev machine.  But I can, as long as it's outside normal work hours.

So thank you; I'll give your patch a try.


Here's a complete dmesg for the current boot, which includes the message being
complained about.  If there's some subset that's of interest for future test
reports, tell me and I won't spam the mailing list as much.

Initializing cgroup subsys cpu
Linux version 3.4.0-00017-g3df9c78 ($USER@$HOST) (gcc version 4.7.0 (Debian 4.7.0-8) ) #152 SMP Mon May 21 22:46:39 UTC 2012
Command line: auto BOOT_IMAGE=Amd64 ro root=905 libata.fua=1 acpi_enforce_resources=lax k10temp.force=1
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 0000000000093800 (usable)
 BIOS-e820: 0000000000093800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000dffa0000 (usable)
 BIOS-e820: 00000000dffc0000 - 00000000dffce000 (ACPI data)
 BIOS-e820: 00000000dffce000 - 00000000dfff0000 (ACPI NVS)
 BIOS-e820: 00000000dfff0000 - 00000000dfffe000 (reserved)
 BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000220000000 (usable)
NX (Execute Disable) protection: active
DMI present.
DMI: MICRO-STAR INTERNATIONAL CO.,LTD MS-7376/MS-7376, BIOS V1.7 01/13/2009
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
No AGP bridge found
last_pfn = 0x220000 max_arch_pfn = 0x400000000
MTRR default type: uncachable
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-EFFFF uncachable
  F0000-FFFFF write-protect
MTRR variable ranges enabled:
  0 base 000000000000 mask FFFF80000000 write-back
  1 base 000080000000 mask FFFFC0000000 write-back
  2 base 0000C0000000 mask FFFFE0000000 write-back
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
TOM2: 0000000220000000 aka 8704M
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
e820 update range: 00000000e0000000 - 0000000100000000 (usable) ==> (reserved)
last_pfn = 0xdffa0 max_arch_pfn = 0x400000000
found SMP MP-table at [ffff8800000ff780] ff780
initial memory mapped : 0 - 20000000
Base memory trampoline at [ffff880000091000] 91000 size 8192
Using GB pages for direct mapping
init_memory_mapping: 0000000000000000-00000000dffa0000
 0000000000 - 00c0000000 page 1G
 00c0000000 - 00dfe00000 page 2M
 00dfe00000 - 00dffa0000 page 4k
kernel direct mapping tables up to dffa0000 @ 1fffd000-20000000
init_memory_mapping: 0000000100000000-0000000220000000
 0100000000 - 0200000000 page 1G
 0200000000 - 0220000000 page 2M
kernel direct mapping tables up to 220000000 @ dff9e000-dffa0000
ACPI: RSDP 00000000000f9e30 00014 (v00 ACPIAM)
ACPI: RSDT 00000000dffc0000 00038 (v01 011309 RSDT1044 20090113 MSFT 00000097)
ACPI: FACP 00000000dffc0200 00084 (v02 011309 FACP1044 20090113 MSFT 00000097)
ACPI: DSDT 00000000dffc0440 06E9B (v01  1ADNC 1ADNC001 00000001 INTL 20051117)
ACPI: FACS 00000000dffce000 00040
ACPI: APIC 00000000dffc0390 0006C (v01 011309 APIC1044 20090113 MSFT 00000097)
ACPI: MCFG 00000000dffc0400 0003C (v01 011309 OEMMCFG  20090113 MSFT 00000097)
ACPI: OEMB 00000000dffce040 00071 (v01 011309 OEMB1044 20090113 MSFT 00000097)
ACPI: HPET 00000000dffc72e0 00038 (v01 011309 OEMHPET  20090113 MSFT 00000097)
ACPI: Local APIC address 0xfee00000
 [ffffea0000000000-ffffea00087fffff] PMD -> [ffff880217600000-ffff88021f5fffff] on node 0
Zone PFN ranges:
  DMA      0x00000010 -> 0x00001000
  DMA32    0x00001000 -> 0x00100000
  Normal   0x00100000 -> 0x00220000
Movable zone start PFN for each node
Early memory PFN ranges
    0: 0x00000010 -> 0x00000093
    0: 0x00000100 -> 0x000dffa0
    0: 0x00100000 -> 0x00220000
On node 0 totalpages: 2096931
  DMA zone: 64 pages used for memmap
  DMA zone: 2 pages reserved
  DMA zone: 3905 pages, LIFO batch:0
  DMA32 zone: 16320 pages used for memmap
  DMA32 zone: 896992 pages, LIFO batch:31
  Normal zone: 18432 pages used for memmap
  Normal zone: 1161216 pages, LIFO batch:31
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 4, version 33, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Using ACPI (MADT) for SMP configuration information
ACPI: HPET id: 0x8300 base: 0xfed00000
SMP: Allowing 4 CPUs, 0 hotplug CPUs
nr_irqs_gsi: 40
Allocating PCI resources starting at dfffe000 (gap: dfffe000:1ff02000)
setup_percpu: NR_CPUS:4 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:1
PERCPU: Embedded 23 pages/cpu @ffff88021fc00000 s72704 r0 d21504 u524288
pcpu-alloc: s72704 r0 d21504 u524288 alloc=1*2097152
pcpu-alloc: [0] 0 1 2 3 
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 2062113
Kernel command line: auto BOOT_IMAGE=Amd64 ro root=905 libata.fua=1 acpi_enforce_resources=lax k10temp.force=1
PID hash table entries: 4096 (order: 3, 32768 bytes)
Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
Checking aperture...
No AGP bridge found
Node 0: aperture @ d4000000 size 32 MB
Aperture pointing to e820 RAM. Ignoring.
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ d4000000
Memory: 8105120k/8912896k available (4123k kernel code, 525172k absent, 282604k reserved, 1959k data, 476k init)
SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
Hierarchical RCU implementation.
NR_IRQS:4352 nr_irqs:712 16
Console: colour VGA+ 80x50
console [tty0] enabled
hpet clockevent registered
Fast TSC calibration failed
TSC: Unable to calibrate against PIT
TSC: using HPET reference calibration
Detected 2500.119 MHz processor.
Calibrating delay loop (skipped), value calculated using timer frequency.. 5000.23 BogoMIPS (lpj=25001190)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 256
tseg: 0000000000
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
mce: CPU supports 6 MCE banks
LVT offset 0 assigned for vector 0xf9
using AMD E400 aware idle routine
Freeing SMP alternatives: 12k freed
ACPI: Core revision 20120320
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: AMD Phenom(tm) 9850 Quad-Core Processor stepping 03
Performance Events: AMD PMU driver.
... version:                0
... bit width:              48
... generic registers:      4
... value mask:             0000ffffffffffff
... max period:             00007fffffffffff
... fixed-purpose events:   0
... event mask:             000000000000000f
MCE: In-kernel MCE decoding enabled.
NMI watchdog: enabled, takes one hw-pmu counter.
Booting Node   0, Processors  #1
NMI watchdog: enabled, takes one hw-pmu counter.
System has AMD C1E enabled
Switch to broadcast mode on CPU1
 #2
NMI watchdog: enabled, takes one hw-pmu counter.
Switch to broadcast mode on CPU2
 #3 Ok.
NMI watchdog: enabled, takes one hw-pmu counter.
Switch to broadcast mode on CPU3
Brought up 4 CPUs
Total of 4 processors activated (20000.95 BogoMIPS).
Switch to broadcast mode on CPU0
xor: automatically using best checksumming function: generic_sse
   generic_sse:  9959.200 MB/sec
xor: using function: generic_sse (9959.200 MB/sec)
NET: Registered protocol family 16
node 0 link 0: io port [1000, ffffff]
TOM: 00000000e0000000 aka 3584M
Fam 10h mmconf [mem 0xe0000000-0xefffffff]
node 0 link 0: mmio [e0000000, efffffff] ==> none
node 0 link 0: mmio [f0000000, ffffffff]
node 0 link 0: mmio [a0000, bffff]
node 0 link 0: mmio [e0000000, dfffffff] ==> none
TOM2: 0000000220000000 aka 8704M
bus: [00, 07] on node 0 link 0
bus: 00 index 0 [io  0x0000-0xffff]
bus: 00 index 1 [mem 0xf0000000-0xffffffff]
bus: 00 index 2 [mem 0x000a0000-0x000bffff]
bus: 00 index 3 [mem 0x220000000-0xfcffffffff]
ACPI: bus type pci registered
PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xe0000000-0xefffffff] (base 0xe0000000)
PCI: not using MMCONFIG
PCI: Using configuration type 1 for base access
PCI: Using configuration type 1 for extended access
bio: create slab <bio-0> at 0
raid6: int64x1   2430 MB/s
raid6: int64x2   2128 MB/s
raid6: int64x4   2234 MB/s
raid6: int64x8   1473 MB/s
raid6: sse2x1    3432 MB/s
raid6: sse2x2    5661 MB/s
raid6: sse2x4    6710 MB/s
raid6: using algorithm sse2x4 (6710 MB/s)
ACPI: Added _OSI(Module Device)
ACPI: Added _OSI(Processor Device)
ACPI: Added _OSI(3.0 _SCP Extensions)
ACPI: Added _OSI(Processor Aggregator Device)
ACPI: EC: Detected MSI hardware, enabling workarounds.
ACPI: EC: Look up EC in DSDT
ACPI: Executed 3 blocks of module-level executable AML code
ACPI: Interpreter enabled
ACPI: (supports S0 S5)
ACPI: Using IOAPIC for interrupt routing
PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xe0000000-0xefffffff] (base 0xe0000000)
PCI: MMCONFIG at [mem 0xe0000000-0xefffffff] reserved in ACPI motherboard resources
ACPI: No dock devices found.
PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
pci_root PNP0A03:00: host bridge window [io  0x0000-0x0cf7]
pci_root PNP0A03:00: host bridge window [io  0x0d00-0xffff]
pci_root PNP0A03:00: host bridge window [mem 0x000a0000-0x000bffff]
pci_root PNP0A03:00: host bridge window [mem 0x000d0000-0x000dffff]
pci_root PNP0A03:00: host bridge window [mem 0xf0000000-0xfebfffff]
PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
pci_bus 0000:00: root bus resource [mem 0x000d0000-0x000dffff]
pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfebfffff]
pci 0000:00:00.0: [1002:5956] type 00 class 0x060000
pci 0000:00:00.0: reg 1c: [mem 0xe0000000-0xffffffff 64bit]
pci 0000:00:02.0: [1002:5978] type 01 class 0x060400
pci 0000:00:02.0: PME# supported from D0 D3hot D3cold
pci 0000:00:05.0: [1002:597b] type 01 class 0x060400
pci 0000:00:05.0: PME# supported from D0 D3hot D3cold
pci 0000:00:09.0: [1002:597e] type 01 class 0x060400
pci 0000:00:09.0: PME# supported from D0 D3hot D3cold
pci 0000:00:12.0: [1002:4380] type 00 class 0x010601
pci 0000:00:12.0: reg 10: [io  0x7000-0x7007]
pci 0000:00:12.0: reg 14: [io  0x6000-0x6003]
pci 0000:00:12.0: reg 18: [io  0x5000-0x5007]
pci 0000:00:12.0: reg 1c: [io  0x4000-0x4003]
pci 0000:00:12.0: reg 20: [io  0x3000-0x300f]
pci 0000:00:12.0: reg 24: [mem 0xfe5ff800-0xfe5ffbff]
pci 0000:00:13.0: [1002:4387] type 00 class 0x0c0310
pci 0000:00:13.0: reg 10: [mem 0xfe5fe000-0xfe5fefff]
pci 0000:00:13.1: [1002:4388] type 00 class 0x0c0310
pci 0000:00:13.1: reg 10: [mem 0xfe5fd000-0xfe5fdfff]
pci 0000:00:13.2: [1002:4389] type 00 class 0x0c0310
pci 0000:00:13.2: reg 10: [mem 0xfe5fc000-0xfe5fcfff]
pci 0000:00:13.3: [1002:438a] type 00 class 0x0c0310
pci 0000:00:13.3: reg 10: [mem 0xfe5fb000-0xfe5fbfff]
pci 0000:00:13.4: [1002:438b] type 00 class 0x0c0310
pci 0000:00:13.4: reg 10: [mem 0xfe5fa000-0xfe5fafff]
pci 0000:00:13.5: [1002:4386] type 00 class 0x0c0320
pci 0000:00:13.5: reg 10: [mem 0xfe5ff000-0xfe5ff0ff]
pci 0000:00:13.5: supports D1 D2
pci 0000:00:13.5: PME# supported from D0 D1 D2 D3hot
pci 0000:00:14.0: [1002:4385] type 00 class 0x0c0500
pci 0000:00:14.0: reg 10: [io  0x0b00-0x0b0f]
pci 0000:00:14.1: [1002:438c] type 00 class 0x01018a
pci 0000:00:14.1: reg 10: [io  0x0000-0x0007]
pci 0000:00:14.1: reg 14: [io  0x0000-0x0003]
pci 0000:00:14.1: reg 18: [io  0x0000-0x0007]
pci 0000:00:14.1: reg 1c: [io  0x0000-0x0003]
pci 0000:00:14.1: reg 20: [io  0xff00-0xff0f]
pci 0000:00:14.2: [1002:4383] type 00 class 0x040300
pci 0000:00:14.2: reg 10: [mem 0xfe5f4000-0xfe5f7fff 64bit]
pci 0000:00:14.2: PME# supported from D0 D3hot D3cold
pci 0000:00:14.3: [1002:438d] type 00 class 0x060100
pci 0000:00:14.4: [1002:4384] type 01 class 0x060401
pci 0000:00:18.0: [1022:1200] type 00 class 0x060000
pci 0000:00:18.1: [1022:1201] type 00 class 0x060000
pci 0000:00:18.2: [1022:1202] type 00 class 0x060000
pci 0000:00:18.3: [1022:1203] type 00 class 0x060000
pci 0000:00:18.4: [1022:1204] type 00 class 0x060000
pci 0000:01:00.0: [1002:5b60] type 00 class 0x030000
pci 0000:01:00.0: reg 10: [mem 0xfc000000-0xfdffffff 64bit pref]
pci 0000:01:00.0: reg 18: [mem 0xfe6f0000-0xfe6fffff 64bit]
pci 0000:01:00.0: reg 20: [io  0x8000-0x80ff]
pci 0000:01:00.0: reg 30: [mem 0xfe6c0000-0xfe6dffff pref]
pci 0000:01:00.0: supports D1 D2
pci 0000:01:00.1: [1002:5b70] type 00 class 0x038000
pci 0000:01:00.1: reg 10: [mem 0xfe6e0000-0xfe6effff 64bit]
pci 0000:01:00.1: supports D1 D2
pci 0000:00:02.0: PCI bridge to [bus 01-01]
pci 0000:00:02.0:   bridge window [io  0x8000-0x8fff]
pci 0000:00:02.0:   bridge window [mem 0xfe600000-0xfe6fffff]
pci 0000:00:02.0:   bridge window [mem 0xfc000000-0xfdffffff 64bit pref]
pci 0000:02:00.0: [10ec:8168] type 00 class 0x020000
pci 0000:02:00.0: reg 10: [io  0x9800-0x98ff]
pci 0000:02:00.0: reg 18: [mem 0xfe7ff000-0xfe7fffff 64bit]
pci 0000:02:00.0: reg 30: [mem 0xfe7c0000-0xfe7dffff pref]
pci 0000:02:00.0: supports D1 D2
pci 0000:02:00.0: PME# supported from D1 D2 D3hot D3cold
pci 0000:00:05.0: PCI bridge to [bus 02-02]
pci 0000:00:05.0:   bridge window [io  0x9000-0x9fff]
pci 0000:00:05.0:   bridge window [mem 0xfe700000-0xfe7fffff]
pci 0000:03:00.0: [105a:3f20] type 00 class 0x010400
pci 0000:03:00.0: reg 10: [io  0xa800-0xa87f]
pci 0000:03:00.0: reg 18: [io  0xa400-0xa4ff]
pci 0000:03:00.0: reg 1c: [mem 0xfe8ff000-0xfe8fffff]
pci 0000:03:00.0: reg 20: [mem 0xfe8c0000-0xfe8dffff]
pci 0000:03:00.0: reg 24: [mem 0xfe8fc000-0xfe8fdfff]
pci 0000:03:00.0: supports D1
pci 0000:00:09.0: PCI bridge to [bus 03-03]
pci 0000:00:09.0:   bridge window [io  0xa000-0xafff]
pci 0000:00:09.0:   bridge window [mem 0xfe800000-0xfe8fffff]
pci 0000:04:00.0: [1106:3044] type 00 class 0x0c0010
pci 0000:04:00.0: reg 10: [mem 0xfe9ff800-0xfe9fffff]
pci 0000:04:00.0: reg 14: [io  0xc800-0xc87f]
pci 0000:04:00.0: supports D2
pci 0000:04:00.0: PME# supported from D2 D3hot D3cold
pci 0000:04:02.0: [1415:9501] type 00 class 0x070006
pci 0000:04:02.0: reg 10: [io  0xc400-0xc41f]
pci 0000:04:02.0: reg 14: [mem 0xfe9fe000-0xfe9fefff]
pci 0000:04:02.0: reg 18: [io  0xc000-0xc01f]
pci 0000:04:02.0: reg 1c: [mem 0xfe9fd000-0xfe9fdfff]
pci 0000:04:02.0: supports D2
pci 0000:04:02.0: PME# supported from D0 D2 D3hot
pci 0000:04:02.1: [1415:9513] type 00 class 0x070101
pci 0000:04:02.1: reg 10: [io  0xb800-0xb807]
pci 0000:04:02.1: reg 14: [io  0xb400-0xb407]
pci 0000:04:02.1: reg 18: [io  0xb000-0xb01f]
pci 0000:04:02.1: reg 1c: [mem 0xfe9fc000-0xfe9fcfff]
pci 0000:04:02.1: supports D2
pci 0000:04:02.1: PME# supported from D0 D2 D3hot
pci 0000:04:03.0: [1011:0024] type 01 class 0x060400
pci 0000:00:14.4: PCI bridge to [bus 04-05] (subtractive decode)
pci 0000:00:14.4:   bridge window [io  0xb000-0xefff]
pci 0000:00:14.4:   bridge window [mem 0xfe900000-0xfebfffff]
pci 0000:00:14.4:   bridge window [io  0x0000-0x0cf7] (subtractive decode)
pci 0000:00:14.4:   bridge window [io  0x0d00-0xffff] (subtractive decode)
pci 0000:00:14.4:   bridge window [mem 0x000a0000-0x000bffff] (subtractive decode)
pci 0000:00:14.4:   bridge window [mem 0x000d0000-0x000dffff] (subtractive decode)
pci 0000:00:14.4:   bridge window [mem 0xf0000000-0xfebfffff] (subtractive decode)
pci 0000:05:04.0: [1011:0019] type 00 class 0x020000
pci 0000:05:04.0: reg 10: [io  0xe800-0xe87f]
pci 0000:05:04.0: reg 14: [mem 0xfebffc00-0xfebfffff]
pci 0000:05:04.0: reg 30: [mem 0xfeb80000-0xfebbffff pref]
pci 0000:05:05.0: [1011:0019] type 00 class 0x020000
pci 0000:05:05.0: reg 10: [io  0xe400-0xe47f]
pci 0000:05:05.0: reg 14: [mem 0xfebff800-0xfebffbff]
pci 0000:05:05.0: reg 30: [mem 0xfeb40000-0xfeb7ffff pref]
pci 0000:05:06.0: [1011:0019] type 00 class 0x020000
pci 0000:05:06.0: reg 10: [io  0xe000-0xe07f]
pci 0000:05:06.0: reg 14: [mem 0xfebff400-0xfebff7ff]
pci 0000:05:06.0: reg 30: [mem 0xfeb00000-0xfeb3ffff pref]
pci 0000:05:07.0: [1011:0019] type 00 class 0x020000
pci 0000:05:07.0: reg 10: [io  0xd800-0xd87f]
pci 0000:05:07.0: reg 14: [mem 0xfebff000-0xfebff3ff]
pci 0000:05:07.0: reg 30: [mem 0xfeac0000-0xfeafffff pref]
pci 0000:04:03.0: PCI bridge to [bus 05-05]
pci 0000:04:03.0:   bridge window [io  0xd000-0xefff]
pci 0000:04:03.0:   bridge window [mem 0xfea00000-0xfebfffff]
pci_bus 0000:00: on NUMA node 0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE2._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE5._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE9._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0PC._PRT]
 pci0000:00: Unable to request _OSC control (_OSC support mask: 0x19)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 7 10 11 12 14 *15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 *7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs *9)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 7 10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs *3 4 5 7 10 11 12 14 15)
vgaarb: device added: PCI:0000:01:00.0,decodes=io+mem,owns=io+mem,locks=none
vgaarb: loaded
vgaarb: bridge control possible 0000:01:00.0
SCSI subsystem initialized
libata version 3.00 loaded.
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
pps_core: LinuxPPS API ver. 1 registered
pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
PTP clock support registered
Advanced Linux Sound Architecture Driver Version 1.0.25.
PCI: Using ACPI for IRQ routing
PCI: pci_cache_line_size set to 64 bytes
pci 0000:00:00.0: no compatible bridge window for [mem 0xe0000000-0xffffffff 64bit]
reserve RAM buffer: 0000000000093800 - 000000000009ffff 
reserve RAM buffer: 00000000dffa0000 - 00000000dfffffff 
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0
hpet0: 4 comparators, 32-bit 14.318180 MHz counter
Switching to clocksource hpet
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp 00:00: [bus 00-ff]
pnp 00:00: [io  0x0cf8-0x0cff]
pnp 00:00: [io  0x0000-0x0cf7 window]
pnp 00:00: [io  0x0d00-0xffff window]
pnp 00:00: [mem 0x000a0000-0x000bffff window]
pnp 00:00: [mem 0x000d0000-0x000dffff window]
pnp 00:00: [mem 0xe0000000-0xdfffffff window disabled]
pnp 00:00: [mem 0xf0000000-0xfebfffff window]
pnp 00:00: Plug and Play ACPI device, IDs PNP0a03 (active)
pnp 00:01: [dma 4]
pnp 00:01: [io  0x0000-0x000f]
pnp 00:01: [io  0x0081-0x0083]
pnp 00:01: [io  0x0087]
pnp 00:01: [io  0x0089-0x008b]
pnp 00:01: [io  0x008f]
pnp 00:01: [io  0x00c0-0x00df]
pnp 00:01: Plug and Play ACPI device, IDs PNP0200 (active)
pnp 00:02: [io  0x0070-0x0071]
pnp 00:02: [irq 8]
pnp 00:02: Plug and Play ACPI device, IDs PNP0b00 (active)
pnp 00:03: [io  0x0061]
pnp 00:03: Plug and Play ACPI device, IDs PNP0800 (active)
pnp 00:04: [io  0x00f0-0x00ff]
pnp 00:04: [irq 13]
pnp 00:04: Plug and Play ACPI device, IDs PNP0c04 (active)
pnp 00:05: [io  0x03f8-0x03ff]
pnp 00:05: [irq 4]
pnp 00:05: [dma 0 disabled]
pnp 00:05: Plug and Play ACPI device, IDs PNP0501 (active)
pnp 00:06: [io  0x03f0-0x03f5]
pnp 00:06: [io  0x03f7]
pnp 00:06: [irq 6]
pnp 00:06: [dma 2]
pnp 00:06: Plug and Play ACPI device, IDs PNP0700 (active)
pnp 00:07: [mem 0xfed00000-0xfed003ff]
pnp 00:07: Plug and Play ACPI device, IDs PNP0103 (active)
pnp 00:08: [mem 0xfec00000-0xfec00fff]
pnp 00:08: [mem 0xfee00000-0xfee00fff]
system 00:08: [mem 0xfec00000-0xfec00fff] could not be reserved
system 00:08: [mem 0xfee00000-0xfee00fff] has been reserved
system 00:08: Plug and Play ACPI device, IDs PNP0c02 (active)
pnp 00:09: [io  0x0010-0x001f]
pnp 00:09: [io  0x0022-0x003f]
pnp 00:09: [io  0x0062-0x0063]
pnp 00:09: [io  0x0065-0x006f]
pnp 00:09: [io  0x0072-0x007f]
pnp 00:09: [io  0x0080]
pnp 00:09: [io  0x0084-0x0086]
pnp 00:09: [io  0x0088]
pnp 00:09: [io  0x008c-0x008e]
pnp 00:09: [io  0x0090-0x009f]
pnp 00:09: [io  0x00a2-0x00bf]
pnp 00:09: [io  0x00b1]
pnp 00:09: [io  0x00e0-0x00ef]
pnp 00:09: [io  0x04d0-0x04d1]
pnp 00:09: [io  0x040b]
pnp 00:09: [io  0x04d6]
pnp 00:09: [io  0x0c00-0x0c01]
pnp 00:09: [io  0x0c14]
pnp 00:09: [io  0x0c50-0x0c51]
pnp 00:09: [io  0x0c52]
pnp 00:09: [io  0x0c6c]
pnp 00:09: [io  0x0c6f]
pnp 00:09: [io  0x0cd0-0x0cd1]
pnp 00:09: [io  0x0cd2-0x0cd3]
pnp 00:09: [io  0x0cd4-0x0cd5]
pnp 00:09: [io  0x0cd6-0x0cd7]
pnp 00:09: [io  0x0cd8-0x0cdf]
pnp 00:09: [io  0x0800-0x089f]
pnp 00:09: [io  0x0b10-0x0b1f]
pnp 00:09: [io  0x0000-0xffffffffffffffff disabled]
pnp 00:09: [io  0x0900-0x090f]
pnp 00:09: [io  0x0910-0x091f]
pnp 00:09: [io  0xfe00-0xfefe]
pnp 00:09: [mem 0xffb80000-0xffbfffff]
system 00:09: [io  0x04d0-0x04d1] has been reserved
system 00:09: [io  0x040b] has been reserved
system 00:09: [io  0x04d6] has been reserved
system 00:09: [io  0x0c00-0x0c01] has been reserved
system 00:09: [io  0x0c14] has been reserved
system 00:09: [io  0x0c50-0x0c51] has been reserved
system 00:09: [io  0x0c52] has been reserved
system 00:09: [io  0x0c6c] has been reserved
system 00:09: [io  0x0c6f] has been reserved
system 00:09: [io  0x0cd0-0x0cd1] has been reserved
system 00:09: [io  0x0cd2-0x0cd3] has been reserved
system 00:09: [io  0x0cd4-0x0cd5] has been reserved
system 00:09: [io  0x0cd6-0x0cd7] has been reserved
system 00:09: [io  0x0cd8-0x0cdf] has been reserved
system 00:09: [io  0x0800-0x089f] has been reserved
system 00:09: [io  0x0b10-0x0b1f] has been reserved
system 00:09: [io  0x0900-0x090f] has been reserved
system 00:09: [io  0x0910-0x091f] has been reserved
system 00:09: [io  0xfe00-0xfefe] has been reserved
system 00:09: [mem 0xffb80000-0xffbfffff] has been reserved
system 00:09: Plug and Play ACPI device, IDs PNP0c02 (active)
pnp 00:0a: [io  0x0060]
pnp 00:0a: [io  0x0064]
pnp 00:0a: [irq 1]
pnp 00:0a: Plug and Play ACPI device, IDs PNP0303 PNP030b (active)
pnp 00:0b: [irq 12]
pnp 00:0b: Plug and Play ACPI device, IDs PNP0f03 PNP0f13 (active)
pnp 00:0c: [io  0x0000-0xffffffffffffffff disabled]
pnp 00:0c: [io  0x0600-0x06df]
pnp 00:0c: [io  0x0ae0-0x0aef]
system 00:0c: [io  0x0600-0x06df] has been reserved
system 00:0c: [io  0x0ae0-0x0aef] has been reserved
system 00:0c: Plug and Play ACPI device, IDs PNP0c02 (active)
pnp 00:0d: [mem 0xe0000000-0xefffffff]
system 00:0d: [mem 0xe0000000-0xefffffff] has been reserved
system 00:0d: Plug and Play ACPI device, IDs PNP0c02 (active)
pnp 00:0e: [mem 0x00000000-0x0009ffff]
pnp 00:0e: [mem 0x000c0000-0x000cffff]
pnp 00:0e: [mem 0x000e0000-0x000fffff]
pnp 00:0e: [mem 0x00100000-0xdfffffff]
pnp 00:0e: [mem 0xfec00000-0xffffffff]
pnp 00:0e: disabling [mem 0x00000000-0x0009ffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]
pnp 00:0e: disabling [mem 0x000c0000-0x000cffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]
pnp 00:0e: disabling [mem 0x000e0000-0x000fffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]
pnp 00:0e: disabling [mem 0x00100000-0xdfffffff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]
system 00:0e: [mem 0xfec00000-0xffffffff] could not be reserved
system 00:0e: Plug and Play ACPI device, IDs PNP0c01 (active)
pnp: PnP ACPI: found 15 devices
ACPI: ACPI bus type pnp unregistered
pci 0000:00:02.0: PCI bridge to [bus 01-01]
pci 0000:00:02.0:   bridge window [io  0x8000-0x8fff]
pci 0000:00:02.0:   bridge window [mem 0xfe600000-0xfe6fffff]
pci 0000:00:02.0:   bridge window [mem 0xfc000000-0xfdffffff 64bit pref]
pci 0000:00:05.0: PCI bridge to [bus 02-02]
pci 0000:00:05.0:   bridge window [io  0x9000-0x9fff]
pci 0000:00:05.0:   bridge window [mem 0xfe700000-0xfe7fffff]
pci 0000:00:09.0: PCI bridge to [bus 03-03]
pci 0000:00:09.0:   bridge window [io  0xa000-0xafff]
pci 0000:00:09.0:   bridge window [mem 0xfe800000-0xfe8fffff]
pci 0000:04:03.0: PCI bridge to [bus 05-05]
pci 0000:04:03.0:   bridge window [io  0xd000-0xefff]
pci 0000:04:03.0:   bridge window [mem 0xfea00000-0xfebfffff]
pci 0000:00:14.4: PCI bridge to [bus 04-05]
pci 0000:00:14.4:   bridge window [io  0xb000-0xefff]
pci 0000:00:14.4:   bridge window [mem 0xfe900000-0xfebfffff]
pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7]
pci_bus 0000:00: resource 5 [io  0x0d00-0xffff]
pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff]
pci_bus 0000:00: resource 7 [mem 0x000d0000-0x000dffff]
pci_bus 0000:00: resource 8 [mem 0xf0000000-0xfebfffff]
pci_bus 0000:01: resource 0 [io  0x8000-0x8fff]
pci_bus 0000:01: resource 1 [mem 0xfe600000-0xfe6fffff]
pci_bus 0000:01: resource 2 [mem 0xfc000000-0xfdffffff 64bit pref]
pci_bus 0000:02: resource 0 [io  0x9000-0x9fff]
pci_bus 0000:02: resource 1 [mem 0xfe700000-0xfe7fffff]
pci_bus 0000:03: resource 0 [io  0xa000-0xafff]
pci_bus 0000:03: resource 1 [mem 0xfe800000-0xfe8fffff]
pci_bus 0000:04: resource 0 [io  0xb000-0xefff]
pci_bus 0000:04: resource 1 [mem 0xfe900000-0xfebfffff]
pci_bus 0000:04: resource 4 [io  0x0000-0x0cf7]
pci_bus 0000:04: resource 5 [io  0x0d00-0xffff]
pci_bus 0000:04: resource 6 [mem 0x000a0000-0x000bffff]
pci_bus 0000:04: resource 7 [mem 0x000d0000-0x000dffff]
pci_bus 0000:04: resource 8 [mem 0xf0000000-0xfebfffff]
pci_bus 0000:05: resource 0 [io  0xd000-0xefff]
pci_bus 0000:05: resource 1 [mem 0xfea00000-0xfebfffff]
NET: Registered protocol family 2
IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP: reno registered
UDP hash table entries: 4096 (order: 5, 131072 bytes)
UDP-Lite hash table entries: 4096 (order: 5, 131072 bytes)
NET: Registered protocol family 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
pci 0000:01:00.0: Boot video device
PCI: CLS 64 bytes, default 64
PCI-DMA: Disabling AGP.
PCI-DMA: aperture base @ d4000000 size 65536 KB
PCI-DMA: using GART IOMMU.
PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
LVT offset 1 assigned for vector 0x400
IBS: LVT offset 1 assigned
perf: AMD IBS detected (0x00000007)
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
msgmni has been set to 15959
alg: No test for stdrng (krng)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
ACPI: duty_cycle spans bit 4
ACPI: processor limited to max C-state 1
Serial: 8250/16550 driver, 5 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:05: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS4: detected caps 00000700 should be 00000500
0000:04:02.0: ttyS4 at I/O 0xc400 (irq = 22) is a 16C950/954
ttyS1: detected caps 00000700 should be 00000500
0000:04:02.0: ttyS1 at I/O 0xc408 (irq = 22) is a 16C950/954
ttyS2: detected caps 00000700 should be 00000500
0000:04:02.0: ttyS2 at I/O 0xc410 (irq = 22) is a 16C950/954
ttyS3: detected caps 00000700 should be 00000500
0000:04:02.0: ttyS3 at I/O 0xc418 (irq = 22) is a 16C950/954
lp: driver loaded but no devices found
Linux agpgart interface v0.103
PCI parallel port detected: 1415:9513, I/O at 0xb800(0x0), IRQ 23
parport0: PC-style at 0xb800, irq 23 [PCSPP]
lp0: using parport0 (interrupt-driven).
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
loop: module loaded
Uniform Multi-Platform E-IDE driver
atiixp 0000:00:14.1: IDE controller (0x1002:0x438c rev 0x00)
atiixp 0000:00:14.1: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xff00-0xff07
Probing IDE interface ide0...
Refined TSC clocksource calibration: 2500.175 MHz.
Switching to clocksource tsc
hda: _NEC DVD_RW ND-3540A, ATAPI CD/DVD-ROM drive
hda: host max PIO4 wanted PIO255(auto-tune) selected PIO4
hda: UDMA/33 mode selected
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide_generic: please use "probe_mask=0x3f" module parameter for probing all legacy ISA IDE ports
ide-gd driver 1.18
ide-cd driver 5.00
ide-cd: hda: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache
cdrom: Uniform CD-ROM driver Revision: 3.20
ahci 0000:00:12.0: version 3.0
ahci 0000:00:12.0: MSI K9A2 Platinum: enabling 64bit DMA
ahci 0000:00:12.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:12.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ccc 
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
ata1: SATA max UDMA/133 abar m1024@0xfe5ff800 port 0xfe5ff900 irq 22
ata2: SATA max UDMA/133 abar m1024@0xfe5ff800 port 0xfe5ff980 irq 22
ata3: SATA max UDMA/133 abar m1024@0xfe5ff800 port 0xfe5ffa00 irq 22
ata4: SATA max UDMA/133 abar m1024@0xfe5ff800 port 0xfe5ffa80 irq 22
ahci 0000:03:00.0: PDC42819 can only drive SATA devices with this driver
ahci 0000:03:00.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl RAID mode
ahci 0000:03:00.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ccc 
scsi4 : ahci
scsi5 : ahci
scsi6 : ahci
scsi7 : ahci
ata5: SATA max UDMA/133 abar m8192@0xfe8fc000 port 0xfe8fc100 irq 17
ata6: SATA max UDMA/133 abar m8192@0xfe8fc000 port 0xfe8fc180 irq 17
ata7: SATA max UDMA/133 abar m8192@0xfe8fc000 port 0xfe8fc200 irq 17
ata8: SATA max UDMA/133 abar m8192@0xfe8fc000 port 0xfe8fc280 irq 17
Linux Tulip driver version 1.1.15-NAPI (Feb 27, 2007)
tulip0: EEPROM default media type Autosense
tulip0: Index #0 - Media MII (#11) described by a 21142 MII PHY (3) block
tulip0:  MII transceiver #1 config 3100 status 7869 advertising 01e1
net eth0: Digital DS21142/43 Tulip rev 65 at MMIO 0xfebffc00, 00:80:c8:b9:c1:d5, IRQ 21
tulip1: EEPROM default media type Autosense
tulip1: Index #0 - Media MII (#11) described by a 21142 MII PHY (3) block
tulip1:  MII transceiver #1 config 3100 status 7869 advertising 01e1
net eth1: Digital DS21142/43 Tulip rev 65 at MMIO 0xfebff800, 00:80:c8:b9:c1:d6, IRQ 22
tulip2: EEPROM default media type Autosense
tulip2: Index #0 - Media MII (#11) described by a 21142 MII PHY (3) block
tulip2:  MII transceiver #1 config 3100 status 7849 advertising 01e1
net eth2: Digital DS21142/43 Tulip rev 65 at MMIO 0xfebff400, 00:80:c8:b9:c1:d7, IRQ 23
tulip3: EEPROM default media type Autosense
tulip3: Index #0 - Media MII (#11) described by a 21142 MII PHY (3) block
tulip3:  MII transceiver #1 config 3100 status 7869 advertising 01e1
net eth3: Digital DS21142/43 Tulip rev 65 at MMIO 0xfebff000, 00:80:c8:b9:c1:d8, IRQ 20
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:02:00.0: irq 40 for MSI/MSI-X
r8169 0000:02:00.0: eth4: RTL8168b/8111b at 0xffffc90000020000, 00:21:85:16:51:7f, XID 18000000 IRQ 40
r8169 0000:02:00.0: eth4: jumbo features [frames: 4080 bytes, tx checksumming: ko]
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci_hcd 0000:00:13.5: EHCI Host Controller
ehci_hcd 0000:00:13.5: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:13.5: applying AMD SB600/SB700 USB freeze workaround
ehci_hcd 0000:00:13.5: debug port 1
ehci_hcd 0000:00:13.5: irq 19, io mem 0xfe5ff000
ehci_hcd 0000:00:13.5: USB 2.0 started, EHCI 1.00
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 10 ports detected
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci_hcd 0000:00:13.0: OHCI Host Controller
ohci_hcd 0000:00:13.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:13.0: irq 16, io mem 0xfe5fe000
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ohci_hcd 0000:00:13.1: OHCI Host Controller
ohci_hcd 0000:00:13.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:13.1: irq 17, io mem 0xfe5fd000
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ohci_hcd 0000:00:13.2: OHCI Host Controller
ohci_hcd 0000:00:13.2: new USB bus registered, assigned bus number 4
ohci_hcd 0000:00:13.2: irq 18, io mem 0xfe5fc000
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
ohci_hcd 0000:00:13.3: OHCI Host Controller
ohci_hcd 0000:00:13.3: new USB bus registered, assigned bus number 5
ohci_hcd 0000:00:13.3: irq 17, io mem 0xfe5fb000
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata8: SATA link down (SStatus 0 SControl 300)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata7: SATA link down (SStatus 0 SControl 300)
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: ATA-8: Hitachi HDS5C3020ALA632, ML6OA580, max UDMA/133
ata6.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
ata5.00: ATA-8: ST3750330AS, SD1A, max UDMA/133
ata2.00: ATA-8: ST3750330AS, SD1A, max UDMA/133
ata2.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: ATA-8: ST3750330AS, SD1A, max UDMA/133
ata1.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: ATA-8: ST3750330AS, SD1A, max UDMA/133
ata4.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: ATA-8: ST3750330AS, SD1A, max UDMA/133
ata3.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata5.00: 1465149168 sectors, multi 16: LBA48 NCQ (depth 31/32)
ata6.00: configured for UDMA/133
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata5.00: configured for UDMA/133
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
scsi 0:0:0:0: Direct-Access     ATA      ST3750330AS      SD1A PQ: 0 ANSI: 5
ata3.00: configured for UDMA/133
sd 0:0:0:0: [sda] 1465149168 512-byte logical blocks: (750 GB/698 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
scsi 1:0:0:0: Direct-Access     ATA      ST3750330AS      SD1A PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] 1465149168 512-byte logical blocks: (750 GB/698 GiB)
scsi 2:0:0:0: Direct-Access     ATA      ST3750330AS      SD1A PQ: 0 ANSI: 5
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 2:0:0:0: [sdc] 1465149168 512-byte logical blocks: (750 GB/698 GiB)
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
scsi 3:0:0:0: Direct-Access     ATA      ST3750330AS      SD1A PQ: 0 ANSI: 5
sd 3:0:0:0: [sdd] 1465149168 512-byte logical blocks: (750 GB/698 GiB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
scsi 4:0:0:0: Direct-Access     ATA      ST3750330AS      SD1A PQ: 0 ANSI: 5
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 4:0:0:0: [sde] 1465149168 512-byte logical blocks: (750 GB/698 GiB)
scsi 5:0:0:0: Direct-Access     ATA      Hitachi HDS5C302 ML6O PQ: 0 ANSI: 5
sd 4:0:0:0: [sde] Write Protect is off
sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 5:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
sd 5:0:0:0: [sdf] Write Protect is off
sd 5:0:0:0: [sdf] Mode Sense: 00 3a 00 10
sd 5:0:0:0: [sdf] Write cache: enabled, read cache: enabled, supports DPO and FUA
 sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:0:0: [sdb] Attached SCSI disk
 sdf: sdf1
sd 5:0:0:0: [sdf] Attached SCSI disk
 sdd: sdd1 sdd2 sdd3 sdd4
sd 3:0:0:0: [sdd] Attached SCSI disk
 sde: sde1 sde2 sde3 sde4
sd 4:0:0:0: [sde] Attached SCSI disk
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
ohci_hcd 0000:00:13.4: OHCI Host Controller
ohci_hcd 0000:00:13.4: new USB bus registered, assigned bus number 6
ohci_hcd 0000:00:13.4: irq 18, io mem 0xfe5fa000
 sda: sda1 sda2 sda3 sda4
sd 0:0:0:0: [sda] Attached SCSI disk
 sdc: sdc1 sdc2 sdc3 sdc4
sd 2:0:0:0: [sdc] Attached SCSI disk
hub 6-0:1.0: USB hub found
hub 6-0:1.0: 2 ports detected
Initializing USB Mass Storage driver...
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
usbcore: registered new interface driver usbserial
usbserial: USB Serial Driver core
usbcore: registered new interface driver pl2303
USB Serial support registered for pl2303
i8042: PNP: PS/2 Controller [PNP0303:PS2K,PNP0f03:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mousedev: PS/2 mouse device common for all mice
rtc_cmos 00:02: RTC can wake from S4
rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
rtc0: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
i2c /dev entries driver
ACPI Warning: 0x0000000000000b00-0x0000000000000b07 SystemIO conflicts with Region \SOR1 1 (20120320/utaddress-251)
ACPI: This conflict may cause random problems and system instability
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xb00, revision 0
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
pps_ldisc: PPS line discipline registered
f71882fg: Found f71882fg chip at 0x600, revision 32
ACPI Warning: 0x0000000000000600-0x0000000000000607 SystemIO conflicts with Region \HMOR 1 (20120320/utaddress-251)
ACPI: This conflict may cause random problems and system instability
ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
f71882fg f71882fg.1536: Fan: 1 is in duty-cycle mode
f71882fg f71882fg.1536: Fan: 2 is in duty-cycle mode
f71882fg f71882fg.1536: Fan: 3 is in duty-cycle mode
f71882fg f71882fg.1536: Fan: 4 is in duty-cycle mode
k10temp 0000:00:18.3: unreliable CPU thermal sensor; check erratum 319
md: raid0 personality registered for level 0
md: raid1 personality registered for level 1
md: raid10 personality registered for level 10
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
EDAC MC: Ver: 2.1.0
AMD64 EDAC driver v3.4.0
EDAC amd64: DRAM ECC disabled.
EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
 (Note that use of the override may cause unknown side effects.)
cpuidle: using governor ladder
cpuidle: using governor menu
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
GACT probability on
netem: version 1.3
u32 classifier
    Actions configured
Netfilter messages via NETLINK v0.30.
nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
ctnetlink v0.93: registering with nfnetlink.
NF_TPROXY: Transparent proxy support initialized, version 4.1.0
NF_TPROXY: Copyright (c) 2006-2007 BalaBit IT Ltd.
ip_set: protocol 6
IPv4 over IPv4 tunneling driver
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP: bic registered
TCP: cubic registered
TCP: westwood registered
TCP: highspeed registered
TCP: hybla registered
TCP: htcp registered
TCP: vegas registered
TCP: veno registered
TCP: scalable registered
TCP: lp registered
TCP: yeah registered
TCP: illinois registered
Initializing XFRM netlink socket
NET: Registered protocol family 10
ip6_tables: (C) 2000-2006 Netfilter Core Team
IPv6 over IPv4 tunneling driver
NET: Registered protocol family 17
NET: Registered protocol family 15
Bridge firewalling registered
8021q: 802.1Q VLAN Support v1.8
registered taskstats version 1
rtc_cmos 00:02: setting system clock to 2012-05-21 23:13:48 UTC (1337642028)
powernow-k8: Found 1 AMD Phenom(tm) 9850 Quad-Core Processor (4 cpu cores) (version 2.20.00)
[Firmware Bug]: powernow-k8: No compatible ACPI _PSS objects found.
[Firmware Bug]: powernow-k8: Try again with latest BIOS.
ALSA device list:
  #0: HDA ATI SB at 0xfe5f4000 irq 16
usb 6-2: new low-speed USB device number 2 using ohci_hcd
input: HID 0430:0100 as /devices/pci0000:00/0000:00:13.4/usb6/6-2/6-2:1.0/input/input1
generic-usb 0003:0430:0100.0001: input: USB HID v1.00 Mouse [HID 0430:0100] on usb-0000:00:13.4-2/input0
input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input2
md: Waiting for all devices to be available before autodetect
md: If you don't use raid, use raid=noautodetect
md: Autodetecting RAID arrays.
md: Scanned 20 and added 20 devices.
md: autorun ...
md: considering sdc4 ...
md:  adding sdc4 ...
md: sdc3 has different UUID to sdc4
md: sdc2 has different UUID to sdc4
md: sdc1 has different UUID to sdc4
md:  adding sda4 ...
md: sda3 has different UUID to sdc4
md: sda2 has different UUID to sdc4
md: sda1 has different UUID to sdc4
md:  adding sde4 ...
md: sde3 has different UUID to sdc4
md: sde2 has different UUID to sdc4
md: sde1 has different UUID to sdc4
md:  adding sdd4 ...
md: sdd3 has different UUID to sdc4
md: sdd2 has different UUID to sdc4
md: sdd1 has different UUID to sdc4
md:  adding sdb4 ...
md: sdb3 has different UUID to sdc4
md: sdb2 has different UUID to sdc4
md: sdb1 has different UUID to sdc4
md: created md6
md: bind<sdb4>
md: bind<sdd4>
md: bind<sde4>
md: bind<sda4>
md: bind<sdc4>
md: running: <sdc4><sda4><sde4><sdd4><sdb4>
bio: create slab <bio-1> at 1
md/raid:md6: device sdc4 operational as raid disk 2
md/raid:md6: device sda4 operational as raid disk 0
md/raid:md6: device sde4 operational as raid disk 4
md/raid:md6: device sdd4 operational as raid disk 3
md/raid:md6: device sdb4 operational as raid disk 1
md/raid:md6: allocated 5350kB
md/raid:md6: raid level 5 active with 5 out of 5 devices, algorithm 2
RAID conf printout:
 --- level:5 rd:5 wd:5
 disk 0, o:1, dev:sda4
 disk 1, o:1, dev:sdb4
 disk 2, o:1, dev:sdc4
 disk 3, o:1, dev:sdd4
 disk 4, o:1, dev:sde4
created bitmap (173 pages) for device md6
md6: bitmap initialized from disk: read 11/11 pages, set 1 of 354293 bits
md6: detected capacity change from 0 to 1486011498496
md: considering sdc3 ...
md:  adding sdc3 ...
md: sdc2 has different UUID to sdc3
md: sdc1 has different UUID to sdc3
md: sda3 has different UUID to sdc3
md: sda2 has different UUID to sdc3
md: sda1 has different UUID to sdc3
md: sde3 has different UUID to sdc3
md: sde2 has different UUID to sdc3
md: sde1 has different UUID to sdc3
md:  adding sdd3 ...
md: sdd2 has different UUID to sdc3
md: sdd1 has different UUID to sdc3
md: sdb3 has different UUID to sdc3
md: sdb2 has different UUID to sdc3
md: sdb1 has different UUID to sdc3
md: created md8
md: bind<sdd3>
md: bind<sdc3>
md: running: <sdc3><sdd3>
md/raid1:md8: active with 2 out of 2 mirrors
md8: detected capacity change from 0 to 6144196608
md: considering sdc2 ...
md:  adding sdc2 ...
md: sdc1 has different UUID to sdc2
md: sda3 has different UUID to sdc2
md:  adding sda2 ...
md: sda1 has different UUID to sdc2
md: sde3 has different UUID to sdc2
md:  adding sde2 ...
md: sde1 has different UUID to sdc2
md:  adding sdd2 ...
md: sdd1 has different UUID to sdc2
md: sdb3 has different UUID to sdc2
md:  adding sdb2 ...
md: sdb1 has different UUID to sdc2
md: created md5
md: bind<sdb2>
md: bind<sdd2>
md: bind<sde2>
md: bind<sda2>
md: bind<sdc2>
md: running: <sdc2><sda2><sde2><sdd2><sdb2>
md/raid10:md5: active with 4 out of 4 devices
created bitmap (173 pages) for device md5
md5: bitmap initialized from disk: read 11/11 pages, set 0 of 354293 bits
md5: detected capacity change from 0 to 743005749248
md: considering sdc1 ...
RAID10 conf printout:
 --- wd:4 rd:4
 disk 0, wo:0, o:1, dev:sda2
 disk 1, wo:0, o:1, dev:sdb2
 disk 2, wo:0, o:1, dev:sdc2
 disk 3, wo:0, o:1, dev:sdd2
md:  adding sdc1 ...
md: sda3 has different UUID to sdc1
md:  adding sda1 ...
md: sde3 has different UUID to sdc1
md:  adding sde1 ...
md:  adding sdd1 ...
md: sdb3 has different UUID to sdc1
md:  adding sdb1 ...
md: created md0
md: bind<sdb1>
md: bind<sdd1>
md: bind<sde1>
md: bind<sda1>
md: bind<sdc1>
md: running: <sdc1><sda1><sde1><sdd1><sdb1>
md/raid1:md0: active with 5 out of 5 mirrors
md0: detected capacity change from 0 to 1003356160
md: considering sda3 ...
md:  adding sda3 ...
md:  adding sde3 ...
md:  adding sdb3 ...
md: created md7
md: bind<sdb3>
md: bind<sde3>
md: bind<sda3>
md: running: <sda3><sde3><sdb3>
md/raid1:md7: active with 2 out of 2 mirrors
md7: detected capacity change from 0 to 6144196608
md: ... autorun DONE.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sda3
 disk 1, wo:0, o:1, dev:sdb3
 md5: unknown partition table
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md5): mounted filesystem with ordered data mode
VFS: Mounted root (ext3 filesystem) readonly on device 9:5.
Freeing unused kernel memory: 476k freed
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
 md7: unknown partition table
 md8: unknown partition table
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
 md6: unknown partition table
 md0:
udevd[1126]: renamed network interface eth4 to inside
udevd[1202]: renamed network interface eth1 to dmz
udevd[1184]: renamed network interface eth2 to t1
udevd[1125]: renamed network interface eth3 to spare
udevd[1132]: renamed network interface eth0 to cable
hda: UDMA/33 mode selected
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata5.00: configured for UDMA/133
ata5: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata6.00: configured for UDMA/133
ata6: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata5.00: configured for UDMA/133
ata5: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata6.00: configured for UDMA/133
ata6: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata5.00: configured for UDMA/133
ata5: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata6.00: configured for UDMA/133
ata6: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata5.00: configured for UDMA/133
ata5: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata6.00: configured for UDMA/133
ata6: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata5.00: configured for UDMA/133
ata5: EH complete
ata6.00: configured for UDMA/133
ata6: EH complete
ata5.00: configured for UDMA/133
ata5: EH complete
ata6.00: configured for UDMA/133
ata6: EH complete
hda: UDMA/33 mode selected
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: SB600 AHCI: limiting to 255 sectors per cmd
ata1.00: configured for UDMA/133
ata1: EH complete
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: SB600 AHCI: limiting to 255 sectors per cmd
ata2.00: configured for UDMA/133
ata2: EH complete
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: SB600 AHCI: limiting to 255 sectors per cmd
ata3.00: configured for UDMA/133
ata3: EH complete
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: SB600 AHCI: limiting to 255 sectors per cmd
ata4.00: configured for UDMA/133
ata4: EH complete
ata5.00: configured for UDMA/133
ata5: EH complete
ata6.00: configured for UDMA/133
ata6: EH complete
Adding 6000188k swap on /dev/md7.  Priority:1 extents:1 across:6000188k 
Adding 6000188k swap on /dev/md8.  Priority:1 extents:1 across:6000188k 
scsi_verify_blk_ioctl: 832 callbacks suppressed
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
EXT3-fs (md5): using internal journal
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
mdadm: sending ioctl 1261 to a partition!
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md6): using internal journal
EXT3-fs (md6): mounted filesystem with writeback data mode
EXT4-fs (sdf1): mounted filesystem with writeback data mode. Opts: data=writeback
r8169 0000:02:00.0: inside: link down
r8169 0000:02:00.0: inside: link down
NOHZ: local_softirq_pending 08
ADDRCONF(NETDEV_UP): inside: link is not ready
r8169 0000:02:00.0: inside: link up
ADDRCONF(NETDEV_CHANGE): inside: link becomes ready
net dmz: Setting full-duplex based on MII#1 link partner capability of 45e1
net cable: Setting full-duplex based on MII#1 link partner capability of 41e1
cable: no IPv6 routers present
dmz: no IPv6 routers present
inside: no IPv6 routers present
postgres (7418): /proc/7418/oom_adj is deprecated, please use /proc/7418/oom_score_adj instead.
pps pps0: new PPS source serial3 at ID 0
pps pps0: source "/dev/ttyS3" added
pps pps1: new PPS source serial4 at ID 1
pps pps1: source "/dev/ttyS4" added
device cable entered promiscuous mode
device dmz entered promiscuous mode
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:256 dev_watchdog+0xe9/0x15c()
Hardware name: MS-7376
NETDEV WATCHDOG: inside (r8169): transmit queue 0 timed out
Pid: 0, comm: swapper/3 Not tainted 3.4.0-00017-g3df9c78 #152
Call Trace:
 <IRQ>  [<ffffffff81311aba>] ? dev_watchdog+0xe9/0x15c
 [<ffffffff81024499>] ? warn_slowpath_common+0x71/0x85
 [<ffffffff813119d1>] ? netif_tx_lock+0x7a/0x7a
 [<ffffffff81024511>] ? warn_slowpath_fmt+0x45/0x4a
 [<ffffffff813119be>] ? netif_tx_lock+0x67/0x7a
 [<ffffffff81311aba>] ? dev_watchdog+0xe9/0x15c
 [<ffffffff810345ab>] ? __queue_work+0x20a/0x20a
 [<ffffffff8102c908>] ? run_timer_softirq+0x17e/0x20b
 [<ffffffff81028889>] ? __do_softirq+0x80/0x102
 [<ffffffff81404b8c>] ? call_softirq+0x1c/0x30
 [<ffffffff81003044>] ? do_softirq+0x2c/0x60
 [<ffffffff81028abc>] ? irq_exit+0x3a/0x91
 [<ffffffff81002e91>] ? do_IRQ+0x81/0x97
 [<ffffffff81403327>] ? common_interrupt+0x67/0x67
 <EOI>  [<ffffffff810079d8>] ? default_idle+0x1e/0x32
 [<ffffffff81007afc>] ? amd_e400_idle+0xb7/0xd4
 [<ffffffff810081a5>] ? cpu_idle+0x58/0x98
---[ end trace 7d5a7d21f604b0d8 ]---
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
TCP: TCP: Possible SYN flooding on port 25. Sending cookies.  Check SNMP counters.
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md0): using internal journal
EXT3-fs (md0): mounted filesystem with journal data mode
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
UDP: short packet: From yyy.yy.yyy.yy:10995 32496/70 to xx.xxx.xx.xxx:6881
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
r8169 0000:02:00.0: inside: link up
UDP: short packet: From zzz.zzz.zz.zz:29440 21248/181 to xx.xxx.xx.xxx:23552

^ permalink raw reply

* Re: [PATCH 01/17] netfilter: add struct nf_proto_net for register l4proto sysctl
From: Gao feng @ 2012-05-24  1:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano,
	Gao feng
In-Reply-To: <20120523101200.GA2836@1984>

Hi pablo:

于 2012年05月23日 18:12, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:11PM +0800, Gao feng wrote:
>> From: Gao feng <gaofeng@cn.fujitus.com>
>>
>> the struct nf_proto_net stroes proto's ctl_table_header and ctl_table,
>> nf_ct_l4proto_(un)register_sysctl use it to register sysctl.
>>
>> there are some changes for struct nf_conntrack_l4proto:
>> - add field compat to identify if this proto should do compat.
>> - the net_id field is used to store the pernet_operations id
>>   that belones to l4proto.
>> - init_net will be used to initial the proto's pernet data
>>
>> and add init_net for struct nf_conntrack_l3proto too.
> 
> This patchset looks bette but there are still things that we have to
> resolve.
> 
> The first one (regarding this patch 1/17) changes in:
> * include/net/netfilter/nf_conntrack_l4proto.h
> * include/net/netns/conntrack.h
> 
> should be included in:
> [PATCH] netfilter: add namespace support for l4proto
> 
> And changes in:
> * include/net/netfilter/nf_conntrack_l3proto.h
> 
> should be included in:
> [PATCH] netfilter: add namespace support for l3proto
> 
> I already told you. A patch that adds a structure without using it,
> is not good. The structure has to go together with the code uses it.
> 

It seams this patch should be merged to "netfilter: add namespace support for l4proto"
the struct nf_proto_net is first used there.

> More comments below.
> 
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitus.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l3proto.h |    3 +++
>>  include/net/netfilter/nf_conntrack_l4proto.h |    6 ++++++
>>  include/net/netns/conntrack.h                |   12 ++++++++++++
>>  3 files changed, 21 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
>> index 9699c02..9766005 100644
>> --- a/include/net/netfilter/nf_conntrack_l3proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l3proto.h
>> @@ -69,6 +69,9 @@ struct nf_conntrack_l3proto {
>>  	struct ctl_table	*ctl_table;
>>  #endif /* CONFIG_SYSCTL */
>>  
>> +	/* Init l3proto pernet data */
>> +	int (*init_net)(struct net *net);
>> +
>>  	/* Module (if any) which this is connected to. */
>>  	struct module *me;
>>  };
>> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
>> index 3b572bb..a90eab5 100644
>> --- a/include/net/netfilter/nf_conntrack_l4proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
>> @@ -22,6 +22,8 @@ struct nf_conntrack_l4proto {
>>  	/* L4 Protocol number. */
>>  	u_int8_t l4proto;
>>  
>> +	u_int8_t compat;
> 
> I don't see why we need this new field.
> 
> It seems to be set to 1 in each structure that has set:
> 
> .ctl_compat_table
> 
> to non-NULL. So, it's redundant.
> 
> Moreover, you already know from the protocol tracker itself if you
> have to allocate the compat ctl table or not.
> 
> In other words: You set compat to 1 for nf_conntrack_l4proto_generic.
> Then, you pass that compat value to generic_init_net via ->inet_net
> again, but this information (that determines if the compat has to be
> done or not) is already in the scope of the protocol tracker.
> 

because some protocols such l4proto_tcp6 and l4proto_tcp use the same init_net
function. the l4proto_tcp6 doesn't need compat sysctl, so we should use this new
field to identify if we should kmemdup compat_sysctl_table.

and beacuse protocols will have pernet ctl_compat_table and ctl_table,the .ctl_compat_table
field will be deleted in patch 15/17. so we should the new field compat.

actually, we don't need to pass compat value for generic_init_net,beacuse
we know l4proto_generic need compat. But consider there are l4proto_tcp(6), and in order to keep
code readable,I prefer to add compat field and pass it to init_net.

> You have to fix this.
> 
>> +
>>  	/* Try to fill in the third arg: dataoff is offset past network protocol
>>             hdr.  Return true if possible. */
>>  	bool (*pkt_to_tuple)(const struct sk_buff *skb, unsigned int dataoff,
>> @@ -103,6 +105,10 @@ struct nf_conntrack_l4proto {
>>  	struct ctl_table	*ctl_compat_table;
>>  #endif
>>  #endif
>> +	int	*net_id;
>> +	/* Init l4proto pernet data */
>> +	int (*init_net)(struct net *net, u_int8_t compat);
>> +
>>  	/* Protocol name */
>>  	const char *name;
>>  
>> diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
>> index a053a19..1f53038 100644
>> --- a/include/net/netns/conntrack.h
>> +++ b/include/net/netns/conntrack.h
>> @@ -8,6 +8,18 @@
>>  struct ctl_table_header;
>>  struct nf_conntrack_ecache;
>>  
>> +struct nf_proto_net {
>> +#ifdef CONFIG_SYSCTL
>> +	struct ctl_table_header *ctl_table_header;
>> +	struct ctl_table        *ctl_table;
>> +#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> +	struct ctl_table_header *ctl_compat_header;
>> +	struct ctl_table        *ctl_compat_table;
>> +#endif
>> +#endif
>> +	unsigned int		users;
>> +};
>> +
>>  struct netns_ct {
>>  	atomic_t		count;
>>  	unsigned int		expect_count;
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 04/17] netfilter: add namespace support for l4proto_generic
From: Gao feng @ 2012-05-24  1:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523103232.GD2836@1984>

于 2012年05月23日 18:32, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:14PM +0800, Gao feng wrote:
>> implement and export nf_conntrack_proto_generic_[init,fini],
>> nf_conntrack_[init,cleanup]_net call them to register or unregister
>> the sysctl of generic proto.
>>
>> implement generic_net_init,it's used to initial the pernet
>> data for generic proto.
>>
>> and use nf_generic_net.timeout to replace nf_ct_generic_timeout in
>> get_timeouts function.
>>
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l4proto.h |    2 +
>>  include/net/netns/conntrack.h                |    6 +++
>>  net/netfilter/nf_conntrack_core.c            |    8 +++-
>>  net/netfilter/nf_conntrack_proto.c           |   21 +++++-----
>>  net/netfilter/nf_conntrack_proto_generic.c   |   55 ++++++++++++++++++++++++-
>>  5 files changed, 76 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
>> index a93dcd5..0d329b9 100644
>> --- a/include/net/netfilter/nf_conntrack_l4proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
>> @@ -118,6 +118,8 @@ struct nf_conntrack_l4proto {
>>  
>>  /* Existing built-in generic protocol */
>>  extern struct nf_conntrack_l4proto nf_conntrack_l4proto_generic;
>> +extern int nf_conntrack_proto_generic_init(struct net *net);
>> +extern void nf_conntrack_proto_generic_fini(struct net *net);
>>  
>>  #define MAX_NF_CT_PROTO 256
>>  
>> diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
>> index 94992e9..3381b80 100644
>> --- a/include/net/netns/conntrack.h
>> +++ b/include/net/netns/conntrack.h
>> @@ -20,7 +20,13 @@ struct nf_proto_net {
>>  	unsigned int		users;
>>  };
>>  
>> +struct nf_generic_net {
>> +	struct nf_proto_net pn;
>> +	unsigned int timeout;
>> +};
>> +
>>  struct nf_ip_net {
>> +	struct nf_generic_net   generic;
>>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>>  	struct ctl_table_header *ctl_table_header;
>>  	struct ctl_table	*ctl_table;
>> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
>> index 32c5909..fd33e91 100644
>> --- a/net/netfilter/nf_conntrack_core.c
>> +++ b/net/netfilter/nf_conntrack_core.c
>> @@ -1353,6 +1353,7 @@ static void nf_conntrack_cleanup_net(struct net *net)
>>  	}
>>  
>>  	nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
>> +	nf_conntrack_proto_generic_fini(net);
>>  	nf_conntrack_helper_fini(net);
>>  	nf_conntrack_timeout_fini(net);
>>  	nf_conntrack_ecache_fini(net);
>> @@ -1586,9 +1587,12 @@ static int nf_conntrack_init_net(struct net *net)
>>  	ret = nf_conntrack_helper_init(net);
>>  	if (ret < 0)
>>  		goto err_helper;
>> -
>> +	ret = nf_conntrack_proto_generic_init(net);
>> +	if (ret < 0)
>> +		goto err_generic;
>>  	return 0;
>> -
>> +err_generic:
>> +	nf_conntrack_helper_fini(net);
>>  err_helper:
>>  	nf_conntrack_timeout_fini(net);
>>  err_timeout:
>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
>> index 7ee6653..9b4bf6d 100644
>> --- a/net/netfilter/nf_conntrack_proto.c
>> +++ b/net/netfilter/nf_conntrack_proto.c
>> @@ -287,10 +287,16 @@ EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
>>  static struct nf_proto_net *nf_ct_l4proto_net(struct net *net,
>>  					      struct nf_conntrack_l4proto *l4proto)
>>  {
>> -	if (l4proto->net_id)
>> -		return net_generic(net, *l4proto->net_id);
>> -	else
>> -		return NULL;
>> +	switch (l4proto->l4proto) {
>> +	case 255: /* l4proto_generic */
>> +		return (struct nf_proto_net *)&net->ct.proto.generic;
>> +	default:
>> +		if (l4proto->net_id)
>> +			return net_generic(net, *l4proto->net_id);
>> +		else
>> +			return NULL;
>> +	}
>> +	return NULL;
>>  }
>>  
>>  int nf_ct_l4proto_register_sysctl(struct net *net,
>> @@ -457,11 +463,6 @@ EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_unregister);
>>  int nf_conntrack_proto_init(void)
>>  {
>>  	unsigned int i;
>> -	int err;
>> -
>> -	err = nf_ct_l4proto_register_sysctl(&init_net, &nf_conntrack_l4proto_generic);
>> -	if (err < 0)
>> -		return err;
> 
> I like that all protocols sysctl are registered by
> nf_conntrack_proto_init. Can you keep using that?

you mean per-net's generic_proto sysctl are registered by
nf_conntrack_proto_init?

such as

int nf_conntrack_proto_init(struct net *net)
{
	...
	err = nf_ct_l4proto_register_sysctl(net, &nf_conntrack_l4proto_generic);
	...
}

if my understanding is right,my answer is yes we can ;)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 16/17] netfilter: add namespace support for cttimeout
From: Gao feng @ 2012-05-24  1:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523104110.GF2836@1984>

于 2012年05月23日 18:41, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:26PM +0800, Gao feng wrote:
>> add struct net as a param of ctnl_timeout.nlattr_to_obj,
>>
>> modify ctnl_timeout_parse_policy and cttimeout_new_timeout
>> to transmit struct net to nlattr_to_obj.
> 
> Please, merge your patch 16 and 17 into one single patch.
> 
>>  	unsigned int *timeouts = data;
>>  
>> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
>> index 291cef4..a28f3c4 100644
>> --- a/net/netfilter/nf_conntrack_proto_sctp.c
>> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
>> @@ -562,7 +562,8 @@ static int sctp_nlattr_size(void)
>>  #include <linux/netfilter/nfnetlink.h>
>>  #include <linux/netfilter/nfnetlink_cttimeout.h>
>>  
>> -static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[], void *data)
>> +static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[],
>> +				      struct net *net, void *data)
> 
> The interface modification and the use of the new *net parameter
> should go together, ie. merge patch 16 and 17 :-).

got it,thanks ;)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 15/17] netfilter: cleanup sysctl for l4proto and l3proto
From: Gao feng @ 2012-05-24  0:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <20120523103810.GE2836@1984>

Hi pablo:

于 2012年05月23日 18:38, Pablo Neira Ayuso 写道:
> On Mon, May 14, 2012 at 04:52:25PM +0800, Gao feng wrote:
>> delete no useless sysctl data for l4proto and l3proto.
>>
>> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  include/net/netfilter/nf_conntrack_l3proto.h   |    2 --
>>  include/net/netfilter/nf_conntrack_l4proto.h   |   10 ----------
>>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |    1 -
>>  net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |    8 --------
>>  net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    5 -----
>>  net/netfilter/nf_conntrack_proto_generic.c     |    8 --------
>>  net/netfilter/nf_conntrack_proto_sctp.c        |   15 ---------------
>>  net/netfilter/nf_conntrack_proto_tcp.c         |   15 ---------------
>>  net/netfilter/nf_conntrack_proto_udp.c         |   15 ---------------
>>  net/netfilter/nf_conntrack_proto_udplite.c     |   12 ------------
>>  10 files changed, 0 insertions(+), 91 deletions(-)
>>
>> diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
>> index d6df8c7..6f7c13f 100644
>> --- a/include/net/netfilter/nf_conntrack_l3proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l3proto.h
>> @@ -64,9 +64,7 @@ struct nf_conntrack_l3proto {
>>  	size_t nla_size;
>>  
>>  #ifdef CONFIG_SYSCTL
>> -	struct ctl_table_header	*ctl_table_header;
>>  	const char		*ctl_table_path;
>> -	struct ctl_table	*ctl_table;
>>  #endif /* CONFIG_SYSCTL */
>>  
>>  	/* Init l3proto pernet data */
>> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
>> index 0d329b9..4881df34 100644
>> --- a/include/net/netfilter/nf_conntrack_l4proto.h
>> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
>> @@ -95,16 +95,6 @@ struct nf_conntrack_l4proto {
>>  		const struct nla_policy *nla_policy;
>>  	} ctnl_timeout;
>>  #endif
>> -
>> -#ifdef CONFIG_SYSCTL
>> -	struct ctl_table_header	**ctl_table_header;
>> -	struct ctl_table	*ctl_table;
>> -	unsigned int		*ctl_table_users;
>> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
>> -	struct ctl_table_header	*ctl_compat_table_header;
>> -	struct ctl_table	*ctl_compat_table;
>> -#endif
>> -#endif
> 
> Interesting. This structure is added in patch 1/17, then it's remove
> in patch 15/17.
> 
> Probably I'm missing anything, but why are you doing it like that?

This structure means ctl_table_header,ctl_table and so on?

I add this structure to struct nf_proto_net in patch 1/17,so those fields in
struct nf_conntrack_l4proto are useless,this patch is just some cleanup.

the same with nf_conntrack_l3proto.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] mm: add a low limit to alloc_large_system_hash
From: Tim Bird @ 2012-05-23 23:33 UTC (permalink / raw)
  To: eric.dumazet, Paul Gortmaker, David Miller, linux kernel, netdev

This patch seems to have fallen in the cracks:

https://lkml.org/lkml/2012/2/27/20

The last message on the thread was an ACK by David Miller, and
the question "Who wants to take this?"
  -- Tim

[message and patch follow]

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
---
V2: no 16 minimum value for pid hash
 fs/dcache.c             |    2 ++
 fs/inode.c              |    2 ++
 include/linux/bootmem.h |    3 ++-
 kernel/pid.c            |    3 ++-
 mm/page_alloc.c         |    7 +++++--
 net/ipv4/route.c        |    1 +
 net/ipv4/tcp.c          |    2 ++
 net/ipv4/udp.c          |   30 ++++++++++--------------------
 8 files changed, 26 insertions(+), 24 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index fe19ac1..ef5e72e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2984,6 +2984,7 @@ static void __init dcache_init_early(void)
 					HASH_EARLY,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
@@ -3014,6 +3015,7 @@ static void __init dcache_init(void)
 					0,
 					&d_hash_shift,
 					&d_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << d_hash_shift); loop++)
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..7acee4c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1667,6 +1667,7 @@ void __init inode_init_early(void)
 					HASH_EARLY,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
@@ -1697,6 +1698,7 @@ void __init inode_init(void)
 					0,
 					&i_hash_shift,
 					&i_hash_mask,
+					0,
 					0);

 	for (loop = 0; loop < (1U << i_hash_shift); loop++)
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 66d3e95..1a0cd27 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -154,7 +154,8 @@ extern void *alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit);
+				     unsigned long low_limit,
+				     unsigned long high_limit);

 #define HASH_EARLY	0x00000001	/* Allocating during early boot? */
 #define HASH_SMALL	0x00000002	/* sub-page allocation allowed, min
diff --git a/kernel/pid.c b/kernel/pid.c
index 9f08dfa..e86b291 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -547,7 +547,8 @@ void __init pidhash_init(void)

 	pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
 					   HASH_EARLY | HASH_SMALL,
-					   &pidhash_shift, NULL, 4096);
+					   &pidhash_shift, NULL,
+					   0, 4096);
 	pidhash_size = 1U << pidhash_shift;

 	for (i = 0; i < pidhash_size; i++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..b9afccb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5198,9 +5198,10 @@ void *__init alloc_large_system_hash(const char *tablename,
 				     int flags,
 				     unsigned int *_hash_shift,
 				     unsigned int *_hash_mask,
-				     unsigned long limit)
+				     unsigned long low_limit,
+				     unsigned long high_limit)
 {
-	unsigned long long max = limit;
+	unsigned long long max = high_limit;
 	unsigned long log2qty, size;
 	void *table = NULL;

@@ -5238,6 +5239,8 @@ void *__init alloc_large_system_hash(const char *tablename,
 	}
 	max = min(max, 0x80000000ULL);

+	if (numentries < low_limit)
+		numentries = low_limit;
 	if (numentries > max)
 		numentries = max;

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bcacf54..0a41e38 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -3475,6 +3475,7 @@ int __init ip_rt_init(void)
 					0,
 					&rt_hash_log,
 					&rt_hash_mask,
+					0,
 					rhash_entries ? 0 : 512 * 1024);
 	memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
 	rt_hash_lock_init();
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 22ef5f9..e61a498 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3267,6 +3267,7 @@ void __init tcp_init(void)
 					0,
 					NULL,
 					&tcp_hashinfo.ehash_mask,
+					0,
 					thash_entries ? 0 : 512 * 1024);
 	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);
@@ -3283,6 +3284,7 @@ void __init tcp_init(void)
 					0,
 					&tcp_hashinfo.bhash_size,
 					NULL,
+					0,
 					64 * 1024);
 	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5d075b5..dc68ed2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2182,26 +2182,16 @@ void __init udp_table_init(struct udp_table *table, const char *name)
 {
 	unsigned int i;

-	if (!CONFIG_BASE_SMALL)
-		table->hash = alloc_large_system_hash(name,
-			2 * sizeof(struct udp_hslot),
-			uhash_entries,
-			21, /* one slot per 2 MB */
-			0,
-			&table->log,
-			&table->mask,
-			64 * 1024);
-	/*
-	 * Make sure hash table has the minimum size
-	 */
-	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
-		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
-				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
-		if (!table->hash)
-			panic(name);
-		table->log = ilog2(UDP_HTABLE_SIZE_MIN);
-		table->mask = UDP_HTABLE_SIZE_MIN - 1;
-	}
+	table->hash = alloc_large_system_hash(name,
+					      2 * sizeof(struct udp_hslot),
+					      uhash_entries,
+					      21, /* one slot per 2 MB */
+					      0,
+					      &table->log,
+					      &table->mask,
+					      UDP_HTABLE_SIZE_MIN,
+					      64 * 1024);
+
 	table->hash2 = table->hash + (table->mask + 1);
 	for (i = 0; i <= table->mask; i++) {
 		INIT_HLIST_NULLS_HEAD(&table->hash[i].head, i);

^ permalink raw reply related

* Re: [PATCH] fec: Add support for Coldfire M5441x enet-mac.
From: David Miller @ 2012-05-23 23:02 UTC (permalink / raw)
  To: sfking; +Cc: netdev, uclinux-dev, gerg
In-Reply-To: <201205231435.50857.sfking@fdwdc.com>


Please resubmit this when the net-next tree is open again.

Now is not an appropriate time to submit patches that are
not bug fixes.

^ permalink raw reply

* Re: [PATCH 06/15] batman-adv: Distributed ARP Table - add snooping functions for ARP messages
From: David Miller @ 2012-05-23 23:01 UTC (permalink / raw)
  To: simon.wunderlich-Y4E02TeZ33kaBlGTGt4zH4SGEyLTKazZ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
	lindner_marek-LWAfsSFWpa4
In-Reply-To: <20120523214802.GA12222@pandem0nium>


It can't be all on me to answer your question, I cannot be
the choke point.

You must lean on the entire networking developer community
for help, otherwise it simply will not scale.

^ permalink raw reply

* Re: NETDEV WATCHDOG: %s (%s): transmit queue %u timed out
From: Francois Romieu @ 2012-05-23 22:32 UTC (permalink / raw)
  To: George Spelvin; +Cc: davej, kernel-team, netdev
In-Reply-To: <20120523015312.14731.qmail@science.horizon.com>

[-- Attachment #1: Type: text/plain, Size: 517 bytes --]

George Spelvin <linux@horizon.com> :
[...]
> Unfortunately, the last reboot was on Feb. 26, and the kernel logs don't
> go back that far, so I'm not sure if 3.3-rc5 reported this priblem or not.

You may try the attached patches on top of current -git. A complete dmesg
will be welcome. So will an 'ethtool -d eth0' if the device stops working.

You did not label the problem as a serious one. Does it means that the
driver automatically recovers ?

I'll add some ring and registers debug stuff tomorrow.

-- 
Ueimor

[-- Attachment #2: 0001-r8169-avoid-clearing-the-end-of-Tx-descriptor-ring-m.patch --]
[-- Type: text/plain, Size: 1040 bytes --]

>From 4fafb314defc9e0cc5da1b58e84bba87686ea2ba Mon Sep 17 00:00:00 2001
Message-Id: <4fafb314defc9e0cc5da1b58e84bba87686ea2ba.1337810811.git.romieu@fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 23 May 2012 22:18:35 +0200
Subject: [PATCH 1/2] r8169: avoid clearing the end of Tx descriptor ring
 marker bit.
X-Organisation: Land of Sunshine Inc.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/r8169.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 00b4f56..01f3367 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5345,7 +5345,7 @@ static void rtl8169_unmap_tx_skb(struct device *d, struct ring_info *tx_skb,
 
 	dma_unmap_single(d, le64_to_cpu(desc->addr), len, DMA_TO_DEVICE);
 
-	desc->opts1 = 0x00;
+	desc->opts1 &= cpu_to_le32(RingEnd);
 	desc->opts2 = 0x00;
 	desc->addr = 0x00;
 	tx_skb->len = 0;
-- 
1.7.7.6


[-- Attachment #3: 0002-r8169-TxPoll-hack-rework.patch --]
[-- Type: text/plain, Size: 4285 bytes --]

>From 4cadfce6ba362b5d8d5c193c4896443be575b2b4 Mon Sep 17 00:00:00 2001
Message-Id: <4cadfce6ba362b5d8d5c193c4896443be575b2b4.1337810811.git.romieu@fr.zoreil.com>
In-Reply-To: <4fafb314defc9e0cc5da1b58e84bba87686ea2ba.1337810811.git.romieu@fr.zoreil.com>
References: <4fafb314defc9e0cc5da1b58e84bba87686ea2ba.1337810811.git.romieu@fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 23 May 2012 23:21:13 +0200
Subject: [PATCH 2/2] r8169: TxPoll hack rework.
X-Organisation: Land of Sunshine Inc.

I don't want to try and convince myself that it is completely safe to
issue a TxPoll request from the NAPI handler right in the middle of
a start_xmit, whence the tx_lock.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/r8169.c |   59 +++++++++++++++++++--------------
 1 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 01f3367..655b293 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5544,19 +5544,20 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
 	txd->opts1 = cpu_to_le32(status);
 
+	smp_wmb();
+
 	tp->cur_tx += frags + 1;
 
-	wmb();
+	smp_wmb();
 
 	RTL_W8(TxPoll, NPQ);
 
 	mmiowb();
 
 	if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
-		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
-		 * not miss a ring update when it notices a stopped queue.
-		 */
-		smp_wmb();
+		/* rtl_tx thread must not miss a ring update when it notices
+		 * a stopped queue. The TxPoll hack requires the smp_wmb
+		 * above so we can go ahead. */
 		netif_stop_queue(dev);
 		/* Sync with rtl_tx:
 		 * - publish queue status and cur_tx ring index (write barrier)
@@ -5640,22 +5641,36 @@ struct rtl_txc {
 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 {
 	struct rtl8169_stats *tx_stats = &tp->tx_stats;
-	unsigned int dirty_tx, tx_left;
+	unsigned int dirty_tx, cur_tx;
 	struct rtl_txc txc = { 0, 0 };
 
 	dirty_tx = tp->dirty_tx;
-	smp_rmb();
-	tx_left = tp->cur_tx - dirty_tx;
-
-	while (tx_left > 0) {
+xmit_race:
+	for (cur_tx = tp->cur_tx; dirty_tx != cur_tx; dirty_tx++) {
 		unsigned int entry = dirty_tx % NUM_TX_DESC;
 		struct ring_info *tx_skb = tp->tx_skb + entry;
 		u32 status;
 
-		rmb();
 		status = le32_to_cpu(tp->TxDescArray[entry].opts1);
-		if (status & DescOwn)
+
+		/* 8168 (only ?) hack: TxPoll requests are lost when the Tx
+		 * packets are too close. Let's kick an extra TxPoll request
+		 * when a burst of start_xmit activity is detected (if it is
+		 * not detected, it is slow enough).
+		 * The NPQ bit is cleared automatically by the chipset.
+		 * The code assumes that the chipset is sane enough to clear
+		 * it at a sensible time.*/
+		if (unlikely(status & DescOwn)) {
+			void __iomem *ioaddr = tp->mmio_addr;
+
+			if (!(RTL_R8(TxPoll) & NPQ)) {
+				netif_tx_lock(dev);
+				RTL_W8(TxPoll, NPQ);
+				netif_tx_unlock(dev);
+				goto done;
+			}
 			break;
+		}
 
 		rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
 				     tp->TxDescArray + entry);
@@ -5667,10 +5682,15 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 			dev_kfree_skb(skb);
 			tx_skb->skb = NULL;
 		}
-		dirty_tx++;
-		tx_left--;
 	}
 
+	/* Rationale: if chipset stopped DMAing, enforce TxPoll write either
+	 * here or in start_xmit. If chipset is still DMAing, this code
+	 * will be run later anyway. */
+	smp_mb();
+	if (cur_tx != tp->cur_tx)
+		goto xmit_race;
+done:
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->packets += txc.packets;
 	tx_stats->bytes += txc.bytes;
@@ -5692,17 +5712,6 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 		    TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 			netif_wake_queue(dev);
 		}
-		/*
-		 * 8168 hack: TxPoll requests are lost when the Tx packets are
-		 * too close. Let's kick an extra TxPoll request when a burst
-		 * of start_xmit activity is detected (if it is not detected,
-		 * it is slow enough). -- FR
-		 */
-		if (tp->cur_tx != dirty_tx) {
-			void __iomem *ioaddr = tp->mmio_addr;
-
-			RTL_W8(TxPoll, NPQ);
-		}
 	}
 }
 
-- 
1.7.7.6


^ permalink raw reply related

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <1337809034.3361.3487.camel@edumazet-glaptop>

On 05/23/2012 02:37 PM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 14:19 -0700, Alexander Duyck wrote:
>> On 05/23/2012 10:10 AM, Alexander Duyck wrote:
>>> On 05/23/2012 09:39 AM, Eric Dumazet wrote:
>>>> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>>>>
>>>>> With current driver, a MTU=1500 frame uses :
>>>>>
>>>>> sk_buff (256 bytes)
>>>>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
>>>> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
>>> Actually pahole seems to be indicating to me the size of skb_shared_info
>>> is 320, unless something has changed in the last few days.
>>>
>>> When I get a chance I will try to remember to reduce the ixgbe header
>>> size to 256 which should also help.  The only reason it is set to 512
>>> was to deal with the fact that the old alloc_skb code wasn't aligning
>>> the shared info with the end of whatever size was allocated and so the
>>> 512 was an approximation to make better use of the 1K slab allocation
>>> back when we still were using hardware packet split.  That should help
>>> to improve the page utilization for the headers since that would
>>> increase the uses of a page from 4 to 6 for the skb head frag, and it
>>> would drop truesize by another 256 bytes.
>>>
>>> Thanks,
>>>
>>> Alex
>> Here is the patch for review.  I have submitted the official patch to Jeff 
>> so that it can go through his tree for testing, validation, and submission 
>> once Dave's tree opens back up.
>>
>> ---
>>
>> The recent changes to netdev_alloc_skb actually make it so that the size of
>> the buffer now actually has a more direct input on the truesize.  So in
>> order to make best use of the piece of a page we are allocated I am
>> reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
>> by 256 bytes as well.
>>
>> This should result in performance improvements since the number of uses per
>> page should increase from 4 to 6 in the case of a 4K page.  In addition we
>> should see socket performance improvements due to the truesize dropping
>> to less than 1K for buffers less than 256 bytes.
>>
>> Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>>  2 files changed, 10 insertions(+), 9 deletions(-)
>>
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> index 402dd66..468e4ab 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> @@ -77,17 +77,18 @@
>>  #define IXGBE_MAX_FCPAUSE		 0xFFFF
>>  
>>  /* Supported Rx Buffer Sizes */
>> -#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
>> +#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
>>  #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
>>  
>>  /*
>> - * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
>> - * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
>> - * this adds up to 512 bytes of extra data meaning the smallest allocation
>> - * we could have is 1K.
>> - * i.e. RXBUFFER_512 --> size-1024 slab
>> + * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
>> + * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
>> + * this adds up to 448 bytes of extra data.
>> + *
>> + * Since netdev_alloc_skb now allocates a page fragment we can use a value
>> + * of 256 and the resultant skb will have a truesize of 960 or less.
>>   */
>> -#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
>> +#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
>>  
>>  #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
>>  
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 7f92e40..f92b31a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>>  	 * 60 bytes if the skb->len is less than 60 for skb_pad.
>>  	 */
>>  	pull_len = skb_frag_size(frag);
>> -	if (pull_len > 256)
>> -		pull_len = ixgbe_get_headlen(va, pull_len);
>> +	if (pull_len > IXGBE_RX_HDR_SIZE)
>> +		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
>>  
>>  	/* align pull length to size of long to optimize memcpy performance */
>>  	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
>>
>
> By the way you should reword the comment about NET_IP_ALIGN
>
> On x86 NET_IP_ALIGN is 0, so we dont 'reserve 64 bytes more'
On x86 yes, on other platforms that don't clear the value it will add 2
more bytes, and after cache alignment it will likely add another 64.  I
am really just stating the worst case scenario here.  This is why I end
the comment with "960 or less".

> -> 896 bytes
.. or 832 bytes if you really tweak settings and get the sk_buff down to
192 bytes.  :-)

> Also, are you sure :
>
> srrctl |= (IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &  
> 	IXGBE_SRRCTL_BSIZEHDR_MASK; 
>
>
> is still needed in ixgbe_configure_srrctl() , since it uses
> IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF (non packet split)
That bit is needed for RSC.  Basically we have to specify the maximum
size of a header, even if we are not using packet split.  The value has
to be at least 128 or more.  Since we are using the value of
IXGBE_RX_HDR_SIZE to limit how much header we pull anyway I figure it is
probably a good idea just to leave this here.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH 06/15] batman-adv: Distributed ARP Table - add snooping functions for ARP messages
From: Simon Wunderlich @ 2012-05-23 21:48 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Marek Lindner
In-Reply-To: <201205171953.54891.lindner_marek-LWAfsSFWpa4@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]

Hello David, 

we are a little bit in a pinch here - the DAT feature sent with this
patchset was developed for a long time, and we need your decision to move on
as more and more patches depend on it:

 * should we rewrite DAT to use our own ARP table/backend or
 * can we use the ARP neighbor table in another way, maybe after your changes?

We thought that re-using existing infrastructure would be smarter, but if
you disagree, please tell us so - we would like to get this feature finally
upstream and need your input to make the neccesary changes.

Thanks
	Simon


On Thu, May 17, 2012 at 07:53:54PM +0800, Marek Lindner wrote:
> 
> David,
> 
> > On Tuesday, May 01, 2012 08:59:04 David Miller wrote:
> > > From: Antonio Quartulli <ordex-GaUfNO9RBHfsrOwW+9ziJQ@public.gmane.org>
> > > Date: Tue, 1 May 2012 00:22:30 +0200
> > > 
> > > > However this patch also contains a procedure which queries the neigh
> > > > table in order to understand whether a given host is known or not.
> > > > Would it be possible to do that in another way (Without manually
> > > > touching the table)?
> > > > 
> > > > Instead, in the next patch (patch 06/15) batman-adv manually increase
> > > > the neigh timeouts. Do you think we should avoid doing that as well?
> > > > If we are allowed to do that, how can we perform the same operation in
> > > > a cleaner way?
> > > > 
> > > > Last question: why can't other modules use exported functions? Are you
> > > > going to change them as well?
> > > 
> > > I really don't have time to discuss your neigh issues right now as I'm
> > > busy speaking at conferences and dealing with the backlog of other
> > > patches.
> > > 
> > > You'll need to find someone else to discuss it with you, sorry.
> > 
> > I hope now is a good moment to bring the questions back onto the table. We
> > still are not sure how to proceed because we have no clear picture of what
> > is going to come and how the exported functions are supposed to be used.
> > 
> > David, if you don't have the time to discuss the ARP handling with us could
> > you name someone who knows your plans and the code equally well ? So far,
> > nobody has stepped up.
> 
> let me add another piece of information: The distributed ARP table does not 
> really depend on the kernel's ARP table. We can easily write our own backend 
> to be totally independent of the kernel's ARP table. Initially, we thought it 
> might be considered a smart move if the code made use of existing kernel 
> infrastructure instead of writing our own storage / user space API / etc, 
> hence duplicating what is already there. But if you feel this is the better 
> way forward we certainly will make the necessary changes.
> 
> Regards,
> Marek
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 21:37 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <4FBD546A.1030504@intel.com>

On Wed, 2012-05-23 at 14:19 -0700, Alexander Duyck wrote:
> On 05/23/2012 10:10 AM, Alexander Duyck wrote:
> > On 05/23/2012 09:39 AM, Eric Dumazet wrote:
> >> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
> >>
> >>> With current driver, a MTU=1500 frame uses :
> >>>
> >>> sk_buff (256 bytes)
> >>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> >> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
> > Actually pahole seems to be indicating to me the size of skb_shared_info
> > is 320, unless something has changed in the last few days.
> >
> > When I get a chance I will try to remember to reduce the ixgbe header
> > size to 256 which should also help.  The only reason it is set to 512
> > was to deal with the fact that the old alloc_skb code wasn't aligning
> > the shared info with the end of whatever size was allocated and so the
> > 512 was an approximation to make better use of the 1K slab allocation
> > back when we still were using hardware packet split.  That should help
> > to improve the page utilization for the headers since that would
> > increase the uses of a page from 4 to 6 for the skb head frag, and it
> > would drop truesize by another 256 bytes.
> >
> > Thanks,
> >
> > Alex
> Here is the patch for review.  I have submitted the official patch to Jeff 
> so that it can go through his tree for testing, validation, and submission 
> once Dave's tree opens back up.
> 
> ---
> 
> The recent changes to netdev_alloc_skb actually make it so that the size of
> the buffer now actually has a more direct input on the truesize.  So in
> order to make best use of the piece of a page we are allocated I am
> reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
> by 256 bytes as well.
> 
> This should result in performance improvements since the number of uses per
> page should increase from 4 to 6 in the case of a 4K page.  In addition we
> should see socket performance improvements due to the truesize dropping
> to less than 1K for buffers less than 256 bytes.
> 
> Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index 402dd66..468e4ab 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -77,17 +77,18 @@
>  #define IXGBE_MAX_FCPAUSE		 0xFFFF
>  
>  /* Supported Rx Buffer Sizes */
> -#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
> +#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
>  #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
>  
>  /*
> - * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
> - * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
> - * this adds up to 512 bytes of extra data meaning the smallest allocation
> - * we could have is 1K.
> - * i.e. RXBUFFER_512 --> size-1024 slab
> + * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
> + * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
> + * this adds up to 448 bytes of extra data.
> + *
> + * Since netdev_alloc_skb now allocates a page fragment we can use a value
> + * of 256 and the resultant skb will have a truesize of 960 or less.
>   */
> -#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
> +#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
>  
>  #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
>  
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 7f92e40..f92b31a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>  	 * 60 bytes if the skb->len is less than 60 for skb_pad.
>  	 */
>  	pull_len = skb_frag_size(frag);
> -	if (pull_len > 256)
> -		pull_len = ixgbe_get_headlen(va, pull_len);
> +	if (pull_len > IXGBE_RX_HDR_SIZE)
> +		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
>  
>  	/* align pull length to size of long to optimize memcpy performance */
>  	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> 


By the way you should reword the comment about NET_IP_ALIGN

On x86 NET_IP_ALIGN is 0, so we dont 'reserve 64 bytes more'


-> 896 bytes

Also, are you sure :

srrctl |= (IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &  
	IXGBE_SRRCTL_BSIZEHDR_MASK; 


is still needed in ixgbe_configure_srrctl() , since it uses
IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF (non packet split)

^ permalink raw reply

* [PATCH] fec: Add support for Coldfire M5441x enet-mac.
From: Steven King @ 2012-05-23 21:35 UTC (permalink / raw)
  To: netdev; +Cc: uClinux development list, gerg

Add support for the Freescale Coldfire M5441x; as these parts have an enet-mac,
add a quirk to distinguish them from the other Coldfire parts so we can use 
the existing enet-mac support.

Signed-off-by: Steven King <sfking@fdwdc.com>
---
 drivers/net/ethernet/freescale/Kconfig |    5 ++---
 drivers/net/ethernet/freescale/fec.c   |    6 +++++-
 drivers/net/ethernet/freescale/fec.h   |    3 ++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/Kconfig b/drivers/net/ethernet/freescale/Kconfig
index 3574e14..f9aa244 100644
--- a/drivers/net/ethernet/freescale/Kconfig
+++ b/drivers/net/ethernet/freescale/Kconfig
@@ -7,7 +7,7 @@ config NET_VENDOR_FREESCALE
 	default y
 	depends on FSL_SOC || QUICC_ENGINE || CPM1 || CPM2 || PPC_MPC512x || \
 		   M523x || M527x || M5272 || M528x || M520x || M532x || \
-		   ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM)
+		   M5441x || ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM)
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y
 	  and read the Ethernet-HOWTO, available from
@@ -22,8 +22,7 @@ if NET_VENDOR_FREESCALE
 
 config FEC
 	tristate "FEC ethernet controller (of ColdFire and some i.MX CPUs)"
-	depends on (M523x || M527x || M5272 || M528x || M520x || M532x || \
-		   ARCH_MXC || SOC_IMX28)
+	depends on (M523x || M527x || M5272 || M528x || M520x || M532x || M5441x || ARCH_MXC || SOC_IMX28)
 	default ARCH_MXC || SOC_IMX28 if ARM
 	select PHYLIB
 	---help---
diff --git a/drivers/net/ethernet/freescale/fec.c b/drivers/net/ethernet/freescale/fec.c
index a12b3f5..4cb1c90 100644
--- a/drivers/net/ethernet/freescale/fec.c
+++ b/drivers/net/ethernet/freescale/fec.c
@@ -93,6 +93,9 @@ static struct platform_device_id fec_devtype[] = {
 		.name = "imx6q-fec",
 		.driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT,
 	}, {
+		.name = "enet-fec",
+		.driver_data = FEC_QUIRK_ENET_MAC,
+	}, {
 		/* sentinel */
 	}
 };
@@ -186,7 +189,8 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
  * account when setting it.
  */
 #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
-    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM)
+    defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM) || \
+    defined(CONFIG_M5441x)
 #define	OPT_FRAME_SIZE	(PKT_MAXBUF_SIZE << 16)
 #else
 #define	OPT_FRAME_SIZE	0
diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 8408c62..298cfb7 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -15,7 +15,8 @@
 
 #if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
     defined(CONFIG_M520x) || defined(CONFIG_M532x) || \
-    defined(CONFIG_ARCH_MXC) || defined(CONFIG_SOC_IMX28)
+    defined(CONFIG_ARCH_MXC) || defined(CONFIG_SOC_IMX28) || \
+    defined(CONFIG_M5441x)
 /*
  *	Just figures, Motorola would have to change the offsets for
  *	registers in the same peripheral device on different models

^ permalink raw reply related

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <4FBD1A0C.4070606@intel.com>

On 05/23/2012 10:10 AM, Alexander Duyck wrote:
> On 05/23/2012 09:39 AM, Eric Dumazet wrote:
>> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>>
>>> With current driver, a MTU=1500 frame uses :
>>>
>>> sk_buff (256 bytes)
>>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
>> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
> Actually pahole seems to be indicating to me the size of skb_shared_info
> is 320, unless something has changed in the last few days.
>
> When I get a chance I will try to remember to reduce the ixgbe header
> size to 256 which should also help.  The only reason it is set to 512
> was to deal with the fact that the old alloc_skb code wasn't aligning
> the shared info with the end of whatever size was allocated and so the
> 512 was an approximation to make better use of the 1K slab allocation
> back when we still were using hardware packet split.  That should help
> to improve the page utilization for the headers since that would
> increase the uses of a page from 4 to 6 for the skb head frag, and it
> would drop truesize by another 256 bytes.
>
> Thanks,
>
> Alex
Here is the patch for review.  I have submitted the official patch to Jeff 
so that it can go through his tree for testing, validation, and submission 
once Dave's tree opens back up.

---

The recent changes to netdev_alloc_skb actually make it so that the size of
the buffer now actually has a more direct input on the truesize.  So in
order to make best use of the piece of a page we are allocated I am
reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
by 256 bytes as well.

This should result in performance improvements since the number of uses per
page should increase from 4 to 6 in the case of a 4K page.  In addition we
should see socket performance improvements due to the truesize dropping
to less than 1K for buffers less than 256 bytes.

Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
 2 files changed, 10 insertions(+), 9 deletions(-)


diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 402dd66..468e4ab 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -77,17 +77,18 @@
 #define IXGBE_MAX_FCPAUSE		 0xFFFF
 
 /* Supported Rx Buffer Sizes */
-#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
+#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
 #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
 
 /*
- * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
- * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
- * this adds up to 512 bytes of extra data meaning the smallest allocation
- * we could have is 1K.
- * i.e. RXBUFFER_512 --> size-1024 slab
+ * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
+ * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
+ * this adds up to 448 bytes of extra data.
+ *
+ * Since netdev_alloc_skb now allocates a page fragment we can use a value
+ * of 256 and the resultant skb will have a truesize of 960 or less.
  */
-#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
+#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
 
 #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7f92e40..f92b31a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
 	pull_len = skb_frag_size(frag);
-	if (pull_len > 256)
-		pull_len = ixgbe_get_headlen(va, pull_len);
+	if (pull_len > IXGBE_RX_HDR_SIZE)
+		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));

^ permalink raw reply related

* [PATCH v7] tilegx network driver: initial support
From: Chris Metcalf @ 2012-05-23 20:42 UTC (permalink / raw)
  To: bhutchings, arnd, David Miller, linux-kernel, netdev
In-Reply-To: <20120520.165546.1211013675964130504.davem@davemloft.net>

This change adds support for the tilegx network driver based on the
GXIO IORPC support in the tilegx software stack, using the on-chip
mPIPE packet processing engine.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
I worked with the original author of this driver to refactor the
code and conform more closely to conventional Linux coding style.
I'd appreciate any additional feedback - thanks!

 drivers/net/ethernet/tile/Kconfig  |    1 +
 drivers/net/ethernet/tile/Makefile |    4 +-
 drivers/net/ethernet/tile/tilegx.c | 1798 ++++++++++++++++++++++++++++++++++++
 3 files changed, 1801 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/tile/tilegx.c

diff --git a/drivers/net/ethernet/tile/Kconfig b/drivers/net/ethernet/tile/Kconfig
index 2d9218f..9184b61 100644
--- a/drivers/net/ethernet/tile/Kconfig
+++ b/drivers/net/ethernet/tile/Kconfig
@@ -7,6 +7,7 @@ config TILE_NET
 	depends on TILE
 	default y
 	select CRC32
+	select TILE_GXIO_MPIPE if TILEGX
 	---help---
 	  This is a standard Linux network device driver for the
 	  on-chip Tilera Gigabit Ethernet and XAUI interfaces.
diff --git a/drivers/net/ethernet/tile/Makefile b/drivers/net/ethernet/tile/Makefile
index f634f14..0ef9eef 100644
--- a/drivers/net/ethernet/tile/Makefile
+++ b/drivers/net/ethernet/tile/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_TILE_NET) += tile_net.o
 ifdef CONFIG_TILEGX
-tile_net-objs := tilegx.o mpipe.o iorpc_mpipe.o dma_queue.o
+tile_net-y := tilegx.o
 else
-tile_net-objs := tilepro.o
+tile_net-y := tilepro.o
 endif
diff --git a/drivers/net/ethernet/tile/tilegx.c b/drivers/net/ethernet/tile/tilegx.c
new file mode 100644
index 0000000..abfff7f
--- /dev/null
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -0,0 +1,1798 @@
+/*
+ * Copyright 2012 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation, version 2.
+ *
+ *   This program is distributed in the hope that it will be useful, but
+ *   WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ *   NON INFRINGEMENT.  See the GNU General Public License for
+ *   more details.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/kernel.h>      /* printk() */
+#include <linux/slab.h>        /* kmalloc() */
+#include <linux/errno.h>       /* error codes */
+#include <linux/types.h>       /* size_t */
+#include <linux/interrupt.h>
+#include <linux/in.h>
+#include <linux/irq.h>
+#include <linux/netdevice.h>   /* struct device, and other headers */
+#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/skbuff.h>
+#include <linux/ioctl.h>
+#include <linux/cdev.h>
+#include <linux/hugetlb.h>
+#include <linux/in6.h>
+#include <linux/timer.h>
+#include <linux/io.h>
+#include <linux/ctype.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+
+#include <asm/checksum.h>
+#include <asm/homecache.h>
+#include <gxio/mpipe.h>
+#include <arch/sim.h>
+
+/* Default transmit lockup timeout period, in jiffies. */
+#define TILE_NET_TIMEOUT (5 * HZ)
+
+/* The maximum number of distinct channels (idesc.channel is 5 bits). */
+#define TILE_NET_CHANNELS 32
+
+/* Maximum number of idescs to handle per "poll". */
+#define TILE_NET_BATCH 128
+
+/* Maximum number of packets to handle per "poll". */
+#define TILE_NET_WEIGHT 64
+
+/* Number of entries in each iqueue. */
+#define IQUEUE_ENTRIES 512
+
+/* Number of entries in each equeue. */
+#define EQUEUE_ENTRIES 2048
+
+/* Total header bytes per equeue slot.  Must be big enough for 2 bytes
+ * of NET_IP_ALIGN alignment, plus 14 bytes (?) of L2 header, plus up to
+ * 60 bytes of actual TCP header.  We round up to align to cache lines.
+ */
+#define HEADER_BYTES 128
+
+/* Maximum completions per cpu per device (must be a power of two).
+ * ISSUE: What is the right number here?  If this is too small, then
+ * egress might block waiting for free space in a completions array.
+ * ISSUE: At the least, allocate these only for initialized echannels.
+ */
+#define TILE_NET_MAX_COMPS 64
+
+#define MAX_FRAGS (MAX_SKB_FRAGS + 1)
+
+/* Size of completions data to allocate.
+ * ISSUE: Probably more than needed since we don't use all the channels.
+ */
+#define COMPS_SIZE (TILE_NET_CHANNELS * sizeof(struct tile_net_comps))
+
+/* Size of NotifRing data to allocate. */
+#define NOTIF_RING_SIZE (IQUEUE_ENTRIES * sizeof(gxio_mpipe_idesc_t))
+
+MODULE_AUTHOR("Tilera Corporation");
+MODULE_LICENSE("GPL");
+
+/* A "packet fragment" (a chunk of memory). */
+struct frag {
+	void *buf;
+	size_t length;
+};
+
+/* A single completion. */
+struct tile_net_comp {
+	/* The "complete_count" when the completion will be complete. */
+	s64 when;
+	/* The buffer to be freed when the completion is complete. */
+	struct sk_buff *skb;
+};
+
+/* The completions for a given cpu and device. */
+struct tile_net_comps {
+	/* The completions. */
+	struct tile_net_comp comp_queue[TILE_NET_MAX_COMPS];
+	/* The number of completions used. */
+	unsigned long comp_next;
+	/* The number of completions freed. */
+	unsigned long comp_last;
+};
+
+/* Info for a specific cpu. */
+struct tile_net_info {
+	/* The NAPI struct. */
+	struct napi_struct napi;
+	/* Packet queue. */
+	gxio_mpipe_iqueue_t iqueue;
+	/* Our cpu. */
+	int my_cpu;
+	/* True if iqueue is valid. */
+	bool has_iqueue;
+	/* NAPI flags. */
+	bool napi_added;
+	bool napi_enabled;
+	/* Number of small sk_buffs which must still be provided. */
+	unsigned int num_needed_small_buffers;
+	/* Number of large sk_buffs which must still be provided. */
+	unsigned int num_needed_large_buffers;
+	/* A timer for handling egress completions. */
+	struct timer_list egress_timer;
+	/* True if "egress_timer" is scheduled. */
+	bool egress_timer_scheduled;
+	/* Comps for each egress channel. */
+	struct tile_net_comps *comps_for_echannel[TILE_NET_CHANNELS];
+};
+
+/* Info for egress on a particular egress channel. */
+struct tile_net_egress {
+	/* The "equeue". */
+	gxio_mpipe_equeue_t *equeue;
+	/* The headers for TSO. */
+	unsigned char *headers;
+};
+
+/* Info for a specific device. */
+struct tile_net_priv {
+	/* Our network device. */
+	struct net_device *dev;
+	/* The primary link. */
+	gxio_mpipe_link_t link;
+	/* The primary channel, if open, else -1. */
+	int channel;
+	/* The "loopify" egress link, if needed. */
+	gxio_mpipe_link_t loopify_link;
+	/* The "loopify" egress channel, if open, else -1. */
+	int loopify_channel;
+	/* The egress channel (channel or loopify_channel). */
+	int echannel;
+	/* Total stats. */
+	struct net_device_stats stats;
+};
+
+/* Egress info, indexed by "priv->echannel" (lazily created as needed). */
+static struct tile_net_egress egress_for_echannel[TILE_NET_CHANNELS];
+
+/* Devices currently associated with each channel.
+ * NOTE: The array entry can become NULL after ifconfig down, but
+ * we do not free the underlying net_device structures, so it is
+ * safe to use a pointer after reading it from this array.
+ */
+static struct net_device *tile_net_devs_for_channel[TILE_NET_CHANNELS];
+
+/* A mutex for "tile_net_devs_for_channel". */
+static DEFINE_MUTEX(tile_net_devs_for_channel_mutex);
+
+/* The per-cpu info. */
+static DEFINE_PER_CPU(struct tile_net_info, per_cpu_info);
+
+/* The "context" for all devices. */
+static gxio_mpipe_context_t context;
+
+/* The small/large "buffer stacks". */
+static int small_buffer_stack = -1;
+static int large_buffer_stack = -1;
+
+/* Amount of memory allocated for each buffer stack. */
+static size_t buffer_stack_size;
+
+/* The actual memory allocated for the buffer stacks. */
+static void *small_buffer_stack_va;
+static void *large_buffer_stack_va;
+
+/* The buckets. */
+static int first_bucket = -1;
+static int num_buckets = 1;
+
+/* The ingress irq. */
+static int ingress_irq = -1;
+
+/* Text value of tile_net.cpus if passed as a module parameter. */
+static char *network_cpus_string;
+
+/* The actual cpus in "network_cpus". */
+static struct cpumask network_cpus_map;
+
+/* If "loopify=LINK" was specified, this is "LINK". */
+static char *loopify_link_name;
+
+/* If "tile_net.custom" was specified, this is non-NULL. */
+static char *custom_str;
+
+/* The "tile_net.cpus" argument specifies the cpus that are dedicated
+ * to handle ingress packets.
+ *
+ * The parameter should be in the form "tile_net.cpus=m-n[,x-y]", where
+ * m, n, x, y are integer numbers that represent the cpus that can be
+ * neither a dedicated cpu nor a dataplane cpu.
+ */
+static bool network_cpus_init(void)
+{
+	char buf[1024];
+	int rc;
+
+	if (network_cpus_string == NULL)
+		return false;
+
+	rc = cpulist_parse_crop(network_cpus_string, &network_cpus_map);
+	if (rc != 0) {
+		pr_warn("tile_net.cpus=%s: malformed cpu list\n",
+			network_cpus_string);
+		return false;
+	}
+
+	/* Remove dedicated cpus. */
+	cpumask_and(&network_cpus_map, &network_cpus_map, cpu_possible_mask);
+
+	if (cpumask_empty(&network_cpus_map)) {
+		pr_warn("Ignoring empty tile_net.cpus='%s'.\n",
+			network_cpus_string);
+		return false;
+	}
+
+	cpulist_scnprintf(buf, sizeof(buf), &network_cpus_map);
+	pr_info("Linux network CPUs: %s\n", buf);
+	return true;
+}
+
+module_param_named(cpus, network_cpus_string, charp, 0444);
+MODULE_PARM_DESC(cpus, "cpulist of cores that handle network interrupts");
+
+/* The "tile_net.loopify=LINK" argument causes the named device to
+ * actually use "loop0" for ingress, and "loop1" for egress.  This
+ * allows an app to sit between the actual link and linux, passing
+ * (some) packets along to linux, and forwarding (some) packets sent
+ * out by linux.
+ */
+module_param_named(loopify, loopify_link_name, charp, 0444);
+MODULE_PARM_DESC(loopify, "name the device to use loop0/1 for ingress/egress");
+
+/* The "tile_net.custom" argument causes us to ignore the "conventional"
+ * classifier metadata, in particular, the "l2_offset".
+ */
+module_param_named(custom, custom_str, charp, 0444);
+MODULE_PARM_DESC(custom, "indicates a (heavily) customized classifier");
+
+/* Atomically update a statistics field.
+ * Note that on TILE-Gx, this operation is fire-and-forget on the
+ * issuing core (single-cycle dispatch) and takes only a few cycles
+ * longer than a regular store when the request reaches the home cache.
+ * No expensive bus management overhead is required.
+ */
+static void tile_net_stats_add(unsigned long value, unsigned long *field)
+{
+	BUILD_BUG_ON(sizeof(atomic_long_t) != sizeof(unsigned long));
+	atomic_long_add(value, (atomic_long_t *)field);
+}
+
+/* Allocate and push a buffer. */
+static bool tile_net_provide_buffer(bool small)
+{
+	int stack = small ? small_buffer_stack : large_buffer_stack;
+	const unsigned long buffer_alignment = 128;
+	struct sk_buff *skb;
+	int len;
+
+	len = sizeof(struct sk_buff **) + buffer_alignment;
+	len += (small ? 128 : 1664);
+	skb = dev_alloc_skb(len);
+	if (skb == NULL)
+		return false;
+
+	/* Make room for a back-pointer to 'skb' and guarantee alignment. */
+	skb_reserve(skb, sizeof(struct sk_buff **));
+	skb_reserve(skb, -(long)skb->data & (buffer_alignment - 1));
+
+	/* Save a back-pointer to 'skb'. */
+	*(struct sk_buff **)(skb->data - sizeof(struct sk_buff **)) = skb;
+
+	/* Make sure "skb" and the back-pointer have been flushed. */
+	wmb();
+
+	gxio_mpipe_push_buffer(&context, stack,
+			       (void *)va_to_tile_io_addr(skb->data));
+
+	return true;
+}
+
+static void tile_net_pop_all_buffers(int stack)
+{
+	void *va;
+	while ((va = gxio_mpipe_pop_buffer(&context, stack)) != NULL) {
+		struct sk_buff **skb_ptr = va - sizeof(*skb_ptr);
+		struct sk_buff *skb = *skb_ptr;
+		dev_kfree_skb_irq(skb);
+	}
+}
+
+/* Provide linux buffers to mPIPE. */
+static void tile_net_provide_needed_buffers(struct tile_net_info *info)
+{
+	while (info->num_needed_small_buffers != 0) {
+		if (!tile_net_provide_buffer(true))
+			goto oops;
+		info->num_needed_small_buffers--;
+	}
+
+	while (info->num_needed_large_buffers != 0) {
+		if (!tile_net_provide_buffer(false))
+			goto oops;
+		info->num_needed_large_buffers--;
+	}
+
+	return;
+
+oops:
+	/* Add a description to the page allocation failure dump. */
+	pr_notice("Tile %d still needs some buffers\n", info->my_cpu);
+}
+
+static inline bool filter_packet(struct net_device *dev, void *buf)
+{
+	/* Filter packets received before we're up. */
+	if (dev == NULL || !(dev->flags & IFF_UP))
+		return true;
+
+	/* Filter out packets that aren't for us. */
+	if (!(dev->flags & IFF_PROMISC) &&
+	    !is_multicast_ether_addr(buf) &&
+	    compare_ether_addr(dev->dev_addr, buf) != 0)
+		return true;
+
+	return false;
+}
+
+/* Convert a raw mpipe buffer to its matching skb pointer. */
+static struct sk_buff *mpipe_buf_to_skb(void *va)
+{
+	/* Acquire the associated "skb". */
+	struct sk_buff **skb_ptr = va - sizeof(*skb_ptr);
+	struct sk_buff *skb = *skb_ptr;
+
+	/* Paranoia. */
+	if (skb->data != va) {
+		/* Panic here since there's a reasonable chance
+		 * that corrupt buffers means generic memory
+		 * corruption, with unpredictable system effects.
+		 */
+		panic("Corrupt linux buffer! "
+		      "va=%p, skb=%p, skb->data=%p",
+		      va, skb, skb->data);
+	}
+
+	return skb;
+}
+
+static void tile_net_receive_skb(struct net_device *dev, struct sk_buff *skb,
+				 struct tile_net_info *info,
+				 gxio_mpipe_idesc_t *idesc, unsigned long len)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+
+	/* Encode the actual packet length. */
+	skb_put(skb, len);
+
+	skb->protocol = eth_type_trans(skb, dev);
+
+	/* Acknowledge "good" hardware checksums. */
+	if (idesc->cs && idesc->csum_seed_val == 0xFFFF)
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+	netif_receive_skb(skb);
+
+	/* Update stats. */
+	tile_net_stats_add(1, &priv->stats.rx_packets);
+	tile_net_stats_add(len, &priv->stats.rx_bytes);
+
+	/* Need a new buffer. */
+	if (idesc->size == GXIO_MPIPE_BUFFER_SIZE_128)
+		info->num_needed_small_buffers++;
+	else
+		info->num_needed_large_buffers++;
+}
+
+/* Handle a packet.  Return true if "processed", false if "filtered". */
+static bool tile_net_handle_packet(struct tile_net_info *info,
+				   gxio_mpipe_idesc_t *idesc)
+{
+	struct net_device *dev = tile_net_devs_for_channel[idesc->channel];
+	uint8_t l2_offset;
+	void *va;
+	void *buf;
+	unsigned long len;
+	bool filter;
+
+	/* Drop packets for which no buffer was available.
+	 * NOTE: This happens under heavy load.
+	 */
+	if (idesc->be) {
+		gxio_mpipe_iqueue_consume(&info->iqueue, idesc);
+		if (net_ratelimit())
+			pr_info("Dropping packet (insufficient buffers).\n");
+		return false;
+	}
+
+	/* Get the "l2_offset", if allowed. */
+	l2_offset = custom_str ? 0 : gxio_mpipe_idesc_get_l2_offset(idesc);
+
+	/* Get the raw buffer VA (includes "headroom"). */
+	va = tile_io_addr_to_va((unsigned long)(long)idesc->va);
+
+	/* Get the actual packet start/length. */
+	buf = va + l2_offset;
+	len = idesc->l2_size - l2_offset;
+
+	/* Point "va" at the raw buffer. */
+	va -= NET_IP_ALIGN;
+
+	filter = filter_packet(dev, buf);
+	if (filter) {
+		/* FIXME: Update "drop" statistics. */
+		gxio_mpipe_iqueue_drop(&info->iqueue, idesc);
+	} else {
+		struct sk_buff *skb = mpipe_buf_to_skb(va);
+
+		/* Skip headroom, and any custom header. */
+		skb_reserve(skb, NET_IP_ALIGN + l2_offset);
+
+		tile_net_receive_skb(dev, skb, info, idesc, len);
+	}
+
+	gxio_mpipe_iqueue_consume(&info->iqueue, idesc);
+	return !filter;
+}
+
+/* Handle some packets for the current CPU.
+ *
+ * This function handles up to TILE_NET_BATCH idescs per call.
+ *
+ * ISSUE: Since we do not provide new buffers until this function is
+ * complete, we must initially provide enough buffers for each network
+ * cpu to fill its iqueue and also its batched idescs.
+ *
+ * ISSUE: The "rotting packet" race condition occurs if a packet
+ * arrives after the queue appears to be empty, and before the
+ * hypervisor interrupt is re-enabled.
+ */
+static int tile_net_poll(struct napi_struct *napi, int budget)
+{
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	unsigned int work = 0;
+	gxio_mpipe_idesc_t *idesc;
+	int i, n;
+
+	/* Process packets. */
+	while ((n = gxio_mpipe_iqueue_try_peek(&info->iqueue, &idesc)) > 0) {
+		for (i = 0; i < n; i++) {
+			if (i == TILE_NET_BATCH)
+				goto done;
+			if (tile_net_handle_packet(info, idesc + i)) {
+				if (++work >= budget)
+					goto done;
+			}
+		}
+	}
+
+	/* There are no packets left. */
+	napi_complete(&info->napi);
+
+	/* Re-enable hypervisor interrupts. */
+	gxio_mpipe_enable_notif_ring_interrupt(&context, info->iqueue.ring);
+
+	/* HACK: Avoid the "rotting packet" problem. */
+	if (gxio_mpipe_iqueue_try_peek(&info->iqueue, &idesc) > 0)
+		napi_schedule(&info->napi);
+
+	/* ISSUE: Handle completions? */
+
+done:
+	tile_net_provide_needed_buffers(info);
+
+	return work;
+}
+
+/* Handle an ingress interrupt on the current cpu. */
+static irqreturn_t tile_net_handle_ingress_irq(int irq, void *unused)
+{
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	napi_schedule(&info->napi);
+	return IRQ_HANDLED;
+}
+
+/* Free some completions.  This must be called with interrupts blocked. */
+static void tile_net_free_comps(gxio_mpipe_equeue_t *equeue,
+				struct tile_net_comps *comps,
+				int limit, bool force_update)
+{
+	int n = 0;
+	while (comps->comp_last < comps->comp_next) {
+		unsigned int cid = comps->comp_last % TILE_NET_MAX_COMPS;
+		struct tile_net_comp *comp = &comps->comp_queue[cid];
+		if (!gxio_mpipe_equeue_is_complete(equeue, comp->when,
+						   force_update || n == 0))
+			return;
+		dev_kfree_skb_irq(comp->skb);
+		comps->comp_last++;
+		if (++n == limit)
+			return;
+	}
+}
+
+/* Add a completion.  This must be called with interrupts blocked.
+ *
+ * FIXME: We should probably have stopped the queue earlier rather
+ * than having to wait here.
+ */
+static void add_comp(gxio_mpipe_equeue_t *equeue,
+		     struct tile_net_comps *comps,
+		     uint64_t when, struct sk_buff *skb)
+{
+	int cid;
+
+	/* Wait for a free completion entry, if needed. */
+	while (comps->comp_next - comps->comp_last >= TILE_NET_MAX_COMPS - 1)
+		tile_net_free_comps(equeue, comps, 32, false);
+
+	/* Update the completions array. */
+	cid = comps->comp_next % TILE_NET_MAX_COMPS;
+	comps->comp_queue[cid].when = when;
+	comps->comp_queue[cid].skb = skb;
+	comps->comp_next++;
+}
+
+/* Make sure the egress timer is scheduled.
+ *
+ * Note that we use "schedule if not scheduled" logic instead of the more
+ * obvious "reschedule" logic, because "reschedule" is fairly expensive.
+ */
+static void tile_net_schedule_egress_timer(struct tile_net_info *info)
+{
+	if (!info->egress_timer_scheduled) {
+		mod_timer_pinned(&info->egress_timer, jiffies + 1);
+		info->egress_timer_scheduled = true;
+	}
+}
+
+/* The "function" for "info->egress_timer".
+ *
+ * This timer will reschedule itself as long as there are any pending
+ * completions expected for this tile.
+ */
+static void tile_net_handle_egress_timer(unsigned long arg)
+{
+	struct tile_net_info *info = (struct tile_net_info *)arg;
+	unsigned long irqflags;
+	bool pending = false;
+	int i;
+
+	local_irq_save(irqflags);
+
+	/* The timer is no longer scheduled. */
+	info->egress_timer_scheduled = false;
+
+	/* Free all possible comps for this tile. */
+	for (i = 0; i < TILE_NET_CHANNELS; i++) {
+		struct tile_net_egress *egress = &egress_for_echannel[i];
+		struct tile_net_comps *comps = info->comps_for_echannel[i];
+		if (comps->comp_last >= comps->comp_next)
+			continue;
+		tile_net_free_comps(egress->equeue, comps, -1, true);
+		pending = pending || (comps->comp_last < comps->comp_next);
+	}
+
+	/* Reschedule timer if needed. */
+	if (pending)
+		tile_net_schedule_egress_timer(info);
+
+	local_irq_restore(irqflags);
+}
+
+/* Helper function for "tile_net_update()".
+ * "dev" (i.e. arg) is the device being brought up or down,
+ * or NULL if all devices are now down.
+ */
+static void tile_net_update_cpu(void *arg)
+{
+	struct net_device *dev = arg;
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+
+	if (!info->has_iqueue)
+		return;
+
+	if (dev != NULL) {
+		if (!info->napi_added) {
+			netif_napi_add(dev, &info->napi,
+				       tile_net_poll, TILE_NET_WEIGHT);
+			info->napi_added = true;
+		}
+		if (!info->napi_enabled) {
+			napi_enable(&info->napi);
+			info->napi_enabled = true;
+		}
+		enable_percpu_irq(ingress_irq, 0);
+	} else {
+		disable_percpu_irq(ingress_irq);
+		if (info->napi_enabled) {
+			napi_disable(&info->napi);
+			info->napi_enabled = false;
+		}
+		/* FIXME: Drain the iqueue. */
+	}
+}
+
+/* Helper function for tile_net_open() and tile_net_stop().
+ * Always called under tile_net_devs_for_channel_mutex.
+ */
+static int tile_net_update(struct net_device *dev)
+{
+	static gxio_mpipe_rules_t rules;  /* too big to fit on the stack */
+	bool saw_channel = false;
+	int channel;
+	int rc;
+	int cpu;
+
+	gxio_mpipe_rules_init(&rules, &context);
+
+	for (channel = 0; channel < TILE_NET_CHANNELS; channel++) {
+		if (tile_net_devs_for_channel[channel] == NULL)
+			continue;
+		if (!saw_channel) {
+			saw_channel = true;
+			gxio_mpipe_rules_begin(&rules, first_bucket,
+					       num_buckets, NULL);
+			gxio_mpipe_rules_set_headroom(&rules, NET_IP_ALIGN);
+		}
+		gxio_mpipe_rules_add_channel(&rules, channel);
+	}
+
+	/* NOTE: This can fail if there is no classifier.
+	 * ISSUE: Can anything else cause it to fail?
+	 */
+	rc = gxio_mpipe_rules_commit(&rules);
+	if (rc != 0) {
+		netdev_warn(dev, "gxio_mpipe_rules_commit failed: %d\n", rc);
+		return -EIO;
+	}
+
+	/* Update all cpus, sequentially (to protect "netif_napi_add()"). */
+	for_each_online_cpu(cpu)
+		smp_call_function_single(cpu, tile_net_update_cpu,
+					 (saw_channel ? dev : NULL), 1);
+
+	/* HACK: Allow packets to flow in the simulator. */
+	if (saw_channel)
+		sim_enable_mpipe_links(0, -1);
+
+	return 0;
+}
+
+/* Allocate and initialize mpipe buffer stacks, and register them in
+ * the mPIPE TLBs, for both small and large packet sizes.
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int init_buffer_stacks(struct net_device *dev, int num_buffers)
+{
+	pte_t hash_pte = pte_set_home((pte_t) { 0 }, PAGE_HOME_HASH);
+	int rc;
+
+	/* Compute stack bytes; we round up to 64KB and then use
+	 * alloc_pages() so we get the required 64KB alignment as well.
+	 */
+	buffer_stack_size =
+		ALIGN(gxio_mpipe_calc_buffer_stack_bytes(num_buffers),
+		      64 * 1024);
+
+	/* Allocate two buffer stack indices. */
+	rc = gxio_mpipe_alloc_buffer_stacks(&context, 2, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_buffer_stacks failed: %d\n",
+			   rc);
+		return rc;
+	}
+	small_buffer_stack = rc;
+	large_buffer_stack = rc + 1;
+
+	/* Allocate the small memory stack. */
+	small_buffer_stack_va =
+		alloc_pages_exact(buffer_stack_size, GFP_KERNEL);
+	if (small_buffer_stack_va == NULL) {
+		netdev_err(dev,
+			   "Could not alloc %zd bytes for buffer stacks\n",
+			   buffer_stack_size);
+		return -ENOMEM;
+	}
+	rc = gxio_mpipe_init_buffer_stack(&context, small_buffer_stack,
+					  GXIO_MPIPE_BUFFER_SIZE_128,
+					  small_buffer_stack_va,
+					  buffer_stack_size, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_init_buffer_stack: %d\n", rc);
+		return rc;
+	}
+	rc = gxio_mpipe_register_client_memory(&context, small_buffer_stack,
+					       hash_pte, 0);
+	if (rc != 0) {
+		netdev_err(dev,
+			   "gxio_mpipe_register_buffer_memory failed: %d\n",
+			   rc);
+		return rc;
+	}
+
+	/* Allocate the large buffer stack. */
+	large_buffer_stack_va =
+		alloc_pages_exact(buffer_stack_size, GFP_KERNEL);
+	if (large_buffer_stack_va == NULL) {
+		netdev_err(dev,
+			   "Could not alloc %zd bytes for buffer stacks\n",
+			   buffer_stack_size);
+		return -ENOMEM;
+	}
+	rc = gxio_mpipe_init_buffer_stack(&context, large_buffer_stack,
+					  GXIO_MPIPE_BUFFER_SIZE_1664,
+					  large_buffer_stack_va,
+					  buffer_stack_size, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_init_buffer_stack failed: %d\n",
+			   rc);
+		return rc;
+	}
+	rc = gxio_mpipe_register_client_memory(&context, large_buffer_stack,
+					       hash_pte, 0);
+	if (rc != 0) {
+		netdev_err(dev,
+			   "gxio_mpipe_register_buffer_memory failed: %d\n",
+			   rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+/* Allocate per-cpu resources (memory for completions and idescs).
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int alloc_percpu_mpipe_resources(struct net_device *dev,
+					int cpu, int ring)
+{
+	struct tile_net_info *info = &per_cpu(per_cpu_info, cpu);
+	int order, i, rc;
+	struct page *page;
+	void *addr;
+
+	/* Allocate the "comps". */
+	order = get_order(COMPS_SIZE);
+	page = homecache_alloc_pages(GFP_KERNEL, order, cpu);
+	if (page == NULL) {
+		netdev_err(dev, "Failed to alloc %zd bytes comps memory\n",
+			   COMPS_SIZE);
+		return -ENOMEM;
+	}
+	addr = pfn_to_kaddr(page_to_pfn(page));
+	memset(addr, 0, COMPS_SIZE);
+	for (i = 0; i < TILE_NET_CHANNELS; i++)
+		info->comps_for_echannel[i] =
+			addr + i * sizeof(struct tile_net_comps);
+
+	/* If this is a network cpu, create an iqueue. */
+	if (cpu_isset(cpu, network_cpus_map)) {
+		order = get_order(NOTIF_RING_SIZE);
+		page = homecache_alloc_pages(GFP_KERNEL, order, cpu);
+		if (page == NULL) {
+			netdev_err(dev,
+				   "Failed to alloc %zd bytes iqueue memory\n",
+				   NOTIF_RING_SIZE);
+			return -ENOMEM;
+		}
+		addr = pfn_to_kaddr(page_to_pfn(page));
+		rc = gxio_mpipe_iqueue_init(&info->iqueue, &context, ring,
+					    addr, NOTIF_RING_SIZE, 0);
+		if (rc != 0) {
+			netdev_err(dev,
+				   "gxio_mpipe_iqueue_init failed: %d\n", rc);
+			return rc;
+		}
+		info->has_iqueue = true;
+	}
+
+	return 0;
+}
+
+/* Initialize NotifGroup and buckets.
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int init_notif_group_and_buckets(struct net_device *dev,
+					int ring, int network_cpus_count)
+{
+	int group, rc;
+
+	/* Allocate one NotifGroup. */
+	rc = gxio_mpipe_alloc_notif_groups(&context, 1, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_notif_groups failed: %d\n",
+			   rc);
+		return rc;
+	}
+	group = rc;
+
+	/* Initialize global num_buckets value. */
+	if (network_cpus_count > 4)
+		num_buckets = 256;
+	else if (network_cpus_count > 1)
+		num_buckets = 16;
+
+	/* Allocate some buckets, and set global first_bucket value. */
+	rc = gxio_mpipe_alloc_buckets(&context, num_buckets, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_buckets failed: %d\n", rc);
+		return rc;
+	}
+	first_bucket = rc;
+
+	/* Init group and buckets. */
+	rc = gxio_mpipe_init_notif_group_and_buckets(
+		&context, group, ring, network_cpus_count,
+		first_bucket, num_buckets,
+		GXIO_MPIPE_BUCKET_STICKY_FLOW_LOCALITY);
+	if (rc != 0) {
+		netdev_err(
+			dev,
+			"gxio_mpipe_init_notif_group_and_buckets failed: %d\n",
+			rc);
+		return rc;
+	}
+
+	return 0;
+}
+
+/* Create an irq and register it, then activate the irq and request
+ * interrupts on all cores.  Note that "ingress_irq" being initialized
+ * is how we know not to call tile_net_init_mpipe() again.
+ * This routine supports tile_net_init_mpipe(), below.
+ */
+static int tile_net_setup_interrupts(struct net_device *dev)
+{
+	int cpu, rc;
+
+	rc = create_irq();
+	if (rc < 0) {
+		netdev_err(dev, "create_irq failed: %d\n", rc);
+		return rc;
+	}
+	ingress_irq = rc;
+	tile_irq_activate(ingress_irq, TILE_IRQ_PERCPU);
+	rc = request_irq(ingress_irq, tile_net_handle_ingress_irq,
+			 0, NULL, NULL);
+	if (rc != 0) {
+		netdev_err(dev, "request_irq failed: %d\n", rc);
+		destroy_irq(ingress_irq);
+		ingress_irq = -1;
+		return rc;
+	}
+
+	for_each_online_cpu(cpu) {
+		struct tile_net_info *info = &per_cpu(per_cpu_info, cpu);
+		if (info->has_iqueue) {
+			gxio_mpipe_request_notif_ring_interrupt(
+				&context, cpu_x(cpu), cpu_y(cpu),
+				1, ingress_irq, info->iqueue.ring);
+		}
+	}
+
+	return 0;
+}
+
+/* Undo any state set up partially by a failed call to tile_net_init_mpipe. */
+static void tile_net_init_mpipe_fail(void)
+{
+	int cpu;
+
+	/* Do cleanups that require the mpipe context first. */
+	if (small_buffer_stack >= 0)
+		tile_net_pop_all_buffers(small_buffer_stack);
+	if (large_buffer_stack >= 0)
+		tile_net_pop_all_buffers(large_buffer_stack);
+
+	/* Destroy mpipe context so the hardware no longer owns any memory. */
+	gxio_mpipe_destroy(&context);
+
+	for_each_online_cpu(cpu) {
+		struct tile_net_info *info = &per_cpu(per_cpu_info, cpu);
+		free_pages((unsigned long)(info->comps_for_echannel[0]),
+			   get_order(COMPS_SIZE));
+		info->comps_for_echannel[0] = NULL;
+		free_pages((unsigned long)(info->iqueue.idescs),
+			   get_order(NOTIF_RING_SIZE));
+		info->iqueue.idescs = NULL;
+	}
+
+	if (small_buffer_stack_va)
+		free_pages_exact(small_buffer_stack_va, buffer_stack_size);
+	if (large_buffer_stack_va)
+		free_pages_exact(large_buffer_stack_va, buffer_stack_size);
+
+	small_buffer_stack_va = NULL;
+	large_buffer_stack_va = NULL;
+	large_buffer_stack = -1;
+	small_buffer_stack = -1;
+	first_bucket = -1;
+}
+
+/* The first time any tilegx network device is opened, we initialize
+ * the global mpipe state.  If this step fails, we fail to open the
+ * device, but if it succeeds, we never need to do it again, and since
+ * tile_net can't be unloaded, we never undo it.
+ *
+ * Note that some resources in this path (buffer stack indices,
+ * bindings from init_buffer_stack, etc.) are hypervisor resources
+ * that are freed implicitly by gxio_mpipe_destroy().
+ */
+static int tile_net_init_mpipe(struct net_device *dev)
+{
+	int i, num_buffers, rc;
+	int cpu;
+	int first_ring, ring;
+	int network_cpus_count = cpus_weight(network_cpus_map);
+
+	if (!hash_default) {
+		netdev_err(dev, "Networking requires hash_default!\n");
+		return -EIO;
+	}
+
+	rc = gxio_mpipe_init(&context, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_init failed: %d\n", rc);
+		return -EIO;
+	}
+
+	/* Set up the buffer stacks. */
+	num_buffers =
+		network_cpus_count * (IQUEUE_ENTRIES + TILE_NET_BATCH);
+	rc = init_buffer_stacks(dev, num_buffers);
+	if (rc != 0)
+		goto fail;
+
+	/* Provide initial buffers. */
+	rc = -ENOMEM;
+	for (i = 0; i < num_buffers; i++) {
+		if (!tile_net_provide_buffer(true)) {
+			netdev_err(dev, "Cannot allocate initial sk_bufs!\n");
+			goto fail;
+		}
+	}
+	for (i = 0; i < num_buffers; i++) {
+		if (!tile_net_provide_buffer(false)) {
+			netdev_err(dev, "Cannot allocate initial sk_bufs!\n");
+			goto fail;
+		}
+	}
+
+	/* Allocate one NotifRing for each network cpu. */
+	rc = gxio_mpipe_alloc_notif_rings(&context, network_cpus_count, 0, 0);
+	if (rc < 0) {
+		netdev_err(dev, "gxio_mpipe_alloc_notif_rings failed %d\n",
+			   rc);
+		goto fail;
+	}
+
+	/* Init NotifRings per-cpu. */
+	first_ring = rc;
+	ring = first_ring;
+	for_each_online_cpu(cpu) {
+		rc = alloc_percpu_mpipe_resources(dev, cpu, ring++);
+		if (rc != 0)
+			goto fail;
+	}
+
+	/* Initialize NotifGroup and buckets. */
+	rc = init_notif_group_and_buckets(dev, first_ring, network_cpus_count);
+	if (rc != 0)
+		goto fail;
+
+	/* Create and enable interrupts. */
+	rc = tile_net_setup_interrupts(dev);
+	if (rc != 0)
+		goto fail;
+
+	return 0;
+
+fail:
+	tile_net_init_mpipe_fail();
+	return rc;
+}
+
+/* Create persistent egress info for a given egress channel.
+ * Note that this may be shared between, say, "gbe0" and "xgbe0".
+ * ISSUE: Defer header allocation until TSO is actually needed?
+ */
+static int tile_net_init_egress(struct net_device *dev, int echannel)
+{
+	struct page *headers_page, *edescs_page, *equeue_page;
+	gxio_mpipe_edesc_t *edescs;
+	gxio_mpipe_equeue_t *equeue;
+	unsigned char *headers;
+	int headers_order, edescs_order, equeue_order;
+	size_t edescs_size;
+	int edma;
+	int rc = -ENOMEM;
+
+	/* Only initialize once. */
+	if (egress_for_echannel[echannel].equeue != NULL)
+		return 0;
+
+	/* Allocate memory for the "headers". */
+	headers_order = get_order(EQUEUE_ENTRIES * HEADER_BYTES);
+	headers_page = alloc_pages(GFP_KERNEL, headers_order);
+	if (headers_page == NULL) {
+		netdev_warn(dev,
+			    "Could not alloc %zd bytes for TSO headers.\n",
+			    PAGE_SIZE << headers_order);
+		goto fail;
+	}
+	headers = pfn_to_kaddr(page_to_pfn(headers_page));
+
+	/* Allocate memory for the "edescs". */
+	edescs_size = EQUEUE_ENTRIES * sizeof(*edescs);
+	edescs_order = get_order(edescs_size);
+	edescs_page = alloc_pages(GFP_KERNEL, edescs_order);
+	if (edescs_page == NULL) {
+		netdev_warn(dev,
+			    "Could not alloc %zd bytes for eDMA ring.\n",
+			    edescs_size);
+		goto fail_headers;
+	}
+	edescs = pfn_to_kaddr(page_to_pfn(edescs_page));
+
+	/* Allocate memory for the "equeue". */
+	equeue_order = get_order(sizeof(*equeue));
+	equeue_page = alloc_pages(GFP_KERNEL, equeue_order);
+	if (equeue_page == NULL) {
+		netdev_warn(dev,
+			    "Could not alloc %zd bytes for equeue info.\n",
+			    PAGE_SIZE << equeue_order);
+		goto fail_edescs;
+	}
+	equeue = pfn_to_kaddr(page_to_pfn(equeue_page));
+
+	/* Allocate an edma ring.  Note that in practice this can't
+	 * fail, which is good, because we will leak an edma ring if so.
+	 */
+	rc = gxio_mpipe_alloc_edma_rings(&context, 1, 0, 0);
+	if (rc < 0) {
+		netdev_warn(dev, "gxio_mpipe_alloc_edma_rings failed: %d\n",
+			    rc);
+		goto fail_equeue;
+	}
+	edma = rc;
+
+	/* Initialize the equeue. */
+	rc = gxio_mpipe_equeue_init(equeue, &context, edma, echannel,
+				    edescs, edescs_size, 0);
+	if (rc != 0) {
+		netdev_err(dev, "gxio_mpipe_equeue_init failed: %d\n", rc);
+		goto fail_equeue;
+	}
+
+	/* Done. */
+	egress_for_echannel[echannel].equeue = equeue;
+	egress_for_echannel[echannel].headers = headers;
+	return 0;
+
+fail_equeue:
+	__free_pages(equeue_page, equeue_order);
+
+fail_edescs:
+	__free_pages(edescs_page, edescs_order);
+
+fail_headers:
+	__free_pages(headers_page, headers_order);
+
+fail:
+	return rc;
+}
+
+/* Return channel number for a newly-opened link. */
+static int tile_net_link_open(struct net_device *dev, gxio_mpipe_link_t *link,
+			      const char *link_name)
+{
+	int rc = gxio_mpipe_link_open(link, &context, link_name, 0);
+	if (rc < 0) {
+		netdev_err(dev, "Failed to open '%s'\n", link_name);
+		return rc;
+	}
+	rc = gxio_mpipe_link_channel(link);
+	if (rc < 0 || rc >= TILE_NET_CHANNELS) {
+		netdev_err(dev, "gxio_mpipe_link_channel bad value: %d\n", rc);
+		gxio_mpipe_link_close(link);
+		return -EINVAL;
+	}
+	return rc;
+}
+
+/* Help the kernel activate the given network interface. */
+static int tile_net_open(struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	int rc;
+
+	mutex_lock(&tile_net_devs_for_channel_mutex);
+
+	/* Do one-time initialization the first time any device is opened. */
+	if (ingress_irq < 0) {
+		rc = tile_net_init_mpipe(dev);
+		if (rc != 0)
+			goto fail;
+	}
+
+	/* Determine if this is the "loopify" device. */
+	if (unlikely((loopify_link_name != NULL) &&
+		     !strcmp(dev->name, loopify_link_name))) {
+		rc = tile_net_link_open(dev, &priv->link, "loop0");
+		if (rc < 0)
+			goto fail;
+		priv->channel = rc;
+		rc = tile_net_link_open(dev, &priv->loopify_link, "loop1");
+		if (rc < 0)
+			goto fail;
+		priv->loopify_channel = rc;
+		priv->echannel = rc;
+	} else {
+		rc = tile_net_link_open(dev, &priv->link, dev->name);
+		if (rc < 0)
+			goto fail;
+		priv->channel = rc;
+		priv->echannel = rc;
+	}
+
+	/* Initialize egress info (if needed).  Once ever, per echannel. */
+	rc = tile_net_init_egress(dev, priv->echannel);
+	if (rc != 0)
+		goto fail;
+
+	tile_net_devs_for_channel[priv->channel] = dev;
+
+	rc = tile_net_update(dev);
+	if (rc != 0)
+		goto fail;
+
+	mutex_unlock(&tile_net_devs_for_channel_mutex);
+
+	netif_start_queue(dev);
+	netif_carrier_on(dev);
+	return 0;
+
+fail:
+	if (priv->loopify_channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->loopify_link) != 0)
+			netdev_warn(dev, "Failed to close loopify link!\n");
+		priv->loopify_channel = -1;
+	}
+	if (priv->channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->link) != 0)
+			netdev_warn(dev, "Failed to close link!\n");
+		priv->channel = -1;
+	}
+	priv->echannel = -1;
+	tile_net_devs_for_channel[priv->channel] = NULL;
+	mutex_unlock(&tile_net_devs_for_channel_mutex);
+
+	/* Don't return raw gxio error codes to generic Linux. */
+	return (rc > -512) ? rc : -EIO;
+}
+
+/* Help the kernel deactivate the given network interface. */
+static int tile_net_stop(struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+
+	mutex_lock(&tile_net_devs_for_channel_mutex);
+	tile_net_devs_for_channel[priv->channel] = NULL;
+	(void)tile_net_update(dev);
+	if (priv->loopify_channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->loopify_link) != 0)
+			netdev_warn(dev, "Failed to close loopify link!\n");
+		priv->loopify_channel = -1;
+	}
+	if (priv->channel >= 0) {
+		if (gxio_mpipe_link_close(&priv->link) != 0)
+			netdev_warn(dev, "Failed to close link!\n");
+		priv->channel = -1;
+	}
+	priv->echannel = -1;
+	mutex_unlock(&tile_net_devs_for_channel_mutex);
+
+	return 0;
+}
+
+/* Determine the VA for a fragment. */
+static inline void *tile_net_frag_buf(skb_frag_t *f)
+{
+	unsigned long pfn = page_to_pfn(skb_frag_page(f));
+	return pfn_to_kaddr(pfn) + f->page_offset;
+}
+
+/* Determine how many edesc's are needed for TSO.
+ *
+ * Sometimes, if "sendfile()" requires copying, we will be called with
+ * "data" containing the header and payload, with "frags" being empty.
+ * Sometimes, for example when using NFS over TCP, a single segment can
+ * span 3 fragments.  This requires special care.
+ */
+static int tso_count_edescs(struct sk_buff *skb)
+{
+	struct skb_shared_info *sh = skb_shinfo(skb);
+	unsigned int len = skb->len;
+	unsigned int p_len = sh->gso_size;
+	long f_id = -1;    /* id of the current fragment */
+	long f_size = -1;  /* size of the current fragment */
+	long f_used = -1;  /* bytes used from the current fragment */
+	long n;            /* size of the current piece of payload */
+	int num_edescs = 0;
+	int segment;
+
+	for (segment = 0; segment < sh->gso_segs; segment++) {
+
+		unsigned int p_used = 0;
+
+		/* The last segment may be less than gso_size. */
+		len -= p_len;
+		if (len < p_len)
+			p_len = len;
+
+		/* One edesc for header and for each piece of the payload. */
+		for (num_edescs++; p_used < p_len; num_edescs++) {
+
+			/* Advance as needed. */
+			while (f_used >= f_size) {
+				f_id++;
+				f_size = sh->frags[f_id].size;
+				f_used = 0;
+			}
+
+			/* Use bytes from the current fragment. */
+			n = p_len - p_used;
+			if (n > f_size - f_used)
+				n = f_size - f_used;
+			f_used += n;
+			p_used += n;
+		}
+	}
+
+	return num_edescs;
+}
+
+/* Prepare modified copies of the skbuff headers.
+ * FIXME (bug 11489): add support for IPv6.
+ */
+static void tso_headers_prepare(struct sk_buff *skb, unsigned char *headers,
+				s64 slot)
+{
+	struct skb_shared_info *sh = skb_shinfo(skb);
+	struct iphdr *ih;
+	struct tcphdr *th;
+	unsigned int len = skb->len;
+	unsigned char *data = skb->data;
+	unsigned int ih_off, th_off, sh_len, total_len, p_len;
+	unsigned int isum_start, tsum_start, id, seq;
+	long f_id = -1;    /* id of the current fragment */
+	long f_size = -1;  /* size of the current fragment */
+	long f_used = -1;  /* bytes used from the current fragment */
+	long n;            /* size of the current piece of payload */
+	int segment;
+
+	/* Locate original headers and compute various lengths. */
+	ih = ip_hdr(skb);
+	th = tcp_hdr(skb);
+	ih_off = (unsigned char *)ih - data;
+	th_off = (unsigned char *)th - data;
+	sh_len = th_off + tcp_hdrlen(skb);
+	p_len = sh->gso_size;
+	total_len = p_len + sh_len;
+
+	/* Set up seed values for IP and TCP csum and initialize id and seq. */
+	isum_start = ((0xFFFF - ih->check) +
+		      (0xFFFF - ih->tot_len) +
+		      (0xFFFF - ih->id));
+	tsum_start = th->check + (0xFFFF ^ htons(len));
+	id = ntohs(ih->id);
+	seq = ntohl(th->seq);
+
+	/* Prepare all the headers. */
+	for (segment = 0; segment < sh->gso_segs; segment++) {
+		unsigned char *buf;
+		unsigned int p_used = 0;
+
+		/* The last segment may be less than gso_size. */
+		len -= p_len;
+		if (len < p_len) {
+			p_len = len;
+			total_len = p_len + sh_len;
+		}
+
+		/* Copy to the header memory for this segment. */
+		buf = headers + (slot % EQUEUE_ENTRIES) * HEADER_BYTES +
+			NET_IP_ALIGN;
+		memcpy(buf, data, sh_len);
+
+		/* Update copied ip header. */
+		ih = (struct iphdr *)(buf + ih_off);
+		ih->tot_len = htons(total_len - ih_off);
+		ih->id = htons(id);
+		ih->check = csum_long(isum_start + htons(total_len - ih_off) +
+				      htons(id)) ^ 0xffff;
+
+		/* Update copied tcp header. */
+		th = (struct tcphdr *)(buf + th_off);
+		th->seq = htonl(seq);
+		th->check = csum_long(tsum_start + htons(total_len));
+		if (segment != sh->gso_segs - 1) {
+			th->fin = 0;
+			th->psh = 0;
+		}
+
+		/* Skip past the header. */
+		slot++;
+
+		/* Skip past the payload. */
+		while (p_used < p_len) {
+
+			/* Advance as needed. */
+			while (f_used >= f_size) {
+				f_id++;
+				f_size = sh->frags[f_id].size;
+				f_used = 0;
+			}
+
+			/* Use bytes from the current fragment. */
+			n = p_len - p_used;
+			if (n > f_size - f_used)
+				n = f_size - f_used;
+			f_used += n;
+			p_used += n;
+
+			slot++;
+		}
+
+		id++;
+		seq += p_len;
+	}
+
+	/* Flush the headers so they are ready for hardware DMA. */
+	wmb();
+}
+
+/* Pass all the data to mpipe for egress. */
+static void tso_egress(struct net_device *dev, gxio_mpipe_equeue_t *equeue,
+		       struct sk_buff *skb, unsigned char *headers, s64 slot)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	struct skb_shared_info *sh = skb_shinfo(skb);
+	unsigned int len = skb->len;
+	unsigned int p_len = sh->gso_size;
+	gxio_mpipe_edesc_t edesc_head = { { 0 } };
+	gxio_mpipe_edesc_t edesc_body = { { 0 } };
+	long f_id = -1;    /* id of the current fragment */
+	long f_size = -1;  /* size of the current fragment */
+	long f_used = -1;  /* bytes used from the current fragment */
+	long n;            /* size of the current piece of payload */
+	unsigned long tx_packets = 0, tx_bytes = 0;
+	unsigned int csum_start, sh_len;
+	int segment;
+	
+	/* Prepare to egress the headers: set up header edesc. */
+	csum_start = skb_checksum_start_offset(skb);
+	sh_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	edesc_head.csum = 1;
+	edesc_head.csum_start = csum_start;
+	edesc_head.csum_dest = csum_start + skb->csum_offset;
+	edesc_head.xfer_size = sh_len;
+
+	/* This is only used to specify the TLB. */
+	edesc_head.stack_idx = large_buffer_stack;
+	edesc_body.stack_idx = large_buffer_stack;
+
+	/* Egress all the edescs. */
+	for (segment = 0; segment < sh->gso_segs; segment++) {
+		void *va;
+		unsigned char *buf;
+		unsigned int p_used = 0;
+
+		/* The last segment may be less than gso_size. */
+		len -= p_len;
+		if (len < p_len)
+			p_len = len;
+
+		/* Egress the header. */
+		buf = headers + (slot % EQUEUE_ENTRIES) * HEADER_BYTES +
+			NET_IP_ALIGN;
+		edesc_head.va = va_to_tile_io_addr(buf);
+		gxio_mpipe_equeue_put_at(equeue, edesc_head, slot);
+		slot++;
+
+		/* Egress the payload. */
+		while (p_used < p_len) {
+
+			/* Advance as needed. */
+			while (f_used >= f_size) {
+				f_id++;
+				f_size = sh->frags[f_id].size;
+				f_used = 0;
+			}
+
+			va = tile_net_frag_buf(&sh->frags[f_id]) + f_used;
+
+			/* Use bytes from the current fragment. */
+			n = p_len - p_used;
+			if (n > f_size - f_used)
+				n = f_size - f_used;
+			f_used += n;
+			p_used += n;
+
+			/* Egress a piece of the payload. */
+			edesc_body.va = va_to_tile_io_addr(va);
+			edesc_body.xfer_size = n;
+			edesc_body.bound = !(p_used < p_len);
+			gxio_mpipe_equeue_put_at(equeue, edesc_body, slot);
+			slot++;
+		}
+
+		tx_packets++;
+		tx_bytes += sh_len + p_len;
+	}
+
+	/* Update stats. */
+	tile_net_stats_add(tx_packets, &priv->stats.tx_packets);
+	tile_net_stats_add(tx_bytes, &priv->stats.tx_bytes);
+}
+
+/* Do TSO handling for egress. */
+static int tile_net_tx_tso(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	int channel = priv->echannel;
+	struct tile_net_egress *egress = &egress_for_echannel[channel];
+	gxio_mpipe_equeue_t *equeue = egress->equeue;
+	unsigned int num_edescs;
+	unsigned long irqflags;
+	s64 slot;
+
+	/* Determine how many mpipe edesc's are needed. */
+	num_edescs = tso_count_edescs(skb);
+
+	local_irq_save(irqflags);
+
+	/* Set first reserved egress slot; see comment in tile_net_tx(). */
+	slot = gxio_mpipe_equeue_try_reserve(equeue, num_edescs);
+
+	/* Set up copies of header data properly. */
+	tso_headers_prepare(skb, egress->headers, slot);
+
+	/* Actually pass the data to the network hardware. */
+	tso_egress(dev, equeue, skb, egress->headers, slot);
+
+	/* Add a completion record. */
+	add_comp(equeue, info->comps_for_echannel[channel],
+		 slot + num_edescs - 1, skb);
+
+	local_irq_restore(irqflags);
+
+	/* Make sure the egress timer is scheduled. */
+	tile_net_schedule_egress_timer(info);
+
+	return NETDEV_TX_OK;
+}
+
+/* Analyze the body and frags for a transmit request. */
+static unsigned int tile_net_tx_frags(struct frag *frags,
+				       struct sk_buff *skb,
+				       void *b_data, unsigned int b_len)
+{
+	unsigned int i, n = 0;
+
+	struct skb_shared_info *sh = skb_shinfo(skb);
+
+	if (b_len != 0) {
+		frags[n].buf = b_data;
+		frags[n++].length = b_len;
+	}
+
+	for (i = 0; i < sh->nr_frags; i++) {
+		skb_frag_t *f = &sh->frags[i];
+		frags[n].buf = tile_net_frag_buf(f);
+		frags[n++].length = skb_frag_size(f);
+	}
+
+	return n;
+}
+
+/* Help the kernel transmit a packet. */
+static int tile_net_tx(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	struct tile_net_egress *egress = &egress_for_echannel[priv->echannel];
+	gxio_mpipe_equeue_t *equeue = egress->equeue;
+	struct tile_net_comps *comps =
+		info->comps_for_echannel[priv->echannel];
+	unsigned int len = skb->len;
+	unsigned char *data = skb->data;
+	unsigned int num_frags;
+	struct frag frags[MAX_FRAGS];
+	gxio_mpipe_edesc_t edescs[MAX_FRAGS];
+	unsigned long irqflags;
+	gxio_mpipe_edesc_t edesc = { { 0 } };
+	unsigned int i;
+	s64 slot;
+
+	/* Save the timestamp. */
+	dev->trans_start = jiffies;
+
+	if (skb_is_gso(skb))
+		return tile_net_tx_tso(skb, dev);
+
+	/* NOTE: This is usually 2, sometimes 3, for big writes. */
+	num_frags = tile_net_tx_frags(frags, skb, data, skb_headlen(skb));
+
+	/* This is only used to specify the TLB. */
+	edesc.stack_idx = large_buffer_stack;
+
+	/* Prepare the edescs. */
+	for (i = 0; i < num_frags; i++) {
+		edesc.xfer_size = frags[i].length;
+		edesc.va = va_to_tile_io_addr(frags[i].buf);
+		edescs[i] = edesc;
+	}
+
+	/* Mark the final edesc. */
+	edescs[num_frags - 1].bound = 1;
+
+	/* Add checksum info to the initial edesc, if needed. */
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		unsigned int csum_start = skb_checksum_start_offset(skb);
+		edescs[0].csum = 1;
+		edescs[0].csum_start = csum_start;
+		edescs[0].csum_dest = csum_start + skb->csum_offset;
+	}
+
+	local_irq_save(irqflags);
+
+	/* Try to reserve slots for egress.  If we fail due to the
+	 * queue being full, we return NETDEV_TX_BUSY.  This may lead
+	 * to "Virtual device xxx asks to queue packet" warnings.
+	 *
+	 * We might consider retrying briefly here since we expect in
+	 * principle that egress slots become available quickly as the
+	 * hardware engine drains packets into the network.
+	 *
+	 * FIXME (bug# 11479): We should stop queues when they're full.
+	 * We may want to consider making tile_net be multiqueue with
+	 * one TX queue per CPU and ndo_select_queue defined
+	 * accordingly.  Initially we saw bad things happen when
+	 * stopping the queue, so we are continuing to work on this
+	 * for a future fix.
+	 */
+	slot = gxio_mpipe_equeue_try_reserve(equeue, num_frags);
+	if (slot < 0) {
+		local_irq_restore(irqflags);
+		return NETDEV_TX_BUSY;
+	}
+
+	for (i = 0; i < num_frags; i++)
+		gxio_mpipe_equeue_put_at(equeue, edescs[i], slot++);
+
+	/* Add a completion record. */
+	add_comp(equeue, comps, slot - 1, skb);
+
+	/* NOTE: Use ETH_ZLEN for short packets (e.g. 42 < 60). */
+	tile_net_stats_add(1, &priv->stats.tx_packets);
+	tile_net_stats_add(max(len, (unsigned int)ETH_ZLEN),
+			   &priv->stats.tx_bytes);
+
+	local_irq_restore(irqflags);
+
+	/* Make sure the egress timer is scheduled. */
+	tile_net_schedule_egress_timer(info);
+
+	return NETDEV_TX_OK;
+}
+
+/* Deal with a transmit timeout. */
+static void tile_net_tx_timeout(struct net_device *dev)
+{
+	netif_wake_queue(dev);
+}
+
+/* Ioctl commands. */
+static int tile_net_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+	return -EOPNOTSUPP;
+}
+
+/* Get system network statistics for device. */
+static struct net_device_stats *tile_net_get_stats(struct net_device *dev)
+{
+	struct tile_net_priv *priv = netdev_priv(dev);
+	return &priv->stats;
+}
+
+/* Change the MTU. */
+static int tile_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if ((new_mtu < 68) || (new_mtu > 1500))
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+/* Change the Ethernet address of the NIC.
+ *
+ * The hypervisor driver does not support changing MAC address.  However,
+ * the hardware does not do anything with the MAC address, so the address
+ * which gets used on outgoing packets, and which is accepted on incoming
+ * packets, is completely up to us.
+ *
+ * Returns 0 on success, negative on failure.
+ */
+static int tile_net_set_mac_address(struct net_device *dev, void *p)
+{
+	struct sockaddr *addr = p;
+
+	if (!is_valid_ether_addr(addr->sa_data))
+		return -EINVAL;
+	memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
+	return 0;
+}
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Polling 'interrupt' - used by things like netconsole to send skbs
+ * without having to re-enable interrupts. It's not called while
+ * the interrupt routine is executing.
+ */
+static void tile_net_netpoll(struct net_device *dev)
+{
+	disable_percpu_irq(ingress_irq);
+	tile_net_handle_ingress_irq(ingress_irq, NULL);
+	enable_percpu_irq(ingress_irq, 0);
+}
+#endif
+
+static const struct net_device_ops tile_net_ops = {
+	.ndo_open = tile_net_open,
+	.ndo_stop = tile_net_stop,
+	.ndo_start_xmit = tile_net_tx,
+	.ndo_do_ioctl = tile_net_ioctl,
+	.ndo_get_stats = tile_net_get_stats,
+	.ndo_change_mtu = tile_net_change_mtu,
+	.ndo_tx_timeout = tile_net_tx_timeout,
+	.ndo_set_mac_address = tile_net_set_mac_address,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	.ndo_poll_controller = tile_net_netpoll,
+#endif
+};
+
+/* The setup function.
+ *
+ * This uses ether_setup() to assign various fields in dev, including
+ * setting IFF_BROADCAST and IFF_MULTICAST, then sets some extra fields.
+ */
+static void tile_net_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+	dev->netdev_ops = &tile_net_ops;
+	dev->watchdog_timeo = TILE_NET_TIMEOUT;
+	dev->features |= NETIF_F_LLTX;
+	dev->features |= NETIF_F_HW_CSUM;
+	dev->features |= NETIF_F_SG;
+	dev->features |= NETIF_F_TSO;
+	dev->tx_queue_len = 0;
+	dev->mtu = 1500;
+}
+
+/* Allocate the device structure, register the device, and obtain the
+ * MAC address from the hypervisor.
+ */
+static void tile_net_dev_init(const char *name, const uint8_t *mac)
+{
+	int ret;
+	int i;
+	int nz_addr = 0;
+	struct net_device *dev;
+	struct tile_net_priv *priv;
+
+	/* HACK: Ignore "loop" links. */
+	if (strncmp(name, "loop", 4) == 0)
+		return;
+
+	/* Allocate the device structure.  Normally, "name" is a
+	 * template, instantiated by register_netdev(), but not for us.
+	 */
+	dev = alloc_netdev(sizeof(*priv), name, tile_net_setup);
+	if (!dev) {
+		pr_err("alloc_netdev(%s) failed\n", name);
+		return;
+	}
+
+	/* Initialize "priv". */
+	priv = netdev_priv(dev);
+	memset(priv, 0, sizeof(*priv));
+	priv->dev = dev;
+	priv->channel = -1;
+	priv->loopify_channel = -1;
+	priv->echannel = -1;
+
+	/* Get the MAC address and set it in the device struct; this must
+	 * be done before the device is opened.  If the MAC is all zeroes,
+	 * we use a random address, since we're probably on the simulator.
+	 */
+	for (i = 0; i < 6; i++)
+		nz_addr |= mac[i];
+
+	if (nz_addr) {
+		memcpy(dev->dev_addr, mac, 6);
+		dev->addr_len = 6;
+	} else {
+		random_ether_addr(dev->dev_addr);
+	}
+
+	/* Register the network device. */
+	ret = register_netdev(dev);
+	if (ret) {
+		netdev_err(dev, "register_netdev failed %d\n", ret);
+		free_netdev(dev);
+		return;
+	}
+}
+
+/* Per-cpu module initialization. */
+static void tile_net_init_module_percpu(void *unused)
+{
+	struct tile_net_info *info = &__get_cpu_var(per_cpu_info);
+	int my_cpu = smp_processor_id();
+
+	info->has_iqueue = false;
+
+	info->my_cpu = my_cpu;
+
+	/* Initialize the egress timer. */
+	init_timer(&info->egress_timer);
+	info->egress_timer.data = (long)info;
+	info->egress_timer.function = tile_net_handle_egress_timer;
+}
+
+/* Module initialization. */
+static int __init tile_net_init_module(void)
+{
+	int i;
+	char name[GXIO_MPIPE_LINK_NAME_LEN];
+	uint8_t mac[6];
+
+	pr_info("Tilera Network Driver\n");
+
+	mutex_init(&tile_net_devs_for_channel_mutex);
+
+	/* Initialize each CPU. */
+	on_each_cpu(tile_net_init_module_percpu, NULL, 1);
+
+	/* Find out what devices we have, and initialize them. */
+	for (i = 0; gxio_mpipe_link_enumerate_mac(i, name, mac) >= 0; i++)
+		tile_net_dev_init(name, mac);
+
+	if (!network_cpus_init())
+		network_cpus_map = *cpu_online_mask;
+
+	return 0;
+}
+
+module_init(tile_net_init_module);
-- 
1.6.5.2

^ permalink raw reply related

* Re: [PATCH v6 2/2] decrement static keys on real destroy time
From: Andrew Morton @ 2012-05-23 20:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, devel, kamezawa.hiroyu, netdev, Tejun Heo,
	Li Zefan, Johannes Weiner, Michal Hocko, David Miller
In-Reply-To: <4FBCAAF4.4030803@parallels.com>

On Wed, 23 May 2012 13:16:36 +0400
Glauber Costa <glommer@parallels.com> wrote:

> On 05/23/2012 02:46 AM, Andrew Morton wrote:
> > Here, we're open-coding kinda-test_bit().  Why do that?  These flags are
> > modified with set_bit() and friends, so we should read them with the
> > matching test_bit()?
> 
> My reasoning was to be as cheap as possible, as you noted yourself two
> paragraphs below.

These aren't on any fast path, are they?

Plus: you failed in that objective!  The C compiler's internal
scalar->bool conversion makes these functions no more efficient than
test_bit().

> > So here are suggested changes from*some*  of the above discussion.
> > Please consider, incorporate, retest and send us a v7?
> 
> How do you want me to do it? Should I add your patch ontop of mine,
> and then another one that tweaks whatever else is left, or should I just
> merge those changes into the patches I have?

A brand new patch, I guess.  I can sort out the what-did-he-change view
at this end.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox