Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line
From: Eric Dumazet @ 2012-11-26  6:44 UTC (permalink / raw)
  To: ling.ma.program; +Cc: linux-kernel, netdev
In-Reply-To: <1353900555-5966-1-git-send-email-ling.ma.program@gmail.com>

On Mon, 2012-11-26 at 11:29 +0800, ling.ma.program@gmail.com wrote:
> From: Ma Ling <ling.ma.program@gmail.com>
> 
> In order to reduce memory latency when last level cache miss occurs,
> modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> Early Restart(ER) to get data ASAP. For CWF if critical word is first member
> in cache line, memory feed CPU with critical word, then fill others
> data in cache line one by one, otherwise after critical word it must
> cost more cycle to fill the remaining cache line. For Early First CPU will
> restart until critical word in cache line reaches.
> 
> Hash value is critical word, so in this patch we place it as first member
> in cache line(sock address is cache-line aligned), and it is also good for
> Early Restart platform as well .
> 
> Thanks
> Ling

networking patches should be sent to netdev.

(I understand this patch is more a generic one, but at least CC netdev)

You give no performance numbers for this change...

I never heard of this CWF/ER, where are the official Intel documents
about this, and what models really benefit from it ?

Also, why not moving skc_net as well ?

BTW, skc_daddr & skc_rcv_saddr are 'critical' as well, we use them in
INET_MATCH()

It seems we have a 32bit hole on 64bit arches, so we probably should
move inet_dport/inet_num in it. It could well remove a full cache line
miss (I'll send a patch for this after tests)

Thanks

^ permalink raw reply

* Re: [net-next:master 20/25] warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)
From: Viresh Kumar @ 2012-11-26  6:13 UTC (permalink / raw)
  To: kbuild test robot, Ben Hutchings, davem; +Cc: netdev
In-Reply-To: <50b2f66a.czw+uqRm3Q/zfbER%fengguang.wu@intel.com>

Hi Ben/Dave,

On 26 November 2012 10:26, kbuild test robot <fengguang.wu@intel.com> wrote:
> tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
> head:   53d6841d225b93c20d561878637c3cd307c11648
> commit: 82167cb8c6b2f8166d5c7532e5ef4b5e0cc46a72 [20/25] net: dsa/slave: Fix compilation warnings
> config: make ARCH=s390 allmodconfig
>
> All warnings:
>
> warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)
> warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)
> --
> warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)

I am not really sure how to fix this up. Can you help?

--
viresh

^ permalink raw reply

* Re: [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
From: Stephen Hemminger @ 2012-11-26  6:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
	Paul E. McKenney, Herbert Xu
In-Reply-To: <20121123130847.18764.87682.stgit@dragon>

On Fri, 23 Nov 2012 14:08:47 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Do NOT APPLY this patch.
> 
> After all the other patches, the rw_lock is now the contention point.

Reader-writer locks are significantly slower than a simple spinlock.
Unless the reader holds the lock for "significant" number of cycles,
a spin lock will be faster. 

Did you try just changing it to a spinlock?

^ permalink raw reply

* Re: [PATCH net-next] net: clean up locking in inet_frag_find()
From: Cong Wang @ 2012-11-26  5:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Patrick McHardy, Pablo Neira Ayuso, David S. Miller
In-Reply-To: <1353908592.30446.1136.camel@edumazet-glaptop>

On Sun, 2012-11-25 at 21:43 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 13:18 +0800, Cong Wang wrote:
> > It is weird to take the read lock outside of inet_frag_find()
> > but release it inside... Also, the hash function doesn't
> > need this lock.
> 
> This patch is wrong. hash needs the lock.
> 
> 

Indeed, I missed .rnd...

I am cooking a better patch now.

Thanks!

^ permalink raw reply

* Re: [PATCH net-next] net: clean up locking in inet_frag_find()
From: Eric Dumazet @ 2012-11-26  5:43 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Patrick McHardy, Pablo Neira Ayuso, David S. Miller
In-Reply-To: <1353907085-30824-1-git-send-email-amwang@redhat.com>

On Mon, 2012-11-26 at 13:18 +0800, Cong Wang wrote:
> It is weird to take the read lock outside of inet_frag_find()
> but release it inside... Also, the hash function doesn't
> need this lock.

This patch is wrong. hash needs the lock.

^ permalink raw reply

* [PATCH net-next] net: clean up locking in inet_frag_find()
From: Cong Wang @ 2012-11-26  5:18 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Pablo Neira Ayuso, David S. Miller, Cong Wang

It is weird to take the read lock outside of inet_frag_find()
but release it inside... Also, the hash function doesn't
need this lock.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>

---
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4750d2b..caf1807 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -272,11 +272,11 @@ static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
 
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
-	__releases(&f->lock)
 {
 	struct inet_frag_queue *q;
 	struct hlist_node *n;
 
+	read_lock(&f->lock);
 	hlist_for_each_entry(q, n, &f->hash[hash], list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 1cf6a76..d9541f4 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -295,9 +295,7 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	arg.iph = iph;
 	arg.user = user;
 
-	read_lock(&ip4_frags.lock);
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
-
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
 	if (q == NULL)
 		goto out_nomem;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 22c8ea9..b778423 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -175,9 +175,8 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 	arg.src = src;
 	arg.dst = dst;
 
-	read_lock_bh(&nf_frags.lock);
 	hash = inet6_hash_frag(id, src, dst, nf_frags.rnd);
-
+	local_bh_disable();
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
 	local_bh_enable();
 	if (q == NULL)
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e5253ec..15e09a6 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -193,9 +193,7 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src, const struct in6
 	arg.src = src;
 	arg.dst = dst;
 
-	read_lock(&ip6_frags.lock);
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
-
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
 	if (q == NULL)
 		return NULL;

^ permalink raw reply related

* [net-next:master 20/25] warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)
From: kbuild test robot @ 2012-11-26  4:56 UTC (permalink / raw)
  To: viresh kumar; +Cc: netdev

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   53d6841d225b93c20d561878637c3cd307c11648
commit: 82167cb8c6b2f8166d5c7532e5ef4b5e0cc46a72 [20/25] net: dsa/slave: Fix compilation warnings
config: make ARCH=s390 allmodconfig

All warnings:

warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)
warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)
--
warning: (NET_DSA_TAG_DSA && NET_DSA_TAG_EDSA && NET_DSA_TAG_TRAILER) selects NET_DSA which has unmet direct dependencies (NET && EXPERIMENTAL && NETDEVICES && !S390)

---
0-DAY kernel build testing backend         Open Source Technology Center
Fengguang Wu, Yuanhan Liu                              Intel Corporation

^ permalink raw reply

* Re: [PATCH 080/493] fddi: remove use of __devexit_p
From: Maciej W. Rozycki @ 2012-11-26  4:35 UTC (permalink / raw)
  To: Greg KH; +Cc: Bill Pemberton, netdev
In-Reply-To: <20121122002337.GA16151@kroah.com>

On Wed, 21 Nov 2012, Greg KH wrote:

> >  TURBOchannel, although valid, is an old exotic case that might not be 
> > worth arguing for, except for purity maybe.  But there are surely many 
> > contemporary systems out there that are known they are never going to 
> > support hot device replacement.  Consider most of the embedded systems for 
> > example, where devices may even physically be cast into a single SOC (with 
> > no prospect of chipping off any pieces ever ;) ), that certainly could not 
> > care less of device replacement, but they do care a lot about memory 
> > consumption.
> 
> Even those don't care about less than 5k of memory, do they?

 I guess you're right.  As long as it's not 5k per driver + 
who_knows_how_much per platform for some generic stuff that is.  Of course 
this is still a waste, but I can accept it as a compromise between the use 
of machine resources and the cost of maintenance.

> >  Was this implication considered, discussed and diregarded as not 
> > important enough compared to benefits from hardcoding HOTPLUG support?  
> 
> Yes, I don't know of any modern system that does not enable
> CONFIG_HOTPLUG, do you?

 I've never enabled it outside x86 to be honest.  None of the platforms I 
care of supports it in hardware.

> > I'm seriously asking for a pointer, not trying to cause any stir-up -- 
> > regrettably I fail to follow most discussions these days, but I would like 
> > to know what the background behind this decision was.  Thanks a lot!
> 
> See Russell's response in this thread for details if you are curious.

 Hmm, not in my inbox for some reason (unless I'm blind), but thanks for 
the pointer, I'll see if I can chase it online in some mailing list 
archive.

  Maciej

^ permalink raw reply

* linux-next: manual merge of the cgroup tree with the net-next tree
From: Stephen Rothwell @ 2012-11-26  4:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-next-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Daniel Wagner, David Miller,
	netdev-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Hi Tejun,

Today's linux-next merge of the cgroup tree got a conflict in
net/sched/cls_cgroup.c between commit 6a328d8c6f03 ("cgroup: net_cls:
Rework update socket logic") from the net-next tree and commits
92fb97487a7e ("cgroup: rename ->create/post_create/pre_destroy/destroy()
to ->css_alloc/online/offline/free()") and 0ba18f7a5e26 ("netcls_cgroup:
move config inheritance to ->css_online() and remove .broken_hierarchy
marking") from the cgroup tree.

I fixed it up (I think - see below) and can carry the fix as necessary
(no action is required).

-- 
Cheers,
Stephen Rothwell                    sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org

diff --cc net/sched/cls_cgroup.c
index 709b0fb,31f06b6..0000000
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@@ -98,9 -79,9 +102,10 @@@ static struct cftype ss_files[] = 

  struct cgroup_subsys net_cls_subsys = {
  	.name		= "net_cls",
- 	.create		= cgrp_create,
- 	.destroy	= cgrp_destroy,
+ 	.css_alloc	= cgrp_css_alloc,
+ 	.css_online	= cgrp_css_online,
+ 	.css_free	= cgrp_css_free,
 +	.attach		= cgrp_attach,
  	.subsys_id	= net_cls_subsys_id,
  	.base_cftypes	= ss_files,
  	.module		= THIS_MODULE,

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH] ip6mr: Add sizeof verification to MRT6_ASSERT and MT6_PIM
From: Joe Perches @ 2012-11-26  4:26 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20121125.164036.1283541826012498001.davem@davemloft.net>

Verify the length of the user-space arguments.

Signed-off-by: Joe Perches <joe@perches.com>
---
 net/ipv6/ip6mr.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 79bb490..926ea54 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1646,6 +1646,9 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 	case MRT6_ASSERT:
 	{
 		int v;
+
+		if (optlen != sizeof(v))
+			return -EINVAL;
 		if (get_user(v, (int __user *)optval))
 			return -EFAULT;
 		mrt->mroute_do_assert = v;
@@ -1656,6 +1659,9 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 	case MRT6_PIM:
 	{
 		int v;
+
+		if (optlen != sizeof(v))
+			return -EINVAL;
 		if (get_user(v, (int __user *)optval))
 			return -EFAULT;
 		v = !!v;

^ permalink raw reply related

* Re: [PATCH] net: ipmr: limit MRT_TABLE identifiers
From: Chen Gang @ 2012-11-26  3:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1353898520.30446.868.camel@edumazet-glaptop>

于 2012年11月26日 10:55, Eric Dumazet 写道:
> On Mon, 2012-11-26 at 10:30 +0800, Chen Gang wrote:
> 
>>   maybe a hacker can do ?  (I just guess).
>>
>>   for my experience:
>>     It is necessary to think of more coding ways, when we have to give comments inside a function.
> 
> I have absolutely no idea of what you are trying to say.
> 

excuse me, my English is not quite well.
I will try to say it as clear as I can.

what I want to say are:
  for os api, we need describe them as full as we can (including limitations, although it seems minor).
  I do not suggest to give comments inside a fuction (especially coding with hard code number).
  they are just my suggestion, not mean, it should be regression.

at least, I think this patch is valuable.

-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [RFC net-next PATCH V1 5/9] net: frag per CPU mem limit and LRU list accounting
From: Cong Wang @ 2012-11-26  2:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Pablo Neira Ayuso, Thomas Graf, Patrick McHardy, Paul E. McKenney,
	Herbert Xu
In-Reply-To: <20121123130826.18764.66507.stgit@dragon>

On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> +       // QUESTION: Please advice in a better alloc solution than
> NR_CPUS
> +       struct cpu_resource     percpu[NR_CPUS]; 

alloc_percpu(struct cpu_resource) ?

BTW, struct cpu_resource is not a good name here, it is too generic.

^ permalink raw reply

* Re: [PATCH] net: ipmr: limit MRT_TABLE identifiers
From: Eric Dumazet @ 2012-11-26  2:55 UTC (permalink / raw)
  To: Chen Gang; +Cc: David Miller, netdev
In-Reply-To: <50B2D42A.6090804@asianux.com>

On Mon, 2012-11-26 at 10:30 +0800, Chen Gang wrote:

>   maybe a hacker can do ?  (I just guess).
> 
>   for my experience:
>     It is necessary to think of more coding ways, when we have to give comments inside a function.

I have absolutely no idea of what you are trying to say.

^ permalink raw reply

* Re: [PATCH] net: ipmr: limit MRT_TABLE identifiers
From: Chen Gang @ 2012-11-26  2:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1353896372.30446.866.camel@edumazet-glaptop>

于 2012年11月26日 10:19, Eric Dumazet 写道:
> On Mon, 2012-11-26 at 09:34 +0800, Chen Gang wrote:
> 
>>     if "pimre%u" (or another format), will not hurt the functional features, I suggest to use it
>>       since, we need try our best to not touch the OS API. 
>>       ("pimreg%u" seems an internal format, not OS API Level)
> 
> Have you taken a look at user code base before suggesting such a
> change ?

  maybe a hacker can do ?  (I just guess).

  for my experience:
    It is necessary to think of more coding ways, when we have to give comments inside a function.


> 
> My patch is the safest change: Just make sure machine doesnt crash if
> user ask stupid things.
> 
> No possible regression.
> 
> 
> 
> 
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [PATCH] net: ipmr: limit MRT_TABLE identifiers
From: Eric Dumazet @ 2012-11-26  2:19 UTC (permalink / raw)
  To: Chen Gang; +Cc: David Miller, netdev
In-Reply-To: <50B2C72F.9000100@asianux.com>

On Mon, 2012-11-26 at 09:34 +0800, Chen Gang wrote:

>     if "pimre%u" (or another format), will not hurt the functional features, I suggest to use it
>       since, we need try our best to not touch the OS API. 
>       ("pimreg%u" seems an internal format, not OS API Level)

Have you taken a look at user code base before suggesting such a
change ?

My patch is the safest change: Just make sure machine doesnt crash if
user ask stupid things.

No possible regression.

^ permalink raw reply

* Re: [PATCH] net: ipmr: limit MRT_TABLE identifiers
From: Chen Gang @ 2012-11-26  1:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <50B2C470.5090802@asianux.com>

于 2012年11月26日 09:22, Chen Gang 写道:
> 于 2012年11月26日 03:44, Eric Dumazet 写道:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> Name of pimreg devices are built from following format :
>>
>> char name[IFNAMSIZ]; // IFNAMSIZ == 16
>>
>> sprintf(name, "pimreg%u", mrt->id);
>>
>> We must therefore limit mrt->id to 9 decimal digits
>> or risk a buffer overflow and a crash.
>>
>> Restrict table identifiers in [0 ... 999999999] interval.
>>

    if "pimre%u" (or another format), will not hurt the functional features, I suggest to use it
      since, we need try our best to not touch the OS API. 
      ("pimreg%u" seems an internal format, not OS API Level)

> 
>   if we have to stick to "pimreg%u" (or will hurt the functional features)
>   suggest to let user mode know this limitation. 
>     define a macro in public header (user mode can know it) and give comments.
>     use macro instead of number.
>     remove the comments which is inside internal function.
> 
>   thanks.
> 
> gchen.
> 
> 
>> Reported-by: Chen Gang <gang.chen@asianux.com>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> ---
>>  net/ipv4/ipmr.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
>> index 6168c4d..3eab2b2 100644
>> --- a/net/ipv4/ipmr.c
>> +++ b/net/ipv4/ipmr.c
>> @@ -1318,6 +1318,10 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
>>  		if (get_user(v, (u32 __user *)optval))
>>  			return -EFAULT;
>>  
>> +		/* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */
>> +		if (v != RT_TABLE_DEFAULT && v >= 1000000000)
>> +			return -EINVAL;
>> +
>>  		rtnl_lock();
>>  		ret = 0;
>>  		if (sk == rtnl_dereference(mrt->mroute_sk)) {
>>
>>
>>
>>
> 
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [PATCH] net: ipmr: limit MRT_TABLE identifiers
From: Chen Gang @ 2012-11-26  1:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1353872669.30446.863.camel@edumazet-glaptop>

于 2012年11月26日 03:44, Eric Dumazet 写道:
> From: Eric Dumazet <edumazet@google.com>
> 
> Name of pimreg devices are built from following format :
> 
> char name[IFNAMSIZ]; // IFNAMSIZ == 16
> 
> sprintf(name, "pimreg%u", mrt->id);
> 
> We must therefore limit mrt->id to 9 decimal digits
> or risk a buffer overflow and a crash.
> 
> Restrict table identifiers in [0 ... 999999999] interval.
> 

  if we have to stick to "pimreg%u" (or will hurt the functional features)
  suggest to let user mode know this limitation. 
    define a macro in public header (user mode can know it) and give comments.
    use macro instead of number.
    remove the comments which is inside internal function.

  thanks.

gchen.


> Reported-by: Chen Gang <gang.chen@asianux.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/ipmr.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 6168c4d..3eab2b2 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -1318,6 +1318,10 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi
>  		if (get_user(v, (u32 __user *)optval))
>  			return -EFAULT;
>  
> +		/* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */
> +		if (v != RT_TABLE_DEFAULT && v >= 1000000000)
> +			return -EINVAL;
> +
>  		rtnl_lock();
>  		ret = 0;
>  		if (sk == rtnl_dereference(mrt->mroute_sk)) {
> 
> 
> 
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [PATCH] atm: br2684: Fix excessive queue bloat
From: Krzysztof Mazur @ 2012-11-25 23:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR)
In-Reply-To: <1353880892.26346.300.camel@shinybook.infradead.org>

On Sun, Nov 25, 2012 at 10:01:32PM +0000, David Woodhouse wrote:
> 
> Yeah. This is fairly much the same conversation I ended up having when I
> did the same for PPPoATM.
> 
> Some day *perhaps* we might look at doing something adaptive, so it'll
> detect a TX underrun and increase the amount of buffering. But this
> seems perfectly good for now.

The biggest problem is the existence of additional, mostly unnecessary
huge software queue that we have between us and the hardware on "ATM sockets".
Most of ATM drivers have some hardware queue, but accept and queue
additional frames in software and it just adds unnecessary delay
- it's much better to queue those frames in netdev queue. With this patch
we limit this queue by limiting the number of frames in this queue and
hardware queue to 2 frames, but the hardware may require larger *hardware*
queue to saturate the link. If we had only the hardware queue, the queue
would be probably sized adequately in ATM driver or could be tuned by user.

I think that the 2 frame queue is much better than the huge queue,
that we have now.

> 
> > Maybe this magic "2" and the comment should be moved to some #define.
> 
> Doesn't make it any less magic. I'm happier with it as it is, with a
> clear comment describing why it's done that way.
> 

In PPPoATM we have NONE_INFLIGHT.

Krzysiek

^ permalink raw reply

* Re: [PATCH] 8139cp: re-enable interrupts after tx timeout
From: Francois Romieu @ 2012-11-25 22:43 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, jasowang, gilboad, jgarzik
In-Reply-To: <1353799650.26346.289.camel@shinybook.infradead.org>

David Woodhouse <dwmw2@infradead.org> :
> On Sat, 2012-11-24 at 23:58 +0100, Francois Romieu wrote:
> > While you are at it, do you see what would prevent the watchdog
> > timer and the rx napi handler to race on two different CPUs
> 
> Hm, good point. Is it sufficient just to stick napi_disable() and
> napi_enable() in there?

No, napi_disable() may sleep.

> And should we try to reset the TX state machine without stomping on
> RX at all ?

Either that or the watchdog timer will have to queue some user context
work. Things may change later when you go lockless (hint, hint).

-- 
Ueimor

^ permalink raw reply

* [PATCH v2] atm: br2684: Fix excessive queue bloat
From: David Woodhouse @ 2012-11-25 22:06 UTC (permalink / raw)
  To: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR)

[-- Attachment #1: Type: text/plain, Size: 5330 bytes --]

There's really no excuse for an additional wmem_default of buffering
between the netdev queue and the ATM device. Two packets (one in-flight,
and one ready to send) ought to be fine. It's not as if it should take
long to get another from the netdev queue when we need it.

If necessary we can make the queue space configurable later, but I don't
think it's likely to be necessary.

cf. commit 9d02daf754238adac48fa075ee79e7edd3d79ed3 (pppoatm: Fix
excessive queue bloat) which did something very similar for PPPoATM.

Note that there is a tremendously unlikely race condition which may
result in qspace temporarily going negative. If a CPU running the
br2684_pop() function goes off into the weeds for a long period of time
after incrementing qspace to 1, but before calling netdev_wake_queue()...
and another CPU ends up calling br2684_start_xmit() and *stopping* the
queue again before the first CPU comes back, the netdev queue could
end up being woken when qspace has already reached zero.

An alternative approach to coping with this race would be to check in
br2684_start_xmit() for qspace==0 and return NETDEV_TX_BUSY, but just
using '> 0' and '< 1' for comparison instead of '== 0' and '!= 0' is
simpler. It just warranted a mention of *why* we do it that way...

Move the call to atmvcc->send() to happen *after* the accounting and
potentially stopping the netdev queue, in br2684_xmit_vcc(). This matters
if the ->send() call suffers an immediate failure, because it'll call
br2684_pop() with the offending skb before returning. We want that to
happen *after* we've done the initial accounting for the packet in
question. Also make it return an appropriate success/failure indication
while we're at it.

Tested by running 'ping -l 1000 bottomless.aaisp.net.uk' from within my
network, with only a single PPPoE-over-BR2684 link running. And after
setting txqueuelen on the nas0 interface to something low (5, in fact).
Before the patch, we'd see about 15 packets being queued and a resulting
latency of ~56ms being reached. After the patch, we see only about 8,
which is fairly much what we expect. And a max latency of ~36ms. On this
OpenWRT box, wmem_default is 163840.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Reviewed-by: Krzysztof Mazur <krzysiek@podlesie.net>
---
[v2: Comment format fixed]

 net/atm/br2684.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/net/atm/br2684.c b/net/atm/br2684.c
index 4819d315..8eb6fbe 100644
--- a/net/atm/br2684.c
+++ b/net/atm/br2684.c
@@ -74,6 +74,7 @@ struct br2684_vcc {
 	struct br2684_filter filter;
 #endif /* CONFIG_ATM_BR2684_IPFILTER */
 	unsigned int copies_needed, copies_failed;
+	atomic_t qspace;
 };

 struct br2684_dev {
@@ -181,18 +182,15 @@ static struct notifier_block atm_dev_notifier = {
 static void br2684_pop(struct atm_vcc *vcc, struct sk_buff *skb)
 {
 	struct br2684_vcc *brvcc = BR2684_VCC(vcc);
-	struct net_device *net_dev = skb->dev;

-	pr_debug("(vcc %p ; net_dev %p )\n", vcc, net_dev);
+	pr_debug("(vcc %p ; net_dev %p )\n", vcc, brvcc->device);
 	brvcc->old_pop(vcc, skb);

-	if (!net_dev)
-		return;
-
-	if (atm_may_send(vcc, 0))
-		netif_wake_queue(net_dev);
-
+	/* If the queue space just went up from zero, wake */
+	if (atomic_inc_return(&brvcc->qspace) == 1)
+		netif_wake_queue(brvcc->device);
 }
+
 /*
  * Send a packet out a particular vcc.  Not to useful right now, but paves
  * the way for multiple vcc's per itf.  Returns true if we can send,
@@ -256,16 +254,19 @@ static int br2684_xmit_vcc(struct sk_buff *skb, struct net_device *dev,
 	ATM_SKB(skb)->atm_options = atmvcc->atm_options;
 	dev->stats.tx_packets++;
 	dev->stats.tx_bytes += skb->len;
-	atmvcc->send(atmvcc, skb);

-	if (!atm_may_send(atmvcc, 0)) {
+	if (atomic_dec_return(&brvcc->qspace) < 1) {
+		/* No more please! */
 		netif_stop_queue(brvcc->device);
-		/*check for race with br2684_pop*/
-		if (atm_may_send(atmvcc, 0))
-			netif_start_queue(brvcc->device);
+		/* We might have raced with br2684_pop() */
+		if (unlikely(atomic_read(&brvcc->qspace) > 0))
+			netif_wake_queue(brvcc->device);
 	}

-	return 1;
+	/* If this fails immediately, the skb will be freed and br2684_pop()
+	   will wake the queue if appropriate. Just return an error so that
+	   the stats are updated correctly */
+	return !atmvcc->send(atmvcc, skb);
 }

 static inline struct br2684_vcc *pick_outgoing_vcc(const struct sk_buff *skb,
@@ -504,6 +505,13 @@ static int br2684_regvcc(struct atm_vcc *atmvcc, void __user * arg)
 	brvcc = kzalloc(sizeof(struct br2684_vcc), GFP_KERNEL);
 	if (!brvcc)
 		return -ENOMEM;
+	/*
+	 * Allow two packets in the ATM queue. One actually being sent, and one
+	 * for the ATM 'TX done' handler to send. It shouldn't take long to get
+	 * the next one from the netdev queue, when we need it. More than that
+	 * would be bufferbloat.
+	 */
+	atomic_set(&brvcc->qspace, 2);
 	write_lock_irq(&devs_lock);
 	net_dev = br2684_find_dev(&be.ifspec);
 	if (net_dev == NULL) {
-- 
1.8.0

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* Re: [PATCH] atm: br2684: Fix excessive queue bloat
From: David Woodhouse @ 2012-11-25 22:01 UTC (permalink / raw)
  To: Krzysztof Mazur
  Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR)
In-Reply-To: <20121125214332.GA2722@shrek.podlesie.net>

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

On Sun, 2012-11-25 at 22:43 +0100, Krzysztof Mazur wrote:
> On Sat, Nov 24, 2012 at 12:01:32AM +0000, David Woodhouse wrote:
> > There's really no excuse for an additional wmem_default of buffering
> > between the netdev queue and the ATM device. Two packets (one in-flight,
> > and one ready to send) ought to be fine. It's not as if it should take
> > long to get another from the netdev queue when we need it.
> > 
> > If necessary we can make the queue space configurable later, but I don't
> > think it's likely to be necessary.
> 
> Maybe some high-speed devices will require larger queue, especially for
> smaller packets, but 2 packet queue should be sufficient in almost all cases.

Yeah. This is fairly much the same conversation I ended up having when I
did the same for PPPoATM.

Some day *perhaps* we might look at doing something adaptive, so it'll
detect a TX underrun and increase the amount of buffering. But this
seems perfectly good for now.

> Maybe this magic "2" and the comment should be moved to some #define.

Doesn't make it any less magic. I'm happier with it as it is, with a
clear comment describing why it's done that way.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: [PATCH] atm: br2684: Fix excessive queue bloat
From: Krzysztof Mazur @ 2012-11-25 21:43 UTC (permalink / raw)
  To: David Woodhouse
  Cc: netdev, John Crispin, Dave Täht, Chas Williams (CONTRACTOR)
In-Reply-To: <1353715292.26346.237.camel@shinybook.infradead.org>

On Sat, Nov 24, 2012 at 12:01:32AM +0000, David Woodhouse wrote:
> There's really no excuse for an additional wmem_default of buffering
> between the netdev queue and the ATM device. Two packets (one in-flight,
> and one ready to send) ought to be fine. It's not as if it should take
> long to get another from the netdev queue when we need it.
> 
> If necessary we can make the queue space configurable later, but I don't
> think it's likely to be necessary.

Maybe some high-speed devices will require larger queue, especially for
smaller packets, but 2 packet queue should be sufficient in almost all cases.

>  static inline struct br2684_vcc *pick_outgoing_vcc(const struct sk_buff *skb,
> @@ -504,6 +505,11 @@ static int br2684_regvcc(struct atm_vcc *atmvcc, void __user * arg)
>  	brvcc = kzalloc(sizeof(struct br2684_vcc), GFP_KERNEL);
>  	if (!brvcc)
>  		return -ENOMEM;
> +	/* Allow two packets in the ATM queue. One actually being sent, and one
> +	   for the ATM 'TX done' handler to send. It shouldn't take long to get
> +	   the next one from the netdev queue, when we need it. More than that
> +	   would be bufferbloat. */
> +	atomic_set(&brvcc->qspace, 2);

Maybe this magic "2" and the comment should be moved to some #define.

>  	write_lock_irq(&devs_lock);
>  	net_dev = br2684_find_dev(&be.ifspec);
>  	if (net_dev == NULL) {

Looks good,

Reviewed-by: Krzysztof Mazur <krzysiek@podlesie.net>

Krzysiek

^ permalink raw reply

* Re: [PATCH] ipv4/ipmr and ipv6/ip6mr: Convert int mroute_do_<foo> to bool
From: David Miller @ 2012-11-25 21:40 UTC (permalink / raw)
  To: joe; +Cc: eric.dumazet, netdev
In-Reply-To: <1353872130.6558.17.camel@joe-AO722>

From: Joe Perches <joe@perches.com>
Date: Sun, 25 Nov 2012 11:35:30 -0800

> Save a few bytes per table by convert mroute_do_assert and
> mroute_do_pim from int to bool.
> 
> Remove !! as the compiler does that when assigning int to bool.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: ipmr: various fixes and cleanups
From: David Miller @ 2012-11-25 21:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1353861705.30446.675.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 25 Nov 2012 08:41:45 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> 1) ip_mroute_setsockopt() & ip_mroute_getsockopt() should not
>    access/set raw_sk(sk)->ipmr_table before making sure the socket
>    is a raw socket, and protocol is IGMP
> 
> 2) MRT_INIT should return -EINVAL if optlen != sizeof(int), not
>    -ENOPROTOOPT
> 
> 3) MRT_ASSERT & MRT_PIM should validate optlen
> 
> 4) " (v) ? 1 : 0 " can be written as " !!v "
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Add Q: patchwork entries for net/ and drivers/net/
From: David Miller @ 2012-11-25 21:17 UTC (permalink / raw)
  To: joe; +Cc: netdev
In-Reply-To: <1353791905.6558.7.camel@joe-AO722>

From: Joe Perches <joe@perches.com>
Date: Sat, 24 Nov 2012 13:18:25 -0800

> Add the netdev patchwork entries for networking drivers.
> Change the W: entry to Q:for networking.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied to net-next, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox