Netdev List
 help / color / mirror / Atom feed
* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2010-04-27 17:18 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, yoshfuji
In-Reply-To: <20100427.093430.258110898.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 27 Apr 2010 09:34:30 -0700 (PDT)

> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Tue, 27 Apr 2010 15:25:16 +1000
> 
>> After merging the bkl-ioctl tree, today's linux-next build (powerpc
>> ppc44x_defconfig) failed like this:
>> 
>> net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
>> net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'
>> 
>> Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
>> br_multicast: IPv6 MLD support") from the net tree.
>> 
>> csum_ipv6_magic is declared in net/ip6_checksum.h ...
> 
> Bummer, powerpc is one of the few platforms that doesn't get the header
> file implicitly so you always trip over this whereas we never see it in
> x86 and sparc64 builds :-)
> 
> I'll fix this, thanks!

I just committed the following for this:

bridge: Fix build of ipv6 multicast code.

Based upon a report from Stephen Rothwell:

--------------------
net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'

Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
br_multicast: IPv6 MLD support") from the net tree.

csum_ipv6_magic is declared in net/ip6_checksum.h ...
--------------------

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/bridge/br_multicast.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index e481dbd..2048ef0 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -28,6 +28,7 @@
 #include <net/ipv6.h>
 #include <net/mld.h>
 #include <net/addrconf.h>
+#include <net/ip6_checksum.h>
 #endif
 
 #include "br_private.h"
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:13 UTC (permalink / raw)
  To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <20100427.095108.68126984.davem@davemloft.net>

Le mardi 27 avril 2010 à 09:51 -0700, David Miller a écrit :
> From: Brian Bloniarz <bmb@athenacr.com>
> Date: Tue, 27 Apr 2010 09:37:11 -0400
> 
> > David Miller wrote:
> >> How damn hard is it to add two 16-bit ports to the hash regardless of
> >> protocol?
> >>   
> > Come to think of it, for UDP the hash must ignore
> > the srcport and srcaddr, because a single bound
> > socket is going to wildcard both those fields.
> 
> For load distribution we don't care if the local socket is wildcard
> bounded on source.
> 
> It's going to be fully specified in the packet, and that's enough.
> 
> Sure, for full RFS some amends might be necessary in this area, but
> for RPS and adapter based hw steering, using all of the ports is
> entirely sufficient.

Well well well...

I was doing some pktgen tests, with :

 pgset "src_min 192.168.0.10"
 pgset "src_max 192.168.0.110"
 pgset "dst_min 192.168.0.2"
 pgset "dst_max 192.168.0.2"
 pgset "udp_dst_min 4000"
 pgset "udp_dst_max 4000"

So I simulate 100 remote IPS bombarding a single port on target machine.
pktgen injects about 930.000 pps

sofirq of my target received on cpu0, and RPS spread packets to 7 other
cpus.

And my receiver is stuck (he can read about 50 pps !!!)


As soon as I disable rps, my receiver can catch 850.000 pps


RPS OFF: perf top of cpu 0

------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 0)
------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function               DSO
             _______ _____ ______________________ _______

              385.00 10.2% __udp4_lib_lookup      vmlinux
              322.00  8.5% ip_route_input         vmlinux
              312.00  8.3% sock_queue_rcv_skb     vmlinux
              262.00  6.9% do_raw_spin_lock       vmlinux
              251.00  6.6% __alloc_skb            vmlinux
              239.00  6.3% sock_put               vmlinux
              207.00  5.5% eth_type_trans         vmlinux
              202.00  5.4% __slab_alloc           vmlinux
              159.00  4.2% __kmalloc_track_caller vmlinux
              149.00  3.9% __sk_mem_schedule      vmlinux
              125.00  3.3% kmem_cache_alloc       vmlinux
              116.00  3.1% ipt_do_table           vmlinux
              115.00  3.0% do_raw_read_lock       vmlinux
               71.00  1.9% tg3_poll_work          vmlinux
               65.00  1.7% __netdev_alloc_skb     vmlinux
               64.00  1.7% skb_pull               vmlinux
               58.00  1.5% ip_rcv                 vmlinux
               58.00  1.5% __slab_free            vmlinux
               53.00  1.4% udp_queue_rcv_skb      vmlinux
               47.00  1.2% nf_iterate             vmlinux
               44.00  1.2% __netif_receive_skb    vmlinux
               29.00  0.8% sock_def_readable      vmlinux
               28.00  0.7% do_raw_spin_unlock     vmlinux
               26.00  0.7% kfree                  vmlinux
               25.00  0.7% __udp4_lib_rcv         vmlinux
               24.00  0.6% ip_rcv_finish          vmlinux
               24.00  0.6% __list_add             vmlinux


RPS, on, a perf top of a slave CPU :

------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu: 1)
------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function            DSO
             _______ _____ ___________________ _______

             2411.00 62.0% do_raw_spin_lock    vmlinux
              690.00 17.7% delay_tsc           vmlinux
              234.00  6.0% __udp4_lib_lookup   vmlinux
              174.00  4.5% sock_put            vmlinux
               72.00  1.9% ip_rcv              vmlinux
               51.00  1.3% __netif_receive_skb vmlinux
               43.00  1.1% do_raw_spin_unlock  vmlinux
               39.00  1.0% __delay             vmlinux
               38.00  1.0% sock_queue_rcv_skb  vmlinux
               36.00  0.9% udp_queue_rcv_skb   vmlinux
               31.00  0.8% ip_route_input      vmlinux
               15.00  0.4% __slab_free         vmlinux
               12.00  0.3% ipt_do_table        vmlinux
               11.00  0.3% skb_release_data    vmlinux
                7.00  0.2% kfree               vmlinux
                5.00  0.1% nf_iterate          vmlinux

So we have a BIG problem :

All cpus are fighting to get the socket lock,
and very litle progress is done.

Note this problem has nothing to do with RPS, we could have 
it with multiqueue as well.

Oh well...




^ permalink raw reply

* [PATCH net-next] bridge: multicast router list manipulation
From: Stephen Hemminger @ 2010-04-27 17:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev
In-Reply-To: <E1NlbuX-00021m-Bq@gondolin.me.apana.org.au>

I prefer that the hlist be only accessed through the hlist macro
objects. Explicit twiddling of links (especially with RCU) exposes
the code to future bugs.

Compile tested only.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/net/bridge/br_multicast.c	2010-04-27 09:54:02.180531924 -0700
+++ b/net/bridge/br_multicast.c	2010-04-27 10:07:19.188688664 -0700
@@ -1041,21 +1041,21 @@ static int br_ip6_multicast_mld2_report(
 static void br_multicast_add_router(struct net_bridge *br,
 				    struct net_bridge_port *port)
 {
-	struct hlist_node *p;
-	struct hlist_node **h;
+	struct net_bridge_port *p;
+	struct hlist_node *n, *last = NULL;
 
-	for (h = &br->router_list.first;
-	     (p = *h) &&
-	     (unsigned long)container_of(p, struct net_bridge_port, rlist) >
-	     (unsigned long)port;
-	     h = &p->next)
-		;
-
-	port->rlist.pprev = h;
-	port->rlist.next = p;
-	rcu_assign_pointer(*h, &port->rlist);
-	if (p)
-		p->pprev = &port->rlist.next;
+	hlist_for_each_entry(p, n, &br->router_list, rlist) {
+		if ((unsigned long) port >= (unsigned long) p) {
+			hlist_add_before_rcu(n, &port->rlist);
+			return;
+		}
+		last = n;
+	}
+
+	if (last)
+		hlist_add_after_rcu(last, &port->rlist);
+	else
+		hlist_add_head_rcu(&port->rlist, &br->router_list);
 }
 
 static void br_multicast_mark_router(struct net_bridge *br,


^ permalink raw reply

* [PATCH net-next] bridge: use is_multicast_ether_addr
From: Stephen Hemminger @ 2010-04-27 17:13 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller; +Cc: netdev
In-Reply-To: <E1NlbuW-00021d-8E@gondolin.me.apana.org.au>

Use existing inline function.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/bridge/br_device.c	2010-04-27 09:49:30.059258391 -0700
+++ b/net/bridge/br_device.c	2010-04-27 09:50:21.439878721 -0700
@@ -36,7 +36,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *
 	skb_reset_mac_header(skb);
 	skb_pull(skb, ETH_HLEN);
 
-	if (dest[0] & 1) {
+	if (is_multicast_ether_addr(dest)) {
 		if (br_multicast_rcv(br, NULL, skb))
 			goto out;
 

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-27 17:06 UTC (permalink / raw)
  To: bmb; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <4BD71890.2050606@athenacr.com>

From: Brian Bloniarz <bmb@athenacr.com>
Date: Tue, 27 Apr 2010 13:02:08 -0400

> Maybe I'm misunderstanding... won't it distribute the
> packet handling load to multiple cores, but then all
> those cores will contend trying to deliver those packets
> to the single socket?
> 
> I was assuming that this'd be a net loss over just doing
> all the protocol handling on a single core. I haven't
> done any benchmarks yet.

Whether it's a new loss depends upon the application.

Also, on the non-application side f.e. a router or firewall, this is
exactly the behavior you want.

^ permalink raw reply

* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: David Miller @ 2010-04-27 17:03 UTC (permalink / raw)
  To: Patrick, McHardy, kaber; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>


Whoa, there are three of you now?!?!?!

:-)

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Brian Bloniarz @ 2010-04-27 17:02 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <20100427.095108.68126984.davem@davemloft.net>

David Miller wrote:
> From: Brian Bloniarz <bmb@athenacr.com>
> Date: Tue, 27 Apr 2010 09:37:11 -0400
> 
>> David Miller wrote:
>>> How damn hard is it to add two 16-bit ports to the hash regardless of
>>> protocol?
>>>   
>> Come to think of it, for UDP the hash must ignore
>> the srcport and srcaddr, because a single bound
>> socket is going to wildcard both those fields.
> 
> For load distribution we don't care if the local socket is wildcard
> bounded on source.
> 
> It's going to be fully specified in the packet, and that's enough.

Maybe I'm misunderstanding... won't it distribute the
packet handling load to multiple cores, but then all
those cores will contend trying to deliver those packets
to the single socket?

I was assuming that this'd be a net loss over just doing
all the protocol handling on a single core. I haven't
done any benchmarks yet.

^ permalink raw reply

* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: David Miller @ 2010-04-27 16:58 UTC (permalink / raw)
  To: vladislav.yasevich
  Cc: error27, sri, yjwei, cdischino, linux-sctp, netdev,
	kernel-janitors
In-Reply-To: <4BD6F582.9030804@hp.com>

From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Tue, 27 Apr 2010 10:32:34 -0400

> 
> 
> Dan Carpenter wrote:
>> This assignment isn't needed because we did it earlier already.
>> 
>> Also another reason to delete the assignment is because it triggers a
>> Smatch warning about checking for NULL pointers after a dereference.
>> 
>> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>> Signed-off-by: Dan Carpenter <error27@gmail.com>
> 
> Thanks.  I'll take this one.

And when will I get this from you? :-)

^ permalink raw reply

* Re: [net-2.6 PATCH] ixgbevf: Fix link speed display
From: David Miller @ 2010-04-27 16:57 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, gregory.v.rose
In-Reply-To: <20100427103834.23338.22213.stgit@localhost.localdomain>


Not appropriate this late in the -RC series.

I don't see this in the regression list, and reported link speed being
incorrect is not a catastropic crash and/or failure.

^ permalink raw reply

* Re: [net-2.6 PATCH] ixgbe: cleanup ethtool autoneg input
From: David Miller @ 2010-04-27 16:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, donald.c.skidmore
In-Reply-To: <20100427112513.24196.70511.stgit@localhost.localdomain>


This is also not appropriate for net-2.6, it doesn't fix a
regression in the regression list and it doesn't fix a catastropic
crash or failure.

You really have to be kidding me if you thing a patch like this
is fine this late in the -RC series.

You also aren't even numbering your patches, which is quite a
transgression when a submission of a set of patches all to the same
driver and/or files.

^ permalink raw reply

* Re: [net-2.6 PATCH] ixgbe: Properly display 1 gig downshift warning for backplane
From: David Miller @ 2010-04-27 16:54 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, anjali.singhai
In-Reply-To: <20100427103724.23338.53872.stgit@localhost.localdomain>


Not fixing a regression, not fixing a catastropic error, just adding
a warning message.

Therefore not appropriate for net-2.6

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-27 16:51 UTC (permalink / raw)
  To: bmb; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <4BD6E887.3000804@athenacr.com>

From: Brian Bloniarz <bmb@athenacr.com>
Date: Tue, 27 Apr 2010 09:37:11 -0400

> David Miller wrote:
>> How damn hard is it to add two 16-bit ports to the hash regardless of
>> protocol?
>>   
> Come to think of it, for UDP the hash must ignore
> the srcport and srcaddr, because a single bound
> socket is going to wildcard both those fields.

For load distribution we don't care if the local socket is wildcard
bounded on source.

It's going to be fully specified in the packet, and that's enough.

Sure, for full RFS some amends might be necessary in this area, but
for RPS and adapter based hw steering, using all of the ports is
entirely sufficient.

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2010-04-27 16:34 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, yoshfuji
In-Reply-To: <20100427152516.580ee620.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 27 Apr 2010 15:25:16 +1000

> After merging the bkl-ioctl tree, today's linux-next build (powerpc
> ppc44x_defconfig) failed like this:
> 
> net/bridge/br_multicast.c: In function 'br_ip6_multicast_alloc_query':
> net/bridge/br_multicast.c:469: error: implicit declaration of function 'csum_ipv6_magic'
> 
> Introduced by commit 08b202b6726459626c73ecfa08fcdc8c3efc76c2 ("bridge
> br_multicast: IPv6 MLD support") from the net tree.
> 
> csum_ipv6_magic is declared in net/ip6_checksum.h ...

Bummer, powerpc is one of the few platforms that doesn't get the header
file implicitly so you always trip over this whereas we never see it in
x86 and sparc64 builds :-)

I'll fix this, thanks!

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Eric Dumazet @ 2010-04-27 16:33 UTC (permalink / raw)
  To: paulmck
  Cc: Eric W. Biederman, Miles Lane, Vivek Goyal, Eric Paris,
	Lai Jiangshan, Ingo Molnar, Peter Zijlstra, LKML, nauman, netdev,
	Jens Axboe, Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <20100427162201.GA5826@linux.vnet.ibm.com>

Le mardi 27 avril 2010 à 09:22 -0700, Paul E. McKenney a écrit :
> On Mon, Apr 26, 2010 at 09:27:44PM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 26, 2010 at 11:35:10AM -0700, Eric W. Biederman wrote:
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> > > 
> > > > Eric Dumazet traced these down to a commit from Eric Biederman.
> > > >
> > > > If I don't hear from Eric Biederman in a few days, I will attempt a
> > > > patch, but it would be more likely to be correct coming from someone
> > > > with a better understanding of the code.  ;-)
> > > 
> > > I already replied.
> > > 
> > > http://lkml.org/lkml/2010/4/21/420
> > 
> > You did indeed!!!  This experience is giving me an even better appreciation
> > of the maintainers' ability to keep all their patches straight!
> > 
> > I will put together something based on your suggestion.
> 
> How about the following?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 85fa42bd568ab99c375f018761ae6345249942cd
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Mon Apr 26 21:40:05 2010 -0700
> 
>     net: suppress RCU lockdep false positive in twsk_net()
>     
>     Calls to twsk_net() are in some cases protected by reference counting
>     as an alternative to RCU protection.  Cases covered by reference counts
>     include __inet_twsk_kill(), inet_twsk_free(), inet_twdr_do_twkill_work(),
>     inet_twdr_twcal_tick(), and tcp_timewait_state_process().  RCU is used
>     by inet_twsk_purge().  Locking is used by established_get_first()
>     and established_get_next().  Finally, __inet_twsk_hashdance() is an
>     initialization case.
>     
>     It appears to be non-trivial to locate the appropriate locks and
>     reference counts from within twsk_net(), so used rcu_dereference_raw().
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 79f67ea..a066fdd 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -224,7 +224,9 @@ static inline
>  struct net *twsk_net(const struct inet_timewait_sock *twsk)
>  {
>  #ifdef CONFIG_NET_NS
> -	return rcu_dereference(twsk->tw_net);
> +	return rcu_dereference_raw(twsk->tw_net); /* protected by locking, */
> +						  /* reference counting, */
> +						  /* initialization, or RCU. */
>  #else
>  	return &init_net;
>  #endif
> --

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-27 16:20 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <4BD6E837.2040505@grandegger.com>

Wolfgang Grandegger wrote:
> Hi Richard,
> 
> Richard Cochran wrote:
>> BTW,
>>
>> To show how the API works from the user space, I also patches For the
>> popular ptpd program to that project.
>>
>>    https://sourceforge.net/tracker/?func=detail&aid=2992845&group_id=139814&atid=744634
> 
> Cool stuff, I'm giving it a try on my MPC8313 PTP setup.

Do you have also a patch adding support for hardware timestamping to ptpd?

Thanks,

Wolfgang.

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-27 16:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <20100427042744.GD2510@linux.vnet.ibm.com>

On Mon, Apr 26, 2010 at 09:27:44PM -0700, Paul E. McKenney wrote:
> On Mon, Apr 26, 2010 at 11:35:10AM -0700, Eric W. Biederman wrote:
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> > 
> > > Eric Dumazet traced these down to a commit from Eric Biederman.
> > >
> > > If I don't hear from Eric Biederman in a few days, I will attempt a
> > > patch, but it would be more likely to be correct coming from someone
> > > with a better understanding of the code.  ;-)
> > 
> > I already replied.
> > 
> > http://lkml.org/lkml/2010/4/21/420
> 
> You did indeed!!!  This experience is giving me an even better appreciation
> of the maintainers' ability to keep all their patches straight!
> 
> I will put together something based on your suggestion.

How about the following?

							Thanx, Paul

------------------------------------------------------------------------

commit 85fa42bd568ab99c375f018761ae6345249942cd
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Apr 26 21:40:05 2010 -0700

    net: suppress RCU lockdep false positive in twsk_net()
    
    Calls to twsk_net() are in some cases protected by reference counting
    as an alternative to RCU protection.  Cases covered by reference counts
    include __inet_twsk_kill(), inet_twsk_free(), inet_twdr_do_twkill_work(),
    inet_twdr_twcal_tick(), and tcp_timewait_state_process().  RCU is used
    by inet_twsk_purge().  Locking is used by established_get_first()
    and established_get_next().  Finally, __inet_twsk_hashdance() is an
    initialization case.
    
    It appears to be non-trivial to locate the appropriate locks and
    reference counts from within twsk_net(), so used rcu_dereference_raw().
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 79f67ea..a066fdd 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -224,7 +224,9 @@ static inline
 struct net *twsk_net(const struct inet_timewait_sock *twsk)
 {
 #ifdef CONFIG_NET_NS
-	return rcu_dereference(twsk->tw_net);
+	return rcu_dereference_raw(twsk->tw_net); /* protected by locking, */
+						  /* reference counting, */
+						  /* initialization, or RCU. */
 #else
 	return &init_net;
 #endif

^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: fix a lockdep rcu warning in __sk_dst_set()
From: Paul E. McKenney @ 2010-04-27 16:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1272350443.4861.9.camel@edumazet-laptop>

On Tue, Apr 27, 2010 at 08:40:43AM +0200, Eric Dumazet wrote:
> __sk_dst_set() might be called while no state can be integrated in a
> rcu_dereference_check() condition.
> 
> So use rcu_dereference_raw() to shutup lockdep warnings (if
> CONFIG_PROVE_RCU is set)

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 86a8ca1..94dbdbc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1236,8 +1236,11 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
>  	struct dst_entry *old_dst;
> 
>  	sk_tx_queue_clear(sk);
> -	old_dst = rcu_dereference_check(sk->sk_dst_cache,
> -					lockdep_is_held(&sk->sk_dst_lock));
> +	/*
> +	 * This can be called while sk is owned by the caller only,
> +	 * with no state that can be checked in a rcu_dereference_check() cond
> +	 */
> +	old_dst = rcu_dereference_raw(sk->sk_dst_cache);
>  	rcu_assign_pointer(sk->sk_dst_cache, dst);
>  	dst_release(old_dst);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-27 16:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004271425140.2951@localhost.localdomain>

On Tue, 27 Apr 2010, Thomas Gleixner wrote:

> On Sun, 18 Apr 2010, Peter P Waskiewicz Jr wrote:
>> +/**
>> + * struct irqaffinityhint - per interrupt affinity helper
>> + * @callback:	device driver callback function
>> + * @dev:	reference for the affected device
>> + * @irq:	interrupt number
>> + */
>> +struct irqaffinityhint {
>> +	irq_affinity_hint_t callback;
>> +	void *dev;
>> +	int irq;
>> +};
>
> Why do you need that extra data structure ? The device and the irq
> number are known, so all you need is the callback itself. So no need
> for allocating memory ....

When I register the function callback with the interrupt layer, I need to 
know what device structures to reference back in the driver.  In other 
words, if I call into an underlying driver with just an interrupt number, 
then I have no way at getting at the dev structures (netdevice for me, 
plus my private adapter structures), unless I declare them globally 
(yuck).

I had a different approach before this one where I assumed the device from 
the irq handler callback was safe to use for the device in this new 
callback.  I didn't feel really great about that, since it's an implicit 
assumption that could cause things to go sideways really quickly.

Let me know what you think either way.  I'm certainly willing to make a 
change, I just don't know at this point what's the safest approach from 
what I currently have.

>
>> +static ssize_t irq_affinity_hint_proc_write(struct file *file,
>> +		const char __user *buffer, size_t count, loff_t *pos)
>> +{
>> +	/* affinity_hint is read-only from proc */
>> +	return -EOPNOTSUPP;
>> +}
>> +
>
> Why do you want a write function when the file is read only ?

It's leftover paranoia.  I put it in early on, then changed the mode 
later.  I can remove this function.  I'll re-send something once we agree 
on how the code in your first comment should look.

Thanks Thomas!

-PJ

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-27 15:55 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: David Miller, yoshfuji, netdev, shemminger
In-Reply-To: <20100427155034.GA11157@midget.suse.cz>

On Tue, Apr 27, 2010 at 05:50:34PM +0200, Jiri Bohac wrote:
>
>   I think that what Herbert is doing is only going to enforce that
>   the ip6_del_rt() is never going to fail, so the dst_free()
>   won't ever be called anyway, right?

Yes I think you're right.  But let's fix the bigger problem first,
and then we can audit this and possibly turn it into a WARN_ON.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-27 15:50 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, herbert, yoshfuji, netdev, shemminger
In-Reply-To: <20100422.185400.71096585.davem@davemloft.net>

On Thu, Apr 22, 2010 at 06:54:00PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Thu, 22 Apr 2010 17:49:08 +0200
> 
> > I still don't see why __ipv6_ifa_notify() needs to call
> > dst_free(). Shouldn't that be dst_release() instead, to drop the
> > reference obtained by dst_hold(&ifp->rt->u.dst)?
> 
> It likely wants to do both.
> 
> Just doing dst_release() doesn't mark the 'dst' object as obsolete,
> and therefore it won't get force garbage collected.

Sure. So If I understand it correctly, there are two problems:

- the reference taken by dst_hold() just above the ip6_del_rt()
  is never dropped if ip6_del_rt() fails; so shouldn't the code
  be like this?:

	dst_hold(&ifp->rt->u.dst);
	if (ip6_del_rt(ifp->rt)) {
		dst_release(&ifp->rt->u.dst);
		dst_free(&ifp->rt->u.dst);
	}

- if ip6_del_rt() fails because it races with something else
  deleting the address, dst_free() will be called twice. This is
  what Herbert is fixing with additional locking. However -- even
  when he fixes that -- how can ip6_del_rt() fail with the
  ifp->rt still needing a dst_free()?
  AFAICS, it can fail by:
  	- __ip6_del_rt() finding that (rt ==
	  net->ipv6.ip6_null_entry); we don't want to call
	  dst_free() on net->ipv6.ip6_null_entry, do we?

	- fib6_del() returning -ENOENT for multiple reasons;
	  but doesn't that mean that something else has removed
	  the route already and called dst_free on it?

  In either case, the dst_free() looks like not being needed. And it only
  does no harm in most cases, because these events are rare and
  it usually finds (obsolete > 2) and does nothing.
  
  I think that what Herbert is doing is only going to enforce that
  the ip6_del_rt() is never going to fail, so the dst_free()
  won't ever be called anyway, right?

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Brian Bloniarz @ 2010-04-27 15:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <1272378659.2295.193.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 27 avril 2010 à 09:37 -0400, Brian Bloniarz a écrit :
>> David Miller wrote:
>>> How damn hard is it to add two 16-bit ports to the hash regardless of
>>> protocol?
>>>   
>> Come to think of it, for UDP the hash must ignore
>> the srcport and srcaddr, because a single bound
>> socket is going to wildcard both those fields.
>>
> 
> For your application maybe ;)
> 
> Here, I have thousand of RTP flows to big mediagateways, so the
> (srcaddr, dstaddr) is shared by all these flows.
> 
> Tom Herbert also wants a threaded DNS server.
> 
> For UDP, we could have a bitmap (system level ?) to say if a particular
> destination port wants multi-cpu (RPS) spreading or not, even if the NIC
> decided to use a single queue to submit frames to the host (ie mask the
> src_addr and src_port). In this case, RFS on non conected UDP sockets
> could be activated as well.
> 
> Or just a global sysctl to be able to mask the src_addr and/or src_port
> in our software rxhash.

This is all good to know. Here are 3 other alternatives/suggestions:

1) Leave things as they are, if it really hurts that badly
for our workload we can just disable RPS entirely.
2) Add a global sysctl to disable RPS for datagram-based protocols.
3) Pluggable SW rxhashes :)

I haven't benchmarked anything for our workloads yet, so
I can't say for sure it's even a big issue.

Has anyone benchmarked RPS + single-threaded DNS servers? It'd
be a pessimization in that case, right? Even the multi-threaded
DNS server wouldn't be workable unless there was something like
the soreuseport patch you & Tom had been been discussing.

^ permalink raw reply

* Re: linux-next: manual merge of the tty tree with the net tree
From: Greg KH @ 2010-04-27 15:14 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-next, linux-kernel, Sjur Braendeland, David Miller, netdev,
	Pavan Savoy
In-Reply-To: <20100427135706.259fa9c9.sfr@canb.auug.org.au>

On Tue, Apr 27, 2010 at 01:57:06PM +1000, Stephen Rothwell wrote:
> Hi Greg,
> 
> Today's linux-next merge of the tty tree got a conflict in
> include/linux/tty.h between commit
> 9b27105b4a44c54bf91ecd7d0315034ae75684f7 ("net-caif-driver: add CAIF
> serial driver (ldisc)") from the net tree and commit
> f3150d54db0e0caa1871f66b2424f463a34a7e7b ("serial: TTY: new ldiscs for
> staging") from the tty tree.
> 
> I fixed it up (they both changed NR_LDISCS - I used the higher number) and
> can carry the fix as necessary.

Thanks, the higher number should be what is needed here.

greg k-h

^ permalink raw reply

* Re: linux-next: manual merge of the staging-next tree with the net tree
From: Greg KH @ 2010-04-27 15:13 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-next, linux-kernel, Jiri Pirko, David Miller, netdev,
	John Sheehan
In-Reply-To: <20100427142655.49673ff1.sfr@canb.auug.org.au>

On Tue, Apr 27, 2010 at 02:26:55PM +1000, Stephen Rothwell wrote:
> Hi Greg,
> 
> Today's linux-next merge of the staging-next tree got a conflict in
> drivers/staging/arlan/arlan-main.c between commit
> 22bedad3ce112d5ca1eaf043d4990fa2ed698c87 ("net: convert multicast list to
> list_head") from the net tree and commit
> 8db2022b08e232bb237b2162f03ff5f6e7c0c35e ("staging: arlan: fix errors
> reported by checkpatch.pl tool") from the staging-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary.

That looks fine, thanks.

greg k-h

^ permalink raw reply

* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: Vlad Yasevich @ 2010-04-27 14:32 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Sridhar Samudrala, David S. Miller, Wei Yongjun, Chris Dischino,
	linux-sctp, netdev, kernel-janitors
In-Reply-To: <20100424171805.GO29093@bicker>



Dan Carpenter wrote:
> This assignment isn't needed because we did it earlier already.
> 
> Also another reason to delete the assignment is because it triggers a
> Smatch warning about checking for NULL pointers after a dereference.
> 
> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Thanks.  I'll take this one.

-vlad

> ---
> Thanks Vlad.  I came so close to seeing that myself if only I had openned
> my eyes a tiny bit more.  :P
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 17cb400..33aed1c 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -419,10 +419,17 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
>  	if (!retval)
>  		goto nomem_chunk;
>  
> -	/* Per the advice in RFC 2960 6.4, send this reply to
> -	 * the source of the INIT packet.
> +	/* RFC 2960 6.4 Multi-homed SCTP Endpoints
> +	 *
> +	 * An endpoint SHOULD transmit reply chunks (e.g., SACK,
> +	 * HEARTBEAT ACK, * etc.) to the same destination transport
> +	 * address from which it received the DATA or control chunk
> +	 * to which it is replying.
> +	 *
> +	 * [INIT ACK back to where the INIT came from.]
>  	 */
>  	retval->transport = chunk->transport;
> +
>  	retval->subh.init_hdr =
>  		sctp_addto_chunk(retval, sizeof(initack), &initack);
>  	retval->param_hdr.v = sctp_addto_chunk(retval, addrs_len, addrs.v);
> @@ -461,18 +468,6 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
>  	/* We need to remove the const qualifier at this point.  */
>  	retval->asoc = (struct sctp_association *) asoc;
>  
> -	/* RFC 2960 6.4 Multi-homed SCTP Endpoints
> -	 *
> -	 * An endpoint SHOULD transmit reply chunks (e.g., SACK,
> -	 * HEARTBEAT ACK, * etc.) to the same destination transport
> -	 * address from which it received the DATA or control chunk
> -	 * to which it is replying.
> -	 *
> -	 * [INIT ACK back to where the INIT came from.]
> -	 */
> -	if (chunk)
> -		retval->transport = chunk->transport;
> -
>  nomem_chunk:
>  	kfree(cookie);
>  nomem_cookie:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 14:30 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <4BD6E887.3000804@athenacr.com>

Le mardi 27 avril 2010 à 09:37 -0400, Brian Bloniarz a écrit :
> David Miller wrote:
> > How damn hard is it to add two 16-bit ports to the hash regardless of
> > protocol?
> >   
> Come to think of it, for UDP the hash must ignore
> the srcport and srcaddr, because a single bound
> socket is going to wildcard both those fields.
> 

For your application maybe ;)

Here, I have thousand of RTP flows to big mediagateways, so the
(srcaddr, dstaddr) is shared by all these flows.

Tom Herbert also wants a threaded DNS server.

For UDP, we could have a bitmap (system level ?) to say if a particular
destination port wants multi-cpu (RPS) spreading or not, even if the NIC
decided to use a single queue to submit frames to the host (ie mask the
src_addr and src_port). In this case, RFS on non conected UDP sockets
could be activated as well.

Or just a global sysctl to be able to mask the src_addr and/or src_port
in our software rxhash.




> So the best we can hope for is for the hash to
> include destport and destaddr? From looking at
> my BCM5709's multiq behavior, I think broadcom's
> hash includes the destaddr but not the destport.


Toepliz hash for UDP is a hash(src_addr, dst_addr)

Both src port and dst port are ignored.



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox