Netdev List
 help / color / mirror / Atom feed
* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon:  caller is netif_rx
From: Eric Dumazet @ 2010-04-15 19:07 UTC (permalink / raw)
  To: Eric Paris, David Miller; +Cc: netdev, Tom Herbert
In-Reply-To: <1271337401.16881.2563.camel@edumazet-laptop>

Le jeudi 15 avril 2010 à 15:16 +0200, Eric Dumazet a écrit :
> Le lundi 12 avril 2010 à 21:40 +0200, Eric Dumazet a écrit :
> > Good spot, RPS changed a bit netif_rx() requirements.
> > 
> > I would change ip_dev_loopback_xmit() to call netif_rx_ni() instead...
> > 
> > David, Tom ?
> > 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index c65f18e..d1bcc9f 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -120,7 +120,7 @@ static int ip_dev_loopback_xmit(struct sk_buff *newskb)
> >  	newskb->pkt_type = PACKET_LOOPBACK;
> >  	newskb->ip_summed = CHECKSUM_UNNECESSARY;
> >  	WARN_ON(!skb_dst(newskb));
> > -	netif_rx(newskb);
> > +	netif_rx_ni(newskb);
> >  	return 0;
> >  }
> >  
> 
> After some confusion, it seems this was the right fix after all :)
> 
> [PATCH] ip: Fix ip_dev_loopback_xmit()
> 
> Eric Paris got following trace with a linux-next kernel
> 
> [   14.203970] BUG: using smp_processor_id() in preemptible [00000000]
> code: avahi-daemon/2093
> [   14.204025] caller is netif_rx+0xfa/0x110
> [   14.204035] Call Trace:
> [   14.204064]  [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
> [   14.204070]  [<ffffffff8142163a>] netif_rx+0xfa/0x110
> [   14.204090]  [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
> [   14.204095]  [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
> [   14.204099]  [<ffffffff8145d610>] ip_local_out+0x20/0x30
> [   14.204105]  [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
> [   14.204119]  [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
> [   14.204125]  [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
> [   14.204137]  [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
> [   14.204149]  [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
> [   14.204189]  [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
> [   14.204233]  [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b
> 
> While current linux-2.6 kernel doesnt emit this warning, bug is latent
> and might cause unexpected failures.
> 
> ip_dev_loopback_xmit() runs in process context, preemption enabled, so
> must call netif_rx_ni() instead of netif_rx(), to make sure that we
> process pending software interrupt.
> 
> Reported-by: Eric Paris <eparis@redhat.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c65f18e..d1bcc9f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -120,7 +120,7 @@ static int ip_dev_loopback_xmit(struct sk_buff *newskb)
>  	newskb->pkt_type = PACKET_LOOPBACK;
>  	newskb->ip_summed = CHECKSUM_UNNECESSARY;
>  	WARN_ON(!skb_dst(newskb));
> -	netif_rx(newskb);
> +	netif_rx_ni(newskb);
>  	return 0;
>  }
>  

Oops silly me, I forgot ipv6

updated patch in a couple of minutes





^ permalink raw reply

* Re: [PATCH net-next] net/l2tp/l2tp_debugfs.c: Convert NIPQUAD to %pI4
From: James Chapman @ 2010-04-15 18:44 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, LKML, netdev
In-Reply-To: <1271349703.1726.62.camel@Joe-Laptop.home>

Joe Perches wrote:
> Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: James Chapman <jchapman@katalix.com>

> ---
>  net/l2tp/l2tp_debugfs.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/l2tp/l2tp_debugfs.c b/net/l2tp/l2tp_debugfs.c
> index 908f10f..104ec3b 100644
> --- a/net/l2tp/l2tp_debugfs.c
> +++ b/net/l2tp/l2tp_debugfs.c
> @@ -122,8 +122,8 @@ static void l2tp_dfs_seq_tunnel_show(struct seq_file *m, void *v)
>  	seq_printf(m, "\nTUNNEL %u peer %u", tunnel->tunnel_id, tunnel->peer_tunnel_id);
>  	if (tunnel->sock) {
>  		struct inet_sock *inet = inet_sk(tunnel->sock);
> -		seq_printf(m, " from " NIPQUAD_FMT " to " NIPQUAD_FMT "\n",
> -			   NIPQUAD(inet->inet_saddr), NIPQUAD(inet->inet_daddr));
> +		seq_printf(m, " from %pI4 to %pI4\n",
> +			   &inet->inet_saddr, &inet->inet_daddr);
>  		if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
>  			seq_printf(m, " source port %hu, dest port %hu\n",
>  				   ntohs(inet->inet_sport), ntohs(inet->inet_dport));
> 
> 
> 


^ permalink raw reply

* Re: Network multiqueue question
From: Eric Dumazet @ 2010-04-15 18:41 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: George B., netdev
In-Reply-To: <21433.1271354986@death.nxdomain.ibm.com>

Le jeudi 15 avril 2010 à 11:09 -0700, Jay Vosburgh a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> >Vlan is multiqueue aware, but bonding is not unfortunatly at this
> >moment.
> >
> >We could let it being 'multiqueue' (a patch was submitted by Oleg A.
> >Arkhangelsky a while ago), but bonding xmit routine needs to lock a
> >central lock, shared by all queues, so it wont be very efficient...
> 
> 	The lock is a read lock, so theoretically it should be possible
> to enter the bonding transmit function on multiple CPUs at the same
> time.  The lock may thrash around, though.
> 

Yes, and with 10Gb cards, this is a limiting factor, if you want to send
14 million packets per second ;)

read_lock() is one atomic op, dirtying cacheline
read_unlock() is one atomic op, dirtying cache line again (if contended)

in active-passive mode, RCU use should be really easy, given netdevices
are already RCU compatable. This way, each cpu only reads bonding state,
without any memory changes.


> >Since this bothers me a bit, I will probably work on this in a near
> >future. (adding real multiqueue capability and RCU to bonding fast
> >paths)
> >
> >Ref: http://permalink.gmane.org/gmane.linux.network/152987
> 
> 	The question I have about it (and the above patch), is: what
> does multi-queue "awareness" really mean for a bonding device?  How does
> allocating a bunch of TX queues help, given that the determination of
> the transmitting device hasn't necessarily been made?
> 

Well, it is a problem that was also taken into account with vlan, you
might take a look at this commit :

commit 669d3e0babb40018dd6e78f4093c13a2eac73866
Author: Vasu Dev <vasu.dev@intel.com>
Date:   Tue Mar 23 14:41:45 2010 +0000

    vlan: adds vlan_dev_select_queue
    
    This is required to correctly select vlan tx queue for a driver
    supporting multi tx queue with ndo_select_queue implemented since
    currently selected vlan tx queue is unaligned to selected queue by
    real net_devce ndo_select_queue.
    
    Unaligned vlan tx queue selection causes thrash with higher vlan
    tx lock contention for least fcoe traffic and wrong socket tx
    queue_mapping for ixgbe having ndo_select_queue implemented.
    
    -v2
    
    As per Eric Dumazet<eric.dumazet@gmail.com> comments, mirrored
    vlan net_device_ops to have them with and without
vlan_dev_select_queue
    and then select according to real dev ndo_select_queue present or
not
    for a vlan net_device. This is to completely skip
vlan_dev_select_queue
    calling for real net_device not supporting ndo_select_queue.
    
    Signed-off-by: Vasu Dev <vasu.dev@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
    Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


> 	I haven't had the chance to acquire some multi-queue network
> cards and check things out with bonding, so I'm not really sure how it
> should work.  Should the bond look, from a multi-queue perspective, like
> the largest slave, or should it look like the sum of the slaves?  Some
> of this is may be mode-specific, as well.
> 





^ permalink raw reply

* Re: Network multiqueue question
From: Jay Vosburgh @ 2010-04-15 18:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: George B., netdev
In-Reply-To: <1271353637.16881.2846.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> wrote:

>Le jeudi 15 avril 2010 à 09:58 -0700, George B. a écrit :
>> I am in need of a little education on multiqueue and was wondering if
>> someone here might be able to help me.
>> 
>> Given intel igb network driver, it appears I can do something like:
>> 
>>  tc qdisc del dev eth0 root handle 1: multiq
>> 
>> which works and reports 4 bands:  dev eth0 root refcnt 4 bands 4/4
>> 
>> But our network is a little more complicated.  Above the ethernet we
>> have the bonding driver which is using mode 2 bonding with two
>> ethernet slaves.  Then we have vlans on the bond interface.  Our
>> production traffic is on a vlan and resource contention is an issue as
>> these are busy machines.
>> 
>> It is my understanding that the vlan driver became multiqueue aware in
>> 2.6.32 (we are currently using 2.6.31).
>> 
>> It would seem that the first thing the kernel would encounter with
>> traffic headed out would be the vlan interface, and then the bond
>> interface, and then the physical ethernet interface.  Is that correct?
>>  So with my kernel, I would seem to get no utility from multiq on the
>> ethernet interface if the vlan interface is going to be a
>> single-threaded bottleneck.  What about the bond driver?  Is it
>> currently multiqueue aware?
>> 
>> I am try to get some sort of logical picture of how all these things
>> interact with each other to get things a little more efficient and
>> reduce resource contention in the application while still trying to be
>> efficient in use of network ports/interfaces.
>> 
>> If someone feels up to the task of sending a little education my way,
>> I would be most appreciative.  There doesn't seem to be a whole lot of
>> documentation floating around about multiqueue other than a blurb of
>> text in the kernel and David's presentation of last year.
>
>Hi George
>
>Vlan is multiqueue aware, but bonding is not unfortunatly at this
>moment.
>
>We could let it being 'multiqueue' (a patch was submitted by Oleg A.
>Arkhangelsky a while ago), but bonding xmit routine needs to lock a
>central lock, shared by all queues, so it wont be very efficient...

	The lock is a read lock, so theoretically it should be possible
to enter the bonding transmit function on multiple CPUs at the same
time.  The lock may thrash around, though.

>Since this bothers me a bit, I will probably work on this in a near
>future. (adding real multiqueue capability and RCU to bonding fast
>paths)
>
>Ref: http://permalink.gmane.org/gmane.linux.network/152987

	The question I have about it (and the above patch), is: what
does multi-queue "awareness" really mean for a bonding device?  How does
allocating a bunch of TX queues help, given that the determination of
the transmitting device hasn't necessarily been made?

	I haven't had the chance to acquire some multi-queue network
cards and check things out with bonding, so I'm not really sure how it
should work.  Should the bond look, from a multi-queue perspective, like
the largest slave, or should it look like the sum of the slaves?  Some
of this is may be mode-specific, as well.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH 2/2] igbvf: dobule increment nr_frags
From: Jeff Kirsher @ 2010-04-15 18:05 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: e1000-devel, netdev, bruce.w.allan, jesse.brandeburg,
	john.ronciak, davem
In-Reply-To: <4BC6BEFC.3010309@jp.fujitsu.com>

2010/4/15 Koki Sanagi <sanagi.koki@jp.fujitsu.com>:
> There is no need to increment nr_frags becasue skb_fill_page_desc increments
> it.
>
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  drivers/net/igbvf/netdev.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>

Thanks I have added this patch to my queue of patches.

-- 
Cheers,
Jeff

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH 1/2 resend] igb: dobule increment nr_frags
From: Jeff Kirsher @ 2010-04-15 18:03 UTC (permalink / raw)
  To: Koki Sanagi
  Cc: netdev, e1000-devel, davem, jesse.brandeburg, bruce.w.allan,
	john.ronciak, Taku Izumi
In-Reply-To: <4BC6D705.80601@jp.fujitsu.com>

On Thu, Apr 15, 2010 at 02:06, Koki Sanagi <sanagi.koki@jp.fujitsu.com> wrote:
> Previous patch has some mail format problem.
> Maybe I've fixed and re-sent.
>
> There is no need to increment nr_frags becasue skb_fill_page_desc increments
> it.
>
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> ---
>  drivers/net/igb/igb_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>

Thanks, I have added the patch to my queue of patches.

-- 
Cheers,
Jeff

^ permalink raw reply

* Re: Network multiqueue question
From: Eric Dumazet @ 2010-04-15 17:47 UTC (permalink / raw)
  To: George B.; +Cc: netdev
In-Reply-To: <i2tb65cae941004150958n5c66dc42j26724bbb075125a0@mail.gmail.com>

Le jeudi 15 avril 2010 à 09:58 -0700, George B. a écrit :
> I am in need of a little education on multiqueue and was wondering if
> someone here might be able to help me.
> 
> Given intel igb network driver, it appears I can do something like:
> 
>  tc qdisc del dev eth0 root handle 1: multiq
> 
> which works and reports 4 bands:  dev eth0 root refcnt 4 bands 4/4
> 
> But our network is a little more complicated.  Above the ethernet we
> have the bonding driver which is using mode 2 bonding with two
> ethernet slaves.  Then we have vlans on the bond interface.  Our
> production traffic is on a vlan and resource contention is an issue as
> these are busy machines.
> 
> It is my understanding that the vlan driver became multiqueue aware in
> 2.6.32 (we are currently using 2.6.31).
> 
> It would seem that the first thing the kernel would encounter with
> traffic headed out would be the vlan interface, and then the bond
> interface, and then the physical ethernet interface.  Is that correct?
>  So with my kernel, I would seem to get no utility from multiq on the
> ethernet interface if the vlan interface is going to be a
> single-threaded bottleneck.  What about the bond driver?  Is it
> currently multiqueue aware?
> 
> I am try to get some sort of logical picture of how all these things
> interact with each other to get things a little more efficient and
> reduce resource contention in the application while still trying to be
> efficient in use of network ports/interfaces.
> 
> If someone feels up to the task of sending a little education my way,
> I would be most appreciative.  There doesn't seem to be a whole lot of
> documentation floating around about multiqueue other than a blurb of
> text in the kernel and David's presentation of last year.

Hi George

Vlan is multiqueue aware, but bonding is not unfortunatly at this
moment.

We could let it being 'multiqueue' (a patch was submitted by Oleg A.
Arkhangelsky a while ago), but bonding xmit routine needs to lock a
central lock, shared by all queues, so it wont be very efficient...

Since this bothers me a bit, I will probably work on this in a near
future. (adding real multiqueue capability and RCU to bonding fast
paths)

Ref: http://permalink.gmane.org/gmane.linux.network/152987



^ permalink raw reply

* Network multiqueue question
From: George B. @ 2010-04-15 16:58 UTC (permalink / raw)
  To: netdev

I am in need of a little education on multiqueue and was wondering if
someone here might be able to help me.

Given intel igb network driver, it appears I can do something like:

 tc qdisc del dev eth0 root handle 1: multiq

which works and reports 4 bands:  dev eth0 root refcnt 4 bands 4/4

But our network is a little more complicated.  Above the ethernet we
have the bonding driver which is using mode 2 bonding with two
ethernet slaves.  Then we have vlans on the bond interface.  Our
production traffic is on a vlan and resource contention is an issue as
these are busy machines.

It is my understanding that the vlan driver became multiqueue aware in
2.6.32 (we are currently using 2.6.31).

It would seem that the first thing the kernel would encounter with
traffic headed out would be the vlan interface, and then the bond
interface, and then the physical ethernet interface.  Is that correct?
 So with my kernel, I would seem to get no utility from multiq on the
ethernet interface if the vlan interface is going to be a
single-threaded bottleneck.  What about the bond driver?  Is it
currently multiqueue aware?

I am try to get some sort of logical picture of how all these things
interact with each other to get things a little more efficient and
reduce resource contention in the application while still trying to be
efficient in use of network ports/interfaces.

If someone feels up to the task of sending a little education my way,
I would be most appreciative.  There doesn't seem to be a whole lot of
documentation floating around about multiqueue other than a blurb of
text in the kernel and David's presentation of last year.

Thanks!

George

^ permalink raw reply

* [PATCH net-next] net/l2tp/l2tp_debugfs.c: Convert NIPQUAD to %pI4
From: Joe Perches @ 2010-04-15 16:41 UTC (permalink / raw)
  To: David Miller; +Cc: James Chapman, LKML, netdev

Signed-off-by: Joe Perches <joe@perches.com>
---
 net/l2tp/l2tp_debugfs.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_debugfs.c b/net/l2tp/l2tp_debugfs.c
index 908f10f..104ec3b 100644
--- a/net/l2tp/l2tp_debugfs.c
+++ b/net/l2tp/l2tp_debugfs.c
@@ -122,8 +122,8 @@ static void l2tp_dfs_seq_tunnel_show(struct seq_file *m, void *v)
 	seq_printf(m, "\nTUNNEL %u peer %u", tunnel->tunnel_id, tunnel->peer_tunnel_id);
 	if (tunnel->sock) {
 		struct inet_sock *inet = inet_sk(tunnel->sock);
-		seq_printf(m, " from " NIPQUAD_FMT " to " NIPQUAD_FMT "\n",
-			   NIPQUAD(inet->inet_saddr), NIPQUAD(inet->inet_daddr));
+		seq_printf(m, " from %pI4 to %pI4\n",
+			   &inet->inet_saddr, &inet->inet_daddr);
 		if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
 			seq_printf(m, " source port %hu, dest port %hu\n",
 				   ntohs(inet->inet_sport), ntohs(inet->inet_dport));

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Rick Jones @ 2010-04-15 16:41 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, eric.dumazet, therbert, netdev, robert, xiaosuo,
	andi
In-Reply-To: <1271332528.4567.150.camel@bigi>

> 
> I speculate again that it may be too costly to run rps on something like
> a tigerton or intel clovertown where you have cores sharing/contending
> for an FSB. If I can get answers to the question: "What h/ware are
> people running?" i could be proven wrong.
> [Note: I am not against RPS - i think it has its place; so i hope my
> desire to find out when to use rps doesnt show as hostility towards
> rps.]

IPS (~= RPS) was running on shared FSB HP9000's.  Now, that was also a BSD 
networking stack with netisrq's and the like.  TOPS (~= RFS) was also run on 
shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems.  TOPS was 
implemented in a Streams-based stack tracing its history to a common ancestor 
with Solaris (Mentat).

rick jones

^ permalink raw reply

* [RFC][PATCH] xfrm6 refcnt problem in bundle creation
From: Nicolas Dichtel @ 2010-04-15 16:32 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

Hi all,

I got a ref count problem in xfrm IPv6 part, but I don't really know what is the 
best way to fix it.

When xfrm6_fill_dst() is called, a dev is given as parameter:

static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
                           struct flowi *fl)
{
         struct rt6_info *rt = (struct rt6_info*)xdst->route;

         xdst->u.dst.dev = dev;
         dev_hold(dev);

         xdst->u.rt6.rt6i_idev = in6_dev_get(rt->u.dst.dev);
         if (!xdst->u.rt6.rt6i_idev)
                 return -ENODEV;
[snip]

In my case, dev points to an ethernet device and the route (rt->u.dst.dev) 
points to a tunnel interface (ip6 over ip6). This function will get a ref on the 
idev of the tunnel (xdst->u.rt6.rt6i_idev = in6_dev_get(rt->u.dst.dev)), but dev 
of the dst is set to the ethernet interface (xdst->u.dst.dev = dev).
After, when we try to remove the tunnel interface, the xfrm gc function will 
never check rt6i_idev, it will only check u.dst.dev, hence it will not remove 
the dst.
The consequence is that the interface cannot be removed.

IPv4 code takes the same dev to get idev, rather than using rt->u.dst.dev. Is it 
right to do the same in IPv6?
A proposal patch is attached.

Code, before the patch of the bundle creation merge, takes 'rt->u.dst.dev' to 
get idev and to set dst.dev.

Suggestions are welcome.


Regards,
Nicolas

[-- Attachment #2: 0001-xfrm6-ensure-to-use-the-same-dev-when-building-a-bu.patch --]
[-- Type: text/x-diff, Size: 950 bytes --]

>From 80432d47369925d4e9e38bcb1068ebf923de3a8f Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 15 Apr 2010 18:27:30 +0200
Subject: [PATCH] xfrm6: ensure to use the same dev when building a bundle

When building a bundle, we set dst.dev and rt6.rt6i_idev.
We must ensure to set the same device for both fields.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/xfrm6_policy.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ae18165..00bf7c9 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -124,7 +124,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.dst.dev = dev;
 	dev_hold(dev);
 
-	xdst->u.rt6.rt6i_idev = in6_dev_get(rt->u.dst.dev);
+	xdst->u.rt6.rt6i_idev = in6_dev_get(dev);
 	if (!xdst->u.rt6.rt6i_idev)
 		return -ENODEV;
 
-- 
1.5.4.5


^ permalink raw reply related

* Poor localhost net performance on recent stable kernel
From: Kelly Burkhart @ 2010-04-15 15:44 UTC (permalink / raw)
  To: netdev, linux-kernel

Hello,

While working on upgrading distributions, I've noticed that local
network communication is much slower on 2.6.33.2 than on our old
kernel 2.6.16.60 (sles 10.2).

Results of netperf, UDP_RR against localhost I get around 150000 tps
on the new kernel vs. 290000 tps with the old kernel.  The netperf
command:

netperf -T 1 -H 127.0.0.1 -t UDP_RR -c -C -- -r 100

TCP_RR had similar results.  The problem did not exist with TCP_STREAM.

While trying to track this down, I wrote a test program that writes
then reads a 32 bit integer to a pipe:

static void tst_pipe0( int sleep_us )
{
    int pipefd[2];
    int idx;
    uint32_t tarr[ITERS];

    printf("tst_pipe0 -- sleep %dus\n", sleep_us);

    if (pipe(pipefd) < 0)
        err_exit("pipe");

    for(idx=0; idx<ITERS; ++idx) {
        uint32_t btsc;
        uint32_t rtsc;
        uint32_t etsc;
        get_tscl(btsc);
        write(pipefd[1], (char *)&btsc, sizeof(btsc));
        read(pipefd[0], (char *)&rtsc, sizeof(rtsc));
        get_tscl(etsc);
        tarr[idx] = etsc-btsc;
        do_sleep(sleep_us);
    }
    prt_avg(tarr, ITERS);
    close(pipefd[0]);
    close(pipefd[1]);
    printf("\n");
}

There's a dramatic difference if there's a sleep between iterations on
the new kernel.  On the old kernel the write/read round trip takes
1100-1300 cycles with or without sleep.  On the new kernel, with no
sleep the round trip is about 1400 cycles.  It doubles with a 1us
sleep then gradually increases to 12000-14000 cycles then stabilizes
as I increase the sleep time to 1500us.  I'm not sure if this is
related to the netperf difference or is a completely different
scheduling issue.

I'm running on an Intel Xeon X5570 @ 2.93GHz.  Different tick/notick,
preemption, HZ kernel config option values doesn't substantially change
the magnitude of the difference.

Does anyone have any ideas regarding what could be causing the netperf
issue?  And is the pipe microbenchmark meaningful and if so what does
it mean?

Thanks,

-Kelly

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-15 15:06 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu,
	davem@davemloft.net, jdike@linux.intel.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C18969A5@shzsmsx502.ccr.corp.intel.com>

On Thursday 15 April 2010, Xin, Xiaohui wrote:
> 
> >It seems that you are duplicating a lot of functionality that
> >is already in macvtap. I've asked about this before but then
> >didn't look at your newer versions. Can you explain the value
> >of introducing another interface to user land?
> 
> >I'm still planning to add zero-copy support to macvtap,
> >hopefully reusing parts of your code, but do you think there
> >is value in having both?
> 
> I have not looked into your macvtap code in detail before.
> Does the two interface exactly the same? We just want to create a simple
> way to do zero-copy. Now it can only support vhost, but in future
> we also want it to support directly read/write operations from user space too.

Right now, the features are mostly distinct. Macvtap first of all provides
a "tap" style interface for users, and can also be used by vhost-net.
It also provides a way to share a NIC among a number of guests by software,
though I indent to add support for VMDq and SR-IOV as well. Zero-copy
is also not yet done in macvtap but should be added.

mpassthru right now does not allow sharing a NIC between guests, and
does not have a tap interface for non-vhost operation, but does the
zero-copy that is missing in macvtap.

> Basically, compared to the interface, I'm more worried about the modification
> to net core we have made to implement zero-copy now. If this hardest part
> can be done, then any user space interface modifications or integrations are 
> more easily to be done after that.

I agree that the network stack modifications are the hard part for zero-copy,
and your work on that looks very promising and is complementary to what I've
done with macvtap. Your current user interface looks good for testing this out,
but I think we should not merge it (the interface) upstream if we can get the
same or better result by integrating your buffer management code into macvtap.

I can try to merge your code into macvtap myself if you agree, so you
can focus on getting the internals right.

> >Not sure what I'm missing, but who calls the vq->receiver? This seems
> >to be neither in the upstream version of vhost nor introduced by your
> >patch.
> 
> See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost.

Ok, I see. As a general rule, it's preferred to split a patch series
in a way that makes it possible to apply each patch separately and still
get a working kernel, ideally with more features than the version before
the patch. I believe you could get there by reordering your patches to
make the actual driver the last one in the series.

Not a big problem though, I was mostly looking in the wrong place.

> >> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> >> +
> >> +		ret = -EBUSY;
> >> +
> >> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> >> +			break;
> 
> >Your current use of the IFF_MPASSTHRU* flags does not seem to make
> >any sense whatsoever. You check that this flag is never set, but set
> >it later yourself and then ignore all flags.
> 
> Using that flag is tried to prevent if another one wants to bind the same device
> Again. But I will see if it really ignore all other flags.

The ifr variable is on the stack of the mp_chr_ioctl function, and you never
look at the value after setting it. In order to prevent multiple opens
of that device, you probably need to lock out any other users as well,
and make it a property of the underlying device. E.g. you also want to
prevent users on the host from setting an IP address on the NIC and
using it to send and receive data there.

	Arnd

^ permalink raw reply

* Re: "kernel:nf_ct_icmp: bad HW ICMP checksum" too noisy
From: Patrick McHardy @ 2010-04-15 14:34 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: netdev, Netfilter Development Mailinglist
In-Reply-To: <m31vegztav.fsf@ursa.amorsen.dk>

Benny Amorsen wrote:
> Would it be possible to lower the log level of "kernel:nf_ct_icmp: bad
> HW ICMP checksum" so that they don't show up on the console? Obviously I
> could configure rsyslog to never send anything to the console, but then
> I might miss something which is actually critical.

Yeah, I guess defaulting to KERN_EMERG wasn't the best choice :)
I'll lower it to something reasonable - I guess KERN_NOTICE would
be appropriate.

> It's obviously nice to know about corrupted ICMP on a controlled LAN,
> but on the open Internet you can't really do anything about it.

You should only see that message when nf_conntrack_log_invalid is
active.

^ permalink raw reply

* "kernel:nf_ct_icmp: bad HW ICMP checksum" too noisy
From: Benny Amorsen @ 2010-04-15 14:25 UTC (permalink / raw)
  To: netdev

Would it be possible to lower the log level of "kernel:nf_ct_icmp: bad
HW ICMP checksum" so that they don't show up on the console? Obviously I
could configure rsyslog to never send anything to the console, but then
I might miss something which is actually critical.

It's obviously nice to know about corrupted ICMP on a controlled LAN,
but on the open Internet you can't really do anything about it.


/Benny



^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Eric Dumazet @ 2010-04-15 13:42 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: Changli Gao, zhigang gong, netdev
In-Reply-To: <m3633szw61.fsf@ursa.amorsen.dk>

Le jeudi 15 avril 2010 à 15:23 +0200, Benny Amorsen a écrit :
> Benny Amorsen <benny+usenet@amorsen.dk> writes:
> 
> > I'll keep monitoring the server, and if it starts dropping packets again
> > or load increases I'll check whether irqbalanced does the right thing,
> > and if not I'll implement your suggestion.
> 
> It did start dropping packets (although very few, a few packets dropped
> at once perhaps every ten minutes). Irqbalanced didn't move the
> interrupts.
> 
> Doing
> 
> echo 01 >/proc/irq/99/smp_affinity
> echo 02 >/proc/irq/100/smp_affinity
> echo 04 >/proc/irq/101/smp_affinity
> 
> and so on like Erik Dumazet suggested seems to have helped, but not
> entirely solved the problem.
> 
> The problem now manifests itself this way in ethtool -S:
>      rx_no_buffer_count: 270
>      rx_queue_drop_packet_count: 270
> 
> I can't be sure that I'm not just getting hit by a 1Gbps traffic spike,
> of course, but it is a bit strange that a machine which can do 200Mbps
> at 92% idle can't handle subsecond peaks close to 1Gbps...
> 

Even with multiqueue, its quite possible one queue gets more than one
packet per micro second. Time to process a packet might be greater then
1 us even on recent hardware. So bursts of 1000 small packets with same
flow information, hit one queue, one cpu, and fill rx ring.

Loosing these packets is OK, its very likely its an attack :)

> I wish ifstat could report errors so I could see what the traffic rate
> was when the problem occurred...

yes, it could be added I guess.



^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Benny Amorsen @ 2010-04-15 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, zhigang gong, netdev
In-Reply-To: <m3hbnfzsye.fsf@ursa.amorsen.dk>

Benny Amorsen <benny+usenet@amorsen.dk> writes:

> I'll keep monitoring the server, and if it starts dropping packets again
> or load increases I'll check whether irqbalanced does the right thing,
> and if not I'll implement your suggestion.

It did start dropping packets (although very few, a few packets dropped
at once perhaps every ten minutes). Irqbalanced didn't move the
interrupts.

Doing

echo 01 >/proc/irq/99/smp_affinity
echo 02 >/proc/irq/100/smp_affinity
echo 04 >/proc/irq/101/smp_affinity

and so on like Erik Dumazet suggested seems to have helped, but not
entirely solved the problem.

The problem now manifests itself this way in ethtool -S:
     rx_no_buffer_count: 270
     rx_queue_drop_packet_count: 270

I can't be sure that I'm not just getting hit by a 1Gbps traffic spike,
of course, but it is a bit strange that a machine which can do 200Mbps
at 92% idle can't handle subsecond peaks close to 1Gbps...

I wish ifstat could report errors so I could see what the traffic rate
was when the problem occurred...


/Benny

^ permalink raw reply

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon:  caller is netif_rx
From: Eric Dumazet @ 2010-04-15 13:16 UTC (permalink / raw)
  To: Eric Paris, David Miller; +Cc: netdev, Tom Herbert
In-Reply-To: <1271101251.16881.135.camel@edumazet-laptop>

Le lundi 12 avril 2010 à 21:40 +0200, Eric Dumazet a écrit :
> Good spot, RPS changed a bit netif_rx() requirements.
> 
> I would change ip_dev_loopback_xmit() to call netif_rx_ni() instead...
> 
> David, Tom ?
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c65f18e..d1bcc9f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -120,7 +120,7 @@ static int ip_dev_loopback_xmit(struct sk_buff *newskb)
>  	newskb->pkt_type = PACKET_LOOPBACK;
>  	newskb->ip_summed = CHECKSUM_UNNECESSARY;
>  	WARN_ON(!skb_dst(newskb));
> -	netif_rx(newskb);
> +	netif_rx_ni(newskb);
>  	return 0;
>  }
>  

After some confusion, it seems this was the right fix after all :)

[PATCH] ip: Fix ip_dev_loopback_xmit()

Eric Paris got following trace with a linux-next kernel

[   14.203970] BUG: using smp_processor_id() in preemptible [00000000]
code: avahi-daemon/2093
[   14.204025] caller is netif_rx+0xfa/0x110
[   14.204035] Call Trace:
[   14.204064]  [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
[   14.204070]  [<ffffffff8142163a>] netif_rx+0xfa/0x110
[   14.204090]  [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
[   14.204095]  [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
[   14.204099]  [<ffffffff8145d610>] ip_local_out+0x20/0x30
[   14.204105]  [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
[   14.204119]  [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
[   14.204125]  [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
[   14.204137]  [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
[   14.204149]  [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
[   14.204189]  [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
[   14.204233]  [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b

While current linux-2.6 kernel doesnt emit this warning, bug is latent
and might cause unexpected failures.

ip_dev_loopback_xmit() runs in process context, preemption enabled, so
must call netif_rx_ni() instead of netif_rx(), to make sure that we
process pending software interrupt.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65f18e..d1bcc9f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -120,7 +120,7 @@ static int ip_dev_loopback_xmit(struct sk_buff *newskb)
 	newskb->pkt_type = PACKET_LOOPBACK;
 	newskb->ip_summed = CHECKSUM_UNNECESSARY;
 	WARN_ON(!skb_dst(newskb));
-	netif_rx(newskb);
+	netif_rx_ni(newskb);
 	return 0;
 }
 



^ permalink raw reply related

* [net-next-2.6 PATCH 1/2] ipv6: cancel to setting local_df in ip6_xmit()
From: Shan Wei @ 2010-04-15 13:04 UTC (permalink / raw)
  To: David Miller, Herbert Xu, emils.tantilov
  Cc: kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, eric.dumazet, sri, netdev@vger.kernel.org,
	Shan Wei

commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

So the change of commit 77e2f1(ipv6: Fix ip6_xmit to 
send fragments if ipfragok is true) is not needed. 
So the patch remove them.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 net/ipv6/ip6_output.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 16c4391..f3a847e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -231,10 +231,6 @@ int ip6_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
 	skb_reset_network_header(skb);
 	hdr = ipv6_hdr(skb);
 
-	/* Allow local fragmentation. */
-	if (ipfragok)
-		skb->local_df = 1;
-
 	/*
 	 *	Fill in the IPv6 header
 	 */
--
1.6.3.3 

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 12:50 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, Tom Herbert, netdev
In-Reply-To: <s2p412e6f7f1004150532y4b13a0bfgadab3e6f2f4aecd@mail.gmail.com>

On Thu, 2010-04-15 at 20:32 +0800, Changli Gao wrote:

> For historical reason, we use Linux-2.6.18. Our company have several
> products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
> Multi-Threaded. 

Thanks for sharing. How much more can you say? ;-> Do you have a paper
or description of some sort somewhere?

> We use the similar mechanism like dynamic weighted
> RPS. The total throughput is increased nearly linear with the number
> of the worker threads(one worker thread per CPU).

Other than the i7 - have you tried to run rps on on the P4?

cheers,
jamal



^ permalink raw reply

* [PATCH 2/3] ipv4: ipmr: fix invalid cache resolving when adding a non-matching entry
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

The patch to convert struct mfc_cache to list_heads (ipv4: ipmr: convert
struct mfc_cache to struct list_head) introduced a bug when adding new
cache entries that don't match any unresolved entries.

The unres queue is searched for a matching entry, which is then resolved.
When no matching entry is present, the iterator points to the head of the
list, but is treated as a matching entry. Use a seperate variable to
indicate that a matching entry was found.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5df5fd7..0643fb6 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1089,12 +1089,14 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 	 *	Check to see if we resolved a queued list. If so we
 	 *	need to send on the frames and tidy up.
 	 */
+	found = false;
 	spin_lock_bh(&mfc_unres_lock);
 	list_for_each_entry(uc, &mrt->mfc_unres_queue, list) {
 		if (uc->mfc_origin == c->mfc_origin &&
 		    uc->mfc_mcastgrp == c->mfc_mcastgrp) {
 			list_del(&uc->list);
 			atomic_dec(&mrt->cache_resolve_queue_len);
+			found = true;
 			break;
 		}
 	}
@@ -1102,7 +1104,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 		del_timer(&mrt->ipmr_expire_timer);
 	spin_unlock_bh(&mfc_unres_lock);
 
-	if (uc) {
+	if (found) {
 		ipmr_cache_resolve(net, mrt, uc, c);
 		ipmr_cache_free(uc);
 	}
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 0/3]: fixes for multicast routing rules
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev

Hi Dave,

the following three patches fix a few bugs introduced by the multicast routing
rule patches:

- a missing Kconfig dependency on IP_MROUTE

- a bug introduced by the list conversion, causing the list head to be treated
  as an element

- a NULL pointer dereference: the net pointer in ipmr_destroy_unres() was
  initialized to NULL. This patch was actually intended to be folded into
  the patch introducing multicast routing rules, but I missed it when
  rebasing to the current tree.

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6.git master

Thanks!

^ permalink raw reply

* [PATCH 3/3] ipv4: ipmr: fix NULL pointer deref during unres queue destruction
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

Fix an oversight in ipmr_destroy_unres() - the net pointer is
unconditionally initialized to NULL, resulting in a NULL pointer
dereference later on.

Fix by adding a net pointer to struct mr_table and using it in
ipmr_destroy_unres().

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0643fb6..7d8a2bc 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -71,6 +71,9 @@
 
 struct mr_table {
 	struct list_head	list;
+#ifdef CONFIG_NET_NS
+	struct net		*net;
+#endif
 	u32			id;
 	struct sock		*mroute_sk;
 	struct timer_list	ipmr_expire_timer;
@@ -308,6 +311,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 	mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
 	if (mrt == NULL)
 		return NULL;
+	write_pnet(&mrt->net, net);
 	mrt->id = id;
 
 	/* Forwarding cache */
@@ -580,7 +584,7 @@ static inline void ipmr_cache_free(struct mfc_cache *c)
 
 static void ipmr_destroy_unres(struct mr_table *mrt, struct mfc_cache *c)
 {
-	struct net *net = NULL; //mrt->net;
+	struct net *net = read_pnet(&mrt->net);
 	struct sk_buff *skb;
 	struct nlmsgerr *e;
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 1/3] ipv4: ipmr: fix IP_MROUTE_MULTIPLE_TABLES Kconfig dependencies
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

IP_MROUTE_MULTIPLE_TABLES should depend on IP_MROUTE.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index be59774..8e3a1fd 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -252,7 +252,7 @@ config IP_MROUTE
 
 config IP_MROUTE_MULTIPLE_TABLES
 	bool "IP: multicast policy routing"
-	depends on IP_ADVANCED_ROUTER
+	depends on IP_MROUTE && IP_ADVANCED_ROUTER
 	select FIB_RULES
 	help
 	  Normally, a multicast router runs a userspace daemon and decides
-- 
1.7.0.4


^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-15 12:32 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, Stephen Hemminger, netdev, robert,
	David Miller, Andi Kleen
In-Reply-To: <1271333428.23780.3.camel@bigi>

On Thu, Apr 15, 2010 at 8:10 PM, jamal <hadi@cyberus.ca> wrote:
> On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:
>
>> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
>> 10Gigabit has 16 queues. It might be good to use less queues according
>> to your results on some workloads, and eventually use RPS on a second
>> layering.
>

For historical reason, we use Linux-2.6.18. Our company have several
products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
Multi-Threaded. We use the similar mechanism like dynamic weighted
RPS. The total throughput is increased nearly linear with the number
of the worker threads(one worker thread per CPU).

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox