Netdev List
 help / color / mirror / Atom feed
* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17  9:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <z2q65634d661004170143nb35ec784mbedd003565410cfb@mail.gmail.com>

Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> >
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> >
> You could do that, but then the packet processing becomes HOL blocking
> on all the packets that are being sent to other queues for
> processing-- remember the IPIs is only sent at the end of the NAPI.
> So unless the upper stack processing is <0.74us in your case, I think
> processing packets directly on the local queue would improve best case
> latency, but would increase average latency and even more likely worse
> case latency on loads with multiple flows.

Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu()
itself, computing skb->rxhash and all. We should make a review of how
many cache lines we exchange per skb, and try to reduce this number.




^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Benny Amorsen @ 2010-04-17  9:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Antonio Almeida, netdev, kaber, davem, devik
In-Reply-To: <4BC63766.5080104@gmail.com>

Jarek Poplawski <jarkao2@gmail.com> writes:

> As I wrote before, the minimal (overflow safe) rate depends on max
> packet size, and for 1500 byte it would be something around:
> 1500b/2min, so if your clients can wait so long, try this:

Wouldn't it be nice of either tc or the kernel to warn about wrong
configurations, or possibly reject them completely?


/Benny


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-17  8:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271489739.16881.4586.camel@edumazet-laptop>

> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
>
> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.
>
You could do that, but then the packet processing becomes HOL blocking
on all the packets that are being sent to other queues for
processing-- remember the IPIs is only sent at the end of the NAPI.
So unless the upper stack processing is <0.74us in your case, I think
processing packets directly on the local queue would improve best case
latency, but would increase average latency and even more likely worse
case latency on loads with multiple flows.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
>
> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.
>
> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.
>
> For me RPS use cases are :
>
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
>
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
>
> I'll try to do these tests on a Nehalem target.
>
>
>
>

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: rps_sock_flow_table is mostly read
From: David Miller @ 2010-04-17  7:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271490733.16881.4588.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 17 Apr 2010 09:52:13 +0200

> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* [PATCH net-next-2.6] rps: rps_sock_flow_table is mostly read
From: Eric Dumazet @ 2010-04-17  7:52 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <1271404097.16881.3827.camel@edumazet-laptop>


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index d7107ac..7abf959 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2205,7 +2205,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 #ifdef CONFIG_RPS
 
 /* One global table that all flow-based protocols share. */
-struct rps_sock_flow_table *rps_sock_flow_table;
+struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
 EXPORT_SYMBOL(rps_sock_flow_table);
 
 /*



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17  7:35 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271424065.4606.31.camel@bigi>

Le vendredi 16 avril 2010 à 09:21 -0400, jamal a écrit :
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
> 
> > 
> > A kernel module might do this, this could be integrated in perf bench so
> > that we can regression tests upcoming kernels.
> 
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
> 
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example: 
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
> 
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.

I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
nehalem. So a 3-4 years old design.

For all test, I use the best time of 3 runs of "ping -f -q -c 100000
192.168.0.2". Yes ping is not very good, but its available ;)

Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
user land. I dont want to tweak acpi or whatever smart power saving
mechanisms.

When RPS off
100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms

RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
(echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms

So the cost of queing the packet into our own queue (netif_receive_skb
-> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)

I personally think we should process packet instead of queeing it, but
Tom disagree with me.

RPS on, directed on cpu1 (other socket)
(echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
is 3 us. Note this cost is in case we receive a single packet.

I suspect IPI itself is in the 1.5 us range, not very far from the
queing to ourself case.

For me RPS use cases are :

1) Value added apps handling lot of TCP data, where the costs of cache
misses in tcp stack easily justify to spend 3 us to gain much more.

2) Network appliance, where a single cpu is filled 100% to handle one
device hardware and software/RPS interrupts, delegating all higher level
works to a pool of cpus.

I'll try to do these tests on a Nehalem target.




^ permalink raw reply

* Re: Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-17  4:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-net
In-Reply-To: <m2v1f8bbe3c1004152351o580b37ebx55e428fb106b09bd@mail.gmail.com>

Hi,

I am trying to find a duplicate IP in the network using arping.

 -------------------------
 [root@vps1 ~]# ping -c 3 192.168.1.212
 PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
 64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
 64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
 64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms

 --- 192.168.1.212 ping statistics ---
 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
 rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
 [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
 ARPING 192.168.1.212 from 0.0.0.0 eth0
 0
 -------------------------


 As per arping that IP is duplicate. But if I go ahead and ifdown the
 IP in the known location I cant ping that IP ( That means that IP is
 not duplicated ? ). This is the result after shutting down the IP.

 --------------------------
 [root@vps1 ~]# ping -c 3 192.168.1.212
 PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
 From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
 From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
 From 192.168.1.63 icmp_seq=3 Destination Host Unreachable

 --- 192.168.1.212 ping statistics ---
 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2001ms
 , pipe 3
 [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
 ARPING 192.168.1.212 from 0.0.0.0 eth0
 Sent 5 probes (5 broadcast(s))
 Received 0 response(s)
 0
 [root@vps1 ~]#
 --------------------------

 My question is, in this case IP 192.168.1.212 is not duplicated. But
 still arping gives duplicate status. Why it is like that ?

 --
 Regards,
 Unni

^ permalink raw reply

* [PATCH] mac8390: change an error return code and some cleanup, take 3
From: Finn Thain @ 2010-04-17  3:16 UTC (permalink / raw)
  To: David Miller; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004162347390.271@localhost>


Change an error return code from -EAGAIN to -EBUSY since the former is 
misleading.

Nubus slots are geographically addressed and their irqs are equally 
inflexible. -EAGAIN is misleading because retrying will not help fix 
whatever bug it was that made the irq unavailable.

Also promote the log message. Likewise some other KERN_INFO log messages.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

--- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
+++ b/drivers/net/mac8390.c	2010-04-16 23:50:39.000000000 +1000
@@ -554,7 +554,7 @@
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -643,8 +643,8 @@
 {
 	__ei_open(dev);
 	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
+		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
+		return -EBUSY;
 	}
 	return 0;
 }
@@ -660,7 +660,7 @@
 {
 	ei_status.txing = 0;
 	if (ei_debug > 1)
-		pr_info("reset not supported\n");
+		printk(KERN_DEBUG pr_fmt("reset not supported\n"));
 	return;
 }
 
@@ -668,7 +668,7 @@
 {
 	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
 	if (ei_debug > 1)
-		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
+		printk(KERN_DEBUG pr_fmt("Need to reset the NS8390 t=%lu..."), jiffies);
 	ei_status.txing = 0;
 	target[0xC0000] = 0;
 	if (ei_debug > 1)
 

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: Finn Thain @ 2010-04-17  2:28 UTC (permalink / raw)
  To: David Miller; +Cc: Joe Perches, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100416.132855.230883346.davem@davemloft.net>


On Fri, 16 Apr 2010, David Miller wrote:

> 
> I just want to point out that with all the trouble you gave about Joe's 
> work, you're having one heck of a time even submitting your changes 
> properly. :-)

You are a thorough reviewer, despite the regrettable implication to the 
contrary in my first message in this thread. And you are quite right, I 
did not understand the pr_* macros at the time. I apologise.

I am hypersensitive to bit rot. I guess that's what happens when one makes 
it one's job to fix regressions in Linux.

Finn

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-17  0:58 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <i2s65634d661004161722hece6f9d4naf528c37b63fffbc@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Fri, 16 Apr 2010 17:22:49 -0700

> Ugh, vmalloc.h must be sneaking in through some other header file for
> me :-(  Sorry about that.  Do you need me to respin the patch?

No, I took care of it and am about to push things out to net-next-2.6
on kernel.org

^ permalink raw reply

* [PATCH] KS8851: NULL pointer dereference if list is empty
From: Abraham Arce @ 2010-04-17  0:48 UTC (permalink / raw)
  To: netdev

Fix NULL pointer dereference in ks8851_tx_work by checking if dequeued
list is already empty before writing the packet to TX FIFO

 Unable to handle kernel NULL pointer dereference at virtual address 00000050
 PC is at ks8851_tx_work+0xdc/0x1b0
 LR is at wait_for_common+0x148/0x164
 pc : [<c01c0df4>]    lr : [<c025a980>]    psr: 20000013
 Backtrace:
  ks8851_tx_work+0x0/0x1b0
  worker_thread+0x0/0x190
  kthread+0x0/0x90

Signed-off-by: Abraham Arce <x0066660@ti.com>
---
 drivers/net/ks8851.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ks8851.c b/drivers/net/ks8851.c
index 13cc1ca..9e9f9b3 100644
--- a/drivers/net/ks8851.c
+++ b/drivers/net/ks8851.c
@@ -722,12 +722,14 @@ static void ks8851_tx_work(struct work_struct *work)
 		txb = skb_dequeue(&ks->txq);
 		last = skb_queue_empty(&ks->txq);

-		ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr | RXQCR_SDA);
-		ks8851_wrpkt(ks, txb, last);
-		ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr);
-		ks8851_wrreg16(ks, KS_TXQCR, TXQCR_METFE);
+		if (txb != NULL) {
+			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr | RXQCR_SDA);
+			ks8851_wrpkt(ks, txb, last);
+			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr);
+			ks8851_wrreg16(ks, KS_TXQCR, TXQCR_METFE);

-		ks8851_done_tx(ks, txb);
+			ks8851_done_tx(ks, txb);
+		}
 	}

 	mutex_unlock(&ks->lock);
-- 
1.5.4.3

^ permalink raw reply related

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-17  0:22 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20100416.155707.66748057.davem@davemloft.net>

Ugh, vmalloc.h must be sneaking in through some other header file for
me :-(  Sorry about that.  Do you need me to respin the patch?

Tom

On Fri, Apr 16, 2010 at 3:57 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Fri, 16 Apr 2010 15:53:40 -0700 (PDT)
>
>> From: David Miller <davem@davemloft.net>
>> Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT)
>>
>>> Great, I'll add this to net-next-2.6 right now.
>>
>> I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c
>> to fix the build while committing this.
>
> net/core/net-sysfs.c needed it too :-/
>

^ permalink raw reply

* Re: Network protocol (IP,IPv6,...) and TC actions (ACT_CSUM)
From: Grégoire Baron @ 2010-04-16 23:10 UTC (permalink / raw)
  To: netdev; +Cc: Jan Ceuleers
In-Reply-To: <4BC4AA5B.8040901@computer.org>

Thanks Jan, for the suggestion.

Hi,

I will re-explain my situation.

I started to write a new TC action (ACT_CSUM) in order to be able to
force, specially when ACT_PEDIT is used, the update of common checksums:
 * the IPv4 header checksum,
 * the ICMP/IGMP and ICMPv6 checksums,
 * the TCP/UDP checkusms,
 * and why not, more ...

Also, the idea is to support directly IPv4 and IPv6.
The best user interface (via iproute2/tc) could to not ask the final
user to assume a specific network protocol, but let the action
discover it.

With this aim, I would like to know if someone could confirm me the
struct sk_buff .protocol member is the good candidate to discover if I
have an IPv4, an IPv6 packet or any other network protocol, in the skb
got by the TC action code (supporting INGRESS and EGRESS).

Indeed, this struct sk_buff member could contain something like
ETH_P_8021Q, which isn't a network protocol Id ...

I think this kind of content isn't seen by the TC actions, which work
at the network level (even if their filter protocol flag accepts all).
If someone could confirm, thanks in advance.

By the same way, I've wondered if the struct sk_buff .len member could
be used to avoid to "discover" the network packet length in the TC
action code, especially in the case of IPv6 packets (and jumbogram ;-).
But, I think not, because it could be not the case in INGRESS TC action
execution, in my point of view, because the packet wasn't delivered to
the network protocol yet. Is my analysis right?

Thanks again for your help.

Best Regards,

Grégoire Baron

On Tue, Apr 13, 2010 at 07:31:07PM +0200, Jan Ceuleers wrote:
> Grégoire Baron wrote:
> > As this .protocol member seems to be used at different moments when a
> > packet is received, forwared or sent, and could contain something like
> > ETH_P_8021Q which isn't a network protocol Id, can we say the struct
> > sk_buff .protocol member is guaranteed to contain a network protocol Id
> > in the struct sb_buff used in the TC action executions ?
> 
> Grégoire,
> 
> I suggest that you ask your question on the netdev mailing list (netdev@vger.kernel.org).
> 
> Cheers, Jan

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16 22:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <20100416.155340.256882855.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 16 Apr 2010 15:53:40 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT)
> 
>> Great, I'll add this to net-next-2.6 right now.
> 
> I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c
> to fix the build while committing this.

net/core/net-sysfs.c needed it too :-/

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16 22:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <20100416.154932.147279343.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT)

> Great, I'll add this to net-next-2.6 right now.

I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c
to fix the build while committing this.

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16 22:49 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271446679.16881.4298.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 16 Apr 2010 21:37:59 +0200

> Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
>> From: Tom Herbert <therbert@google.com>
>> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
>> 
>> > Version 5 of RFS:
>> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
>> > static function.
>> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
>> > sysfs variable.
>> 
>> I've read this over a few times and I think it's ready to go into
>> net-next-2.6, we can tweak things as-needed from here on out.
>> 
>> Eric, what do you think?
> 
> I think I can give my Sob, and we have time to fully test it and tweak
> it if necessary.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Great, I'll add this to net-next-2.6 right now.

Thanks!

^ permalink raw reply

* Re: [PATCH net-2.6] packet : remove init_net restriction
From: David Miller @ 2010-04-16 22:41 UTC (permalink / raw)
  To: daniel.lezcano; +Cc: netdev
In-Reply-To: <1271322674-21726-1-git-send-email-daniel.lezcano@free.fr>

From: Daniel Lezcano <daniel.lezcano@free.fr>
Date: Thu, 15 Apr 2010 11:11:14 +0200

> The af_packet protocol is used by Perl to do ioctls as reported by
> Stephane Riviere:
> 
> "Net::RawIP relies on SIOCGIFADDR et SIOCGIFHWADDR to get the IP and MAC
> addresses of the network interface."
> 
> But in a new network namespace these ioctl fail because it is disabled for
> a namespace different from the init_net_ns.
> 
> These two lines should not be there as af_inet and af_packet are
> namespace aware since a long time now. I suppose we forget to remove these
> lines because we sent the af_packet first, before af_inet was supported.
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
> Reported-by: Stephane Riviere <stephane.riviere@regis-dgac.net>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH] WAN: flush tx_queue in hdlc_ppp to prevent panic on rmmod hw_driver.
From: David Miller @ 2010-04-16 22:41 UTC (permalink / raw)
  To: khc; +Cc: netdev
In-Reply-To: <m3mxx5mv8v.fsf@intrepid.localdomain>

From: Krzysztof Halasa <khc@pm.waw.pl>
Date: Thu, 15 Apr 2010 02:09:52 +0200

> tx_queue is used as a temporary queue when not allowed to queue skb
> directly to the hw device driver (which may sleep). Most paths flush
> it before returning, but ppp_start() currently cannot. Make sure we
> don't leave skbs pointing to a non-existent device.
> 
> Thanks to Michael Barkowski for reporting this problem.
> 
> Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>

Applied, thank you.

^ permalink raw reply

* [PATCH net-next-2.6] net: Introduce skb_orphan_try()
From: Eric Dumazet @ 2010-04-16 22:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100415.143321.200497785.davem@davemloft.net>

Le jeudi 15 avril 2010 à 14:33 -0700, David Miller a écrit :

> If it's not legal to skb_orphan() here then it would not be legal for
> the drivers to unconditionally skb_orphan(), which they do.
> 
> So either your test is unnecessary, or we have a big existing problem
> :-)

I cooked following patch, introducing skb_orphan_try() helper, to
document all known exceptions.

I have a possible followup for this patch :

Orphaning skbs earlier could also make dev_kfree_skb_irq() faster.
Instead of queing skb into completion_queue and triggering
NET_TX_SOFTIRQ, we would directly free an orphaned skb ?



[PATCH net-next-2.6] net: Introduce skb_orphan_try()

Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index e8041eb..acae5fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1880,6 +1880,17 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+/*
+ * Try to orphan skb early, right before transmission by the device.
+ * We cannot orphan skb if tx timestamp is requested, since
+ * drivers need to call skb_tstamp_tx() to send the timestamp.
+ */
+static inline void skb_orphan_try(struct sk_buff *skb)
+{
+	if (!skb_tx(skb)->flags)
+		skb_orphan(skb);
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1904,23 +1915,10 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(skb);
 
+		skb_orphan_try(skb);
 		rc = ops->ndo_start_xmit(skb, dev);
 		if (rc == NETDEV_TX_OK)
 			txq_trans_update(txq);
-		/*
-		 * TODO: if skb_orphan() was called by
-		 * dev->hard_start_xmit() (for example, the unmodified
-		 * igb driver does that; bnx2 doesn't), then
-		 * skb_tx_software_timestamp() will be unable to send
-		 * back the time stamp.
-		 *
-		 * How can this be prevented? Always create another
-		 * reference to the socket before calling
-		 * dev->hard_start_xmit()? Prevent that skb_orphan()
-		 * does anything in dev->hard_start_xmit() by clearing
-		 * the skb destructor before the call and restoring it
-		 * afterwards, then doing the skb_orphan() ourselves?
-		 */
 		return rc;
 	}
 
@@ -1938,6 +1936,7 @@ gso:
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(nskb);
 
+		skb_orphan_try(nskb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)



^ permalink raw reply related

* [PATCH] gigaset: include cleanup cleanup
From: Tilman Schmidt @ 2010-04-16 22:08 UTC (permalink / raw)
  To: Karsten Keil, David Miller
  Cc: Tejun Heo, Hansjoerg Lipp, i4ldeveloper, netdev, linux-kernel

Commit 5a0e3ad causes slab.h to be included twice in many of the
Gigaset driver's source files, first via the common include file
gigaset.h and then a second time directly. Drop the spares, and
use the opportunity to clean up a few more similar cases.

Impact: cleanup, no functional change
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
CC: Tejun Heo <tj@kernel.org>
---
Seeing that the "include cleanup" patch triggering this was accepted
after the merge window, I have hopes this one will be accepted, too.

 drivers/isdn/gigaset/bas-gigaset.c |    5 -----
 drivers/isdn/gigaset/capi.c        |    2 --
 drivers/isdn/gigaset/common.c      |    2 --
 drivers/isdn/gigaset/gigaset.h     |    2 +-
 drivers/isdn/gigaset/i4l.c         |    1 -
 drivers/isdn/gigaset/interface.c   |    1 -
 drivers/isdn/gigaset/proc.c        |    1 -
 drivers/isdn/gigaset/ser-gigaset.c |    3 ---
 drivers/isdn/gigaset/usb-gigaset.c |    4 ----
 9 files changed, 1 insertions(+), 20 deletions(-)

diff --git a/drivers/isdn/gigaset/bas-gigaset.c b/drivers/isdn/gigaset/bas-gigaset.c
index 0be15c7..47a5ffe 100644
--- a/drivers/isdn/gigaset/bas-gigaset.c
+++ b/drivers/isdn/gigaset/bas-gigaset.c
@@ -14,11 +14,6 @@
  */
 
 #include "gigaset.h"
-
-#include <linux/errno.h>
-#include <linux/init.h>
-#include <linux/slab.h>
-#include <linux/timer.h>
 #include <linux/usb.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
diff --git a/drivers/isdn/gigaset/capi.c b/drivers/isdn/gigaset/capi.c
index eb7e271..964a55f 100644
--- a/drivers/isdn/gigaset/capi.c
+++ b/drivers/isdn/gigaset/capi.c
@@ -12,8 +12,6 @@
  */
 
 #include "gigaset.h"
-#include <linux/slab.h>
-#include <linux/ctype.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/isdn/capilli.h>
diff --git a/drivers/isdn/gigaset/common.c b/drivers/isdn/gigaset/common.c
index 0b39b38..f6f45f2 100644
--- a/drivers/isdn/gigaset/common.c
+++ b/drivers/isdn/gigaset/common.c
@@ -14,10 +14,8 @@
  */
 
 #include "gigaset.h"
-#include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
-#include <linux/slab.h>
 
 /* Version Information */
 #define DRIVER_AUTHOR "Hansjoerg Lipp <hjlipp@web.de>, Tilman Schmidt <tilman@imap.cc>, Stefan Eilers"
diff --git a/drivers/isdn/gigaset/gigaset.h b/drivers/isdn/gigaset/gigaset.h
index 9ef5b04..d32efb6 100644
--- a/drivers/isdn/gigaset/gigaset.h
+++ b/drivers/isdn/gigaset/gigaset.h
@@ -22,9 +22,9 @@
 #include <linux/kernel.h>
 #include <linux/compiler.h>
 #include <linux/types.h>
+#include <linux/ctype.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/usb.h>
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
 #include <linux/ppp_defs.h>
diff --git a/drivers/isdn/gigaset/i4l.c b/drivers/isdn/gigaset/i4l.c
index c99fb97..c22e5ac 100644
--- a/drivers/isdn/gigaset/i4l.c
+++ b/drivers/isdn/gigaset/i4l.c
@@ -15,7 +15,6 @@
 
 #include "gigaset.h"
 #include <linux/isdnif.h>
-#include <linux/slab.h>
 
 #define HW_HDR_LEN	2	/* Header size used to store ack info */
 
diff --git a/drivers/isdn/gigaset/interface.c b/drivers/isdn/gigaset/interface.c
index f0dc6c9..c9f28dd 100644
--- a/drivers/isdn/gigaset/interface.c
+++ b/drivers/isdn/gigaset/interface.c
@@ -13,7 +13,6 @@
 
 #include "gigaset.h"
 #include <linux/gigaset_dev.h>
-#include <linux/tty.h>
 #include <linux/tty_flip.h>
 
 /*** our ioctls ***/
diff --git a/drivers/isdn/gigaset/proc.c b/drivers/isdn/gigaset/proc.c
index b69f73a..b943efb 100644
--- a/drivers/isdn/gigaset/proc.c
+++ b/drivers/isdn/gigaset/proc.c
@@ -14,7 +14,6 @@
  */
 
 #include "gigaset.h"
-#include <linux/ctype.h>
 
 static ssize_t show_cidmode(struct device *dev,
 			    struct device_attribute *attr, char *buf)
diff --git a/drivers/isdn/gigaset/ser-gigaset.c b/drivers/isdn/gigaset/ser-gigaset.c
index 8b0afd2..e96c058 100644
--- a/drivers/isdn/gigaset/ser-gigaset.c
+++ b/drivers/isdn/gigaset/ser-gigaset.c
@@ -11,13 +11,10 @@
  */
 
 #include "gigaset.h"
-
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/platform_device.h>
-#include <linux/tty.h>
 #include <linux/completion.h>
-#include <linux/slab.h>
 
 /* Version Information */
 #define DRIVER_AUTHOR "Tilman Schmidt"
diff --git a/drivers/isdn/gigaset/usb-gigaset.c b/drivers/isdn/gigaset/usb-gigaset.c
index 9430a2b..76dbb20 100644
--- a/drivers/isdn/gigaset/usb-gigaset.c
+++ b/drivers/isdn/gigaset/usb-gigaset.c
@@ -16,10 +16,6 @@
  */
 
 #include "gigaset.h"
-
-#include <linux/errno.h>
-#include <linux/init.h>
-#include <linux/slab.h>
 #include <linux/usb.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
-- 
1.6.5.3.298.g39add

^ permalink raw reply related

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 21:25 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <1271452358.16881.4486.camel@edumazet-laptop>

Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit :
> Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> > >> Results with "tbench 16" on an 8 core Intel machine.
> > >>
> > >> No RPS/RFS:  2155 MB/sec
> > >> RPS (0ff mask): 1700 MB/sec
> > >> RFS: 1097
> > >>
> > 
> > Blah, I mistakingly reported that... should have been:
> > 
> > No RPS/RFS:  2155 MB/sec
> > RPS (0ff mask): 1097 MB/sec
> > RFS: 1700 MB/sec
> > 
> > Sorry about that!
> 
> > This was my expectation too, and what my "corrected" numbers show :-)
> > But, I take it this is different in your results?
> 
> 
> My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
> (16 logical cpus)
> 
> No RPS , no RFS : 4448.14 MB/sec 
> RPS : 2298.00 MB/sec (but lot of variation)
> RFS : 2600 MB/sec
> 
> Maybe my RFS setup is bad ?
> (8192 flows)
> 

Very strange, a second tbench-16 RFS=y run gave me 2134.08 MB/sec 

A third run gave me 1813.21 MB/sec 
A fourth run gave me 2472.91 MB/sec 

Hmm...





^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16 21:12 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <u2t65634d661004161342zeadb5602w73c369ec717dc6e1@mail.gmail.com>

Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> >> Results with "tbench 16" on an 8 core Intel machine.
> >>
> >> No RPS/RFS:  2155 MB/sec
> >> RPS (0ff mask): 1700 MB/sec
> >> RFS: 1097
> >>
> 
> Blah, I mistakingly reported that... should have been:
> 
> No RPS/RFS:  2155 MB/sec
> RPS (0ff mask): 1097 MB/sec
> RFS: 1700 MB/sec
> 
> Sorry about that!

> This was my expectation too, and what my "corrected" numbers show :-)
> But, I take it this is different in your results?


My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
(16 logical cpus)

No RPS , no RFS : 4448.14 MB/sec 
RPS : 2298.00 MB/sec (but lot of variation)
RFS : 2600 MB/sec

Maybe my RFS setup is bad ?
(8192 flows)



^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-16 20:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1271443994.16881.4249.camel@edumazet-laptop>

On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
>> Results with "tbench 16" on an 8 core Intel machine.
>>
>> No RPS/RFS:  2155 MB/sec
>> RPS (0ff mask): 1700 MB/sec
>> RFS: 1097
>>

Blah, I mistakingly reported that... should have been:

No RPS/RFS:  2155 MB/sec
RPS (0ff mask): 1097 MB/sec
RFS: 1700 MB/sec

Sorry about that!

>> I am not particularly surprised by the results, using loopback
>> interface already provides good parallelism and RPS/RFS really would
>> only add overhead and more trips between CPUs (last part is why RPS <
>> RFS I suspect)-- I guess this is why we've never enabled RPS on
>> loopback :-)
>>
>> Eric, do you have a particular concern that this could affect a real workload?
>>
>
> I was expecting RFS to be better than RPS at least, for this particular
> workload (tcp over loopback)
>
This was my expectation too, and what my "corrected" numbers show :-)
But, I take it this is different in your results?

Tom

^ permalink raw reply

* Re: [PATCH] rdma/cm: Randomize local port allocation.
From: David Miller @ 2010-04-16 20:30 UTC (permalink / raw)
  To: penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp
  Cc: amwang-H+wXaHxf7aLQT0dZR+AlfA, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	opurdila-+zzKsuq53OdBDgjK7y7TUQ,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <201004162254.FJF73478.SHOOMOFtQFVJLF-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>

From: Tetsuo Handa <penguin-kernel-JPay3/Yim36HaxMnTkn67Xf5DAMn2ifp@public.gmane.org>
Date: Fri, 16 Apr 2010 22:54:22 +0900

> Cong Wang wrote:
>> Sean Hefty wrote:
>> > I like this version, thanks!  I'm not sure which tree to merge it through.
>> > Are you needing this for 2.6.34, or is 2.6.35 okay?
>> > 
>> 
>> As soon as possible, so 2.6.34. :)
>> 
> Cong, merge window for 2.6.34 was already closed.
> You need to make your patchset towards 2.6.35 (using net-next-2.6 tree)
> rather than 2.6.34 (using linux-2.6 tree). Therefore, this patch being
> queued for 2.6.35 (through net-next-2.6 tree) should be okay for you.

I don't take RDMA patches into net-next-2.6, the less I touch this
stack avoiding stuff the better and Roland has been taking this stuff
into his own tree for some time now.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: David Miller @ 2010-04-16 20:28 UTC (permalink / raw)
  To: fthain; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004162340370.271@localhost>

From: Finn Thain <fthain@telegraphics.com.au>
Date: Fri, 16 Apr 2010 23:57:34 +1000 (EST)

> 
> On Thu, 15 Apr 2010, Joe Perches wrote:
> 
>> ...Why is it better to use -EBUSY?
> 
> Nubus slots are geographically addressed and their irqs are equally 
> inflexible. -EAGAIN is misleading because retrying will not help fix 
> whatever bug caused the irq to unavailable.

This is exactly the kind of background information and verbose
explanation that belongs in the commit message.

Yet in your recent version of the patch, you're still being extremely
terse as per the reasoning for using -EBUSY

Just saying it's "misleading" doesn't tell anyone anything if they
have to go back in the commit history and try to figure out why this
change was made if it's causing problems later.

Please make the verbose and complete explanation in your commit
message, and resubmit your patch.

I just want to point out that with all the trouble you gave about
Joe's work, you're having one heck of a time even submitting your
changes properly. :-)

Thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox