Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-17  0:58 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <i2s65634d661004161722hece6f9d4naf528c37b63fffbc@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Fri, 16 Apr 2010 17:22:49 -0700

> Ugh, vmalloc.h must be sneaking in through some other header file for
> me :-(  Sorry about that.  Do you need me to respin the patch?

No, I took care of it and am about to push things out to net-next-2.6
on kernel.org

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: Finn Thain @ 2010-04-17  2:28 UTC (permalink / raw)
  To: David Miller; +Cc: Joe Perches, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100416.132855.230883346.davem@davemloft.net>


On Fri, 16 Apr 2010, David Miller wrote:

> 
> I just want to point out that with all the trouble you gave about Joe's 
> work, you're having one heck of a time even submitting your changes 
> properly. :-)

You are a thorough reviewer, despite the regrettable implication to the 
contrary in my first message in this thread. And you are quite right, I 
did not understand the pr_* macros at the time. I apologise.

I am hypersensitive to bit rot. I guess that's what happens when one makes 
it one's job to fix regressions in Linux.

Finn

^ permalink raw reply

* [PATCH] mac8390: change an error return code and some cleanup, take 3
From: Finn Thain @ 2010-04-17  3:16 UTC (permalink / raw)
  To: David Miller; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004162347390.271@localhost>


Change an error return code from -EAGAIN to -EBUSY since the former is 
misleading.

Nubus slots are geographically addressed and their irqs are equally 
inflexible. -EAGAIN is misleading because retrying will not help fix 
whatever bug it was that made the irq unavailable.

Also promote the log message. Likewise some other KERN_INFO log messages.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

--- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
+++ b/drivers/net/mac8390.c	2010-04-16 23:50:39.000000000 +1000
@@ -554,7 +554,7 @@
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -643,8 +643,8 @@
 {
 	__ei_open(dev);
 	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
+		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
+		return -EBUSY;
 	}
 	return 0;
 }
@@ -660,7 +660,7 @@
 {
 	ei_status.txing = 0;
 	if (ei_debug > 1)
-		pr_info("reset not supported\n");
+		printk(KERN_DEBUG pr_fmt("reset not supported\n"));
 	return;
 }
 
@@ -668,7 +668,7 @@
 {
 	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
 	if (ei_debug > 1)
-		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
+		printk(KERN_DEBUG pr_fmt("Need to reset the NS8390 t=%lu..."), jiffies);
 	ei_status.txing = 0;
 	target[0xC0000] = 0;
 	if (ei_debug > 1)
 

^ permalink raw reply

* Re: Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-17  4:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-net
In-Reply-To: <m2v1f8bbe3c1004152351o580b37ebx55e428fb106b09bd@mail.gmail.com>

Hi,

I am trying to find a duplicate IP in the network using arping.

 -------------------------
 [root@vps1 ~]# ping -c 3 192.168.1.212
 PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
 64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
 64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
 64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms

 --- 192.168.1.212 ping statistics ---
 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
 rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
 [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
 ARPING 192.168.1.212 from 0.0.0.0 eth0
 0
 -------------------------


 As per arping that IP is duplicate. But if I go ahead and ifdown the
 IP in the known location I cant ping that IP ( That means that IP is
 not duplicated ? ). This is the result after shutting down the IP.

 --------------------------
 [root@vps1 ~]# ping -c 3 192.168.1.212
 PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
 From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
 From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
 From 192.168.1.63 icmp_seq=3 Destination Host Unreachable

 --- 192.168.1.212 ping statistics ---
 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2001ms
 , pipe 3
 [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
 ARPING 192.168.1.212 from 0.0.0.0 eth0
 Sent 5 probes (5 broadcast(s))
 Received 0 response(s)
 0
 [root@vps1 ~]#
 --------------------------

 My question is, in this case IP 192.168.1.212 is not duplicated. But
 still arping gives duplicate status. Why it is like that ?

 --
 Regards,
 Unni

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17  7:35 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271424065.4606.31.camel@bigi>

Le vendredi 16 avril 2010 à 09:21 -0400, jamal a écrit :
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
> 
> > 
> > A kernel module might do this, this could be integrated in perf bench so
> > that we can regression tests upcoming kernels.
> 
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
> 
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example: 
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
> 
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.

I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
nehalem. So a 3-4 years old design.

For all test, I use the best time of 3 runs of "ping -f -q -c 100000
192.168.0.2". Yes ping is not very good, but its available ;)

Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
user land. I dont want to tweak acpi or whatever smart power saving
mechanisms.

When RPS off
100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms

RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
(echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms

So the cost of queing the packet into our own queue (netif_receive_skb
-> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)

I personally think we should process packet instead of queeing it, but
Tom disagree with me.

RPS on, directed on cpu1 (other socket)
(echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
is 3 us. Note this cost is in case we receive a single packet.

I suspect IPI itself is in the 1.5 us range, not very far from the
queing to ourself case.

For me RPS use cases are :

1) Value added apps handling lot of TCP data, where the costs of cache
misses in tcp stack easily justify to spend 3 us to gain much more.

2) Network appliance, where a single cpu is filled 100% to handle one
device hardware and software/RPS interrupts, delegating all higher level
works to a pool of cpus.

I'll try to do these tests on a Nehalem target.




^ permalink raw reply

* [PATCH net-next-2.6] rps: rps_sock_flow_table is mostly read
From: Eric Dumazet @ 2010-04-17  7:52 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <1271404097.16881.3827.camel@edumazet-laptop>


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index d7107ac..7abf959 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2205,7 +2205,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 #ifdef CONFIG_RPS
 
 /* One global table that all flow-based protocols share. */
-struct rps_sock_flow_table *rps_sock_flow_table;
+struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
 EXPORT_SYMBOL(rps_sock_flow_table);
 
 /*



^ permalink raw reply related

* Re: [PATCH net-next-2.6] rps: rps_sock_flow_table is mostly read
From: David Miller @ 2010-04-17  7:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271490733.16881.4588.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 17 Apr 2010 09:52:13 +0200

> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-17  8:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271489739.16881.4586.camel@edumazet-laptop>

> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
>
> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.
>
You could do that, but then the packet processing becomes HOL blocking
on all the packets that are being sent to other queues for
processing-- remember the IPIs is only sent at the end of the NAPI.
So unless the upper stack processing is <0.74us in your case, I think
processing packets directly on the local queue would improve best case
latency, but would increase average latency and even more likely worse
case latency on loads with multiple flows.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
>
> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.
>
> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.
>
> For me RPS use cases are :
>
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
>
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
>
> I'll try to do these tests on a Nehalem target.
>
>
>
>

^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Benny Amorsen @ 2010-04-17  9:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Antonio Almeida, netdev, kaber, davem, devik
In-Reply-To: <4BC63766.5080104@gmail.com>

Jarek Poplawski <jarkao2@gmail.com> writes:

> As I wrote before, the minimal (overflow safe) rate depends on max
> packet size, and for 1500 byte it would be something around:
> 1500b/2min, so if your clients can wait so long, try this:

Wouldn't it be nice of either tc or the kernel to warn about wrong
configurations, or possibly reject them completely?


/Benny


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17  9:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <z2q65634d661004170143nb35ec784mbedd003565410cfb@mail.gmail.com>

Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> >
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> >
> You could do that, but then the packet processing becomes HOL blocking
> on all the packets that are being sent to other queues for
> processing-- remember the IPIs is only sent at the end of the NAPI.
> So unless the upper stack processing is <0.74us in your case, I think
> processing packets directly on the local queue would improve best case
> latency, but would increase average latency and even more likely worse
> case latency on loads with multiple flows.

Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu()
itself, computing skb->rxhash and all. We should make a review of how
many cache lines we exchange per skb, and try to reduce this number.




^ permalink raw reply

* [PATCH net-next-2.6] net: remove time limit in process_backlog()
From: Eric Dumazet @ 2010-04-17 14:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <z2q65634d661004170143nb35ec784mbedd003565410cfb@mail.gmail.com>

- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..8092f01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -264,7 +264,7 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
  *	queue in the local softnet handler.
  */
 
-DEFINE_PER_CPU(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 EXPORT_PER_CPU_SYMBOL(softnet_data);
 
 #ifdef CONFIG_LOCKDEP
@@ -3232,7 +3232,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
-	unsigned long start_time = jiffies;
 
 	napi->weight = weight_p;
 	do {
@@ -3252,7 +3251,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_enable();
 
 		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+	} while (++work < quota);
 
 	return work;
 }



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17 14:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271496229.16881.4602.camel@edumazet-laptop>

Le samedi 17 avril 2010 à 11:23 +0200, Eric Dumazet a écrit :
> Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > > So the cost of queing the packet into our own queue (netif_receive_skb
> > > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > >
> > > I personally think we should process packet instead of queeing it, but
> > > Tom disagree with me.
> > >
> > You could do that, but then the packet processing becomes HOL blocking
> > on all the packets that are being sent to other queues for
> > processing-- remember the IPIs is only sent at the end of the NAPI.
> > So unless the upper stack processing is <0.74us in your case, I think
> > processing packets directly on the local queue would improve best case
> > latency, but would increase average latency and even more likely worse
> > case latency on loads with multiple flows.


Tom, I am not sure what you describe is even respected for NAPI devices.
(I hope you use napi devices in your company ;) )

If we enqueue a skb to backlog, we also link our backlog napi into our
poll_list, if not already there.

So the loop in net_rx_action() will make us handle our backlog napi a
bit after this network device napi (if time limit of 2 jiffies not
elapsed) and *before* sending IPIS to remote cpus anyway.





^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-17 16:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <1271452358.16881.4486.camel@edumazet-laptop>

Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit :
> Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> > >> Results with "tbench 16" on an 8 core Intel machine.
> > >>
> > >> No RPS/RFS:  2155 MB/sec
> > >> RPS (0ff mask): 1700 MB/sec
> > >> RFS: 1097
> > >>
> > 
> > Blah, I mistakingly reported that... should have been:
> > 
> > No RPS/RFS:  2155 MB/sec
> > RPS (0ff mask): 1097 MB/sec
> > RFS: 1700 MB/sec
> > 
> > Sorry about that!
> 
> > This was my expectation too, and what my "corrected" numbers show :-)
> > But, I take it this is different in your results?
> 
> 
> My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
> (16 logical cpus)
> 
> No RPS , no RFS : 4448.14 MB/sec 
> RPS : 2298.00 MB/sec (but lot of variation)
> RFS : 2600 MB/sec
> 
> Maybe my RFS setup is bad ?
> (8192 flows)
> 

With attached patch, I reached 

Throughput 4465.13 MB/sec 16 procs

RFS better than no RPS/RFS :)

So, the old idea to make rxhash consistent (same value in both
directions) is a win for some workloads (Consider connection tracking /
firewalling) 

port1 = ...
port2 = ...
addr1 = ...
addr2 = ...
if (addr1 > addr2)
	exchange(addr1, addr2)
if (port1 > port2)
	exchange(port, port2)

hash = jhash(addr1, addr2, (port1<<16)+port2, ...)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..6b757ff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2280,8 +2280,10 @@ static int get_rps_cpu(struct net_device *dev,
struct sk_buff *skb,
        case IPPROTO_AH:
        case IPPROTO_SCTP:
        case IPPROTO_UDPLITE:
-               if (pskb_may_pull(skb, (ihl * 4) + 4))
-                       ports = *((u32 *) (skb->data + (ihl * 4)));
+               if (pskb_may_pull(skb, (ihl * 4) + 4)) {
+                       u16 *_ports = (u16 *)(skb->data + (ihl * 4));
+                       ports = _ports[0] ^ _ports[1];
+               }
                break;
 
        default:



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-17 17:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271514445.16881.4746.camel@edumazet-laptop>

> Tom, I am not sure what you describe is even respected for NAPI devices.
> (I hope you use napi devices in your company ;) )
>
> If we enqueue a skb to backlog, we also link our backlog napi into our
> poll_list, if not already there.
>
> So the loop in net_rx_action() will make us handle our backlog napi a
> bit after this network device napi (if time limit of 2 jiffies not
> elapsed) and *before* sending IPIS to remote cpus anyway.
>
Then I think that's a bug you've identified ;-)

>
>
>
>

^ permalink raw reply

* [PATCH] TCP: avoid to send keepalive probes if it is receiving data
From: Flavio Leitner @ 2010-04-17 17:28 UTC (permalink / raw)
  To: netdev; +Cc: Flavio Leitner
In-Reply-To: <20100416150644.GA2641@sysclose.org>

RFC 1122 says the following:
...
  Keep-alive packets MUST only be sent when no data or
  acknowledgement packets have been received for the
  connection within an interval.
...

Fix this by storing the timestamp of last received data
packet and checking for it when the keepalive timer expires.

Signed-off-by: Flavio Leitner <fleitner@redhat.com>
---
 include/linux/tcp.h  |    1 +
 net/ipv4/tcp_input.c |    3 +++
 net/ipv4/tcp_timer.c |    8 ++++++++
 3 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..405678f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -314,6 +314,7 @@ struct tcp_sock {
  	u32	snd_sml;	/* Last byte of the most recently transmitted small packet */
 	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
+	u32	lrcvtime;	/* timestamp of last received data packet (for keepalives) */
 
 	/* Data for direct copy to user */
 	struct {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f240f57..60d2980 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5391,6 +5391,8 @@ no_ack:
 				__kfree_skb(skb);
 			else
 				sk->sk_data_ready(sk, 0);
+
+			tp->lrcvtime = tcp_time_stamp;
 			return 0;
 		}
 	}
@@ -5421,6 +5423,7 @@ step5:
 
 	tcp_data_snd_check(sk);
 	tcp_ack_snd_check(sk);
+	tp->lrcvtime = tcp_time_stamp;
 	return 0;
 
 csum_error:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 8a0ab29..74dd804 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -554,6 +554,14 @@ static void tcp_keepalive_timer (unsigned long data)
 	if (tp->packets_out || tcp_send_head(sk))
 		goto resched;
 
+	elapsed = tcp_time_stamp - tp->lrcvtime;
+	
+	/* receiving data means alive */
+	if (elapsed < keepalive_time_when(tp)) {
+		elapsed = keepalive_time_when(tp) - elapsed;
+		goto resched;
+	}
+
 	elapsed = tcp_time_stamp - tp->rcv_tstamp;
 
 	if (elapsed >= keepalive_time_when(tp)) {
-- 
1.6.6.1


^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-17 17:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271489739.16881.4586.camel@edumazet-laptop>

On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:

> I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> nehalem. So a 3-4 years old design.

Eric, I thank you kind sir for going out of your way to do this - it is
certainly a good processor to compare against 

> For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> 192.168.0.2". Yes ping is not very good, but its available ;)

It is a reasonable quick test, no fancy setup required ;->

> Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> user land. 

I didnt keep the cpus busy. I should re-run with such a setup, any
specific app that you used to keep them busy? Keeping them busy could
have consequences;  I am speculating you probably ended having greater
than one packet/IPI ratio i.e amortization benefit..
  
> I dont want to tweak acpi or whatever smart power saving
> mechanisms.

I should mention i turned off acpi as well in the bios; it was consuming
more cpu cycles than net-processing and was interfering in my tests.

> When RPS off
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> 
> RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> 
> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> 

Excellent analysis.

> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.

Sorry - I am gonna have to turn on some pedagogy and offer my
Canadian 2 cents;->
I would lean on agreeing with Tom, but maybe go one step further (sans
packet-reordering): we should never process packets to socket layer on
the demuxing cpu.
enqueue everything you receive on a different cpu - so somehow receiving
cpu becomes part of a hashing decision ...

The reason is derived from queueing theory - of which i know dangerously
little - but refer you to mr. little his-self[1] (pun fully
intended;->):
i.e fixed serving time provides more predictable results as opposed to
once in a while a spike as you receive packets destined to "our cpu".
Queueing packets and later allocating cycles to processing them adds to
variability, but is not as bad as processing to completion to socket
layer.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

Good test - should be worst case scenario. But there are two other 
scenarios which will give different results in my opinion.
On your setup i think each socket has two dies, each with two cores. So
my feeling is you will get different numbers if you go within same die
and across dies within same socket. If i am not mistaken, the mapping
would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
socket1/die0{core1/3}, socket1{core5/7}.
If you have cycles can you try the same socket+die but different cores
and same socket but different die test?

> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.

Which is not too bad if amortized. Were you able to check if you
processed a packet/IPI? One way to achieve that is just standard ping.
In the nehalem my number for going to a different core was in the range
of 5 microseconds effect on RTT when system was not busy. I think it
would be higher going across QPI.

> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.

Sound about right maybe 2 us in my case. I am still mystified by "what
damage does an IPI make?" to the system harmony. I have to do some
reading. Andi mentioned the APIC connection - but my gut feeling is you
probably end up going to main memory and invalidate cache.

> For me RPS use cases are :
> 
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
> 
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
> 

Agreed on both. 
The caveat to note:
- what hardware would be reasonable
- within same hardware what setups would be good to use 
- when it doesnt benefit even with the everything correct (eg low tcp
throughput)

> I'll try to do these tests on a Nehalem target.

Thanks again Eric.

cheers,
jamal 

[1]http://en.wikipedia.org/wiki/Little's_law


^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-17 17:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1271520633.16881.4754.camel@edumazet-laptop>

>
> With attached patch, I reached
>
> Throughput 4465.13 MB/sec 16 procs
>
> RFS better than no RPS/RFS :)
>
> So, the old idea to make rxhash consistent (same value in both
> directions) is a win for some workloads (Consider connection tracking /
> firewalling)
>
> port1 = ...
> port2 = ...
> addr1 = ...
> addr2 = ...
> if (addr1 > addr2)
>        exchange(addr1, addr2)
> if (port1 > port2)
>        exchange(port, port2)
>
> hash = jhash(addr1, addr2, (port1<<16)+port2, ...)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7abf959..6b757ff 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2280,8 +2280,10 @@ static int get_rps_cpu(struct net_device *dev,
> struct sk_buff *skb,
>        case IPPROTO_AH:
>        case IPPROTO_SCTP:
>        case IPPROTO_UDPLITE:
> -               if (pskb_may_pull(skb, (ihl * 4) + 4))
> -                       ports = *((u32 *) (skb->data + (ihl * 4)));
> +               if (pskb_may_pull(skb, (ihl * 4) + 4)) {
> +                       u16 *_ports = (u16 *)(skb->data + (ihl * 4));
> +                       ports = _ports[0] ^ _ports[1];
> +               }
>                break;
>
>        default:
>
That's cool!, but I still like the idea that this hash is treated as
an opaque value, getting the hash from the device to avoid the jhash
or cache misses on the packet can also be a win...  Maybe connection
tracking/firewall could use the skb->rxhash which provides the
consistency and also eliminates the need to do more jhashes.


>
>

^ permalink raw reply

* Re: Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-17 19:39 UTC (permalink / raw)
  To: Pascal Hambourg; +Cc: linux-net, netdev
In-Reply-To: <4BC9784B.3020103@plouf.fr.eu.org>

Ok, then what is the best method to find the duplicate IP ( same IP
address assigned to different machines ) ?

On Sat, Apr 17, 2010 at 2:28 PM, Pascal Hambourg
<pascal.mail@plouf.fr.eu.org> wrote:
> Hello,
>
> unni krishnan a écrit :
>>
>> I am trying to find a duplicate IP in the network using arping.
>>
>>  -------------------------
>>  [root@vps1 ~]# ping -c 3 192.168.1.212
>>  PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>>  64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
>>  64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
>>  64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms
>>
>>  --- 192.168.1.212 ping statistics ---
>>  3 packets transmitted, 3 received, 0% packet loss, time 1999ms
>>  rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
>>  [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
>>  ARPING 192.168.1.212 from 0.0.0.0 eth0
>>  0
>>  -------------------------
>>
>>  As per arping that IP is duplicate.
>
> I disagree. According to man arping :
>
>    -D  Duplicate  address  detection  mode  (DAD).  See RFC2131, 4.4.1.
>        Returns 0, if DAD succeeded i.e. no replies are received
>                                         ^^^^^^^^^^^^^^^^^^^^^^^
> -D (DAD) is meant for DHCP to find out if the proposed IP address is not
> already assigned to another host. Its purpose is not to find out if
> multiple hosts have the same IP address. Besides, a return value of 0
> means that no ARP replies were received (IOW -D inverts the return value
> logic), which is weird since the target IP address replies to ICMP ping
> unless that address is assigned to the local host.
>
> Here :
>
> # arping -DI eth0 -c 1 192.168.0.246 ; echo result=$?
> ARPING 192.168.0.246 from 0.0.0.0 eth0
> Unicast reply from 192.168.0.246 [xx:xx:xx:xx:xx:xx]  0.964ms
> Sent 1 probes (1 broadcast(s))
> Received 1 response(s)
> result=1
>
> # arping -DI eth0 -c 1 192.168.0.24 ; echo result=$?
> ARPING 192.168.0.24 from 0.0.0.0 eth0
> Sent 1 probes (1 broadcast(s))
> Received 0 response(s)
> result=0
>
>> But if I go ahead and ifdown the
>>  IP in the known location I cant ping that IP ( That means that IP is
>>  not duplicated ? ). This is the result after shutting down the IP.
>>
>>  --------------------------
>>  [root@vps1 ~]# ping -c 3 192.168.1.212
>>  PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>>  From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
>>  From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
>>  From 192.168.1.63 icmp_seq=3 Destination Host Unreachable
>
> Ok, that means no ARP reply.
>
>>  [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
>>  ARPING 192.168.1.212 from 0.0.0.0 eth0
>>  Sent 5 probes (5 broadcast(s))
>>  Received 0 response(s)
>>  0
>
> Same as above.
>
>>  My question is, in this case IP 192.168.1.212 is not duplicated. But
>>  still arping gives duplicate status. Why it is like that ?
>
> A situation of real duplicate ARP replies may occur when the address is
> assigned to a host which has multiple interfaces connected to the same
> network, so it receives and replies to ARP queries on each interface.
>



-- 
Regards,
Unni
http://mutexes.org/
http://twitter.com/webofunni

^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Jarek Poplawski @ 2010-04-17 21:01 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: Antonio Almeida, netdev, kaber, davem, devik
In-Reply-To: <m3bpdi8mhw.fsf@ursa.amorsen.dk>

Benny Amorsen wrote, On 04/17/2010 11:19 AM:

> Jarek Poplawski <jarkao2@gmail.com> writes:
> 
>> As I wrote before, the minimal (overflow safe) rate depends on max
>> packet size, and for 1500 byte it would be something around:
>> 1500b/2min, so if your clients can wait so long, try this:
> 
> Wouldn't it be nice of either tc or the kernel to warn about wrong
> configurations, or possibly reject them completely?

...or have it documented etc.

Acked-by: Jarek P. ;-)

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Changli Gao @ 2010-04-18  0:06 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <h2h65634d661004171038g75160e7avae118dfd1cb1441d@mail.gmail.com>

On Sun, Apr 18, 2010 at 1:38 AM, Tom Herbert <therbert@google.com> wrote:
> That's cool!, but I still like the idea that this hash is treated as
> an opaque value getting the hash from the device to avoid the jhash
> or cache misses on the packet can also be a win...  Maybe connection
> tracking/firewall could use the skb->rxhash which provides the
> consistency and also eliminates the need to do more jhashes.
>

consistent rxhash only adds the risk of the hash collision, and I
don't think it is a big problem. For connection tracking/firewall use,
I am afraid that we have to recompute this value after defrag.  So we
have to export the hash function we used in RPS.

As NIC's hash function can be changed dynamically, the rxhash isn't
consistent, so the rxhash can't be used by connection tracking, socket
lookup and others come later.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] Fix SCTP failure with ipv6 source address routing
From: Paul Gortmaker @ 2010-04-18  0:17 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev
In-Reply-To: <4BC510A2.1070105@hp.com>

On 10-04-13 08:47 PM, Vlad Yasevich wrote:
> 
> 
> Paul Gortmaker wrote:
>> From: Weixing Shi<Weixing.Shi@windriver.com>
>>
>> Given the below test case, using source address routing, SCTP
>> does not work.
>>
>> Node-A:
>>    1)ifconfig eth0 inet6 add 2001:1::1/64
>>    2)ip -6 rule add from 2001:1::1 table 100 pref 100
>>    3)ip -6 route add 2001:2::1 dev eth0 table 100
>>    4)sctp_darn -H 2001:1::1 -P 250 -l&
>>
>> Node-B:
>>    1)ifconfig eth0 inet6 add 2001:2::1/64
>>    2)ip -6 rule add from 2001:2::1 table 100 pref 100
>>    3)ip -6 route add 2001:1::1 dev eth0 table 100
>>    4)sctp_darn -H 2001:2::1 -P 250 -h 2001:1::1 -p 250 -s
>>
>> Root cause:
>>    Node-A and Node-B use source address routing, and in the
>>    begining, the source address will be NULL.  So SCTP will search
>>    the routing table by the destination address (because it is using
>>    the source address routing table), and hence the resulting dst_entry
>>    will be NULL.
>>
>> Solution:
>>    After SCTP gets the correct source address, then we search for
>>    dst_entry again, and then we will get the correct value.
> 
> The problem here is that ipv6 route lookup code in sctp doesn't bother
> searching for the source address, unlike the v4 route lookup code.
> 
> Compare sctp_v4_get_dst() and sctp_v6_get_dst.  The v4 version bends over
> backwards trying to get the correct route, while the v6 version simple does
> a single lookup and returns the result.
> 
> The v6 route lookup code needs to be fixed to take into account the bound
> address list.

Thanks for the feedback -- we'll take a look and see if we can
fix it as per your recommendation and re-test.

Paul.

> 
> -vlad
> 	
>>
>> Signed-off-by: Weixing Shi<Weixing.Shi@windriver.com>
>> Signed-off-by: Paul Gortmaker<paul.gortmaker@windriver.com>
>> ---
>>   net/sctp/transport.c |   11 +++++++++--
>>   1 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
>> index be4d63d..b5ae18c 100644
>> --- a/net/sctp/transport.c
>> +++ b/net/sctp/transport.c
>> @@ -295,9 +295,16 @@ void sctp_transport_route(struct sctp_transport *transport,
>>
>>   	if (saddr)
>>   		memcpy(&transport->saddr, saddr, sizeof(union sctp_addr));
>> -	else
>> +	else {
>>   		af->get_saddr(opt, asoc, dst, daddr,&transport->saddr);
>> -
>> +		/* When using source address routing, since dst was
>> +		 * looked up prior to filling in the source address, dst
>> +		 * needs to be looked up again to get the correct dst
>> +		 */
>> +		if (dst)
>> +			dst_release(dst);
>> +		dst = af->get_dst(asoc, daddr,&transport->saddr);
>> +	}
>>   	transport->dst = dst;
>>   	if ((transport->param_flags&  SPP_PMTUD_DISABLE)&&  transport->pathmtu) {
>>   		return;


^ permalink raw reply

* [PATCH] X25 fix dead unaccepted sockets
From: Andrew Hendry @ 2010-04-18  0:17 UTC (permalink / raw)
  To: netdev


1, An X25 program binds and listens
2, calls arrive waiting to be accepted
3, Program exits without accepting
4, Sockets time out but don't get correctly cleaned up
5, cat /proc/net/x25/socket shows the dead sockets with bad inode fields.

This line borrowed from AX25 sets the dying socket so the timers clean up later.

Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>

---
 net/x25/af_x25.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index cbddd0c..36e84e1 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -402,6 +402,7 @@ static void __x25_destroy_socket(struct sock *sk)
 			/*
 			 * Queue the unaccepted socket for death
 			 */
+			skb->sk->sk_state = TCP_LISTEN;
 			sock_set_flag(skb->sk, SOCK_DEAD);
 			x25_start_heartbeat(skb->sk);
 			x25_sk(skb->sk)->state = X25_STATE_0;
-- 
1.5.6.5



^ permalink raw reply related

* Re: [PATCH] gigaset: include cleanup cleanup
From: Tejun Heo @ 2010-04-18  2:16 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: Karsten Keil, David Miller, Hansjoerg Lipp, i4ldeveloper, netdev,
	linux-kernel
In-Reply-To: <20100416220858.A19F540123@xenon.ts.pxnet.com>

Hello,

On 04/17/2010 07:08 AM, Tilman Schmidt wrote:
> Commit 5a0e3ad causes slab.h to be included twice in many of the
> Gigaset driver's source files, first via the common include file
> gigaset.h and then a second time directly. Drop the spares, and
> use the opportunity to clean up a few more similar cases.
> 
> Impact: cleanup, no functional change
> Signed-off-by: Tilman Schmidt <tilman@imap.cc>
> CC: Tejun Heo <tj@kernel.org>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks for the clean up.

> Seeing that the "include cleanup" patch triggering this was accepted
> after the merge window, I have hopes this one will be accepted, too.

Hmm... through which tree should this go through?  I can route it
through percpu but maybe taking the usual isdn patch path would be
better?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH] TCP: avoid to send keepalive probes if it is receiving data
From: Eric Dumazet @ 2010-04-18  9:06 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: netdev
In-Reply-To: <1271525305-28423-1-git-send-email-fleitner@redhat.com>

Le samedi 17 avril 2010 à 14:28 -0300, Flavio Leitner a écrit :
> RFC 1122 says the following:
> ...
>   Keep-alive packets MUST only be sent when no data or
>   acknowledgement packets have been received for the
>   connection within an interval.
> ...
> 
> Fix this by storing the timestamp of last received data
> packet and checking for it when the keepalive timer expires.
> 
> Signed-off-by: Flavio Leitner <fleitner@redhat.com>

Thanks Flavio !

Shouldnt you also change do_tcp_setsockopt() TCP_KEEPIDLE for
consistency ?

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0f8caf6..a4048d7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2298,7 +2298,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			if (sock_flag(sk, SOCK_KEEPOPEN) &&
 			    !((1 << sk->sk_state) &
 			      (TCPF_CLOSE | TCPF_LISTEN))) {
-				__u32 elapsed = tcp_time_stamp - tp->rcv_tstamp;
+				u32 elapsed = min_t(u32,
+						      tcp_time_stamp - tp->rcv_tstamp,
+						      tcp_time_stamp - tp->lrcvtime);
+
 				if (tp->keepalive_time > elapsed)
 					elapsed = tp->keepalive_time - elapsed;
 				else






> ---
>  include/linux/tcp.h  |    1 +
>  net/ipv4/tcp_input.c |    3 +++
>  net/ipv4/tcp_timer.c |    8 ++++++++
>  3 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index a778ee0..405678f 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -314,6 +314,7 @@ struct tcp_sock {
>   	u32	snd_sml;	/* Last byte of the most recently transmitted small packet */
>  	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
>  	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
> +	u32	lrcvtime;	/* timestamp of last received data packet (for keepalives) */
>  
>  	/* Data for direct copy to user */
>  	struct {
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index f240f57..60d2980 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5391,6 +5391,8 @@ no_ack:
>  				__kfree_skb(skb);
>  			else
>  				sk->sk_data_ready(sk, 0);
> +
> +			tp->lrcvtime = tcp_time_stamp;
>  			return 0;
>  		}
>  	}
> @@ -5421,6 +5423,7 @@ step5:
>  
>  	tcp_data_snd_check(sk);
>  	tcp_ack_snd_check(sk);
> +	tp->lrcvtime = tcp_time_stamp;
>  	return 0;
>  
>  csum_error:
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 8a0ab29..74dd804 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -554,6 +554,14 @@ static void tcp_keepalive_timer (unsigned long data)
>  	if (tp->packets_out || tcp_send_head(sk))
>  		goto resched;
>  
> +	elapsed = tcp_time_stamp - tp->lrcvtime;
> +	
> +	/* receiving data means alive */
> +	if (elapsed < keepalive_time_when(tp)) {
> +		elapsed = keepalive_time_when(tp) - elapsed;
> +		goto resched;
> +	}
> +
>  	elapsed = tcp_time_stamp - tp->rcv_tstamp;
>  
>  	if (elapsed >= keepalive_time_when(tp)) {



^ permalink raw reply related

* Re: [PATCH] gigaset: include cleanup cleanup
From: David Miller @ 2010-04-18  9:13 UTC (permalink / raw)
  To: tj; +Cc: tilman, isdn, hjlipp, i4ldeveloper, netdev, linux-kernel
In-Reply-To: <4BCA6B8E.9000408@kernel.org>

From: Tejun Heo <tj@kernel.org>
Date: Sun, 18 Apr 2010 11:16:46 +0900

> 
>> Seeing that the "include cleanup" patch triggering this was accepted
>> after the merge window, I have hopes this one will be accepted, too.
> 
> Hmm... through which tree should this go through?  I can route it
> through percpu but maybe taking the usual isdn patch path would be
> better?

I'll take it into net-2.6, no worries.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox