Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-17 19:39 UTC (permalink / raw)
  To: Pascal Hambourg; +Cc: linux-net, netdev
In-Reply-To: <4BC9784B.3020103@plouf.fr.eu.org>

Ok, then what is the best method to find the duplicate IP ( same IP
address assigned to different machines ) ?

On Sat, Apr 17, 2010 at 2:28 PM, Pascal Hambourg
<pascal.mail@plouf.fr.eu.org> wrote:
> Hello,
>
> unni krishnan a écrit :
>>
>> I am trying to find a duplicate IP in the network using arping.
>>
>>  -------------------------
>>  [root@vps1 ~]# ping -c 3 192.168.1.212
>>  PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>>  64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
>>  64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
>>  64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms
>>
>>  --- 192.168.1.212 ping statistics ---
>>  3 packets transmitted, 3 received, 0% packet loss, time 1999ms
>>  rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
>>  [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
>>  ARPING 192.168.1.212 from 0.0.0.0 eth0
>>  0
>>  -------------------------
>>
>>  As per arping that IP is duplicate.
>
> I disagree. According to man arping :
>
>    -D  Duplicate  address  detection  mode  (DAD).  See RFC2131, 4.4.1.
>        Returns 0, if DAD succeeded i.e. no replies are received
>                                         ^^^^^^^^^^^^^^^^^^^^^^^
> -D (DAD) is meant for DHCP to find out if the proposed IP address is not
> already assigned to another host. Its purpose is not to find out if
> multiple hosts have the same IP address. Besides, a return value of 0
> means that no ARP replies were received (IOW -D inverts the return value
> logic), which is weird since the target IP address replies to ICMP ping
> unless that address is assigned to the local host.
>
> Here :
>
> # arping -DI eth0 -c 1 192.168.0.246 ; echo result=$?
> ARPING 192.168.0.246 from 0.0.0.0 eth0
> Unicast reply from 192.168.0.246 [xx:xx:xx:xx:xx:xx]  0.964ms
> Sent 1 probes (1 broadcast(s))
> Received 1 response(s)
> result=1
>
> # arping -DI eth0 -c 1 192.168.0.24 ; echo result=$?
> ARPING 192.168.0.24 from 0.0.0.0 eth0
> Sent 1 probes (1 broadcast(s))
> Received 0 response(s)
> result=0
>
>> But if I go ahead and ifdown the
>>  IP in the known location I cant ping that IP ( That means that IP is
>>  not duplicated ? ). This is the result after shutting down the IP.
>>
>>  --------------------------
>>  [root@vps1 ~]# ping -c 3 192.168.1.212
>>  PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>>  From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
>>  From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
>>  From 192.168.1.63 icmp_seq=3 Destination Host Unreachable
>
> Ok, that means no ARP reply.
>
>>  [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
>>  ARPING 192.168.1.212 from 0.0.0.0 eth0
>>  Sent 5 probes (5 broadcast(s))
>>  Received 0 response(s)
>>  0
>
> Same as above.
>
>>  My question is, in this case IP 192.168.1.212 is not duplicated. But
>>  still arping gives duplicate status. Why it is like that ?
>
> A situation of real duplicate ARP replies may occur when the address is
> assigned to a host which has multiple interfaces connected to the same
> network, so it receives and replies to ARP queries on each interface.
>



-- 
Regards,
Unni
http://mutexes.org/
http://twitter.com/webofunni

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-17 17:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1271520633.16881.4754.camel@edumazet-laptop>

>
> With attached patch, I reached
>
> Throughput 4465.13 MB/sec 16 procs
>
> RFS better than no RPS/RFS :)
>
> So, the old idea to make rxhash consistent (same value in both
> directions) is a win for some workloads (Consider connection tracking /
> firewalling)
>
> port1 = ...
> port2 = ...
> addr1 = ...
> addr2 = ...
> if (addr1 > addr2)
>        exchange(addr1, addr2)
> if (port1 > port2)
>        exchange(port, port2)
>
> hash = jhash(addr1, addr2, (port1<<16)+port2, ...)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7abf959..6b757ff 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2280,8 +2280,10 @@ static int get_rps_cpu(struct net_device *dev,
> struct sk_buff *skb,
>        case IPPROTO_AH:
>        case IPPROTO_SCTP:
>        case IPPROTO_UDPLITE:
> -               if (pskb_may_pull(skb, (ihl * 4) + 4))
> -                       ports = *((u32 *) (skb->data + (ihl * 4)));
> +               if (pskb_may_pull(skb, (ihl * 4) + 4)) {
> +                       u16 *_ports = (u16 *)(skb->data + (ihl * 4));
> +                       ports = _ports[0] ^ _ports[1];
> +               }
>                break;
>
>        default:
>
That's cool!, but I still like the idea that this hash is treated as
an opaque value, getting the hash from the device to avoid the jhash
or cache misses on the packet can also be a win...  Maybe connection
tracking/firewall could use the skb->rxhash which provides the
consistency and also eliminates the need to do more jhashes.


>
>

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-17 17:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271489739.16881.4586.camel@edumazet-laptop>

On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:

> I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> nehalem. So a 3-4 years old design.

Eric, I thank you kind sir for going out of your way to do this - it is
certainly a good processor to compare against 

> For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> 192.168.0.2". Yes ping is not very good, but its available ;)

It is a reasonable quick test, no fancy setup required ;->

> Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> user land. 

I didnt keep the cpus busy. I should re-run with such a setup, any
specific app that you used to keep them busy? Keeping them busy could
have consequences;  I am speculating you probably ended having greater
than one packet/IPI ratio i.e amortization benefit..

> I dont want to tweak acpi or whatever smart power saving
> mechanisms.

I should mention i turned off acpi as well in the bios; it was consuming
more cpu cycles than net-processing and was interfering in my tests.

> When RPS off
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> 
> RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> 
> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> 

Excellent analysis.

> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.

Sorry - I am gonna have to turn on some pedagogy and offer my
Canadian 2 cents;->
I would lean on agreeing with Tom, but maybe go one step further (sans
packet-reordering): we should never process packets to socket layer on
the demuxing cpu.
enqueue everything you receive on a different cpu - so somehow receiving
cpu becomes part of a hashing decision ...

The reason is derived from queueing theory - of which i know dangerously
little - but refer you to mr. little his-self[1] (pun fully
intended;->):
i.e fixed serving time provides more predictable results as opposed to
once in a while a spike as you receive packets destined to "our cpu".
Queueing packets and later allocating cycles to processing them adds to
variability, but is not as bad as processing to completion to socket
layer.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

Good test - should be worst case scenario. But there are two other 
scenarios which will give different results in my opinion.
On your setup i think each socket has two dies, each with two cores. So
my feeling is you will get different numbers if you go within same die
and across dies within same socket. If i am not mistaken, the mapping
would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
socket1/die0{core1/3}, socket1{core5/7}.
If you have cycles can you try the same socket+die but different cores
and same socket but different die test?

> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.

Which is not too bad if amortized. Were you able to check if you
processed a packet/IPI? One way to achieve that is just standard ping.
In the nehalem my number for going to a different core was in the range
of 5 microseconds effect on RTT when system was not busy. I think it
would be higher going across QPI.

> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.

Sound about right maybe 2 us in my case. I am still mystified by "what
damage does an IPI make?" to the system harmony. I have to do some
reading. Andi mentioned the APIC connection - but my gut feeling is you
probably end up going to main memory and invalidate cache.

> For me RPS use cases are :
> 
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
> 
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
> 

Agreed on both. 
The caveat to note:
- what hardware would be reasonable
- within same hardware what setups would be good to use 
- when it doesnt benefit even with the everything correct (eg low tcp
throughput)

> I'll try to do these tests on a Nehalem target.

Thanks again Eric.

cheers,
jamal 

[1]http://en.wikipedia.org/wiki/Little's_law

^ permalink raw reply

* [PATCH] TCP: avoid to send keepalive probes if it is receiving data
From: Flavio Leitner @ 2010-04-17 17:28 UTC (permalink / raw)
  To: netdev; +Cc: Flavio Leitner
In-Reply-To: <20100416150644.GA2641@sysclose.org>

RFC 1122 says the following:
...
  Keep-alive packets MUST only be sent when no data or
  acknowledgement packets have been received for the
  connection within an interval.
...

Fix this by storing the timestamp of last received data
packet and checking for it when the keepalive timer expires.

Signed-off-by: Flavio Leitner <fleitner@redhat.com>
---
 include/linux/tcp.h  |    1 +
 net/ipv4/tcp_input.c |    3 +++
 net/ipv4/tcp_timer.c |    8 ++++++++
 3 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..405678f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -314,6 +314,7 @@ struct tcp_sock {
  	u32	snd_sml;	/* Last byte of the most recently transmitted small packet */
 	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
+	u32	lrcvtime;	/* timestamp of last received data packet (for keepalives) */
 
 	/* Data for direct copy to user */
 	struct {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f240f57..60d2980 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5391,6 +5391,8 @@ no_ack:
 				__kfree_skb(skb);
 			else
 				sk->sk_data_ready(sk, 0);
+
+			tp->lrcvtime = tcp_time_stamp;
 			return 0;
 		}
 	}
@@ -5421,6 +5423,7 @@ step5:
 
 	tcp_data_snd_check(sk);
 	tcp_ack_snd_check(sk);
+	tp->lrcvtime = tcp_time_stamp;
 	return 0;
 
 csum_error:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 8a0ab29..74dd804 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -554,6 +554,14 @@ static void tcp_keepalive_timer (unsigned long data)
 	if (tp->packets_out || tcp_send_head(sk))
 		goto resched;
 
+	elapsed = tcp_time_stamp - tp->lrcvtime;
+	
+	/* receiving data means alive */
+	if (elapsed < keepalive_time_when(tp)) {
+		elapsed = keepalive_time_when(tp) - elapsed;
+		goto resched;
+	}
+
 	elapsed = tcp_time_stamp - tp->rcv_tstamp;
 
 	if (elapsed >= keepalive_time_when(tp)) {
-- 
1.6.6.1


^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-17 17:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271514445.16881.4746.camel@edumazet-laptop>

> Tom, I am not sure what you describe is even respected for NAPI devices.
> (I hope you use napi devices in your company ;) )
>
> If we enqueue a skb to backlog, we also link our backlog napi into our
> poll_list, if not already there.
>
> So the loop in net_rx_action() will make us handle our backlog napi a
> bit after this network device napi (if time limit of 2 jiffies not
> elapsed) and *before* sending IPIS to remote cpus anyway.
>
Then I think that's a bug you've identified ;-)

>
>
>
>

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-17 16:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <1271452358.16881.4486.camel@edumazet-laptop>

Le vendredi 16 avril 2010 à 23:12 +0200, Eric Dumazet a écrit :
> Le vendredi 16 avril 2010 à 13:42 -0700, Tom Herbert a écrit :
> > On Fri, Apr 16, 2010 at 11:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Le vendredi 16 avril 2010 à 11:35 -0700, Tom Herbert a écrit :
> > >> Results with "tbench 16" on an 8 core Intel machine.
> > >>
> > >> No RPS/RFS:  2155 MB/sec
> > >> RPS (0ff mask): 1700 MB/sec
> > >> RFS: 1097
> > >>
> > 
> > Blah, I mistakingly reported that... should have been:
> > 
> > No RPS/RFS:  2155 MB/sec
> > RPS (0ff mask): 1097 MB/sec
> > RFS: 1700 MB/sec
> > 
> > Sorry about that!
> 
> > This was my expectation too, and what my "corrected" numbers show :-)
> > But, I take it this is different in your results?
> 
> 
> My results are on a "tbench 16" on an dual X5570  @ 2.93GHz.
> (16 logical cpus)
> 
> No RPS , no RFS : 4448.14 MB/sec 
> RPS : 2298.00 MB/sec (but lot of variation)
> RFS : 2600 MB/sec
> 
> Maybe my RFS setup is bad ?
> (8192 flows)
> 

With attached patch, I reached 

Throughput 4465.13 MB/sec 16 procs

RFS better than no RPS/RFS :)

So, the old idea to make rxhash consistent (same value in both
directions) is a win for some workloads (Consider connection tracking /
firewalling) 

port1 = ...
port2 = ...
addr1 = ...
addr2 = ...
if (addr1 > addr2)
	exchange(addr1, addr2)
if (port1 > port2)
	exchange(port, port2)

hash = jhash(addr1, addr2, (port1<<16)+port2, ...)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..6b757ff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2280,8 +2280,10 @@ static int get_rps_cpu(struct net_device *dev,
struct sk_buff *skb,
        case IPPROTO_AH:
        case IPPROTO_SCTP:
        case IPPROTO_UDPLITE:
-               if (pskb_may_pull(skb, (ihl * 4) + 4))
-                       ports = *((u32 *) (skb->data + (ihl * 4)));
+               if (pskb_may_pull(skb, (ihl * 4) + 4)) {
+                       u16 *_ports = (u16 *)(skb->data + (ihl * 4));
+                       ports = _ports[0] ^ _ports[1];
+               }
                break;
 
        default:



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17 14:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271496229.16881.4602.camel@edumazet-laptop>

Le samedi 17 avril 2010 à 11:23 +0200, Eric Dumazet a écrit :
> Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > > So the cost of queing the packet into our own queue (netif_receive_skb
> > > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > >
> > > I personally think we should process packet instead of queeing it, but
> > > Tom disagree with me.
> > >
> > You could do that, but then the packet processing becomes HOL blocking
> > on all the packets that are being sent to other queues for
> > processing-- remember the IPIs is only sent at the end of the NAPI.
> > So unless the upper stack processing is <0.74us in your case, I think
> > processing packets directly on the local queue would improve best case
> > latency, but would increase average latency and even more likely worse
> > case latency on loads with multiple flows.


Tom, I am not sure what you describe is even respected for NAPI devices.
(I hope you use napi devices in your company ;) )

If we enqueue a skb to backlog, we also link our backlog napi into our
poll_list, if not already there.

So the loop in net_rx_action() will make us handle our backlog napi a
bit after this network device napi (if time limit of 2 jiffies not
elapsed) and *before* sending IPIS to remote cpus anyway.





^ permalink raw reply

* [PATCH net-next-2.6] net: remove time limit in process_backlog()
From: Eric Dumazet @ 2010-04-17 14:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <z2q65634d661004170143nb35ec784mbedd003565410cfb@mail.gmail.com>

- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..8092f01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -264,7 +264,7 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
  *	queue in the local softnet handler.
  */
 
-DEFINE_PER_CPU(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 EXPORT_PER_CPU_SYMBOL(softnet_data);
 
 #ifdef CONFIG_LOCKDEP
@@ -3232,7 +3232,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
-	unsigned long start_time = jiffies;
 
 	napi->weight = weight_p;
 	do {
@@ -3252,7 +3251,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_enable();
 
 		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+	} while (++work < quota);
 
 	return work;
 }



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17  9:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <z2q65634d661004170143nb35ec784mbedd003565410cfb@mail.gmail.com>

Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit :
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> >
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> >
> You could do that, but then the packet processing becomes HOL blocking
> on all the packets that are being sent to other queues for
> processing-- remember the IPIs is only sent at the end of the NAPI.
> So unless the upper stack processing is <0.74us in your case, I think
> processing packets directly on the local queue would improve best case
> latency, but would increase average latency and even more likely worse
> case latency on loads with multiple flows.

Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu()
itself, computing skb->rxhash and all. We should make a review of how
many cache lines we exchange per skb, and try to reduce this number.




^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Benny Amorsen @ 2010-04-17  9:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Antonio Almeida, netdev, kaber, davem, devik
In-Reply-To: <4BC63766.5080104@gmail.com>

Jarek Poplawski <jarkao2@gmail.com> writes:

> As I wrote before, the minimal (overflow safe) rate depends on max
> packet size, and for 1500 byte it would be something around:
> 1500b/2min, so if your clients can wait so long, try this:

Wouldn't it be nice of either tc or the kernel to warn about wrong
configurations, or possibly reject them completely?

/Benny

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Tom Herbert @ 2010-04-17  8:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, Rick Jones, David Miller, netdev, robert, andi
In-Reply-To: <1271489739.16881.4586.camel@edumazet-laptop>

> So the cost of queing the packet into our own queue (netif_receive_skb
> -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
>
> I personally think we should process packet instead of queeing it, but
> Tom disagree with me.
>
You could do that, but then the packet processing becomes HOL blocking
on all the packets that are being sent to other queues for
processing-- remember the IPIs is only sent at the end of the NAPI.
So unless the upper stack processing is <0.74us in your case, I think
processing packets directly on the local queue would improve best case
latency, but would increase average latency and even more likely worse
case latency on loads with multiple flows.

> RPS on, directed on cpu1 (other socket)
> (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
>
> So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
> is 3 us. Note this cost is in case we receive a single packet.
>
> I suspect IPI itself is in the 1.5 us range, not very far from the
> queing to ourself case.
>
> For me RPS use cases are :
>
> 1) Value added apps handling lot of TCP data, where the costs of cache
> misses in tcp stack easily justify to spend 3 us to gain much more.
>
> 2) Network appliance, where a single cpu is filled 100% to handle one
> device hardware and software/RPS interrupts, delegating all higher level
> works to a pool of cpus.
>
> I'll try to do these tests on a Nehalem target.
>
>
>
>

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: rps_sock_flow_table is mostly read
From: David Miller @ 2010-04-17  7:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271490733.16881.4588.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 17 Apr 2010 09:52:13 +0200

> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* [PATCH net-next-2.6] rps: rps_sock_flow_table is mostly read
From: Eric Dumazet @ 2010-04-17  7:52 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <1271404097.16881.3827.camel@edumazet-laptop>


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index d7107ac..7abf959 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2205,7 +2205,7 @@ DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
 #ifdef CONFIG_RPS
 
 /* One global table that all flow-based protocols share. */
-struct rps_sock_flow_table *rps_sock_flow_table;
+struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;
 EXPORT_SYMBOL(rps_sock_flow_table);
 
 /*



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Eric Dumazet @ 2010-04-17  7:35 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271424065.4606.31.camel@bigi>

Le vendredi 16 avril 2010 à 09:21 -0400, jamal a écrit :
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
> 
> > 
> > A kernel module might do this, this could be integrated in perf bench so
> > that we can regression tests upcoming kernels.
> 
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
> 
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example: 
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
> 
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.

I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
nehalem. So a 3-4 years old design.

For all test, I use the best time of 3 runs of "ping -f -q -c 100000
192.168.0.2". Yes ping is not very good, but its available ;)

Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
user land. I dont want to tweak acpi or whatever smart power saving
mechanisms.

When RPS off
100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms

RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
(echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms

So the cost of queing the packet into our own queue (netif_receive_skb
-> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)

I personally think we should process packet instead of queeing it, but
Tom disagree with me.

RPS on, directed on cpu1 (other socket)
(echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
is 3 us. Note this cost is in case we receive a single packet.

I suspect IPI itself is in the 1.5 us range, not very far from the
queing to ourself case.

For me RPS use cases are :

1) Value added apps handling lot of TCP data, where the costs of cache
misses in tcp stack easily justify to spend 3 us to gain much more.

2) Network appliance, where a single cpu is filled 100% to handle one
device hardware and software/RPS interrupts, delegating all higher level
works to a pool of cpus.

I'll try to do these tests on a Nehalem target.

^ permalink raw reply

* Re: Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-17  4:11 UTC (permalink / raw)
  To: netdev; +Cc: linux-net
In-Reply-To: <m2v1f8bbe3c1004152351o580b37ebx55e428fb106b09bd@mail.gmail.com>

Hi,

I am trying to find a duplicate IP in the network using arping.

 -------------------------
 [root@vps1 ~]# ping -c 3 192.168.1.212
 PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
 64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
 64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
 64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms

 --- 192.168.1.212 ping statistics ---
 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
 rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
 [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
 ARPING 192.168.1.212 from 0.0.0.0 eth0
 0
 -------------------------


 As per arping that IP is duplicate. But if I go ahead and ifdown the
 IP in the known location I cant ping that IP ( That means that IP is
 not duplicated ? ). This is the result after shutting down the IP.

 --------------------------
 [root@vps1 ~]# ping -c 3 192.168.1.212
 PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
 From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
 From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
 From 192.168.1.63 icmp_seq=3 Destination Host Unreachable

 --- 192.168.1.212 ping statistics ---
 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2001ms
 , pipe 3
 [root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
 ARPING 192.168.1.212 from 0.0.0.0 eth0
 Sent 5 probes (5 broadcast(s))
 Received 0 response(s)
 0
 [root@vps1 ~]#
 --------------------------

 My question is, in this case IP 192.168.1.212 is not duplicated. But
 still arping gives duplicate status. Why it is like that ?

 --
 Regards,
 Unni

^ permalink raw reply

* [PATCH] mac8390: change an error return code and some cleanup, take 3
From: Finn Thain @ 2010-04-17  3:16 UTC (permalink / raw)
  To: David Miller; +Cc: joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <alpine.OSX.2.00.1004162347390.271@localhost>


Change an error return code from -EAGAIN to -EBUSY since the former is 
misleading.

Nubus slots are geographically addressed and their irqs are equally 
inflexible. -EAGAIN is misleading because retrying will not help fix 
whatever bug it was that made the irq unavailable.

Also promote the log message. Likewise some other KERN_INFO log messages.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

--- a/drivers/net/mac8390.c	2010-04-16 13:31:04.000000000 +1000
+++ b/drivers/net/mac8390.c	2010-04-16 23:50:39.000000000 +1000
@@ -554,7 +554,7 @@
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -643,8 +643,8 @@
 {
 	__ei_open(dev);
 	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
+		pr_err("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
+		return -EBUSY;
 	}
 	return 0;
 }
@@ -660,7 +660,7 @@
 {
 	ei_status.txing = 0;
 	if (ei_debug > 1)
-		pr_info("reset not supported\n");
+		printk(KERN_DEBUG pr_fmt("reset not supported\n"));
 	return;
 }
 
@@ -668,7 +668,7 @@
 {
 	unsigned char *target = nubus_slot_addr(IRQ2SLOT(dev->irq));
 	if (ei_debug > 1)
-		pr_info("Need to reset the NS8390 t=%lu...", jiffies);
+		printk(KERN_DEBUG pr_fmt("Need to reset the NS8390 t=%lu..."), jiffies);
 	ei_status.txing = 0;
 	target[0xC0000] = 0;
 	if (ei_debug > 1)
 

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: Finn Thain @ 2010-04-17  2:28 UTC (permalink / raw)
  To: David Miller; +Cc: Joe Perches, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100416.132855.230883346.davem@davemloft.net>

On Fri, 16 Apr 2010, David Miller wrote:

> 
> I just want to point out that with all the trouble you gave about Joe's 
> work, you're having one heck of a time even submitting your changes 
> properly. :-)

You are a thorough reviewer, despite the regrettable implication to the 
contrary in my first message in this thread. And you are quite right, I 
did not understand the pr_* macros at the time. I apologise.

I am hypersensitive to bit rot. I guess that's what happens when one makes 
it one's job to fix regressions in Linux.

Finn

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-17  0:58 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <i2s65634d661004161722hece6f9d4naf528c37b63fffbc@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Fri, 16 Apr 2010 17:22:49 -0700

> Ugh, vmalloc.h must be sneaking in through some other header file for
> me :-(  Sorry about that.  Do you need me to respin the patch?

No, I took care of it and am about to push things out to net-next-2.6
on kernel.org

^ permalink raw reply

* [PATCH] KS8851: NULL pointer dereference if list is empty
From: Abraham Arce @ 2010-04-17  0:48 UTC (permalink / raw)
  To: netdev

Fix NULL pointer dereference in ks8851_tx_work by checking if dequeued
list is already empty before writing the packet to TX FIFO

 Unable to handle kernel NULL pointer dereference at virtual address 00000050
 PC is at ks8851_tx_work+0xdc/0x1b0
 LR is at wait_for_common+0x148/0x164
 pc : [<c01c0df4>]    lr : [<c025a980>]    psr: 20000013
 Backtrace:
  ks8851_tx_work+0x0/0x1b0
  worker_thread+0x0/0x190
  kthread+0x0/0x90

Signed-off-by: Abraham Arce <x0066660@ti.com>
---
 drivers/net/ks8851.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ks8851.c b/drivers/net/ks8851.c
index 13cc1ca..9e9f9b3 100644
--- a/drivers/net/ks8851.c
+++ b/drivers/net/ks8851.c
@@ -722,12 +722,14 @@ static void ks8851_tx_work(struct work_struct *work)
 		txb = skb_dequeue(&ks->txq);
 		last = skb_queue_empty(&ks->txq);

-		ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr | RXQCR_SDA);
-		ks8851_wrpkt(ks, txb, last);
-		ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr);
-		ks8851_wrreg16(ks, KS_TXQCR, TXQCR_METFE);
+		if (txb != NULL) {
+			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr | RXQCR_SDA);
+			ks8851_wrpkt(ks, txb, last);
+			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr);
+			ks8851_wrreg16(ks, KS_TXQCR, TXQCR_METFE);

-		ks8851_done_tx(ks, txb);
+			ks8851_done_tx(ks, txb);
+		}
 	}

 	mutex_unlock(&ks->lock);
-- 
1.5.4.3

^ permalink raw reply related

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-17  0:22 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20100416.155707.66748057.davem@davemloft.net>

Ugh, vmalloc.h must be sneaking in through some other header file for
me :-(  Sorry about that.  Do you need me to respin the patch?

Tom

On Fri, Apr 16, 2010 at 3:57 PM, David Miller <davem@davemloft.net> wrote:
> From: David Miller <davem@davemloft.net>
> Date: Fri, 16 Apr 2010 15:53:40 -0700 (PDT)
>
>> From: David Miller <davem@davemloft.net>
>> Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT)
>>
>>> Great, I'll add this to net-next-2.6 right now.
>>
>> I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c
>> to fix the build while committing this.
>
> net/core/net-sysfs.c needed it too :-/
>

^ permalink raw reply

* Re: Network protocol (IP,IPv6,...) and TC actions (ACT_CSUM)
From: Grégoire Baron @ 2010-04-16 23:10 UTC (permalink / raw)
  To: netdev; +Cc: Jan Ceuleers
In-Reply-To: <4BC4AA5B.8040901@computer.org>

Thanks Jan, for the suggestion.

Hi,

I will re-explain my situation.

I started to write a new TC action (ACT_CSUM) in order to be able to
force, specially when ACT_PEDIT is used, the update of common checksums:
 * the IPv4 header checksum,
 * the ICMP/IGMP and ICMPv6 checksums,
 * the TCP/UDP checkusms,
 * and why not, more ...

Also, the idea is to support directly IPv4 and IPv6.
The best user interface (via iproute2/tc) could to not ask the final
user to assume a specific network protocol, but let the action
discover it.

With this aim, I would like to know if someone could confirm me the
struct sk_buff .protocol member is the good candidate to discover if I
have an IPv4, an IPv6 packet or any other network protocol, in the skb
got by the TC action code (supporting INGRESS and EGRESS).

Indeed, this struct sk_buff member could contain something like
ETH_P_8021Q, which isn't a network protocol Id ...

I think this kind of content isn't seen by the TC actions, which work
at the network level (even if their filter protocol flag accepts all).
If someone could confirm, thanks in advance.

By the same way, I've wondered if the struct sk_buff .len member could
be used to avoid to "discover" the network packet length in the TC
action code, especially in the case of IPv6 packets (and jumbogram ;-).
But, I think not, because it could be not the case in INGRESS TC action
execution, in my point of view, because the packet wasn't delivered to
the network protocol yet. Is my analysis right?

Thanks again for your help.

Best Regards,

Grégoire Baron

On Tue, Apr 13, 2010 at 07:31:07PM +0200, Jan Ceuleers wrote:
> Grégoire Baron wrote:
> > As this .protocol member seems to be used at different moments when a
> > packet is received, forwared or sent, and could contain something like
> > ETH_P_8021Q which isn't a network protocol Id, can we say the struct
> > sk_buff .protocol member is guaranteed to contain a network protocol Id
> > in the struct sb_buff used in the TC action executions ?
> 
> Grégoire,
> 
> I suggest that you ask your question on the netdev mailing list (netdev@vger.kernel.org).
> 
> Cheers, Jan

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16 22:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <20100416.155340.256882855.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 16 Apr 2010 15:53:40 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT)
> 
>> Great, I'll add this to net-next-2.6 right now.
> 
> I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c
> to fix the build while committing this.

net/core/net-sysfs.c needed it too :-/

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16 22:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <20100416.154932.147279343.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Fri, 16 Apr 2010 15:49:32 -0700 (PDT)

> Great, I'll add this to net-next-2.6 right now.

I had to add an include of linux/vmalloc.h to net/core/sysctl_net_core.c
to fix the build while committing this.

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16 22:49 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271446679.16881.4298.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 16 Apr 2010 21:37:59 +0200

> Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
>> From: Tom Herbert <therbert@google.com>
>> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
>> 
>> > Version 5 of RFS:
>> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
>> > static function.
>> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
>> > sysfs variable.
>> 
>> I've read this over a few times and I think it's ready to go into
>> net-next-2.6, we can tweak things as-needed from here on out.
>> 
>> Eric, what do you think?
> 
> I think I can give my Sob, and we have time to fully test it and tweak
> it if necessary.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Great, I'll add this to net-next-2.6 right now.

Thanks!

^ permalink raw reply

* Re: [PATCH net-2.6] packet : remove init_net restriction
From: David Miller @ 2010-04-16 22:41 UTC (permalink / raw)
  To: daniel.lezcano; +Cc: netdev
In-Reply-To: <1271322674-21726-1-git-send-email-daniel.lezcano@free.fr>

From: Daniel Lezcano <daniel.lezcano@free.fr>
Date: Thu, 15 Apr 2010 11:11:14 +0200

> The af_packet protocol is used by Perl to do ioctls as reported by
> Stephane Riviere:
> 
> "Net::RawIP relies on SIOCGIFADDR et SIOCGIFHWADDR to get the IP and MAC
> addresses of the network interface."
> 
> But in a new network namespace these ioctl fail because it is disabled for
> a namespace different from the init_net_ns.
> 
> These two lines should not be there as af_inet and af_packet are
> namespace aware since a long time now. I suppose we forget to remove these
> lines because we sent the af_packet first, before af_inet was supported.
> 
> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
> Reported-by: Stephane Riviere <stephane.riviere@regis-dgac.net>

Applied, thanks!

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox