Netdev List

Netdev List
 help / color / mirror / Atom feed

* Duplicate IP false alerts from arping
From: unni krishnan @ 2010-04-16  6:51 UTC (permalink / raw)
  To: netdev

Hi,

I am trying to find a duplicate IP in the network using arping.

-------------------------
[root@vps1 ~]# ping -c 3 192.168.1.212
PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
64 bytes from 192.168.1.212: icmp_seq=1 ttl=64 time=1.33 ms
64 bytes from 192.168.1.212: icmp_seq=2 ttl=64 time=0.280 ms
64 bytes from 192.168.1.212: icmp_seq=3 ttl=64 time=0.306 ms

--- 192.168.1.212 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.280/0.641/1.339/0.494 ms
[root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
ARPING 192.168.1.212 from 0.0.0.0 eth0
0
-------------------------


As per arping that IP is duplicate. But if I go ahead and ifdown the
IP in the known location I cant ping that IP ( That means that IP is
not duplicated ? ). This is the result after shutting down the IP.

--------------------------
[root@vps1 ~]# ping -c 3 192.168.1.212
PING 192.168.1.212 (192.168.1.212) 56(84) bytes of data.
>From 192.168.1.63 icmp_seq=1 Destination Host Unreachable
>From 192.168.1.63 icmp_seq=2 Destination Host Unreachable
>From 192.168.1.63 icmp_seq=3 Destination Host Unreachable

--- 192.168.1.212 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2001ms
, pipe 3
[root@vps1 ~]# arping -D -I eth0 -c 5 192.168.1.212 ; echo $?
ARPING 192.168.1.212 from 0.0.0.0 eth0
Sent 5 probes (5 broadcast(s))
Received 0 response(s)
0
[root@vps1 ~]#
--------------------------

My question is, in this case IP 192.168.1.212 is not duplicated. But
still arping gives duplicate status. Why it is like that ?

-- 
Regards,
Unni
http://mutexes.org/
http://twitter.com/webofunni

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16  6:56 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100415.233334.242114544.davem@davemloft.net>

Le jeudi 15 avril 2010 à 23:33 -0700, David Miller a écrit :
> From: Tom Herbert <therbert@google.com>
> Date: Thu, 15 Apr 2010 22:47:08 -0700 (PDT)
> 
> > Version 5 of RFS:
> > - Moved rps_sock_flow_sysctl into net/core/sysctl_net_core.c as a
> > static function.
> > - Apply limits to rps_sock_flow_entires systcl and rps_flow_count
> > sysfs variable.
> 
> I've read this over a few times and I think it's ready to go into
> net-next-2.6, we can tweak things as-needed from here on out.
> 
> Eric, what do you think?

I read the patch and found no error.

I booted a test machine and performed some tests

I am a bit worried of a tbench regression I am looking at right now.

if RFS disabled , tbench 16   ->  4408.63 MB/sec 


# grep . /sys/class/net/lo/queues/rx-0/*
/sys/class/net/lo/queues/rx-0/rps_cpus:00000000
/sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
# cat /proc/sys/net/core/rps_sock_flow_entries
8192


echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus

tbench 16 -> 2336.32 MB/sec


-----------------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
-----------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ __________________________________________________________

             2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
             1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              821.00  1.6% child_run                      /usr/bin/tbench                                           
              766.00  1.5% all_string_sub                 /usr/bin/tbench                                           
              630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so                                    
              606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so                                    
              556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              473.00  0.9% next_token                     /usr/bin/tbench                                           
              449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
              360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux

But if RFS is on, why activating rps_cpus change tbench ?




^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Andi Kleen @ 2010-04-16  7:15 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, hadi, Rick Jones, David Miller, therbert, netdev,
	robert, andi
In-Reply-To: <o2v412e6f7f1004152302j1aca5edam9d53d01781ddbe9d@mail.gmail.com>

> > Come on Changli.
> >
> > How do you wake up a thread on a remote cpu ?
> >
> 
> resched IPI, apparently. But it is async absolutely. and its IRQ
> handler is lighter.

It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
that's in the tree for a few releases. So it would surprise me if it made
much difference. In the old days when there was only a single lock for
s_c_f() perhaps...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16  7:18 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <1271401007.16881.3762.camel@edumazet-laptop>

Le vendredi 16 avril 2010 à 08:56 +0200, Eric Dumazet a écrit :

> I read the patch and found no error.
> 
> I booted a test machine and performed some tests
> 
> I am a bit worried of a tbench regression I am looking at right now.
> 
> if RFS disabled , tbench 16   ->  4408.63 MB/sec 
> 
> 
> # grep . /sys/class/net/lo/queues/rx-0/*
> /sys/class/net/lo/queues/rx-0/rps_cpus:00000000
> /sys/class/net/lo/queues/rx-0/rps_flow_cnt:8192
> # cat /proc/sys/net/core/rps_sock_flow_entries
> 8192
> 
> 
> echo ffff >/sys/class/net/lo/queues/rx-0/rps_cpus
> 
> tbench 16 -> 2336.32 MB/sec
> 
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>    PerfTop:   14561 irqs/sec  kernel:86.3% [1000Hz cycles],  (all, 16 CPUs)
> -----------------------------------------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                       DSO
>              _______ _____ ______________________________ __________________________________________________________
> 
>              2664.00  5.1% copy_user_generic_string       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              2323.00  4.4% acpi_os_read_port              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1641.00  3.1% _raw_spin_lock_irqsave         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1260.00  2.4% schedule                       /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1159.00  2.2% _raw_spin_lock                 /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>              1051.00  2.0% tcp_ack                        /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               991.00  1.9% tcp_sendmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               922.00  1.8% tcp_recvmsg                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               821.00  1.6% child_run                      /usr/bin/tbench                                           
>               766.00  1.5% all_string_sub                 /usr/bin/tbench                                           
>               630.00  1.2% __switch_to                    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               608.00  1.2% __GI_strchr                    /lib/tls/libc-2.3.4.so                                    
>               606.00  1.2% ipt_do_table                   /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               600.00  1.1% __GI_strstr                    /lib/tls/libc-2.3.4.so                                    
>               556.00  1.1% __netif_receive_skb            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               504.00  1.0% tcp_transmit_skb               /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               502.00  1.0% tick_nohz_stop_sched_tick      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               481.00  0.9% _raw_spin_unlock_irqrestore    /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               473.00  0.9% next_token                     /usr/bin/tbench                                           
>               449.00  0.9% ip_rcv                         /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               423.00  0.8% call_function_single_interrupt /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               422.00  0.8% ia32_sysenter_target           /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               420.00  0.8% compat_sys_socketcall          /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               401.00  0.8% mod_timer                      /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               400.00  0.8% process_backlog                /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               399.00  0.8% ip_queue_xmit                  /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               387.00  0.7% select_task_rq_fair            /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               377.00  0.7% _raw_spin_lock_bh              /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
>               360.00  0.7% tcp_v4_rcv                     /lib/modules/2.6.34-rc3-03375-ga4fbf84-dirty/build/vmlinux
> 
> But if RFS is on, why activating rps_cpus change tbench ?
> 

Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old
(versus linux-2.6). I know scheduler guys did some tweaks.

Because apparently, some cpus are idle part of their time (30% ???)

Or a new bug on cpu accounting, reporting idle time while cpus are
busy....

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
16  0      0 5670264  13280  63392    0    0     2     1 1512  227 12 47 41  0
18  0      0 5669396  13280  63392    0    0     0     0 657952 1606102 14 58 28  0
17  0      0 5668776  13288  63392    0    0     0    12 656701 1606369 14 58 28  0
18  0      0 5669644  13288  63392    0    0     0     0 657636 1603960 15 57 28  0
17  0      0 5670900  13288  63392    0    0     0     0 666425 1584847 15 56 29  0
15  0      0 5669164  13288  63392    0    0     0     0 682578 1472616 14 56 30  0
16  0      0 5669412  13288  63392    0    0     0     0 695767 1506302 14 54 32  0
14  0      0 5668916  13296  63396    0    0     4   148 685286 1482897 14 56 30  0
17  0      0 5669784  13296  63396    0    0     0     0 683910 1477994 14 56 30  0
18  0      0 5670032  13296  63396    0    0     0     0 692023 1497195 14 55 31  0
16  0      0 5669040  13296  63396    0    0     0     0 677477 1468157 14 56 30  0
16  0      0 5668916  13312  63396    0    0     0    32 489358 1048553 14 57 30  0
18  0      0 5667924  13320  63396    0    0     0    12 424787 897145 15 55 29  0

RFS off :

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
24  0      0 5669624  13632  63476    0    0     2     1  261   82 12 48 40  0
26  0      0 5669492  13632  63476    0    0     0     0 4223 1740651 21 71  7  0
23  0      0 5669864  13640  63476    0    0     0    12 4205 1731882 21 71  8  0
23  0      0 5670484  13640  63476    0    0     0     0 4176 1733448 21 71  8  0
24  0      0 5670588  13640  63476    0    0     0     0 4176 1733845 21 72  7  0
21  0      0 5671084  13640  63476    0    0     0     0 4200 1734990 20 73  7  0
23  0      0 5671580  13640  63476    0    0     0     0 4168 1735100 21 71  8  0
23  0      0 5671704  13640  63480    0    0     4   132 4221 1733428 21 72  7  0
22  0      0 5671952  13640  63480    0    0     0     0 4190 1730370 21 72  8  0
20  0      0 5672292  13640  63480    0    0     0     0 4212 1732084 22 70  8  0




^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: David Miller @ 2010-04-16  7:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, netdev
In-Reply-To: <1271402283.16881.3791.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 16 Apr 2010 09:18:03 +0200

> Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old
> (versus linux-2.6). I know scheduler guys did some tweaks.

I synced net-next-2.6 up with Linus's current tree just a day
or two ago when I pulled net-2.6 into net-next-2.6.

^ permalink raw reply

* Re: Network multiqueue question
From: George B. @ 2010-04-16  7:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271393633.16881.3606.camel@edumazet-laptop>

On Thu, Apr 15, 2010 at 9:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 15 avril 2010 à 21:00 -0700, George B. a écrit :

> What kind of traffic do your machines manage exactly ?

Content to mobile devices (cell phones and such). More detail sent privately.

> On server, you use two ports of the same kind (same number of queues) ?

Yes, same kind.  We try to make everything identical.  Fewer problems that way.

George

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-16  7:48 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, netdev
In-Reply-To: <20100416.002632.83844236.davem@davemloft.net>

Le vendredi 16 avril 2010 à 00:26 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 16 Apr 2010 09:18:03 +0200
> 
> > Hmm, I wonder if its not an artifact of net-next-2.6 being a bit old
> > (versus linux-2.6). I know scheduler guys did some tweaks.
> 
> I synced net-next-2.6 up with Linus's current tree just a day
> or two ago when I pulled net-2.6 into net-next-2.6.

OK thanks :)

Tom, please add a read_mostly to rps_sock_flow_table

struct rps_sock_flow_table *rps_sock_flow_table __read_mostly;

I'll spend some hours today to track the problem.




^ permalink raw reply

* Re: HTB - What's the minimal value for 'rate' parameter?
From: Antonio Almeida @ 2010-04-16 11:56 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev, kaber, davem, devik
In-Reply-To: <4BC63766.5080104@gmail.com>

Now I understand. It makes sense - totally! Thanks for your endurance
trying to open my eyes :)
I've been trying rates bigger that 100bit for a while and it's working fine.
Thanks a lot for your illustration!

Regards
  Antonio Almeida



On Wed, Apr 14, 2010 at 10:45 PM, Jarek Poplawski wrote:
> Antonio Almeida wrote, On 04/14/2010 12:22 PM:
>
>> What do you mean with "1:2 has grandchildren with overflown rate tables"?
>> I couldn't understand your idea. Is there any mistake in the
>> configuration I sent?
>> How would you set rates for this particular example?
>
>
> class htb 1:1 root rate 1000Mbit ceil 1000Mbit
> class htb 1:2 parent 1:1 rate 4096Kbit ceil 4096Kbit
> class htb 1:10 parent 1:2 rate 1024Kbit ceil 4096Kbit
> class htb 1:11 parent 1:2 rate 1024Kbit ceil 4096Kbit
> class htb 1:101 parent 1:10 prio 3 rate 8bit ceil 4096Kbit
> class htb 1:111 parent 1:11 prio 3 rate 8bit ceil 4096Kbit
>
> Classes 1:101 and 1:111 have too low rates, which causes wrong (overflowed!)
> values in their rate tables, so their rates could be practically
> uncontrollable. They are limited by their ceils instead, so something like:
>
> class htb 1:101 parent 1:10 leaf 101: prio 3 rate 4096Kbit ceil 4096Kbit
> class htb 1:111 parent 1:11 leaf 111: prio 3 rate 4096Kbit ceil 4096Kbit
>
> But then their guaranteed rates are higher than their parents, and the
> sum is higher than grandparent's rate, which means the config is wrong.
> (You have to control these sums - HTB doesn't.)
>
> As I wrote before, the minimal (overflow safe) rate depends on max
> packet size, and for 1500 byte it would be something around:
> 1500b/2min, so if your clients can wait so long, try this:
>
> class htb 1:101 parent 1:10 leaf 101: prio 3 rate 100bit ceil 4096Kbit
> class htb 1:111 parent 1:11 leaf 111: prio 3 rate 100bit ceil 4096Kbit
>
> Regards,
> Jarek P.
>

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Andi Kleen @ 2010-04-16 11:57 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, eric.dumazet
In-Reply-To: <alpine.DEB.1.00.1004152243470.15102@pokey.mtv.corp.google.com>

Tom Herbert <therbert@google.com> writes:
> +
> +		/*
> +		 * If the desired CPU (where last recvmsg was done) is
> +		 * different from current CPU (one in the rx-queue flow
> +		 * table entry), switch if one of the following holds:
> +		 *   - Current CPU is unset (equal to RPS_NO_CPU).
> +		 *   - Current CPU is offline.
> +		 *   - The current CPU's queue tail has advanced beyond the
> +		 *     last packet that was enqueued using this table entry.
> +		 *     This guarantees that all previous packets for the flow
> +		 *     have been dequeued, thus preserving in order delivery.
> +		 */
> +		if (unlikely(tcpu != next_cpu) &&
> +		    (tcpu == RPS_NO_CPU || !cpu_online(tcpu) ||
> +		     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -

One thing I've been wondering while reading if this should be made
socket or SMT aware.

If you're on a hyperthreaded system and sending a IPI
to your core sibling, which has a completely shared cache hierarchy,
might not be the best use of cycles.

The same could potentially true for shared L2 or shared L3 cache
(e.g. only redirect flows between different sockets)

Have you ever considered that?

This is of course something that could be addressed post-merge, not
a blocker.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] drivers/net/pcmcia/3c574_cs: fixing stats.tx_bytes counter
From: Dominik Brodowski @ 2010-04-16 13:01 UTC (permalink / raw)
  To: Alexander Kurz, David S. Miller; +Cc: netdev
In-Reply-To: <alpine.DEB.1.10.1003312014120.9974@blala.de>

David,

as this is more netdev-related than PCMCIA-related, could you pick it up?
Else, I'm willing to take it upstream, but would prefer your ACK on this.

Thanks & best wishes,

	Dominik

From: Alexander Kurz <akurz@blala.de>
Date: Wed, 31 Mar 2010 20:21:29 +0400
Subject: [PATCH] net: 3c574_cs fix stats.tx_bytes counter

Update the stats counter calculation in 3c574_cs, similar
to the method used in 3c589_cs. This corrects the contents
of the counter on tests using a "Megahertz 574B" card.

[linux@dominikbrodowski.net: clean up commit message]
Signed-off-by: Alexander Kurz <linux@kbdbabel.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

diff --git a/drivers/net/pcmcia/3c574_cs.c b/drivers/net/pcmcia/3c574_cs.c
index 727bb38..30b7cf7 100644
--- a/drivers/net/pcmcia/3c574_cs.c
+++ b/drivers/net/pcmcia/3c574_cs.c
@@ -772,8 +772,13 @@ static netdev_tx_t el3_start_xmit(struct sk_buff *skb,
 		  inw(ioaddr + EL3_STATUS));

 	spin_lock_irqsave(&lp->window_lock, flags);
+
+	dev->stats.tx_bytes += skb->len;
+
+	/* Put out the doubleword header... */
 	outw(skb->len, ioaddr + TX_FIFO);
 	outw(0, ioaddr + TX_FIFO);
+	/* ... and the packet rounded to a doubleword. */
 	outsl(ioaddr + TX_FIFO, skb->data, (skb->len+3)>>2);

 	dev->trans_start = jiffies;
@@ -1012,8 +1017,6 @@ static void update_stats(struct net_device *dev)
 	/* BadSSD */				   inb(ioaddr + 12);
 	up					 = inb(ioaddr + 13);

-	dev->stats.tx_bytes 			+= tx + ((up & 0xf0) << 12);
-
 	EL3WINDOW(1);
 }

^ permalink raw reply related

* [PATCH] tg3: Fix INTx fallback when MSI fails
From: Andre Detsch @ 2010-04-16 13:15 UTC (permalink / raw)
  To: netdev, Matt Carlson

tg3: Fix INTx fallback when MSI fails

MSI setup changes the value of some key attributes of struct tg3 *tp.
These attributes must be taken into account and restored before
we try to do a new request_irq for INTx fallback.

In powerpc, the original code was leading to an EINVAL return within
request_irq, because the driver was trying to use the disabled MSI
virtual irq number instead of tp->pdev->irq.

Signed-off-by: Andre Detsch <adetsch@br.ibm.com>

---
Tested on powerpc, but should be safe for other architectures as well.

Index: linux-2.6.34-rc4/drivers/net/tg3.c
===================================================================
--- linux-2.6.34-rc4.orig/drivers/net/tg3.c	2010-04-12 21:41:35.000000000 -0400
+++ linux-2.6.34-rc4/drivers/net/tg3.c	2010-04-15 20:37:41.000000000 -0400
@@ -8633,6 +8633,9 @@ static int tg3_test_msi(struct tg3 *tp)
 	pci_disable_msi(tp->pdev);

 	tp->tg3_flags2 &= ~TG3_FLG2_USING_MSI;
+	tp->irq_cnt = 1;
+	tp->napi[0].irq_vec = tp->pdev->irq;
+	tp->dev->real_num_tx_queues = 1;

 	err = tg3_request_irq(tp, 0);
 	if (err)

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271395106.16881.3645.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 1231 bytes --]

On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:

> 
> A kernel module might do this, this could be integrated in perf bench so
> that we can regression tests upcoming kernels.

Perf would be good - but even softnet_stat cleaner than the the nasty
hack i use (attached) would be a good start; the ping with and without
rps gives me a ballpark number.

IPI is important to me because having tried it before it and failed
miserably. I was thinking the improvement may be due to hardware used
but i am having a hard time to get people to tell me what hardware they
used! I am old school - I need data;-> The RFS patch commit seems to
have more info but still vague, example: 
"The benefits of RFS are dependent on cache hierarchy, application
load, and other factors"
Also, what does a "simple" or "complex" benchmark mean?;->
I think it is only fair to get this info, no?

Please dont consider what i say above as being anti-RPS.
5 microsec extra latency is not bad if it can be amortized.
Unfortunately, the best traffic i could generate was < 20Kpps of
ping which still manages to get 1 IPI/packet on Nehalem. I am going
to write up some app (lots of cycles available tommorow). I still think
it is valueable.

cheers,
jamal

[-- Attachment #2: p1 --]
[-- Type: text/x-patch, Size: 1551 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d1a21b5..f8267fc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -224,6 +224,7 @@ struct netif_rx_stats {
 	unsigned time_squeeze;
 	unsigned cpu_collision;
 	unsigned received_rps;
+	unsigned ipi_rps;
 };

 DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
diff --git a/kernel/smp.c b/kernel/smp.c
index 9867b6b..8c5dcb7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -11,6 +11,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/netdevice.h>

 static struct {
 	struct list_head	queue;
@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}

 	if (wait)
 		csd_lock_wait(data);
diff --git a/net/core/dev.c b/net/core/dev.c
index b98ddc6..0bbbdcf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3563,10 +3563,12 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
 {
 	struct netif_rx_stats *s = v;

-	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);
+	s->ipi_rps = 0;
+	s->received_rps = 0;
 	return 0;
 }

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Changli Gao, Eric Dumazet, Rick Jones, David Miller, therbert,
	netdev, robert
In-Reply-To: <20100416071522.GY18855@one.firstfloor.org>

On Fri, 2010-04-16 at 09:15 +0200, Andi Kleen wrote:

> > resched IPI, apparently. But it is async absolutely. and its IRQ
> > handler is lighter.
> 
> It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
> that's in the tree for a few releases. So it would surprise me if it made
> much difference. In the old days when there was only a single lock for
> s_c_f() perhaps...

So you are saying that the old implementation of IPI (likely what i
tried pre-napi and as recent as 2-3 years ago) was bad because of a
single lock?

BTW, I directed some questions to you earlier but didnt get a response,
to quote:
---
On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?
---

Do you know any specs i could read up which will tell me a little more?

cheers,
jamal



^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: jamal @ 2010-04-16 13:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Herbert, davem, netdev, eric.dumazet
In-Reply-To: <87r5mf8va9.fsf@basil.nowhere.org>

On Fri, 2010-04-16 at 13:57 +0200, Andi Kleen wrote:

> One thing I've been wondering while reading if this should be made
> socket or SMT aware.
> 
> If you're on a hyperthreaded system and sending a IPI
> to your core sibling, which has a completely shared cache hierarchy,
> might not be the best use of cycles.
> 
> The same could potentially true for shared L2 or shared L3 cache
> (e.g. only redirect flows between different sockets)
> 
> Have you ever considered that?
> 

How are you going to schedule the net softirq on an empty queue if you
do this?
BTW, in my tests sending an IPI to an SMT sibling or to another core
didnt make any difference in terms of latency - still 5 microsecs.
I dont have dual Nehalem where we have to cross QPI - there i suspect
it will be longer than 5 microsecs.

cheers,
jamal

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-16 13:34 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271424065.4606.31.camel@bigi>

On Fri, Apr 16, 2010 at 9:21 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote:
>
>>
>> A kernel module might do this, this could be integrated in perf bench so
>> that we can regression tests upcoming kernels.
>
> Perf would be good - but even softnet_stat cleaner than the the nasty
> hack i use (attached) would be a good start; the ping with and without
> rps gives me a ballpark number.
>
> IPI is important to me because having tried it before it and failed
> miserably. I was thinking the improvement may be due to hardware used
> but i am having a hard time to get people to tell me what hardware they
> used! I am old school - I need data;-> The RFS patch commit seems to
> have more info but still vague, example:
> "The benefits of RFS are dependent on cache hierarchy, application
> load, and other factors"
> Also, what does a "simple" or "complex" benchmark mean?;->
> I think it is only fair to get this info, no?
>
> Please dont consider what i say above as being anti-RPS.
> 5 microsec extra latency is not bad if it can be amortized.
> Unfortunately, the best traffic i could generate was < 20Kpps of
> ping which still manages to get 1 IPI/packet on Nehalem. I am going
> to write up some app (lots of cycles available tommorow). I still think
> it is valueable.
>

+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);

Do you mean that received_rps is equal to ipi_rps? received_rps is the
number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
function generic_exec_single(). If there isn't other user of
generic_exec_single(), received_rps should be equal to ipi_rps.

@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct
call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: "kernel:nf_ct_icmp: bad HW ICMP checksum" too noisy
From: Benny Amorsen @ 2010-04-16 13:36 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Netfilter Development Mailinglist
In-Reply-To: <4BC723EE.9090504@trash.net>

Patrick McHardy <kaber@trash.net> writes:

> You should only see that message when nf_conntrack_log_invalid is
> active.

True, I can disable that one as a work-around.


/Benny


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Andi Kleen @ 2010-04-16 13:37 UTC (permalink / raw)
  To: jamal
  Cc: Andi Kleen, Changli Gao, Eric Dumazet, Rick Jones, David Miller,
	therbert, netdev, robert
In-Reply-To: <1271424455.4606.39.camel@bigi>

On Fri, Apr 16, 2010 at 09:27:35AM -0400, jamal wrote:
> On Fri, 2010-04-16 at 09:15 +0200, Andi Kleen wrote:
> 
> > > resched IPI, apparently. But it is async absolutely. and its IRQ
> > > handler is lighter.
> > 
> > It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
> > that's in the tree for a few releases. So it would surprise me if it made
> > much difference. In the old days when there was only a single lock for
> > s_c_f() perhaps...
> 
> So you are saying that the old implementation of IPI (likely what i
> tried pre-napi and as recent as 2-3 years ago) was bad because of a
> single lock?

Yes.

The old implementation of smp_call_function. Also in the really old
days there was no smp_call_function_single() so you tended to broadcast.

Jens did a lot of work on this for his block device work IPI implementation.

> On IPIs:
> Is anyone familiar with what is going on with Nehalem? Why is it this
> good? I expect things will get a lot nastier with other hardware like
> xeon based or even Nehalem with rps going across QPI.

Nehalem is just fast. I don't know why it's fast in your specific
case. It might be simply because it has lots of bandwidth everywhere.
Atomic operations are also faster than on previous Intel CPUs.


> Here's why i think IPIs are bad, please correct me if i am wrong:
> - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> is in the form of an IPI).

In the hardware there's no ack, but in the Linux implementation there
is usually (because need to know when to free the stack state used
to pass information)

However there's also now support for queued IPI
with a special API (I believe Tom is using that)

> - data cache has to be synced to main memory
> - the instruction pipeline is flushed

At least on Nehalem data transfer can be often through the cache.

IPIs involve APIC accesses which are not very fast (so overall
it's far more than a pipeline worth of work), but it's still
not a incredible expensive operation.

There's also X2APIC now which should be slightly faster, but it's 
likely not in your Nehalem (this is only in the highend Xeon versions)

> Do you know any specs i could read up which will tell me a little more?

If you're just interested in IPI and cache line transfer performance it's
probably best to just measure it.

Some general information is always in the Intel optimization guide.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271399547.16881.3724.camel@edumazet-laptop>

On Fri, 2010-04-16 at 08:32 +0200, Eric Dumazet wrote:
> Le vendredi 16 avril 2010 à 14:02 +0800, Changli Gao a écrit :
> 
> > resched IPI, apparently. But it is async absolutely. and its IRQ
> > handler is lighter.
> > 
> 
> You still dont answer to the question, and your claims are not grounded
> by hard facts, but by your interpretation of code.

My understanding of current scheduler is it does use IPIs to migrate
tasks around - so thats why things may be working for Changli. i.e
it is scheduler magic if you use kthreads. It is hard to say if this
would work better...

cheers,
jamal


^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Andi Kleen @ 2010-04-16 13:42 UTC (permalink / raw)
  To: jamal; +Cc: Andi Kleen, Tom Herbert, davem, netdev, eric.dumazet
In-Reply-To: <1271424726.4606.42.camel@bigi>

On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote:
> How are you going to schedule the net softirq on an empty queue if you
> do this?

Sorry don't understand the question? 

You can always do the flow as if rps was not there.

> BTW, in my tests sending an IPI to an SMT sibling or to another core
> didnt make any difference in terms of latency - still 5 microsecs.
> I dont have dual Nehalem where we have to cross QPI - there i suspect
> it will be longer than 5 microsecs.

I meant an IPI to a sibling is not useful. You send it to the IPI
to get cache locality in the target, but if the target has the same
cache locality as you you can as well avoid the cost of the IPI
and process directly.

For thread sibling I'm pretty sure it's useless. Not full sure about
socket sibling. Maybe.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:49 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <k2h412e6f7f1004160634v88440075p6e4cb2404abdb006@mail.gmail.com>

On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:

> 
> +	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
>  		   s->total, s->dropped, s->time_squeeze, 0,
>  		   0, 0, 0, 0, /* was fastroute */
> -		   s->cpu_collision, s->received_rps);
> +		   s->cpu_collision, s->received_rps, s->ipi_rps);
> 
> Do you mean that received_rps is equal to ipi_rps? received_rps is the
> number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
> function generic_exec_single(). If there isn't other user of
> generic_exec_single(), received_rps should be equal to ipi_rps.
> 

my observation is:
s->total is the sum of all packets received by cpu (some directly from
ethernet)
s->received_rps was what the count receiver cpu saw incoming if they
were sent by another cpu. 
s-> ipi_rps is the times we tried to enq to remote cpu but found it to
be empty and had to send an IPI. 
ipi_rps can be < received_rps if we receive > 1 packet without
generating an IPI. What did i miss?

cheers,
jamal

^ permalink raw reply

* Re: [PATCH] rdma/cm: Randomize local port allocation.
From: Tetsuo Handa @ 2010-04-16 13:54 UTC (permalink / raw)
  To: amwang-H+wXaHxf7aLQT0dZR+AlfA, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: opurdila-+zzKsuq53OdBDgjK7y7TUQ,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, nhorman-2XuSBdqkA4R54TAoqtyWWQ,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BC7C9CF.20403-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Cong Wang wrote:
> Sean Hefty wrote:
> > I like this version, thanks!  I'm not sure which tree to merge it through.
> > Are you needing this for 2.6.34, or is 2.6.35 okay?
> > 
> 
> As soon as possible, so 2.6.34. :)
> 
Cong, merge window for 2.6.34 was already closed.
You need to make your patchset towards 2.6.35 (using net-next-2.6 tree)
rather than 2.6.34 (using linux-2.6 tree). Therefore, this patch being
queued for 2.6.35 (through net-next-2.6 tree) should be okay for you.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] mac8390: fix pr_info() calls and change return code
From: Finn Thain @ 2010-04-16 13:57 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <1271392454.2298.37.camel@Joe-Laptop.home>

On Thu, 15 Apr 2010, Joe Perches wrote:

> ...Why is it better to use -EBUSY?

Nubus slots are geographically addressed and their irqs are equally 
inflexible. -EAGAIN is misleading because retrying will not help fix 
whatever bug caused the irq to unavailable.

> ...It'd be better to prefix this with the driver name
> or use something like netdev_dbg with #define DEBUG
> otherwise it's "huh? what device emits this message?"
> when reading the logs.
> 
> Something like:
> 	printk(KERN_DEBUG pr_fmt("reset not supported\n"));

Thanks for the suggestion. I'll resend again.

> ...unnecessary conversion.

I guess some prefer consistency, some prefer symmetry.

Finn

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-16 13:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Changli Gao, Eric Dumazet, Rick Jones, David Miller, therbert,
	netdev, robert
In-Reply-To: <20100416133707.GZ18855@one.firstfloor.org>

On Fri, 2010-04-16 at 15:37 +0200, Andi Kleen wrote:
> On Fri, Apr 16, 2010 at 09:27:35AM -0400, jamal wrote:

> > So you are saying that the old implementation of IPI (likely what i
> > tried pre-napi and as recent as 2-3 years ago) was bad because of a
> > single lock?
> 
> Yes.

> The old implementation of smp_call_function. Also in the really old
> days there was no smp_call_function_single() so you tended to broadcast.
> 
> Jens did a lot of work on this for his block device work IPI implementation.

Nice - thanks for that info! So not only has h/ware improved, but
implementation as well..

> > On IPIs:
> > Is anyone familiar with what is going on with Nehalem? Why is it this
> > good? I expect things will get a lot nastier with other hardware like
> > xeon based or even Nehalem with rps going across QPI.
> 
> Nehalem is just fast. I don't know why it's fast in your specific
> case. It might be simply because it has lots of bandwidth everywhere.
> Atomic operations are also faster than on previous Intel CPUs.

Well, the cache architecture is nicer. The on-die MC is nice. No more
shared MC hub/FSB. The 3 MC channels are nice. Intel finally beating
AMD ;-> someone did a measurement of the memory timings (L1, L2, L3, MM
and the results were impressive; i have the numbers somewhere).

> 
> > Here's why i think IPIs are bad, please correct me if i am wrong:
> > - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> > is in the form of an IPI).
> 
> In the hardware there's no ack, but in the Linux implementation there
> is usually (because need to know when to free the stack state used
> to pass information)
>
> However there's also now support for queued IPI
> with a special API (I believe Tom is using that)
> 

Which is the non-queued-IPI call?

> > - data cache has to be synced to main memory
> > - the instruction pipeline is flushed
> 
> At least on Nehalem data transfer can be often through the cache.

I thought you have to go all the way to MM in case of IPIs.

> IPIs involve APIC accesses which are not very fast (so overall
> it's far more than a pipeline worth of work), but it's still
> not a incredible expensive operation.
> 
> There's also X2APIC now which should be slightly faster, but it's 
> likely not in your Nehalem (this is only in the highend Xeon versions)
> 

Ok, true - forgot about the APIC as well...

> > Do you know any specs i could read up which will tell me a little more?
> 
> If you're just interested in IPI and cache line transfer performance it's
> probably best to just measure it.

There are tools like benchit which would give me L1,2,3,MM measurements;
for IPI the ping + rps test i did maybe sufficient.

> Some general information is always in the Intel optimization guide.

Thanks Andi!

cheers,
jamal


^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: jamal @ 2010-04-16 14:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, Tom Herbert, davem, netdev, eric.dumazet
In-Reply-To: <20100416134236.GA18855@one.firstfloor.org>

On Fri, 2010-04-16 at 15:42 +0200, Andi Kleen wrote:
> On Fri, Apr 16, 2010 at 09:32:06AM -0400, jamal wrote:
> > How are you going to schedule the net softirq on an empty queue if you
> > do this?
> 
> Sorry don't understand the question? 
> 
> You can always do the flow as if rps was not there.

Meaning you schedule the other side netrx softirq if queue is empty?

> I meant an IPI to a sibling is not useful. You send it to the IPI
> to get cache locality in the target, but if the target has the same
> cache locality as you you can as well avoid the cost of the IPI
> and process directly.
> 

Isnt the purpose of the IPI to signal remote side that theres something
for it to do? Does it also sync the remote cache?

> For thread sibling I'm pretty sure it's useless. Not full sure about
> socket sibling. Maybe.
> 

Agreed, the SMT threads share L2. All the cores share L3. And it is
inclusive, so if it is missing it is in L1 of one thread it must be
present in L2 of shared cache as well as L3. Across the QPI i dont think
that is true.
But if you speacial case this - arent you being specific to Nehalem?

cheers,
jamal

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-16 14:10 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271425753.4606.65.camel@bigi>

On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote:
> On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote:
>
>
> my observation is:
> s->total is the sum of all packets received by cpu (some directly from
> ethernet)

It is meaningless currently. If rps is enabled, it may be twice of the
number of the packets received, because one packet may be count twice:
one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I
had posted a patch to solve this problem.

http://patchwork.ozlabs.org/patch/50217/

If you don't apply my patch, you'd better refer to /proc/net/dev for
the total number.

> s->received_rps was what the count receiver cpu saw incoming if they
> were sent by another cpu.

Maybe its name confused you.

/* Called from hardirq (IPI) context */
static void trigger_softirq(void *data)
{
        struct softnet_data *queue = data;
        __napi_schedule(&queue->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

the function above is called in IRQ of IPI. It counts the number of
IPIs received. It is actually ipi_rps you need.

> s-> ipi_rps is the times we tried to enq to remote cpu but found it to
> be empty and had to send an IPI.
> ipi_rps can be < received_rps if we receive > 1 packet without
> generating an IPI. What did i miss?
>

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox