From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: udp ping pong measurements from 2.6.22 to .30 with various cpu affinities Date: Thu, 23 Apr 2009 19:04:04 +0200 Message-ID: <49F09F84.10108@cosmosbay.com> References: <49EF85E3.8060703@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: David Miller , Michael Chan , Ben Hutchings , netdev@vger.kernel.org To: Christoph Lameter Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:39435 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752547AbZDWRE5 convert rfc822-to-8bit (ORCPT ); Thu, 23 Apr 2009 13:04:57 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: Christoph Lameter a =E9crit : > On Wed, 22 Apr 2009, Eric Dumazet wrote: >=20 >> Check /proc/cpuinfo, and check it doesnt change between kernel versi= on. >=20 > Hmmm.... It does not since it depends on the way that the machine is > configured by the firmware. >=20 >>> Results follow (a nice diagram is available from >>> http://gentwo.org/results/udpping-tests-2.pdf >> Nice graphs, but lack of documentation of test conditions. >=20 > What would you like to know? =46or example, "cat /proc/cpuinfo", just to be sure you did not invert = cpus, since your results are not that logical. "size vmlinux-2.6.2[2-9] vmlinux-2.6.30" to check how kernel text size = increase. We add new features, security knobs, fixes, ... and this probably incre= ase number of instructions taken in a particular workload, who knows ? What was the size of packets used on this last test ? iptables active or not ? conntrack ? Overall, many unknown factors that mean we have to guess what really is happening. >=20 >> We already pointed it was probably scheduling. Since ICMP pings dont= use >> processes and no regression here. Patching kernel to implement udppi= ng >> at softirq level should be quite easy if you really want to check UD= P stack. >> >> Last network improvements focused on scalability more than latencies= =2E >> (multi-flows, not single flow !) >=20 > Latencies have priority here. Multi-flows are secondary. Amen :) >=20 > So I guess this means that you are okay with the network stacks laten= cy > creep? Even if its only 1.5usec: In practice you tune the NIC to perf= orm > much better and this is a latency increase likely occurring in every > packet transmission. You keep mentioning "network latency" increases, while your test obviou= sly stress both network stack, scheduler, nic driver, memory allocator,=20 time infrastructure, and many kernel services. 1.5 us is about 4500 cycles, -> ~1000 instructions or so, not counting hardware issues. Thats would not be a surprise given vmlinux text size. number of cycles per network udpping round-trip can also be estimated w= ith oprofile (one NMI every 10000 CPU_CLK_UNHALTED events) # egrep NMI|eth1 /proc/interrupts 34: 6371754 0 0 0 0 0 = 0 0 PCI-MSI-edge eth1 NMI: 8831219 141645 96321 113255 26232 44575 = 64762 23292 Non-maskable interrupts # udpping -n 100000 192.168.20.110 udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples .... 10286138.24us (93.16us/102.86us/10896.49us) # egrep "Non-maskable interrupts|eth1" /proc/interrupts 34: 6581203 0 0 0 0 0 = 0 0 PCI-MSI-edge eth1 NMI: 9167306 150313 97909 119027 31965 46523 = 68006 25070 Non-maskable interrupts delta of eth1 irq =3D 6581203 - 6371754 =3D 209449 (2*100000 + residu= al trafic on this host) and delta of NMI on CPU0 =3D 336087 -> 336087*10000 CPU_CLK_UNHALTED / 100000 packets -> 33608 cycles per packet (only on ring 0 : kernel) Shouldnt it be about 10 us ? BTW on my machines, latest linux-2.6.git, with bnx2.msi_disabled=3D1, I= have : # udpping -l 40 -n 100000 192.168.20.110 udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ... 7891442.22us (72.22us/78.91us/501.14us) with bnx2.msi_disable=3D0 : # udpping -l 40 -n 100000 192.168.20.110 udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples .... 7644318.91us (66.90us/76.44us/4036.34us) So its quite a bit less than your numbers. (and I do use netfiler, bond= ing, vlan...) # opreport -la vmlinux-2.6.30-rc3 | head -n 50 CPU: Core 2, speed 3000.36 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a u= nit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 294498 294498 18.5543 18.5543 mwait_idle 63149 357647 3.9786 22.5329 __schedule 50071 407718 3.1546 25.6875 irq_entries_start 49798 457516 3.1374 28.8249 bnx2_poll_work 38037 495553 2.3964 31.2214 apic_timer_interrupt 36896 532449 2.3246 33.5459 sched_clock_cpu 27671 560120 1.7434 35.2893 find_busiest_group 25694 585814 1.6188 36.9081 bnx2_start_xmit 25516 611330 1.6076 38.5157 ipt_do_table 24312 635642 1.5317 40.0474 scheduler_tick 22728 658370 1.4319 41.4794 __switch_to 21672 680042 1.3654 42.8448 bnx2_msi 20957 700999 1.3204 44.1651 native_sched_clock 20156 721155 1.2699 45.4350 read_tsc 18813 739968 1.1853 46.6203 handle_edge_irq 17396 757364 1.0960 47.7163 __skb_recv_datagram 15010 772374 0.9457 48.6620 copy_to_user 14954 787328 0.9421 49.6041 udp_recvmsg 14557 801885 0.9171 50.5213 dev_queue_xmit 13727 815612 0.8648 51.3861 udp_sendmsg 12849 828461 0.8095 52.1956 cpu_idle 12632 841093 0.7959 52.9915 dequeue_task_fair 12145 853238 0.7652 53.7567 __alloc_skb 12135 865373 0.7645 54.5212 __qdisc_run 11528 876901 0.7263 55.2475 ip_append_data 10847 887748 0.6834 55.9309 getnstimeofday 10733 898481 0.6762 56.6071 netif_receive_skb 10598 909079 0.6677 57.2748 sysenter_past_esp 10557 919636 0.6651 57.9399 prepare_to_wait_exclusive 10162 929798 0.6402 58.5802 finish_task_switch 10120 939918 0.6376 59.2178 __do_softirq 9778 949696 0.6160 59.8338 ip_push_pending_frames 9594 959290 0.6045 60.4383 hrtimer_interrupt 9199 968489 0.5796 61.0178 csum_partial_copy_generic 9181 977670 0.5784 61.5963 copy_from_user 8641 986311 0.5444 62.1407 ktime_get_ts 8513 994824 0.5363 62.6770 __udp4_lib_rcv 8479 1003303 0.5342 63.2112 tick_sched_timer 8445 1011748 0.5321 63.7433 tick_check_oneshot_broadcast 8319 1020067 0.5241 64.2674 ip_rcv 8267 1028334 0.5208 64.7882 enqueue_task_fair 8010 1036344 0.5047 65.2929 rb_erase 7937 1044281 0.5001 65.7930 kfree 7923 1052204 0.4992 66.2921 ktime_get 7891 1060095 0.4972 66.7893 bnx2_poll 7702 1067797 0.4853 67.2745 update_curr 7629 1075426 0.4807 67.7552 try_to_wake_up opannotate extracts : c0202f40 : /* irq_entries_start total: 55704 3.146= 7 */ =2E.. 55361 3.1273 :c0203000: push $0x35 251 0.0142 :c0203002: jmp c020301a =2E.. 80 0.0045 :c020301a: jmp c0203340 Using NO_HZ mode gives higher latencies [root@svivoipvnx002 ~]# udpping -n 1000000 192.168.20.110 udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 1000000 samples .... 101270858.22us (88.74us/101.27us/18833.35us) and in this NO_HZ case, 'network part' is obviously not the more consum= ing area Using opcontrol -p=3Dcpu to separate events for each cpu, and giving he= re CPU0 only events : CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a u= nit mask of 0x00 (Unhalted core cycles) count 10000 Samples on CPU 0 samples cum. samples % cum. % symbol name 34593 34593 7.3643 7.3643 mwait_idle 21077 55670 4.4870 11.8513 __schedule 17648 73318 3.7570 15.6083 irq_entries_start 14354 87672 3.0557 18.6640 bnx2_poll_work 13873 101545 2.9533 21.6174 getnstimeofday 12146 113691 2.5857 24.2031 rb_get_reader_page 11810 125501 2.5142 26.7172 get_next_timer_interrupt 11062 136563 2.3549 29.0722 copy_to_user (ud= pping recvfrom()) 9980 146543 2.1246 31.1968 __switch_to 9011 155554 1.9183 33.1151 sched_clock_cpu 8081 163635 1.7203 34.8354 ring_buffer_consume (opro= file) 8045 171680 1.7127 36.5480 read_tsc 6717 178397 1.4299 37.9780 tick_nohz_stop_sched_tick 6696 185093 1.4255 39.4035 udp_recvmsg 6679 191772 1.4219 40.8253 __skb_recv_datagram 6322 198094 1.3459 42.1712 bnx2_msi 5821 203915 1.2392 43.4104 native_sched_clock 5269 209184 1.1217 44.5321 udp_sendmsg 4584 213768 0.9759 45.5079 dev_queue_xmit 3918 217686 0.8341 46.3420 copy_from_user 3887 221573 0.8275 47.1695 ip_append_data 3838 225411 0.8171 47.9865 sched_clock_tick 3806 229217 0.8102 48.7968 __alloc_skb 3680 232897 0.7834 49.5802 dequeue_task_fair 3635 236532 0.7738 50.3540 rb_erase 3330 239862 0.7089 51.0629 ipt_do_table 3300 243162 0.7025 51.7655 pick_next_task_fair 3259 246421 0.6938 52.4592 __hrtimer_start_range_ns 3164 249585 0.6736 53.1328 tick_check_idle 3107 252692 0.6614 53.7942 bnx2_start_xmit 3088 255780 0.6574 54.4516 prepare_to_wait_exclusive 3046 258826 0.6484 55.1001 sysenter_past_esp 2940 261766 0.6259 55.7260 finish_task_switch 2903 264669 0.6180 56.3440 handle_edge_irq 2848 267517 0.6063 56.9503 netif_receive_skb 2747 270264 0.5848 57.5351 __qdisc_run 2714 272978 0.5778 58.1128 rb_advance_reader 2649 275627 0.5639 58.6768 update_curr 2481 278108 0.5282 59.2049 sys_socketcall 2467 280575 0.5252 59.7301 ktime_get 2396 282971 0.5101 60.2402 kfree 2372 285343 0.5050 60.7451 enqueue_task_fair If using INST_RETIRED_ANY_P (number of instructions retired) oprofiling= : CPU: Core 2, speed 3000.03 MHz (estimated) Counted INST_RETIRED_ANY_P events (number of instructions retired) with= a unit mask of 0x00 (No unit mask) count 10000 Samples on CPU 0 samples cum. samples % cum. % symbol name 45654 45654 12.6972 12.6972 get_next_timer_interrupt (!= !!) 14583 60237 4.0558 16.7530 ipt_do_table 11819 72056 3.2871 20.0400 getnstimeofday 9096 81152 2.5298 22.5698 sched_clock_cpu 7534 88686 2.0953 24.6651 bnx2_poll_work 7163 95849 1.9922 26.6573 ring_buffer_consume 7089 102938 1.9716 28.6289 read_tsc 7006 109944 1.9485 30.5774 __schedule 6012 115956 1.6720 32.2494 csum_partial_copy_generic 5870 121826 1.6326 33.8820 rb_get_reader_page 5718 127544 1.5903 35.4722 tick_nohz_stop_sched_tick 5373 132917 1.4943 36.9666 ktime_get 5003 137920 1.3914 38.3580 dev_queue_xmit 4402 142322 1.2243 39.5823 set_normalized_timespec 4332 146654 1.2048 40.7871 clockevents_program_event 4263 150917 1.1856 41.9727 udp_sendmsg 4175 155092 1.1611 43.1338 copy_to_user 4113 159205 1.1439 44.2777 native_sched_clock 3920 163125 1.0902 45.3679 ip_append_data 3710 166835 1.0318 46.3998 tick_check_idle 3541 170376 0.9848 47.3846 handle_edge_irq 3335 173711 0.9275 48.3121 __hrtimer_start_range_ns 3319 177030 0.9231 49.2352 sched_clock_tick 3296 180326 0.9167 50.1519 nf_iterate 3225 183551 0.8969 51.0488 ktime_get_ts 3041 186592 0.8458 51.8945 rb_event_length 2954 189546 0.8216 52.7161 __qdisc_run 2840 192386 0.7899 53.5060 udp_recvmsg 2838 195224 0.7893 54.2952 copy_from_user 2780 198004 0.7732 55.0684 bnx2_start_xmit 2778 200782 0.7726 55.8410 ip_push_pending_frames 2672 203454 0.7431 56.5842 __switch_to 2423 205877 0.6739 57.2580 native_apic_mem_write 2349 208226 0.6533 57.9113 __do_softirq 2344 210570 0.6519 58.5632 net_rx_action 2309 212879 0.6422 59.2054 kfree 2226 215105 0.6191 59.8245 udp_push_pending_frames 2212 217317 0.6152 60.4397 __ip_route_output_key 2081 219398 0.5788 61.0185 __kmalloc_track_caller 2071 221469 0.5760 61.5944 dequeue_task_fair 2068 223537 0.5751 62.1696 ip_finish_output 2044 225581 0.5685 62.7381 __alloc_skb I believe all this *time* services seem really expensive.