From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Horman Subject: Re: Multicast packet loss Date: Mon, 2 Feb 2009 13:22:12 -0500 Message-ID: <20090202182212.GA17950@hmsreliant.think-freely.org> References: <49833DBC.7040607@athenacr.com> <20090130200330.GA12659@hmsreliant.think-freely.org> <49837F56.2020502@athenacr.com> <49838213.90700@cosmosbay.com> <49859847.9010206@cosmosbay.com> <20090202134523.GA13369@hmsreliant.think-freely.org> <498725F4.2010205@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Kenny Chang , netdev@vger.kernel.org To: Eric Dumazet Return-path: Received: from charlotte.tuxdriver.com ([70.61.120.58]:39754 "EHLO smtp.tuxdriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755437AbZBBSWZ (ORCPT ); Mon, 2 Feb 2009 13:22:25 -0500 Content-Disposition: inline In-Reply-To: <498725F4.2010205@cosmosbay.com> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Feb 02, 2009 at 05:57:24PM +0100, Eric Dumazet wrote: > Neil Horman a =E9crit : > > On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote: > >> Eric Dumazet a =E9crit : > >>> Kenny Chang a =E9crit : > >>>> Ah, sorry, here's the test program attached. > >>>> > >>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or t= he > >>>> 2.6.29.-rcX. > >>>> > >>>> Right now, we are trying to step through the kernel versions unt= il we > >>>> see where the performance drops significantly. We'll try 2.6.29= -rc soon > >>>> and post the result. > >> I tried your program on my dev machines and 2.6.29 (each machine := two quad core cpus, 32bits kernel) > >> > >> With 8 clients, about 10% packet loss,=20 > >> > >> Might be a scheduling problem, not sure... 50.000 packets per seco= nd, x 8 cpus =3D 400.000 > >> wakeups per second... But at least UDP receive path seems OK. > >> > >> Thing is the receiver (softirq that queues the packet) seems to fi= ght on socket lock with > >> readers... > >> > >> I tried to setup IRQ affinities, but it doesnt work any more on bn= x2 (unless using msi_disable=3D1) > >> > >> I tried playing with ethtool -C|c G|g params... > >> And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger = receive buffers in your program) > >> > >> I can have 0% packet loss if booting with msi_disable and > >> > >> echo 1 >/proc/irq/16/smp_affinities > >> > >> (16 being interrupt of eth0 NIC) > >> > >> then, a second run gave me errors, about 2%, oh well... > >> > >> > >> oprofile numbers without playing IRQ affinities: > >> > >> CPU: Core 2, speed 2999.89 MHz (estimated) > >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) wit= h a unit mask of 0x00 (Unhalted core cycles) count 100000 > >> samples % symbol name > >> 327928 10.1427 schedule > >> 259625 8.0301 mwait_idle > >> 187337 5.7943 __skb_recv_datagram > >> 109854 3.3977 lock_sock_nested > >> 104713 3.2387 tick_nohz_stop_sched_tick > >> 98831 3.0568 select_nohz_load_balancer > >> 88163 2.7268 skb_release_data > >> 78552 2.4296 update_curr > >> 75241 2.3272 getnstimeofday > >> 71400 2.2084 set_next_entity > >> 67629 2.0917 get_next_timer_interrupt > >> 67375 2.0839 sched_clock_tick > >> 58112 1.7974 enqueue_entity > >> 56462 1.7463 udp_recvmsg > >> 55049 1.7026 copy_to_user > >> 54277 1.6788 sched_clock_cpu > >> 54031 1.6712 __copy_skb_header > >> 51859 1.6040 __slab_free > >> 51786 1.6017 prepare_to_wait_exclusive > >> 51776 1.6014 sock_def_readable > >> 50062 1.5484 try_to_wake_up > >> 42182 1.3047 __switch_to > >> 41631 1.2876 read_tsc > >> 38337 1.1857 tick_nohz_restart_sched_tick > >> 34358 1.0627 cpu_idle > >> 34194 1.0576 native_sched_clock > >> 33812 1.0458 pick_next_task_fair > >> 33685 1.0419 resched_task > >> 33340 1.0312 sys_recvfrom > >> 33287 1.0296 dst_release > >> 32439 1.0033 kmem_cache_free > >> 32131 0.9938 hrtimer_start_range_ns > >> 29807 0.9219 udp_queue_rcv_skb > >> 27815 0.8603 task_rq_lock > >> 26875 0.8312 __update_sched_clock > >> 23912 0.7396 sock_queue_rcv_skb > >> 21583 0.6676 __wake_up_sync > >> 21001 0.6496 effective_load > >> 20531 0.6350 hrtick_start_fair > >> > >> > >> > >> > >> With IRQ affinities and msi_disable (no packet drops) > >> > >> CPU: Core 2, speed 3000.13 MHz (estimated) > >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) wit= h a unit mask of 0x00 (Unhalted core cycles) count 100000 > >> samples % symbol name > >> 79788 10.3815 schedule > >> 69422 9.0328 mwait_idle > >> 44877 5.8391 __skb_recv_datagram > >> 28629 3.7250 tick_nohz_stop_sched_tick > >> 27252 3.5459 select_nohz_load_balancer > >> 24320 3.1644 lock_sock_nested > >> 20833 2.7107 getnstimeofday > >> 20666 2.6889 skb_release_data > >> 18612 2.4217 set_next_entity > >> 17785 2.3141 get_next_timer_interrupt > >> 17691 2.3018 udp_recvmsg > >> 17271 2.2472 sched_clock_tick > >> 16032 2.0860 copy_to_user > >> 14785 1.9237 update_curr > >> 12512 1.6280 prepare_to_wait_exclusive > >> 12498 1.6262 __slab_free > >> 11380 1.4807 read_tsc > >> 11145 1.4501 sched_clock_cpu > >> 10598 1.3789 __switch_to > >> 9588 1.2475 pick_next_task_fair > >> 9480 1.2335 cpu_idle > >> 9218 1.1994 sys_recvfrom > >> 9008 1.1721 tick_nohz_restart_sched_tick > >> 8977 1.1680 dst_release > >> 8930 1.1619 native_sched_clock > >> 8392 1.0919 kmem_cache_free > >> 8124 1.0570 hrtimer_start_range_ns > >> 7274 0.9464 bnx2_interrupt > >> 7175 0.9336 __copy_skb_header > >> 7006 0.9116 try_to_wake_up > >> 6949 0.9042 sock_def_readable > >> 6787 0.8831 enqueue_entity > >> 6772 0.8811 __update_sched_clock > >> 6349 0.8261 finish_task_switch > >> 6164 0.8020 copy_from_user > >> 5096 0.6631 resched_task > >> 5007 0.6515 sysenter_past_esp > >> > >> > >> I will try to investigate a litle bit more in following days if ti= me permits. > >> > > I'm not 100% versed on this, but IIRC, some hardware simply can't s= et irq > > affinity when operating in msi interrupt mode. If this is the case= with this > > particular bnx2 card, then I would expect some packet loss, simply = due to the > > constant cache misses. It would be interesting to re-run your opro= file cases, > > counting L2 cache hits/misses (if your cpu supports that class of c= ounter) for > > both bnx2 running in msi enabled mode and msi disabled mode. It wo= uld also be > > interesting to use a different card, that can set irq affinity, and= compare loss > > with irqbalance on, and irqbalance off with irq afninty set to all = cpus. >=20 > booted with msi_disable=3D1, IRQ of eth0 handled by CPU0 only, so tha= t > oprofile results sorted on CPU0 numbers. >=20 > We can see scheduler has hard time to cope with this workload with mo= re of two CPUS >=20 > OK up to 30.000 (* 8 sockets) packets per second.=20 >=20 > CPU0 is 100% handling softirq (ksoftirqd/0) >=20 This explains alot. if the application is scheduled to run on the same= cpu that has the irq for the NIC bound to it, you get a perf boost by not having= to warm up two caches (1 for the app cpu and one for the irq & softirq work), b= ut you loose it and then some fighting for cpu time. If both the app and the = irq are on the same cpu, and we spend so much time in softirq context, we will eventually overflow higher up the network stack, as the application doe= sn't have enough time to dequeue frames. It may also speak to the need to make the bnx2 napi routine more effici= ent :) Neil