From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kenny Chang Subject: Re: Multicast packet loss Date: Thu, 26 Feb 2009 12:15:37 -0500 Message-ID: <49A6CE39.5050200@athenacr.com> References: <49838213.90700@cosmosbay.com> <49859847.9010206@cosmosbay.com> <20090202134523.GA13369@hmsreliant.think-freely.org> <498725F4.2010205@cosmosbay.com> <20090202182212.GA17950@hmsreliant.think-freely.org> <498757AA.8010101@cosmosbay.com> <4987610D.6040902@athenacr.com> <4987663D.6080802@cosmosbay.com> <4988803E.2020009@athenacr.com> <20090204012144.GC3650@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE To: netdev@vger.kernel.org Return-path: Received: from sprinkles.athenacr.com ([64.95.46.210]:1029 "EHLO sprinkles.inp.in.athenacr.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755423AbZBZRPp (ORCPT ); Thu, 26 Feb 2009 12:15:45 -0500 Received: from [192.168.14.21] (fiji.em.in.athenacr.com [192.168.14.21]) by sprinkles.inp.in.athenacr.com (Postfix) with ESMTP id A900DEE98C for ; Thu, 26 Feb 2009 12:15:39 -0500 (EST) In-Reply-To: <20090204012144.GC3650@localhost.localdomain> Sender: netdev-owner@vger.kernel.org List-ID: Neil Horman wrote: > On Tue, Feb 03, 2009 at 12:34:54PM -0500, Kenny Chang wrote: >> Eric Dumazet wrote: >>> Wes Chow a =E9crit : >>> =20 >>>> Eric Dumazet wrote: >>>> =20 >>>>> Wes Chow a =E9crit : >>>>> =20 >>>>>> (I'm Kenny's colleague, and I've been doing the kernel builds) >>>>>> >>>>>> First I'd like to note that there were a lot of bnx2 NAPI change= s >>>>>> between 2.6.21 and 2.6.22. As a reminder, 2.6.21 shows tiny amou= nts >>>>>> of packet loss, >>>>>> whereas loss in 2.6.22 is significant. >>>>>> >>>>>> Second, some CPU affinity info: if I do like Eric and pin all of= the >>>>>> apps onto a single CPU, I see no packet loss. Also, I do *not* s= ee >>>>>> ksoftirqd show up on top at all! >>>>>> >>>>>> If I pin half the processes on one CPU and the other half on ano= ther >>>>>> CPU, one ksoftirqd processes shows up in top and completely pegs= one >>>>>> CPU. My packet loss >>>>>> in that case is significant (25%). >>>>>> >>>>>> Now, the strange case: if I pin 3 processes to one CPU and 1 pro= cess >>>>>> to another, I get about 25% packet loss and ksoftirqd pins one C= PU. >>>>>> However, one >>>>>> of the apps takes significantly less CPU than the others, and al= l >>>>>> apps lose the >>>>>> *exact same number of packets*. In all other situations where we= see >>>>>> packet >>>>>> loss, the actual number lost per application instance appears ra= ndom. >>>>>> =20 >>>>> You see same number of packet lost because they are lost at NIC l= evel >>>>> =20 >>>> Understood. >>>> >>>> I have a new observation: if I pin processes to just CPUs 0 and 1,= I see >>>> no packet loss. Pinning to 0 and 2, I do see packet loss. Pinning = 2 and >>>> 3, no packet loss. 4 & 5 - no packet loss, 6 & 7 - no packet loss.= Any >>>> other combination appears to produce loss (though I have not tried= all >>>> 28 combinations, this seems to be the case). >>>> >>>> At first I thought maybe it had to do with processes pinned to the= same >>>> CPU, but different cores. The machine is a dual quad core, which m= eans >>>> that CPUs 0-3 should be a physical CPU, correct? Pinning to 0/2 an= d 0/3 >>>> produce packet loss. >>>> =20 >>> a quad core is really a 2 x 2 core >>> >>> L2 cache is splited on two blocks, one block used by CPU0/1, other = by=20 >>> CPU2/3=20 >>> >>> You are at the limit of the machine with such workload, so as soon = as your >>> CPUs have to transfert 64 bytes lines between those two L2 blocks, = you loose. >>> >>> >>> =20 >>>> I've also noticed that it does not matter which of the working pai= rs I >>>> pin to. For example, pinning 5 processes in any combination on eit= her >>>> 0/1 produce no packet loss, pinning all 5 to just CPU 0 also produ= ces no >>>> packet loss. >>>> >>>> The failures are also sudden. In all of the working cases mentione= d >>>> above, I don't see ksoftirqd on top at all. But when I run 6 proce= sses >>>> on a single CPU, ksoftirqd shoots up to 100% and I lose a huge num= ber of >>>> packets. >>>> >>>> =20 >>>>> Normaly, softirq runs on same cpu (the one handling hard irq) >>>>> =20 >>>> What determines which CPU the hard irq occurs on? >>>> >>>> =20 >>> Check /proc/irq/{irqnumber}/smp_affinity >>> >>> If you want IRQ16 only served by CPU0 : >>> >>> echo 1 >/proc/irq/16/smp_affinity >>> >>> =20 >> Hi everyone, >> >> -snip- >> Correct me if I'm wrong, from what we've seen, it looks like its =20 >> pointing to some inefficiency in the softirq handling. The question= is =20 >> whether it's something in the driver or the kernel. If we can isola= te =20 >> that, maybe we can take some action to have it fixed. >> > I don't think its sofirq ineffeciencies (oprofile would have shown th= at). I > know I keep harping on this, but I still think irq affininty is your = problem. > I'd be interested in knowning what your /proc/interrupts file looked = like on > each of the above kenrels. Perhaps its not that the bnx2 card you ha= ve can't > handle the setting of MSI interrupt affinities, but rather that somet= hing > changeed to break irq affinity on this card. > > Neil > > It's been a while since I updated this thread. We've been running=20 through the different suggestions and tabulating their effects, as well= =20 as trying out an Intel card. The short story is that setting affinity=20 and MSI works to some extent, and the Intel card doesn't seem to change= =20 things significantly. The results don't seem consistent enough for us=20 to be able to point to a smoking gun. It does look like the 2.6.29-rc4 kernel performs okay with the Intel=20 card, but this is not a real-time build and it's not likely to be in a=20 supported Ubuntu distribution real soon. We've reached the point where= =20 we'd like to look for an expert dedicated to work on this problem for a= =20 period of time. The final result being some sort of solution to produc= e=20 a realtime configuration with a reasonably "aged" kernel (.24~.28) that= =20 has multicast performance greater than or equal to that of 2.6.15. If anybody is interested in devoting some compensated time to this=20 issue, we're offering up a bounty:=20 http://www.athenacr.com/bounties/multicast-performance/ =46or completeness, here's the table of our experiment results: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D= =3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Kernel flavor IRQ affinity *4x=20 mcasttest* *5x mcasttest* *6x mcasttest* *Mtools2* [4]_=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D= =3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Intel=20 e1000e = =20 -----------------------------------------+---------+----------+--------= -------+--------------+--------------+----------------- 2.6.24.19 rt | any |=20 OK Maybe X =20 2.6.24.19 rt | CPU0 |=20 OK OK X =20 2.6.24.19 generic | any |=20 X =20 2.6.24.19 generic | CPU0 |=20 OK =20 2.6.29-rc3 vanilla-server | any |=20 X =20 2.6.29-rc3 vanilla-server | CPU0 |=20 OK =20 2.6.29-rc4 vanilla-generic | any |=20 X OK =20 2.6.29-rc4 vanilla-generic | CPU0 | OK =20 OK OK [5]_ OK =20 -----------------------------------------+---------+----------+--------= -------+--------------+--------------+----------------- Broadcom=20 BNX2 = =20 -----------------------------------------+---------+----------+--------= -------+--------------+--------------+----------------- 2.6.24-19 rt | MSI any |=20 OK OK X =20 2.6.24-19 rt | MSI CPU0 |=20 OK Maybe X =20 2.6.24-19 rt | APIC any |=20 OK OK X =20 2.6.24-19 rt | APIC CPU0 |=20 OK Maybe X =20 2.6.24-19-bnx-latest rt | APIC CPU0 |=20 OK X =20 2.6.24-19 server | MSI any |=20 X =20 2.6.24-19 server | MSI CPU0 |=20 OK =20 2.6.24-19 generic | APIC any |=20 X =20 2.6.24-19 generic | APIC CPU0 |=20 OK =20 2.6.27-11 generic | APIC any |=20 X =20 2.6.27-11 generic | APIC CPU0 |=20 OK 10% drop =20 2.6.28-8 generic | APIC any |=20 OK X =20 2.6.28-8 generic | APIC CPU0 |=20 OK OK 0.5% drop =20 2.6.29-rc3 vanilla-server | MSI any |=20 X =20 2.6.29-rc3 vanilla-server | MSI CPU0 |=20 X =20 2.6.29-rc3 vanilla-server | APIC any |=20 OK X =20 2.6.29-rc3 vanilla-server | APIC CPU0 |=20 OK OK =20 2.6.29-rc4 vanilla-generic | APIC any |=20 X =20 2.6.29-rc4 vanilla-generic | APIC CPU0 |=20 OK 3% drop 10% drop X =20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet= / * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped=20 nothing. Kenny