From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Dickinson Subject: Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%) Date: Thu, 28 Jan 2010 22:06:08 -0800 Message-ID: <606676311001282206q113f6bbbq776996b67fd18adb@mail.gmail.com> References: <606676311001280023j77b8b96aj556706f3e49bcc13@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "netdev@vger.kernel.org" To: "Brandeburg, Jesse" Return-path: Received: from mail-px0-f182.google.com ([209.85.216.182]:48531 "EHLO mail-px0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751300Ab0A2GGI convert rfc822-to-8bit (ORCPT ); Fri, 29 Jan 2010 01:06:08 -0500 Received: by mail-px0-f182.google.com with SMTP id 12so1240619pxi.33 for ; Thu, 28 Jan 2010 22:06:08 -0800 (PST) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: Short response: CONFIG_HPET was the dirty little bastard! Answering your questions below in case somebody else stumbles across this thread... On Thu, Jan 28, 2010 at 4:18 PM, Brandeburg, Jesse wrote: > > > On Thu, 28 Jan 2010, Andrew Dickinson wrote: >> I'm running into some unexpected performance issues. =A0I say >> "unexpected" because I was running the same tests on this same box 5 >> months ago and getting very different (and much better) results. > > > can you try turning off cpuspeed service, C-States in BIOS, and GV3 (= aka > speedstep) support in BIOS? Yup, everything's on "maximum performance" in my BIOS's vernacular (HP GL360g6) no C-states, etc. > Have you upgraded your BIOS since before? Not that I'm aware of, but our provisioning folks might have done something crazy. > I agree you should be able to see better numbers, I suspect that you = are > getting cross-cpu traffic that is limiting your throughput. That's what I would have suspected as well. > How many flows are you pushing? I'm pushing two streams of traffic, one in each direction. Each stream is defined as follows: North-bound: L2: a0a0a0a0a0a0 -> b0b0b0b0b0b0 L3: RAND(10.0.0.0/16) -> RAND(100.0.0.0/16) L4: UDP with random data South-bound is the reverse. where "RAND(CIDR)" is a random address within that CIDR (I'm using an hardware traffic generator). > Another idea is to compile the "perf" tool in the tools/perf director= y of > the kernel and run "perf record -a -- sleep 10" while running at stea= dy > state. =A0then show output of perf report to get an idea of which fun= ctions > are eating all the cpu time. > > did you change to the "tickless" kernel? =A0We've also found that rou= ting > performance improves dramatically by disabling tickless, preemptive k= ernel > and setting HZ=3D100. =A0What about CONFIG_HPET? yes, yes, yes, and no... changed CONFIG_HPET to n, rebooted and retested.... ta-da! > You should try the kernel that the scheduler fixes went into (maybe 3= 1?) > or at least try 2.6.32.6 so you've tried something fully up to date. I'll give it a whirl :D >> =3D=3D=3D Background =3D=3D=3D >> >> The box is a dual Core i7 box with a pair of Intel 82598EB's. =A0I'm >> running 2.6.30 with the in-kernel ixgbe driver. =A0My tests 5 months= ago >> were using 2.6.30-rc3 (with a tiny patch from David Miller as seen >> here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/56059= 24). >> =A0The box is configured with both NICs in a bridge; normally I'm do= ing >> some packet processing using ebtables, but for the sake of keeping >> things simple, I'm not doing anything special.. just straight bridgi= ng >> (no ebtables rules, etc). =A0I'm not running irqbalance and instead >> pinning my interrupts, one per core. =A0I've re-read and double chec= ked >> various settings based on Intel's README (i.e. gso off, tso off, etc= ). >> >> In my previous tests, i was able to pass 3+Mpps regardless of how th= at >> was divided across the two NICS (i.e. 3Mpps all in one direction, >> 1.5Mpps in each direction simultaneously, etc). =A0Now, I'm hardly a= ble >> to exceed about 750kpps x 2 (i.e. 750k in both directions), and I >> can't do more than 750kpps in one direction even with the other >> direction having no traffic). >> >> Unfortunately, I didn't take very good notes when I did this last ti= me >> so I don't have my previous .config and I'm not 100% positive I've g= ot >> identical ethtool settings, etc. =A0That being said, I've worked thr= ough >> seemingly every combination of factors that I can think of and I'm >> still unable to see the old performance (NUMA on/off, Hyperthreading >> on/off, various irq coelescing settings, etc). >> >> I have two identical boxes, they both see the same thing; so a >> hardware issue seems unlikely. =A0My next step is to grab 2.6.30-rc3= and >> see if I can repro the good performance with that kernel again and >> determine if there was a regression between 2.6.30-rc3 and 2.6.30... >> but I'm skeptical that that's the issue since I'm sure other people >> would have noticed this as well. >> >> >> =3D=3D=3D What I'm seeing =3D=3D=3D >> >> CPU% (almost entirely softirq time, which is expected) ramps extreme= ly >> quickly as packet rate increases. =A0The following table show the pa= cket >> rate ("150 x 2" means 150kpps in each direction simultaneously), the >> right side is the cpu utilization (as measured by %si in top). >> >> 150 x 2: =A0 4% >> 300 x 2: =A0 8% >> 450 x 2: =A018% >> 483 x 2: =A050% >> 525 x 2: =A066% >> 600 x 2: =A085% >> 750 x 2: 100% (and dropping frames) >> >> I _am_ seeing interrupts getting spread nicely across cores, so in t= he >> "150 x 2" case, that's about 4% soft-interrupt time per each of the = 16 >> cores. =A0 The CPUs are otherwise idle bar a small amount of hardwar= e >> interrupt time (less than 1%). >> >> >> =3D=3D=3D Where it gets weird... =3D=3D=3D >> >> Trying to isolate the problem, I added an ebtables rule to drop >> everything on the forward chain. =A0I was expecting to see the CPU >> utilization drop since I'd no longer be dealing with the TX-side... = no >> change. >> >> I then decided to switch from a bridge to a route-based solution. =A0= I >> tore down the bridge, enabled ip_forward, setup some IPs and route >> entries, etc. =A0Nothing changes. =A0CPU performance is identical to >> what's shown above. =A0Additionally, if I add an iptables drop on >> FORWARD, the CPU utilization remains unchanged (just like in the >> bridging case above). >> >> The point that [I think] I'm driving to is that there's something >> fishy going on with the receive-side of the packets. =A0I wish I cou= ld >> point to something more specific or a section of code, but I haven't >> been able to par this down to anything more granular in my testing. >> >> >> =3D=3D=3D Questions =3D=3D=3D >> >> Has anybody seen this before? =A0If so, what was wrong? >> Do you have any recommendations on things to try (either as guesses >> or, even better, to help eliminate possibilities) >> And along those lines... can anybody think of any possible reasons f= or this? > > hope the above helped. > >> This is so frustrating since I _know_ this hardware is capable of so >> much more. =A0It's relatively painless for me to re-run tests in my = lab, >> so feel free to throw something at me that you think will stick :D > > last I checked, I recall with 82599 I was pushing ~4.5 million 64 byt= e > packets a second (bidirectional, no drop), after disabling irqbalance= and > 16 tx/rx queues set with set_irq_affinity.sh script (available in our > ixgbe-foo.tar.gz from sourceforge). =A082598 should be a bit lower, b= ut > probably can get close to that number. > > I haven't run the test lately though, but at that point I was likely = on > 2.6.30 ish > > Jesse > Thank you so much... I wish I'd sent this email out a week ago ;-P -A