From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Dickinson <andrew@whydna.net>
Subject: Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%)
Date: Thu, 28 Jan 2010 22:06:08 -0800
Message-ID: <606676311001282206q113f6bbbq776996b67fd18adb@mail.gmail.com>
References: <606676311001280023j77b8b96aj556706f3e49bcc13@mail.gmail.com>
	 <alpine.WNT.2.00.1001281135110.360@jbrandeb-desk1.amr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
To: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-px0-f182.google.com ([209.85.216.182]:48531 "EHLO
	mail-px0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751300Ab0A2GGI convert rfc822-to-8bit (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 29 Jan 2010 01:06:08 -0500
Received: by mail-px0-f182.google.com with SMTP id 12so1240619pxi.33
        for <netdev@vger.kernel.org>; Thu, 28 Jan 2010 22:06:08 -0800 (PST)
In-Reply-To: <alpine.WNT.2.00.1001281135110.360@jbrandeb-desk1.amr.corp.intel.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

Short response: CONFIG_HPET was the dirty little bastard!

Answering your questions below in case somebody else stumbles across
this thread...

On Thu, Jan 28, 2010 at 4:18 PM, Brandeburg, Jesse
<jesse.brandeburg@intel.com> wrote:
>
>
> On Thu, 28 Jan 2010, Andrew Dickinson wrote:
>> I'm running into some unexpected performance issues. =A0I say
>> "unexpected" because I was running the same tests on this same box 5
>> months ago and getting very different (and much better) results.
>
>
> can you try turning off cpuspeed service, C-States in BIOS, and GV3 (=
aka
> speedstep) support in BIOS?

Yup, everything's on "maximum performance" in my BIOS's vernacular (HP
GL360g6) no C-states, etc.

> Have you upgraded your BIOS since before?

Not that I'm aware of, but our provisioning folks might have done
something crazy.

> I agree you should be able to see better numbers, I suspect that you =
are
> getting cross-cpu traffic that is limiting your throughput.

That's what I would have suspected as well.

> How many flows are you pushing?

I'm pushing two streams of traffic, one in each direction.  Each
stream is defined as follows:
    North-bound:
        L2: a0a0a0a0a0a0 -> b0b0b0b0b0b0
        L3: RAND(10.0.0.0/16) -> RAND(100.0.0.0/16)
        L4: UDP with random data
    South-bound is the reverse.

    where "RAND(CIDR)" is a random address within that CIDR (I'm using
an hardware traffic generator).

> Another idea is to compile the "perf" tool in the tools/perf director=
y of
> the kernel and run "perf record -a -- sleep 10" while running at stea=
dy
> state. =A0then show output of perf report to get an idea of which fun=
ctions
> are eating all the cpu time.
>
> did you change to the "tickless" kernel? =A0We've also found that rou=
ting
> performance improves dramatically by disabling tickless, preemptive k=
ernel
> and setting HZ=3D100. =A0What about CONFIG_HPET?

yes, yes, yes, and no...

changed CONFIG_HPET to n, rebooted and retested....

ta-da!

> You should try the kernel that the scheduler fixes went into (maybe 3=
1?)
> or at least try 2.6.32.6 so you've tried something fully up to date.

I'll give it a whirl :D

>> =3D=3D=3D Background =3D=3D=3D
>>
>> The box is a dual Core i7 box with a pair of Intel 82598EB's. =A0I'm
>> running 2.6.30 with the in-kernel ixgbe driver. =A0My tests 5 months=
 ago
>> were using 2.6.30-rc3 (with a tiny patch from David Miller as seen
>> here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/56059=
24).
>> =A0The box is configured with both NICs in a bridge; normally I'm do=
ing
>> some packet processing using ebtables, but for the sake of keeping
>> things simple, I'm not doing anything special.. just straight bridgi=
ng
>> (no ebtables rules, etc). =A0I'm not running irqbalance and instead
>> pinning my interrupts, one per core. =A0I've re-read and double chec=
ked
>> various settings based on Intel's README (i.e. gso off, tso off, etc=
).
>>
>> In my previous tests, i was able to pass 3+Mpps regardless of how th=
at
>> was divided across the two NICS (i.e. 3Mpps all in one direction,
>> 1.5Mpps in each direction simultaneously, etc). =A0Now, I'm hardly a=
ble
>> to exceed about 750kpps x 2 (i.e. 750k in both directions), and I
>> can't do more than 750kpps in one direction even with the other
>> direction having no traffic).
>>
>> Unfortunately, I didn't take very good notes when I did this last ti=
me
>> so I don't have my previous .config and I'm not 100% positive I've g=
ot
>> identical ethtool settings, etc. =A0That being said, I've worked thr=
ough
>> seemingly every combination of factors that I can think of and I'm
>> still unable to see the old performance (NUMA on/off, Hyperthreading
>> on/off, various irq coelescing settings, etc).
>>
>> I have two identical boxes, they both see the same thing; so a
>> hardware issue seems unlikely. =A0My next step is to grab 2.6.30-rc3=
 and
>> see if I can repro the good performance with that kernel again and
>> determine if there was a regression between 2.6.30-rc3 and 2.6.30...
>> but I'm skeptical that that's the issue since I'm sure other people
>> would have noticed this as well.
>>
>>
>> =3D=3D=3D What I'm seeing =3D=3D=3D
>>
>> CPU% (almost entirely softirq time, which is expected) ramps extreme=
ly
>> quickly as packet rate increases. =A0The following table show the pa=
cket
>> rate ("150 x 2" means 150kpps in each direction simultaneously), the
>> right side is the cpu utilization (as measured by %si in top).
>>
>> 150 x 2: =A0 4%
>> 300 x 2: =A0 8%
>> 450 x 2: =A018%
>> 483 x 2: =A050%
>> 525 x 2: =A066%
>> 600 x 2: =A085%
>> 750 x 2: 100% (and dropping frames)
>>
>> I _am_ seeing interrupts getting spread nicely across cores, so in t=
he
>> "150 x 2" case, that's about 4% soft-interrupt time per each of the =
16
>> cores. =A0 The CPUs are otherwise idle bar a small amount of hardwar=
e
>> interrupt time (less than 1%).
>>
>>
>> =3D=3D=3D Where it gets weird... =3D=3D=3D
>>
>> Trying to isolate the problem, I added an ebtables rule to drop
>> everything on the forward chain. =A0I was expecting to see the CPU
>> utilization drop since I'd no longer be dealing with the TX-side... =
no
>> change.
>>
>> I then decided to switch from a bridge to a route-based solution. =A0=
I
>> tore down the bridge, enabled ip_forward, setup some IPs and route
>> entries, etc. =A0Nothing changes. =A0CPU performance is identical to
>> what's shown above. =A0Additionally, if I add an iptables drop on
>> FORWARD, the CPU utilization remains unchanged (just like in the
>> bridging case above).
>>
>> The point that [I think] I'm driving to is that there's something
>> fishy going on with the receive-side of the packets. =A0I wish I cou=
ld
>> point to something more specific or a section of code, but I haven't
>> been able to par this down to anything more granular in my testing.
>>
>>
>> =3D=3D=3D Questions =3D=3D=3D
>>
>> Has anybody seen this before? =A0If so, what was wrong?
>> Do you have any recommendations on things to try (either as guesses
>> or, even better, to help eliminate possibilities)
>> And along those lines... can anybody think of any possible reasons f=
or this?
>
> hope the above helped.
>
>> This is so frustrating since I _know_ this hardware is capable of so
>> much more. =A0It's relatively painless for me to re-run tests in my =
lab,
>> so feel free to throw something at me that you think will stick :D
>
> last I checked, I recall with 82599 I was pushing ~4.5 million 64 byt=
e
> packets a second (bidirectional, no drop), after disabling irqbalance=
 and
> 16 tx/rx queues set with set_irq_affinity.sh script (available in our
> ixgbe-foo.tar.gz from sourceforge). =A082598 should be a bit lower, b=
ut
> probably can get close to that number.
>
> I haven't run the test lately though, but at that point I was likely =
on
> 2.6.30 ish
>
> Jesse
>

Thank you so much... I wish I'd sent this email out a week ago ;-P

-A