* receive-side performance issue (ixgbe, core-i7, softirq cpu%) @ 2010-01-28 8:23 Andrew Dickinson 2010-01-29 0:18 ` Brandeburg, Jesse 0 siblings, 1 reply; 4+ messages in thread From: Andrew Dickinson @ 2010-01-28 8:23 UTC (permalink / raw) To: netdev Hi, I'm running into some unexpected performance issues. I say "unexpected" because I was running the same tests on this same box 5 months ago and getting very different (and much better) results. === Background === The box is a dual Core i7 box with a pair of Intel 82598EB's. I'm running 2.6.30 with the in-kernel ixgbe driver. My tests 5 months ago were using 2.6.30-rc3 (with a tiny patch from David Miller as seen here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/5605924). The box is configured with both NICs in a bridge; normally I'm doing some packet processing using ebtables, but for the sake of keeping things simple, I'm not doing anything special.. just straight bridging (no ebtables rules, etc). I'm not running irqbalance and instead pinning my interrupts, one per core. I've re-read and double checked various settings based on Intel's README (i.e. gso off, tso off, etc). In my previous tests, i was able to pass 3+Mpps regardless of how that was divided across the two NICS (i.e. 3Mpps all in one direction, 1.5Mpps in each direction simultaneously, etc). Now, I'm hardly able to exceed about 750kpps x 2 (i.e. 750k in both directions), and I can't do more than 750kpps in one direction even with the other direction having no traffic). Unfortunately, I didn't take very good notes when I did this last time so I don't have my previous .config and I'm not 100% positive I've got identical ethtool settings, etc. That being said, I've worked through seemingly every combination of factors that I can think of and I'm still unable to see the old performance (NUMA on/off, Hyperthreading on/off, various irq coelescing settings, etc). I have two identical boxes, they both see the same thing; so a hardware issue seems unlikely. My next step is to grab 2.6.30-rc3 and see if I can repro the good performance with that kernel again and determine if there was a regression between 2.6.30-rc3 and 2.6.30... but I'm skeptical that that's the issue since I'm sure other people would have noticed this as well. === What I'm seeing === CPU% (almost entirely softirq time, which is expected) ramps extremely quickly as packet rate increases. The following table show the packet rate ("150 x 2" means 150kpps in each direction simultaneously), the right side is the cpu utilization (as measured by %si in top). 150 x 2: 4% 300 x 2: 8% 450 x 2: 18% 483 x 2: 50% 525 x 2: 66% 600 x 2: 85% 750 x 2: 100% (and dropping frames) I _am_ seeing interrupts getting spread nicely across cores, so in the "150 x 2" case, that's about 4% soft-interrupt time per each of the 16 cores. The CPUs are otherwise idle bar a small amount of hardware interrupt time (less than 1%). === Where it gets weird... === Trying to isolate the problem, I added an ebtables rule to drop everything on the forward chain. I was expecting to see the CPU utilization drop since I'd no longer be dealing with the TX-side... no change. I then decided to switch from a bridge to a route-based solution. I tore down the bridge, enabled ip_forward, setup some IPs and route entries, etc. Nothing changes. CPU performance is identical to what's shown above. Additionally, if I add an iptables drop on FORWARD, the CPU utilization remains unchanged (just like in the bridging case above). The point that [I think] I'm driving to is that there's something fishy going on with the receive-side of the packets. I wish I could point to something more specific or a section of code, but I haven't been able to par this down to anything more granular in my testing. === Questions === Has anybody seen this before? If so, what was wrong? Do you have any recommendations on things to try (either as guesses or, even better, to help eliminate possibilities) And along those lines... can anybody think of any possible reasons for this? This is so frustrating since I _know_ this hardware is capable of so much more. It's relatively painless for me to re-run tests in my lab, so feel free to throw something at me that you think will stick :D -Andrew ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%) 2010-01-28 8:23 receive-side performance issue (ixgbe, core-i7, softirq cpu%) Andrew Dickinson @ 2010-01-29 0:18 ` Brandeburg, Jesse 2010-01-29 6:06 ` Andrew Dickinson 0 siblings, 1 reply; 4+ messages in thread From: Brandeburg, Jesse @ 2010-01-29 0:18 UTC (permalink / raw) To: Andrew Dickinson; +Cc: netdev@vger.kernel.org, jesse.brandeburg On Thu, 28 Jan 2010, Andrew Dickinson wrote: > I'm running into some unexpected performance issues. I say > "unexpected" because I was running the same tests on this same box 5 > months ago and getting very different (and much better) results. can you try turning off cpuspeed service, C-States in BIOS, and GV3 (aka speedstep) support in BIOS? Have you upgraded your BIOS since before? I agree you should be able to see better numbers, I suspect that you are getting cross-cpu traffic that is limiting your throughput. How many flows are you pushing? Another idea is to compile the "perf" tool in the tools/perf directory of the kernel and run "perf record -a -- sleep 10" while running at steady state. then show output of perf report to get an idea of which functions are eating all the cpu time. did you change to the "tickless" kernel? We've also found that routing performance improves dramatically by disabling tickless, preemptive kernel and setting HZ=100. What about CONFIG_HPET? You should try the kernel that the scheduler fixes went into (maybe 31?) or at least try 2.6.32.6 so you've tried something fully up to date. > === Background === > > The box is a dual Core i7 box with a pair of Intel 82598EB's. I'm > running 2.6.30 with the in-kernel ixgbe driver. My tests 5 months ago > were using 2.6.30-rc3 (with a tiny patch from David Miller as seen > here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/5605924). > The box is configured with both NICs in a bridge; normally I'm doing > some packet processing using ebtables, but for the sake of keeping > things simple, I'm not doing anything special.. just straight bridging > (no ebtables rules, etc). I'm not running irqbalance and instead > pinning my interrupts, one per core. I've re-read and double checked > various settings based on Intel's README (i.e. gso off, tso off, etc). > > In my previous tests, i was able to pass 3+Mpps regardless of how that > was divided across the two NICS (i.e. 3Mpps all in one direction, > 1.5Mpps in each direction simultaneously, etc). Now, I'm hardly able > to exceed about 750kpps x 2 (i.e. 750k in both directions), and I > can't do more than 750kpps in one direction even with the other > direction having no traffic). > > Unfortunately, I didn't take very good notes when I did this last time > so I don't have my previous .config and I'm not 100% positive I've got > identical ethtool settings, etc. That being said, I've worked through > seemingly every combination of factors that I can think of and I'm > still unable to see the old performance (NUMA on/off, Hyperthreading > on/off, various irq coelescing settings, etc). > > I have two identical boxes, they both see the same thing; so a > hardware issue seems unlikely. My next step is to grab 2.6.30-rc3 and > see if I can repro the good performance with that kernel again and > determine if there was a regression between 2.6.30-rc3 and 2.6.30... > but I'm skeptical that that's the issue since I'm sure other people > would have noticed this as well. > > > === What I'm seeing === > > CPU% (almost entirely softirq time, which is expected) ramps extremely > quickly as packet rate increases. The following table show the packet > rate ("150 x 2" means 150kpps in each direction simultaneously), the > right side is the cpu utilization (as measured by %si in top). > > 150 x 2: 4% > 300 x 2: 8% > 450 x 2: 18% > 483 x 2: 50% > 525 x 2: 66% > 600 x 2: 85% > 750 x 2: 100% (and dropping frames) > > I _am_ seeing interrupts getting spread nicely across cores, so in the > "150 x 2" case, that's about 4% soft-interrupt time per each of the 16 > cores. The CPUs are otherwise idle bar a small amount of hardware > interrupt time (less than 1%). > > > === Where it gets weird... === > > Trying to isolate the problem, I added an ebtables rule to drop > everything on the forward chain. I was expecting to see the CPU > utilization drop since I'd no longer be dealing with the TX-side... no > change. > > I then decided to switch from a bridge to a route-based solution. I > tore down the bridge, enabled ip_forward, setup some IPs and route > entries, etc. Nothing changes. CPU performance is identical to > what's shown above. Additionally, if I add an iptables drop on > FORWARD, the CPU utilization remains unchanged (just like in the > bridging case above). > > The point that [I think] I'm driving to is that there's something > fishy going on with the receive-side of the packets. I wish I could > point to something more specific or a section of code, but I haven't > been able to par this down to anything more granular in my testing. > > > === Questions === > > Has anybody seen this before? If so, what was wrong? > Do you have any recommendations on things to try (either as guesses > or, even better, to help eliminate possibilities) > And along those lines... can anybody think of any possible reasons for this? hope the above helped. > This is so frustrating since I _know_ this hardware is capable of so > much more. It's relatively painless for me to re-run tests in my lab, > so feel free to throw something at me that you think will stick :D last I checked, I recall with 82599 I was pushing ~4.5 million 64 byte packets a second (bidirectional, no drop), after disabling irqbalance and 16 tx/rx queues set with set_irq_affinity.sh script (available in our ixgbe-foo.tar.gz from sourceforge). 82598 should be a bit lower, but probably can get close to that number. I haven't run the test lately though, but at that point I was likely on 2.6.30 ish Jesse ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%) 2010-01-29 0:18 ` Brandeburg, Jesse @ 2010-01-29 6:06 ` Andrew Dickinson 2010-01-29 8:02 ` Andrew Dickinson 0 siblings, 1 reply; 4+ messages in thread From: Andrew Dickinson @ 2010-01-29 6:06 UTC (permalink / raw) To: Brandeburg, Jesse; +Cc: netdev@vger.kernel.org Short response: CONFIG_HPET was the dirty little bastard! Answering your questions below in case somebody else stumbles across this thread... On Thu, Jan 28, 2010 at 4:18 PM, Brandeburg, Jesse <jesse.brandeburg@intel.com> wrote: > > > On Thu, 28 Jan 2010, Andrew Dickinson wrote: >> I'm running into some unexpected performance issues. I say >> "unexpected" because I was running the same tests on this same box 5 >> months ago and getting very different (and much better) results. > > > can you try turning off cpuspeed service, C-States in BIOS, and GV3 (aka > speedstep) support in BIOS? Yup, everything's on "maximum performance" in my BIOS's vernacular (HP GL360g6) no C-states, etc. > Have you upgraded your BIOS since before? Not that I'm aware of, but our provisioning folks might have done something crazy. > I agree you should be able to see better numbers, I suspect that you are > getting cross-cpu traffic that is limiting your throughput. That's what I would have suspected as well. > How many flows are you pushing? I'm pushing two streams of traffic, one in each direction. Each stream is defined as follows: North-bound: L2: a0a0a0a0a0a0 -> b0b0b0b0b0b0 L3: RAND(10.0.0.0/16) -> RAND(100.0.0.0/16) L4: UDP with random data South-bound is the reverse. where "RAND(CIDR)" is a random address within that CIDR (I'm using an hardware traffic generator). > Another idea is to compile the "perf" tool in the tools/perf directory of > the kernel and run "perf record -a -- sleep 10" while running at steady > state. then show output of perf report to get an idea of which functions > are eating all the cpu time. > > did you change to the "tickless" kernel? We've also found that routing > performance improves dramatically by disabling tickless, preemptive kernel > and setting HZ=100. What about CONFIG_HPET? yes, yes, yes, and no... changed CONFIG_HPET to n, rebooted and retested.... ta-da! > You should try the kernel that the scheduler fixes went into (maybe 31?) > or at least try 2.6.32.6 so you've tried something fully up to date. I'll give it a whirl :D >> === Background === >> >> The box is a dual Core i7 box with a pair of Intel 82598EB's. I'm >> running 2.6.30 with the in-kernel ixgbe driver. My tests 5 months ago >> were using 2.6.30-rc3 (with a tiny patch from David Miller as seen >> here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/5605924). >> The box is configured with both NICs in a bridge; normally I'm doing >> some packet processing using ebtables, but for the sake of keeping >> things simple, I'm not doing anything special.. just straight bridging >> (no ebtables rules, etc). I'm not running irqbalance and instead >> pinning my interrupts, one per core. I've re-read and double checked >> various settings based on Intel's README (i.e. gso off, tso off, etc). >> >> In my previous tests, i was able to pass 3+Mpps regardless of how that >> was divided across the two NICS (i.e. 3Mpps all in one direction, >> 1.5Mpps in each direction simultaneously, etc). Now, I'm hardly able >> to exceed about 750kpps x 2 (i.e. 750k in both directions), and I >> can't do more than 750kpps in one direction even with the other >> direction having no traffic). >> >> Unfortunately, I didn't take very good notes when I did this last time >> so I don't have my previous .config and I'm not 100% positive I've got >> identical ethtool settings, etc. That being said, I've worked through >> seemingly every combination of factors that I can think of and I'm >> still unable to see the old performance (NUMA on/off, Hyperthreading >> on/off, various irq coelescing settings, etc). >> >> I have two identical boxes, they both see the same thing; so a >> hardware issue seems unlikely. My next step is to grab 2.6.30-rc3 and >> see if I can repro the good performance with that kernel again and >> determine if there was a regression between 2.6.30-rc3 and 2.6.30... >> but I'm skeptical that that's the issue since I'm sure other people >> would have noticed this as well. >> >> >> === What I'm seeing === >> >> CPU% (almost entirely softirq time, which is expected) ramps extremely >> quickly as packet rate increases. The following table show the packet >> rate ("150 x 2" means 150kpps in each direction simultaneously), the >> right side is the cpu utilization (as measured by %si in top). >> >> 150 x 2: 4% >> 300 x 2: 8% >> 450 x 2: 18% >> 483 x 2: 50% >> 525 x 2: 66% >> 600 x 2: 85% >> 750 x 2: 100% (and dropping frames) >> >> I _am_ seeing interrupts getting spread nicely across cores, so in the >> "150 x 2" case, that's about 4% soft-interrupt time per each of the 16 >> cores. The CPUs are otherwise idle bar a small amount of hardware >> interrupt time (less than 1%). >> >> >> === Where it gets weird... === >> >> Trying to isolate the problem, I added an ebtables rule to drop >> everything on the forward chain. I was expecting to see the CPU >> utilization drop since I'd no longer be dealing with the TX-side... no >> change. >> >> I then decided to switch from a bridge to a route-based solution. I >> tore down the bridge, enabled ip_forward, setup some IPs and route >> entries, etc. Nothing changes. CPU performance is identical to >> what's shown above. Additionally, if I add an iptables drop on >> FORWARD, the CPU utilization remains unchanged (just like in the >> bridging case above). >> >> The point that [I think] I'm driving to is that there's something >> fishy going on with the receive-side of the packets. I wish I could >> point to something more specific or a section of code, but I haven't >> been able to par this down to anything more granular in my testing. >> >> >> === Questions === >> >> Has anybody seen this before? If so, what was wrong? >> Do you have any recommendations on things to try (either as guesses >> or, even better, to help eliminate possibilities) >> And along those lines... can anybody think of any possible reasons for this? > > hope the above helped. > >> This is so frustrating since I _know_ this hardware is capable of so >> much more. It's relatively painless for me to re-run tests in my lab, >> so feel free to throw something at me that you think will stick :D > > last I checked, I recall with 82599 I was pushing ~4.5 million 64 byte > packets a second (bidirectional, no drop), after disabling irqbalance and > 16 tx/rx queues set with set_irq_affinity.sh script (available in our > ixgbe-foo.tar.gz from sourceforge). 82598 should be a bit lower, but > probably can get close to that number. > > I haven't run the test lately though, but at that point I was likely on > 2.6.30 ish > > Jesse > Thank you so much... I wish I'd sent this email out a week ago ;-P -A ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%) 2010-01-29 6:06 ` Andrew Dickinson @ 2010-01-29 8:02 ` Andrew Dickinson 0 siblings, 0 replies; 4+ messages in thread From: Andrew Dickinson @ 2010-01-29 8:02 UTC (permalink / raw) To: Brandeburg, Jesse; +Cc: netdev@vger.kernel.org I might have mis-spoken about HPET. The 4.6Mpps is with 2.6.32.4 vanilla, HPET on. Either way, I'm happy now ;-P -A On Thu, Jan 28, 2010 at 10:06 PM, Andrew Dickinson <andrew@whydna.net> wrote: > Short response: CONFIG_HPET was the dirty little bastard! > > Answering your questions below in case somebody else stumbles across > this thread... > > On Thu, Jan 28, 2010 at 4:18 PM, Brandeburg, Jesse > <jesse.brandeburg@intel.com> wrote: >> >> >> On Thu, 28 Jan 2010, Andrew Dickinson wrote: >>> I'm running into some unexpected performance issues. I say >>> "unexpected" because I was running the same tests on this same box 5 >>> months ago and getting very different (and much better) results. >> >> >> can you try turning off cpuspeed service, C-States in BIOS, and GV3 (aka >> speedstep) support in BIOS? > > Yup, everything's on "maximum performance" in my BIOS's vernacular (HP > GL360g6) no C-states, etc. > >> Have you upgraded your BIOS since before? > > Not that I'm aware of, but our provisioning folks might have done > something crazy. > >> I agree you should be able to see better numbers, I suspect that you are >> getting cross-cpu traffic that is limiting your throughput. > > That's what I would have suspected as well. > >> How many flows are you pushing? > > I'm pushing two streams of traffic, one in each direction. Each > stream is defined as follows: > North-bound: > L2: a0a0a0a0a0a0 -> b0b0b0b0b0b0 > L3: RAND(10.0.0.0/16) -> RAND(100.0.0.0/16) > L4: UDP with random data > South-bound is the reverse. > > where "RAND(CIDR)" is a random address within that CIDR (I'm using > an hardware traffic generator). > >> Another idea is to compile the "perf" tool in the tools/perf directory of >> the kernel and run "perf record -a -- sleep 10" while running at steady >> state. then show output of perf report to get an idea of which functions >> are eating all the cpu time. >> >> did you change to the "tickless" kernel? We've also found that routing >> performance improves dramatically by disabling tickless, preemptive kernel >> and setting HZ=100. What about CONFIG_HPET? > > yes, yes, yes, and no... > > changed CONFIG_HPET to n, rebooted and retested.... > > ta-da! > >> You should try the kernel that the scheduler fixes went into (maybe 31?) >> or at least try 2.6.32.6 so you've tried something fully up to date. > > I'll give it a whirl :D > >>> === Background === >>> >>> The box is a dual Core i7 box with a pair of Intel 82598EB's. I'm >>> running 2.6.30 with the in-kernel ixgbe driver. My tests 5 months ago >>> were using 2.6.30-rc3 (with a tiny patch from David Miller as seen >>> here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/5605924). >>> The box is configured with both NICs in a bridge; normally I'm doing >>> some packet processing using ebtables, but for the sake of keeping >>> things simple, I'm not doing anything special.. just straight bridging >>> (no ebtables rules, etc). I'm not running irqbalance and instead >>> pinning my interrupts, one per core. I've re-read and double checked >>> various settings based on Intel's README (i.e. gso off, tso off, etc). >>> >>> In my previous tests, i was able to pass 3+Mpps regardless of how that >>> was divided across the two NICS (i.e. 3Mpps all in one direction, >>> 1.5Mpps in each direction simultaneously, etc). Now, I'm hardly able >>> to exceed about 750kpps x 2 (i.e. 750k in both directions), and I >>> can't do more than 750kpps in one direction even with the other >>> direction having no traffic). >>> >>> Unfortunately, I didn't take very good notes when I did this last time >>> so I don't have my previous .config and I'm not 100% positive I've got >>> identical ethtool settings, etc. That being said, I've worked through >>> seemingly every combination of factors that I can think of and I'm >>> still unable to see the old performance (NUMA on/off, Hyperthreading >>> on/off, various irq coelescing settings, etc). >>> >>> I have two identical boxes, they both see the same thing; so a >>> hardware issue seems unlikely. My next step is to grab 2.6.30-rc3 and >>> see if I can repro the good performance with that kernel again and >>> determine if there was a regression between 2.6.30-rc3 and 2.6.30... >>> but I'm skeptical that that's the issue since I'm sure other people >>> would have noticed this as well. >>> >>> >>> === What I'm seeing === >>> >>> CPU% (almost entirely softirq time, which is expected) ramps extremely >>> quickly as packet rate increases. The following table show the packet >>> rate ("150 x 2" means 150kpps in each direction simultaneously), the >>> right side is the cpu utilization (as measured by %si in top). >>> >>> 150 x 2: 4% >>> 300 x 2: 8% >>> 450 x 2: 18% >>> 483 x 2: 50% >>> 525 x 2: 66% >>> 600 x 2: 85% >>> 750 x 2: 100% (and dropping frames) >>> >>> I _am_ seeing interrupts getting spread nicely across cores, so in the >>> "150 x 2" case, that's about 4% soft-interrupt time per each of the 16 >>> cores. The CPUs are otherwise idle bar a small amount of hardware >>> interrupt time (less than 1%). >>> >>> >>> === Where it gets weird... === >>> >>> Trying to isolate the problem, I added an ebtables rule to drop >>> everything on the forward chain. I was expecting to see the CPU >>> utilization drop since I'd no longer be dealing with the TX-side... no >>> change. >>> >>> I then decided to switch from a bridge to a route-based solution. I >>> tore down the bridge, enabled ip_forward, setup some IPs and route >>> entries, etc. Nothing changes. CPU performance is identical to >>> what's shown above. Additionally, if I add an iptables drop on >>> FORWARD, the CPU utilization remains unchanged (just like in the >>> bridging case above). >>> >>> The point that [I think] I'm driving to is that there's something >>> fishy going on with the receive-side of the packets. I wish I could >>> point to something more specific or a section of code, but I haven't >>> been able to par this down to anything more granular in my testing. >>> >>> >>> === Questions === >>> >>> Has anybody seen this before? If so, what was wrong? >>> Do you have any recommendations on things to try (either as guesses >>> or, even better, to help eliminate possibilities) >>> And along those lines... can anybody think of any possible reasons for this? >> >> hope the above helped. >> >>> This is so frustrating since I _know_ this hardware is capable of so >>> much more. It's relatively painless for me to re-run tests in my lab, >>> so feel free to throw something at me that you think will stick :D >> >> last I checked, I recall with 82599 I was pushing ~4.5 million 64 byte >> packets a second (bidirectional, no drop), after disabling irqbalance and >> 16 tx/rx queues set with set_irq_affinity.sh script (available in our >> ixgbe-foo.tar.gz from sourceforge). 82598 should be a bit lower, but >> probably can get close to that number. >> >> I haven't run the test lately though, but at that point I was likely on >> 2.6.30 ish >> >> Jesse >> > > Thank you so much... I wish I'd sent this email out a week ago ;-P > > -A > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-01-29 8:02 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-01-28 8:23 receive-side performance issue (ixgbe, core-i7, softirq cpu%) Andrew Dickinson 2010-01-29 0:18 ` Brandeburg, Jesse 2010-01-29 6:06 ` Andrew Dickinson 2010-01-29 8:02 ` Andrew Dickinson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).