Re: >10% performance degradation since 2.6.18

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: >10% performance degradation since 2.6.18
       [not found]                   ` <20090704084430.GO2041@one.firstfloor.org>
@ 2009-07-04  9:19                     ` Jeff Garzik
  2009-07-05  4:01                       ` Herbert Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2009-07-04  9:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arjan van de Ven, Matthew Wilcox, Jens Axboe, linux-kernel,
	Styner, Douglas W, Chinang Ma, Prickett, Terry O, Matthew Wilcox,
	Eric.Moore, DL-MPTFusionLinux, NetDev

Andi Kleen wrote:
>> for networking, especially for incoming data such as new connections,
>> that isn't the case.. that's more or less randomly (well hash based)
>> distributed.
> 
> Ok. Still binding them all to a single CPU all is quite dumb. It
> makes MSI-X quite useless and probably even harmful.
> 
> We don't default to socket power saving for normal scheduling either, 
> but only when you specify a special knob. I don't see why interrupts
> should be different.

In the pre-MSI-X days, you'd have cachelines bouncing all over the place 
if you distributed networking interrupts across CPUs, particularly given 
that NAPI would run some things on a single CPU anyway.

Today, machines are faster, we have multiple interrupts per device, and 
we have multiple RX/TX queues.  I would be interested to see hard 
numbers (as opposed to guesses) about various new ways to distributed 
interrupts across CPUs.

What's the best setup for power usage?
What's the best setup for performance?
Are they the same?
Is it most optimal to have the interrupt for socket $X occur on the same 
CPU as where the app is running?
If yes, how to best handle when the scheduler moves app to another CPU?
Should we reprogram the NIC hardware flow steering mechanism at that point?

Interesting questions, and I hope we'd see some hard number comparisons 
before solutions start flowing into the kernel.

	Jeff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-04  9:19                     ` >10% performance degradation since 2.6.18 Jeff Garzik
@ 2009-07-05  4:01                       ` Herbert Xu
  2009-07-05 13:09                         ` Matthew Wilcox
                                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Herbert Xu @ 2009-07-05  4:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: andi, arjan, matthew, jens.axboe, linux-kernel, douglas.w.styner,
	chinang.ma, terry.o.prickett, matthew.r.wilcox, Eric.Moore,
	DL-MPTFusionLinux, netdev

Jeff Garzik <jeff@garzik.org> wrote:
> 
> What's the best setup for power usage?
> What's the best setup for performance?
> Are they the same?

Yes.

> Is it most optimal to have the interrupt for socket $X occur on the same 
> CPU as where the app is running?

Yes.

> If yes, how to best handle when the scheduler moves app to another CPU?
> Should we reprogram the NIC hardware flow steering mechanism at that point?

Not really.  For now the best thing to do is to pin everything
down and not move at all, because we can't afford to move.

The only way for moving to work is if we had the ability to get
the sockets to follow the processes.  That means, we must have
one RX queue per socket.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05  4:01                       ` Herbert Xu
@ 2009-07-05 13:09                         ` Matthew Wilcox
  2009-07-05 16:11                           ` Herbert Xu
  2009-07-06  8:38                           ` Andi Kleen
  2009-07-05 20:44                         ` Jeff Garzik
  2009-07-06 17:00                         ` Rick Jones
  2 siblings, 2 replies; 15+ messages in thread
From: Matthew Wilcox @ 2009-07-05 13:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jeff Garzik, andi, arjan, jens.axboe, linux-kernel,
	douglas.w.styner, chinang.ma, terry.o.prickett, matthew.r.wilcox,
	netdev, Jesse Brandeburg

On Sun, Jul 05, 2009 at 12:01:37PM +0800, Herbert Xu wrote:
> > If yes, how to best handle when the scheduler moves app to another CPU?
> > Should we reprogram the NIC hardware flow steering mechanism at that point?
> 
> Not really.  For now the best thing to do is to pin everything
> down and not move at all, because we can't afford to move.
> 
> The only way for moving to work is if we had the ability to get
> the sockets to follow the processes.  That means, we must have
> one RX queue per socket.

Maybe not one RX queue per socket -- sockets belonging to the same
thread could share the same RX queue.  I'm fairly ignorant of the way
networking works these days; is it possible to dynamically reassign a
socket between RX queues, so we'd only need one RX queue per CPU?

It seems the 82575 device has four queues per port, and it's a dual-port
card, so that's eight queues in the system.  We'd need hundreds of queues
to get one queue per client process.  The 82576 has sixteen queues per
port, but that's still not enough (funnily, the driver still limits you
to four per port).

For what it's worth, I believe the current setup pins the client tasks to
a package, but they are allowed to move between cores on that package.
My information may be out of date; hopefully Doug, Chinang or Terry can
clarify how the tasks are currently bound.

I know this test still uses SCHED_RR for the client tasks (using
SCHED_OTHER results in another couple of percentage points off the
overall performance).

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05 13:09                         ` Matthew Wilcox
@ 2009-07-05 16:11                           ` Herbert Xu
  2009-07-06  8:38                           ` Andi Kleen
  1 sibling, 0 replies; 15+ messages in thread
From: Herbert Xu @ 2009-07-05 16:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jeff Garzik, andi, arjan, jens.axboe, linux-kernel,
	douglas.w.styner, chinang.ma, terry.o.prickett, matthew.r.wilcox,
	netdev, Jesse Brandeburg

On Sun, Jul 05, 2009 at 07:09:26AM -0600, Matthew Wilcox wrote:
> 
> Maybe not one RX queue per socket -- sockets belonging to the same
> thread could share the same RX queue.  I'm fairly ignorant of the way
> networking works these days; is it possible to dynamically reassign a
> socket between RX queues, so we'd only need one RX queue per CPU?

Not reliably.  You can tweak the hash in the NIC to redirect
traffic (aka flow director) but ultimately if you've got more
sockets than queues then it's a losing game.

A better strategy for now is to pin everything down and try to
get user-space involved by using threads and distributing the
sockets based on which queue they're associated with.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05  4:01                       ` Herbert Xu
  2009-07-05 13:09                         ` Matthew Wilcox
@ 2009-07-05 20:44                         ` Jeff Garzik
  2009-07-06  1:19                           ` Herbert Xu
  2009-07-06  8:45                           ` Andi Kleen
  2009-07-06 17:00                         ` Rick Jones
  2 siblings, 2 replies; 15+ messages in thread
From: Jeff Garzik @ 2009-07-05 20:44 UTC (permalink / raw)
  To: Herbert Xu
  Cc: andi, arjan, matthew, jens.axboe, linux-kernel, douglas.w.styner,
	chinang.ma, terry.o.prickett, matthew.r.wilcox, Eric.Moore,
	DL-MPTFusionLinux, netdev

Herbert Xu wrote:
> Jeff Garzik <jeff@garzik.org> wrote:
>> What's the best setup for power usage?
>> What's the best setup for performance?
>> Are they the same?
> 
> Yes.

Is this a blind guess, or is there real world testing across multiple 
setups behind this answer?

Consider a 2-package, quad-core system with 3 userland threads actively 
performing network communication, plus periodic, low levels of network 
activity from OS utilities (such as nightly 'yum upgrade').

That is essentially an under-utilized 8-CPU system.  For such a case, it 
seems like a power win to idle or power down a few cores, or maybe even 
an entire package.

Efficient power usage means scaling _down_ when activity decreases.  A 
blind "distribute network activity across all CPUs" policy does not 
appear to be responsive to that type of situation.

>> Is it most optimal to have the interrupt for socket $X occur on the same 
>> CPU as where the app is running?
> 
> Yes.

Same question:  blind guess, or do you have numbers?

Consider two competing CPU hogs:  a kernel with tons of netfilter tables 
and rules, plus an application that uses a lot of CPU.

Can you not reach a threshold where it makes more sense to split kernel 
and userland work onto different CPUs?

>> If yes, how to best handle when the scheduler moves app to another CPU?
>> Should we reprogram the NIC hardware flow steering mechanism at that point?
> 
> Not really.  For now the best thing to do is to pin everything
> down and not move at all, because we can't afford to move.
> 
> The only way for moving to work is if we had the ability to get
> the sockets to follow the processes.  That means, we must have
> one RX queue per socket.

That seems to presume it is impossible to reprogram the NIC RX queue 
selection rules?

If you can add a new 'flow' to a NIC hardware RX queue, surely you can 
imagine a remove + add operation for a migrated 'flow'...  and thus, at 
least on the NIC hardware level, flows can follow processes.

	Jeff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05 20:44                         ` Jeff Garzik
@ 2009-07-06  1:19                           ` Herbert Xu
  2009-07-06  8:45                           ` Andi Kleen
  1 sibling, 0 replies; 15+ messages in thread
From: Herbert Xu @ 2009-07-06  1:19 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: andi, arjan, matthew, jens.axboe, linux-kernel, douglas.w.styner,
	chinang.ma, terry.o.prickett, matthew.r.wilcox, Eric.Moore,
	DL-MPTFusionLinux, netdev

On Sun, Jul 05, 2009 at 04:44:41PM -0400, Jeff Garzik wrote:
>
> Efficient power usage means scaling _down_ when activity decreases.  A  
> blind "distribute network activity across all CPUs" policy does not  
> appear to be responsive to that type of situation.

I didn't suggest distributing network activity across all CPUs.
It should be distributed across all active CPUs.

> Consider two competing CPU hogs:  a kernel with tons of netfilter tables  
> and rules, plus an application that uses a lot of CPU.
>
> Can you not reach a threshold where it makes more sense to split kernel  
> and userland work onto different CPUs?

In that case the best split would be split the application into
different threads which can then move onto different CPUs.  Doing
what you said above will probably work assuming the CPUs share
cache, otherwise it will suck.

> That seems to presume it is impossible to reprogram the NIC RX queue  
> selection rules?
>
> If you can add a new 'flow' to a NIC hardware RX queue, surely you can  
> imagine a remove + add operation for a migrated 'flow'...  and thus, at  
> least on the NIC hardware level, flows can follow processes.

Right, that would work as long as you can add a rule for each socket
you cared about.  Though it's interesting to know whether the number
of rules can keep up with the number of sockets that we usually
encounter on busy servers.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05 13:09                         ` Matthew Wilcox
  2009-07-05 16:11                           ` Herbert Xu
@ 2009-07-06  8:38                           ` Andi Kleen
  1 sibling, 0 replies; 15+ messages in thread
From: Andi Kleen @ 2009-07-06  8:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Herbert Xu, Jeff Garzik, andi, arjan, jens.axboe, linux-kernel,
	douglas.w.styner, chinang.ma, terry.o.prickett, matthew.r.wilcox,
	netdev, Jesse Brandeburg

On Sun, Jul 05, 2009 at 07:09:26AM -0600, Matthew Wilcox wrote:
> On Sun, Jul 05, 2009 at 12:01:37PM +0800, Herbert Xu wrote:
> > > If yes, how to best handle when the scheduler moves app to another CPU?
> > > Should we reprogram the NIC hardware flow steering mechanism at that point?
> > 
> > Not really.  For now the best thing to do is to pin everything
> > down and not move at all, because we can't afford to move.
> > 
> > The only way for moving to work is if we had the ability to get
> > the sockets to follow the processes.  That means, we must have
> > one RX queue per socket.
> 
> Maybe not one RX queue per socket -- sockets belonging to the same
> thread could share the same RX queue.  I'm fairly ignorant of the way
> networking works these days; is it possible to dynamically reassign a
> socket between RX queues, so we'd only need one RX queue per CPU?

That is how it is supposed to work (ignoring some special setups 
with QoS) in theory.

a You have per CPU RX queues (or if the NIC has less than CPUs, then on a 
  subset of CPUs)
b The NIC uses a hash function on the stream (= socket) to map an 
  incoming packet to a specific RX queue.
c The interrupt handler is supposed to be bound on a specific CPU.
d The CPU then does wakeups and the scheduler biases the process/thread
  using the sockets towards the CPU that always does the wakeups.

Ideally then the process/thread doing the socket IO should be on 
the receiving CPU.  It doesn't always work out like this in practice, 
but it should.

(c) seems to be the part that is broken right now.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05 20:44                         ` Jeff Garzik
  2009-07-06  1:19                           ` Herbert Xu
@ 2009-07-06  8:45                           ` Andi Kleen
  1 sibling, 0 replies; 15+ messages in thread
From: Andi Kleen @ 2009-07-06  8:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Herbert Xu, andi, arjan, matthew, jens.axboe, linux-kernel,
	douglas.w.styner, chinang.ma, terry.o.prickett, matthew.r.wilcox,
	Eric.Moore, DL-MPTFusionLinux, netdev

> That seems to presume it is impossible to reprogram the NIC RX queue 
> selection rules?
> 
> If you can add a new 'flow' to a NIC hardware RX queue, surely you can 
> imagine a remove + add operation for a migrated 'flow'...  and thus, at 
> least on the NIC hardware level, flows can follow processes.

The standard on modern NIC hardware is stateless hashing: as in you
don't program in flows, but the hardware just uses a fixed hash
to map the header to a rx queue. You can't really significantly influence
the hash on a per flow basis there (in theory you could chose
specific local port numbers, but you can't do that for the remote
ports or for well known sockets)

There are a few highend NICs that optionally support arbitary flow matching,
but it typically only supports a very limited number of flows
(so you need some fallback anyways) and of course it would
be costly to reprogram the NIC on every socket state change.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-05  4:01                       ` Herbert Xu
  2009-07-05 13:09                         ` Matthew Wilcox
  2009-07-05 20:44                         ` Jeff Garzik
@ 2009-07-06 17:00                         ` Rick Jones
  2009-07-06 17:36                           ` Ma, Chinang
  2 siblings, 1 reply; 15+ messages in thread
From: Rick Jones @ 2009-07-06 17:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Jeff Garzik, andi, arjan, matthew, jens.axboe, linux-kernel,
	douglas.w.styner, chinang.ma, terry.o.prickett, matthew.r.wilcox,
	Eric.Moore, DL-MPTFusionLinux, netdev

Herbert Xu wrote:
> Jeff Garzik <jeff@garzik.org> wrote:
> 
>>What's the best setup for power usage?
>>What's the best setup for performance?
>>Are they the same?
> 
> 
> Yes.
> 
> 
>>Is it most optimal to have the interrupt for socket $X occur on the same 
>>CPU as where the app is running?
> 
> 
> Yes.

Well...  Yes, if the goal is lowest service demand/latency, but not always if 
the goal is to have highest throughput.  For example, basic netperf TCP_RR 
between a pair of systems with NIC interrupts pinned to CPU0 for my convenience :)

Pin netperf/netserver to CPU0 as well:
sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 0 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      10.00   16396.22  0.39   0.55   3.846   5.364
16384  87380

Now pin it to the peer thread in that same core:

sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 8 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      10.00   14078.23  0.67   0.87   7.604   9.863
16384  87380

Now pin it to another core in that same processor:

sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 2 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      10.00   14649.57  1.76   0.64   19.213  7.036
16384  87380

Certainly seems to support "run on the same core as interrupts." Now though lets 
look at bulk throughput:

sbs133b15:~ # netperf -H sbs133b16 -T 0 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sbs133b16.west 
(10.208.1.50) port 0 AF_INET : cpu bind
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

  87380  16384  16384    10.00      9384.11   3.39     2.19     0.474   0.306

In this case, I'm running on Nehalems (two quad-cores with threads enabled) so I 
have enough "oomph" to hit link-rate on a classic throughput test so all these 
next two will show is the CPU hit and some of the run to run variablity:

sbs133b15:~ # for t in 8 2; do netperf -P 0 -H sbs133b16 -T $t -c -C -B "bind to 
core $t"; done
  87380  16384  16384    10.00      9383.67   4.23     5.21     0.591   0.728 
bind to core 8
  87380  16384  16384    10.00      9383.12   3.03     5.35     0.423   0.747 
bind to core 2

So apart from the thing on the top of my head what is my point?  Let's look at a 
less conventional but still important case - bulk small packet throughput. 
First, find the limit for a single connection when bound to the interrupt core:

sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 0 -H 
sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
16384  87380  1       1      10.00   16336.52  0.69   0.91   6.715   8.944  0 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   61324.84  2.23   2.27   5.825   5.910  4 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   152221.78  2.81   3.49   2.956   3.664  16 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   291247.72  4.86   5.07   2.670   2.788  64 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   292257.59  3.99   5.91   2.183   3.236  128 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   291734.00  5.55   5.32   3.043   2.920  256 
added simultaneous trans
16384  87380

Now, when bound to the peer thread:
sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 8 -H 
sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
16384  87380  1       1      10.00   14367.40  0.78   1.75   8.652   19.477 0 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   54820.22  2.73   4.78   7.956   13.948 4 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   159305.92  4.61   6.84   4.627   6.874  16 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   260227.55  6.26   8.36   3.851   5.140  64 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   256336.50  6.23   8.00   3.891   4.993  128 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   250543.92  6.24   6.29   3.985   4.014  256 
added simultaneous trans
16384  87380

Things still don't look good for running on another CPU, but wait :)  Bind to 
another core in the same processor:

sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 2 -H 
sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
16384  87380  1       1      10.00   14697.98  0.89   1.53   9.689   16.700 0 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   58201.08  2.11   4.21   5.804   11.585 4 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   158999.50  3.87   6.20   3.899   6.240  16 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   379243.72  6.24   9.04   2.634   3.815  64 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   384823.34  6.15   9.50   2.556   3.949  128 
added simultaneous trans
16384  87380
16384  87380  1       1      10.00   375001.50  6.07   9.63   2.588   4.109  256 
added simultaneous trans
16384  87380

When the CPU does not have enough "oomph" for link-rate 10G, then what we see 
above with the aggregate TCP_RR holds true for a plain TCP_STREAM test as well - 
getting the second core involved, while indeed increasing CPU util, also 
provides the additional cycles required to get higher thoughput.  So what is 
optimal depends on what one wishes to optimize.

> 
>>If yes, how to best handle when the scheduler moves app to another CPU?
>>Should we reprogram the NIC hardware flow steering mechanism at that point?
> 
> 
> Not really.  For now the best thing to do is to pin everything
> down and not move at all, because we can't afford to move.
> 
> The only way for moving to work is if we had the ability to get
> the sockets to follow the processes.  That means, we must have
> one RX queue per socket.

Well, or assign sockets to per-core RX queues and be able to move them around. 
If it weren't for all the smarts in the NICs getting in the way :), we'd 
probably do the "lookup where the socket was last accessed and run there" thing 
somewhere in the inbound path a la TOPS.

rick jones

> 
> Cheers,

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: >10% performance degradation since 2.6.18
  2009-07-06 17:00                         ` Rick Jones
@ 2009-07-06 17:36                           ` Ma, Chinang
  2009-07-06 17:42                             ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Ma, Chinang @ 2009-07-06 17:36 UTC (permalink / raw)
  To: Rick Jones, Herbert Xu
  Cc: Jeff Garzik, andi@firstfloor.org, arjan@infradead.org,
	matthew@wil.cx, jens.axboe@oracle.com,
	linux-kernel@vger.kernel.org, Styner, Douglas W,
	Prickett, Terry O, Wilcox, Matthew R, Eric.Moore@lsi.com,
	DL-MPTFusionLinux@lsi.com, netdev@vger.kernel.org

For OLTP workload we are not pushing much network throughput. Lower network latency is more important for OLTP performance. For the original Nehalem 2 sockets OLTP result in this mail thread, we bound the two NIC interrupts to cpu1 and cpu9 (one NIC per sockets). Database processes are divided into two groups and pinned to socket and each processe only received request from the NIC it bound to. This binding scheme gave us >1% performance boost pre-Nehalem date. We also see positive impact on this NHM system.
-Chinang

>-----Original Message-----
>From: Rick Jones [mailto:rick.jones2@hp.com]
>Sent: Monday, July 06, 2009 10:00 AM
>To: Herbert Xu
>Cc: Jeff Garzik; andi@firstfloor.org; arjan@infradead.org; matthew@wil.cx;
>jens.axboe@oracle.com; linux-kernel@vger.kernel.org; Styner, Douglas W; Ma,
>Chinang; Prickett, Terry O; Wilcox, Matthew R; Eric.Moore@lsi.com; DL-
>MPTFusionLinux@lsi.com; netdev@vger.kernel.org
>Subject: Re: >10% performance degradation since 2.6.18
>
>Herbert Xu wrote:
>> Jeff Garzik <jeff@garzik.org> wrote:
>>
>>>What's the best setup for power usage?
>>>What's the best setup for performance?
>>>Are they the same?
>>
>>
>> Yes.
>>
>>
>>>Is it most optimal to have the interrupt for socket $X occur on the same
>>>CPU as where the app is running?
>>
>>
>> Yes.
>
>Well...  Yes, if the goal is lowest service demand/latency, but not always
>if
>the goal is to have highest throughput.  For example, basic netperf TCP_RR
>between a pair of systems with NIC interrupts pinned to CPU0 for my
>convenience :)
>
>Pin netperf/netserver to CPU0 as well:
>sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 0 -c -C
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>16384  87380  1       1      10.00   16396.22  0.39   0.55   3.846   5.364
>16384  87380
>
>Now pin it to the peer thread in that same core:
>
>sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 8 -c -C
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>16384  87380  1       1      10.00   14078.23  0.67   0.87   7.604   9.863
>16384  87380
>
>Now pin it to another core in that same processor:
>
>sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 2 -c -C
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>16384  87380  1       1      10.00   14649.57  1.76   0.64   19.213  7.036
>16384  87380
>
>Certainly seems to support "run on the same core as interrupts." Now though
>lets
>look at bulk throughput:
>
>sbs133b15:~ # netperf -H sbs133b16 -T 0 -c -C
>TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sbs133b16.west
>(10.208.1.50) port 0 AF_INET : cpu bind
>Recv   Send    Send                          Utilization       Service
>Demand
>Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
>Size   Size    Size     Time     Throughput  local    remote   local
>remote
>bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
>us/KB
>
>  87380  16384  16384    10.00      9384.11   3.39     2.19     0.474
>0.306
>
>In this case, I'm running on Nehalems (two quad-cores with threads enabled)
>so I
>have enough "oomph" to hit link-rate on a classic throughput test so all
>these
>next two will show is the CPU hit and some of the run to run variablity:
>
>sbs133b15:~ # for t in 8 2; do netperf -P 0 -H sbs133b16 -T $t -c -C -B
>"bind to
>core $t"; done
>  87380  16384  16384    10.00      9383.67   4.23     5.21     0.591
>0.728
>bind to core 8
>  87380  16384  16384    10.00      9383.12   3.03     5.35     0.423
>0.747
>bind to core 2
>
>So apart from the thing on the top of my head what is my point?  Let's look
>at a
>less conventional but still important case - bulk small packet throughput.
>First, find the limit for a single connection when bound to the interrupt
>core:
>
>sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 0 -H
>sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
>16384  87380  1       1      10.00   16336.52  0.69   0.91   6.715   8.944
>0
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   61324.84  2.23   2.27   5.825   5.910
>4
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   152221.78  2.81   3.49   2.956   3.664
>16
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   291247.72  4.86   5.07   2.670   2.788
>64
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   292257.59  3.99   5.91   2.183   3.236
>128
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   291734.00  5.55   5.32   3.043   2.920
>256
>added simultaneous trans
>16384  87380
>
>Now, when bound to the peer thread:
>sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 8 -H
>sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
>16384  87380  1       1      10.00   14367.40  0.78   1.75   8.652   19.477
>0
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   54820.22  2.73   4.78   7.956   13.948
>4
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   159305.92  4.61   6.84   4.627   6.874
>16
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   260227.55  6.26   8.36   3.851   5.140
>64
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   256336.50  6.23   8.00   3.891   4.993
>128
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   250543.92  6.24   6.29   3.985   4.014
>256
>added simultaneous trans
>16384  87380
>
>Things still don't look good for running on another CPU, but wait :)  Bind
>to
>another core in the same processor:
>
>sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 2 -H
>sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
>16384  87380  1       1      10.00   14697.98  0.89   1.53   9.689   16.700
>0
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   58201.08  2.11   4.21   5.804   11.585
>4
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   158999.50  3.87   6.20   3.899   6.240
>16
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   379243.72  6.24   9.04   2.634   3.815
>64
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   384823.34  6.15   9.50   2.556   3.949
>128
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   375001.50  6.07   9.63   2.588   4.109
>256
>added simultaneous trans
>16384  87380
>
>When the CPU does not have enough "oomph" for link-rate 10G, then what we
>see
>above with the aggregate TCP_RR holds true for a plain TCP_STREAM test as
>well -
>getting the second core involved, while indeed increasing CPU util, also
>provides the additional cycles required to get higher thoughput.  So what
>is
>optimal depends on what one wishes to optimize.
>
>>
>>>If yes, how to best handle when the scheduler moves app to another CPU?
>>>Should we reprogram the NIC hardware flow steering mechanism at that
>point?
>>
>>
>> Not really.  For now the best thing to do is to pin everything
>> down and not move at all, because we can't afford to move.
>>
>> The only way for moving to work is if we had the ability to get
>> the sockets to follow the processes.  That means, we must have
>> one RX queue per socket.
>
>Well, or assign sockets to per-core RX queues and be able to move them
>around.
>If it weren't for all the smarts in the NICs getting in the way :), we'd
>probably do the "lookup where the socket was last accessed and run there"
>thing
>somewhere in the inbound path a la TOPS.
>
>rick jones
>
>>
>> Cheers,


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-06 17:36                           ` Ma, Chinang
@ 2009-07-06 17:42                             ` Matthew Wilcox
  2009-07-06 17:57                               ` Ma, Chinang
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2009-07-06 17:42 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Rick Jones, Herbert Xu, Jeff Garzik, andi@firstfloor.org,
	arjan@infradead.org, jens.axboe@oracle.com,
	linux-kernel@vger.kernel.org, Styner, Douglas W,
	Prickett, Terry O, Wilcox, Matthew R, netdev@vger.kernel.org,
	jesse.brandeburg

On Mon, Jul 06, 2009 at 10:36:11AM -0700, Ma, Chinang wrote:
> For OLTP workload we are not pushing much network throughput. Lower network latency is more important for OLTP performance. For the original Nehalem 2 sockets OLTP result in this mail thread, we bound the two NIC interrupts to cpu1 and cpu9 (one NIC per sockets). Database processes are divided into two groups and pinned to socket and each processe only received request from the NIC it bound to. This binding scheme gave us >1% performance boost pre-Nehalem date. We also see positive impact on this NHM system.

So you've tried spreading the four RX and TX interrupts for each card
out over, say, CPUs 1, 3, 5, 7 for eth1 and then 9, 11, 13, 15 for eth0,
and it produces worse performance than having all four tied to CPUs 1
and 9?  Interesting.

Can you try changing IGB_MAX_RX_QUEUES (in drivers/net/igb/igb.h, about
line 60) to 1, and seeing if performance improves that way?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: >10% performance degradation since 2.6.18
  2009-07-06 17:42                             ` Matthew Wilcox
@ 2009-07-06 17:57                               ` Ma, Chinang
  2009-07-06 18:05                                 ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Ma, Chinang @ 2009-07-06 17:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Rick Jones, Herbert Xu, Jeff Garzik, andi@firstfloor.org,
	arjan@infradead.org, jens.axboe@oracle.com,
	linux-kernel@vger.kernel.org, Styner, Douglas W,
	Prickett, Terry O, Wilcox, Matthew R, netdev@vger.kernel.org,
	Brandeburg, Jesse



>-----Original Message-----
>From: Matthew Wilcox [mailto:matthew@wil.cx]
>Sent: Monday, July 06, 2009 10:42 AM
>To: Ma, Chinang
>Cc: Rick Jones; Herbert Xu; Jeff Garzik; andi@firstfloor.org;
>arjan@infradead.org; jens.axboe@oracle.com; linux-kernel@vger.kernel.org;
>Styner, Douglas W; Prickett, Terry O; Wilcox, Matthew R;
>netdev@vger.kernel.org; Brandeburg, Jesse
>Subject: Re: >10% performance degradation since 2.6.18
>
>On Mon, Jul 06, 2009 at 10:36:11AM -0700, Ma, Chinang wrote:
>> For OLTP workload we are not pushing much network throughput. Lower
>network latency is more important for OLTP performance. For the original
>Nehalem 2 sockets OLTP result in this mail thread, we bound the two NIC
>interrupts to cpu1 and cpu9 (one NIC per sockets). Database processes are
>divided into two groups and pinned to socket and each processe only
>received request from the NIC it bound to. This binding scheme gave us >1%
>performance boost pre-Nehalem date. We also see positive impact on this NHM
>system.
>
>So you've tried spreading the four RX and TX interrupts for each card
>out over, say, CPUs 1, 3, 5, 7 for eth1 and then 9, 11, 13, 15 for eth0,
>and it produces worse performance than having all four tied to CPUs 1
>and 9?  Interesting.

I was comparing 2 NIC on 2 sockets to 2 NIC on the same socket. I have not tried spreading out the interrupt for a NIC to cpus in the same sockets. Is there good reason for trying this?

 
>
>Can you try changing IGB_MAX_RX_QUEUES (in drivers/net/igb/igb.h, about
>line 60) to 1, and seeing if performance improves that way?
>

I suppose this should wait until we find out whether spread out NIC interrupt in socket helps or not.

>--
>Matthew Wilcox				Intel Open Source Technology Centre
>"Bill, look, we understand that you're interested in selling us this
>operating system, but compare it to ours.  We can't possibly take such
>a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-06 17:57                               ` Ma, Chinang
@ 2009-07-06 18:05                                 ` Matthew Wilcox
  2009-07-06 18:48                                   ` Ma, Chinang
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2009-07-06 18:05 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Rick Jones, Herbert Xu, Jeff Garzik, andi@firstfloor.org,
	arjan@infradead.org, jens.axboe@oracle.com,
	linux-kernel@vger.kernel.org, Styner, Douglas W,
	Prickett, Terry O, Wilcox, Matthew R, netdev@vger.kernel.org,
	Brandeburg, Jesse

On Mon, Jul 06, 2009 at 10:57:09AM -0700, Ma, Chinang wrote:
> >-----Original Message-----
> >From: Matthew Wilcox [mailto:matthew@wil.cx]
> >Sent: Monday, July 06, 2009 10:42 AM
> >To: Ma, Chinang
> >Cc: Rick Jones; Herbert Xu; Jeff Garzik; andi@firstfloor.org;
> >arjan@infradead.org; jens.axboe@oracle.com; linux-kernel@vger.kernel.org;
> >Styner, Douglas W; Prickett, Terry O; Wilcox, Matthew R;
> >netdev@vger.kernel.org; Brandeburg, Jesse
> >Subject: Re: >10% performance degradation since 2.6.18
> >
> >On Mon, Jul 06, 2009 at 10:36:11AM -0700, Ma, Chinang wrote:
> >> For OLTP workload we are not pushing much network throughput. Lower
> >network latency is more important for OLTP performance. For the original
> >Nehalem 2 sockets OLTP result in this mail thread, we bound the two NIC
> >interrupts to cpu1 and cpu9 (one NIC per sockets). Database processes are
> >divided into two groups and pinned to socket and each processe only
> >received request from the NIC it bound to. This binding scheme gave us >1%
> >performance boost pre-Nehalem date. We also see positive impact on this NHM
> >system.
> >
> >So you've tried spreading the four RX and TX interrupts for each card
> >out over, say, CPUs 1, 3, 5, 7 for eth1 and then 9, 11, 13, 15 for eth0,
> >and it produces worse performance than having all four tied to CPUs 1
> >and 9?  Interesting.
> 
> I was comparing 2 NIC on 2 sockets to 2 NIC on the same socket. I have not tried spreading out the interrupt for a NIC to cpus in the same sockets. Is there good reason for trying this?

If spreading the network load from one CPU to two CPUs increases the
performance, spreading the network load from two to eight might get even
better performance.

Are the clients now pinned to the CPU package on which they receive their
network interrupts?

> >Can you try changing IGB_MAX_RX_QUEUES (in drivers/net/igb/igb.h, about
> >line 60) to 1, and seeing if performance improves that way?
> 
> I suppose this should wait until we find out whether spread out NIC interrupt in socket helps or not.

Yes, let's try having the driver work the way that LAD designed it
first ;-)

They may not have been optimising for the database client workload,
of course.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: >10% performance degradation since 2.6.18
  2009-07-06 18:05                                 ` Matthew Wilcox
@ 2009-07-06 18:48                                   ` Ma, Chinang
  2009-07-06 18:53                                     ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Ma, Chinang @ 2009-07-06 18:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Rick Jones, Herbert Xu, Jeff Garzik, andi@firstfloor.org,
	arjan@infradead.org, jens.axboe@oracle.com,
	linux-kernel@vger.kernel.org, Styner, Douglas W,
	Prickett, Terry O, Wilcox, Matthew R, netdev@vger.kernel.org,
	Brandeburg, Jesse



>-----Original Message-----
>From: Matthew Wilcox [mailto:matthew@wil.cx]
>Sent: Monday, July 06, 2009 11:06 AM
>To: Ma, Chinang
>Cc: Rick Jones; Herbert Xu; Jeff Garzik; andi@firstfloor.org;
>arjan@infradead.org; jens.axboe@oracle.com; linux-kernel@vger.kernel.org;
>Styner, Douglas W; Prickett, Terry O; Wilcox, Matthew R;
>netdev@vger.kernel.org; Brandeburg, Jesse
>Subject: Re: >10% performance degradation since 2.6.18
>
>On Mon, Jul 06, 2009 at 10:57:09AM -0700, Ma, Chinang wrote:
>> >-----Original Message-----
>> >From: Matthew Wilcox [mailto:matthew@wil.cx]
>> >Sent: Monday, July 06, 2009 10:42 AM
>> >To: Ma, Chinang
>> >Cc: Rick Jones; Herbert Xu; Jeff Garzik; andi@firstfloor.org;
>> >arjan@infradead.org; jens.axboe@oracle.com; linux-kernel@vger.kernel.org;
>> >Styner, Douglas W; Prickett, Terry O; Wilcox, Matthew R;
>> >netdev@vger.kernel.org; Brandeburg, Jesse
>> >Subject: Re: >10% performance degradation since 2.6.18
>> >
>> >On Mon, Jul 06, 2009 at 10:36:11AM -0700, Ma, Chinang wrote:
>> >> For OLTP workload we are not pushing much network throughput. Lower
>> >network latency is more important for OLTP performance. For the original
>> >Nehalem 2 sockets OLTP result in this mail thread, we bound the two NIC
>> >interrupts to cpu1 and cpu9 (one NIC per sockets). Database processes
>are
>> >divided into two groups and pinned to socket and each processe only
>> >received request from the NIC it bound to. This binding scheme gave us
>>1%
>> >performance boost pre-Nehalem date. We also see positive impact on this
>NHM
>> >system.
>> >
>> >So you've tried spreading the four RX and TX interrupts for each card
>> >out over, say, CPUs 1, 3, 5, 7 for eth1 and then 9, 11, 13, 15 for eth0,
>> >and it produces worse performance than having all four tied to CPUs 1
>> >and 9?  Interesting.
>>
>> I was comparing 2 NIC on 2 sockets to 2 NIC on the same socket. I have
>not tried spreading out the interrupt for a NIC to cpus in the same sockets.
>Is there good reason for trying this?
>
>If spreading the network load from one CPU to two CPUs increases the
>performance, spreading the network load from two to eight might get even
>better performance.
>

On the distributing interrupt load subject.  Can we do the same thing with the IOC interrupts. We have 4 LSI 3801, The number of i/o interrupt is huge on 4 of the cpus. Is there a way to divide the IOC irq number so we can spread out the i/o interrupt to more cpu?


>Are the clients now pinned to the CPU package on which they receive their
>network interrupts?

Yes. The database processes in server are pinning to the socket on which they received network interrupts.

>
>> >Can you try changing IGB_MAX_RX_QUEUES (in drivers/net/igb/igb.h, about
>> >line 60) to 1, and seeing if performance improves that way?
>>
>> I suppose this should wait until we find out whether spread out NIC
>interrupt in socket helps or not.
>
>Yes, let's try having the driver work the way that LAD designed it
>first ;-)
>
>They may not have been optimising for the database client workload,
>of course.
>
>--
>Matthew Wilcox				Intel Open Source Technology Centre
>"Bill, look, we understand that you're interested in selling us this
>operating system, but compare it to ours.  We can't possibly take such
>a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: >10% performance degradation since 2.6.18
  2009-07-06 18:48                                   ` Ma, Chinang
@ 2009-07-06 18:53                                     ` Matthew Wilcox
  0 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2009-07-06 18:53 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Rick Jones, Herbert Xu, Jeff Garzik, andi@firstfloor.org,
	arjan@infradead.org, jens.axboe@oracle.com,
	linux-kernel@vger.kernel.org, Styner, Douglas W,
	Prickett, Terry O, Wilcox, Matthew R, netdev@vger.kernel.org,
	Brandeburg, Jesse

On Mon, Jul 06, 2009 at 11:48:47AM -0700, Ma, Chinang wrote:
> On the distributing interrupt load subject.  Can we do the same thing with the IOC interrupts. We have 4 LSI 3801, The number of i/o interrupt is huge on 4 of the cpus. Is there a way to divide the IOC irq number so we can spread out the i/o interrupt to more cpu?

I think the LSI 3801 only has one interrupt per ioc, but they could
perhaps be better spread-out.  Right now, they're delivering interrupts
to CPUs 2, 3, 4 and 5.  Possibly spreading them out to CPUs 2, 6, 10
and 14 would help.  Or maybe it would hurt ...

(nb: the ioc interrupts are similarly tied to CPUs 2, 3, 4 and 5 with
2.6.18, so this isn't a likely cause of/solution to the regression,
it may just be a path to better numbers).

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-07-06 18:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090703025607.GK5480@parisc-linux.org>
     [not found] ` <87skhdaaub.fsf@basil.nowhere.org>
     [not found]   ` <20090703185414.GP23611@kernel.dk>
     [not found]     ` <20090703191321.GO5480@parisc-linux.org>
     [not found]       ` <20090703192235.GV23611@kernel.dk>
     [not found]         ` <20090703194557.GQ5480@parisc-linux.org>
     [not found]           ` <20090703195458.GK2041@one.firstfloor.org>
     [not found]             ` <20090703130421.646fe5cb@infradead.org>
     [not found]               ` <20090703233505.GL2041@one.firstfloor.org>
     [not found]                 ` <20090703230408.4433ee39@infradead.org>
     [not found]                   ` <20090704084430.GO2041@one.firstfloor.org>
2009-07-04  9:19                     ` >10% performance degradation since 2.6.18 Jeff Garzik
2009-07-05  4:01                       ` Herbert Xu
2009-07-05 13:09                         ` Matthew Wilcox
2009-07-05 16:11                           ` Herbert Xu
2009-07-06  8:38                           ` Andi Kleen
2009-07-05 20:44                         ` Jeff Garzik
2009-07-06  1:19                           ` Herbert Xu
2009-07-06  8:45                           ` Andi Kleen
2009-07-06 17:00                         ` Rick Jones
2009-07-06 17:36                           ` Ma, Chinang
2009-07-06 17:42                             ` Matthew Wilcox
2009-07-06 17:57                               ` Ma, Chinang
2009-07-06 18:05                                 ` Matthew Wilcox
2009-07-06 18:48                                   ` Ma, Chinang
2009-07-06 18:53                                     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).