* pipe performance regression on ia64
@ 2005-01-18 17:41 Luck, Tony
2005-01-18 18:11 ` Linus Torvalds
0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2005-01-18 17:41 UTC (permalink / raw)
To: torvalds; +Cc: linux-ia64, linux-kernel
David Mosberger pointed out to me that 2.6.11-rc1 kernel scores
very badly on ia64 in lmbench pipe throughput test (bw_pipe) compared
with earlier kernels.
Nanhai Zou looked into this, and found that the performance loss
began with Linus' patch to speed up pipe performance by allocating
a circular list of pages.
Here's his analysis:
>OK, I know the reason now.
>
>This regression we saw comes from scheduler load balancer.
>
>Pipe is a kind of workload that writer and reader will never run at the
>same time. They are synchronized by semaphore. One is always sleeping
>when the other end is working.
>
>To have cache hot, we do not wish to let writer and reader
>to be balanced to 2 cpus. That is why in fs/pipe.c, kernel use
>wake_up_interruptible_sync() instead of wake_up_interruptible to wakeup
>process.
>
>Now, load balancer is still balancing the processes if we have other
>any cpu idle. Note that on an HT enabled x86 the load balancer will
>first balance the process to a cpu in SMT domain without cache miss
>penalty.
>
>So, when we run bw_pipe on a low load SMP machine, the kernel running in
>a way load balancer always trying to spread out 2 processes while the
>wake_up_interruptible_sync() is always trying to draw them back into
>1 cpu.
>
>Linus's patch will reduce the change to call wake_up_interruptible_sync()
>a lot.
>
>For bw_pipe writer or reader, the buffer size is 64k. In a 16k page
>kernel. The old kernel will call wake_up_interruptible_sync 4 times but
>the new kernel will call wakeup only 1 time.
>
>Now the load balancer wins, processes are running on 2 cpus at most of
>the time. They got a lot of cache miss penalty.
>
>To prove this, Just run 4 instances of bw_pipe on a 4 -way Tiger to let
>load balancer not so active.
>
>Or simply add some code at the top of main() in bw_pipe.c
>
>{
> long affinity = 1;
> sched_setaffinity(getpid(), sizeof(long), &affinity);
>}
>then make and run bw_pipe again.
>
>Now I get a throughput of 5GB...
-Tony
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: pipe performance regression on ia64 2005-01-18 17:41 pipe performance regression on ia64 Luck, Tony @ 2005-01-18 18:11 ` Linus Torvalds 2005-01-18 18:31 ` David Mosberger ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Linus Torvalds @ 2005-01-18 18:11 UTC (permalink / raw) To: Luck, Tony; +Cc: linux-ia64, linux-kernel On Tue, 18 Jan 2005, Luck, Tony wrote: > David Mosberger: > > > >So, when we run bw_pipe on a low load SMP machine, the kernel running in > >a way load balancer always trying to spread out 2 processes while the > >wake_up_interruptible_sync() is always trying to draw them back into > >1 cpu. > > > >Linus's patch will reduce the change to call wake_up_interruptible_sync() > >a lot. > > > >For bw_pipe writer or reader, the buffer size is 64k. In a 16k page > >kernel. The old kernel will call wake_up_interruptible_sync 4 times but > >the new kernel will call wakeup only 1 time. Yes, it will depend on the buffer size, and on whether the writer actually does any _work_ to fill it, or just writes it. The thing is, in real life, the "wake_up()" tends to be preferable, because even though we are totally synchronized on the pipe semaphore (which is a locking issue in itself that might be worth looking into), most real loads will actually do something to _generate_ the write data in the first place, and thus you actually want to spread the load out over CPU's. The lmbench pipe benchmark is kind of special, since the writer literally does nothing but write and the reader does nothing but read, so there is nothing to parallellize. The "wake_up_sync()" hack only helps for the special case where we know the writer is going to write more. Of course, we could make the pipe code use that "synchronous" write unconditionally, and benchmarks would look better, but I suspect it would hurt real life. The _normal_ use of a pipe, after all, is having a writer that does real work to generate the data (like 'cc1'), and a sink that actually does real work with it (like 'as'), and having less synchronization is a _good_ thing. I don't know how to make the benchmark look repeatable and good, though. The CPU affinity thing may be the right thing. For example, if somebody blocks on a semaphore, we actually do have some code to try to wake it up on the same CPU that released the semaphore (in "try_to_wake_up()", but again, that in this case tends to be fought by the idle balancing there too.. And again, that does tend to be the right thing to do if the process has _other_ data than the stuff protected by the semaphore. It's just that pipe_bw doesn't have that.. (pipe_bw() also makes zero-copy pipes with VM tricks look really good, because it never does a store operation to the buffer it uses to write the data, so VM tricks never see any COW faults, and can just move pages around without any cost. Again, that is not what real life does, so optimizing for the benchmark does the wrong thing). Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-18 18:11 ` Linus Torvalds @ 2005-01-18 18:31 ` David Mosberger 2005-01-18 20:17 ` Linus Torvalds 2005-01-18 23:34 ` Nick Piggin 2005-01-19 12:52 ` Ingo Molnar 2 siblings, 1 reply; 12+ messages in thread From: David Mosberger @ 2005-01-18 18:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Luck, Tony, linux-ia64, linux-kernel >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <torvalds@osdl.org> said: Linus> I don't know how to make the benchmark look repeatable and Linus> good, though. The CPU affinity thing may be the right thing. Perhaps it should be split up into three cases: - producer/consumer pinned to the same CPU - producer/consumer pinned to different CPUs - producer/consumer lefter under control of the scheduler The first two would let us observe any changes in the actual pipe code, whereas the 3rd case would tell us which case the scheduler is leaning towards (or if it starts doing something real crazy, like reschedule the tasks on different CPUs each time, we'd see a bandwith lower than case 2 and that should ring alarm bells). --david ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-18 18:31 ` David Mosberger @ 2005-01-18 20:17 ` Linus Torvalds 2005-01-19 3:05 ` [Lmbench-users] " Larry McVoy 2005-01-19 16:40 ` Larry McVoy 0 siblings, 2 replies; 12+ messages in thread From: Linus Torvalds @ 2005-01-18 20:17 UTC (permalink / raw) To: davidm, carl.staelin Cc: Luck, Tony, lmbench-users, linux-ia64, Kernel Mailing List On Tue, 18 Jan 2005, David Mosberger wrote: > > >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <torvalds@osdl.org> said: > > Linus> I don't know how to make the benchmark look repeatable and > Linus> good, though. The CPU affinity thing may be the right thing. > > Perhaps it should be split up into three cases: > > - producer/consumer pinned to the same CPU > - producer/consumer pinned to different CPUs > - producer/consumer lefter under control of the scheduler > > The first two would let us observe any changes in the actual pipe > code, whereas the 3rd case would tell us which case the scheduler is > leaning towards (or if it starts doing something real crazy, like > reschedule the tasks on different CPUs each time, we'd see a bandwith > lower than case 2 and that should ring alarm bells). Yes, that would be good. However, I don't know who (if anybody) maintains lmbench any more. It might be Carl Staelin (added to cc), and there used to be a mailing list which may or may not be active any more.. [ Background for Carl (and/or lmbench-users): The "pipe bandwidth" test ends up giving wildly fluctuating (and even when stable, pretty nonsensical, since they depend very strongly on the size of the buffer being used to do the writes vs the buffer size in the kernel) numbers purely depending on where the reader/writer got scheduled. So a recent kernel buffer management change made lmbench numbers vary radically, ranging from huge improvements to big decreases. It would be useful to see the numbers as a function of CPU selection on SMP (the same is probably true also for the scheduling latency benchmark, which is also extremely unstable on SMP). It's not just that it has big variance - you can't just average out many runs. It has very "modal" operation, making averages meaningless. A trivial thing that would work for most cases is just a simple (change the "1" to whatever CPU-mask you want for some case) long affinity = 1; /* bitmask: CPU0 only */ sched_setaffinity(0, sizeof(long), &affinity); but I don't know what other OS's do, so it's obviously not portable ] Hmm? Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Lmbench-users] Re: pipe performance regression on ia64 2005-01-18 20:17 ` Linus Torvalds @ 2005-01-19 3:05 ` Larry McVoy 2005-01-19 3:20 ` Linus Torvalds 2005-01-19 16:40 ` Larry McVoy 1 sibling, 1 reply; 12+ messages in thread From: Larry McVoy @ 2005-01-19 3:05 UTC (permalink / raw) To: Linus Torvalds Cc: davidm, carl.staelin, Luck, Tony, lmbench-users, linux-ia64, Kernel Mailing List It would be good if you copied me directly since I don't read the kernel list anymore (I'd love to but don't have the bandwidth) and I rarely read the lmbench list. But only if you want to drag me into it, of course. Carl and I both work on LMbench but not very actively. I had really hoped that once people saw how small the benchmarks are they would create their own: work ~/LMbench2/src wc bw_pipe.c 120 340 2399 bw_pipe.c I'm very unthrilled with the idea of adding stuff to the release benchmark which is OS specific. That said, there is nothing to say that you can't grab the benchmark and tweak your own test case in there to prove or disprove your theory. If you want to take LMbench and turn it into LinuxBench or something like that so that it is clear that it is just a regression test for Linux then hacking in a bunch of tests would make a ton of sense. But, if you keep it generic I can give you output on a pile of different OS's on relatively recent hardware since we just upgraded our build cluster: Welcome to redhat52.bitmover.com, a 2.1Ghz Athlon running Red Hat 5.2. Welcome to redhat62.bitmover.com, a 2.16Ghz Athlon running Red Hat 6.2. Welcome to redhat71.bitmover.com, a 2.1Ghz Athlon running Red Hat 7.1. Welcome to redhat9.bitmover.com, a 2.1Ghz Athlon running Red Hat 9. Welcome to amd64.bitmover.com, a 2Ghz AMD 64 running Fedora Core 1. Welcome to parisc.bitmover.com, a 552Mhz PA8600 running Debian 3.1 Welcome to ppc.bitmover.com, a 400Mhz PowerPC running Yellow Dog 1.2. Welcome to macos.bitmover.com, a dual 1.2Ghz G4 running MacOS 10.2.8. Welcome to sparc.bitmover.com a 440 Mhz Sun Netra T1 running Debian 3.1. Welcome to alpha.bitmover.com, a 500Mhz AlphaPC running Red Hat 7.2. Welcome to ia64.bitmover.com, a dual 800Mhz Itanium running Red Hat 7.2. Welcome to freebsd.bitmover.com, a 2.17Ghz Athlon running FreeBSD 2.2.8. Welcome to freebsd3.bitmover.com, a 1.8Ghz Athlon running FreeBSD 3.2. Welcome to freebsd4.bitmover.com, a 1.8Ghz Athlon running FreeBSD 4.1. Welcome to freebsd5.bitmover.com, a 1.6Ghz Athlon running FreeBSD 5.1. Welcome to openbsd.bitmover.com, a 2.17Ghz Athlon running OpenBSD 3.4. Welcome to netbsd.bitmover.com, a 1Ghz Athlon running NetBSD 1.6.1. Welcome to sco.bitmover.com, a 1.8Ghz Athlon running SCO OpenServer R5. Welcome to sun.bitmover.com, a 440Mhz Sun Ultra 10 running Solaris 2.6 Welcome to sunx86.bitmover.com, a dual 1Ghz PIII running Solaris 2.7. Welcome to sgi.bitmover.com, a 195Mhz MIPS IP28 running IRIX 6.5. Welcome to sibyte.bitmover.com, a dual 800Mhz MIPS running Debian 3.0. Welcome to hp.bitmover.com, a 552Mhz PA8600 running HP-UX 10.20. Welcome to hp11.bitmover.com, a dual 550Mhz PA8500 running HP-UX 11.11. Welcome to hp11-32bit.bitmover.com, a 400Mhz PA8500 running HP-UX 11.11. Welcome to aix.bitmover.com, a 332Mhz PowerPC running AIX 4.1.5. Welcome to qube.bitmover.com, a 250Mhz MIPS running Linux 2.0.34. Welcome to arm.bitmover.com, a 233Mhz StrongARM running Linux 2.2. Welcome to tru64.bitmover.com, a 600Mhz Alpha running Tru64 5.1B. Welcome to winxp2.bitmover.com, a 2.1Ghz Athlon running Windows XP. On Tue, Jan 18, 2005 at 12:17:11PM -0800, Linus Torvalds wrote: > > > On Tue, 18 Jan 2005, David Mosberger wrote: > > > > >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <torvalds@osdl.org> said: > > > > Linus> I don't know how to make the benchmark look repeatable and > > Linus> good, though. The CPU affinity thing may be the right thing. > > > > Perhaps it should be split up into three cases: > > > > - producer/consumer pinned to the same CPU > > - producer/consumer pinned to different CPUs > > - producer/consumer lefter under control of the scheduler > > > > The first two would let us observe any changes in the actual pipe > > code, whereas the 3rd case would tell us which case the scheduler is > > leaning towards (or if it starts doing something real crazy, like > > reschedule the tasks on different CPUs each time, we'd see a bandwith > > lower than case 2 and that should ring alarm bells). > > Yes, that would be good. > > However, I don't know who (if anybody) maintains lmbench any more. It > might be Carl Staelin (added to cc), and there used to be a mailing list > which may or may not be active any more.. > > [ Background for Carl (and/or lmbench-users): > > The "pipe bandwidth" test ends up giving wildly fluctuating (and even > when stable, pretty nonsensical, since they depend very strongly on the > size of the buffer being used to do the writes vs the buffer size in the > kernel) numbers purely depending on where the reader/writer got > scheduled. > > So a recent kernel buffer management change made lmbench numbers vary > radically, ranging from huge improvements to big decreases. It would be > useful to see the numbers as a function of CPU selection on SMP (the > same is probably true also for the scheduling latency benchmark, which > is also extremely unstable on SMP). > > It's not just that it has big variance - you can't just average out many > runs. It has very "modal" operation, making averages meaningless. > > A trivial thing that would work for most cases is just a simple (change > the "1" to whatever CPU-mask you want for some case) > > long affinity = 1; /* bitmask: CPU0 only */ > sched_setaffinity(0, sizeof(long), &affinity); > > but I don't know what other OS's do, so it's obviously not portable ] > > Hmm? > > Linus > _______________________________________________ > Lmbench-users mailing list > Lmbench-users@bitmover.com > http://bitmover.com/mailman/listinfo/lmbench-users -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Lmbench-users] Re: pipe performance regression on ia64 2005-01-19 3:05 ` [Lmbench-users] " Larry McVoy @ 2005-01-19 3:20 ` Linus Torvalds 0 siblings, 0 replies; 12+ messages in thread From: Linus Torvalds @ 2005-01-19 3:20 UTC (permalink / raw) To: Larry McVoy Cc: davidm, carl.staelin, Luck, Tony, lmbench-users, linux-ia64, Kernel Mailing List On Tue, 18 Jan 2005, Larry McVoy wrote: > > I'm very unthrilled with the idea of adding stuff to the release benchmark > which is OS specific. That said, there is nothing to say that you can't > grab the benchmark and tweak your own test case in there to prove or > disprove your theory. Hmm.. The notion of SMP and CPU pinning is certainly not OS-specific (and I bet you'll see all the same issues everythwre else too), but the interfaces do tend to be, which makes it a bit uncomfortable.. Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Lmbench-users] Re: pipe performance regression on ia64 2005-01-18 20:17 ` Linus Torvalds 2005-01-19 3:05 ` [Lmbench-users] " Larry McVoy @ 2005-01-19 16:40 ` Larry McVoy 1 sibling, 0 replies; 12+ messages in thread From: Larry McVoy @ 2005-01-19 16:40 UTC (permalink / raw) To: Linus Torvalds Cc: davidm, carl.staelin, Luck, Tony, lmbench-users, linux-ia64, Kernel Mailing List On Tue, Jan 18, 2005 at 12:17:11PM -0800, Linus Torvalds wrote: > > > On Tue, 18 Jan 2005, David Mosberger wrote: > > > > >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <torvalds@osdl.org> said: > > > > Linus> I don't know how to make the benchmark look repeatable and > > Linus> good, though. The CPU affinity thing may be the right thing. > > > > Perhaps it should be split up into three cases: > > > > - producer/consumer pinned to the same CPU > > - producer/consumer pinned to different CPUs > > - producer/consumer lefter under control of the scheduler > > > > The first two would let us observe any changes in the actual pipe > > code, whereas the 3rd case would tell us which case the scheduler is > > leaning towards (or if it starts doing something real crazy, like > > reschedule the tasks on different CPUs each time, we'd see a bandwith > > lower than case 2 and that should ring alarm bells). > > Yes, that would be good. You're revisiting a pile of work I did back at SGI, I'm pretty sure all of this has been thought through before but it's worth going over again. I have some pretty strong opinions about this that schedulers tend not to like. It's certainly true that you can increase the performance of this sort of problem by pinning the processes to a CPU and/or different CPUs. For specific applications that is a fine thing to do, I did that for the bulk data server that was moving NFS traffic over a TCP socket over HIPPI (if you look at images from space that came from the military it is pretty likely that they passed through that code). Pinning the processes to a particular _cache_ (not CPU, CPU has nothing to do with it) gave me around 20% better throughput. The problem is that schedulers tend be too smart, they try and figure out where to put the process each time the schedule. In general that is the wrong answer for two reasons: a) It's more work on each context switch b) It only works for processes that use up a substantial fraction of their time slice (because the calculation is typically based in part on the idea if you ran on this cache for a long time then you want to stay here). The problem with the "thinking scheduler" is that it doesn't work for I/O loads. That sort of approach will believe that it is fine to move a process which hasn't run for a long time. That's false, you are invalidating its cache and that hurts. That's the 20% gain I got. You are far better off, in my opinion, havig a scheduler that thinks at process creation time and then only when the load gets unbalanced. Other than that, it always puts the process back on the CPU where it last ran. If the scheduler guesses wrong and puts two processes on the same CPU and they fight one will get moved. But it shouldn't right away, leave it there and let things settle a bit. If someone coded up this policy and tried it I think that it would go a long way towards making the LMbench timings more stable. I could be wrong but it would be interesting to compare this approach with a manual placement approach. Manual placement will always do better but it should be in the 5% range, not in the 20% range. -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-18 18:11 ` Linus Torvalds 2005-01-18 18:31 ` David Mosberger @ 2005-01-18 23:34 ` Nick Piggin 2005-01-19 5:11 ` David Mosberger 2005-01-19 12:52 ` Ingo Molnar 2 siblings, 1 reply; 12+ messages in thread From: Nick Piggin @ 2005-01-18 23:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Luck, Tony, linux-ia64, linux-kernel Linus Torvalds wrote: > > On Tue, 18 Jan 2005, Luck, Tony wrote: > >>David Mosberger: >> >>>So, when we run bw_pipe on a low load SMP machine, the kernel running in >>>a way load balancer always trying to spread out 2 processes while the >>>wake_up_interruptible_sync() is always trying to draw them back into >>>1 cpu. >>> >>>Linus's patch will reduce the change to call wake_up_interruptible_sync() >>>a lot. >>> >>>For bw_pipe writer or reader, the buffer size is 64k. In a 16k page >>>kernel. The old kernel will call wake_up_interruptible_sync 4 times but >>>the new kernel will call wakeup only 1 time. > > > Yes, it will depend on the buffer size, and on whether the writer actually > does any _work_ to fill it, or just writes it. > > The thing is, in real life, the "wake_up()" tends to be preferable, > because even though we are totally synchronized on the pipe semaphore > (which is a locking issue in itself that might be worth looking into), > most real loads will actually do something to _generate_ the write data in > the first place, and thus you actually want to spread the load out over > CPU's. > > The lmbench pipe benchmark is kind of special, since the writer literally > does nothing but write and the reader does nothing but read, so there is > nothing to parallellize. > > The "wake_up_sync()" hack only helps for the special case where we know > the writer is going to write more. Of course, we could make the pipe code > use that "synchronous" write unconditionally, and benchmarks would look > better, but I suspect it would hurt real life. > > The _normal_ use of a pipe, after all, is having a writer that does real > work to generate the data (like 'cc1'), and a sink that actually does real > work with it (like 'as'), and having less synchronization is a _good_ > thing. > > I don't know how to make the benchmark look repeatable and good, though. > The CPU affinity thing may be the right thing. > Regarding scheduler balancing behaviour: The problem could also be magnified in recent -bk kernels by the "wake up to an idle CPU" code in sched.c:try_to_wake_up(). To turn this off, remove SD_WAKE_IDLE from include/linux/topology.h:SD_CPU_INIT and include/asm/topology.h:SD_NODE_INIT David I remember you reporting a pipe bandwidth regression, and I had a patch for it, but that hurt other workloads, so I don't think we ever really got anywhere. I've recently begun having another look at the multiprocessor balancer, so hopefully I can get a bit further with it this time. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-18 23:34 ` Nick Piggin @ 2005-01-19 5:11 ` David Mosberger 2005-01-19 12:43 ` Nick Piggin 0 siblings, 1 reply; 12+ messages in thread From: David Mosberger @ 2005-01-19 5:11 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Luck, Tony, linux-ia64, linux-kernel >>>>> On Wed, 19 Jan 2005 10:34:30 +1100, Nick Piggin <nickpiggin@yahoo.com.au> said: Nick> David I remember you reporting a pipe bandwidth regression, Nick> and I had a patch for it, but that hurt other workloads, so I Nick> don't think we ever really got anywhere. I've recently begun Nick> having another look at the multiprocessor balancer, so Nick> hopefully I can get a bit further with it this time. While it may be worthwhile to improve the scheduler, it's clear that there isn't going to be a trivial "fix" for this issue, especially since it's not even clear that anything is really broken. Independent of the scheduler work, it would be very useful to have a pipe benchmark which at least made the dependencies on the scheduler obvious. So I think improving the scheduler and improving the LMbench pipe benchmark are entirely complementary. --david ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-19 5:11 ` David Mosberger @ 2005-01-19 12:43 ` Nick Piggin 2005-01-19 17:31 ` David Mosberger 0 siblings, 1 reply; 12+ messages in thread From: Nick Piggin @ 2005-01-19 12:43 UTC (permalink / raw) To: davidm; +Cc: Linus Torvalds, Luck, Tony, linux-ia64, linux-kernel David Mosberger wrote: >>>>>>On Wed, 19 Jan 2005 10:34:30 +1100, Nick Piggin <nickpiggin@yahoo.com.au> said: > > > Nick> David I remember you reporting a pipe bandwidth regression, > Nick> and I had a patch for it, but that hurt other workloads, so I > Nick> don't think we ever really got anywhere. I've recently begun > Nick> having another look at the multiprocessor balancer, so > Nick> hopefully I can get a bit further with it this time. > > While it may be worthwhile to improve the scheduler, it's clear that > there isn't going to be a trivial "fix" for this issue, especially > since it's not even clear that anything is really broken. Independent > of the scheduler work, it would be very useful to have a pipe > benchmark which at least made the dependencies on the scheduler > obvious. So I think improving the scheduler and improving the LMbench > pipe benchmark are entirely complementary. > Oh that's quite true. A bad score on SMP on the pipe benchmark does not mean anything is broken. And IMO, probably many (most?) lmbench tests should be run with all processes bound to the same CPU on SMP systems to get the best repeatability and an indication of the basic serial speed of the operation (which AFAIK is what they aim to measure). Having the scheduler take care of process placement is interesting too, of course. But it adds a new variable to the tests, which IMO doesn't always suit lmbench too well. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-19 12:43 ` Nick Piggin @ 2005-01-19 17:31 ` David Mosberger 0 siblings, 0 replies; 12+ messages in thread From: David Mosberger @ 2005-01-19 17:31 UTC (permalink / raw) To: Nick Piggin; +Cc: davidm, Linus Torvalds, Luck, Tony, linux-ia64, linux-kernel >>>>> On Wed, 19 Jan 2005 23:43:45 +1100, Nick Piggin <nickpiggin@yahoo.com.au> said: Nick> Oh that's quite true. A bad score on SMP on the pipe benchmark Nick> does not mean anything is broken. Nick> And IMO, probably many (most?) lmbench tests should be run Nick> with all processes bound to the same CPU on SMP systems to get Nick> the best repeatability and an indication of the basic serial Nick> speed of the operation (which AFAIK is what they aim to Nick> measure). We need to keep an eye on both the intra- and the inter-cpu pipe-bandwidth and should measure them explicitly. The problem is that at the moment, we get one, the other, or a mixture of the two, subject to the vagaries of the scheduler. If we could reliably measure both intra and inter-cpu cases, we may well find new optimization opportunities (I'm almost certain that's the case for the cross-cpu case; which is probably the more important case, actually). --david ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: pipe performance regression on ia64 2005-01-18 18:11 ` Linus Torvalds 2005-01-18 18:31 ` David Mosberger 2005-01-18 23:34 ` Nick Piggin @ 2005-01-19 12:52 ` Ingo Molnar 2 siblings, 0 replies; 12+ messages in thread From: Ingo Molnar @ 2005-01-19 12:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: Luck, Tony, linux-ia64, linux-kernel * Linus Torvalds <torvalds@osdl.org> wrote: > The "wake_up_sync()" hack only helps for the special case where we > know the writer is going to write more. Of course, we could make the > pipe code use that "synchronous" write unconditionally, and benchmarks > would look better, but I suspect it would hurt real life. not just that, it's incorrect scheduling, because it introduces the potential to delay the woken up task by a long time, amounting to a missed wakeup. > I don't know how to make the benchmark look repeatable and good, > though. The CPU affinity thing may be the right thing. the fundamental bw_pipe scenario is this: the wakeup will happen earlier than the waker suspends. (because it's userspace that decides about suspension.) So the kernel rightfully notifies another, idle CPU to run the freshly woken task. If the message passing across CPUs and the target CPU is fast enough to 'grab' the task, then we'll get the "slow" benchmark case, waker remaining on this CPU, wakee running on another CPU. If this CPU happens to be fast enough suspending, before that other CPU had the chance to grab the CPU (we 'steal the task back') then we'll see the "fast" benchmark scenario. i've seen traces where a single bw_pipe testrun showed _both_ variants in chunks of 100s of milliseconds, probably due to cacheline placement putting the overhead sometimes above the critical latency, sometimes below it. so there will always be this 'latency and tendency to reschedule on another CPU' thing that will act as a barrier between 'really good' and 'really bad' numbers, and if a test happens to be around that boundary it will fluctuate back and forth. and this property also has another effect: _worse_ scheduling decisions (not waking up an idle CPU when we could) can result in _better_ bw_pipe numbers. Also, a _slower_ scheduler can sometimes move the bw_pipe workload below the threshold, resulting in _better_ numbers. So as far as SMP systems are concerned, bw_pipe numbers have to be considered very carefully. this is a generic thing: message passing latency scales inversely always to the quality of distribution of SMP tasks. The better we are at spreading out tasks, the worse message passing latency gets. (nothing will beat passive, work-less 'message passing' between two tasks on the same CPU.) Ingo ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2005-01-19 17:34 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-01-18 17:41 pipe performance regression on ia64 Luck, Tony 2005-01-18 18:11 ` Linus Torvalds 2005-01-18 18:31 ` David Mosberger 2005-01-18 20:17 ` Linus Torvalds 2005-01-19 3:05 ` [Lmbench-users] " Larry McVoy 2005-01-19 3:20 ` Linus Torvalds 2005-01-19 16:40 ` Larry McVoy 2005-01-18 23:34 ` Nick Piggin 2005-01-19 5:11 ` David Mosberger 2005-01-19 12:43 ` Nick Piggin 2005-01-19 17:31 ` David Mosberger 2005-01-19 12:52 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox