public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* pipe performance regression on ia64
@ 2005-01-18 17:41 Luck, Tony
  2005-01-18 18:11 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Luck, Tony @ 2005-01-18 17:41 UTC (permalink / raw)
  To: torvalds; +Cc: linux-ia64, linux-kernel

David Mosberger pointed out to me that 2.6.11-rc1 kernel scores
very badly on ia64 in lmbench pipe throughput test (bw_pipe) compared
with earlier kernels.

Nanhai Zou looked into this, and found that the performance loss
began with Linus' patch to speed up pipe performance by allocating
a circular list of pages.

Here's his analysis:

>OK, I know the reason now.
>
>This regression we saw comes from scheduler load balancer.
>
>Pipe is a kind of workload that writer and reader will never run at the
>same time. They are synchronized by semaphore. One is always sleeping
>when the other end is working.
>
>To have cache hot, we do not wish to let writer and reader
>to be balanced to 2 cpus. That is why in fs/pipe.c, kernel use
>wake_up_interruptible_sync() instead of wake_up_interruptible to wakeup
>process.
>
>Now, load balancer is still balancing the processes if we have other
>any cpu idle.  Note that on an HT enabled x86 the load balancer will
>first balance the process to a cpu in SMT domain without cache miss
>penalty.
>
>So, when we run bw_pipe on a low load SMP machine, the kernel running in
>a way load balancer always trying to spread out 2 processes while the
>wake_up_interruptible_sync() is always trying to draw them back into
>1 cpu.
>
>Linus's patch will reduce the change to call wake_up_interruptible_sync()
>a lot.
>
>For bw_pipe writer or reader, the buffer size is 64k.  In a 16k page
>kernel. The old kernel will call wake_up_interruptible_sync 4 times but
>the new kernel will call wakeup only 1 time.
>
>Now the load balancer wins, processes are running on 2 cpus at most of
>the time.  They got a lot of cache miss penalty.
>
>To prove this, Just run 4 instances of bw_pipe on a 4 -way Tiger to let
>load balancer not so active.
>
>Or simply add some code at the top of main() in bw_pipe.c
>
>{
>  long affinity = 1;
>  sched_setaffinity(getpid(), sizeof(long), &affinity);
>}
>then make and run bw_pipe again.
>
>Now I get a throughput of 5GB...

-Tony

^ permalink raw reply	[flat|nested] 15+ messages in thread
* RE: [Lmbench-users] Re: pipe performance regression on ia64
@ 2005-01-19  3:24 Zou, Nanhai
  0 siblings, 0 replies; 15+ messages in thread
From: Zou, Nanhai @ 2005-01-19  3:24 UTC (permalink / raw)
  To: Larry McVoy, Linus Torvalds
  Cc: davidm, carl.staelin, Luck, Tony, lmbench-users, linux-ia64,
	Kernel Mailing List

> -----Original Message-----
> From: linux-ia64-owner@vger.kernel.org
> [mailto:linux-ia64-owner@vger.kernel.org] On Behalf Of Larry McVoy
> Sent: Wednesday, January 19, 2005 11:05 AM
> To: Linus Torvalds
> Cc: davidm@hpl.hp.com; carl.staelin@hp.com; Luck, Tony;
> lmbench-users@bitmover.com; linux-ia64@vger.kernel.org; Kernel Mailing
List
> Subject: Re: [Lmbench-users] Re: pipe performance regression on ia64
> 
> I'm very unthrilled with the idea of adding stuff to the release
benchmark
> which is OS specific.  That said, there is nothing to say that you
can't
> grab the benchmark and tweak your own test case in there to prove or
> disprove your theory.
> 

Maybe lmbench could add a feature that bw_pipe will fork CPU number of
children to measure the average throughput. 

This will give a much reasonable result when running bw_pipe on a SMP
Box, at least for Linux.

Zou Nan hai

^ permalink raw reply	[flat|nested] 15+ messages in thread
* RE: [Lmbench-users] Re: pipe performance regression on ia64
@ 2005-01-19  6:35 Luck, Tony
  0 siblings, 0 replies; 15+ messages in thread
From: Luck, Tony @ 2005-01-19  6:35 UTC (permalink / raw)
  To: Zou, Nanhai, Larry McVoy, Linus Torvalds
  Cc: davidm, carl.staelin, lmbench-users, linux-ia64,
	Kernel Mailing List

>Maybe lmbench could add a feature that bw_pipe will fork CPU 
>number of children to measure the average throughput. 
>
>This will give a much reasonable result when running bw_pipe 
>on a SMP Box, at least for Linux.

bw_pipe (along with most/all of the lmbench tools already has
a "-P" argument to specify the degree of parallelism).

-Tony

^ permalink raw reply	[flat|nested] 15+ messages in thread
* RE: [Lmbench-users] Re: pipe performance regression on ia64
@ 2005-01-19  9:23 Staelin, Carl
  0 siblings, 0 replies; 15+ messages in thread
From: Staelin, Carl @ 2005-01-19  9:23 UTC (permalink / raw)
  To: Luck, Tony, Zou, Nanhai, Larry McVoy, Linus Torvalds
  Cc: Mosberger, David, lmbench-users, linux-ia64, Kernel Mailing List

One problem is that on SMPs "average" doesn't really
make sense.  Statistically, "average" (mean()) only
really makes sense when you have a Gaussian distribution
of results.  The benchmark results for SMPs tend to
be stongly modal, i.e. very tight clusters around
a few values.  In this environment "average" is
generally meaningless.

That being said, the '-P' flag exists on most lmbench
version 3 benchmarks and allows one to have a given
number of jobs running in parallel.  It is intended
to measure performance under load.  However, even
in this case one may see modal results.  Please see
the recent lmbench paper [2] for an example.


Cheers,

Carl

References
[1] Larry McVoy and Carl Staelin.  lmbench: Portable
    tools for performance analysis.  Proceedings 1996
    USENIX Annual Technical Conference (San Diego, CA),
    pages 279--284.  January 1996.
 
http://www.usenix.org/publications/library/proceedings/sd96/mcvoy.html
[2] Carl Staelin.  lmbench --- an extensible micro-
    benchmark suite.  HPL-2004-213.  December 2004.
    Also to appear in Software Practice and Experience.
    http://www.hpl.hp.com/techreports/2004/HPL-2004-213.pdf



_________________________________________________
[(hp)]	Carl Staelin
	Senior Research Scientist
	Hewlett-Packard Laboratories
	Technion City
	Haifa, 32000
	ISRAEL
	+972(4)823-1237x305	+972(4)822-0407 fax
	carl.staelin@hp.com
______ http://www.hpl.hp.com/personal/Carl_Staelin ______

-----Original Message-----
From: Luck, Tony [mailto:tony.luck@intel.com] 
Sent: Wednesday, January 19, 2005 8:35 AM
To: Zou, Nanhai; Larry McVoy; Linus Torvalds
Cc: Mosberger, David; Staelin, Carl; lmbench-users@bitmover.com;
linux-ia64@vger.kernel.org; Kernel Mailing List
Subject: RE: [Lmbench-users] Re: pipe performance regression on ia64

>Maybe lmbench could add a feature that bw_pipe will fork CPU number of 
>children to measure the average throughput.
>
>This will give a much reasonable result when running bw_pipe on a SMP 
>Box, at least for Linux.

bw_pipe (along with most/all of the lmbench tools already has a "-P"
argument to specify the degree of parallelism).

-Tony


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2005-01-19 17:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-18 17:41 pipe performance regression on ia64 Luck, Tony
2005-01-18 18:11 ` Linus Torvalds
2005-01-18 18:31   ` David Mosberger
2005-01-18 20:17     ` Linus Torvalds
2005-01-19  3:05       ` [Lmbench-users] " Larry McVoy
2005-01-19  3:20         ` Linus Torvalds
2005-01-19 16:40       ` Larry McVoy
2005-01-18 23:34   ` Nick Piggin
2005-01-19  5:11     ` David Mosberger
2005-01-19 12:43       ` Nick Piggin
2005-01-19 17:31         ` David Mosberger
2005-01-19 12:52   ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2005-01-19  3:24 [Lmbench-users] " Zou, Nanhai
2005-01-19  6:35 Luck, Tony
2005-01-19  9:23 Staelin, Carl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox