public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: O(1) scheduler gives big boost to tbench 192
@ 2002-05-06  8:20 rwhron
  2002-05-06 16:42 ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: rwhron @ 2002-05-06  8:20 UTC (permalink / raw)
  To: andrea; +Cc: linux-kernel

> BTW, Randy, I seen my tree runs slower with tiobench, that's probably
> because I made the elevator anti-starvation logic more aggressive than
> mainline and the other kernel trees (to help interactive usage), could
> you try to run tiobench on -aa after elvtune -r 8192 -w 16384
> /dev/hd[abcd] to verify? Thanks for the great benchmarking effort.

I will have results on the big machine in a couple days.  On the 
small machine, elvtune increases tiobench sequential reads by
30-50%, and lowers worst case latency a little.

More -aa at:
http://home.earthlink.net/~rwhron/kernel/aa.html

> And for the reason fork is faster in -aa that's partly thanks to the
> reschedule-child-first logic, that can be easily merged in mainline,
> it's just in 2.5.

Is that part of parent_timeslice patch?  parent_timeslice helped 
fork a little when I tried to isolating patches to find what
makes fork faster in -aa.  It is more than one patch as far as 
I can tell.

On uniprocessor the unixbench execl test, all -aa kernel's going back 
at least to 2.4.15aa1 are about 20% faster than other trees, even those 
like jam and akpm's splitted vm.  Fork in -aa for more "real world" 
test (autoconf build) is about 8-10% over other kernel trees.

On quad Xeon, with bigger L2 cache, autoconf (fork test) the difference
between mainline and -aa is smaller.  The -aa based VMs in aa, jam, and
mainline have about 15% edge over rmap VM in ac and rmap.  jam has a
slight advantage for autoconf build, possibly because of O(1) effect
which is more likely to show up since more processes execute
on the 4 way box.

More quad Xeon at:
http://home.earthlink.net/~rwhron/kernel/bigbox.html


-- 
Randy Hron


^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: O(1) scheduler gives big boost to tbench 192
@ 2002-05-20 12:46 rwhron
  0 siblings, 0 replies; 24+ messages in thread
From: rwhron @ 2002-05-20 12:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: kravetz, jamagallon, rml

> On Tue, May 07, 2002 at 04:39:34PM -0700, Robert Love wrote:
> > It is just for pipes we previously used sync, no?

On Tue, 7 May 2002 16:48:57 -0700, Mike Kravetz wrote
> That's the only thing I know of that used it.

> I'd really like to know if there are any real workloads that
> benefited from this feature, rather than just some benchmark.
> I can do some research, but was hoping someone on this list
> might remember.  If there is a valid workload, I'll propose
> a patch.  

On Mon, 13 May 2002 02:06:31 +0200, J.A. Magallon wrote: 
> - Re-introduction of wake_up_sync to make pipes run fast again. No idea
> about this is useful or not, that is the point, to test it

2.4.19-pre8-jam2 showed slightly better performance on the quad Xeon
for most benchmarks with 25-wake_up_sync backed out.  However, it's
not clear to me 25-wake_up_sync was proper patch to backout for this
test, as there wasn't a dramatic change in Pipe latency or bandwidth
without it.  

There was a > 300% improvement lmbench Pipe bandwidth and latency 
comparing pre8-jam2 to pre7-jam6.  

Average of 25 lmbench runs on jam2 kernels, 12 on the others:
2.4.19-pre8-jam2-nowuos (backed out 25-wake_up_sync patch)

*Local* Communication latencies in microseconds - smaller is better
                                 AF     
kernel                   Pipe   UNIX   
-----------------------  -----  -----  
2.4.19-pre7-jam6         29.51  42.37  
2.4.19-pre8              10.73  29.94  
2.4.19-pre8-aa2          12.45  29.53  
2.4.19-pre8-ac1          35.39  45.59  
2.4.19-pre8-jam2          7.70  15.27  
2.4.19-pre8-jam2-nowuos   7.74  14.93  


*Local* Communication bandwidths in MB/s - bigger is better
                                   AF  
kernel                    Pipe    UNIX 
-----------------------  ------  ------
2.4.19-pre7-jam6          66.41  260.39
2.4.19-pre8              468.57  273.32
2.4.19-pre8-aa2          418.09  273.59
2.4.19-pre8-ac1          110.62  241.06
2.4.19-pre8-jam2         545.66  233.68
2.4.19-pre8-jam2-nowuos  544.57  246.53

The kernel build test, which applies patches through a pipe
and compiles with -pipe didn't reflect an improvement.

kernel                   average  min_time  max_time  runs  notes
2.4.19-pre7-jam6           237.0       235       239     3  All successful
2.4.19-pre8                239.7       238       241     3  All successful
2.4.19-pre8-aa2            237.7       237       238     3  All successful
2.4.19-pre8-ac1            239.3       238       241     3  All successful
2.4.19-pre8-jam2           240.0       238       241     3  All successful
2.4.19-pre8-jam2-nowuos    238.7       236       241     3  All successful

I don't know how much of the kernel build test is dependant on
pipe performance.  There is probably a better "real world"
measurement.  

On a single processor box, there was an improvement on kernel build
between pre7-jam6 and pre8-jam2.  That was only on one sample though.

Xeon page:
http://home.earthlink.net/~rwhron/kernel/bigbox.html

Latest on uniproc:
http://home.earthlink.net/~rwhron/kernel/latest.html

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: O(1) scheduler gives big boost to tbench 192
@ 2002-05-08 16:39 Bill Davidsen
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Davidsen @ 2002-05-08 16:39 UTC (permalink / raw)
  To: Linux-Kernel Mailing List

Forgive me if you feel I've clipped too much from your posting, I'm trying
to capture the points made by various folks without responding to each
message.

---------- Forwarded message ----------
From: Mike Kravetz <kravetz@us.ibm.com>
Date: Tue, 7 May 2002 15:13:56 -0700

I have experimented with reintroducing '__wake_up_sync' support
into the O(1) scheduler.  The modifications are limited to the
'try_to_wake_up' routine as they were before.  If the 'synchronous'
flag is set, then 'try_to_wake_up' trys to put the awakened task
on the same runqueue as the caller without forcing a reschedule.
If the task is not already on a runqueue, this is easy.  If not,
we give up.  Results, restore previous bandwidth results.

BEFORE
------
Pipe latency:    6.5185 microseconds
Pipe bandwidth: 86.35 MB/sec

AFTER
-----
Pipe latency:     6.5723 microseconds
Pipe bandwidth: 540.13 MB/sec

---------- Forwarded message ----------
From: Andrea Arcangeli <andrea@suse.de>

So my hypothesis about the sync wakeup in the below email proven to be right:

	http://marc.theaimsgroup.com/?l=linux-kernel&m=102050009725367&w=2

Many thanks for verifying this.

Personally if the two tasks ends blocking waiting each other, then I
prefer them to be in the same cpu. That was the whole point of the
optimization. If the pipe buffer is large enough not to require reader
or writer to block, then we don't do the sync wakeup just now (there's a
detail with the reader that may block simply because the writer is slow
at writing, but it probably doesn't matter much). There are many cases
where a PAGE_SIZE of buffer gets filled in much less then a timeslice,
and for all those cases rescheduling the two tasks one after the other
in the same cpu is a win, just like the benchmark shows.  Think the
normal pipes we do from the shell, like a "| grep something", they are
very common and they all wants to be handled as a sync wakeups.  In
short when loads of data pass through the pipe with max bandwith, the
sync-wakeup is a definitive win. If the pipe never gets filled then the
writer never sync-wakeup, it just returns the write call asynchronously,
but of course the pipe doesn't get filled because it's not a
max-bandiwth scenario, and so the producer and the consumer are allowed
to scale in multiple cpus by the design of the workload.

Comments?

I would like if you could pass over your changes to the O(1) scheduler
to resurrect the sync-wakeup.

---------- Forwarded message ----------
From: Mike Kravetz <kravetz@us.ibm.com>
Date: Tue, 7 May 2002 15:43:22 -0700

I'm not sure if 'synchronous' is still being passed all the way
down to try_to_wake_up in your tree (since it was removed in 2.5).
This is based off a back port of O(1) to 2.4.18 that Robert Love
did.  The rest of try_to_wake_up (the normal/common path) remains
the same.

---------- Forwarded message ----------
From: Robert Love <rml@tech9.net>
Date: 07 May 2002 16:39:34 -0700

Hm, interesting.  When Ingo removed the sync variants of wake_up he did
it believing the load balancer would handle the case.  Apparently, at
least in this case, that assumption was wrong.

I agree with your earlier statement, though - this benchmark may be a
case where it shows up negatively but in general the balancing is
preferred.  I can think of plenty of workloads where that is the case. 
I also wonder if over time the load balancer would end up putting the
tasks on the same CPU.  That is something the quick pipe benchmark would
not show.

---------- Forwarded message ----------
From: Mike Kravetz <kravetz@us.ibm.com>
Date: Tue, 7 May 2002 16:48:57 -0700

On Tue, May 07, 2002 at 04:39:34PM -0700, Robert Love wrote:
> It is just for pipes we previously used sync, no?

That's the only thing I know of that used it.

I'd really like to know if there are any real workloads that
benefited from this feature, rather than just some benchmark.
I can do some research, but was hoping someone on this list
might remember.  If there is a valid workload, I'll propose
a patch.  However, I don't think we should be adding patches/
features just to help some benchmark that is unrelated to
real world use.

==== start original material ====

Got to change mailers...

Consider the command line:
  grep pattern huge_log_file | cut -f1-2,5,7 | sed 's/stuff/things/' |
  tee extract.tmp | less

Ideally I would like the pipes to run as fast as possible since I'm
waiting for results, using cache and one CPU where that is best, and using
all the CPUs needed if the machine is SMP and processing is complex. I
believe that the original code came closer to that ideal than the recent
code, and obviously I think the example is "valid workload" since I do
stuff like that every time I look for/at server problems.

I believe the benchmark shows a performance issue which will occur in
normal usage.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: O(1) scheduler gives big boost to tbench 192
@ 2002-05-03 16:37 John Hawkes
  0 siblings, 0 replies; 24+ messages in thread
From: John Hawkes @ 2002-05-03 16:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: rwhron

From: <rwhron@earthlink.net>
...
> tbench 192 is an anomaly test too.  AIM looks like a nice
> "mixed" bench.  Do you have any scripts for it?  I'd like 
> to use AIM too.

Try http://www.caldera.com/developers/community/contrib/aim.html for a tarball
with everything you'll need.

The "Multiuser Shared System Mix" (aka "workfile.shared") is the one I use.
You'll need several disk spindles to keep it compute-bound, though.  Several
of the disk subtests, especially the sync_* tests, quickly drive one or two
spindles to their max transaction rates, and from that point AIM7 will be
I/O-bound and produce a largely idle system, which isn't very interesting if
you're trying to example CPU scheduler performance with high process counts.

One thing you can do is to comment-out the three sync_* tests in the
workfile.shared configuration file, and then watch your idle time with
something like vmstat.  Experiment with commenting-out more disk subtests,
like creat-clo, disk_cp, and disk_src, one by one, until AIM7 becomes
compute-bound.

John Hawkes

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: O(1) scheduler gives big boost to tbench 192
@ 2002-05-03 13:38 rwhron
  2002-05-03 20:29 ` Gerrit Huizenga
  2002-05-07 22:13 ` Mike Kravetz
  0 siblings, 2 replies; 24+ messages in thread
From: rwhron @ 2002-05-03 13:38 UTC (permalink / raw)
  To: gh; +Cc: linux-kernel, alan

> > > Rumor is that on some workloads MQ it outperforms O(1), but it
> > > may be that the latest (post K3?) O(1) is catching up?

Is MQ based on the Davide Libenzi scheduler? 
(a version of Davide's scheduler is in the -aa tree).

> > I'd be interested to know what workloads ?

> AIM on large CPU count machines was the most significant I had heard
> about.  Haven't measured recently on database load - we made a cut to
> O(1) some time back for simplicity.  Supposedly volanomark was doing
> better for a while but again we haven't cut back to MQ in quite a while;
> trying instead to refine O(1).  Volanomark is something of a scheduling
> anomaly though - sender/receiver timing on loopback affects scheduling
> decisions and overall throughput in ways that may or may not be consistent
> with real workloads.  AIM is probably a better workload for "real life"
> random scheduling testing.

tbench 192 is an anomaly test too.  AIM looks like a nice
"mixed" bench.  Do you have any scripts for it?  I'd like 
to use AIM too.

A side effect of O(1) in ac2 and jam6 on the 4 way box is a decrease 
in pipe bandwidth and an increase in pipe latency measured by lmbench:

kernel                    Pipe bandwidth in MB/s - bigger is better
-----------------------  ------
2.4.16                   383.93
2.4.19-pre3aa2           316.88
2.4.19-pre5              385.56
2.4.19-pre5-aa1          345.93
2.4.19-pre5-aa1-2g-hio   371.87
2.4.19-pre5-aa1-3g-hio   355.97
2.4.19-pre7              462.80
2.4.19-pre7-aa1          382.90
2.4.19-pre7-ac2           85.66
2.4.19-pre7-jam6          66.41
2.4.19-pre7-rl           464.60
2.4.19-pre7-rmap13       453.24

kernel                   Pipe latency in microseconds - smaller is better
-----------------------  -----
2.4.16                   12.73
2.4.19-pre3aa2           13.58
2.4.19-pre5              12.98
2.4.19-pre5-aa1          13.46
2.4.19-pre5-aa1-2g-hio   12.83
2.4.19-pre5-aa1-3g-hio   13.08
2.4.19-pre7              10.71
2.4.19-pre7-aa1          13.32
2.4.19-pre7-ac2          31.95
2.4.19-pre7-jam6         29.51
2.4.19-pre7-rl           10.71
2.4.19-pre7-rmap13       10.75

More at:
http://home.earthlink.net/~rwhron/kernel/bigbox.html

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 24+ messages in thread
* O(1) scheduler gives big boost to tbench 192
@ 2002-05-02 21:36 rwhron
  2002-05-03  0:09 ` Gerrit Huizenga
  0 siblings, 1 reply; 24+ messages in thread
From: rwhron @ 2002-05-02 21:36 UTC (permalink / raw)
  To: linux-kernel

On an OSDL 4 way x86 box the O(1) scheduler effect 
becomes obvious as the run queue gets large.  

2.4.19-pre7-ac2 and 2.4.19-pre7-jam6 have the O(1) scheduler.  

At 192 processes, O(1) shows about 340% improvement in throughput.
The dyn-sched in -aa appears to be somewhat improved over the
standard scheduler.

Numbers are in MB/second.

tbench 192 processes
2.4.16                    29.39
2.4.17                    29.70
2.4.19-pre5               29.01
2.4.19-pre5-aa1           29.22
2.4.19-pre5-aa1-2g-hio    29.94
2.4.19-pre5-aa1-3g-hio    28.66
2.4.19-pre7               29.93
2.4.19-pre7-aa1           32.75
2.4.19-pre7-ac2          103.98
2.4.19-pre7-rmap13        29.46
2.4.19-pre7-jam6         104.98
2.4.19-pre7-rl            29.74

At 64 processes, O(1) helps a little.  ac2 and jam6 have
the highest numbers here too.

tbench 64 processes
2.4.16                    101.99
2.4.17                    103.49
2.4.19-pre5-aa1           102.43
2.4.19-pre5-aa1-2g-hio    104.30
2.4.19-pre5-aa1-3g-hio    104.60
2.4.19-pre7               100.86
2.4.19-pre7-aa1           101.76
2.4.19-pre7-ac2           105.89
2.4.19-pre7-rmap13        100.94
2.4.19-pre7-rl             99.65
2.4.19-pre7-jam6          108.23

I've seen some benefit on a uniprocessor box running tbench 32 
for kernels with O(1).  Hmm, have to try tbench 192 on uniproc 
and see if the difference is all scheduler overhead.

I'm putting together a page with more results on this machine.
It will be growing at:
http://home.earthlink.net/~rwhron/kernel/bigbox.html

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-05-20 12:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-06  8:20 O(1) scheduler gives big boost to tbench 192 rwhron
2002-05-06 16:42 ` Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2002-05-20 12:46 rwhron
2002-05-08 16:39 Bill Davidsen
2002-05-03 16:37 John Hawkes
2002-05-03 13:38 rwhron
2002-05-03 20:29 ` Gerrit Huizenga
2002-05-04  8:13   ` Andrea Arcangeli
2002-05-07 22:13 ` Mike Kravetz
2002-05-07 22:44   ` Alan Cox
2002-05-07 22:43     ` Mike Kravetz
2002-05-07 23:39       ` Robert Love
2002-05-07 23:48         ` Mike Kravetz
2002-05-08 15:34           ` Jussi Laako
2002-05-08 16:31             ` Robert Love
2002-05-08 17:02               ` Mike Kravetz
2002-05-09  0:26                 ` Jussi Laako
2002-05-08  8:50   ` Andrea Arcangeli
2002-05-09 23:18     ` Mike Kravetz
2002-05-02 21:36 rwhron
2002-05-03  0:09 ` Gerrit Huizenga
2002-05-02 23:17   ` J.A. Magallon
2002-05-03  0:14   ` Alan Cox
2002-05-03  1:08     ` Gerrit Huizenga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox