public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
@ 2001-03-20 18:28 Serge Orlov
  2001-03-20 18:43 ` Linus Torvalds
  2001-03-20 18:43 ` Jakob Østergaard
  0 siblings, 2 replies; 32+ messages in thread
From: Serge Orlov @ 2001-03-20 18:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, sorlov

 Hi,
I upgraded one of our computer happily running 2.2.13 kernel
to 2.4.2. Everything was OK, but compilation time of our C++
project greatly increased (1.4 times slower). I investigated the
issue and found that g++ spends 7 times more time in kernel.
The reason for this is big vm map:

cat /proc/15677/maps |wc -l
   2238

15677 -- is cc1plus process, the map looks like this:
.....
4014a000-4014b000 rw-p 00000000 00:00 0
4014b000-4014c000 rw-p 00000000 00:00 0
4014c000-4014d000 rw-p 00000000 00:00 0
4014d000-4014e000 rw-p 00000000 00:00 0
4014e000-4014f000 rw-p 00000000 00:00 0
4014f000-40150000 rw-p 00000000 00:00 0
40150000-40151000 rw-p 00000000 00:00 0
40151000-40152000 rw-p 00000000 00:00 0
......
strace:
.....
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40019000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40019000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001a000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001b000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001c000
15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001d000
.......

2.4.2:
time g++ -g -Wall -I. -c t_instr.cpp -o t_instr.o

real    0m3.434s
user    0m2.790s
sys     0m0.530s

2.2.13:
time g++ -g -Wall -I.  -c t_instr.cpp -o t_instr.o

real    0m3.167s
user    0m2.950s
sys     0m0.070s

I noticed a recent thread (Re: Kernel is unstable) in archives that
ended by Linus:
--- quote ---
Ehh.. If the merging doesn't actually happen, it's always a loss. We've
just spent CPU cycles on doing something useless. And in my tests, that
was the case a lot more than not.

Also, in the expense of taking a page fault, looking one or two levels
deeper in the AVL tree is pretty much not noticeable.

Show me numbers for real applications, and I might care. I saw barely
measurable speedups (and more importantly to me - real simplification)
by
removing it.

Don't bother arguing with "it might.."

                Linus
--- end of quote ----

OK, the numbers are here. g++ is 2.96 from RedHat 7.0.
Please, CC me, as I'm not on the list.

   Serge.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-20 18:28 Serge Orlov
@ 2001-03-20 18:43 ` Linus Torvalds
  2001-03-20 18:59   ` Jakob Østergaard
  2001-03-21  1:20   ` Kevin Buhr
  2001-03-20 18:43 ` Jakob Østergaard
  1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2001-03-20 18:43 UTC (permalink / raw)
  To: Serge Orlov; +Cc: linux-kernel, sorlov



On Tue, 20 Mar 2001, Serge Orlov wrote:
>
> I upgraded one of our computer happily running 2.2.13 kernel
> to 2.4.2. Everything was OK, but compilation time of our C++
> project greatly increased (1.4 times slower). I investigated the
> issue and found that g++ spends 7 times more time in kernel.
> The reason for this is big vm map:

Cool. Somebody actually found a real case.

I'll fix the mmap case asap. Its' not hard, I just waited to see if it
ever actually triggers. Something like g++ certainly counts as major.

Are you willing to test out patches?

		Linus


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-20 18:28 Serge Orlov
  2001-03-20 18:43 ` Linus Torvalds
@ 2001-03-20 18:43 ` Jakob Østergaard
  1 sibling, 0 replies; 32+ messages in thread
From: Jakob Østergaard @ 2001-03-20 18:43 UTC (permalink / raw)
  To: Serge Orlov; +Cc: linux-kernel, Linus Torvalds, sorlov

On Tue, Mar 20, 2001 at 09:28:57PM +0300, Serge Orlov wrote:
>  Hi,
> I upgraded one of our computer happily running 2.2.13 kernel
> to 2.4.2. Everything was OK, but compilation time of our C++
> project greatly increased (1.4 times slower). I investigated the
> issue and found that g++ spends 7 times more time in kernel.

I see the *exact* same problem. Large C++ codes, and gcc spending most of the
CPU time in kernel.

> The reason for this is big vm map:
> 
> cat /proc/15677/maps |wc -l
>    2238

Exactly what I see too.   200 MB of memory allocated in 4K maps...

There is an easy fix:  In libiberty in GCC we could change xmalloc()
to do real malloc instead of calloc().   I think that would fix it.

Or glibc could be fixed to make calloc() behave more reasonably
when it's called with tons and tons of 4K allocations.

Or the kernel could be fixed to merge maps.

...
> .....
> 15677 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40019000

hear hear !   

...
> 
> OK, the numbers are here. g++ is 2.96 from RedHat 7.0.
> Please, CC me, as I'm not on the list.

gcc 2.96 here too.

Should we take this up with the glibc or gcc folks, or should
someone fix the kernel ?

This *is* a very significant performance problem for a standard tool.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-20 18:43 ` Linus Torvalds
@ 2001-03-20 18:59   ` Jakob Østergaard
  2001-03-21  1:20   ` Kevin Buhr
  1 sibling, 0 replies; 32+ messages in thread
From: Jakob Østergaard @ 2001-03-20 18:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Serge Orlov, linux-kernel, sorlov

On Tue, Mar 20, 2001 at 10:43:33AM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 20 Mar 2001, Serge Orlov wrote:
> >
> > I upgraded one of our computer happily running 2.2.13 kernel
> > to 2.4.2. Everything was OK, but compilation time of our C++
> > project greatly increased (1.4 times slower). I investigated the
> > issue and found that g++ spends 7 times more time in kernel.
> > The reason for this is big vm map:
> 
> Cool. Somebody actually found a real case.
> 
> I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> ever actually triggers. Something like g++ certainly counts as major.

Uber-cool !  :)

> 
> Are you willing to test out patches?

Definitely.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-20 18:43 ` Linus Torvalds
  2001-03-20 18:59   ` Jakob Østergaard
@ 2001-03-21  1:20   ` Kevin Buhr
  2001-03-21  1:38     ` David S. Miller
  2001-03-21  6:41     ` Mike Galbraith
  1 sibling, 2 replies; 32+ messages in thread
From: Kevin Buhr @ 2001-03-21  1:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Serge Orlov, linux-kernel, Jakob Østergaard

Linus Torvalds <torvalds@transmeta.com> writes:
> 
> Cool. Somebody actually found a real case.
> 
> I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> ever actually triggers. Something like g++ certainly counts as major.

I frequently build Mozilla from scratch on my (aging) dual Celeron
machine.  That's about 65 megs of actual C++ source, and it takes
about an hour of real time to compile.  I see times for the whole
build like this:

    real    60m4.574s
    user    101m18.260s
    sys     3m23.520s

with gcc 2.95.2 20000220 (Debian GNU/Linux) under Linux 2.4.2.

The sys-to-user ratio seems much closer to Serge's 2.2.13 numbers than
his 2.4.2 numbers, and I'm wondering why.

If I recall correctly, RedHat's 2.96 was a modified development
snapshot of GCC 3.0, not an official GCC release.  If this is just a
quirk in 2.96 that can be fixed before the official release of 3.0 by
a trivial patch to libiberty, maybe your original hunch was right and
the kernel should be left as-is.

> Are you willing to test out patches?

I'm willing to help test out the patch; I'd be curious to see what
effect it has on the performance of 2.95.2.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21  1:20   ` Kevin Buhr
@ 2001-03-21  1:38     ` David S. Miller
  2001-03-21 20:19       ` Kevin Buhr
  2001-03-21  6:41     ` Mike Galbraith
  1 sibling, 1 reply; 32+ messages in thread
From: David S. Miller @ 2001-03-21  1:38 UTC (permalink / raw)
  To: Kevin Buhr
  Cc: Linus Torvalds, Serge Orlov, linux-kernel, Jakob Østergaard


Kevin Buhr writes:
 > If I recall correctly, RedHat's 2.96 was a modified development
 > snapshot of GCC 3.0, not an official GCC release.  If this is just a
 > quirk in 2.96 that can be fixed before the official release of 3.0 by
 > a trivial patch to libiberty, maybe your original hunch was right and
 > the kernel should be left as-is.

It is the garbage collector scheme used for memory allocation in gcc
>=2.96 that triggers the bad cases seen by Serge.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
@ 2001-03-21  2:02 Dieter Nützel
  0 siblings, 0 replies; 32+ messages in thread
From: Dieter Nützel @ 2001-03-21  2:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel List

Linus Torvalds <torvalds@transmeta.com> writes:
> 
> Cool. Somebody actually found a real case.
> 
> I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> ever actually triggers. Something like g++ certainly counts as major.

I do daily builds of the VTK CVS tree (The Visualization Toolkit, 
www.kitware.com/vtk.html, a huge 3D app).

~33 MB C++ source

It took ~1 hour on my K7 550, 256 MB, IBM DTL-307030, glibc-2.2 and 
gcc-2.95.2 ( 19991024 (release)) under most of the 2.4-test kernels (all with 
ReiserFS) for a whole rebuild.
Now it take nearly 1 and a half hour with 2.4.2-ac20.
BTW My mouse (PS2) is very sluggished during C++ compilations, now.

I am open for all of your patches. Or should I better say most :-)))

Cheers,
	Dieter
-- 
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
Cognitive Systems Group
Vogt-Kölln-Straße 30
D-22527 Hamburg, Germany

email: nuetzel@kogs.informatik.uni-hamburg.de
@home: Dieter.Nuetzel@hamburg.de

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21  1:20   ` Kevin Buhr
  2001-03-21  1:38     ` David S. Miller
@ 2001-03-21  6:41     ` Mike Galbraith
  2001-03-21 14:56       ` Matthias Urlichs
  2001-03-21 15:59       ` Kurt Garloff
  1 sibling, 2 replies; 32+ messages in thread
From: Mike Galbraith @ 2001-03-21  6:41 UTC (permalink / raw)
  To: linux-kernel

On 20 Mar 2001, Kevin Buhr wrote:

> Linus Torvalds <torvalds@transmeta.com> writes:
> >
> > Cool. Somebody actually found a real case.
> >
> > I'll fix the mmap case asap. Its' not hard, I just waited to see if it
> > ever actually triggers. Something like g++ certainly counts as major.
>
> I frequently build Mozilla from scratch on my (aging) dual Celeron
> machine.  That's about 65 megs of actual C++ source, and it takes
> about an hour of real time to compile.  I see times for the whole
> build like this:
>
>     real    60m4.574s
>     user    101m18.260s  <-- impossible no?
>     sys     3m23.520s

Why do numbers like this show up?  I noticed some of this after having
enabled SMP on my UP box.

	-Mike


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21  6:41     ` Mike Galbraith
@ 2001-03-21 14:56       ` Matthias Urlichs
  2001-03-21 15:05         ` Mike Galbraith
  2001-03-21 15:59       ` Kurt Garloff
  1 sibling, 1 reply; 32+ messages in thread
From: Matthias Urlichs @ 2001-03-21 14:56 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel

> > I frequently build Mozilla from scratch on my (aging) dual Celeron
> > machine.  [...]
> >     real    60m4.574s
> >     user    101m18.260s  <-- impossible no?
> >     sys     3m23.520s
> 
> Why do numbers like this show up?  I noticed some of this after having
> enabled SMP on my UP box.
> 
Now why would that be impossible on a two-CPU system?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21 14:56       ` Matthias Urlichs
@ 2001-03-21 15:05         ` Mike Galbraith
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2001-03-21 15:05 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: linux-kernel

On Wed, 21 Mar 2001, Matthias Urlichs wrote:

> > > I frequently build Mozilla from scratch on my (aging) dual Celeron
> > > machine.  [...]
> > >     real    60m4.574s
> > >     user    101m18.260s  <-- impossible no?
> > >     sys     3m23.520s
> >
> > Why do numbers like this show up?  I noticed some of this after having
> > enabled SMP on my UP box.
> >
> Now why would that be impossible on a two-CPU system?

zzzt.  Right.. impossible on a UP box.

	-Mike


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21  6:41     ` Mike Galbraith
  2001-03-21 14:56       ` Matthias Urlichs
@ 2001-03-21 15:59       ` Kurt Garloff
  2001-03-21 16:45         ` Mike Galbraith
  1 sibling, 1 reply; 32+ messages in thread
From: Kurt Garloff @ 2001-03-21 15:59 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 725 bytes --]

On Wed, Mar 21, 2001 at 07:41:55AM +0100, Mike Galbraith wrote:
> On 20 Mar 2001, Kevin Buhr wrote:
> >     real    60m4.574s
> >     user    101m18.260s  <-- impossible no?
> >     sys     3m23.520s
> 
> Why do numbers like this show up?  I noticed some of this after having
> enabled SMP on my UP box.

As you have two CPUs, you can spend more time in CPU than your wall clock
shows if you time multithreaded processes or multiple processes. At most
(ideal case) twice as much.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, FRG                               SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21 15:59       ` Kurt Garloff
@ 2001-03-21 16:45         ` Mike Galbraith
  2001-03-21 20:16           ` Kevin Buhr
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2001-03-21 16:45 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: linux-kernel

On Wed, 21 Mar 2001, Kurt Garloff wrote:

> On Wed, Mar 21, 2001 at 07:41:55AM +0100, Mike Galbraith wrote:
> > On 20 Mar 2001, Kevin Buhr wrote:
> > >     real    60m4.574s
> > >     user    101m18.260s  <-- impossible no?
> > >     sys     3m23.520s
> >
> > Why do numbers like this show up?  I noticed some of this after having
> > enabled SMP on my UP box.
>
> As you have two CPUs, you can spend more time in CPU than your wall clock
> shows if you time multithreaded processes or multiple processes. At most
> (ideal case) twice as much.

Yes.  I'm so used to UP numbers I didn't think.  I saw user larger than
real on my UP box yesterday during some testing, and then seeing this
post... oops.

	-Mike


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21 16:45         ` Mike Galbraith
@ 2001-03-21 20:16           ` Kevin Buhr
  2001-03-22  9:04             ` Mike Galbraith
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Buhr @ 2001-03-21 20:16 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel

Mike Galbraith <mikeg@wen-online.de> writes:
> 
> Yes.  I'm so used to UP numbers I didn't think.  I saw user larger than
> real on my UP box yesterday during some testing, and then seeing this
> post... oops.

Okay, so you see "user > real" on a UP box running an SMP kernel.

First, I'm not really familiar with this part of the kernel, but as I
understand things (and others will correct me if I'm wrong) ...

The "real time" is calculated by subtracting the "gettimeofday" before
and after running the process.  The "user" and "system" times are
sampled times updated every timer tick.

A discrepancy of a hundredth of a second is perfectly normal.
"gettimeofday" uses a neat trick to get microsecond accuracy, but the
user and system times only have one timer tick (1/HZ=.01sec on i386)
resolution.  For this reason, any CPU intensive program can give
slightly (within .01sec or so) higher user than real:

    buhr@saurus:~/src/cpuburn/cpuburn-1.2$ time ./burnP6
    real    0m6.438s
    user    0m6.440s
    sys     0m0.000s
    ^C
    buhr@saurus:~/src/cpuburn/cpuburn-1.2$

If your discrepancy is bigger than a couple hundredths of second, it
gets more complicated.

In an SMP kernel, the jiffies are updated by the "do_timer" function,
and the timer bottom half uses the jiffies to update the time of day.
On the other hand, the user and system times are updated by the
"smp_local_timer_interrupt".

On an SMP motherboard (one with an APIC), "do_timer" is invoked by
timer ticks from the dedicated timer chip, but "smp_local_timer_
interrupt" is invoked by a timer on the APIC chip.  These two timers
will run at nearly the same speed (HZ times per second), but not
exactly.  If the APIC timer is significantly faster, you can have
user+system>real on an SMP motherboard, even though it only has one
processor installed!

So, the first question is, does your "UP" box really have a UP-only
motherboard?  That is, in your bootup messages, do you see a line like
this:

   Mar  5 15:32:28 mozart kernel: SMP motherboard not detected. Using
   dummy APIC emulation.

If you don't see such a line, this might be the problem: the real time
is based on a different timer than the user and system times.  
I believe the APIC timer is based on bus frequency.  If you're over-
or under-clocking your board, you may see huge discrepancies.

If you *do* see the emulation message, then "do_timer" and
"smp_local_timer_interrupt" are both called exactly once on every
timer tick, so there is no discrepancy possible there.

However, the "gettimeofday" time isn't just based on the jiffies
count.  The time adjustment parameters (set by the adjtimex(2) system
call) can modify the "gettimeofday" time away from what would normally
be calculated from jiffies alone.  If you are running a time daemon,
like NTP, if you've run "ntpdate" at bootup and a time adjustment is
in progress, or if you've used the "adjtimex" utility directly to make
your system clock more accurate, then that could also account for the
discrepancy.

In any event, if the discrepancy is large: if user, for a
single-threaded process, exceeds the real time by more than 1% (or a
few hundredths of a second, whichever is greater) on any system, I
think this indicates a serious problem.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21  1:38     ` David S. Miller
@ 2001-03-21 20:19       ` Kevin Buhr
  2001-03-22 18:23         ` Kevin Buhr
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Buhr @ 2001-03-21 20:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, Serge Orlov, linux-kernel, Jakob Østergaard

"David S. Miller" <davem@redhat.com> writes:
> 
> It is the garbage collector scheme used for memory allocation in gcc
> >=2.96 that triggers the bad cases seen by Serge.

Ahhh!  Thanks for the info.

I'm still happy to help test out the patch, but I guess it's not
likely to affect my 2.95.2 numbers much at all.  Maybe I can get a
snapshot of GCC 3.0 up and running, though, and test that out.

Thanks.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21 20:16           ` Kevin Buhr
@ 2001-03-22  9:04             ` Mike Galbraith
  2001-03-22 22:19               ` Kevin Buhr
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2001-03-22  9:04 UTC (permalink / raw)
  To: Kevin Buhr; +Cc: linux-kernel, Alan Cox

On 21 Mar 2001, Kevin Buhr wrote:

> Mike Galbraith <mikeg@wen-online.de> writes:
> >
> > Yes.  I'm so used to UP numbers I didn't think.  I saw user larger than
> > real on my UP box yesterday during some testing, and then seeing this
> > post... oops.
>
> Okay, so you see "user > real" on a UP box running an SMP kernel.

On ac20 I see this (has rw_mmap_sem patch in place tho..), but not on
2.4.3-pre6 with Linus' deadlock fix.

[snip nice explanation.. thanks]  box is genuine UP btw.

> In any event, if the discrepancy is large: if user, for a
> single-threaded process, exceeds the real time by more than 1% (or a
> few hundredths of a second, whichever is greater) on any system, I
> think this indicates a serious problem.

Let me check virgin ac20 and see what it does.

2.4.2.ac20.virgin   2.4.3-pre6
real    11m0.708s   11m58.617s
user    15m8.720s   7m29.970s
sys     1m31.410s   0m41.590s

It looks like ac20 is doing some double accounting.

	-Mike

(fwiw, the smp/up numbers suck rocks compared to up/up)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-21 20:19       ` Kevin Buhr
@ 2001-03-22 18:23         ` Kevin Buhr
  2001-03-22 18:35           ` Jakob Østergaard
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Buhr @ 2001-03-22 18:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, Serge Orlov, linux-kernel, Jakob Østergaard

buhr@cs.wisc.edu (Kevin Buhr) writes:
> 
> "David S. Miller" <davem@redhat.com> writes:
> > 
> > It is the garbage collector scheme used for memory allocation in gcc
> > >=2.96 that triggers the bad cases seen by Serge.
> 
> Ahhh!  Thanks for the info.
> 
> I'm still happy to help test out the patch, but I guess it's not
> likely to affect my 2.95.2 numbers much at all.  Maybe I can get a
> snapshot of GCC 3.0 up and running, though, and test that out.

I pulled the "gcc-3_0-branch" of GCC from CVS and compiled Mozilla
under a 2.4.2 kernel.  The numbers I saw were:

    real    57m26.850s
    user    96m57.490s
    sys     3m16.780s

which are almost exactly the same as my GCC 2.95.2 numbers.  When I
peeked at "/proc/<cc1plus>/maps" a few times, I counted ~150 lines,
not ~2000.  On another, much smaller block of C++ code, I get similar
results: no dramatic change in kernel time.

Either the Mozilla codebase and my other test case don't tickle the
problem, or something has changed in 3.0's allocation scheme since
RedHat 2.96 was built.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-22 18:23         ` Kevin Buhr
@ 2001-03-22 18:35           ` Jakob Østergaard
  2001-03-23  4:32             ` Kevin Buhr
  2001-03-23 20:43             ` James Lewis Nance
  0 siblings, 2 replies; 32+ messages in thread
From: Jakob Østergaard @ 2001-03-22 18:35 UTC (permalink / raw)
  To: Kevin Buhr; +Cc: David S. Miller, Linus Torvalds, Serge Orlov, linux-kernel

On Thu, Mar 22, 2001 at 12:23:15PM -0600, Kevin Buhr wrote:
> buhr@cs.wisc.edu (Kevin Buhr) writes:
...
> I pulled the "gcc-3_0-branch" of GCC from CVS and compiled Mozilla
> under a 2.4.2 kernel.  The numbers I saw were:
> 
>     real    57m26.850s
>     user    96m57.490s
>     sys     3m16.780s
> 
> which are almost exactly the same as my GCC 2.95.2 numbers.  When I
> peeked at "/proc/<cc1plus>/maps" a few times, I counted ~150 lines,
> not ~2000.  On another, much smaller block of C++ code, I get similar
> results: no dramatic change in kernel time.
> 
> Either the Mozilla codebase and my other test case don't tickle the
> problem, or something has changed in 3.0's allocation scheme since
> RedHat 2.96 was built.

Mozilla uses C++ mainly as "extended C" - due to compatibility concerns.

Try compiling something like Qt/KDE/gtk-- which are really heavy on
templates (with all the benefits and drawbacks of that).

My code here is quite template heavy, and I suspect that's what's triggering
it.  In fact, I can't compile our development code with optimization, because
GCC runs out of memory (it only allocates some 300-500 MB, but each page has
it's own map in /proc/pid/maps, and a wc -l /proc/pid/maps doesn't finish for
minutes).  My typical GCC eats 100-200 MB and runs for several minutes.

You should benchmark this particular case with code that makes GCC eat
lots of memory, 100MB or more.  I've never seen Mozilla really make GCC
eat that much memory  -  other projects do.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-22  9:04             ` Mike Galbraith
@ 2001-03-22 22:19               ` Kevin Buhr
  2001-03-23  7:44                 ` Mike Galbraith
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Buhr @ 2001-03-22 22:19 UTC (permalink / raw)
  To: Alan Cox, Mike Galbraith; +Cc: linux-kernel

Mike Galbraith <mikeg@wen-online.de> writes:
> 
> 2.4.2.ac20.virgin   2.4.3-pre6
> real    11m0.708s   11m58.617s
> user    15m8.720s   7m29.970s
> sys     1m31.410s   0m41.590s
> 
> It looks like ac20 is doing some double accounting.

Alan:

In "2.4.2-ac20", the check in "apic.c" in the "APIC_init_uniprocessor"
function to avoid initializing the APIC is:

        if (!smp_found_config && !cpu_has_apic)
                return -1;

However, in "arch/i386/time.c", we use the following check:

        if (!smp_found_config)
                smp_local_timer_interrupt(regs);

to see if we need to emulate an smp_local_timer_interrupt from
"do_timer_interrupt".

In Mike's case, I think we have smp_found_config == 0 but cpu_has_apic
== 1, so we're telling the CPU APIC to generate smp_local_timer_interrupts,
and then we're also emulating them on normal timer ticks.  That
doubles the rate at which "smp_local_timer_interrupt" is called,
doubling the process user and system time accounting.

Mike, would you like to try out the following (untested) patch against
vanilla ac20 to see if it does the trick?

Kevin <buhr@stat.wisc.edu>

                        *       *       *

diff -ru linux-2.4.2-ac20/arch/i386/kernel/apic.c linux-2.4.2-ac20-local/arch/i386/kernel/apic.c
--- linux-2.4.2-ac20/arch/i386/kernel/apic.c	Thu Mar 22 12:36:02 2001
+++ linux-2.4.2-ac20-local/arch/i386/kernel/apic.c	Thu Mar 22 15:59:08 2001
@@ -30,6 +30,9 @@
 #include <asm/mpspec.h>
 #include <asm/pgalloc.h>
 
+/* Using APIC to generate smp_local_timer_interrupt? */
+int using_apic_timer = 0;
+
 int prof_multiplier[NR_CPUS] = { 1, };
 int prof_old_multiplier[NR_CPUS] = { 1, };
 int prof_counter[NR_CPUS] = { 1, };
@@ -884,6 +887,8 @@
 
 	/* and update all other cpus */
 	smp_call_function(setup_APIC_timer, (void *)calibration_result, 1, 1);
+
+	using_apic_timer = 1;
 }
 
 /*
diff -ru linux-2.4.2-ac20/arch/i386/kernel/time.c linux-2.4.2-ac20-local/arch/i386/kernel/time.c
--- linux-2.4.2-ac20/arch/i386/kernel/time.c	Thu Mar 22 12:36:03 2001
+++ linux-2.4.2-ac20-local/arch/i386/kernel/time.c	Thu Mar 22 16:03:02 2001
@@ -422,7 +422,7 @@
 	if (!user_mode(regs))
 		x86_do_profile(regs->eip);
 #else
-	if (!smp_found_config)
+	if (!using_apic_timer)
 		smp_local_timer_interrupt(regs);
 #endif
 
diff -ru linux-2.4.2-ac20/include/asm-i386/smp.h linux-2.4.2-ac20-local/include/asm-i386/smp.h
--- linux-2.4.2-ac20/include/asm-i386/smp.h	Sun Mar  4 21:35:03 2001
+++ linux-2.4.2-ac20-local/include/asm-i386/smp.h	Thu Mar 22 16:07:28 2001
@@ -34,6 +34,7 @@
 extern unsigned long cpu_online_map;
 extern volatile unsigned long smp_invalidate_needed;
 extern int pic_mode;
+extern int using_apic_timer;
 extern void smp_flush_tlb(void);
 extern void smp_message_irq(int cpl, void *dev_id, struct pt_regs *regs);
 extern void smp_send_reschedule(int cpu);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-22 18:35           ` Jakob Østergaard
@ 2001-03-23  4:32             ` Kevin Buhr
  2001-03-24  4:11               ` Zack Weinberg
                                 ` (3 more replies)
  2001-03-23 20:43             ` James Lewis Nance
  1 sibling, 4 replies; 32+ messages in thread
From: Kevin Buhr @ 2001-03-23  4:32 UTC (permalink / raw)
  To: Jakob Østergaard, Linus Torvalds
  Cc: Serge Orlov, linux-kernel, David S. Miller

Jakob Østergaard <jakob@unthought.net> writes:
>
> Try compiling something like Qt/KDE/gtk-- which are really heavy on
> templates (with all the benefits and drawbacks of that).

Okay, I just compiled gtk-- 1.0.3 (with CFLAGS = "-O2 -g") under three
versions of GCC (Debian 2.95.3, RedHat 2.96, and a CVS pull of the
"gcc-3_0-branch") on the same Debian machine running kernel 2.4.2.

In all cases, the "cc1plus" processes appeared to max out around 25M
total size.  The "maps" pseudofiles for the 2.95.3 and and 3.0
compiles never grew past 250 lines, but the "maps" pseudofiles for the
RedHat 2.96 compile were gigantic, jumping to 3000 or 5000 lines at
times.

The results speak for themselves:

    CVS gcc 3.0:          Debian gcc 2.95.3:   RedHat gcc 2.96:
                      
    real    16m8.423s     real    8m2.417s     real    12m24.939s
    user    15m23.710s    user    7m22.200s    user    10m14.420s
    sys     0m48.730s     sys     0m41.040s    sys     2m13.910s 
maps:    <250 lines           <250 lines          >3000 lines

Obviously, the *real* problem is RedHat GCC 2.96.  If Linus bothers to
write this patch (he probably already has), its only proven benefit so
far is that it improves the performance of a RedHat-specific, orphaned
G++ development snapshot that everyone (the people of RedHat most of
all) will be glad to be rid of as soon as possible.

The numbers above suggest that the patch is unlikely to have a
positive impact on the performance of either officially released GCC
versions or the upcoming 3.0 release.

Drifting off topic...

> Mozilla uses C++ mainly as "extended C" - due to compatibility concerns.

This statement is potentially misleading.

I think most people will believe you to mean "using C++ as a better C"
in the sense of Stroustrup: using the small, conventional-language
subset of C++ that looks like C but has stronger type checking,
function and operator overloading, default arguments, "//" style
comments, reference types, and other syntactic and semantic sugar.

Mozilla does not use C++ as "extended C" in this sense.  While it does
use a *subset* of C++ for compatibility reasons, the subset includes
extensive use of class lattices and polymorphism as well as extensive
(albeit simple and carefully constructed) uses of templates for its
utility classes, including string and component-autoreferencing
template classes and functions that are used throughout the source.
The only major C++ facilities that are not used are the standard
library, RTTI, namespaces, and exception handling, but other than that
it's a good, real-world C++ test case.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-22 22:19               ` Kevin Buhr
@ 2001-03-23  7:44                 ` Mike Galbraith
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2001-03-23  7:44 UTC (permalink / raw)
  To: Kevin Buhr; +Cc: Alan Cox, linux-kernel

On 22 Mar 2001, Kevin Buhr wrote:

> Mike Galbraith <mikeg@wen-online.de> writes:
> >
> > 2.4.2.ac20.virgin   2.4.3-pre6
> > real    11m0.708s   11m58.617s
> > user    15m8.720s   7m29.970s
> > sys     1m31.410s   0m41.590s
> >
> > It looks like ac20 is doing some double accounting.

[snip]

> Mike, would you like to try out the following (untested) patch against
> vanilla ac20 to see if it does the trick?

Yes, that fixed it.

	-Mike



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-22 18:35           ` Jakob Østergaard
  2001-03-23  4:32             ` Kevin Buhr
@ 2001-03-23 20:43             ` James Lewis Nance
  1 sibling, 0 replies; 32+ messages in thread
From: James Lewis Nance @ 2001-03-23 20:43 UTC (permalink / raw)
  To: linux-kernel

On Thu, Mar 22, 2001 at 07:35:49PM +0100, Jakob Østergaard wrote:

> My code here is quite template heavy, and I suspect that's what's triggering
> it.  In fact, I can't compile our development code with optimization, because
> GCC runs out of memory (it only allocates some 300-500 MB, but each page has
> it's own map in /proc/pid/maps, and a wc -l /proc/pid/maps doesn't finish for
> minutes).  My typical GCC eats 100-200 MB and runs for several minutes.

Would it be possible for you to post the preprocessor outout to this list?
It would be quite nice to have this testcase.

Jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-23  4:32             ` Kevin Buhr
@ 2001-03-24  4:11               ` Zack Weinberg
  2001-03-24 21:46                 ` Kevin Buhr
  2001-03-24  5:02               ` Linus Torvalds
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Zack Weinberg @ 2001-03-24  4:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kevin Buhr

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 2573 bytes --]

Kevin Buhr wrote:
> Jakob Østergaard <jakob@unthought.net> writes:
> >
> > Try compiling something like Qt/KDE/gtk-- which are really heavy on
> > templates (with all the benefits and drawbacks of that).
> 
> Okay, I just compiled gtk-- 1.0.3 (with CFLAGS = "-O2 -g") under three
> versions of GCC (Debian 2.95.3, RedHat 2.96, and a CVS pull of the
> "gcc-3_0-branch") on the same Debian machine running kernel 2.4.2.
> 
> In all cases, the "cc1plus" processes appeared to max out around 25M
> total size. The "maps" pseudofiles for the 2.95.3 and and 3.0
> compiles never grew past 250 lines, but the "maps" pseudofiles for the
> RedHat 2.96 compile were gigantic, jumping to 3000 or 5000 lines at
> times.
> 
> The results speak for themselves:
> 
>     CVS gcc 3.0:	Debian gcc 2.95.3:	RedHat gcc 2.96:
>                       
>     real 16m8.423s	real 8m2.417s		real 12m24.939s
>     user 15m23.710s	user 7m22.200s		user 10m14.420s
>     sys 0m48.730s	sys 0m41.040s		sys 2m13.910s
> maps: <250 lines	<250 lines		>3000 lines

Let me inject some information about what gcc's doing in each version.

2.95.3 allocates its memory via a bunch of 'obstacks' which,
underneath, get memory from malloc, and therefore brk(2).  I'm very
surprised to see it had ~250 vmas; it should be more like 10.

2.96 and later versions use a garbage collecting allocator instead; it
was becoming much too hard to decide which obstack to use when.  The
garbage collector allocates memory with mmap(..., MAP_ANON, ...).
This is to avoid interfering with malloc, which is still used in many
places; and to get page-aligned memory without wasting tons of space,
as valloc(3) does.

In Red Hat's 2.96, that allocator gets memory from mmap one page at a
time.  If I understand what's going on in the kernel correctly, that
means each page is its own vma.  25 megs of GC arena is roughly 6400
vmas in that regime.

In CVS 3.0-to-be (and trunk), the allocator gets memory in 32-page
chunks instead.  So 25 megs of GC arena is only 200 vmas.

However, 25 megs of GC arena is small as these things go.  GCC's
memory consumption can _easily_ get up to 200 or 300 megs.  The
example I'm familiar with is insn-attrtab.c from GCC's own sources
(it's machine-generated code, with several huge functions).  256 megs
of GC arena, in 32-page chunks, is 2048 vmas.  Yes, at this point the
machine is swapping... but if I understand the issue, it's just when
we're swapping that having thousands of vmas causes problems.

In conclusion, I think that GCC's allocator still makes a good case
for merging vmas.

zw

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-23  4:32             ` Kevin Buhr
  2001-03-24  4:11               ` Zack Weinberg
@ 2001-03-24  5:02               ` Linus Torvalds
  2001-03-24  9:31                 ` Jakob Østergaard
  2001-03-24  9:48               ` Jakob Østergaard
       [not found]               ` <200103240502.VAA02673@penguin.transmeta.com>
  3 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2001-03-24  5:02 UTC (permalink / raw)
  To: linux-kernel

In article <vbawv9hyuj0.fsf@mozart.stat.wisc.edu>,
Kevin Buhr <buhr@stat.wisc.edu> wrote:
>
>The results speak for themselves:
>
>    CVS gcc 3.0:          Debian gcc 2.95.3:   RedHat gcc 2.96:
>                      
>    real    16m8.423s     real    8m2.417s     real    12m24.939s
>    user    15m23.710s    user    7m22.200s    user    10m14.420s
>    sys     0m48.730s     sys     0m41.040s    sys     2m13.910s 
>maps:    <250 lines           <250 lines          >3000 lines
>
>Obviously, the *real* problem is RedHat GCC 2.96.  If Linus bothers to
>write this patch (he probably already has),

Check out 2.4.3-pre7, I'd be interested to hear what the system time is
for that one.

It does seem like gcc-2.96 is kind of special, but considering how easy
it is to merge anonymous memory (most of the changes were cosmetic ones
to get nice ordering to make the merge trivial without having to
allocate a vma that never gets used etc), it's certainly worth doing.

		Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-24  5:02               ` Linus Torvalds
@ 2001-03-24  9:31                 ` Jakob Østergaard
  0 siblings, 0 replies; 32+ messages in thread
From: Jakob Østergaard @ 2001-03-24  9:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Mar 23, 2001 at 09:02:30PM -0800, Linus Torvalds wrote:
> In article <vbawv9hyuj0.fsf@mozart.stat.wisc.edu>,
> Kevin Buhr <buhr@stat.wisc.edu> wrote:
> >
> >The results speak for themselves:
> >
> >    CVS gcc 3.0:          Debian gcc 2.95.3:   RedHat gcc 2.96:
> >                      
> >    real    16m8.423s     real    8m2.417s     real    12m24.939s
> >    user    15m23.710s    user    7m22.200s    user    10m14.420s
> >    sys     0m48.730s     sys     0m41.040s    sys     2m13.910s 
> >maps:    <250 lines           <250 lines          >3000 lines
> >
> >Obviously, the *real* problem is RedHat GCC 2.96.  If Linus bothers to
> >write this patch (he probably already has),
> 
> Check out 2.4.3-pre7, I'd be interested to hear what the system time is
> for that one.

I was unable to compile gcc-3.0 from CVS this morning - so no tests there
for now...

First the "small" test case:
-----------------------------
2.4.2:
  gcc-2.96:  -O6 -felide-constructors -fPIC
  real    7m31.748s
  user    3m52.340s
  sys     3m38.180s
  Memory consumption:  ~200MB

2.4.3-pre7:
  gcc-2.96:  -O6 -felide-constructors -fPIC
  real    3m52.347s
  user    3m46.120s
  sys     0m3.370s

That's pretty darn impressive Linus !  3m38 -> 3sec !  Now if the GCC people
could only repeat that trick   ;)


Then the bigger one:
-----------------------------
2.4.2:
  gcc-2.96:  -O6 -felide-constructors -fPIC
  Fails compilation with "Virtual memory exhausted!" after
  real    37m28.305s
  user    23m39.930s
  sys     13m44.900s
  Memory consumption:  ~300MB before failure

Note - there are no ulimits set, and the machine has more than enough memory

2.4.3-pre7:
  gcc-2.96:  -O6 -felide-constructors -fPIC
  real    31m48.898s
  user    31m21.460s
  sys     0m26.980s
  Memory consumption:  ~400MB - successful completion

Cool !  I can work again   ;)
 
> 
> It does seem like gcc-2.96 is kind of special, but considering how easy
> it is to merge anonymous memory (most of the changes were cosmetic ones
> to get nice ordering to make the merge trivial without having to
> allocate a vma that never gets used etc), it's certainly worth doing.

Beautiful !

Also, the speedup gained here is ~70 times, which may be more than the changed
allocation in gcc-3 will buy us (was that 32 times?).  And,  after all,  there
_has_ to be some other case out there which is not as easily fixed as the GCC
one.

> 		Linus

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-23  4:32             ` Kevin Buhr
  2001-03-24  4:11               ` Zack Weinberg
  2001-03-24  5:02               ` Linus Torvalds
@ 2001-03-24  9:48               ` Jakob Østergaard
  2001-03-24 19:54                 ` Kevin Buhr
       [not found]               ` <200103240502.VAA02673@penguin.transmeta.com>
  3 siblings, 1 reply; 32+ messages in thread
From: Jakob Østergaard @ 2001-03-24  9:48 UTC (permalink / raw)
  To: Kevin Buhr; +Cc: Linus Torvalds, Serge Orlov, linux-kernel, David S. Miller

On Thu, Mar 22, 2001 at 10:32:51PM -0600, Kevin Buhr wrote:
> Jakob Østergaard <jakob@unthought.net> writes:
> >
> > Try compiling something like Qt/KDE/gtk-- which are really heavy on
> > templates (with all the benefits and drawbacks of that).
> 
> Okay, I just compiled gtk-- 1.0.3 (with CFLAGS = "-O2 -g") under three
> versions of GCC (Debian 2.95.3, RedHat 2.96, and a CVS pull of the
> "gcc-3_0-branch") on the same Debian machine running kernel 2.4.2.

It's important that you use at least -O3 to get inlining too.

> 
> In all cases, the "cc1plus" processes appeared to max out around 25M
> total size.  The "maps" pseudofiles for the 2.95.3 and and 3.0
> compiles never grew past 250 lines, but the "maps" pseudofiles for the
> RedHat 2.96 compile were gigantic, jumping to 3000 or 5000 lines at
> times.

25 MB doesn't count  ;)

> 
> The results speak for themselves:
> 
>     CVS gcc 3.0:          Debian gcc 2.95.3:   RedHat gcc 2.96:
>                       
>     real    16m8.423s     real    8m2.417s     real    12m24.939s
>     user    15m23.710s    user    7m22.200s    user    10m14.420s
>     sys     0m48.730s     sys     0m41.040s    sys     2m13.910s 
> maps:    <250 lines           <250 lines          >3000 lines
> 
> Obviously, the *real* problem is RedHat GCC 2.96.  If Linus bothers to
> write this patch (he probably already has), its only proven benefit so
> far is that it improves the performance of a RedHat-specific, orphaned
> G++ development snapshot that everyone (the people of RedHat most of
> all) will be glad to be rid of as soon as possible.

No, map merging is obviously a good idea if it can be done at little cost.
There has to be other cases out there than GCC 2.96 (which is still the
best damn C++ compiler to ship on any GNU/Linux distribution in history)

As someone else already pointed out GCC-3.0 will improve it's allocation,
but it *still* allocates many maps - less than before, but still potentially
lots...

> 
> The numbers above suggest that the patch is unlikely to have a
> positive impact on the performance of either officially released GCC
> versions or the upcoming 3.0 release.

It will still have the 70x performance increase on kernel memory map
handling as demonstrated in my benchmark just posted.  However, it will
be 70x of much less than with 2.96.

Granted, the impact will be much smaller on GCC-3.0 in terms of wall clock
ticks, but I bet there is some other code out there that also triggers the
map-nightmare.

> 
> Drifting off topic...

We can continue on /.  ;)

> 
> > Mozilla uses C++ mainly as "extended C" - due to compatibility concerns.
> 
> This statement is potentially misleading.
> 
> I think most people will believe you to mean "using C++ as a better C"
> in the sense of Stroustrup: using the small, conventional-language
> subset of C++ that looks like C but has stronger type checking,
> function and operator overloading, default arguments, "//" style
> comments, reference types, and other syntactic and semantic sugar.

Yes

> 
> Mozilla does not use C++ as "extended C" in this sense.  While it does
> use a *subset* of C++ for compatibility reasons, the subset includes
> extensive use of class lattices and polymorphism as well as extensive
> (albeit simple and carefully constructed) uses of templates for its
> utility classes, including string and component-autoreferencing
> template classes and functions that are used throughout the source.
> The only major C++ facilities that are not used are the standard
> library, RTTI, namespaces, and exception handling, but other than that
> it's a good, real-world C++ test case.

Ok - I just read the coding guidelines for Mozilla, that's where I got
my information from...  In general (except for the exceptions I guess),
rule number one is "Don't use templates". Rule 5 is "Don't use the namespace
facility". Rule 16 is "Don't put constructors in header files".   All
stuff that leads to much much shorter symbols (type names) and less code
inlining - something that makes the job a lot easier for GCC.

Putting template classes and functions in header files with heavy inlining
is something that makes GCC memory usage explode.  I managed to write a
few hundred lines once I couldn't compile because GCC couldn't allocate
more than 800-900 MB (old glibc and GCC).   The code was badly designed
and easily fixed, but it demonstrated this feature in GCC nicely.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-24  9:48               ` Jakob Østergaard
@ 2001-03-24 19:54                 ` Kevin Buhr
  2001-03-25  3:17                   ` Jakob Østergaard
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Buhr @ 2001-03-24 19:54 UTC (permalink / raw)
  To: Jakob Østergaard; +Cc: linux-kernel

Jakob Østergaard <jakob@unthought.net> writes:
> 
> It's important that you use at least -O3 to get inlining too.
[ . . . ]
> 25 MB doesn't count  ;)

Aggh!  I feel like I'm in a comedy sketch.  You tell me "do that".  
I do that.  You tell me, "you should try this instead", so I do this.
Then, you tell me, "but you should really do the other."

You're the one who suggested "gtk--" as a test case.  Built out of the
box, it uses "-O2".  If there were magical settings or sekret
incantations, I wish you'd mentioned them when you suggested it.

> No, map merging is obviously a good idea if it can be done at little cost.
> There has to be other cases out there than GCC 2.96 (which is still the
> best damn C++ compiler to ship on any GNU/Linux distribution in history)

If something has a cost, even a little cost, and no one can find a
benefit, then implementing it is not "obviously" a good idea.  That's
why Linus asked for a real-world example before he spent time
complicating the algorithms and adding checks that incur a cost for
every process, even those that won't get any benefit.

> As someone else already pointed out GCC-3.0 will improve it's allocation,
> but it *still* allocates many maps - less than before, but still potentially
> lots...

Yes.  Zach's explanation is the first thing I've seen that makes a
case for some benefit (besides babysitting GCC 2.96) without
conflicting with the data I'm getting.

As I've noted elsewhere, I see no change at all in system time for GCC
3.0 between 2.4.2 and 2.4.3-pre7.  Given Zach's explanation, I'm
prepared to believe there might be a difference with, say, a 500meg
arena (or perhaps something as small as a 100meg arena).

> It will still have the 70x performance increase on kernel memory map
> handling as demonstrated in my benchmark just posted.  However, it will
> be 70x of much less than with 2.96.

For my test cases under 3.0, it looks like 70 times zero.  However,
I'm now prepared to believe that it could be 70 times something
non-zero for certain very hairy source files.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
       [not found]               ` <200103240502.VAA02673@penguin.transmeta.com>
@ 2001-03-24 21:22                 ` Kevin Buhr
  2001-03-25  3:37                   ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Buhr @ 2001-03-24 21:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:
> >
[ under kernel 2.4.2 ]
> >
> >    CVS gcc 3.0:          Debian gcc 2.95.3:   RedHat gcc 2.96:
> >                      
> >    real    16m8.423s     real    8m2.417s     real    12m24.939s
> >    user    15m23.710s    user    7m22.200s    user    10m14.420s
> >    sys     0m48.730s     sys     0m41.040s    sys     2m13.910s 
> >maps:    <250 lines           <250 lines          >3000 lines
> >
> >Obviously, the *real* problem is RedHat GCC 2.96.  If Linus bothers to
> >write this patch (he probably already has),
> 
> Check out 2.4.3-pre7, I'd be interested to hear what the system time is
> for that one.

Okay.  One note about the above results: as Zach pointed out, my
2.95.3 number for "maps" was wrong.  I must have forgotten to collect
the data but thought I had.  In fact, there are ~10 lines in "maps"
for the 2.95.3 "cc1plus" process.  The other "maps" numbers for 3.0
and 2.96 are correct, at least within an order of magnitude.

Under 2.4.3-pre7, I get the following disappointing numbers:

    CVS gcc 3.0:          Debian gcc 2.95.3:   RedHat gcc 2.96:

    real    16m10.660s    real    7m58.874s    real    10m36.368s
    user    15m27.900s    user    7m23.090s    user    10m0.290s 
    sys     0m48.400s     sys     0m40.350s    sys     0m40.790s 
maps:   <20 lines             ~10 lines            ~10 lines

A huge win for 2.96 and absolutely no benefit whatsoever for 3.0, even
though it obviously had a 10-fold effect on maps counts.  On the
positive side, there was no performance *hit* either.

As a blind "have not looked at relevant kernel source" guess, this
looks like a hash scaling problem to me: the hash size works great for
~300 maps and falls apart in a major way at ~3000 maps, presumably
when we get multiple hits per hash bin and start walking 10-member
lists.

How this translates into a course of action---some combination of
keeping your patch, enlarging the hash, and performance tweaking the
list-walking---I'm not sure.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-24  4:11               ` Zack Weinberg
@ 2001-03-24 21:46                 ` Kevin Buhr
  0 siblings, 0 replies; 32+ messages in thread
From: Kevin Buhr @ 2001-03-24 21:46 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: linux-kernel

"Zack Weinberg" <zackw@stanford.edu> writes:
> 
> Let me inject some information about what gcc's doing in each version.

Thanks...  very useful information.

> 2.95.3 allocates its memory via a bunch of 'obstacks' which,
> underneath, get memory from malloc, and therefore brk(2).  I'm very
> surprised to see it had ~250 vmas; it should be more like 10.

You are correct.  My "maps" numbers for 2.96 and 3.0 are correct (at
least within an order of magnitude), but I must have plucked the
number for 2.95.3 out of thin air---there are only ~10 maps, as you
predict.

> In conclusion, I think that GCC's allocator still makes a good case
> for merging vmas.

Maybe.  It looks like the performance drop is quite sharp as a
function of vma count.  In another note to the list, I observed no
system time change (not even a half a second) using GCC 3.0 on my
gtk-- test case between 2.4.2 and 2.4.3-pre7, even though the vma
count dropped from ~200 to ~15.  On the other hand, 2.96 dropped from
>3000 to ~10 and dropped from a system time of 2m13s to a system time
of 41sec (in line with the 3.0 and 2.95.3 system times).

Given your data, it'll really depend on where the performance hit is
taken.  If it's taken at 4000 vmas, then it'll take a 500 meg arena
under 3.0 before the patch makes a difference.  It it's taken at 1000
vmas, then we'll see it around 125 megs, and it'll really make a big
difference in some of the test cases people are talking about.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-24 19:54                 ` Kevin Buhr
@ 2001-03-25  3:17                   ` Jakob Østergaard
  2001-03-25 16:47                     ` Jamie Lokier
  0 siblings, 1 reply; 32+ messages in thread
From: Jakob Østergaard @ 2001-03-25  3:17 UTC (permalink / raw)
  To: Kevin Buhr; +Cc: linux-kernel

On Sat, Mar 24, 2001 at 01:54:39PM -0600, Kevin Buhr wrote:
> Jakob Østergaard <jakob@unthought.net> writes:
> > 
> > It's important that you use at least -O3 to get inlining too.
> [ . . . ]
> > 25 MB doesn't count  ;)
> 
> Aggh!  I feel like I'm in a comedy sketch.  You tell me "do that".  
> I do that.  You tell me, "you should try this instead", so I do this.
> Then, you tell me, "but you should really do the other."

I'm sorry, I was wrong about gtk-- being hairy enough, and I should have
apologized earler.

> 
> You're the one who suggested "gtk--" as a test case.  Built out of the
> box, it uses "-O2".  If there were magical settings or sekret
> incantations, I wish you'd mentioned them when you suggested it.

Yes, yes.  I guess even Qt won't do the trick either. I know at least one of
the KDE packages will, it uses Qt and if you set compilation options to -O6 it
will grow to ~100MB.

A few years ago when I compiled Mico, that one would make GCC chew up a few
hundred megs as well, if compilation options were set to use heavy
optimization.

But never mind about C++ test cases now...

> 
> > No, map merging is obviously a good idea if it can be done at little cost.
> > There has to be other cases out there than GCC 2.96 (which is still the
> > best damn C++ compiler to ship on any GNU/Linux distribution in history)
> 
> If something has a cost, even a little cost, and no one can find a
> benefit, then implementing it is not "obviously" a good idea.  That's
> why Linus asked for a real-world example before he spent time
> complicating the algorithms and adding checks that incur a cost for
> every process, even those that won't get any benefit.

I just felt that many other parts of the kernel try hard to make it as
inexpensive as possible to use kernel functionality, and that the VM naturally
should do the same (to a reasonable extent, of course, as with the other
layers).

For example, if I use thousands of TCP connections, the network layer folks
have been working hard to ensure that I can actually do that with good
performance.

It would feel "wrong" - I think - if the VM had this special rule that "you can
use MMAP, but if you do it a lot the kernel becomes horribly inefficient".
Especially because it was just proved that such behaviour could be completely
eliminated without a big performance overhead on other more simpler users of
the VM.

It's just my oppinion - of course - but I think it's very nice that under Linux
you can actually use the system calls to do lots of neat tricks (such as the
GCC mmap one, or having a thousand TCP connnections open), without being
penalized too heavily.  Using lots of system calls is not necessarily always
bad design.

> 
> > As someone else already pointed out GCC-3.0 will improve it's allocation,
> > but it *still* allocates many maps - less than before, but still potentially
> > lots...
> 
> Yes.  Zach's explanation is the first thing I've seen that makes a
> case for some benefit (besides babysitting GCC 2.96) without
> conflicting with the data I'm getting.

But the bad case was a garbage collector in GCC.  The mmap tricks seem like
some you may be inclined to actually use in something like garbage collectors.
Are we sure that the developers of all other garbage collectors out there
foresaw this problem and didn't do mmap tricks ?

When running the Haskell interpreter "Hugs", I see lots of lines like this from
strace:
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000
But I don't have any "big" haskell codes, so I don't know if Haskell does actually
exhibit the gcc-2.96 pattern too...

Maybe some Haskell / ML / Java folks could comment further on this ?

> 
> As I've noted elsewhere, I see no change at all in system time for GCC
> 3.0 between 2.4.2 and 2.4.3-pre7.  Given Zach's explanation, I'm
> prepared to believe there might be a difference with, say, a 500meg
> arena (or perhaps something as small as a 100meg arena).
> 
> > It will still have the 70x performance increase on kernel memory map
> > handling as demonstrated in my benchmark just posted.  However, it will
> > be 70x of much less than with 2.96.
> 
> For my test cases under 3.0, it looks like 70 times zero.  However,
> I'm now prepared to believe that it could be 70 times something
> non-zero for certain very hairy source files.

Or maybe 70x something large for some case we just don't know about yet ?

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-24 21:22                 ` Kevin Buhr
@ 2001-03-25  3:37                   ` Linus Torvalds
  2001-03-26  4:22                     ` Kevin Buhr
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2001-03-25  3:37 UTC (permalink / raw)
  To: Kevin Buhr; +Cc: linux-kernel



On 24 Mar 2001, Kevin Buhr wrote:
>
> A huge win for 2.96 and absolutely no benefit whatsoever for 3.0, even
> though it obviously had a 10-fold effect on maps counts.  On the
> positive side, there was no performance *hit* either.

I don't think the system time in 3.0 has anything to do the the mmap size.

The 40 seconds of system time you see is probably mostly something else.
It's not as if gcc _only_ does mmap's.

Do a kernel profile, and I bet that the mmap stuff is pretty low in the
noise, and the 40 seconds are for things like clearing pages in
do_anonymous_page() and for actually reading and writing to the file. Note
how the sys numbers are now all pretty much the same across the board for
different gcc versions - regardless of whether they use mmap  for the
memory management or not.

(Well, gcc-2.95 and 2.96 are the same. Gcc-3.0 is higher, but it was
higher already before, and that's probably not the memory management per
se. I suspect it's because of other things, like bigger memory footprint
or similar. Or maybe the integrated preprocessor tends to do IO in smaller
chunks or something).

		Linus



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-25  3:17                   ` Jakob Østergaard
@ 2001-03-25 16:47                     ` Jamie Lokier
  0 siblings, 0 replies; 32+ messages in thread
From: Jamie Lokier @ 2001-03-25 16:47 UTC (permalink / raw)
  To: Jakob Østergaard, Kevin Buhr, linux-kernel

Jakob Østergaard wrote:
> But the bad case was a garbage collector in GCC.  The mmap tricks seem like
> some you may be inclined to actually use in something like garbage collectors.
> Are we sure that the developers of all other garbage collectors out there
> foresaw this problem and didn't do mmap tricks ?

On this theme, some garbage collectors like to write-protect individual
pages, to detect which pages are modified between generations.  The
kernel has never handled this especially well.  It could be argued that
mprotect() and signal() aren't the right way to get this information
though, and it would be better to add a different mechanism.

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux 2.4.2 fails to merge mmap areas, 700% slowdown.
  2001-03-25  3:37                   ` Linus Torvalds
@ 2001-03-26  4:22                     ` Kevin Buhr
  0 siblings, 0 replies; 32+ messages in thread
From: Kevin Buhr @ 2001-03-26  4:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:
> 
> On 24 Mar 2001, Kevin Buhr wrote:
> >
> > A huge win for 2.96 and absolutely no benefit whatsoever for 3.0, even
> > though it obviously had a 10-fold effect on maps counts.  On the
> > positive side, there was no performance *hit* either.
> 
> I don't think the system time in 3.0 has anything to do the the mmap size.
> 
> The 40 seconds of system time you see is probably mostly something else.
> It's not as if gcc _only_ does mmap's.

Yes, that's what I meant.  I was assuming that there was 40sec of
baseline system time in each compilation representing everything
*except* searching lists of unmerged mmaps.

Before doing the pre7 test, I figured that---given Zack's 32-factor
observation---my benchmarks indicated that 2.96 was spending 2m14-40 =
214sec doing unmerged mmaps while 3.0 was spending 49-40 = 9 sec doing
unmerged mmaps.  This ratio is more or less in line with a 32-fold
difference in number of maps predicted by Zack plus or minus a couple
seconds.

That is, I was assuming that the total time wasted because of unmerged
mmaps was roughly linear in the number of vmas.  Actually, it'll be
O(n log n)---the number of maps times the O(log n) search time once
the AVL tree gets big enough to matter.  Anyway, the factor should
still be around 30-50 or so.

When I did the test and 2.96 fell right in line with 2.95.3, I was
disappointed that 3.0 *didn't* fall right in line with the other
two---I thought I'd get those extra 8 seconds back.

> Do a kernel profile, and I bet that the mmap stuff is pretty low in the
> noise,

I'll bet your right---that's why I was disappointed.  I thought 3.0's
mmap overhead would be higher than it turned out to be.

As it is, it looks like only the most extreme cases (thousands or ten
of thousands of mergeable maps) will benefit from the patch.

Kevin <buhr@stat.wisc.edu>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2001-03-26  4:23 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-03-21  2:02 Linux 2.4.2 fails to merge mmap areas, 700% slowdown Dieter Nützel
  -- strict thread matches above, loose matches on Subject: below --
2001-03-20 18:28 Serge Orlov
2001-03-20 18:43 ` Linus Torvalds
2001-03-20 18:59   ` Jakob Østergaard
2001-03-21  1:20   ` Kevin Buhr
2001-03-21  1:38     ` David S. Miller
2001-03-21 20:19       ` Kevin Buhr
2001-03-22 18:23         ` Kevin Buhr
2001-03-22 18:35           ` Jakob Østergaard
2001-03-23  4:32             ` Kevin Buhr
2001-03-24  4:11               ` Zack Weinberg
2001-03-24 21:46                 ` Kevin Buhr
2001-03-24  5:02               ` Linus Torvalds
2001-03-24  9:31                 ` Jakob Østergaard
2001-03-24  9:48               ` Jakob Østergaard
2001-03-24 19:54                 ` Kevin Buhr
2001-03-25  3:17                   ` Jakob Østergaard
2001-03-25 16:47                     ` Jamie Lokier
     [not found]               ` <200103240502.VAA02673@penguin.transmeta.com>
2001-03-24 21:22                 ` Kevin Buhr
2001-03-25  3:37                   ` Linus Torvalds
2001-03-26  4:22                     ` Kevin Buhr
2001-03-23 20:43             ` James Lewis Nance
2001-03-21  6:41     ` Mike Galbraith
2001-03-21 14:56       ` Matthias Urlichs
2001-03-21 15:05         ` Mike Galbraith
2001-03-21 15:59       ` Kurt Garloff
2001-03-21 16:45         ` Mike Galbraith
2001-03-21 20:16           ` Kevin Buhr
2001-03-22  9:04             ` Mike Galbraith
2001-03-22 22:19               ` Kevin Buhr
2001-03-23  7:44                 ` Mike Galbraith
2001-03-20 18:43 ` Jakob Østergaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox