Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Gerst <bgerst@didntduck.org>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: Kevin Pedretti <ktpedre@sandia.gov>, linux-kernel@vger.kernel.org
Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
Date: Tue, 18 Mar 2003 13:30:54 -0500	[thread overview]
Message-ID: <3E7765DE.10609@didntduck.org> (raw)
In-Reply-To: <Pine.LNX.4.44.0303180809190.11381-100000@home.transmeta.com>

Linus Torvalds wrote:
> On Tue, 18 Mar 2003, Kevin Pedretti wrote:
> 
>>    I wasn't aware of what you state below but it makes sense.  What I 
>>haven't been able to figure out, and nobody seems to know, is why the 
>>rodata section of an executable is placed in the text section and is not 
>>page aligned.  This seems to be a mixing of code and data on the same 
>>page.  Maybe it doesn't matter since it is read only?
> 
> 
> It's a bad idea to share even read-only data, but the impact of read-only 
> data is much less that read-write. In particular, you should avoid sharing 
> _any_ code and data in the same physical L1 cache-line, since that will be 
> a big problem for any CPU with exclusion between the I$ and D$.
> 
> HOWEVER, modern x86 CPU's tend to have the I$ be part of the cache 
> coherency protocol, so instead of having exclusion they allow sharing as 
> long as the D$ isn't actually dirty. In that case it's fine to share 
> read-only data and code, although the cache utilization goes down if you 
> do a lot of it.
> 
> Anyway, as long as they are in separate cache-lines, you should be ok even 
> on something with cache exclusion.
> 
> When it comes to actually _writing_ to the data, at least on the P4 you
> don't want to have read-write data anywhere _near_ the I$ (somebody
> reported half-page granularity). This is true on crusoe too, btw (at a
> 128-byte granularity).
> 
> Anyway, I think gcc should make sure that even the ro-data section is at
> least cacheline-aligned so that it stays away from cachelines used for I$.  
> That makes sense even on CPU's that don't have exclusion, since it
> actually gives slightly better L1 cache utilization.
> 
> You can run this (stupid) test-program to try. On my P4 I get
> 
> 	empty overhead=320 cycles
> 	load overhead=0 cycles
> 	I$ load overhead=0 cycles
> 	I$ load overhead=0 cycles
> 	I$ store overhead=264 cycles
> 
> and on my PIII I get
> 
> 	empty overhead=74 cycles
> 	load overhead=8 cycles
> 	I$ load overhead=8 cycles
> 	I$ load overhead=8 cycles
> 	I$ store overhead=103 cycles
> 
> and (just for fun) on an old crusoe I get
> 
> 	empty overhead=67 cycles
> 	load overhead=-9 cycles
> 	I$ load overhead=-14 cycles
> 	I$ load overhead=-14 cycles
> 	I$ store overhead=12 cycles
> 
> where that "negative overhead" just shows that we do some strnge things to
> scheduling, and the loop actually ends up faster if it has a load in it
> than without the load..
> 
> But you can see that storing to code is a really bad idea. Especially on a 
> P4, where the overhead for a store was 264 cycles! (You can also see the 
> cost of doing just the empty synchronization and rdtsc - 320 cycles for a 
> rdtsc and two locked memory accesses on a P4).
> 
> I don't have access to an old Pentium - I think that was the one that had 
> the strict exclusion between the L1 I$ and D$, and then you should see the 
> I$ load overhead go up.
> 
> 			Linus

Here's a few more data points:

vendor_id       : AuthenticAMD
cpu family      : 5
model           : 8
model name      : AMD-K6(tm) 3D processor
stepping        : 12
cpu MHz         : 451.037
empty overhead=105 cycles
load overhead=-2 cycles
I$ load overhead=30 cycles
I$ load overhead=90 cycles
I$ store overhead=95 cycles


vendor_id       : GenuineIntel
cpu family      : 6
model           : 3
model name      : Pentium II (Klamath)
stepping        : 3
cpu MHz         : 265.913
empty overhead=73 cycles
load overhead=10 cycles
I$ load overhead=10 cycles
I$ load overhead=10 cycles
I$ store overhead=2 cycles


vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1409.946
empty overhead=11 cycles
load overhead=5 cycles
I$ load overhead=5 cycles
I$ load overhead=5 cycles
I$ store overhead=826 cycles

The Athlon XP shows really bad behavior when you store to the text area.

--
				Brian Gerst

next prev parent reply	other threads:[~2003-03-18 18:20 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-02-12  1:35 [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance) Martin J. Bligh
2003-02-12  2:59 ` Dave Jones
2003-02-12  4:21   ` Jamie Lokier
2003-02-12  5:49     ` Linus Torvalds
2003-02-12 10:12       ` Jamie Lokier
2003-03-10  3:07         ` Linus Torvalds
2003-03-10 11:06           ` Andi Kleen
2003-03-10 18:33             ` Linus Torvalds
2003-03-10 22:44           ` Linus Torvalds
2003-02-12 12:54     ` Dave Jones
2003-02-12  7:50   ` Andi Kleen
2003-02-12 10:27     ` Jamie Lokier
2003-02-12 10:45       ` Andi Kleen
2003-02-12 17:52         ` Ingo Oeser
2003-02-12 18:13           ` Dave Jones
2003-02-12 18:18           ` Andi Kleen
2003-02-13  2:42             ` Alan Cox
2003-02-13  5:17         ` Eric W. Biederman
2003-02-13 18:07           ` Andi Kleen
2003-02-14  0:14             ` [discuss] " Peter Tattam
2003-02-14  1:29               ` Andi Kleen
2003-02-14  1:51               ` Eric Northup
2003-02-14  2:01                 ` Peter Tattam
2003-02-14  4:07                   ` Thomas J. Merritt
2003-02-14  9:38                     ` Peter Finderup Lund
2003-02-14  8:27               ` Eric W. Biederman
2003-03-19  1:22             ` Rob Landley
2003-02-12  4:18 ` Jamie Lokier
2003-02-12  5:54   ` Linus Torvalds
2003-02-12 10:18     ` Jamie Lokier
2003-02-12 17:24       ` Linus Torvalds
2003-03-18 15:24     ` Kevin Pedretti
2003-03-18 16:41       ` Linus Torvalds
2003-03-18 18:30         ` Brian Gerst [this message]
2003-03-18 19:14           ` Thomas Molina
2003-03-18 19:21           ` Linus Torvalds
2003-03-18 20:03             ` Thomas Schlichter
2003-03-18 20:24             ` Steven Cole
2003-03-19  0:42             ` H. Peter Anvin
2003-03-19  2:22               ` george anzinger
     [not found] <20030318165013$55f4@gated-at.bofh.it>
     [not found] ` <20030318184010$6448@gated-at.bofh.it>
2003-03-18 20:19   ` Pascal Schmidt
  -- strict thread matches above, loose matches on Subject: below --
2003-03-19  9:55 Ph. Marek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3E7765DE.10609@didntduck.org \
    --to=bgerst@didntduck.org \
    --cc=ktpedre@sandia.gov \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.