Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Brian Gerst <bgerst@didntduck.org>
To: Linus Torvalds <torvalds@transmeta.com>
Cc: Kevin Pedretti <ktpedre@sandia.gov>, linux-kernel@vger.kernel.org
Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)
Date: Tue, 18 Mar 2003 13:30:54 -0500	[thread overview]
Message-ID: <3E7765DE.10609@didntduck.org> (raw)
In-Reply-To: <Pine.LNX.4.44.0303180809190.11381-100000@home.transmeta.com>

Linus Torvalds wrote:
> On Tue, 18 Mar 2003, Kevin Pedretti wrote:
> 
>>    I wasn't aware of what you state below but it makes sense.  What I 
>>haven't been able to figure out, and nobody seems to know, is why the 
>>rodata section of an executable is placed in the text section and is not 
>>page aligned.  This seems to be a mixing of code and data on the same 
>>page.  Maybe it doesn't matter since it is read only?
> 
> 
> It's a bad idea to share even read-only data, but the impact of read-only 
> data is much less that read-write. In particular, you should avoid sharing 
> _any_ code and data in the same physical L1 cache-line, since that will be 
> a big problem for any CPU with exclusion between the I$ and D$.
> 
> HOWEVER, modern x86 CPU's tend to have the I$ be part of the cache 
> coherency protocol, so instead of having exclusion they allow sharing as 
> long as the D$ isn't actually dirty. In that case it's fine to share 
> read-only data and code, although the cache utilization goes down if you 
> do a lot of it.
> 
> Anyway, as long as they are in separate cache-lines, you should be ok even 
> on something with cache exclusion.
> 
> When it comes to actually _writing_ to the data, at least on the P4 you
> don't want to have read-write data anywhere _near_ the I$ (somebody
> reported half-page granularity). This is true on crusoe too, btw (at a
> 128-byte granularity).
> 
> Anyway, I think gcc should make sure that even the ro-data section is at
> least cacheline-aligned so that it stays away from cachelines used for I$.  
> That makes sense even on CPU's that don't have exclusion, since it
> actually gives slightly better L1 cache utilization.
> 
> You can run this (stupid) test-program to try. On my P4 I get
> 
> 	empty overhead=320 cycles
> 	load overhead=0 cycles
> 	I$ load overhead=0 cycles
> 	I$ load overhead=0 cycles
> 	I$ store overhead=264 cycles
> 
> and on my PIII I get
> 
> 	empty overhead=74 cycles
> 	load overhead=8 cycles
> 	I$ load overhead=8 cycles
> 	I$ load overhead=8 cycles
> 	I$ store overhead=103 cycles
> 
> and (just for fun) on an old crusoe I get
> 
> 	empty overhead=67 cycles
> 	load overhead=-9 cycles
> 	I$ load overhead=-14 cycles
> 	I$ load overhead=-14 cycles
> 	I$ store overhead=12 cycles
> 
> where that "negative overhead" just shows that we do some strnge things to
> scheduling, and the loop actually ends up faster if it has a load in it
> than without the load..
> 
> But you can see that storing to code is a really bad idea. Especially on a 
> P4, where the overhead for a store was 264 cycles! (You can also see the 
> cost of doing just the empty synchronization and rdtsc - 320 cycles for a 
> rdtsc and two locked memory accesses on a P4).
> 
> I don't have access to an old Pentium - I think that was the one that had 
> the strict exclusion between the L1 I$ and D$, and then you should see the 
> I$ load overhead go up.
> 
> 			Linus

Here's a few more data points:

vendor_id       : AuthenticAMD
cpu family      : 5
model           : 8
model name      : AMD-K6(tm) 3D processor
stepping        : 12
cpu MHz         : 451.037
empty overhead=105 cycles
load overhead=-2 cycles
I$ load overhead=30 cycles
I$ load overhead=90 cycles
I$ store overhead=95 cycles


vendor_id       : GenuineIntel
cpu family      : 6
model           : 3
model name      : Pentium II (Klamath)
stepping        : 3
cpu MHz         : 265.913
empty overhead=73 cycles
load overhead=10 cycles
I$ load overhead=10 cycles
I$ load overhead=10 cycles
I$ store overhead=2 cycles


vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) Processor
stepping        : 2
cpu MHz         : 1409.946
empty overhead=11 cycles
load overhead=5 cycles
I$ load overhead=5 cycles
I$ load overhead=5 cycles
I$ store overhead=826 cycles

The Athlon XP shows really bad behavior when you store to the text area.

--
				Brian Gerst

next prev parent reply	other threads:[~2003-03-18 18:20 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-02-12  1:35 [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance) Martin J. Bligh
2003-02-12  2:59 ` Dave Jones
2003-02-12  4:21   ` Jamie Lokier
2003-02-12  5:49     ` Linus Torvalds
2003-02-12 10:12       ` Jamie Lokier
2003-03-10  3:07         ` Linus Torvalds
2003-03-10 11:06           ` Andi Kleen
2003-03-10 18:33             ` Linus Torvalds
2003-03-10 22:44           ` Linus Torvalds
2003-02-12 12:54     ` Dave Jones
2003-02-12  7:50   ` Andi Kleen
2003-02-12 10:27     ` Jamie Lokier
2003-02-12 10:45       ` Andi Kleen
2003-02-12 17:52         ` Ingo Oeser
2003-02-12 18:13           ` Dave Jones
2003-02-12 18:18           ` Andi Kleen
2003-02-13  2:42             ` Alan Cox
2003-02-13  5:17         ` Eric W. Biederman
2003-02-13 18:07           ` Andi Kleen
2003-02-14  0:14             ` [discuss] " Peter Tattam
2003-02-14  1:29               ` Andi Kleen
2003-02-14  1:51               ` Eric Northup
2003-02-14  2:01                 ` Peter Tattam
2003-02-14  4:07                   ` Thomas J. Merritt
2003-02-14  9:38                     ` Peter Finderup Lund
2003-02-14  8:27               ` Eric W. Biederman
2003-03-19  1:22             ` Rob Landley
2003-02-12  4:18 ` Jamie Lokier
2003-02-12  5:54   ` Linus Torvalds
2003-02-12 10:18     ` Jamie Lokier
2003-02-12 17:24       ` Linus Torvalds
2003-03-18 15:24     ` Kevin Pedretti
2003-03-18 16:41       ` Linus Torvalds
2003-03-18 18:30         ` Brian Gerst [this message]
2003-03-18 19:14           ` Thomas Molina
2003-03-18 19:21           ` Linus Torvalds
2003-03-18 20:03             ` Thomas Schlichter
2003-03-18 20:24             ` Steven Cole
2003-03-19  0:42             ` H. Peter Anvin
2003-03-19  2:22               ` george anzinger
     [not found] <20030318165013$55f4@gated-at.bofh.it>
     [not found] ` <20030318184010$6448@gated-at.bofh.it>
2003-03-18 20:19   ` Pascal Schmidt
  -- strict thread matches above, loose matches on Subject: below --
2003-03-19  9:55 Ph. Marek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3E7765DE.10609@didntduck.org \
    --to=bgerst@didntduck.org \
    --cc=ktpedre@sandia.gov \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@transmeta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox