* Patch for optimize context switch
@ 2000-02-21 10:49 FASSINO Jean-Philippe
2000-02-21 23:12 ` Paul Mackerras
0 siblings, 1 reply; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-21 10:49 UTC (permalink / raw)
To: linuxppc-dev@lists.linuxppc.org
Cc: Benjamin Herrenschmidt, paulus@linuxcare.com
The aim of this patch is to optimize context switch on PPC.
It permit to optimize pipeline and reduce near 30 instructions per
context switch.
I'm using it on my computer and it work well, please test it !
The patch is taken from kernel 2.2.15pre7 and modify _switch
and set_context on arch/ppc/kernel/heas.S.
On kernel 2.3.X only set_context need ot be modified.
Jean-Philippe
--- head.S Mon Feb 21 10:16:41 2000
+++ arch/ppc/kernel/head.S Mon Feb 21 11:31:38 2000
@@ -2335,13 +2335,29 @@
/* Set up segment registers for new task */
rlwinm r5,r5,4,8,27 /* VSID = context << 4 */
addis r5,r5,0x6000 /* Set Ks, Ku bits */
- li r0,12 /* TASK_SIZE / SEGMENT_SIZE */
- mtctr r0
- li r9,0
-3: mtsrin r5,r9
- addi r5,r5,1 /* next VSID */
- addis r9,r9,0x1000 /* address of next segment */
- bdnz 3b
+ addi r9,r5,1
+ mtsr SR0,r5 /* update SR0 ..
SR[TASK_SIZE/SEGMENT_SIZE-1] */
+ addi r5,r9,1
+ mtsr SR1,r9
+ addi r9,r5,1
+ mtsr SR2,r5
+ addi r5,r9,1
+ mtsr SR3,r9
+ addi r9,r5,1
+ mtsr SR4,r5
+ addi r5,r9,1
+ mtsr SR5,r9
+ addi r9,r5,1
+ mtsr SR6,r5
+ addi r5,r9,1
+ mtsr SR7,r9
+ addi r9,r5,1
+ mtsr SR8,r5
+ addi r5,r9,1
+ mtsr SR9,r9
+ addi r9,r5,1
+ mtsr SR10,r5
+ mtsr SR11,r9
#else
/* On the MPC8xx, we place the physical address of the new task
* page directory loaded into the MMU base register, and set the
@@ -2500,13 +2516,29 @@
_GLOBAL(set_context)
rlwinm r3,r3,4,8,27 /* VSID = context << 4 */
addis r3,r3,0x6000 /* Set Ks, Ku bits */
- li r0,12 /* TASK_SIZE / SEGMENT_SIZE */
- mtctr r0
- li r4,0
-3: mtsrin r3,r4
- addi r3,r3,1 /* next VSID */
- addis r4,r4,0x1000 /* address of next segment */
- bdnz 3b
+ addi r4,r3,1
+ mtsr SR0,r3 /* update SR0 ..
SR[TASK_SIZE/SEGMENT_SIZE-1] */
+ addi r3,r4,1
+ mtsr SR1,r4
+ addi r4,r3,1
+ mtsr SR2,r3
+ addi r3,r4,1
+ mtsr SR3,r4
+ addi r4,r3,1
+ mtsr SR4,r3
+ addi r3,r4,1
+ mtsr SR5,r4
+ addi r4,r3,1
+ mtsr SR6,r3
+ addi r3,r4,1
+ mtsr SR7,r4
+ addi r4,r3,1
+ mtsr SR8,r3
+ addi r3,r4,1
+ mtsr SR9,r4
+ addi r4,r3,1
+ mtsr SR10,r3
+ mtsr SR11,r4
SYNC
blr
--
--------------------------------------------------------------------------
Jean-Philippe FASSINO Tel : 04 76 76 45 52
CNET : DTL/ASR mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Patch for optimize context switch
2000-02-21 10:49 Patch for optimize context switch FASSINO Jean-Philippe
@ 2000-02-21 23:12 ` Paul Mackerras
2000-02-22 9:27 ` Gabriel Paubert
2000-02-22 9:30 ` FASSINO Jean-Philippe
0 siblings, 2 replies; 10+ messages in thread
From: Paul Mackerras @ 2000-02-21 23:12 UTC (permalink / raw)
To: FASSINO Jean-Philippe, linuxppc-dev@lists.linuxppc.org
On Mon, 21 Feb 2000, FASSINO Jean-Philippe wrote:
> The aim of this patch is to optimize context switch on PPC.
> It permit to optimize pipeline and reduce near 30 instructions per
> context switch.
> I'm using it on my computer and it work well, please test it !
Interesting. How much does it reduce the context switch time? Did you
run lmbench or something to see if it makes it go faster?
The reason I ask is that it is possible that unrolling the loop as you
have done could actually make it go slower due to increased i-cache
misses. The bdnz instruction on PPC has essentially zero overhead since it
is pulled out of the instruction stream in the fetch/decode unit by the
branch processing unit. Also, it is very easy to predict whether a bdnz
will branch or not.
--
Paul Mackerras, Senior Open Source Researcher, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare. Support for the revolution.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Patch for optimize context switch
2000-02-21 23:12 ` Paul Mackerras
@ 2000-02-22 9:27 ` Gabriel Paubert
2000-02-22 10:50 ` Benjamin Herrenschmidt
2000-02-22 9:30 ` FASSINO Jean-Philippe
1 sibling, 1 reply; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 9:27 UTC (permalink / raw)
To: Paul Mackerras; +Cc: FASSINO Jean-Philippe, linuxppc-dev@lists.linuxppc.org
On Tue, 22 Feb 2000, Paul Mackerras wrote:
>
> On Mon, 21 Feb 2000, FASSINO Jean-Philippe wrote:
>
> > The aim of this patch is to optimize context switch on PPC.
> > It permit to optimize pipeline and reduce near 30 instructions per
> > context switch.
> > I'm using it on my computer and it work well, please test it !
>
> Interesting. How much does it reduce the context switch time? Did you
> run lmbench or something to see if it makes it go faster?
>
> The reason I ask is that it is possible that unrolling the loop as you
> have done could actually make it go slower due to increased i-cache
> misses. The bdnz instruction on PPC has essentially zero overhead since it
> is pulled out of the instruction stream in the fetch/decode unit by the
> branch processing unit. Also, it is very easy to predict whether a bdnz
> will branch or not.
Exactly, context switches are infrequent and should be benchmarked at
least after having invalidated the code from the instruction cache
(whether it should also be pushed out of the L2 cache is more questionable
but I would also push it out since L2 cache is only direct mapped or 2 way
set associative on most processors). Besides the time of the loop is
dominated by the execution synchronized mtsrin. Actually the only
processor on which it might be a clear win is the 601.
Gabriel
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Patch for optimize context switch
2000-02-22 9:27 ` Gabriel Paubert
@ 2000-02-22 10:50 ` Benjamin Herrenschmidt
2000-02-22 11:13 ` Gabriel Paubert
0 siblings, 1 reply; 10+ messages in thread
From: Benjamin Herrenschmidt @ 2000-02-22 10:50 UTC (permalink / raw)
To: Gabriel Paubert, paulus, linuxppc-dev
On Tue, Feb 22, 2000, Gabriel Paubert <paubert@iram.es> wrote:
>Exactly, context switches are infrequent and should be benchmarked at
>least after having invalidated the code from the instruction cache
>(whether it should also be pushed out of the L2 cache is more questionable
>but I would also push it out since L2 cache is only direct mapped or 2 way
>set associative on most processors). Besides the time of the loop is
>dominated by the execution synchronized mtsrin. Actually the only
>processor on which it might be a clear win is the 601.
BTW. There's an idea that have been idling in my mind for some time:
Do you think there would be any interest into adding code to some drivers
for invalidating the data cache of buffers before doing DMA-read i/os to
them ? (For example invalidating the block buffers before or just after
having started a DMA read in the IDE driver).
Those datas will be replaced by new datas, so invalidating them before
(or at the beginning of) the transfer will help avoiding snooping hits
during the transfer itself, and eventually help keeping more useful
things in the cache.
I'm not sure this would have any measurable impact, I may just setup a
lmbench and try it out once I'm finished with my current stuffs.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Patch for optimize context switch
2000-02-22 10:50 ` Benjamin Herrenschmidt
@ 2000-02-22 11:13 ` Gabriel Paubert
0 siblings, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 11:13 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: paulus, linuxppc-dev
On Tue, 22 Feb 2000, Benjamin Herrenschmidt wrote:
> BTW. There's an idea that have been idling in my mind for some time:
>
> Do you think there would be any interest into adding code to some drivers
> for invalidating the data cache of buffers before doing DMA-read i/os to
> them ? (For example invalidating the block buffers before or just after
> having started a DMA read in the IDE driver).
I'd rather leave the hardware invalidate the caches. Now most bridges
gather stores from the PCI until they reach one full cache line of data
which they burst with "write-with-kill" transaction on the bus, which
means that the data is invalidated in the cache if necessary but the snoop
does not cause a data cast-out. In the end you would end up doing twice
the work and increase bus traffic, especially on processors designed for
SMP since the cache invalidation (even if done with dcbi which will
not push data) will also cause bus broadcasts.
Gabriel.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Patch for optimize context switch
2000-02-21 23:12 ` Paul Mackerras
2000-02-22 9:27 ` Gabriel Paubert
@ 2000-02-22 9:30 ` FASSINO Jean-Philippe
2000-02-22 11:23 ` FASSINO Jean-Philippe
2000-02-22 11:40 ` Gabriel Paubert
1 sibling, 2 replies; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-22 9:30 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev@lists.linuxppc.org
Paul Mackerras wrote:
> On Mon, 21 Feb 2000, FASSINO Jean-Philippe wrote:
>
> > The aim of this patch is to optimize context switch on PPC.
> > It permit to optimize pipeline and reduce near 30 instructions per
> > context switch.
> > I'm using it on my computer and it work well, please test it !
>
> Interesting. How much does it reduce the context switch time? Did you
> run lmbench or something to see if it makes it go faster?
>
> The reason I ask is that it is possible that unrolling the loop as you
> have done could actually make it go slower due to increased i-cache
> misses. The bdnz instruction on PPC has essentially zero overhead since it
> is pulled out of the instruction stream in the fetch/decode unit by the
> branch processing unit. Also, it is very easy to predict whether a bdnz
> will branch or not.
There are two advantages of this patch :
- unrolling the loop (suppress the bdnz instructions),
- statically designate segment register (suppress one add per loop).
The main disadvantage is :
- possibly increase i-cache misses (depend of function alignment)
To conclude, i'm trying to run lmbench and when i got result i send it.
Jean-Philippe
--
--------------------------------------------------------------------------
Jean-Philippe FASSINO Tel : 04 76 76 45 52
CNET : DTL/ASR mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Patch for optimize context switch
2000-02-22 9:30 ` FASSINO Jean-Philippe
@ 2000-02-22 11:23 ` FASSINO Jean-Philippe
2000-02-22 11:33 ` FASSINO Jean-Philippe
2000-02-22 11:40 ` Gabriel Paubert
1 sibling, 1 reply; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-22 11:23 UTC (permalink / raw)
To: Paul Mackerras
Cc: linuxppc-dev@lists.linuxppc.org, Benjamin Herrenschmidt,
Gabriel Paubert
[-- Attachment #1: Type: text/plain, Size: 1362 bytes --]
FASSINO Jean-Philippe wrote:
> Paul Mackerras wrote:
>
> > Interesting. How much does it reduce the context switch time? Did you
> > run lmbench or something to see if it makes it go faster?
I have got result of lmbench for Context switching (initlevel 1).
I'm running 5 bench with and without patch on a PBG3/400.
Result is in attach. The average is :
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
With patch. ======
ppc-linux Linux 2.2.15p 0.6 8 91 26 117 28 239
Without patch ======
ppc-linux Linux 2.2.15p 1 7 102 26 121 28 240
What do you think about this result ???
I'm thinking to say something it is necessary to do many bench run !
Here, performance vary too much between two run to really say something.
Jean-Philippe
--
--------------------------------------------------------------------------
Jean-Philippe FASSINO Tel : 04 76 76 45 52
CNET : DTL/ASR mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------
[-- Attachment #2: res.txt --]
[-- Type: text/plain, Size: 1065 bytes --]
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
ppc-linux Linux 2.2.15p 1 7 90 26 107 31 222
ppc-linux Linux 2.2.15p 1 7 91 26 110 29 257
ppc-linux Linux 2.2.15p 1 7 90 25 112 26 233
ppc-linux Linux 2.2.15p 1 7 92 25 126 27 243
ppc-linux Linux 2.2.15p 1 7 147 26 150 27 245
ppc-linux Linux 2.2.15p 0 7 92 25 136 28 270
ppc-linux Linux 2.2.15p 1 8 90 27 103 27 211
ppc-linux Linux 2.2.15p 0 9 90 30 112 32 228
ppc-linux Linux 2.2.15p 1 7 91 25 130 29 277
ppc-linux Linux 2.2.15p 1 7 90 25 105 26 211
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Patch for optimize context switch
2000-02-22 11:23 ` FASSINO Jean-Philippe
@ 2000-02-22 11:33 ` FASSINO Jean-Philippe
2000-02-22 11:50 ` Gabriel Paubert
0 siblings, 1 reply; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-22 11:33 UTC (permalink / raw)
To: Paul Mackerras
Cc: linuxppc-dev@lists.linuxppc.org, Benjamin Herrenschmidt,
Gabriel Paubert
FASSINO Jean-Philippe wrote:
> FASSINO Jean-Philippe wrote:
>
> ------------------------------------------------------------------------
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ----- ------ ------ ------ ------ ------- -------
> ppc-linux Linux 2.2.15p 1 7 90 26 107 31 222
> ppc-linux Linux 2.2.15p 1 7 91 26 110 29 257
> ppc-linux Linux 2.2.15p 1 7 90 25 112 26 233
> ppc-linux Linux 2.2.15p 1 7 92 25 126 27 243
> ppc-linux Linux 2.2.15p 1 7 147 26 150 27 245
> ppc-linux Linux 2.2.15p 0 7 92 25 136 28 270
> ppc-linux Linux 2.2.15p 1 8 90 27 103 27 211
> ppc-linux Linux 2.2.15p 0 9 90 30 112 32 228
> ppc-linux Linux 2.2.15p 1 7 91 25 130 29 277
> ppc-linux Linux 2.2.15p 1 7 90 25 105 26 211
Sorry i forgot to say :
- 5 first run is without patch
- 5 last run is with path
Jean-Philippe
--
--------------------------------------------------------------------------
Jean-Philippe FASSINO Tel : 04 76 76 45 52
CNET : DTL/ASR mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Patch for optimize context switch
2000-02-22 11:33 ` FASSINO Jean-Philippe
@ 2000-02-22 11:50 ` Gabriel Paubert
0 siblings, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 11:50 UTC (permalink / raw)
To: FASSINO Jean-Philippe
Cc: Paul Mackerras, linuxppc-dev@lists.linuxppc.org,
Benjamin Herrenschmidt
On Tue, 22 Feb 2000, FASSINO Jean-Philippe wrote:
> FASSINO Jean-Philippe wrote:
>
> > FASSINO Jean-Philippe wrote:
> >
> > ------------------------------------------------------------------------
> > Context switching - times in microseconds - smaller is better
> > -------------------------------------------------------------
> > Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> > ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> > --------- ------------- ----- ------ ------ ------ ------ ------- -------
> > ppc-linux Linux 2.2.15p 1 7 90 26 107 31 222
> > ppc-linux Linux 2.2.15p 1 7 91 26 110 29 257
> > ppc-linux Linux 2.2.15p 1 7 90 25 112 26 233
> > ppc-linux Linux 2.2.15p 1 7 92 25 126 27 243
> > ppc-linux Linux 2.2.15p 1 7 147 26 150 27 245
>
> > ppc-linux Linux 2.2.15p 0 7 92 25 136 28 270
> > ppc-linux Linux 2.2.15p 1 8 90 27 103 27 211
> > ppc-linux Linux 2.2.15p 0 9 90 30 112 32 228
> > ppc-linux Linux 2.2.15p 1 7 91 25 130 29 277
> > ppc-linux Linux 2.2.15p 1 7 90 25 105 26 211
>
> Sorry i forgot to say :
> - 5 first run is without patch
> - 5 last run is with path
That's largely in the noise: results with processes which do not pollute
cache are slightly better wit the patch. But I don't consider this a real
life situation. Processes which actually do some work don't see any
practical difference. The 147 microseconds in the 3rd column is very
probably a bogus point due to collision on L2 cache which is only 2 way
set associative.
Gabriel.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Patch for optimize context switch
2000-02-22 9:30 ` FASSINO Jean-Philippe
2000-02-22 11:23 ` FASSINO Jean-Philippe
@ 2000-02-22 11:40 ` Gabriel Paubert
1 sibling, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 11:40 UTC (permalink / raw)
To: FASSINO Jean-Philippe; +Cc: Paul Mackerras, linuxppc-dev@lists.linuxppc.org
On Tue, 22 Feb 2000, FASSINO Jean-Philippe wrote:
> There are two advantages of this patch :
> - unrolling the loop (suppress the bdnz instructions),
Cost of bdnz is virtually zero (one slot in the cmoletion queue).
> - statically designate segment register (suppress one add per loop).
The cost of the add is negligible.
> The main disadvantage is :
> - possibly increase i-cache misses (depend of function alignment)
Transforming a 4 instruction loop executed 12 times into straight code
needing 24 or so instruction code, you add something like 2
cache lines to the footprint. Instruction issue in the loop is not a
problem on 603/G3/G4 (2 clocks) or 604 (1 or 2 clocks depending on
alignment).
Instruction completion is often the problem and the limiting factor
actually on all processors except the 604 (the documentation clearly
states that the second completed instruction must be an integer or load,
so that the bdnz which writes back the ctr is bad since it takes an
additional clock in the completion queue):
If I interpret correctly the G3/G4 docs
- t=0, previous instruction completed, mtsrin starts, which takes 2 clocks,
- t=1, mtsrin + add complete
- t=2, second add complete
- t=3, bdnz complete
- t=4, previous instructions completed, mtsrin starts
that's 4 clocks per iteration. Which is more than the 2 clocks we can get
by interleaving mtsr/add. Cost for 12 iterations is 24 clocks, which is
still cheaper than 2 cache line feches IMHO. However, changing the loop
to:
rlwinm r3,r3,4,8,27 /* VSID = context << 4 */
addis r3,r3,0x6000 /* Set Ks, Ku bits */
lis r4,0xc000
lis r5,0xf000
addi r3,r3,12 /* Last segment to write */
3: add. r4,r4,r5 /* address of next segment */
addi r3,r3,-1 /* next VSID */
mtsrin r3,r4
bne 3b
transforms the branch into a folded branch which saves one clock in the
completion unit:
- t=0: previous instructions complete, mstrin starts, takes 2 clocks
- t=1: mtsrin and first add complete, branch has been folded
- t=2: addi complete, branch has been
- t=3: previous instruction completed, mtsrin starts
however this will only save 12 clocks from each context switch. I think
that there are other areas to focus on to improve performance.
Gabriel.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2000-02-22 11:50 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-02-21 10:49 Patch for optimize context switch FASSINO Jean-Philippe
2000-02-21 23:12 ` Paul Mackerras
2000-02-22 9:27 ` Gabriel Paubert
2000-02-22 10:50 ` Benjamin Herrenschmidt
2000-02-22 11:13 ` Gabriel Paubert
2000-02-22 9:30 ` FASSINO Jean-Philippe
2000-02-22 11:23 ` FASSINO Jean-Philippe
2000-02-22 11:33 ` FASSINO Jean-Philippe
2000-02-22 11:50 ` Gabriel Paubert
2000-02-22 11:40 ` Gabriel Paubert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).