linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Patch for optimize context switch
@ 2000-02-21 10:49 FASSINO Jean-Philippe
  2000-02-21 23:12 ` Paul Mackerras
  0 siblings, 1 reply; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-21 10:49 UTC (permalink / raw)
  To: linuxppc-dev@lists.linuxppc.org
  Cc: Benjamin Herrenschmidt, paulus@linuxcare.com


The aim of this patch is to optimize context switch on PPC.
It permit to optimize pipeline and reduce near 30 instructions per
context switch.
I'm using it on my computer and it work well, please test it !

The patch is taken from kernel 2.2.15pre7 and modify _switch
and set_context on arch/ppc/kernel/heas.S.
On kernel 2.3.X only set_context need ot be modified.

Jean-Philippe

--- head.S      Mon Feb 21 10:16:41 2000
+++ arch/ppc/kernel/head.S      Mon Feb 21 11:31:38 2000
@@ -2335,13 +2335,29 @@
        /* Set up segment registers for new task */
        rlwinm  r5,r5,4,8,27    /* VSID = context << 4 */
        addis   r5,r5,0x6000    /* Set Ks, Ku bits */
-       li      r0,12           /* TASK_SIZE / SEGMENT_SIZE */
-       mtctr   r0
-       li      r9,0
-3:     mtsrin  r5,r9
-       addi    r5,r5,1         /* next VSID */
-       addis   r9,r9,0x1000    /* address of next segment */
-       bdnz    3b
+       addi    r9,r5,1
+       mtsr    SR0,r5          /* update SR0 ..
SR[TASK_SIZE/SEGMENT_SIZE-1] */
+       addi    r5,r9,1
+       mtsr    SR1,r9
+       addi    r9,r5,1
+       mtsr    SR2,r5
+       addi    r5,r9,1
+       mtsr    SR3,r9
+       addi    r9,r5,1
+       mtsr    SR4,r5
+       addi    r5,r9,1
+       mtsr    SR5,r9
+       addi    r9,r5,1
+       mtsr    SR6,r5
+       addi    r5,r9,1
+       mtsr    SR7,r9
+       addi    r9,r5,1
+       mtsr    SR8,r5
+       addi    r5,r9,1
+       mtsr    SR9,r9
+       addi    r9,r5,1
+       mtsr    SR10,r5
+       mtsr    SR11,r9
 #else
 /* On the MPC8xx, we place the physical address of the new task
  * page directory loaded into the MMU base register, and set the
@@ -2500,13 +2516,29 @@
 _GLOBAL(set_context)
        rlwinm  r3,r3,4,8,27    /* VSID = context << 4 */
        addis   r3,r3,0x6000    /* Set Ks, Ku bits */
-       li      r0,12           /* TASK_SIZE / SEGMENT_SIZE */
-       mtctr   r0
-       li      r4,0
-3:     mtsrin  r3,r4
-       addi    r3,r3,1         /* next VSID */
-       addis   r4,r4,0x1000    /* address of next segment */
-       bdnz    3b
+       addi    r4,r3,1
+       mtsr    SR0,r3          /* update SR0 ..
SR[TASK_SIZE/SEGMENT_SIZE-1] */
+       addi    r3,r4,1
+       mtsr    SR1,r4
+       addi    r4,r3,1
+       mtsr    SR2,r3
+       addi    r3,r4,1
+       mtsr    SR3,r4
+       addi    r4,r3,1
+       mtsr    SR4,r3
+       addi    r3,r4,1
+       mtsr    SR5,r4
+       addi    r4,r3,1
+       mtsr    SR6,r3
+       addi    r3,r4,1
+       mtsr    SR7,r4
+       addi    r4,r3,1
+       mtsr    SR8,r3
+       addi    r3,r4,1
+       mtsr    SR9,r4
+       addi    r4,r3,1
+       mtsr    SR10,r3
+       mtsr    SR11,r4
        SYNC
        blr



--
--------------------------------------------------------------------------
Jean-Philippe FASSINO  Tel :  04 76 76 45 52
CNET : DTL/ASR         mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-21 10:49 Patch for optimize context switch FASSINO Jean-Philippe
@ 2000-02-21 23:12 ` Paul Mackerras
  2000-02-22  9:27   ` Gabriel Paubert
  2000-02-22  9:30   ` FASSINO Jean-Philippe
  0 siblings, 2 replies; 10+ messages in thread
From: Paul Mackerras @ 2000-02-21 23:12 UTC (permalink / raw)
  To: FASSINO Jean-Philippe, linuxppc-dev@lists.linuxppc.org


On Mon, 21 Feb 2000, FASSINO Jean-Philippe wrote:

> The aim of this patch is to optimize context switch on PPC.
> It permit to optimize pipeline and reduce near 30 instructions per
> context switch.
> I'm using it on my computer and it work well, please test it !

Interesting.  How much does it reduce the context switch time?  Did you
run lmbench or something to see if it makes it go faster?

The reason I ask is that it is possible that unrolling the loop as you
have done could actually make it go slower due to increased i-cache
misses.  The bdnz instruction on PPC has essentially zero overhead since it
is pulled out of the instruction stream in the fetch/decode unit by the
branch processing unit.  Also, it is very easy to predict whether a bdnz
will branch or not.

--
Paul Mackerras, Senior Open Source Researcher, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
paulus@linuxcare.com.au, http://www.linuxcare.com.au/
Linuxcare.  Support for the revolution.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-21 23:12 ` Paul Mackerras
@ 2000-02-22  9:27   ` Gabriel Paubert
  2000-02-22 10:50     ` Benjamin Herrenschmidt
  2000-02-22  9:30   ` FASSINO Jean-Philippe
  1 sibling, 1 reply; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22  9:27 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: FASSINO Jean-Philippe, linuxppc-dev@lists.linuxppc.org




On Tue, 22 Feb 2000, Paul Mackerras wrote:

>
> On Mon, 21 Feb 2000, FASSINO Jean-Philippe wrote:
>
> > The aim of this patch is to optimize context switch on PPC.
> > It permit to optimize pipeline and reduce near 30 instructions per
> > context switch.
> > I'm using it on my computer and it work well, please test it !
>
> Interesting.  How much does it reduce the context switch time?  Did you
> run lmbench or something to see if it makes it go faster?
>
> The reason I ask is that it is possible that unrolling the loop as you
> have done could actually make it go slower due to increased i-cache
> misses.  The bdnz instruction on PPC has essentially zero overhead since it
> is pulled out of the instruction stream in the fetch/decode unit by the
> branch processing unit.  Also, it is very easy to predict whether a bdnz
> will branch or not.

Exactly, context switches are infrequent and should be benchmarked at
least after having invalidated the code from the instruction cache
(whether it should also be pushed out of the L2 cache is more questionable
but I would also push it out since L2 cache is only direct mapped or 2 way
set associative on most processors). Besides the time of the loop is
dominated by the execution synchronized mtsrin. Actually the only
processor on which it might be a clear win is the 601.

	Gabriel


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-21 23:12 ` Paul Mackerras
  2000-02-22  9:27   ` Gabriel Paubert
@ 2000-02-22  9:30   ` FASSINO Jean-Philippe
  2000-02-22 11:23     ` FASSINO Jean-Philippe
  2000-02-22 11:40     ` Gabriel Paubert
  1 sibling, 2 replies; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-22  9:30 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev@lists.linuxppc.org


Paul Mackerras wrote:

> On Mon, 21 Feb 2000, FASSINO Jean-Philippe wrote:
>
> > The aim of this patch is to optimize context switch on PPC.
> > It permit to optimize pipeline and reduce near 30 instructions per
> > context switch.
> > I'm using it on my computer and it work well, please test it !
>
> Interesting.  How much does it reduce the context switch time?  Did you
> run lmbench or something to see if it makes it go faster?
>
> The reason I ask is that it is possible that unrolling the loop as you
> have done could actually make it go slower due to increased i-cache
> misses.  The bdnz instruction on PPC has essentially zero overhead since it
> is pulled out of the instruction stream in the fetch/decode unit by the
> branch processing unit.  Also, it is very easy to predict whether a bdnz
> will branch or not.

There are two advantages of this patch :
    - unrolling the loop (suppress the bdnz instructions),
    - statically designate segment register (suppress one add per loop).
The main disadvantage is :
    - possibly increase i-cache misses (depend of function alignment)

To conclude, i'm trying to run lmbench and when i got result i send it.

Jean-Philippe

--
--------------------------------------------------------------------------
Jean-Philippe FASSINO  Tel :  04 76 76 45 52
CNET : DTL/ASR         mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-22  9:27   ` Gabriel Paubert
@ 2000-02-22 10:50     ` Benjamin Herrenschmidt
  2000-02-22 11:13       ` Gabriel Paubert
  0 siblings, 1 reply; 10+ messages in thread
From: Benjamin Herrenschmidt @ 2000-02-22 10:50 UTC (permalink / raw)
  To: Gabriel Paubert, paulus, linuxppc-dev


On Tue, Feb 22, 2000, Gabriel Paubert <paubert@iram.es> wrote:

>Exactly, context switches are infrequent and should be benchmarked at
>least after having invalidated the code from the instruction cache
>(whether it should also be pushed out of the L2 cache is more questionable
>but I would also push it out since L2 cache is only direct mapped or 2 way
>set associative on most processors). Besides the time of the loop is
>dominated by the execution synchronized mtsrin. Actually the only
>processor on which it might be a clear win is the 601.

BTW. There's an idea that have been idling in my mind for some time:

Do you think there would be any interest into adding code to some drivers
for invalidating the data cache of buffers before doing DMA-read i/os to
them ? (For example invalidating the block buffers before or just after
having started a DMA read in the IDE driver).

Those datas will be replaced by new datas, so invalidating them before
(or at the beginning of) the transfer will help avoiding snooping hits
during the transfer itself, and eventually help keeping more useful
things in the cache.

I'm not sure this would have any measurable impact, I may just setup a
lmbench and try it out once I'm finished with my current stuffs.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-22 10:50     ` Benjamin Herrenschmidt
@ 2000-02-22 11:13       ` Gabriel Paubert
  0 siblings, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 11:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulus, linuxppc-dev




On Tue, 22 Feb 2000, Benjamin Herrenschmidt wrote:

> BTW. There's an idea that have been idling in my mind for some time:
>
> Do you think there would be any interest into adding code to some drivers
> for invalidating the data cache of buffers before doing DMA-read i/os to
> them ? (For example invalidating the block buffers before or just after
> having started a DMA read in the IDE driver).

I'd rather leave the hardware invalidate the caches. Now most bridges
gather stores from the PCI until they reach one full cache line of data
which they burst with "write-with-kill" transaction on the bus, which
means that the data is invalidated in the cache if necessary but the snoop
does not cause a data cast-out. In the end you would end up doing twice
the work and increase bus traffic, especially on processors designed for
SMP since the cache invalidation (even if done with dcbi which will
not push data) will also cause bus broadcasts.

	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-22  9:30   ` FASSINO Jean-Philippe
@ 2000-02-22 11:23     ` FASSINO Jean-Philippe
  2000-02-22 11:33       ` FASSINO Jean-Philippe
  2000-02-22 11:40     ` Gabriel Paubert
  1 sibling, 1 reply; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-22 11:23 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linuxppc-dev@lists.linuxppc.org, Benjamin Herrenschmidt,
	Gabriel Paubert

[-- Attachment #1: Type: text/plain, Size: 1362 bytes --]

FASSINO Jean-Philippe wrote:

> Paul Mackerras wrote:
>
> > Interesting.  How much does it reduce the context switch time?  Did you
> > run lmbench or something to see if it makes it go faster?

I have got result of lmbench for Context switching (initlevel 1).
I'm running 5 bench with and without patch on a PBG3/400.
Result is in attach. The average is :

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------

With patch. ======
ppc-linux Linux 2.2.15p  0.6      8     91    26    117      28     239

Without patch ======
ppc-linux Linux 2.2.15p    1      7    102    26    121      28     240


What do you think about this result ???
I'm thinking to say something it is necessary to do many bench run !
Here, performance vary too much between two run to really say something.

Jean-Philippe

--
--------------------------------------------------------------------------
Jean-Philippe FASSINO  Tel :  04 76 76 45 52
CNET : DTL/ASR         mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------



[-- Attachment #2: res.txt --]
[-- Type: text/plain, Size: 1065 bytes --]

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
ppc-linux Linux 2.2.15p    1      7     90    26    107      31     222
ppc-linux Linux 2.2.15p    1      7     91    26    110      29     257
ppc-linux Linux 2.2.15p    1      7     90    25    112      26     233
ppc-linux Linux 2.2.15p    1      7     92    25    126      27     243
ppc-linux Linux 2.2.15p    1      7    147    26    150      27     245
ppc-linux Linux 2.2.15p    0      7     92    25    136      28     270
ppc-linux Linux 2.2.15p    1      8     90    27    103      27     211
ppc-linux Linux 2.2.15p    0      9     90    30    112      32     228
ppc-linux Linux 2.2.15p    1      7     91    25    130      29     277
ppc-linux Linux 2.2.15p    1      7     90    25    105      26     211

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-22 11:23     ` FASSINO Jean-Philippe
@ 2000-02-22 11:33       ` FASSINO Jean-Philippe
  2000-02-22 11:50         ` Gabriel Paubert
  0 siblings, 1 reply; 10+ messages in thread
From: FASSINO Jean-Philippe @ 2000-02-22 11:33 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linuxppc-dev@lists.linuxppc.org, Benjamin Herrenschmidt,
	Gabriel Paubert


FASSINO Jean-Philippe wrote:

> FASSINO Jean-Philippe wrote:
>
>   ------------------------------------------------------------------------
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------
> Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> --------- ------------- ----- ------ ------ ------ ------ ------- -------
> ppc-linux Linux 2.2.15p    1      7     90    26    107      31     222
> ppc-linux Linux 2.2.15p    1      7     91    26    110      29     257
> ppc-linux Linux 2.2.15p    1      7     90    25    112      26     233
> ppc-linux Linux 2.2.15p    1      7     92    25    126      27     243
> ppc-linux Linux 2.2.15p    1      7    147    26    150      27     245

> ppc-linux Linux 2.2.15p    0      7     92    25    136      28     270
> ppc-linux Linux 2.2.15p    1      8     90    27    103      27     211
> ppc-linux Linux 2.2.15p    0      9     90    30    112      32     228
> ppc-linux Linux 2.2.15p    1      7     91    25    130      29     277
> ppc-linux Linux 2.2.15p    1      7     90    25    105      26     211

Sorry i forgot to say :
    - 5 first run is without patch
    - 5 last run is with path

Jean-Philippe

--
--------------------------------------------------------------------------
Jean-Philippe FASSINO  Tel :  04 76 76 45 52
CNET : DTL/ASR         mailto:jeanphilippe.fassino@cnet.francetelecom.fr
--------------------------------------------------------------------------


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-22  9:30   ` FASSINO Jean-Philippe
  2000-02-22 11:23     ` FASSINO Jean-Philippe
@ 2000-02-22 11:40     ` Gabriel Paubert
  1 sibling, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 11:40 UTC (permalink / raw)
  To: FASSINO Jean-Philippe; +Cc: Paul Mackerras, linuxppc-dev@lists.linuxppc.org




On Tue, 22 Feb 2000, FASSINO Jean-Philippe wrote:

> There are two advantages of this patch :
>     - unrolling the loop (suppress the bdnz instructions),

Cost of bdnz is virtually zero (one slot in the cmoletion queue).

>     - statically designate segment register (suppress one add per loop).

The cost of the add is negligible.


> The main disadvantage is :
>     - possibly increase i-cache misses (depend of function alignment)

Transforming a 4 instruction loop executed 12 times into straight code
needing 24 or so instruction code, you add something like 2
cache lines to the footprint. Instruction issue in the loop is not a
problem on 603/G3/G4 (2 clocks) or 604 (1 or 2 clocks depending on
alignment).

Instruction completion is often the problem and the limiting factor
actually on all processors except the 604 (the documentation clearly
states that the second completed instruction must be an integer or load,
so that the bdnz which writes back the ctr is bad since it takes an
additional clock in the completion queue):

If I interpret correctly the G3/G4 docs
- t=0, previous instruction completed, mtsrin starts, which takes 2 clocks,
- t=1, mtsrin + add complete
- t=2, second add complete
- t=3, bdnz complete
- t=4, previous instructions completed, mtsrin starts

that's 4 clocks per iteration. Which is more than the 2 clocks we can get
by interleaving mtsr/add. Cost for 12 iterations is 24 clocks, which is
still cheaper than 2 cache line feches IMHO. However, changing the loop
to:

        rlwinm  r3,r3,4,8,27    /* VSID = context << 4 */
        addis   r3,r3,0x6000    /* Set Ks, Ku bits */
        lis     r4,0xc000
	lis	r5,0xf000
        addi    r3,r3,12        /* Last segment to write */
3:      add.    r4,r4,r5        /* address of next segment */
        addi    r3,r3,-1        /* next VSID */
	mtsrin  r3,r4
        bne     3b

transforms the branch into a folded branch which saves one clock in the
completion unit:

- t=0: previous instructions complete, mstrin starts, takes 2 clocks
- t=1: mtsrin and first add complete, branch has been folded
- t=2: addi complete, branch has been
- t=3: previous instruction completed, mtsrin starts

however this will only save 12 clocks from each context switch. I think
that there are other areas to focus on to improve performance.

	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Patch for optimize context switch
  2000-02-22 11:33       ` FASSINO Jean-Philippe
@ 2000-02-22 11:50         ` Gabriel Paubert
  0 siblings, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2000-02-22 11:50 UTC (permalink / raw)
  To: FASSINO Jean-Philippe
  Cc: Paul Mackerras, linuxppc-dev@lists.linuxppc.org,
	Benjamin Herrenschmidt




On Tue, 22 Feb 2000, FASSINO Jean-Philippe wrote:

> FASSINO Jean-Philippe wrote:
>
> > FASSINO Jean-Philippe wrote:
> >
> >   ------------------------------------------------------------------------
> > Context switching - times in microseconds - smaller is better
> > -------------------------------------------------------------
> > Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> >                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> > --------- ------------- ----- ------ ------ ------ ------ ------- -------
> > ppc-linux Linux 2.2.15p    1      7     90    26    107      31     222
> > ppc-linux Linux 2.2.15p    1      7     91    26    110      29     257
> > ppc-linux Linux 2.2.15p    1      7     90    25    112      26     233
> > ppc-linux Linux 2.2.15p    1      7     92    25    126      27     243
> > ppc-linux Linux 2.2.15p    1      7    147    26    150      27     245
>
> > ppc-linux Linux 2.2.15p    0      7     92    25    136      28     270
> > ppc-linux Linux 2.2.15p    1      8     90    27    103      27     211
> > ppc-linux Linux 2.2.15p    0      9     90    30    112      32     228
> > ppc-linux Linux 2.2.15p    1      7     91    25    130      29     277
> > ppc-linux Linux 2.2.15p    1      7     90    25    105      26     211
>
> Sorry i forgot to say :
>     - 5 first run is without patch
>     - 5 last run is with path

That's largely in the noise: results with processes which do not pollute
cache are slightly better wit the patch. But I don't consider this a real
life situation. Processes which actually do some work don't see any
practical difference. The 147 microseconds in the 3rd column is very
probably a bogus point due to collision on L2 cache which is only 2 way
set associative.

	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2000-02-22 11:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-02-21 10:49 Patch for optimize context switch FASSINO Jean-Philippe
2000-02-21 23:12 ` Paul Mackerras
2000-02-22  9:27   ` Gabriel Paubert
2000-02-22 10:50     ` Benjamin Herrenschmidt
2000-02-22 11:13       ` Gabriel Paubert
2000-02-22  9:30   ` FASSINO Jean-Philippe
2000-02-22 11:23     ` FASSINO Jean-Philippe
2000-02-22 11:33       ` FASSINO Jean-Philippe
2000-02-22 11:50         ` Gabriel Paubert
2000-02-22 11:40     ` Gabriel Paubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).