Fwd: Re: still no accelerated X ($#!$*)

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Fwd: Re: still no accelerated X ($#!$*)
@ 2000-01-20 18:12 ` Kevin Hendricks
  2000-01-20 18:26   ` David Edelsohn
  2000-01-20 18:46   ` Franz Sirl
  0 siblings, 2 replies; 29+ messages in thread
From: Kevin Hendricks @ 2000-01-20 18:12 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2408 bytes --]

Hi,

Can anyone explain this to me?

> Finally I got it!

>   asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex), "r"(base_addr));

>   asm("lwbrx %0,%1,%2": "=r"(val):"r"(regindex), "r"(base_addr));

>   asm("stwbrx %0,%1,%2": : "r"(regdata), "b"(regindex), "r"(base_addr));

>   asm("lwbrx %0,%1,%2": "=r"(val):"b"(regindex), "r"(base_addr));


> Don't know if this is correct (no clue about ppc assembly), but it works...

Well I did the following with the attached sample program:

gcc -O0 -S testit.c

then I looked at testit.s  (the assembler).

old_regw:
        stwu 1,-32(1)
        stw 31,28(1)
        mr 31,1
        stw 3,8(31)
        mr 0,4
        lis 11,mach64MemReg@ha
        lwz 9,mach64MemReg@l(11)
        lwz 11,8(31)
        stwbrx 0,11,9
.L2:
        lwz 11,0(1)
        lwz 31,-4(11)
        mr 1,11
        blr
.Lfe2:
        .size    old_regw,.Lfe2-old_regw
        .align 2
        .type    old_regr,@function
old_regr:
        stwu 1,-32(1)
        stw 31,28(1)
        mr 31,1
        stw 3,8(31)
        lis 9,mach64MemReg@ha
        lwz 0,mach64MemReg@l(9)
        lwz 11,8(31)
        lwbrx 9,11,0
        mr 3,9
        b .L3
.L3:
        lwz 11,0(1)
        lwz 31,-4(11)
        mr 1,11
        blr
.Lfe3:
        .size    old_regr,.Lfe3-old_regr
        .align 2
:regw:
        stwu 1,-32(1)
        stw 31,28(1)
        mr 31,1
        stw 3,8(31)
        mr 0,4
        lis 11,mach64MemReg@ha
        lwz 9,mach64MemReg@l(11)
        lwz 11,8(31)
        stwbrx 0,11,9
.L4:
        lwz 11,0(1)
        lwz 31,-4(11)
        mr 1,11
        blr
.Lfe4:
        .size    regw,.Lfe4-regw
        .align 2
        .type    regr,@function
regr:
        stwu 1,-32(1)
        stw 31,28(1)
        mr 31,1
        stw 3,8(31)
        lis 9,mach64MemReg@ha
        lwz 0,mach64MemReg@l(9)
        lwz 11,8(31)
        lwbrx 9,11,0
        mr 3,9
        b .L5
.L5:
        lwz 11,0(1)
        lwz 31,-4(11)
        mr 1,11
        blr

And I simply can not see any difference in the actual code produced by each
bunch of asm statements which leads me to believe that there is something else
going on here.

I would love to know exactly what.

Will you please try compiling the code I attached to get the assembler out and
compare old_regr and regr and old_rew and regw and see if you find any
differences.


Are you sure you haven't changed *anything* else?

Thanks,

Kevin


[-- Attachment #2: testit.c --]
[-- Type: text/plain, Size: 1144 bytes --]

#if defined (__powerpc__)

unsigned int mach64MemReg = 0xDEADBEAF;

static inline void old_regw(volatile unsigned long regindex, unsigned long regdata)
{
  register unsigned long base_addr = (unsigned long)mach64MemReg;

  asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex), "r"(base_addr));
}

static inline unsigned long old_regr(volatile unsigned long regindex)
{
  register unsigned long base_addr = (unsigned long)mach64MemReg, val;

  asm("lwbrx %0,%1,%2": "=r"(val):"r"(regindex), "r"(base_addr));
  return(val);
}

static inline void regw(volatile unsigned long regindex, unsigned long regdata)
{
  register unsigned long base_addr = (unsigned long)mach64MemReg;

  asm("stwbrx %0,%1,%2": : "r"(regdata), "b"(regindex), "r"(base_addr));
}

static inline unsigned long regr(volatile unsigned long regindex)
{
  register unsigned long base_addr = (unsigned long)mach64MemReg, val;

  asm("lwbrx %0,%1,%2": "=r"(val):"b"(regindex), "r"(base_addr));
  return(val);
}
#endif

int main() {
 int offset=10;
 int data=12;
 int input, input2;
 
 input = regr(offset);
 input = old_regr(offset);

 regw(offset,data);
 old_regw(offset,data);

}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 18:12 ` Kevin Hendricks
@ 2000-01-20 18:26   ` David Edelsohn
  2000-01-20 18:45     ` Benjamin Herrenschmidt
  2000-01-20 18:52     ` Franz Sirl
  2000-01-20 18:46   ` Franz Sirl
  1 sibling, 2 replies; 29+ messages in thread
From: David Edelsohn @ 2000-01-20 18:26 UTC (permalink / raw)
  To: khendricks; +Cc: linuxppc-dev


	The "b" constraint should be associated with "base_addr", not with
"regindex":

	asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex), "b"(base_addr));

David

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 18:26   ` David Edelsohn
@ 2000-01-20 18:45     ` Benjamin Herrenschmidt
  2000-01-20 18:51       ` David Edelsohn
  2000-01-20 18:52     ` Franz Sirl
  1 sibling, 1 reply; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2000-01-20 18:45 UTC (permalink / raw)
  To: David Edelsohn, linuxppc-dev

On Thu, Jan 20, 2000, David Edelsohn <dje@watson.ibm.com> wrote:

>	The "b" constraint should be associated with "base_addr", not with
>"regindex":
>
>	asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex), "b"(base_addr));

Hum... I still have to check what gcc/ppc specific constraints are. But
in this specific case, it's the index who should not be assigned to r0.
Both base and regdata can be r0. So either I'm mising something, or the
"b" constraint is actually wrong semanticall, or we need yet-another
constraint for the index.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 18:12 ` Kevin Hendricks
  2000-01-20 18:26   ` David Edelsohn
@ 2000-01-20 18:46   ` Franz Sirl
  1 sibling, 0 replies; 29+ messages in thread
From: Franz Sirl @ 2000-01-20 18:46 UTC (permalink / raw)
  To: khendricks; +Cc: linuxppc-dev


At 19:12 20.01.00 , Kevin Hendricks wrote:
>Hi,
>
>Can anyone explain this to me?
>
> > Finally I got it!
>
> >   asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex), "r"(base_addr));
>
> >   asm("lwbrx %0,%1,%2": "=r"(val):"r"(regindex), "r"(base_addr));
>
> >   asm("stwbrx %0,%1,%2": : "r"(regdata), "b"(regindex), "r"(base_addr));
>
> >   asm("lwbrx %0,%1,%2": "=r"(val):"b"(regindex), "r"(base_addr));
>
>
> > Don't know if this is correct (no clue about ppc assembly), but it works...
>
>Well I did the following with the attached sample program:
>
>gcc -O0 -S testit.c
>
>then I looked at testit.s  (the assembler).
>
>old_regw:
>         stwu 1,-32(1)
>         stw 31,28(1)
>         mr 31,1
>         stw 3,8(31)
>         mr 0,4
>         lis 11,mach64MemReg@ha
>         lwz 9,mach64MemReg@l(11)
>         lwz 11,8(31)
>         stwbrx 0,11,9
>.L2:
>         lwz 11,0(1)
>         lwz 31,-4(11)
>         mr 1,11
>         blr
>.Lfe2:
>         .size    old_regw,.Lfe2-old_regw
>         .align 2
>         .type    old_regr,@function
>old_regr:
>         stwu 1,-32(1)
>         stw 31,28(1)
>         mr 31,1
>         stw 3,8(31)
>         lis 9,mach64MemReg@ha
>         lwz 0,mach64MemReg@l(9)
>         lwz 11,8(31)
>         lwbrx 9,11,0
>         mr 3,9
>         b .L3
>.L3:
>         lwz 11,0(1)
>         lwz 31,-4(11)
>         mr 1,11
>         blr
>.Lfe3:
>         .size    old_regr,.Lfe3-old_regr
>         .align 2
>:regw:
>         stwu 1,-32(1)
>         stw 31,28(1)
>         mr 31,1
>         stw 3,8(31)
>         mr 0,4
>         lis 11,mach64MemReg@ha
>         lwz 9,mach64MemReg@l(11)
>         lwz 11,8(31)
>         stwbrx 0,11,9
>.L4:
>         lwz 11,0(1)
>         lwz 31,-4(11)
>         mr 1,11
>         blr
>.Lfe4:
>         .size    regw,.Lfe4-regw
>         .align 2
>         .type    regr,@function
>regr:
>         stwu 1,-32(1)
>         stw 31,28(1)
>         mr 31,1
>         stw 3,8(31)
>         lis 9,mach64MemReg@ha
>         lwz 0,mach64MemReg@l(9)
>         lwz 11,8(31)
>         lwbrx 9,11,0
>         mr 3,9
>         b .L5
>.L5:
>         lwz 11,0(1)
>         lwz 31,-4(11)
>         mr 1,11
>         blr
>
>And I simply can not see any difference in the actual code produced by each
>bunch of asm statements which leads me to believe that there is something else
>going on here.
>
>I would love to know exactly what.
>
>Will you please try compiling the code I attached to get the assembler out and
>compare old_regr and regr and old_rew and regw and see if you find any
>differences.

Kevin,
the fix is correct, you cannot use "r" (allow r0-r31) as a base register
constraint, you have to use "b" (allow r1-r31). This is not easy to
reproduce with a small testprogram, because it will only fail if r0 is
assigned for the inline assembly by the compiler, which depends on a lot of
factors. In practice the outdated egcs-1.1.2 seems to have a lower
probability to trigger this bug than the current gcc-2.95.2, probably due
to the better optimizers in gcc-2.95.2 producing higher register pressure.

Additionally with inline assembly you should always be as explicit as
possible, so for stuff possible relying on ordering, you should add
'volatile', and a "memory" clobber for writes:

asm volatile ("stwbrx %0,%1,%2" : :"r"(regdata), "b"(regindex),
"r"(base_addr) : "memory");
asm volatile ("lwbrx %0,%1,%2" : "=r"(val) : "b"(regindex), "r"(base_addr));

Franz.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 18:45     ` Benjamin Herrenschmidt
@ 2000-01-20 18:51       ` David Edelsohn
  0 siblings, 0 replies; 29+ messages in thread
From: David Edelsohn @ 2000-01-20 18:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Kevin Hendricks
  Cc: Moritz Thomas, linuxppc-dev, linuxppc-user


> Hum... I still have to check what gcc/ppc specific constraints are. But
> in this specific case, it's the index who should not be assigned to r0.
> Both base and regdata can be r0. So either I'm mising something, or the
> "b" constraint is actually wrong semanticall, or we need yet-another
> constraint for the index.

	Sorry, you are right.  I am multitasking on too many things.  The
way that you have written the inlined assembly, regindex should be the one
with the "b" constraint.

	Also, Geert's comment was correct that there is no difference in
the example function simply because of luck.

Sorry, David
===============================================================================
David Edelsohn                                      T.J. Watson Research Center
dje@watson.ibm.com                                  P.O. Box 218
+1 914 945 4364 (TL 862)                            Yorktown Heights, NY 10598

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 18:26   ` David Edelsohn
  2000-01-20 18:45     ` Benjamin Herrenschmidt
@ 2000-01-20 18:52     ` Franz Sirl
  2000-01-20 19:31       ` Gabriel Paubert
  1 sibling, 1 reply; 29+ messages in thread
From: Franz Sirl @ 2000-01-20 18:52 UTC (permalink / raw)
  To: David Edelsohn; +Cc: khendricks, linuxppc-dev


At 19:26 20.01.00 , David Edelsohn wrote:

>         The "b" constraint should be associated with "base_addr", not with
>"regindex":
>
>         asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex),
> "b"(base_addr));

Uhm, David, that's wrong. "b" has to be assigned to %1!

Franz.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 18:52     ` Franz Sirl
@ 2000-01-20 19:31       ` Gabriel Paubert
  2000-01-20 19:36         ` Kevin Hendricks
  0 siblings, 1 reply; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-20 19:31 UTC (permalink / raw)
  To: Franz Sirl; +Cc: David Edelsohn, khendricks, linuxppc-dev

On Thu, 20 Jan 2000, Franz Sirl wrote:

>
> At 19:26 20.01.00 , David Edelsohn wrote:
>
> >         The "b" constraint should be associated with "base_addr", not with
> >"regindex":
> >
> >         asm("stwbrx %0,%1,%2": : "r"(regdata), "r"(regindex),
> > "b"(base_addr));
>
> Uhm, David, that's wrong. "b" has to be assigned to %1!

Actually if base_addr can be reused by the compiler for other accesses
to the same area (byte or big endian), it should be written as:

"stwbrx %0,%1,%2": : "r" (regdata), "b" (base_addr), "r" (regindex)

with a volatile qualifier on the asm statement but I disagree on the
"memory" clobber if this does not access areas the compiler will ever
touch and does not have side effect.

There are already too many memory clobbers out there, they are bad because
they basically tell the compiler that it can not keep a single variable
in a register. Have a look at the effect of in and out instructions
in the linux kernel which reload isa_io_base repeatedly when there is no
need for it, that's not exactly a speed issue but a icache footprint
and bloat issue.

This is an effect which is important on all machines which have a
resonable number of registers (all modern ones in practice and perhaps
even the 68k).

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 19:31       ` Gabriel Paubert
@ 2000-01-20 19:36         ` Kevin Hendricks
  2000-01-20 19:51           ` Geert Uytterhoeven
  2000-01-20 19:59           ` Gabriel Paubert
  0 siblings, 2 replies; 29+ messages in thread
From: Kevin Hendricks @ 2000-01-20 19:36 UTC (permalink / raw)
  To: Gabriel Paubert, Franz Sirl; +Cc: David Edelsohn, khendricks, linuxppc-dev


Hi,

> Actually if base_addr can be reused by the compiler for other accesses
> to the same area (byte or big endian), it should be written as:
>
> "stwbrx %0,%1,%2": : "r" (regdata), "b" (base_addr), "r" (regindex)
>
> with a volatile qualifier on the asm statement but I disagree on the
> "memory" clobber if this does not access areas the compiler will ever
> touch and does not have side effect.
>
> There are already too many memory clobbers out there, they are bad because
> they basically tell the compiler that it can not keep a single variable
> in a register.

In this particular case, the base address can change (but very very rarely
such as writing to one Aperature or Another on the Rage 128 card) and all of
the writes are made to either the card memory mapped io or the frame buffer
itself.

Should I not include the : "memory" clobber in this case?  Will it hurt
performance much?

Thanks,

Kevin

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 19:36         ` Kevin Hendricks
@ 2000-01-20 19:51           ` Geert Uytterhoeven
  2000-01-20 19:59           ` Gabriel Paubert
  1 sibling, 0 replies; 29+ messages in thread
From: Geert Uytterhoeven @ 2000-01-20 19:51 UTC (permalink / raw)
  To: Kevin Hendricks; +Cc: Gabriel Paubert, Franz Sirl, David Edelsohn, linuxppc-dev


On Thu, 20 Jan 2000, Kevin Hendricks wrote:
> > Actually if base_addr can be reused by the compiler for other accesses
> > to the same area (byte or big endian), it should be written as:
> >
> > "stwbrx %0,%1,%2": : "r" (regdata), "b" (base_addr), "r" (regindex)
> >
> > with a volatile qualifier on the asm statement but I disagree on the
> > "memory" clobber if this does not access areas the compiler will ever
> > touch and does not have side effect.
> >
> > There are already too many memory clobbers out there, they are bad because
> > they basically tell the compiler that it can not keep a single variable
> > in a register.
>
> In this particular case, the base address can change (but very very rarely
> such as writing to one Aperature or Another on the Rage 128 card) and all of
> the writes are made to either the card memory mapped io or the frame buffer
> itself.

The base_addr may change in between calls, but not while executing the asm
code. The memory clobber means that memory contents may change while
executing the asm code without gcc knowing about it.

Gr{oetje,eeting}s,
--
Geert Uytterhoeven -- Linux/{m68k~Amiga,PPC~CHRP} -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 19:36         ` Kevin Hendricks
  2000-01-20 19:51           ` Geert Uytterhoeven
@ 2000-01-20 19:59           ` Gabriel Paubert
  2000-01-20 20:08             ` David Edelsohn
  2000-01-20 22:34             ` Franz Sirl
  1 sibling, 2 replies; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-20 19:59 UTC (permalink / raw)
  To: Kevin Hendricks; +Cc: Franz Sirl, David Edelsohn, linuxppc-dev

	Hi,

> In this particular case, the base address can change (but very very rarely
> such as writing to one Aperature or Another on the Rage 128 card) and all of
> the writes are made to either the card memory mapped io or the frame buffer
> itself.

Then the memory clobber would force the compiler to reload base_addr
between 2 writes to the frame buffer.

> Should I not include the : "memory" clobber in this case?  Will it hurt
> performance much?

I think that it is not necessary: the best thing with a compiler which
performs alias analysis might be to tell the truth

asm ("stwbrx %1,%2,%3"
     : "=m" (*(volatile unsigned *)(base_addr+regindex))
     : "r" (regdata), "b" (base_addr), "r" (regindex));

Note we don't use %0, and it won't produce any aditional code. You may
want to check what the compiler would have generated as addressing mode
by appending " # %0" at the end of the code string.

I truly wish there were a constraint for indexed addressing modes (anyway
this will be necessary for proper Altivec support).

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 19:59           ` Gabriel Paubert
@ 2000-01-20 20:08             ` David Edelsohn
  2000-01-20 22:34             ` Franz Sirl
  1 sibling, 0 replies; 29+ messages in thread
From: David Edelsohn @ 2000-01-20 20:08 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Kevin Hendricks, linuxppc-dev


>>>>> Gabriel Paubert writes:

Gabriel> I think that it is not necessary: the best thing with a compiler which
Gabriel> performs alias analysis might be to tell the truth

Gabriel> asm ("stwbrx %1,%2,%3"
Gabriel> : "=m" (*(volatile unsigned *)(base_addr+regindex))
Gabriel> : "r" (regdata), "b" (base_addr), "r" (regindex));

Gabriel> Note we don't use %0, and it won't produce any aditional code. You may
Gabriel> want to check what the compiler would have generated as addressing mode
Gabriel> by appending " # %0" at the end of the code string.

	Yes, this type of inlined assembly describing the actual memory
operation as an output constraint is better than clobbering all of memory,
if possible.

David

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 19:59           ` Gabriel Paubert
  2000-01-20 20:08             ` David Edelsohn
@ 2000-01-20 22:34             ` Franz Sirl
  2000-01-21  0:05               ` Gabriel Paubert
  1 sibling, 1 reply; 29+ messages in thread
From: Franz Sirl @ 2000-01-20 22:34 UTC (permalink / raw)
  To: Gabriel Paubert, Kevin Hendricks; +Cc: David Edelsohn, linuxppc-dev


Am Don, 20 Jan 2000 schrieb Gabriel Paubert:
>Hi,
>
>> In this particular case, the base address can change (but very very rarely
>> such as writing to one Aperature or Another on the Rage 128 card) and all of
>> the writes are made to either the card memory mapped io or the frame buffer
>> itself.
>
>Then the memory clobber would force the compiler to reload base_addr
>between 2 writes to the frame buffer.
>
>> Should I not include the : "memory" clobber in this case?  Will it hurt
>> performance much?
>
>I think that it is not necessary: the best thing with a compiler which
>performs alias analysis might be to tell the truth
>
>asm ("stwbrx %1,%2,%3"
>     : "=m" (*(volatile unsigned *)(base_addr+regindex))
>     : "r" (regdata), "b" (base_addr), "r" (regindex));
>
>Note we don't use %0, and it won't produce any aditional code. You may
>want to check what the compiler would have generated as addressing mode
>by appending " # %0" at the end of the code string.

It depends a little bit on the usage of the asm's if the memory (either
global or local) clobber is needed or not. If you use them for read/writes to HW
registers needing ordering (which is very likely here since we talk about
graphics HW), the compiler can only decide on the memory usage defined by the
clobbers/memory inputs on how to order the inlines (volatile has no effect on
this).

Actually the load instructions need a memory input too:

asm volatile ("lwbrx %0,%1,%2" : "=r"(val) : "b"(regindex), "r"(base_addr),
"m" (*(volatile unsigned *)(base_addr+regindex)));

And to insure ordering on processor level you still need the eieio (with a
memory clobber) as usual.

Franz.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-20 22:34             ` Franz Sirl
@ 2000-01-21  0:05               ` Gabriel Paubert
  2000-01-21  0:35                 ` Kevin Hendricks
  2000-01-21 15:47                 ` Franz Sirl
  0 siblings, 2 replies; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-21  0:05 UTC (permalink / raw)
  To: Franz Sirl; +Cc: Kevin Hendricks, David Edelsohn, linuxppc-dev

On Thu, 20 Jan 2000, Franz Sirl wrote:

> It depends a little bit on the usage of the asm's if the memory (either
> global or local) clobber is needed or not. If you use them for read/writes to HW
> registers needing ordering (which is very likely here since we talk about
> graphics HW), the compiler can only decide on the memory usage defined by the
> clobbers/memory inputs on how to order the inlines (volatile has no effect on
> this).

Do you mean that a volatile memory reference can be reordered wit an asm
volatile statement ? I thought theat this was not possible.

>From GCC's documentation:

   "You can prevent an `asm' instruction from being deleted, moved
significantly, or combined, by writing the keyword `volatile' after the
`asm'."

You might disagree, but I consider that moving across a volatile memory
reference is a _significant_ move, a very significant one.

> Actually the load instructions need a memory input too:
>
> asm volatile ("lwbrx %0,%1,%2" : "=r"(val) : "b"(regindex), "r"(base_addr),
> "m" (*(volatile unsigned *)(base_addr+regindex)));

Indeed, perhaps even more than the store since asm without output operands
are actually assumed to have side effects.

>
> And to insure ordering on processor level you still need the eieio (with a
> memory clobber) as usual.

The best thing to do is probably to put the eieio inside the asm
statement, perhaps with 2 macros, one that includes the eieio and one that
does not include it.

Memory clobbers should be used extremely sparingly, I view them as the
last resort when _nothing_ else would work. They are used far too often
for my taste in the kernel, probably because they don't affect
significantly machines with very few registers like Intel which have to
permanently spill registers to memory or stack.

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  0:05               ` Gabriel Paubert
@ 2000-01-21  0:35                 ` Kevin Hendricks
  2000-01-21  1:53                   ` Gabriel Paubert
  2000-01-21 11:54                   ` Benjamin Herrenschmidt
  2000-01-21 15:47                 ` Franz Sirl
  1 sibling, 2 replies; 29+ messages in thread
From: Kevin Hendricks @ 2000-01-21  0:35 UTC (permalink / raw)
  To: Gabriel Paubert, Franz Sirl; +Cc: David Edelsohn, linuxppc-dev

Hi,

I don't want to be dense here but why two macros (one with eieio and one without
eieio).  Can two processors in an SMP setting be trying to drive the same
video card at the same time?   I didn't think that was possible.  I saw the
eieio usage in the kernel versions and in aty128fb.c but thought that multiple
processors might use those macros at the same time and multiple processors
might have different fbdev drivers running in multi-head applications.  But I
thought only one processor could drive a video hardware card.  Is this a bad
assumption?

So exactly what is the best way to write these macros for Xpmac?  Using
the output constraints approach with eieio following it or is all of this
overkill.

>From the various posts (given the operand ordering done in the original post),
here is what I have tried to piece together.

asm volatile ("stwbrx %1,%2,%3; eieio" : "=m" (*(volatile unsigned
*)(base_addr+regindex))       : "r" (regdata), "b" (regindex), "r" (base_addr));

asm volatile ("lwbrx %0,%1,%2; eieio" : "=r"(val) : "b"(regindex),
"r"(base_addr), "m" (*(volatile unsigned *)(base_addr+regindex)));

Please let me know how to change the above so that I get it right this time.

Thanks,

Kevin

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  0:35                 ` Kevin Hendricks
@ 2000-01-21  1:53                   ` Gabriel Paubert
  2000-01-21  2:19                     ` Kevin Hendricks
  2000-01-21 11:54                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-21  1:53 UTC (permalink / raw)
  To: Kevin Hendricks; +Cc: Franz Sirl, David Edelsohn, linuxppc-dev

	Hi,

> I don't want to be dense here but why two macros (one with eieio and one without
> eieio).  Can two processors in an SMP setting be trying to drive the same
> video card at the same time?   I didn't think that was possible.  I saw the
> eieio usage in the kernel versions and in aty128fb.c but thought that multiple
> processors might use those macros at the same time and multiple processors
> might have different fbdev drivers running in multi-head applications.  But I
> thought only one processor could drive a video hardware card.  Is this a bad
> assumption?

Multiple processors attempting to access the video hardware should be
serialized one way or anoter (semaphores or whatever).

> So exactly what is the best way to write these macros for Xpmac?  Using
> the output constraints approach with eieio following it or is all of this
> overkill.

eieio may be useful and even necessary when accessing the registers, but
in the frame buffer section it prevents the processor from performing
store merging to non guarded memory and other optimizations (like the
bridge to merge the writes and transforming them into bursts to the
framebuffer).

eieio is also necessary to avoid load from being moved before stores when
accessing IO registers: if your intent is to write to register A and then
read register B and that the first write actually changes the contents of
register B (for example your trigger an operation with the write and wait
for it to finish by reading register B which contains a busy bit), then
the eieio is necessary (it is not necessary if the registers are the
same).

> >From the various posts (given the operand ordering done in the original post),
> here is what I have tried to piece together.
>
> asm volatile ("stwbrx %1,%2,%3; eieio" : "=m" (*(volatile unsigned
> *)(base_addr+regindex))       : "r" (regdata), "b" (regindex), "r" (base_addr));
>
> asm volatile ("lwbrx %0,%1,%2; eieio" : "=r"(val) : "b"(regindex),
> "r"(base_addr), "m" (*(volatile unsigned *)(base_addr+regindex)));

I actually doubt that the eieio are necessary but then I'm not a
specialist on this kind of hardware. Every eieio is a bus broadcast
operation (except on 603, on G3 it is IIRC an option controlled by a bit
in HID0) and actually has a cost comparable to a write posted I/O access
but the other consequences (preventing bursts on the I/O bus) may actually
cause a significant performance hit.  So it should be used only when
necessary...

> Please let me know how to change the above so that I get it right this time.

Try to determine first whether the eieio are necessary; for access to the
frame buffer I'm almost sure that they are superfluous and potentially
very costly in terms of performance. For the MMIO I suspect that they may
be necessary at some places, but adding them systematically will have less
impact.

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  1:53                   ` Gabriel Paubert
@ 2000-01-21  2:19                     ` Kevin Hendricks
  2000-01-21  7:58                       ` Geert Uytterhoeven
  2000-01-21 14:15                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 29+ messages in thread
From: Kevin Hendricks @ 2000-01-21  2:19 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Franz Sirl, David Edelsohn, linuxppc-dev

Hi,

> I actually doubt that the eieio are necessary but then I'm not a
> specialist on this kind of hardware. Every eieio is a bus broadcast
> operation (except on 603, on G3 it is IIRC an option controlled by a bit
> in HID0) and actually has a cost comparable to a write posted I/O access
> but the other consequences (preventing bursts on the I/O bus) may actually
> cause a significant performance hit.  So it should be used only when
> necessary...
>
> > Please let me know how to change the above so that I get it right this time.
>
> Try to determine first whether the eieio are necessary; for access to the
> frame buffer I'm almost sure that they are superfluous and potentially
> very costly in terms of performance. For the MMIO I suspect that they may
> be necessary at some places, but adding them systematically will have less
> impact.

Okay, I went and looked at the latest aty128fb.c code and it does not use eieio
anywhere.  I looked at ealier verions of this file and it at one time had eieio
but they have since been removed.

I also looked and the endian conversion routines do not use the output
contraint approach you took but do include the memory clobber on the writes.

I think I will go with the output constraint version given above without the
eieio until or unless the kernel driver begins to use them too.

Thanks for all of your help with this everyone.  I have learned alot.

Kevin

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  2:19                     ` Kevin Hendricks
@ 2000-01-21  7:58                       ` Geert Uytterhoeven
  2000-01-21 14:15                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 29+ messages in thread
From: Geert Uytterhoeven @ 2000-01-21  7:58 UTC (permalink / raw)
  To: Kevin Hendricks; +Cc: Gabriel Paubert, Franz Sirl, David Edelsohn, linuxppc-dev


On Thu, 20 Jan 2000, Kevin Hendricks wrote:
> > I actually doubt that the eieio are necessary but then I'm not a
> > specialist on this kind of hardware. Every eieio is a bus broadcast
> > operation (except on 603, on G3 it is IIRC an option controlled by a bit
> > in HID0) and actually has a cost comparable to a write posted I/O access
> > but the other consequences (preventing bursts on the I/O bus) may actually
> > cause a significant performance hit.  So it should be used only when
> > necessary...
> >
> > > Please let me know how to change the above so that I get it right this time.
> >
> > Try to determine first whether the eieio are necessary; for access to the
> > frame buffer I'm almost sure that they are superfluous and potentially
> > very costly in terms of performance. For the MMIO I suspect that they may
> > be necessary at some places, but adding them systematically will have less
> > impact.
>
> Okay, I went and looked at the latest aty128fb.c code and it does not use eieio
> anywhere.  I looked at ealier verions of this file and it at one time had eieio
> but they have since been removed.

Perhaps they were replaced by the platform independent wmb()?

Gr{oetje,eeting}s,
--
Geert Uytterhoeven -- Linux/{m68k~Amiga,PPC~CHRP} -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  0:35                 ` Kevin Hendricks
  2000-01-21  1:53                   ` Gabriel Paubert
@ 2000-01-21 11:54                   ` Benjamin Herrenschmidt
  2000-01-21 13:34                     ` Gabriel Paubert
  1 sibling, 1 reply; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2000-01-21 11:54 UTC (permalink / raw)
  To: khendricks, linuxppc-dev

On Thu, Jan 20, 2000, Kevin Hendricks <khendricks@ivey.uwo.ca> wrote:

>From the various posts (given the operand ordering done in the original
post),
>here is what I have tried to piece together.
>
>asm volatile ("stwbrx %1,%2,%3; eieio" : "=m" (*(volatile unsigned
>*)(base_addr+regindex))       : "r" (regdata), "b" (regindex), "r"
>(base_addr));
>
>asm volatile ("lwbrx %0,%1,%2; eieio" : "=r"(val) : "b"(regindex),
>"r"(base_addr), "m" (*(volatile unsigned *)(base_addr+regindex)));

Hi Kevin !

A good rule is to use eieio() when accessing a register (that means doing
an access that actually performs an action and whose ordering is
important relative to other accesses of the same type) and not use it
when filling the framebuffer. There are usually few enough register
accesses for this to work. it may be optimal to skip eieio's when writing
to a bunch "parameters" registers where ordering is not important, in
this case you just need to put an eieio() between those, and the register
write that triggers the engine operation that will use those parameters.
However, when doing that, the PCI bridge is allowed to combine your
register writes in a burst, and I know some cards who don't handle burst
access to MMIO registers very well.

Basically, eieio() will make sure that all previous memory accesses will
have been finished before memory accesses after the eieio are done.

It may be important to make sure that the last bit of framebuffer has
been written before "starting" an engine operation. So one eieio between
frame buffer filling and engine register access may be useful in the case
where you use eieio after the write in the asm.

I personally tend to prefer doing the eieio _before_ the read/write in
the asm code, but there are some rare cases where you mix eieio and
non-eieio accesses (like with your framebuffer) where special care must
be taken and may require both eieio before and after the register access.

Another thing to take care of is PCI write posting: Basically, when you
write, let's say, a MMIO register, you are not guaranteed that this write
have actually been done unless you do a read from the same io space. For
example: If you write an interrupt mask register to disable an interrupt
followed by critical code in which this interrupt _must not_ happen, you
need absolutely to do a read (typically to re-read the mask you just
wrote to) after the write, and before the critical code.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21 11:54                   ` Benjamin Herrenschmidt
@ 2000-01-21 13:34                     ` Gabriel Paubert
  2000-01-21 14:06                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-21 13:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: khendricks, linuxppc-dev




On Fri, 21 Jan 2000, Benjamin Herrenschmidt wrote:

> Hi Kevin !
>
> A good rule is to use eieio() when accessing a register (that means doing
> an access that actually performs an action and whose ordering is
> important relative to other accesses of the same type) and not use it
> when filling the framebuffer. There are usually few enough register
> accesses for this to work. it may be optimal to skip eieio's when writing
> to a bunch "parameters" registers where ordering is not important, in
> this case you just need to put an eieio() between those, and the register
> write that triggers the engine operation that will use those parameters.

Indeed, that's what I do in my VME driver. I did not like the fact that
readl/writel always insert an eieio. This is safe but almost doubles the
number of bus cycles. Very often it translates into:

- a bunch or register accesses,
- eieio,
- the access which triggers the operation (which was a read in some cases
  for the VME driver).
- eieio again, so you don't have problems with setting up the next
operation...

> However, when doing that, the PCI bridge is allowed to combine your
> register writes in a burst, and I know some cards who don't handle burst
> access to MMIO registers very well.

Broken HW exists but in a device specific driver, you should know whether
it is necessary or not from the errata...

> Basically, eieio() will make sure that all previous memory accesses will
> have been finished before memory accesses after the eieio are done.
>
> It may be important to make sure that the last bit of framebuffer has
> been written before "starting" an engine operation. So one eieio between
> frame buffer filling and engine register access may be useful in the case
> where you use eieio after the write in the asm.

If there is one before the register access that triggers the engine
operation, it should be enough.

>
> I personally tend to prefer doing the eieio _before_ the read/write in
> the asm code, but there are some rare cases where you mix eieio and
> non-eieio accesses (like with your framebuffer) where special care must
> be taken and may require both eieio before and after the register access.

That's a good quetion, does anybody have any final conclusion about
whether it is better to ensure ordering before or after the access ?

> Another thing to take care of is PCI write posting: Basically, when you
> write, let's say, a MMIO register, you are not guaranteed that this write
> have actually been done unless you do a read from the same io space. For
> example: If you write an interrupt mask register to disable an interrupt
> followed by critical code in which this interrupt _must not_ happen, you
> need absolutely to do a read (typically to re-read the mask you just
> wrote to) after the write, and before the critical code.

In this case, I think that you need the following:

- write,
- eieio,
- read,
- isync to make sure that the read has reached the registers and is not in
a load pending queue or whatever which can be quite deep especially if the
processor does never need the result of the read...

	Gabriel.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
@ 2000-01-21 13:53 Kevin_Hendricks
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin_Hendricks @ 2000-01-21 13:53 UTC (permalink / raw)
  To: bh40, paubert; +Cc: khendricks, linuxppc-dev

Hi,

>> Another thing to take care of is PCI write posting: Basically, when you
>> write, let's say, a MMIO register, you are not guaranteed that this write
>> have actually been done unless you do a read from the same io space. For
>> example: If you write an interrupt mask register to disable an interrupt
>> followed by critical code in which this interrupt _must not_ happen, you
>> need absolutely to do a read (typically to re-read the mask you just
>> wrote to) after the write, and before the critical code.
>
>In this case, I think that you need the following:
>
>- write,
>- eieio,
>- read,
>- isync to make sure that the read has reached the registers and is not in
>a load pending queue or whatever which can be quite deep especially if the
>processor does never need the result of the read...

Okay, please let me be specific here to make sure I understand.

In fixing the xf3.9.17 code in the r128 module, there is a routine that

1. reads from an MMIO register to get the current value of the hardware cursor
enable bit

2. writes to that same MMIO register to turn off the hardware cursor

3. copies an new image for the cursor into framebuffer memory

4. reloads the MMIO register to put back the original value for the hardware
cursor enable bit.

There are no eieios or isyncs anywhere in the current implementation.  The
problem is that sometimes that hardware cursor flashes garabage as if the
hardware cursor disable was never done.

So from what I have read here, I should rewrite this routine to look as follows:

1. read the current value of the MMIO register for the hardware cursor enable

2. write to that same register to disable tha hardware cursor

3. eieio

4. read back in from the hardware cursor enable register to make sure a PCI
write post has been done  (does this need to be in a loop?)

5. isync

6. write the new cursor image into framebuffer memory with no eieios to allow
for burst writing

7. eieio (to make sure the complete image had been written)

8. write back to the MMIO register to put back its original value

9. eieio

10.  read the current value from the MMIO register to make sure the write post
was done properly

11. isync

This seems like extreme overkill.

Am I simply misunderstanding this PCI write post thing and the need for eieio
and isync?

I really thought I only needed an sync or isync when I was reading from value
(say a semaphore or spin lock) and wanted to make sure absolutely *no*
pre-fetches or other accesses happened in the code or datastructures protected
by the spinlock or semaphore.

I always though all other forms of access on PPC did enforce in order of
completion (i.e. allowed pre-fetching of operands, data, etc) but that a write
to memory really would complete before any later read or write to any location
which came from a later instruction.

If that is true, then you only need eieio or sync/isync when you are writing a
value that somehow will impact other later accesses and you need to be sure it
is complete before the later accesses do any pre-fetching of data.

Am I messed up here?  I really thought I understood this but now I am very
unsure.

Thanks,

Kevin

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21 13:34                     ` Gabriel Paubert
@ 2000-01-21 14:06                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2000-01-21 14:06 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linuxppc-dev

On Fri, Jan 21, 2000, Gabriel Paubert <paubert@iram.es> wrote:

>- write,
>- eieio,
>- read,
>- isync to make sure that the read has reached the registers and is not in
>a load pending queue or whatever which can be quite deep especially if the
>processor does never need the result of the read...

Indeed. Some time ago, I fixed the pmac-pic mask/unmask routines this way:

 - setup new mask in cached_mask (variable)
 - write_mask(cached_mask)
 - do {
 -   sync();
 - } while (read_mask() != cached_mask)

Note that both read_mask and write_mask will do eieio.

I beleive the sync could be replaced by an isync, I'm just not 100% sure
of the SMP behaviour but the mask/unmask routines should be fully
synchronized anyway. (And they are called with EE off).

Without this fix, we occasionally had bogus interrupts coming from the
IDE and possibly other rare problems.

I added a smiliar fix to the openpic code in my recent kernels too.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  2:19                     ` Kevin Hendricks
  2000-01-21  7:58                       ` Geert Uytterhoeven
@ 2000-01-21 14:15                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2000-01-21 14:15 UTC (permalink / raw)
  To: khendricks; +Cc: linuxppc-dev, linux-fbdev

On Thu, Jan 20, 2000, Kevin Hendricks <khendricks@ivey.uwo.ca> wrote:

>Okay, I went and looked at the latest aty128fb.c code and it does not use
>eieio
>anywhere.  I looked at ealier verions of this file and it at one time had
>eieio
>but they have since been removed.
>
>I also looked and the endian conversion routines do not use the output
>contraint approach you took but do include the memory clobber on the writes.

I just looked at atyfb.c and aty128fb.c in my source tree (atyfb is
2.2.14 one and aty128fb is the latest backport done by atong) and neither
uses eieio nor mb(), wmb(), ...

This looks bogus to me. I've spotted a few cases where those calls should
be in.

We can either put the eieio back in the access functions (less optimal,
but we can also fix the constraints to get rid of the memory clobber as
discussed previously), or we can fill the code with carefuly placed mb()
and wmb() but this requires more knowledge of the chipset than I actually
have.

I'll put back eieio() in the access macros for my kernels until a
definitive answer pops up on this issue.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
       [not found] <200001211355.NAA05477@granada.iram.es>
@ 2000-01-21 15:13 ` Gabriel Paubert
  2000-01-21 15:29   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-21 15:13 UTC (permalink / raw)
  To: Kevin_Hendricks; +Cc: bh40, khendricks, linuxppc-dev

	Hi,

> Okay, please let me be specific here to make sure I understand.
>
> In fixing the xf3.9.17 code in the r128 module, there is a routine that
>
> 1. reads from an MMIO register to get the current value of the hardware cursor
> enable bit
>
> 2. writes to that same MMIO register to turn off the hardware cursor

Read modify write of a single register should not need any eieio, but I'm
not so sure when reading or wrting a FIFO for example. Could it be that
lwz r3,reg followed by lwz r4,reg appear reversed on the bus ?

I can't think of any good reason to do it: if you have to go to the bus
anyway, keeping it in order will minimize the number of instructions in
fly in the HW. The only thing that may improve performance when you have
to go to the bus is to move loads ahead of stores, reordering between
noncacheable reads or between noncacheable writes does not make sense, but
weird hardware designs do exist.

>
> 3. copies an new image for the cursor into framebuffer memory
>
> 4. reloads the MMIO register to put back the original value for the hardware
> cursor enable bit.
>
>
> There are no eieios or isyncs anywhere in the current implementation.  The
> problem is that sometimes that hardware cursor flashes garabage as if the
> hardware cursor disable was never done.

No the isync was tere when you have to protect against interrupts, I don't
tink that it is your case.

> So from what I have read here, I should rewrite this routine to look as follows:
>
>
> 1. read the current value of the MMIO register for the hardware cursor enable
>
> 2. write to that same register to disable tha hardware cursor
>
> 3. eieio

Yes.

>
> 4. read back in from the hardware cursor enable register to make sure a PCI
> write post has been done  (does this need to be in a loop?)
>
> 5. isync

I tink that you can skip 4) and 5)

> 6. write the new cursor image into framebuffer memory with no eieios to allow
> for burst writing
>
> 7. eieio (to make sure the complete image had been written)
>
> 8. write back to the MMIO register to put back its original value
>
> 9. eieio
>
> 10.  read the current value from the MMIO register to make sure the write post
> was done properly
>
> 11. isync

You can skip 10 and 11 I think.

> This seems like extreme overkill.
>
> Am I simply misunderstanding this PCI write post thing and the need for eieio
> and isync?

No the isync was there for the case of interrupt masking. But even then
that might not be enough with an interrupt controller which is too slow
to remove the interrupt request from the CPU, or at least you might get
spurious interrupts. But I digress...

> I really thought I only needed an sync or isync when I was reading from value
> (say a semaphore or spin lock) and wanted to make sure absolutely *no*
> pre-fetches or other accesses happened in the code or datastructures protected
> by the spinlock or semaphore.

Indeed, basically the sync ans isync are mosty interrupt and SMP
synchronizatin issues, this is not your case AFAICT.

> I always though all other forms of access on PPC did enforce in order of
> completion (i.e. allowed pre-fetching of operands, data, etc) but that a write
> to memory really would complete before any later read or write to any location
> which came from a later instruction.

No, a read may be moved before a write as long as there are no address
conflicts (overlaps). That's pretty common on all high end processors.

> If that is true, then you only need eieio or sync/isync when you are writing a
> value that somehow will impact other later accesses and you need to be sure it
> is complete before the later accesses do any pre-fetching of data.

Bridges also may reorder writes and burst them, eieio can be used to
prevent this. That's not only the processor effect...

> Am I messed up here?  I really thought I understood this but now I am very
> unsure.

I've often believed that I understood it and later came to the conclusion
that I was wrong, maybe I'm still wrong but hopefully I'm coming closer
to a full understanding of these things every time it is discussed on the
mailing lists :-).

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21 15:13 ` Fwd: Re: still no accelerated X ($#!$*) Gabriel Paubert
@ 2000-01-21 15:29   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2000-01-21 15:29 UTC (permalink / raw)
  To: Gabriel Paubert, linuxppc-dev

On Fri, Jan 21, 2000, Gabriel Paubert <paubert@iram.es> wrote:

>Bridges also may reorder writes and burst them, eieio can be used to
>prevent this. That's not only the processor effect...

My understanding of the PCI spec is that bridges are not allowed to
re-order. They an combine for burst but they are forbidden to do any
reordering.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21  0:05               ` Gabriel Paubert
  2000-01-21  0:35                 ` Kevin Hendricks
@ 2000-01-21 15:47                 ` Franz Sirl
  2000-01-21 19:08                   ` Gabriel Paubert
  1 sibling, 1 reply; 29+ messages in thread
From: Franz Sirl @ 2000-01-21 15:47 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Kevin Hendricks, David Edelsohn, linuxppc-dev


At 01:05 21.01.00 , Gabriel Paubert wrote:



>On Thu, 20 Jan 2000, Franz Sirl wrote:
>
> > It depends a little bit on the usage of the asm's if the memory (either
> > global or local) clobber is needed or not. If you use them for
> read/writes to HW
> > registers needing ordering (which is very likely here since we talk about
> > graphics HW), the compiler can only decide on the memory usage defined
> by the
> > clobbers/memory inputs on how to order the inlines (volatile has no
> effect on
> > this).
>
>Do you mean that a volatile memory reference can be reordered wit an asm
>volatile statement ? I thought theat this was not possible.
>
> >From GCC's documentation:
>
>    "You can prevent an `asm' instruction from being deleted, moved
>significantly, or combined, by writing the keyword `volatile' after the
>`asm'."
>
>You might disagree, but I consider that moving across a volatile memory
>reference is a _significant_ move, a very significant one.

asm volatile doesn't declare any volatile memory references! What makes you
think it does?
There was a lengthy discussion about this with Richard Henderson on the
gcc/egcs lists a while ago. The result was a rewrite of
linux/include/asm-ppc/io.h in the kernel to it's current state.

Franz.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
@ 2000-01-21 17:32 David Edelsohn
  0 siblings, 0 replies; 29+ messages in thread
From: David Edelsohn @ 2000-01-21 17:32 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: khendricks

asm volatile ("stwbrx %1,%2,%3; eieio"
	: "=m" (*(volatile unsigned *)(base_addr+regindex))
	: "r" (regdata), "b" (regindex), "r" (base_addr));

asm volatile ("lwbrx %0,%1,%2; eieio"
	: "=r"(val)
	: "b"(regindex), "r"(base_addr),
	  "m" (*(volatile unsigned *)(base_addr+regindex)));

BTW, one of the main stylistic points which confused me yesterday is that
the inlined assembly statement associates variable "regindex" with the
second parameter of the instruction and varable "base_addr" with the third
parameter of the instruction.  As long as the second parameter is not
assigned to register r0, the two are commucative in their function, so
this is not an error.

	In the instruction architecture, the second parameter is intended
to be the base and the third as the index, so it is confusing for someone
familiar with the POWER and PowerPC instruction set to read the order in
which the parameters are assigned.  Instructions loading from addresses
with displacements, e.g.,

	lwz %0,%1(%2)

specify the displacement second and the base register last, so the
ordering can get confused.  That was what threw me off yesterday in my
haste.

David
===============================================================================
David Edelsohn                                      T.J. Watson Research Center
dje@watson.ibm.com                                  P.O. Box 218
+1 914 945 4364 (TL 862)                            Yorktown Heights, NY 10598

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21 15:47                 ` Franz Sirl
@ 2000-01-21 19:08                   ` Gabriel Paubert
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-21 19:08 UTC (permalink / raw)
  To: Franz Sirl; +Cc: Kevin Hendricks, David Edelsohn, linuxppc-dev

On Fri, 21 Jan 2000, Franz Sirl wrote:

> >    "You can prevent an `asm' instruction from being deleted, moved
> >significantly, or combined, by writing the keyword `volatile' after the
> >`asm'."
> >
> >You might disagree, but I consider that moving across a volatile memory
> >reference is a _significant_ move, a very significant one.
>
> asm volatile doesn't declare any volatile memory references! What makes you
> think it does?

No it does not declare any volatile memory reference, but it is
"guaranteed not to be moved significantly". What does this exactly mean ?

Anyway asm volatile statements are guranteed not to reordered wrt each
other. So let us illustrate it with examples; if you write:

asm volatile("stwbrx...", ...);

asm volatile("eieio");

asm volatile("lwbrx...", ...);

the compiler should respect the order even without nasty memory clobbers.
The funny thing is that if the accesses are big endian or byte, then you
don't need the asm and a reordering might perhaps take place. In this
case:

*(volatile unsigned *)output_reg = data;

asm volatile("eieio");

result = *(volatile unsigned *) input_reg;

then the eieio might move become the first or last statement according to
what you say. I consider that this does not reflect the documentation
since moving across a volatile memory reference is a significant move
in my book.

Note that the documentation also states clearly that such an asm without
any parameters is considered as having undeterminate side effects so it
might actually not be moved, but it might be as bad as a memory clobber.

> There was a lengthy discussion about this with Richard Henderson on the
> gcc/egcs lists a while ago. The result was a rewrite of
> linux/include/asm-ppc/io.h in the kernel to it's current state.

I remember, but my goal is a memory-clobber-free kernel, although there
are many other efficiency issues right now in Linux/PPC...

	Gabriel
	(desperately trying to build gcc_latest_snapshot)

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
@ 2000-01-21 20:25 jlquinn
  2000-01-23 13:06 ` Gabriel Paubert
  0 siblings, 1 reply; 29+ messages in thread
From: jlquinn @ 2000-01-21 20:25 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linuxppc-dev

Can you list some of these efficiency issues?  In case anybody wants to
tackle them.

In fact, it would be interesting to have a page of potential ppc-related
kernel projects (others as well) that people could refer to if they're
looking for something to do.  Is there such a page already?  If not, I'd be
willing to assemble one.  However I don't have a place to host it.

Thanks,
Jerry Quinn
jlquinn@us.ibm.com

Gabriel Paubert <paubert@iram.es>@lists.linuxppc.org on 01/21/2000 02:08:30
PM
[snip]
I remember, but my goal is a memory-clobber-free kernel, although there
are many other efficiency issues right now in Linux/PPC...

     Gabriel
     (desperately trying to build gcc_latest_snapshot)

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Fwd: Re: still no accelerated X ($#!$*)
  2000-01-21 20:25 jlquinn
@ 2000-01-23 13:06 ` Gabriel Paubert
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Paubert @ 2000-01-23 13:06 UTC (permalink / raw)
  To: jlquinn; +Cc: linuxppc-dev

On Fri, 21 Jan 2000 jlquinn@us.ibm.com wrote:

> Can you list some of these efficiency issues?  In case anybody wants to
> tackle them.

Well, there are many things which need improvements IMHO:

1) interrupt handling, I hate to say this but it has recently bloated to
an incredible level through this new (and I am firmly convinced absolutely
unnecessary int_control structure). Every save_flags/cli/sti/restore flags
is implemented with an indirect function call. So the common sequence:

	save_flags(flags);
	cli();
	...do some work...
	restore_flags;

takes about 4 instructions per line (lis+lwz+mtctr+bctrl). And because
"some chips have problems", _every_ mtmsr is preceded with a sync
instuction which are very costly in every implementation in which the sync
actually goes to the bus (and on an SMP G4 with the long memory queues we
may be speaking of something like 100+ processor cycles).  Multiply the
number of cli+sti+restore_flags per second on a busy SMP system by the
number of processors by the time a sync actually takes and you might end
up with a significant proportion of the bus activity (AFAIU a sync will be
retried as long as one device on the bus, be it processor or host bridge,
has something pending in its buffers, and this may take a very long time
with the deep buffers of current processors).

I always thought the sync argument was bogus at least for the cli case,
but since you work for IBM, you might have better access than me to the
errata :-).

Furthermore the fact that these heavily used functions are external
functions means that many leaf functions are transformed into non-leaf
functions with the associated code bloat and overhead needed to establish
a stack frame, save GPR/LR and restore them on exit (this is indirect code
bloat but it may be even more significant than the direct one). The cost
of leaf functions on most PPC processors is extremely small thanks to
early detection of branches in the pieline and folding (actually even
fairly short functions can improve performance by reducing instruction
cache footprint).

Ok, enough rant on this one, I have an idea which that I'm fairly sure
will work on UP (and very probably on SMP) but I want to make some tests
before publishing it.

2) system call entry optimization, saving all registers for a system call
and most interrupts is ridiculous. The API says that r14-r31 are preserved
across function calls, so please don't save them on the kernel stack on
every entry through real mode. System calls should not even need to save
lr/ctr/xer/r11/r12/cr, I'd rather have syscalls use a convention similar
to standard function calls, you need to save some registers (r0 and
r3-r10)  in case it is restarted but that's about all. The problem is that
we might have to save now some of these registers because of backwards
compatibility issues. Anyway saving something like 18 to 24 registers less
will speed up many system calls and reduce icache and dcache footprint
significantly for simple system calls.

Also not saving all these registers will reduce stack usage in case of
nested interrupts, don't forget that the available stack size has just
been reduced by half a kilobyte with the addition of the Altivec state.
Although the right solution might be an interrupt stack (one per processor
on SMP), which can be fairly large since it does not need to scale with
the number of processes running in the system.

3) inline many simple functions, in my source tree I have already removed
some functions in arch/ppc/kernel/misc.S (abs for example, by adding
-Dabs=__builtin_abs after the -fno-builtin), other simple functions are
now inline. I'm also going to test a version of bitops.h and atomic.h
which inlines virtually all functions (and generates optimal or very close
to optimal assembly code in every case). For example:

clear_bit(constant, &word) generates:

1:	lwarx	tmp,0,ptr
	rlwinm	tmp,tmp,0,"correct mask for constant"
	stwcx.	tmp
	bne-	1b

where prt is the address of word + constant>>5
but if it is a variable the rlwinm is replaced with andc tmp,tmp,mask
where mask is generated by:

	li 	x,1
	rlwnm	mask,x,variable

(x and mask may or may not be the same register, the compiler chooses).

 This works well with recent compilers and could probably be introduced
when the minimal compiler version is raised, or using inline versions
conditionally depending on compiler version. Here once again the issue
is direct and indirect code bloat (actually the code may be slightly
larger than a function call in some cases, but increasing the number
of leaf functions should more than compensate for it).

4) Add support for machines with 1Gb and more of RAM. But please do not
copy the way it's done on Linux/Intel: it's a kludge.

5) Increase the address space available to applications, the current limit
of 2 Gb is mostly due to the fact that I/O space on PreP is mapped at
0x80000000 and that most firmware allocates I/O space in an extremely
sparse way. Here my bootloader (which reorganizes PCI I/O space to compact
it without taking any risk of conflicts) might help.

5) Allow to use BATs from user mode for large mappings, this will be
especially beneficial for X servers.

6) Many other things which I forget right now, and this post is already
long enough.

Right now I'm more interested in improving GCC code generation. Then I'll
be back to kernel hacking when I get my PPC notenook around Easter I hope
(it has to be an Apple since IBM has dropped them AFAICT).

> In fact, it would be interesting to have a page of potential ppc-related
> kernel projects (others as well) that people could refer to if they're
> looking for something to do.  Is there such a page already?  If not, I'd be
> willing to assemble one.  However I don't have a place to host it.

Good idea, let us follow Hollis' suggestions.

	Gabriel.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2000-01-23 13:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200001211355.NAA05477@granada.iram.es>
2000-01-21 15:13 ` Fwd: Re: still no accelerated X ($#!$*) Gabriel Paubert
2000-01-21 15:29   ` Benjamin Herrenschmidt
2000-01-21 20:25 jlquinn
2000-01-23 13:06 ` Gabriel Paubert
  -- strict thread matches above, loose matches on Subject: below --
2000-01-21 17:32 David Edelsohn
2000-01-21 13:53 Kevin_Hendricks
     [not found] <Message from Kevin Hendricks <khendricks@ivey.uwo.ca>
2000-01-20 18:12 ` Kevin Hendricks
2000-01-20 18:26   ` David Edelsohn
2000-01-20 18:45     ` Benjamin Herrenschmidt
2000-01-20 18:51       ` David Edelsohn
2000-01-20 18:52     ` Franz Sirl
2000-01-20 19:31       ` Gabriel Paubert
2000-01-20 19:36         ` Kevin Hendricks
2000-01-20 19:51           ` Geert Uytterhoeven
2000-01-20 19:59           ` Gabriel Paubert
2000-01-20 20:08             ` David Edelsohn
2000-01-20 22:34             ` Franz Sirl
2000-01-21  0:05               ` Gabriel Paubert
2000-01-21  0:35                 ` Kevin Hendricks
2000-01-21  1:53                   ` Gabriel Paubert
2000-01-21  2:19                     ` Kevin Hendricks
2000-01-21  7:58                       ` Geert Uytterhoeven
2000-01-21 14:15                       ` Benjamin Herrenschmidt
2000-01-21 11:54                   ` Benjamin Herrenschmidt
2000-01-21 13:34                     ` Gabriel Paubert
2000-01-21 14:06                       ` Benjamin Herrenschmidt
2000-01-21 15:47                 ` Franz Sirl
2000-01-21 19:08                   ` Gabriel Paubert
2000-01-20 18:46   ` Franz Sirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).