Re: nasm over gas?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: nasm over gas?
       [not found]       ` <tcVB.rs.3@gated-at.bofh.it>
@ 2003-09-08 12:03         ` Ihar 'Philips' Filipau
  2003-09-08 13:53           ` Richard B. Johnson
                             ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Ihar 'Philips' Filipau @ 2003-09-08 12:03 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Kernel Mailing List

Eric W. Biederman wrote:
> insecure <insecure@mail.od.ua> writes:
>>        movl    $0, 20(%esp)
>>        movl    $1000000, %edi      <----
>>        movl    $1000000, 16(%esp)  <----
>>        movl    $0, 12(%esp)
>>
>>No sane human will do that.
>>main:
>>        movl    $1000000, %edi
>>        movl    %edi, 16(%esp)	<-- save 4 bytes
>>        movl    %ebp, 12(%esp)  <-- save 4 bytes
>>        movl    $.LC27, 8(%esp)
>>
>>And this is only from a cursory examination.
> 
> Actually it is no as simple as that.  With the instruction that uses
> %edi following immediately after the instruction that populates it you cannot
> execute those two instructions in parallel.  So the code may be slower.  The
> exact rules depend on the architecture of the cpu.
> 

   It will depend on arch CPU only in case if you have unlimited i$ size.
   Servers with 8MB of cache - yes it is faster.
   Celeron with 128k of cache - +4bytes == higher probability of i$ miss 
== lower performance.

> 
>>What gives you an impression that anyone is going to rewrite linux in asm?
>>I _only_ saying that compiler-generated asm is not 'good'. It's mediocre.
>>Nothing more. I am not asm zealot.
> 
> 
> I think I would agree with that statement most compiler-generated assembly
> code is mediocre in general.  At the same time I would add most human
> generated assembly is poor, and a pain to maintain.
> 
> If you concentrate on those handful of places where you need to
> optimize that is reasonable.  Beyond that there simply are not the
> developer resources to do good assembly.  And things like algorithmic
> transformations in assembly are an absolute nightmare.  Where they are
> quite simple in C.
> 
> And if the average generated code quality bothers you enough with C
> the compiler can be fixed, or another compiler can be written that
> does a better job, and the benefit applies to a lot more code.
> 

   e.g. C-- project: something like C, where you can operate with 
registers just like another variables. Under DOS was producing .com 
files witout any overhead: program with only 'int main() { return 0; }' 
was optimized to one byte 'ret' ;-) But sure it was not complete C 
implementation.

   Sure I would prefere to have nasm used for kernel asm parts - but 
obviously gas already became standard.

P.S. Add having good macroprocessor for assembler is a must: CPP is 
terribly stupid by design. I beleive gas has no preprocessor comparable 
to masm's one? I bet they are using C's cpp. This is degradation: macros 
is the major feature of any translator I was working with. They can save 
you a lot of time and make code much more cleaner/readable/mantainable. 
CPP is just too dumb for asm...
Good old times, when people were responsible to _every_ byte of their 
programmes... Yeh... Memory/programmers are cheap nowadays...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nasm over gas?
  2003-09-08 12:03         ` nasm over gas? Ihar 'Philips' Filipau
@ 2003-09-08 13:53           ` Richard B. Johnson
  2003-09-08 16:10             ` Jamie Lokier
  2003-09-08 16:17           ` Jamie Lokier
  2003-09-08 17:59           ` William Lee Irwin III
  2 siblings, 1 reply; 12+ messages in thread
From: Richard B. Johnson @ 2003-09-08 13:53 UTC (permalink / raw)
  To: Ihar 'Philips' Filipau
  Cc: Eric W. Biederman, Linux Kernel Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8204 bytes --]

On Mon, 8 Sep 2003, Ihar 'Philips' Filipau wrote:

> Eric W. Biederman wrote:
> > insecure <insecure@mail.od.ua> writes:
> >>        movl    $0, 20(%esp)
> >>        movl    $1000000, %edi      <----
> >>        movl    $1000000, 16(%esp)  <----
> >>        movl    $0, 12(%esp)
> >>
> >>No sane human will do that.
> >>main:
> >>        movl    $1000000, %edi
> >>        movl    %edi, 16(%esp)	<-- save 4 bytes
> >>        movl    %ebp, 12(%esp)  <-- save 4 bytes
> >>        movl    $.LC27, 8(%esp)
> >>
> >>And this is only from a cursory examination.
> >
> > Actually it is no as simple as that.  With the instruction that uses
> > %edi following immediately after the instruction that populates it you
> > cannot
> > execute those two instructions in parallel.

With a single-CPU ix86, the only instructions that operate in
parallel are the instructions that calculate the next address, and
this only if you use 'leal'. However, there is an instruction
pipe-line so many memory accesses may seem to be unrelated to the
current execution context and therfore assumed to be 'parallel'.

> >  So the code may be slower.  The
> > exact rules depend on the architecture of the cpu.
> >
>
>    It will depend on arch CPU only in case if you have unlimited i$ size.
>    Servers with 8MB of cache - yes it is faster.
>    Celeron with 128k of cache - +4bytes == higher probability of i$ miss
> == lower performance.
>
> >
> >>What gives you an impression that anyone is going to rewrite linux in asm?
> >>I _only_ saying that compiler-generated asm is not 'good'. It's mediocre.
> >>Nothing more. I am not asm zealot.
> >
> >
> > I think I would agree with that statement most compiler-generated assembly
> > code is mediocre in general.  At the same time I would add most human
> > generated assembly is poor, and a pain to maintain.
> >

The compiler-generated assembly is, by design, "universal" so that
any legal 'C' statement may follow any other legal 'C' statement.
This means that, at each sequence-point, the assembly generation
is complete. This results in a lot of code duplication, etc. A
really good optimizer could, perform a fix-up that, based upon
the current 'C' code context, remove a lot of redundancy. Currently,
some such optimization is done by gcc such as loop-unrolling, etc.

A really good project would be an assembly-optimizer operated
like:

	gcc -O2 -S -o -  prog.c | optimizer | as -o prog.o -

Just make that optimizer and away you go!  I hate parser and
other text-based stuff so I'm not a candidate to make one of
these things.

> > If you concentrate on those handful of places where you need to
> > optimize that is reasonable.  Beyond that there simply are not the
> > developer resources to do good assembly.  And things like algorithmic
> > transformations in assembly are an absolute nightmare.  Where they are
> > quite simple in C.
> >
> > And if the average generated code quality bothers you enough with C
> > the compiler can be fixed, or another compiler can be written that
> > does a better job, and the benefit applies to a lot more code.
> >
>
>    e.g. C-- project: something like C, where you can operate with
> registers just like another variables. Under DOS was producing .com
> files witout any overhead: program with only 'int main() { return 0; }'
> was optimized to one byte 'ret' ;-) But sure it was not complete C
> implementation.
>
>    Sure I would prefere to have nasm used for kernel asm parts - but
> obviously gas already became standard.
>
> P.S. Add having good macroprocessor for assembler is a must: CPP is
> terribly stupid by design. I beleive gas has no preprocessor comparable
> to masm's one? I bet they are using C's cpp. This is degradation: macros
> is the major feature of any translator I was working with. They can save
> you a lot of time and make code much more cleaner/readable/mantainable.
> CPP is just too dumb for asm...
> Good old times, when people were responsible to _every_ byte of their
> programmes... Yeh... Memory/programmers are cheap nowadays...

This is for information only. I certainly don't advocate
writing everything in assembly language.

Attached is a tar file containing source and a Makefile.
It generates two tiny programs, "hello" and "world".
Both write "Hello world!" to standard-output. One is
written in assembly and the other is written in 'C'.
The one written in 'C' uses your installed shared
runtime library as is normal for such programs. Even
then, it is 2,948 bytes in length. The one written
in assembly results in a complete executable that
doesn't require any runtime support, i.e., static.
It is only 456 bytes in length.

gcc -Wall -O4 -o hello hello.c
strip hello
as -o world.o world.S
ld -o world world.o
strip world
ls -la hello world
-rwxr-xr-x   1 root     root         2948 Sep  8 08:34 hello
-rwxr-xr-x   1 root     root          456 Sep  8 08:34 world

The point is that if you really need to save some application
size, in many cases you can do the work in assembly. It is
a very useful tool. Also, if you have critical sections of
code you need to pipe-line for speed, you can do it in assembly
and make sure the optimization doesn't disappear the next
time somebody updates (improves) your tools. What you write
in assembly is what you get.

I don't like "in-line" assembly. Sometimes you don't have
much choice because you can't call some assembly-language
function to perform the work. However, when you can afford
the overhead of calling a function written in assembly, the
following applies.

Assume you have:

 extern int funct(int one, int two, int three);

Your assembly would obtain parameters as:

one   = 0x04
two   = 0x08
three = 0x0c

funct:	movl	one(%esp), %eax		# Get first passed parameter
	movl	two(%esp), %ebx		# Get second parameter
	movl	three(%esp), %ecx	# Get third parameter
	...etc

Now, gcc requires that your function not destroy any index
registers, %ebp, or any segment registers so, in the case
above, we need to save %ebx (an index register) before we
modify its value. To do this, we push it onto the stack.
This will alter the stack offsets where we obtain our input
parameters.

one   = 0x08
two   = 0x0c
three = 0x10

funct:	pushl	%ebx			# Save index register
	movl	one(%esp), %eax		# Get first passed parameter
	movl	two(%esp), %ebx		# Get second parameter
	movl	three(%esp), %ecx	# Get third parameter
	...etc
	popl	%ebx			# Restore index register

So, we could define macro that allows us to adjust the offsets
based upon the number of registers saved. I won't bother
here.

In most all cases, any value returned from the function is returned
in the %eax register. If you need to return a 'long long' both
%edx and %eax are used. Some functions may return values in the
floating-point unit so, when replacing existing 'C' code, you
need to see what the convention was.

When I write assembly-language functions I usually do it to
replace 'C' functions that (usually) somebody else has written.
Those 'C' functions are known to work. In other words, they
perform the correct mathematics. However, they need to be
speeded up or they need to be parred down to a more reasonable
size to fit in some embedded system.

Recently we had a function that calculated the RMS value of
an array of floating-point (double) numbers. With a particular
array size, the time necessary was something like 300 milliseconds.
By rewriting in assembly, and using the knowledge that the
array will never be less that 512 doubles in length, plus always
a power-of-two, the execution time went way down to 40 milliseconds.
Also, you can't "cheat" with a FP unit. There are always memory-
accesses that eat valuable CPU time. You can't put temporary float
values in registers.

I strongly suggest that if you have an interest in assembly, you
cultivate that interest. Soon most all mundane coding will be
performed by machine from a specification written by "Sales".
The only "real" programming will be done by those who can make
the interface between the hardware and the "coding machine". That's
assembly!

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (794.73 BogoMips).
            Note 96.31% of all statistics are fiction.

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 626 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nasm over gas?
  2003-09-08 13:53           ` Richard B. Johnson
@ 2003-09-08 16:10             ` Jamie Lokier
  0 siblings, 0 replies; 12+ messages in thread
From: Jamie Lokier @ 2003-09-08 16:10 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Ihar 'Philips' Filipau, Eric W. Biederman,
	Linux Kernel Mailing List

Richard B. Johnson wrote:
> > > Actually it is no as simple as that.  With the instruction that uses
> > > %edi following immediately after the instruction that populates it you
> > > cannot
> > > execute those two instructions in parallel.
> 
> With a single-CPU ix86, the only instructions that operate in
> parallel are the instructions that calculate the next address, and
> this only if you use 'leal'. However, there is an instruction
> pipe-line so many memory accesses may seem to be unrelated to the
> current execution context and therfore assumed to be 'parallel'.

That was true on the 486.  The Pentium famously executed one or two
instructions per cycle, depending on whether they are "pairable".  The
Pentium Pro and later can issue up to 3 instructions per cycle,
depending on the instruction types.  If they are the right
instructions, it will sustain that rate over multiple cycles.

Nowadays all the major x86 CPUs issue multiple instructions per clock cycle.

-- Jamie



> 
> > >  So the code may be slower.  The
> > > exact rules depend on the architecture of the cpu.
> > >
> >
> >    It will depend on arch CPU only in case if you have unlimited i$ size.
> >    Servers with 8MB of cache - yes it is faster.
> >    Celeron with 128k of cache - +4bytes == higher probability of i$ miss
> > == lower performance.
> >
> > >
> > >>What gives you an impression that anyone is going to rewrite linux in asm?
> > >>I _only_ saying that compiler-generated asm is not 'good'. It's mediocre.
> > >>Nothing more. I am not asm zealot.
> > >
> > >
> > > I think I would agree with that statement most compiler-generated assembly
> > > code is mediocre in general.  At the same time I would add most human
> > > generated assembly is poor, and a pain to maintain.
> > >
> 
> The compiler-generated assembly is, by design, "universal" so that
> any legal 'C' statement may follow any other legal 'C' statement.
> This means that, at each sequence-point, the assembly generation
> is complete. This results in a lot of code duplication, etc. A
> really good optimizer could, perform a fix-up that, based upon
> the current 'C' code context, remove a lot of redundancy. Currently,
> some such optimization is done by gcc such as loop-unrolling, etc.
> 
> A really good project would be an assembly-optimizer operated
> like:
> 
> 	gcc -O2 -S -o -  prog.c | optimizer | as -o prog.o -
> 
> Just make that optimizer and away you go!  I hate parser and
> other text-based stuff so I'm not a candidate to make one of
> these things.
> 
> > > If you concentrate on those handful of places where you need to
> > > optimize that is reasonable.  Beyond that there simply are not the
> > > developer resources to do good assembly.  And things like algorithmic
> > > transformations in assembly are an absolute nightmare.  Where they are
> > > quite simple in C.
> > >
> > > And if the average generated code quality bothers you enough with C
> > > the compiler can be fixed, or another compiler can be written that
> > > does a better job, and the benefit applies to a lot more code.
> > >
> >
> >    e.g. C-- project: something like C, where you can operate with
> > registers just like another variables. Under DOS was producing .com
> > files witout any overhead: program with only 'int main() { return 0; }'
> > was optimized to one byte 'ret' ;-) But sure it was not complete C
> > implementation.
> >
> >    Sure I would prefere to have nasm used for kernel asm parts - but
> > obviously gas already became standard.
> >
> > P.S. Add having good macroprocessor for assembler is a must: CPP is
> > terribly stupid by design. I beleive gas has no preprocessor comparable
> > to masm's one? I bet they are using C's cpp. This is degradation: macros
> > is the major feature of any translator I was working with. They can save
> > you a lot of time and make code much more cleaner/readable/mantainable.
> > CPP is just too dumb for asm...
> > Good old times, when people were responsible to _every_ byte of their
> > programmes... Yeh... Memory/programmers are cheap nowadays...
> 
> 
> This is for information only. I certainly don't advocate
> writing everything in assembly language.
> 
> Attached is a tar file containing source and a Makefile.
> It generates two tiny programs, "hello" and "world".
> Both write "Hello world!" to standard-output. One is
> written in assembly and the other is written in 'C'.
> The one written in 'C' uses your installed shared
> runtime library as is normal for such programs. Even
> then, it is 2,948 bytes in length. The one written
> in assembly results in a complete executable that
> doesn't require any runtime support, i.e., static.
> It is only 456 bytes in length.
> 
> gcc -Wall -O4 -o hello hello.c
> strip hello
> as -o world.o world.S
> ld -o world world.o
> strip world
> ls -la hello world
> -rwxr-xr-x   1 root     root         2948 Sep  8 08:34 hello
> -rwxr-xr-x   1 root     root          456 Sep  8 08:34 world
> 
> The point is that if you really need to save some application
> size, in many cases you can do the work in assembly. It is
> a very useful tool. Also, if you have critical sections of
> code you need to pipe-line for speed, you can do it in assembly
> and make sure the optimization doesn't disappear the next
> time somebody updates (improves) your tools. What you write
> in assembly is what you get.
> 
> I don't like "in-line" assembly. Sometimes you don't have
> much choice because you can't call some assembly-language
> function to perform the work. However, when you can afford
> the overhead of calling a function written in assembly, the
> following applies.
> 
> Assume you have:
> 
>  extern int funct(int one, int two, int three);
> 
> Your assembly would obtain parameters as:
> 
> one   = 0x04
> two   = 0x08
> three = 0x0c
> 
> funct:	movl	one(%esp), %eax		# Get first passed parameter
> 	movl	two(%esp), %ebx		# Get second parameter
> 	movl	three(%esp), %ecx	# Get third parameter
> 	...etc
> 
> Now, gcc requires that your function not destroy any index
> registers, %ebp, or any segment registers so, in the case
> above, we need to save %ebx (an index register) before we
> modify its value. To do this, we push it onto the stack.
> This will alter the stack offsets where we obtain our input
> parameters.
> 
> 
> one   = 0x08
> two   = 0x0c
> three = 0x10
> 
> funct:	pushl	%ebx			# Save index register
> 	movl	one(%esp), %eax		# Get first passed parameter
> 	movl	two(%esp), %ebx		# Get second parameter
> 	movl	three(%esp), %ecx	# Get third parameter
> 	...etc
> 	popl	%ebx			# Restore index register
> 
> So, we could define macro that allows us to adjust the offsets
> based upon the number of registers saved. I won't bother
> here.
> 
> In most all cases, any value returned from the function is returned
> in the %eax register. If you need to return a 'long long' both
> %edx and %eax are used. Some functions may return values in the
> floating-point unit so, when replacing existing 'C' code, you
> need to see what the convention was.
> 
> When I write assembly-language functions I usually do it to
> replace 'C' functions that (usually) somebody else has written.
> Those 'C' functions are known to work. In other words, they
> perform the correct mathematics. However, they need to be
> speeded up or they need to be parred down to a more reasonable
> size to fit in some embedded system.
> 
> Recently we had a function that calculated the RMS value of
> an array of floating-point (double) numbers. With a particular
> array size, the time necessary was something like 300 milliseconds.
> By rewriting in assembly, and using the knowledge that the
> array will never be less that 512 doubles in length, plus always
> a power-of-two, the execution time went way down to 40 milliseconds.
> Also, you can't "cheat" with a FP unit. There are always memory-
> accesses that eat valuable CPU time. You can't put temporary float
> values in registers.
> 
> I strongly suggest that if you have an interest in assembly, you
> cultivate that interest. Soon most all mundane coding will be
> performed by machine from a specification written by "Sales".
> The only "real" programming will be done by those who can make
> the interface between the hardware and the "coding machine". That's
> assembly!
> 
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.4.22 on an i686 machine (794.73 BogoMips).
>             Note 96.31% of all statistics are fiction.
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nasm over gas?
  2003-09-08 12:03         ` nasm over gas? Ihar 'Philips' Filipau
  2003-09-08 13:53           ` Richard B. Johnson
@ 2003-09-08 16:17           ` Jamie Lokier
  2003-09-08 16:45             ` Ihar 'Philips' Filipau
  2003-09-08 17:59           ` William Lee Irwin III
  2 siblings, 1 reply; 12+ messages in thread
From: Jamie Lokier @ 2003-09-08 16:17 UTC (permalink / raw)
  To: Ihar 'Philips' Filipau
  Cc: Eric W. Biederman, Linux Kernel Mailing List

Ihar 'Philips' Filipau wrote:
>   It will depend on arch CPU only in case if you have unlimited i$ size.
>   Servers with 8MB of cache - yes it is faster.
>   Celeron with 128k of cache - +4bytes == higher probability of i$ miss 
> == lower performance.

Higher probability != optimal performance.

It depends on your execution context.  If it's part of a tight loop
which is executed often, then saving a cycle in the loop gains more
performance than saving icache, even on a 128k Celeron.

The execution context can depend on the input to the program, in which
case the faster of the two code sequences can depend on the program's
input too.  Then, for optimal performance, you need to profile the
"expected" inputs.

> P.S. Add having good macroprocessor for assembler is a must: CPP is 
> terribly stupid by design. I beleive gas has no preprocessor comparable 
> to masm's one? I bet they are using C's cpp.

You obviously have not read the GAS documentation.

It has quite a good macro facility built in.

-- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nasm over gas?
  2003-09-08 16:17           ` Jamie Lokier
@ 2003-09-08 16:45             ` Ihar 'Philips' Filipau
  2003-09-08 16:58               ` Jamie Lokier
  0 siblings, 1 reply; 12+ messages in thread
From: Ihar 'Philips' Filipau @ 2003-09-08 16:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux Kernel Mailing List

Jamie Lokier wrote:
> Ihar 'Philips' Filipau wrote:
> 
>>  It will depend on arch CPU only in case if you have unlimited i$ size.
>>  Servers with 8MB of cache - yes it is faster.
>>  Celeron with 128k of cache - +4bytes == higher probability of i$ miss 
>>== lower performance.
> 
> Higher probability != optimal performance.
> 
> It depends on your execution context.  If it's part of a tight loop
> which is executed often, then saving a cycle in the loop gains more
> performance than saving icache, even on a 128k Celeron.
> 

   You think as system-programmer.
   Every bit of i$ waste - hit user space applications too.
   128k of $ - is for every app.

   If you gained one cycle by polluting one more cache line - do not 
forget that this cache line probably contained some info, which was able 
to avoid cache miss for another application. So you gained cycle here - 
and lost it immediately in another app. Not good.

   If you can improve performance by NOT polluting cache - it would be 
another story :-)))

> The execution context can depend on the input to the program, in which
> case the faster of the two code sequences can depend on the program's
> input too.  Then, for optimal performance, you need to profile the
> "expected" inputs.
> 
> 
> You obviously have not read the GAS documentation.
> 
> It has quite a good macro facility built in.
> 

   Indeed. RTFM quickly shown some good examples.

   But still I never saw this kind of thing being used in kernel. 
Instead of writing normal asm we have something like i386/mmx.c. And 
i386/checksum.S not the best sample of asm in kernel too. Sad.

-- 
Ihar 'Philips' Filipau  / with best regards from Saarbruecken.
   -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
   * Please avoid sending me Word/PowerPoint/Excel attachments.
   * See http://www.fsf.org/philosophy/no-word-attachments.html
   -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -
    There should be some SCO's source code in Linux -
       my servers sometimes are crashing.      -- People


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nasm over gas?
  2003-09-08 16:45             ` Ihar 'Philips' Filipau
@ 2003-09-08 16:58               ` Jamie Lokier
  0 siblings, 0 replies; 12+ messages in thread
From: Jamie Lokier @ 2003-09-08 16:58 UTC (permalink / raw)
  To: Ihar 'Philips' Filipau; +Cc: Linux Kernel Mailing List

Ihar 'Philips' Filipau wrote:
> Jamie Lokier wrote:
> >Ihar 'Philips' Filipau wrote:
> >
> >> It will depend on arch CPU only in case if you have unlimited i$ size.
> >> Servers with 8MB of cache - yes it is faster.
> >> Celeron with 128k of cache - +4bytes == higher probability of i$ miss 
> >>== lower performance.
> >
> >Higher probability != optimal performance.
> >
> >It depends on your execution context.  If it's part of a tight loop
> >which is executed often, then saving a cycle in the loop gains more
> >performance than saving icache, even on a 128k Celeron.
> >
> 
>   You think as system-programmer.
>   Every bit of i$ waste - hit user space applications too.
>   128k of $ - is for every app.
> 
>   If you gained one cycle by polluting one more cache line - do not 
> forget that this cache line probably contained some info, which was able 
> to avoid cache miss for another application. So you gained cycle here - 
> and lost it immediately in another app. Not good.

Usually the whole L1 cache is flushed between appliation context
switches anyway, so the cost of a miss is borne by the application
which causes it.

And that _still_ doesn't change the truth of my statement.  One cycle
saved in a loop which executes 1000 times is worth more than an L1
i-cache miss, always.

>   If you can improve performance by NOT polluting cache - it would be 
> another story :-)))

Yes, of course that is better when it is possible.

Modern OOO CPUs are subtle beasts.  Like I said, I added a single
"nop" (one-byte instruction) to a tight graphics loop once, and the
loop went significantly faster.  I could not explain it, except that I
know the Pentium Pro instruction decode stage has many quirks.

>   But still I never saw this kind of thing being used in kernel. 
> Instead of writing normal asm we have something like i386/mmx.c. And 
> i386/checksum.S not the best sample of asm in kernel too. Sad.

Those were written before GAS had a macro facility.  I agree with you,
it should be used more in the kernel.

The two examples you gave have been carefully tuned on particular
CPUs, by trial and error.  Changing the instruction order makes a big
difference to their performance.

-- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nasm over gas?
  2003-09-08 12:03         ` nasm over gas? Ihar 'Philips' Filipau
  2003-09-08 13:53           ` Richard B. Johnson
  2003-09-08 16:17           ` Jamie Lokier
@ 2003-09-08 17:59           ` William Lee Irwin III
  2 siblings, 0 replies; 12+ messages in thread
From: William Lee Irwin III @ 2003-09-08 17:59 UTC (permalink / raw)
  To: Ihar 'Philips' Filipau
  Cc: Eric W. Biederman, Linux Kernel Mailing List

On Mon, Sep 08, 2003 at 02:03:21PM +0200, Ihar 'Philips' Filipau wrote:
>   e.g. C-- project: something like C, where you can operate with 
> registers just like another variables. Under DOS was producing .com 
> files witout any overhead: program with only 'int main() { return 0; }' 
> was optimized to one byte 'ret' ;-) But sure it was not complete C 
> implementation.

There is already a C-- project and it is unrelated to your suggestion.

c.f. http://cminusminus.org/


-- wli

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <uw6d.3hD.35@gated-at.bofh.it>]

[parent not found: <uxED.5Rz.9@gated-at.bofh.it>]

[parent not found: <uYbM.26o.3@gated-at.bofh.it>]

[parent not found: <uZUr.4QR.25@gated-at.bofh.it>]

[parent not found: <v4qU.3h1.27@gated-at.bofh.it>]

[parent not found: <vog2.7k4.23@gated-at.bofh.it>]

* stack alignment in the kernel was Re: nasm over gas?
       [not found]           ` <vog2.7k4.23@gated-at.bofh.it>
@ 2003-09-13 23:57             ` Andi Kleen
  2003-09-14 13:54               ` Jamie Lokier
  0 siblings, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2003-09-13 23:57 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, jh

Jamie Lokier <jamie@shareable.org> writes:

> Obvious the _intent_ of -O2 is to compile for speed, but it's clear
> that GCC often emits trivially redundant instructions (like stack
> adjustments) that don't serve to speed up the program at all.

The stack adjustments are for getting good performance with floating
point code. Most x86 CPUs require 16 byte alignment for floating point
stores/loads on the stack. It can make a dramatic difference in some 
FP intensive programs.

But obviously that's completely useless for the kernel which never
uses floating point.

A compiler option to turn it off would make sense to save .text space
and eliminate these useless instructions. Especially since the kernel
entry points make no attempt to align the stack to 16 byte anyways,
so most likely the stack adjustments do not even work.

(this option could also warn for floating point usage which is usually illegal,
although you can already get the same effect by compiling with -msoft-float)

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stack alignment in the kernel was Re: nasm over gas?
  2003-09-13 23:57             ` stack alignment in the kernel was " Andi Kleen
@ 2003-09-14 13:54               ` Jamie Lokier
  2003-09-14 14:13                 ` Andi Kleen
  2003-09-14 22:27                 ` Jan Hubicka
  0 siblings, 2 replies; 12+ messages in thread
From: Jamie Lokier @ 2003-09-14 13:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, jh

Andi Kleen wrote:
> The stack adjustments are for getting good performance with floating
> point code. Most x86 CPUs require 16 byte alignment for floating point
> stores/loads on the stack. It can make a dramatic difference in some 
> FP intensive programs.

You're right.

> A compiler option to turn it off would make sense to save .text space
> and eliminate these useless instructions. Especially since the kernel
> entry points make no attempt to align the stack to 16 byte anyways,
> so most likely the stack adjustments do not even work.

There is an option:

	-mpreferred-stack-boundary=2

-- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stack alignment in the kernel was Re: nasm over gas?
  2003-09-14 13:54               ` Jamie Lokier
@ 2003-09-14 14:13                 ` Andi Kleen
  2003-09-14 15:56                   ` Jamie Lokier
  2003-09-14 22:27                 ` Jan Hubicka
  1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2003-09-14 14:13 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, linux-kernel, jh

On Sun, Sep 14, 2003 at 02:54:31PM +0100, Jamie Lokier wrote:
> > A compiler option to turn it off would make sense to save .text space
> > and eliminate these useless instructions. Especially since the kernel
> > entry points make no attempt to align the stack to 16 byte anyways,
> > so most likely the stack adjustments do not even work.
> 
> There is an option:
> 
> 	-mpreferred-stack-boundary=2

Hmm. The i386 Makefile sets that already. Where exactly did you see
bogus stack adjustments in kernel code?

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stack alignment in the kernel was Re: nasm over gas?
  2003-09-14 14:13                 ` Andi Kleen
@ 2003-09-14 15:56                   ` Jamie Lokier
  0 siblings, 0 replies; 12+ messages in thread
From: Jamie Lokier @ 2003-09-14 15:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, jh

Andi Kleen wrote:
> Hmm. The i386 Makefile sets that already. Where exactly did you see
> bogus stack adjustments in kernel code?

I didn't.  I saw them in a test program for __builtin_expect() in the
"oops_in_progress is unlikely()" thread.

I'm used to seeing redundant "mov" instructions and such from GCC, so
when I saw the stack adjustments with -O2 go away with -Os I thought
they were more of the same - not realising that -Os turns off the stack
alignment.  My error.

-- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stack alignment in the kernel was Re: nasm over gas?
  2003-09-14 13:54               ` Jamie Lokier
  2003-09-14 14:13                 ` Andi Kleen
@ 2003-09-14 22:27                 ` Jan Hubicka
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Hubicka @ 2003-09-14 22:27 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Andi Kleen, linux-kernel, jh

> 
> > A compiler option to turn it off would make sense to save .text space
> > and eliminate these useless instructions. Especially since the kernel
> > entry points make no attempt to align the stack to 16 byte anyways,
> > so most likely the stack adjustments do not even work.
> 
> There is an option:
> 
> 	-mpreferred-stack-boundary=2

Note that this won't work for x86-64 where ABI compliant varargs require
it.

Honza
> 
> -- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-09-14 22:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <rZQN.83u.21@gated-at.bofh.it>
     [not found] ` <saVL.7lR.1@gated-at.bofh.it>
     [not found]   ` <soFo.16a.1@gated-at.bofh.it>
     [not found]     ` <ssJa.6M6.25@gated-at.bofh.it>
     [not found]       ` <tcVB.rs.3@gated-at.bofh.it>
2003-09-08 12:03         ` nasm over gas? Ihar 'Philips' Filipau
2003-09-08 13:53           ` Richard B. Johnson
2003-09-08 16:10             ` Jamie Lokier
2003-09-08 16:17           ` Jamie Lokier
2003-09-08 16:45             ` Ihar 'Philips' Filipau
2003-09-08 16:58               ` Jamie Lokier
2003-09-08 17:59           ` William Lee Irwin III
     [not found] ` <uw6d.3hD.35@gated-at.bofh.it>
     [not found]   ` <uxED.5Rz.9@gated-at.bofh.it>
     [not found]     ` <uYbM.26o.3@gated-at.bofh.it>
     [not found]       ` <uZUr.4QR.25@gated-at.bofh.it>
     [not found]         ` <v4qU.3h1.27@gated-at.bofh.it>
     [not found]           ` <vog2.7k4.23@gated-at.bofh.it>
2003-09-13 23:57             ` stack alignment in the kernel was " Andi Kleen
2003-09-14 13:54               ` Jamie Lokier
2003-09-14 14:13                 ` Andi Kleen
2003-09-14 15:56                   ` Jamie Lokier
2003-09-14 22:27                 ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox