public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Using unsigned int for loop counters - better performance for  Architectures - urban hacker legend?
@ 2010-07-30 23:53 Luis R. Rodriguez
  2010-07-31  9:38 ` David Newall
  0 siblings, 1 reply; 4+ messages in thread
From: Luis R. Rodriguez @ 2010-07-30 23:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: doug.dahlby, Jiri Slaby

For good while now I have been using unsigned int for every looper
iterator I implement on the kernel. I remember I picked this practice
up from Jiri but I just tried to look for actual documentation of this
online and could not find much. A simple test would be to run a series
of loop tests on different architectures with a regular int and then
with an unsigned int to see if performance improves but I obviously
don't have access to many machines with varying architectures.

So if you also follow this practice have you actually found evidence
for this practice, have you actually measured performance metrics? Or
is this just a urban hacker legend?

Please CC me as i am not subscribed to lkml.

  Luis

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using unsigned int for loop counters - better performance for Architectures - urban hacker legend?
  2010-07-30 23:53 Using unsigned int for loop counters - better performance for Architectures - urban hacker legend? Luis R. Rodriguez
@ 2010-07-31  9:38 ` David Newall
       [not found]   ` <B7132A25476D334D9130FE7532F2A563109FED150D@SC1EXMB-MBCL.global.atheros.com>
  0 siblings, 1 reply; 4+ messages in thread
From: David Newall @ 2010-07-31  9:38 UTC (permalink / raw)
  To: Luis R. Rodriguez; +Cc: linux-kernel, doug.dahlby, Jiri Slaby

I don't know about unsigned int; but looping down to zero should be 
faster on many architectures, as they automatically test for zero (or 
negative) on decrement.  Counting up requires an additional test.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using unsigned int for loop counters - better performance for Architectures - urban hacker legend?
       [not found]       ` <B7132A25476D334D9130FE7532F2A563109FED151A@SC1EXMB-MBCL.global.atheros.com>
@ 2010-08-02 20:48         ` Luis R. Rodriguez
  2010-08-03 10:06           ` David Newall
  0 siblings, 1 reply; 4+ messages in thread
From: Luis R. Rodriguez @ 2010-08-02 20:48 UTC (permalink / raw)
  To: Doug Dahlby; +Cc: Luis Rodriguez, davidn, linux-kernel, mcgrof, jirislaby

Doug, I'm adding your response to lkml as its the best answer I've gotten so far.

On Mon, Aug 02, 2010 at 01:10:01PM -0700, Doug Dahlby wrote:
> Luis,
> 
> Just out of curiousity, I looked at what gcc does on my own x86 computer.
> When compiled regularly, the loop bodies are practically identical:
> 
> $ more loop_test1.c loop_test1.s loop_test2.c loop_test2.s
> ::::::::::::::
> loop_test1.c
> ::::::::::::::
> int foo(int limit)
> {
>     int i = 0;
>     for (; limit > 0; limit--) {
>         i += 1;
>     }
>     return i;
> }
> ::::::::::::::
> loop_test1.s
> ::::::::::::::
>         .file   "loop_test1.c"
>         .text
> .globl _foo
>         .def    _foo;   .scl    2;      .type   32;     .endef
> _foo:
>         pushl   %ebp
>         movl    %esp, %ebp
>         subl    $4, %esp
>         movl    $0, -4(%ebp)
> L2:
>         cmpl    $0, 8(%ebp)
>         jle     L3
>         leal    -4(%ebp), %eax
>         incl    (%eax)
>         decl    8(%ebp)
>         jmp     L2
> L3:
>         movl    -4(%ebp), %eax
>         leave
>         ret
> ::::::::::::::
> loop_test2.c
> ::::::::::::::
> int foo(unsigned limit)
> {
>     int i = 0;
>     for (; limit > 0; limit--) {
>         i += 1;
>     }
>     return i;
> }
> ::::::::::::::
> loop_test2.s
> ::::::::::::::
>         .file   "loop_test2.c"
>         .text
> .globl _foo
>         .def    _foo;   .scl    2;      .type   32;     .endef
> _foo:
>         pushl   %ebp
>         movl    %esp, %ebp
>         subl    $4, %esp
>         movl    $0, -4(%ebp)
> L2:
>         cmpl    $0, 8(%ebp)
>         je      L3
>         leal    -4(%ebp), %eax
>         incl    (%eax)
>         decl    8(%ebp)
>         jmp     L2
> L3:
>         movl    -4(%ebp), %eax
>         leave
>         ret
> 
> but when I compile with -O3, there is a little difference:
> 
> ::::::::::::::
> loop_test1.s
> ::::::::::::::
>         .file   "loop_test1.c"
>         .text
>         .p2align 4,,15
> .globl _foo
>         .def    _foo;   .scl    2;      .type   32;     .endef
> _foo:
>         pushl   %ebp
>         xorl    %eax, %eax
>         movl    %esp, %ebp
>         movl    8(%ebp), %edx
>         jmp     L10
>         .p2align 4,,7
> L12:
>         incl    %eax
>         decl    %edx
> L10:
>         testl   %edx, %edx
>         jg      L12
>         popl    %ebp
>         ret
> ::::::::::::::
> loop_test2.s
> ::::::::::::::
>         .file   "loop_test2.c"
>         .text
>         .p2align 4,,15
> .globl _foo
>         .def    _foo;   .scl    2;      .type   32;     .endef
> _foo:
>         pushl   %ebp
>         xorl    %eax, %eax
>         movl    %esp, %ebp
>         movl    8(%ebp), %edx
>         testl   %edx, %edx
>         jmp     L10
>         .p2align 4,,7
> L12:
>         incl    %eax
>         decl    %edx
> L10:
>         jne     L12
>         popl    %ebp
>         ret
> 
> Looks like the compiler is explicity testing the unsigned counter
> against zero, but uses the status bits set as a byproduct of the
> loop counter decrement for the unsigned case.  When I run these
> 2 functions repeatedly, the unsigned counter takes about 70% of
> the time of the signed counter.  This roughly matches the ratio
> of the 3 loop body statements in the unsigned case to the 4
> statements in the signed case.  This is not a rigorous test, and
> this may be specific to my architecture and my compiler settings
> (default + -O3), but it appears that there is some validity to
> make a general habit of using unsigned loop counters rather
> than signed. That being said, I'd be surprised if we have loops that
>
> (a) are dominated by the looping overhead rather than the operations
> in the loop body, and
> (b) iterate such a large number of times that they take up an non-negligible
> amount of the driver's CPU use. 
>
> So it looks to me like this is a good policy to recommend, but not one
> that needs across-the-board adherence.

Awesome, thanks!

  Luis

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Using unsigned int for loop counters - better performance for Architectures - urban hacker legend?
  2010-08-02 20:48         ` Luis R. Rodriguez
@ 2010-08-03 10:06           ` David Newall
  0 siblings, 0 replies; 4+ messages in thread
From: David Newall @ 2010-08-03 10:06 UTC (permalink / raw)
  To: Luis R. Rodriguez; +Cc: Doug Dahlby, linux-kernel, mcgrof, jirislaby

Luis R. Rodriguez wrote:
> Doug, I'm adding your response to lkml as its the best answer I've gotten so far.

If I may point out, Doug's result is correct for his combination of CPU 
and compiler, but not necessarily for other combinations. It is not 
something that should be promulgated unless you caveat by architecture 
and even version of compiler. As Linux is architecture neutral, the 
question about performance of signed versus unsigned is irrelevant 
except in architecture-specific code, and you would code it in 
assembler, not in C.

Don't recommend signed versus unsigned for reason of efficiency, only 
for reason of clarity or accuracy.

When I compile Doug's function for ia32 using gcc (Ubunut 
4.3.3-5ubuntu4) with -O3, I get totally different code emitted than he:

signed_count_down:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	popl	%ebp
	movl	%eax, %edx
	sarl	$31, %edx
	notl	%edx
	andl	%edx, %eax
	ret

The exact same code is emitted if the loop is changed to count up (i.e; 
int j; for (j = 0; j < limit; j++) ...)

By contrast, the code emitted when using unsigned, regardless of whether 
counting up or down, is:

unsigned_count_up:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	popl	%ebp
	ret

I find this code interesting because it contains no loop and no test. It 
works because, in the case of unsigned loop, the function returns its 
input. When limit is signed, the function returns its input if that is 
positive, otherwise it returns zero. Observe that the unsigned code is 
identical to the signed but for the addition of the four instructions 
before the (unsigned code's) ret.

I think all instructions on IA32 take multiple stages to execute. These 
stages occur inside a "pipeline;" op-codes are fed in one end and, a few 
cycles later, the result pops out the other. To make the machine, IA32 
uses multiple execution units in parallel, i.e. multiple pipelines, with 
each unit running one stage behind, and executing the instruction 
following, the previous execution unit. Some instructions, however, 
cannot be executed in parallel. Consider a conditional-branch: should 
the next instruction be the target of the branch or the one that follows 
the branch? Whichever answer you think is best, some of the time it will 
be wrong and when that occurs the second and subsequent pipelines must 
be flushed, thus stalling the CPU (that is, it produces no results) for 
some number of cycles. So it can be understood that elimination of the 
branches gives a big performance boost on ia32.

Consider the four extra instructions in the signed code: These load 
limit into a register and then shift right by 31 bits. If limit is 
negative, it's 32nd bit will be a one, otherwise a zero. The shift moves 
that 32nd bit into the remaining 31, giving -1 for negative limit, 
otherwise zero. The notl instruction inverts this value; thus the 
function returns 0&limit (which is always zero) for negative limit, 
otherwise -1&limit (which is always limit.) It's a clever optimisation; 
but one which would be wrong to write except in very limited circumstances.

When coding, if you must choose between the two, you should usually 
write something which is easily understood rather than something which 
is fast. Fast-but-confusing code is likely to bite someone down the 
track, and as we've seen, might not even produce the efficient machine 
code that you expect.

Here is my favourite example of what not to do. Your task is to name it.

void ???(int i, int j) { i ^= j ^= i ^= j; }

As Knuth (probably) said, "Premature optimization is the root of all evil."

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-08-03 10:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-30 23:53 Using unsigned int for loop counters - better performance for Architectures - urban hacker legend? Luis R. Rodriguez
2010-07-31  9:38 ` David Newall
     [not found]   ` <B7132A25476D334D9130FE7532F2A563109FED150D@SC1EXMB-MBCL.global.atheros.com>
     [not found]     ` <20100802193712.GB8920@tux>
     [not found]       ` <B7132A25476D334D9130FE7532F2A563109FED151A@SC1EXMB-MBCL.global.atheros.com>
2010-08-02 20:48         ` Luis R. Rodriguez
2010-08-03 10:06           ` David Newall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox