public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@argo.co.il>
To: Willy Tarreau <w@1wt.eu>
Cc: Andi Kleen <andi@firstfloor.org>,
	Kyle Moffett <mrmacman_g4@mac.com>,
	Lennart Sorensen <lsorense@csclub.uwaterloo.ca>,
	Ben Crowhurst <Ben.Crowhurst@stellatravel.co.uk>,
	linux-kernel@vger.kernel.org
Subject: Re: Kernel Development & Objective-C
Date: Tue, 04 Dec 2007 23:07:05 +0200	[thread overview]
Message-ID: <4755C179.8050101@argo.co.il> (raw)
In-Reply-To: <20071203211353.GA4009@1wt.eu>

Willy Tarreau wrote:
>>>>    
>>>>         
>>> With 10Gbit/s ethernet working you start to care about every cycle.
>>>  
>>>       
>> If you have 10M packets/sec no amount of cycle-saving will help you.  
>> You need high level optimizations like TSO.  I'm not saying we should 
>> sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
>>     
>
> Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
> route those packets, those cycles may just be what you need to lookup a
> forwarding table and perform a few MMIO on an accelerated chip which will
> take care of the transfer. But you need those cycles. If you start to waste
> them 30 by 30, the performance can drop by a critical factor.
>
>   

I really doubt Linux spends 400 cycles routing a packet.  Look what an 
skbuff looks like.

A flood ping to localhost on a 2GHz system takes 8 microseconds, that's 
16,000 cycles.  Sure it involves userspace, but you're about two orders 
of magnitude off.  And the localhost interface is nicely cached in L1 
without mmio at all, unlike real devices.

>>> Another simple noticeable case is Unix
>>> sockets and your X server communication.
>>>       
>> Your reflexes are *much* better than mine if you can measure half a 
>> nanosecond on X.
>>     
>
> It just depends how many times a second it happens. For instance, consider
> this trivial loop (fct is a two-function array which just return 1 or 2) :
>
>         i = 0;
>         for (j = 0; j < (1 << 28); j++) {
>                 k = (j >> 8) & 1;
>                 i += fct[k]();
>         }
>
> It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
> changing the function once every 256 calls, you change it to every call :
>
>         i = 0;
>         for (j = 0; j < (1 << 28); j++) {
>                 k = (j >> 0) & 1;
>                 i += fct[k]();
>         }
>
> Then it only takes 4.3 seconds, which is about 3 times slower. The number
> of calls per function remains the same (128M calls each), it's just the
> branch prediction which is wrong every time. The very few nanoseconds added
> at each call are enough to slow down a program from 1.6 to 4.3 seconds while
> it executes the exact same code (it may even save one shift). If you have
> such stupid code, say, to compute the color or alpha of each pixel in an
> image, you will certainly notice the difference.
>
>   

This happens very often in HPC, and when it does, it is often worthwhile 
to invest in manual optimizations or even assembly coding.  
Unfortunately it is very rare in the kernel (memcmp, raid xor, what 
else?).  Loops with high iteration counts are very rare, so any 
attention you give to the loop body is not amortized over a large number 
of executions.

> And such poorly efficient code may happen very often when you blindly rely
> on function pointers instead of explicit calls.
>   

Using an indirect call where a direct call is sufficient will also 
reduce the compiler's optimization opportunities.  However, I don't see 
anyone recommending it in the context of systems programming.

It is not true that the number of indirect calls necessarily increases 
if you use a language other than C.

(Actually, with templates you can reduce the number of indirect calls)

>> Here, it's scheduling that matters, avoiding large transfers, and 
>> avoiding ping-pongs, not some cycles on the unix domain socket.  You 
>> already paid 150 cycles or so by issuing the syscall and thousands for 
>> copying the data, 50 more won't be noticeable except in nanobenchmarks.
>>     
>
> You are forgetting something very important : once you start stacking
> functions to perform the dirty work for you, you end up with so much
> abstraction that even new stupid code cannot be written at all without
> relying on them, and it's where the problem takes its roots, because
> when you need to write a fast function and you notice that you cannot
> touch a variable without passing through a slow pinhole, your fast
> function will remain slow whatever you do, and the worst of all is that
> you will think that it is normally fast and that it cannot be written
> faster.
>
>   

I don't understand.  Can you give an example?

There are two cases where abstraction hurts performance: the first is 
where the mechanisms used to achieve the abstraction (functions instead 
of direct access to variables, function pointers instead of duplicating 
the caller) introduce performance overhead.  I don't think C has any 
advantage here -- actually a disadvantage as it lacks templates and is 
forced to use function pointers for nontrivial cases.  Usually the 
abstraction penalty is nil with modern compilers.

The second case is where too much abstraction clouds the programmer's 
mind.  But this is independent of the programming language.


>>> And there are some special cases where block IO is also pretty critical.
>>> A popular one is TPC-* benchmarking, but there are also others and it 
>>> looks likely in the future that this will become more critical
>>> as block devices become faster (e.g. highend SSDs) 
>>>  
>>>       
>> And again the key is batching, improving cpu affinity, and caching, not 
>> looking for a faster instruction sequence.
>>     
>
> Every cycle burned is definitely lost. The time cannot go backwards. So
> for each cycle that you lose to laziness, you have to become more and more
> clever to find out how to write an alternative. Lazy people simply put
> caches everywhere and after that they find normal that "hello world" requires
> 2 Gigs of RAM to be displayed. 

A 100 byte program will print "hello world" on a UART and stop.  A 
modern program will load a vector description of a font, scale it to the 
desired size, render it using anti aliasing and sub-pixel positioning, 
lay it out according to the language rules of whereever you live, and 
place it on a multi-megabyte frame buffer.  Yes it needs hundreds of 
megabytes and lots of nasty algorithms to do that.

> The only true solution is to create better
> algorithms, but you will find even less people capable of creating efficient
> algorithms than you will find capable of coding correctly.
>
>   

That is true, that is why we see a lot more microoptimizations than 
algorithmic progress.

But if you want a fast streaming filesystem you choose XFS over ext3, 
even though the latter is much smaller and easier to optimize.  If you 
write a network server you choose epoll() instead of trying to optimize 
select() somehow.  True algorithmic improvements are rare but they are 
the ones that are actually measurable.


>>> For example there are some CPUs who are relatively slow at indirect
>>> function calls and there are actually cases where this can be measured.
>>>       
>> That is true.  But any self-respecting systems language will let you 
>> choose between direct and indirect calls.
>>
>> If adding an indirect call allows you to avoid even 1% of I/O, you save 
>> much more than you lose, so again the high level optimizations win.
>>     
>
> It depends which type of I/O. If the I/O is non-blocking, you end up doing
> something else instead of actively burning cycles.
>
>   

Unless you are I/O bound, which is usually the case when you have 2GHz 
cpus driving 200Hz disks.

>> Nanooptimizations are fun (I do them myself, I admit) but that's not 
>> where performance as measured by the end user lies.
>>     
>
> I do not agree. It's not uncommon to find 2- or 3-fold performance factors
> between equivalent components when one is carefully optimized and the other
> one is not. Granted it takes an awful lot of time doing all those nano-opts
> at the beginning, but the more you learn about how the hardware reacts to
> your code, the more efficiently you write future code, with the fewest bloat.
> End users notice bloat a lot (especially when CPU and RAM are excessively
> wasted).
>   

Can you give an example of a 2- or 3- fold factor on an end-user 
workload achieved by microopts?

I agree about bloat.


  parent reply	other threads:[~2007-12-04 21:17 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-29 12:14 Kernel Development & Objective-C Ben Crowhurst
2007-11-30 10:02 ` Xavier Bestel
2007-11-30 10:09   ` KOSAKI Motohiro
2007-11-30 10:20     ` Xavier Bestel
2007-11-30 10:54       ` Jan Engelhardt
2007-11-30 14:21         ` David Newall
2007-11-30 23:31           ` Bill Davidsen
2007-11-30 23:40             ` Alan Cox
2007-12-01  0:05               ` Arnaldo Carvalho de Melo
2007-12-01 18:27               ` Bill Davidsen
2007-12-01 18:18                 ` Alan Cox
2007-12-03  1:23                   ` Bill Davidsen
2007-11-30 22:52     ` J.A. Magallón
2007-11-30 10:29 ` Loïc Grenié
2007-11-30 11:16   ` Ben Crowhurst
2007-11-30 11:36     ` Karol Swietlicki
2007-11-30 14:37     ` Lennart Sorensen
2007-12-08  8:54     ` Rogelio M. Serrano Jr.
2007-11-30 23:19   ` J.A. Magallón
2007-11-30 23:53     ` Nicholas Miell
2007-12-01  0:31     ` Al Viro
2007-12-01  0:34       ` Al Viro
2007-12-01  1:09       ` J.A. Magallón
2007-12-01 19:55       ` Avi Kivity
2007-12-04 17:54     ` Lennart Sorensen
2007-12-04 21:10       ` Avi Kivity
2007-12-04 21:24       ` J.A. Magallón
2007-11-30 11:37 ` Matti Aarnio
2007-11-30 14:34 ` Lennart Sorensen
2007-11-30 15:26   ` Kyle Moffett
2007-11-30 18:40     ` H. Peter Anvin
2007-11-30 19:35       ` Kyle Moffett
2007-12-01 20:03     ` Avi Kivity
2007-12-02 19:01       ` Andi Kleen
2007-12-03  5:12         ` Avi Kivity
2007-12-03  9:50           ` Andi Kleen
2007-12-03 11:46             ` Avi Kivity
2007-12-03 11:50               ` Andi Kleen
2007-12-03 21:13               ` Willy Tarreau
2007-12-03 21:39                 ` J.A. Magallón
2007-12-03 21:57                   ` Alan Cox
2007-12-04 21:47                     ` J.A. Magallón
2007-12-04 22:20                       ` Diego Calleja
2007-12-05 10:59                         ` Giacomo A. Catenazzi
2007-12-04 21:07                 ` Avi Kivity [this message]
2007-12-04 22:43                   ` Willy Tarreau
2007-12-05 17:05                     ` Micro vs macro optimizations (was: Re: Kernel Development & Objective-C) Avi Kivity
2007-12-03 12:35           ` Kernel Development & Objective-C Gilboa Davara
2007-12-03 12:44             ` Gilboa Davara
2007-12-03 16:28             ` Casey Schaufler
2007-12-04 17:50             ` Lennart Sorensen
2007-12-05 10:31               ` Gilboa Davara
2007-12-01 19:59   ` Avi Kivity
2007-12-02 19:44     ` Jörn Engel
2007-12-03 16:53     ` Lennart Sorensen
2007-11-30 15:00 ` Chris Snook
2007-12-01  9:50   ` David Newall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4755C179.8050101@argo.co.il \
    --to=avi@argo.co.il \
    --cc=Ben.Crowhurst@stellatravel.co.uk \
    --cc=andi@firstfloor.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lsorense@csclub.uwaterloo.ca \
    --cc=mrmacman_g4@mac.com \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox