Re: [PATCH] tcp_cubic: use 32 bit math

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stephen Hemminger <shemminger@linux-foundation.org>
To: Willy Tarreau <w@1wt.eu>
Cc: David Miller <davem@davemloft.net>,
	rkuhn@e18.physik.tu-muenchen.de, andi@firstfloor.org,
	dada1@cosmosbay.com, jengelh@linux01.gwdg.de,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: Re: [PATCH] tcp_cubic: use 32 bit math
Date: Wed, 21 Mar 2007 11:54:19 -0700	[thread overview]
Message-ID: <20070321115419.483a655a@freekitty> (raw)
In-Reply-To: <20070313205020.GA5418@1wt.eu>

On Tue, 13 Mar 2007 21:50:20 +0100
Willy Tarreau <w@1wt.eu> wrote:

> Hi Stephen,
> 
> On Mon, Mar 12, 2007 at 02:11:56PM -0700, Stephen Hemminger wrote:
> > > Oh BTW, I have a newer version with a first approximation of the
> > > cbrt() before the div64_64, which allows us to reduce from 3 div64
> > > to only 2 div64. This results in a version which is twice as fast
> > > as the initial one (ncubic), but with slightly less accuracy (0.286%
> > > compared to 0.247). But I see that other functions such as hcbrt()
> > > had a 1.5% avg error, so I think this is not dramatic.
> > 
> > Ignore my hcbrt() it was a less accurate version of andi's stuff.
> 
> OK.
> 
> > > Also, I managed to remove all other divides, to be kind with CPUs
> > > having a slow divide instruction or no divide at all. Since we compute
> > > on limited range (22 bits), we can multiply then shift right. It shows
> > > me even slightly better time on pentium-m and athlon, with a slightly
> > > higher avg error (0.297% compared to 0.286%), and slightly smaller
> > > code.
> > 
> > What does the code look like?
> 
> Well, I have cleaned it a little bit, there were more comments and ifdefs
> than code ! I've appended it to the end of this mail.
> 
> I have changed it a bit, because I noticed that integer divide precision
> was so coarse that there were other possibilities to play with the bits.
> 
> I have experimented with combinations of several methods :
>   - replace integer divides with multiplies/shifts where possible.
> 
>   - compensation for divide imprecisions by adding/removing small
>     values bofore/after them. Often, the integer result of 1/(x*(x-1))
>     is closer to (float)1/(float)x^2 than 1/(x*x). This is because
>     the divide always truncates the result.
> 
>   - use direct result lookup for small values. Small inputs give small
>     outputs which have very few moving bits. Many different values fit
>     in a 32bit integer, so we use a shift offset to lookup the value.
>     I used this in an fls function I wrote a while ago, that I should
>     also post because it is up to twice as fast as the kernel's.
>     Sometimes it seems faster to lookup in from memory, sometimes it
>     is faster from an immediate value. Maybe more visible differences
>     would show up on RISC CPUs where loading 32 bits immediate needs
>     two instructions. I don't know yet, I've not tested on my sparc
>     yet.
> 
>   - use small lookup tables (64 bytes) with 6 bits inputs and at least
>     as many on output. We only lookup the 6 MSB and return the 2-3 MSB
>     of the result.
> 
>   - iterative search and manual refinment of the lookup tables for best
>     accuracy. The avg error rate can easily be halved this way.
> 
> I have duplicated tried several functions with 0, 1, 2 and 3 divides.
> Several of them offer better accuracy over what we currently have, in
> less cycles. Others offer faster results (up to 5 times) with slightly
> less accuracy.
> 
> There is one function which is not to be used, but is just here for
> comparison (ncubic_0div). It does no divide but has awful avg error.
> 
> But one which is interesting is the ncubic_tab0. It does not use any
> divide at all, even not any div64. It shows a 0.6% avg error, which I'm
> not sure is enough or not. It is 6.7 times faster than initial ncubic()
> with less accuracy, and 4 times smaller. I suspect that it can differ
> more on architectures which have no divide instruction.
> 
> Is 0.6% avg error rate is too much, ncubic_tab1() uses one single div64
> and is twice slower (still nearly 3 times faster than ncubic). It show
> 0.195% avg error, which is better than initial ncubic. I think that it
> is a good tradeoff.
> 
> If best accuracy is an absolute requirement, then I have a variation of
> ncubic (ncubic_3div) which does 0.17% in 2/3 of the time (compared to
> 0.247%), and which is slightly smaller.
> 
> I have also added a "size" column, indicating approximative function
> size, provided that the compiler does not reorder the code. On gcc 3.4,
> it's OK, but 4.1 returns garbage. That does not matter, it's just a
> rough estimate anyway.
> 
> Here are the results classed by speed :
> 
> /* Sample output on a Pentium-M 600 MHz :
> 
> Function          clocks mean(us)  max(us)  std(us) Avg err size
> ncubic_tab0           79     0.66     7.20     1.04  0.613%  160
> ncubic_0div           84     0.70     7.64     1.57  4.521%  192
> ncubic_1div          178     1.48    16.27     1.81  0.443%  336
> ncubic_tab1          179     1.49    16.34     1.85  0.195%  320
> ncubic_ndiv3         263     2.18    24.04     3.59  0.250%  512
> ncubic_2div          270     2.24    24.70     2.77  0.187%  512
> ncubic32_1           359     2.98    32.81     3.59  0.238%  544
> ncubic_3div          361     2.99    33.08     3.79  0.170%  656
> ncubic32             364     3.02    33.29     3.51  0.247%  544
> ncubic               529     4.39    48.39     4.92  0.247%  720
> hcbrt                539     4.47    49.25     5.98  1.580%   96
> ocubic               732     4.93    61.83     7.22  0.274%  320
> acbrt                842     6.98    76.73     8.55  0.275%  192
> bictcp              1032     6.95    86.30     9.04  0.172%  768
> 
> And now by avg error :
> 
> ncubic_3div          361     2.99    33.08     3.79  0.170%  656
> bictcp              1032     6.95    86.30     9.04  0.172%  768
> ncubic_2div          270     2.24    24.70     2.77  0.187%  512
> ncubic_tab1          179     1.49    16.34     1.85  0.195%  320
> ncubic32_1           359     2.98    32.81     3.59  0.238%  544
> ncubic               529     4.39    48.39     4.92  0.247%  720
> ncubic32             364     3.02    33.29     3.51  0.247%  544
> ncubic_ndiv3         263     2.18    24.04     3.59  0.250%  512
> ocubic               732     4.93    61.83     7.22  0.274%  320
> acbrt                842     6.98    76.73     8.55  0.275%  192
> ncubic_1div          178     1.48    16.27     1.81  0.443%  336
> ncubic_tab0           79     0.66     7.20     1.04  0.613%  160
> hcbrt                539     4.47    49.25     5.98  1.580%   96
> ncubic_0div           84     0.70     7.64     1.57  4.521%  192
> 
> 
> And here comes the code. I have tried to document it a bit, as much
> as can be done on experimentation code. It is often easier to use
> a pencil and paper to understand how the bits move.
> 
> Regards,
> Willy
> 

The following version of div64_64 is faster because do_div already
optimized for the 32 bit case..

I get the following results on ULV Core Solo (ie slow current processor)
and the following on 64bit Core Duo. ncubic_tab1 seems like
the best (no additional error and about as fast)

ULV Core Solo

Function          clocks mean(us)  max(us)  std(us) Avg err size
ncubic_tab0          192    11.24    45.10    15.28  0.450% -2262
ncubic_0div          201    11.77    47.23    27.40  3.357% -2404
ncubic_1div          324    19.02    76.32    25.82  0.189% -2567
ncubic_tab1          326    19.13    76.73    23.71  0.043% -2059
ncubic_2div          456    26.72   108.92   493.16  0.028% -2790
ncubic_ndiv3         463    27.15   133.37  1889.39  0.104% -3344
ncubic32             549    32.18   130.59   508.97  0.041% -3794
ncubic32_1           574    33.66   138.32   548.48  0.029% -3604
ncubic_3div          581    34.04   140.24   608.55  0.018% -3050
ncubic               733    42.92   173.35   523.19  0.041%  299
ocubic              1046    61.25   283.68  3305.65  0.027% -2232
acbrt               1149    67.32   284.91  1941.55  0.029%  168
bictcp              1663    97.41   394.29   604.86  0.017%  628

Core 2 Duo

Function          clocks mean(us)  max(us)  std(us) Avg err size
ncubic_0div           74     0.03     1.60     0.07  3.357% -2101
ncubic_tab0           74     0.03     1.60     0.04  0.450% -2029
ncubic_1div          142     0.07     3.11     1.05  0.189% -2195
ncubic_tab1          144     0.07     3.18     1.02  0.043% -1638
ncubic_2div          216     0.10     4.74     1.07  0.028% -2326
ncubic_ndiv3         219     0.10     4.76     1.04  0.104% -2709
ncubic32             269     0.13     5.87     1.13  0.041% -1500
ncubic32_1           272     0.13     5.92     1.10  0.029% -2881
ncubic               273     0.13     5.96     1.13  0.041% -1763
ncubic_3div          290     0.14     6.32     1.01  0.018% -2499
acbrt                430     0.20     9.42     1.18  0.029%   77
ocubic               444     0.21     9.82     1.82  0.027% -1924
bictcp               549     0.26    12.06     1.68  0.017%  236



-- 
Stephen Hemminger <shemminger@linux-foundation.org>

next prev parent reply	other threads:[~2007-03-21 18:59 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-24  1:05 [RFC] div64_64 support Stephen Hemminger
2007-02-24 16:19 ` Sami Farin
2007-02-26 19:28   ` Stephen Hemminger
2007-02-26 19:39     ` David Miller
2007-02-26 20:09 ` Jan Engelhardt
2007-02-26 21:28   ` Stephen Hemminger
2007-02-27  1:20     ` H. Peter Anvin
2007-02-27  3:45       ` Segher Boessenkool
2007-02-26 22:31   ` Stephen Hemminger
2007-02-26 23:02     ` Jan Engelhardt
2007-02-26 23:44       ` Stephen Hemminger
2007-02-27  0:05         ` Jan Engelhardt
2007-02-27  0:07           ` Stephen Hemminger
2007-02-27  0:14             ` Jan Engelhardt
2007-02-27  6:21     ` Dan Williams
2007-03-03  2:31     ` Andi Kleen
2007-03-05 23:57       ` Stephen Hemminger
2007-03-06  0:25         ` David Miller
2007-03-06 13:36           ` Andi Kleen
2007-03-06 14:04           ` [RFC] div64_64 support II Andi Kleen
2007-03-06 17:43             ` Dagfinn Ilmari Mannsåker
2007-03-06 18:25               ` David Miller
2007-03-06 18:48             ` H. Peter Anvin
2007-03-06 13:34         ` [RFC] div64_64 support Andi Kleen
2007-03-06 14:19           ` Eric Dumazet
2007-03-06 14:45             ` Andi Kleen
2007-03-06 15:10               ` Roland Kuhn
2007-03-06 18:29                 ` Stephen Hemminger
2007-03-06 19:48                   ` Andi Kleen
2007-03-06 20:04                     ` Stephen Hemminger
2007-03-06 21:53                   ` Sami Farin
2007-03-06 22:24                     ` Sami Farin
2007-03-07  0:00                       ` Stephen Hemminger
2007-03-07  0:05                         ` David Miller
2007-03-07  0:05                         ` Sami Farin
2007-03-07 16:11                       ` Chuck Ebbert
2007-03-07 18:32                         ` Sami Farin
2007-03-08 18:23                       ` asm volatile [Was: [RFC] div64_64 support] Sami Farin
2007-03-08 22:01                         ` asm volatile David Miller
2007-03-06 21:58                   ` [RFC] div64_64 support David Miller
2007-03-06 22:47                     ` [PATCH] tcp_cubic: faster cube root Stephen Hemminger
2007-03-06 22:58                       ` cube root benchmark code Stephen Hemminger
2007-03-07  6:08                         ` Update to " Willy Tarreau
2007-03-08  1:07                           ` [PATCH] tcp_cubic: use 32 bit math Stephen Hemminger
2007-03-08  2:55                             ` David Miller
2007-03-08  3:10                               ` Stephen Hemminger
2007-03-08  3:51                                 ` David Miller
2007-03-10 11:48                                   ` Willy Tarreau
2007-03-12 21:11                                     ` Stephen Hemminger
2007-03-13 20:50                                       ` Willy Tarreau
2007-03-21 18:54                                         ` Stephen Hemminger [this message]
2007-03-21 19:15                                           ` Willy Tarreau
2007-03-21 19:58                                             ` Stephen Hemminger
2007-03-21 20:15                                             ` [PATCH 1/2] div64_64 optimization Stephen Hemminger
2007-03-21 20:17                                               ` [PATCH 2/2] tcp: cubic optimization Stephen Hemminger
2007-03-22 19:11                                                 ` David Miller
2007-03-22 19:11                                               ` [PATCH 1/2] div64_64 optimization David Miller
2007-03-08  4:16                                 ` [PATCH] tcp_cubic: use 32 bit math Willy Tarreau
2007-03-07  4:20                       ` [PATCH] tcp_cubic: faster cube root David Miller
2007-03-07 12:12                         ` Andi Kleen
2007-03-07 19:33                           ` David Miller
2007-03-06 18:50               ` [RFC] div64_64 support H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070321115419.483a655a@freekitty \
    --to=shemminger@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=dada1@cosmosbay.com \
    --cc=davem@davemloft.net \
    --cc=jengelh@linux01.gwdg.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rkuhn@e18.physik.tu-muenchen.de \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.