crush: straw is dead, long live straw2

All of lore.kernel.org
 help / color / mirror / Atom feed

* crush: straw is dead, long live straw2
@ 2014-12-08 23:48 Sage Weil
  2014-12-09  0:21 ` Josh Durgin
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Sage Weil @ 2014-12-08 23:48 UTC (permalink / raw)
  To: ceph-devel

Hey everyone,

When I was writing the original CRUSH code ages ago, I made several 
different bucket types, each using a different 'choose' algorithm for 
pseudorandomly selecting an item.  Most of these were modeled after the 
original RUSH algorithms from RJ Honicky, but there was one new bucket 
type I called 'straw' with some new properties that I was rather proud of:

 1- items could have any weight.
 2- O(n) to choose an item.
 3- if an item's weight was adjusted up or down, mappings would either 
move to or from the adjusted item, but never between other unmodified 
items in the bucket.

The basic motivation for the process to look like drawing straws, where 
the longest straw would win, except that each item would have their random 
straw lengths scaled based on their weight.

This is all fine and good, except that it turns out the third property 
doesn't actually hold true!  The amount that the random straw lengths got 
scaled turned out to be a complicated function of the other weights in the 
bucket.  Although the placement calculation is pretty clearly independent 
for each item, a change in the weights affects the scaling factors for 
other items, which means that a weight change could affect the placement 
of items between other items.

In retrospect this should have been obvious, but it turns out nobody has 
looked too carefully at this bit of code or the underlying algorithm for 
some 8 years.  It took a customer making a deliberately miniscule change 
and seeing a huge data movement to dig into this bit of behavior.  Also in 
retrospect, a well written test that verified the third property held true 
for various changes would have caught it, but my testing at the time was 
limited to strange combinations of weights.

We suspect that this has actually been noticed by lots of people who have 
complained about more data moving after a weight change than they 
expected.  These sorts of reports are hard to quantify and investigate and 
are easy to ignore.  Sigh.

Anyway, that's the bad news.

The good news is that we figured out how to fix the placement algorithm to 
get all three properties.  The new bucket type will be called 'straw2' 
(unless someone has a better suggestion).  Here's the difference.

For straw buckets, the choose function looks roughly like this:

 max_x = -1
 max_item = -1
 for each item:
    x = random value from 0..65535
    x *= scaling factor
    if x > max_x:
       max_x = x
       max_item = item
 return item

The issue, as I mentioned, is that the scaling factor is a strange 
function of *all* of the other item weights, which means a change in a 
single item A's weight could affect the relative distribution of items 
previously on B and C.

The new straw2 bucket works like this:

 max_x = -1
 max_item = -1
 for each item:
    x = random value from 0..65535
    x = ln(x / 65536) / weight
    if x > max_x:
       max_x = x
       max_item = item
 return item

That ln() is a natural log (well, a 16-bit fixed-point approximation of 
it) and as you can see it's a simple function of the weight of that item.  
That means that changing one item's weight won't have any effect on the 
straw lengths for other items, so a change will either make the item win 
or not win but won't change who the other winner is in the not-win case.

Somewhat embarassingly, I stumbled onto this solution half by accident.  
Sam found the underlying principle that makes it work:

 http://en.wikipedia.org/wiki/Exponential_distribution#Distribution_of_the_minimum_of_exponential_random_variables

Calculating the ln() function is a bit annoying because it is a floating 
point function and CRUSH is all fixed-point arithmetic (integer-based).  
The current draft implementation uses a 128 KB lookup table (2 bytes per 
entry for 16 bits of input precision).  It seems to be about 25% slower 
than the original straw code in my simple microbenchmark, but I'm not sure 
that's a number we should trust since it loops through a zillion inputs 
and will pull most of the lookup table into the processor caches.

We could probably try a few others things:

 - Lose half the precision to reduce the table size.  I don't think this 
will work well, and in most cases we'll still miss the CPU cache so the 
benefit is marginal at best.

 - Use floating point log function.  This is problematic for the kernel 
implementation (no floating point), is slower than the lookup table, and 
makes me worry about whether the floating point calculations are 
consistent across architectures (the mapping has to be completely 
deterministic).

 - Use some approximation of the logarithm with fixed-point arithmetic.  
I spent a bit of time search and found

 http://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm

but it also involves a couple of lookup tables and (judging by figure 1) 
is probably slower than the 256 KB table.

We could probably expand out taylor series and get something half decent, 
but any precision we lose will translate to OSD utilizations that are off 
from the input weights--something that costs real disk space and we 
probably want to avoid.

 - Stick with the 128KB lookup table.  Performance sensitive clients can 
precalculate all PG mappings when they get OSDMap updates if they are 
concerned about leave the CRUSH calculation in the IO path.

Any other suggestions?

Here is my implementation of the lookup-table based approach:

	https://github.com/ceph/ceph/commit/1d462c9f6a262de3a51533193ed2dff34c730727
	https://github.com/ceph/ceph/commits/wip-crush-straw2

You'll notice that included in there is a unit test that verifies that 
changing a single item's weight does not effect the distribution of inputs 
among other items in the bucket.  :)

Thanks-
sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-08 23:48 crush: straw is dead, long live straw2 Sage Weil
@ 2014-12-09  0:21 ` Josh Durgin
  2014-12-12  9:14 ` Thorsten Behrens
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Josh Durgin @ 2014-12-09  0:21 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 12/08/2014 03:48 PM, Sage Weil wrote:
>   - Use floating point log function.  This is problematic for the kernel
> implementation (no floating point), is slower than the lookup table, and
> makes me worry about whether the floating point calculations are
> consistent across architectures (the mapping has to be completely
> deterministic).

This also won't work for QEMU, which may not restore floating point
modes while doing I/O (leading to crashes like 
http://tracker.ceph.com/issues/3521).

>   - Use some approximation of the logarithm with fixed-point arithmetic.
> I spent a bit of time search and found
>
>   http://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm
>
> but it also involves a couple of lookup tables and (judging by figure 1)
> is probably slower than the 256 KB table.
>
> We could probably expand out taylor series and get something half decent,
> but any precision we lose will translate to OSD utilizations that are off
> from the input weights--something that costs real disk space and we
> probably want to avoid.
>
>   - Stick with the 128KB lookup table.  Performance sensitive clients can
> precalculate all PG mappings when they get OSDMap updates if they are
> concerned about leave the CRUSH calculation in the IO path.
>
> Any other suggestions?

It could be a lookup table generated (or chosen) at runtime to a
particular size configured by the crushmap, but that's probably more
complex than it's worth.

> Here is my implementation of the lookup-table based approach:
>
> 	https://github.com/ceph/ceph/commit/1d462c9f6a262de3a51533193ed2dff34c730727
> 	https://github.com/ceph/ceph/commits/wip-crush-straw2
>
> You'll notice that included in there is a unit test that verifies that
> changing a single item's weight does not effect the distribution of inputs
> among other items in the bucket.  :)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-08 23:48 crush: straw is dead, long live straw2 Sage Weil
  2014-12-09  0:21 ` Josh Durgin
@ 2014-12-12  9:14 ` Thorsten Behrens
  2014-12-12 14:46   ` Sage Weil
  2014-12-12 17:27 ` Yehuda Sadeh
  2017-01-27  0:26 ` Loic Dachary
  3 siblings, 1 reply; 13+ messages in thread
From: Thorsten Behrens @ 2014-12-12  9:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

Sage Weil wrote:
> Calculating the ln() function is a bit annoying because it is a floating 
> point function and CRUSH is all fixed-point arithmetic (integer-based).  
> The current draft implementation uses a 128 KB lookup table (2 bytes per 
> entry for 16 bits of input precision).  It seems to be about 25% slower 
> than the original straw code in my simple microbenchmark, but I'm not sure 
> that's a number we should trust since it loops through a zillion inputs 
> and will pull most of the lookup table into the processor caches.
> 
> We could probably try a few others things:
> 
[snip]

My Hacker's Delight copy has a number of further variants (on
pp. 215), unfortunately only covering log_2 (which is trivially the
number of leading zeros in the binary representation) and log_10. But
since the logarithms are simply related by a multiplicative constant,
something clever might be doable here with a bit of thinking.

Cheers,

-- Thorsten

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 966 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12  9:14 ` Thorsten Behrens
@ 2014-12-12 14:46   ` Sage Weil
  2014-12-12 15:39     ` Joe Landman
  2014-12-12 15:43     ` Milosz Tanski
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2014-12-12 14:46 UTC (permalink / raw)
  To: Thorsten Behrens; +Cc: ceph-devel

On Fri, 12 Dec 2014, Thorsten Behrens wrote:
> Sage Weil wrote:
> > Calculating the ln() function is a bit annoying because it is a floating 
> > point function and CRUSH is all fixed-point arithmetic (integer-based).  
> > The current draft implementation uses a 128 KB lookup table (2 bytes per 
> > entry for 16 bits of input precision).  It seems to be about 25% slower 
> > than the original straw code in my simple microbenchmark, but I'm not sure 
> > that's a number we should trust since it loops through a zillion inputs 
> > and will pull most of the lookup table into the processor caches.
> > 
> > We could probably try a few others things:
> > 
> [snip]
> 
> My Hacker's Delight copy has a number of further variants (on
> pp. 215), unfortunately only covering log_2 (which is trivially the
> number of leading zeros in the binary representation) and log_10. But
> since the logarithms are simply related by a multiplicative constant,
> something clever might be doable here with a bit of thinking.

Any log will definitely do; the next steps are to do a division and then 
take the max.  Simply counting 0's isn't precise enough, though.. we need 
a lot more precision than that.

The table lookup is (I think) on the order of 50-100 cycles with a cold 
cache; that's what we need to beat!

sage


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 14:46   ` Sage Weil
@ 2014-12-12 15:39     ` Joe Landman
  2014-12-12 16:20       ` Sage Weil
  2014-12-12 15:43     ` Milosz Tanski
  1 sibling, 1 reply; 13+ messages in thread
From: Joe Landman @ 2014-12-12 15:39 UTC (permalink / raw)
  To: Sage Weil, Thorsten Behrens; +Cc: ceph-devel

On 12/12/2014 09:46 AM, Sage Weil wrote:
> On Fri, 12 Dec 2014, Thorsten Behrens wrote:
>> Sage Weil wrote:
>>> Calculating the ln() function is a bit annoying because it is a floating
>>> point function and CRUSH is all fixed-point arithmetic (integer-based).
>>> The current draft implementation uses a 128 KB lookup table (2 bytes per
>>> entry for 16 bits of input precision).  It seems to be about 25% slower
>>> than the original straw code in my simple microbenchmark, but I'm not sure
>>> that's a number we should trust since it loops through a zillion inputs
>>> and will pull most of the lookup table into the processor caches.
>>>
>>> We could probably try a few others things:
>>>
>> [snip]
>>
>> My Hacker's Delight copy has a number of further variants (on
>> pp. 215), unfortunately only covering log_2 (which is trivially the
>> number of leading zeros in the binary representation) and log_10. But
>> since the logarithms are simply related by a multiplicative constant,
>> something clever might be doable here with a bit of thinking.
> Any log will definitely do; the next steps are to do a division and then
> take the max.  Simply counting 0's isn't precise enough, though.. we need
> a lot more precision than that.

You want integer log?  Looks like ln(x * 2^-16) from your original 
description., with x = ( 0 .. 2^16 )
So your log is going from undefined to 0 (ln 0 is not defined). Fire up 
gnuplot and do a quick "plot [0:1] (log(x)) " to see the shape of the 
function.

It can be a real valued function scaled in a particular way, then we can 
use a nice set of series to calculate it in FP (single precision if you 
like).    Conversion to int should be fast, and we could do this in 
SSE2/AVX to do multiple calcs at the same time.  I would think the 
random value portion of the code would be the slow part. Where are you 
getting your random source?  Could you not start that out with an 
exponential distribution rather than generate your own?

Aside from that, I'd suggest doing something like this (as you don't 
really need the ln(x/65536)/weight term, just the ln(x) term ... as you 
are effectively throwing away the value and just using side effects

  max_x = -1
  max_item = -1
  mult_const = ln(1/65536)/weight
  for each item:
     x = random value from 0..65535
     x = ln(x)
     if x > max_x:
        max_x = x
        max_item = item
  return item

The random value is likely to be an expensive portion of the algorithm 
... possibly have a thread to pre-queue a few thousand random values 
that you queue up in a ring buffer somewhere?

I'd be happy to code up a quick log calculator if that really is the 
most expensive portion of this.  I think we might be able to do the logs 
and even the comparisons in SSEx/AVX if you are open for that (exploit 
some loop unrolling), though we'd need a serial finishing loop to deal 
with the unrollable portion.

>
> The table lookup is (I think) on the order of 50-100 cycles with a cold
> cache; that's what we need to beat!
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 14:46   ` Sage Weil
  2014-12-12 15:39     ` Joe Landman
@ 2014-12-12 15:43     ` Milosz Tanski
  1 sibling, 0 replies; 13+ messages in thread
From: Milosz Tanski @ 2014-12-12 15:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: Thorsten Behrens, ceph-devel

On Fri, Dec 12, 2014 at 9:46 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 12 Dec 2014, Thorsten Behrens wrote:
>> Sage Weil wrote:
>> > Calculating the ln() function is a bit annoying because it is a floating
>> > point function and CRUSH is all fixed-point arithmetic (integer-based).
>> > The current draft implementation uses a 128 KB lookup table (2 bytes per
>> > entry for 16 bits of input precision).  It seems to be about 25% slower
>> > than the original straw code in my simple microbenchmark, but I'm not sure
>> > that's a number we should trust since it loops through a zillion inputs
>> > and will pull most of the lookup table into the processor caches.
>> >
>> > We could probably try a few others things:
>> >
>> [snip]
>>
>> My Hacker's Delight copy has a number of further variants (on
>> pp. 215), unfortunately only covering log_2 (which is trivially the
>> number of leading zeros in the binary representation) and log_10. But
>> since the logarithms are simply related by a multiplicative constant,
>> something clever might be doable here with a bit of thinking.
>
> Any log will definitely do; the next steps are to do a division and then
> take the max.  Simply counting 0's isn't precise enough, though.. we need
> a lot more precision than that.
>
> The table lookup is (I think) on the order of 50-100 cycles with a cold
> cache; that's what we need to beat!

I believe there are a few different options with respect to natural
log in 16bit fix point. First place to look I believe would be DSP/FFT
code for 16bit microprocessors (think atmel); folks have been able to
do real time fft on those chips for a long time. You might have to
translate some assembly code for 16 bit micro to C code but that's
easy enough.

A quick google search reveals a there's a portable C library for 16bit
fixpoint calculations called libfixmath at
https://code.google.com/p/libfixmath/ that has a natural log
implementation. The code is here:
https://code.google.com/p/libfixmath/source/browse/trunk/libfixmath/fix16_exp.c#61.
The documentation is here:
https://code.google.com/p/libfixmath/wiki/FunctionReference

There's more:
https://github.com/BurntBrunch/rockbox-fft/blob/master/apps/fixedpoint.c#L239

My knowledge is somewhat limited since last time I did this was years
ago, best to ask somebody with a EE degree who does DSP applications.


>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 15:39     ` Joe Landman
@ 2014-12-12 16:20       ` Sage Weil
  2014-12-12 16:39         ` Joe Landman
  2014-12-12 18:18         ` Joe Landman
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2014-12-12 16:20 UTC (permalink / raw)
  To: Joe Landman; +Cc: Thorsten Behrens, ceph-devel

On Fri, 12 Dec 2014, Joe Landman wrote:
> On 12/12/2014 09:46 AM, Sage Weil wrote:
> > On Fri, 12 Dec 2014, Thorsten Behrens wrote:
> > > Sage Weil wrote:
> > > > Calculating the ln() function is a bit annoying because it is a floating
> > > > point function and CRUSH is all fixed-point arithmetic (integer-based).
> > > > The current draft implementation uses a 128 KB lookup table (2 bytes per
> > > > entry for 16 bits of input precision).  It seems to be about 25% slower
> > > > than the original straw code in my simple microbenchmark, but I'm not
> > > > sure
> > > > that's a number we should trust since it loops through a zillion inputs
> > > > and will pull most of the lookup table into the processor caches.
> > > > 
> > > > We could probably try a few others things:
> > > > 
> > > [snip]
> > > 
> > > My Hacker's Delight copy has a number of further variants (on
> > > pp. 215), unfortunately only covering log_2 (which is trivially the
> > > number of leading zeros in the binary representation) and log_10. But
> > > since the logarithms are simply related by a multiplicative constant,
> > > something clever might be doable here with a bit of thinking.
> > Any log will definitely do; the next steps are to do a division and then
> > take the max.  Simply counting 0's isn't precise enough, though.. we need
> > a lot more precision than that.
> 
> You want integer log?  Looks like ln(x * 2^-16) from your original
> description., with x = ( 0 .. 2^16 )
> So your log is going from undefined to 0 (ln 0 is not defined). Fire up
> gnuplot and do a quick "plot [0:1] (log(x)) " to see the shape of the
> function.
> 
> It can be a real valued function scaled in a particular way, then we can use a
> nice set of series to calculate it in FP (single precision if you like).
> Conversion to int should be fast, and we could do this in SSE2/AVX to do
> multiple calcs at the same time.

We can't use floating point. The code needs to run in the kernel.  We also 
need the rseults to be perfectly deterministic and consistent across all 
architectures; I'm not sure if all floating point implementations (and log 
implementations) will do that?

> I would think the random value portion of
> the code would be the slow part. Where are you getting your random source?
> Could you not start that out with an exponential distribution rather than
> generate your own?

That would simplify things, but I'm not sure how practical it is.  In 
reality, the "random" value is pseudorandom, and looks like this:

 rjhash(crush_input_x, bucket_id, item_id, attempt)

The input 'x' is basically the PG id.  Attempt is incremented as CRUSH 
makes multiple samples to find a mapping that is suitable. Note that the 
position of item in the bucket is not an input, so this result is stable 
when other bucket members change.  The current hash function we're using 
is based on Richard Jenkin's 32-bit mix function.  Here:

	https://github.com/ceph/ceph/blob/master/src/crush/hash.c

> Aside from that, I'd suggest doing something like this (as you don't really
> need the ln(x/65536)/weight term, just the ln(x) term ... as you are
> effectively throwing away the value and just using side effects

The weight is actually item_weight and is the key to getting more mappings 
on more heavily weighted items.  But it's a simple integer divide, so 
cheap.

If SSEx/AVX could be leveraged that'd be great, but I think it needs to be 
fixed-point arithmetic...

Thanks!
sage


>  max_x = -1
>  max_item = -1
>  mult_const = ln(1/65536)/weight
>  for each item:
>     x = random value from 0..65535
>     x = ln(x)
>     if x > max_x:
>        max_x = x
>        max_item = item
>  return item
> 
> The random value is likely to be an expensive portion of the algorithm ...
> possibly have a thread to pre-queue a few thousand random values that you
> queue up in a ring buffer somewhere?
> 
> I'd be happy to code up a quick log calculator if that really is the most
> expensive portion of this.  I think we might be able to do the logs and even
> the comparisons in SSEx/AVX if you are open for that (exploit some loop
> unrolling), though we'd need a serial finishing loop to deal with the
> unrollable portion.
> 
> 
> 
> > 
> > The table lookup is (I think) on the order of 50-100 cycles with a cold
> > cache; that's what we need to beat!
> > 
> > sage
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 16:20       ` Sage Weil
@ 2014-12-12 16:39         ` Joe Landman
  2014-12-12 18:18         ` Joe Landman
  1 sibling, 0 replies; 13+ messages in thread
From: Joe Landman @ 2014-12-12 16:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: Thorsten Behrens, ceph-devel

On 12/12/2014 11:20 AM, Sage Weil wrote:
>
> We can't use floating point. The code needs to run in the kernel.  We also
> need the rseults to be perfectly deterministic and consistent across all
> architectures; I'm not sure if all floating point implementations (and log
> implementations) will do that?

Nope ... that is a whole different subject.  Ok, why not start looking 
at what some of the GFX folks do?

https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-08 23:48 crush: straw is dead, long live straw2 Sage Weil
  2014-12-09  0:21 ` Josh Durgin
  2014-12-12  9:14 ` Thorsten Behrens
@ 2014-12-12 17:27 ` Yehuda Sadeh
  2014-12-12 21:42   ` Mark Nelson
  2017-01-27  0:26 ` Loic Dachary
  3 siblings, 1 reply; 13+ messages in thread
From: Yehuda Sadeh @ 2014-12-12 17:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Mon, Dec 8, 2014 at 3:48 PM, Sage Weil <sweil@redhat.com> wrote:
> Hey everyone,
>
> When I was writing the original CRUSH code ages ago, I made several
> different bucket types, each using a different 'choose' algorithm for
> pseudorandomly selecting an item.  Most of these were modeled after the
> original RUSH algorithms from RJ Honicky, but there was one new bucket
> type I called 'straw' with some new properties that I was rather proud of:
>
>  1- items could have any weight.
>  2- O(n) to choose an item.
>  3- if an item's weight was adjusted up or down, mappings would either
> move to or from the adjusted item, but never between other unmodified
> items in the bucket.
>
> The basic motivation for the process to look like drawing straws, where
> the longest straw would win, except that each item would have their random
> straw lengths scaled based on their weight.
>
> This is all fine and good, except that it turns out the third property
> doesn't actually hold true!  The amount that the random straw lengths got
> scaled turned out to be a complicated function of the other weights in the
> bucket.  Although the placement calculation is pretty clearly independent
> for each item, a change in the weights affects the scaling factors for
> other items, which means that a weight change could affect the placement
> of items between other items.
>
> In retrospect this should have been obvious, but it turns out nobody has
> looked too carefully at this bit of code or the underlying algorithm for
> some 8 years.  It took a customer making a deliberately miniscule change
> and seeing a huge data movement to dig into this bit of behavior.  Also in
> retrospect, a well written test that verified the third property held true
> for various changes would have caught it, but my testing at the time was
> limited to strange combinations of weights.
>
> We suspect that this has actually been noticed by lots of people who have
> complained about more data moving after a weight change than they
> expected.  These sorts of reports are hard to quantify and investigate and
> are easy to ignore.  Sigh.
>
> Anyway, that's the bad news.
>
> The good news is that we figured out how to fix the placement algorithm to
> get all three properties.  The new bucket type will be called 'straw2'
> (unless someone has a better suggestion).  Here's the difference.
>
> For straw buckets, the choose function looks roughly like this:
>
>  max_x = -1
>  max_item = -1
>  for each item:
>     x = random value from 0..65535
>     x *= scaling factor
>     if x > max_x:
>        max_x = x
>        max_item = item
>  return item
>
> The issue, as I mentioned, is that the scaling factor is a strange
> function of *all* of the other item weights, which means a change in a
> single item A's weight could affect the relative distribution of items
> previously on B and C.
>
> The new straw2 bucket works like this:
>
>  max_x = -1
>  max_item = -1
>  for each item:
>     x = random value from 0..65535
>     x = ln(x / 65536) / weight
>     if x > max_x:
>        max_x = x
>        max_item = item
>  return item
>
> That ln() is a natural log (well, a 16-bit fixed-point approximation of
> it) and as you can see it's a simple function of the weight of that item.
> That means that changing one item's weight won't have any effect on the
> straw lengths for other items, so a change will either make the item win
> or not win but won't change who the other winner is in the not-win case.
>
> Somewhat embarassingly, I stumbled onto this solution half by accident.
> Sam found the underlying principle that makes it work:
>
>  http://en.wikipedia.org/wiki/Exponential_distribution#Distribution_of_the_minimum_of_exponential_random_variables
>
> Calculating the ln() function is a bit annoying because it is a floating
> point function and CRUSH is all fixed-point arithmetic (integer-based).
> The current draft implementation uses a 128 KB lookup table (2 bytes per
> entry for 16 bits of input precision).  It seems to be about 25% slower
> than the original straw code in my simple microbenchmark, but I'm not sure
> that's a number we should trust since it loops through a zillion inputs
> and will pull most of the lookup table into the processor caches.
>
> We could probably try a few others things:
>
>  - Lose half the precision to reduce the table size.  I don't think this
> will work well, and in most cases we'll still miss the CPU cache so the
> benefit is marginal at best.
>
>  - Use floating point log function.  This is problematic for the kernel
> implementation (no floating point), is slower than the lookup table, and
> makes me worry about whether the floating point calculations are
> consistent across architectures (the mapping has to be completely
> deterministic).
>
>  - Use some approximation of the logarithm with fixed-point arithmetic.
> I spent a bit of time search and found
>
>  http://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm
>
> but it also involves a couple of lookup tables and (judging by figure 1)
> is probably slower than the 256 KB table.
>
> We could probably expand out taylor series and get something half decent,
> but any precision we lose will translate to OSD utilizations that are off
> from the input weights--something that costs real disk space and we
> probably want to avoid.
>
>  - Stick with the 128KB lookup table.  Performance sensitive clients can
> precalculate all PG mappings when they get OSDMap updates if they are
> concerned about leave the CRUSH calculation in the IO path.
>
> Any other suggestions?
>
> Here is my implementation of the lookup-table based approach:
>
>         https://github.com/ceph/ceph/commit/1d462c9f6a262de3a51533193ed2dff34c730727
>         https://github.com/ceph/ceph/commits/wip-crush-straw2
>
> You'll notice that included in there is a unit test that verifies that
> changing a single item's weight does not effect the distribution of inputs
> among other items in the bucket.  :)
>
> Thanks-
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



We discussed it off line, but should probably bring it to the mailing
list. One approach would be to have some kind of linear approximation
of the ln() function. Here's my take on it, which is very raw:
https://github.com/ceph/ceph/commit/224b921d6c680a8ef1fe39c9e6e183171cefd299

The idea is to find segments that are roughly linear (up to some
degree of accuracy), and then to generate some kind of lookup into
these segments. Then you could calculate the actual result using the
linear params.
The current code searches for lines that gives up to +-10 points
accuracy. The built code can be structured differently, allowing to do
the lookup using some kind of binary search (didn't make any effort in
this direction currently as you can notice). The generated code might
not be as nice as a simple lookup table, and may prove to be slower
(due to instruction cache misses), but maybe a combination of a
smaller lookup table with something like this could do the trick.

Yehuda

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 16:20       ` Sage Weil
  2014-12-12 16:39         ` Joe Landman
@ 2014-12-12 18:18         ` Joe Landman
  2014-12-12 21:29           ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Joe Landman @ 2014-12-12 18:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: Thorsten Behrens, ceph-devel

On 12/12/2014 11:20 AM, Sage Weil wrote:

> We can't use floating point. The code needs to run in the kernel. We 
> also need the rseults to be perfectly deterministic and consistent 
> across all architectures; I'm not sure if all floating point 
> implementations (and log implementations) will do that? 

Have a look at this (using the code I pointed to before): 
https://gist.github.com/joelandman/ec6f3abef9bc5f1c7b0e

Running on my desktop box, library double function (with casts) is about 
3x slower than the local log_2 version.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 18:18         ` Joe Landman
@ 2014-12-12 21:29           ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2014-12-12 21:29 UTC (permalink / raw)
  To: Joe Landman; +Cc: Thorsten Behrens, ceph-devel

On Fri, 12 Dec 2014, Joe Landman wrote:
> On 12/12/2014 11:20 AM, Sage Weil wrote:
> 
> > We can't use floating point. The code needs to run in the kernel. We also
> > need the rseults to be perfectly deterministic and consistent across all
> > architectures; I'm not sure if all floating point implementations (and log
> > implementations) will do that? 
> 
> Have a look at this (using the code I pointed to before):
> https://gist.github.com/joelandman/ec6f3abef9bc5f1c7b0e
> 
> Running on my desktop box, library double function (with casts) is about 3x
> slower than the local log_2 version.

It looks like the log_2 is only giving 5 or 6 bits of precision (counting 
0's), but we want a fixed-point output with more like 16 bits. The link 
Milosz sent looks like the most promising candidate I've seen so far...

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-12 17:27 ` Yehuda Sadeh
@ 2014-12-12 21:42   ` Mark Nelson
  0 siblings, 0 replies; 13+ messages in thread
From: Mark Nelson @ 2014-12-12 21:42 UTC (permalink / raw)
  To: Yehuda Sadeh, Sage Weil; +Cc: ceph-devel



On 12/12/2014 11:27 AM, Yehuda Sadeh wrote:
> On Mon, Dec 8, 2014 at 3:48 PM, Sage Weil <sweil@redhat.com> wrote:
>> Hey everyone,
>>
>> When I was writing the original CRUSH code ages ago, I made several
>> different bucket types, each using a different 'choose' algorithm for
>> pseudorandomly selecting an item.  Most of these were modeled after the
>> original RUSH algorithms from RJ Honicky, but there was one new bucket
>> type I called 'straw' with some new properties that I was rather proud of:
>>
>>   1- items could have any weight.
>>   2- O(n) to choose an item.
>>   3- if an item's weight was adjusted up or down, mappings would either
>> move to or from the adjusted item, but never between other unmodified
>> items in the bucket.
>>
>> The basic motivation for the process to look like drawing straws, where
>> the longest straw would win, except that each item would have their random
>> straw lengths scaled based on their weight.
>>
>> This is all fine and good, except that it turns out the third property
>> doesn't actually hold true!  The amount that the random straw lengths got
>> scaled turned out to be a complicated function of the other weights in the
>> bucket.  Although the placement calculation is pretty clearly independent
>> for each item, a change in the weights affects the scaling factors for
>> other items, which means that a weight change could affect the placement
>> of items between other items.
>>
>> In retrospect this should have been obvious, but it turns out nobody has
>> looked too carefully at this bit of code or the underlying algorithm for
>> some 8 years.  It took a customer making a deliberately miniscule change
>> and seeing a huge data movement to dig into this bit of behavior.  Also in
>> retrospect, a well written test that verified the third property held true
>> for various changes would have caught it, but my testing at the time was
>> limited to strange combinations of weights.
>>
>> We suspect that this has actually been noticed by lots of people who have
>> complained about more data moving after a weight change than they
>> expected.  These sorts of reports are hard to quantify and investigate and
>> are easy to ignore.  Sigh.
>>
>> Anyway, that's the bad news.
>>
>> The good news is that we figured out how to fix the placement algorithm to
>> get all three properties.  The new bucket type will be called 'straw2'
>> (unless someone has a better suggestion).  Here's the difference.
>>
>> For straw buckets, the choose function looks roughly like this:
>>
>>   max_x = -1
>>   max_item = -1
>>   for each item:
>>      x = random value from 0..65535
>>      x *= scaling factor
>>      if x > max_x:
>>         max_x = x
>>         max_item = item
>>   return item
>>
>> The issue, as I mentioned, is that the scaling factor is a strange
>> function of *all* of the other item weights, which means a change in a
>> single item A's weight could affect the relative distribution of items
>> previously on B and C.
>>
>> The new straw2 bucket works like this:
>>
>>   max_x = -1
>>   max_item = -1
>>   for each item:
>>      x = random value from 0..65535
>>      x = ln(x / 65536) / weight
>>      if x > max_x:
>>         max_x = x
>>         max_item = item
>>   return item
>>
>> That ln() is a natural log (well, a 16-bit fixed-point approximation of
>> it) and as you can see it's a simple function of the weight of that item.
>> That means that changing one item's weight won't have any effect on the
>> straw lengths for other items, so a change will either make the item win
>> or not win but won't change who the other winner is in the not-win case.
>>
>> Somewhat embarassingly, I stumbled onto this solution half by accident.
>> Sam found the underlying principle that makes it work:
>>
>>   http://en.wikipedia.org/wiki/Exponential_distribution#Distribution_of_the_minimum_of_exponential_random_variables
>>
>> Calculating the ln() function is a bit annoying because it is a floating
>> point function and CRUSH is all fixed-point arithmetic (integer-based).
>> The current draft implementation uses a 128 KB lookup table (2 bytes per
>> entry for 16 bits of input precision).  It seems to be about 25% slower
>> than the original straw code in my simple microbenchmark, but I'm not sure
>> that's a number we should trust since it loops through a zillion inputs
>> and will pull most of the lookup table into the processor caches.
>>
>> We could probably try a few others things:
>>
>>   - Lose half the precision to reduce the table size.  I don't think this
>> will work well, and in most cases we'll still miss the CPU cache so the
>> benefit is marginal at best.
>>
>>   - Use floating point log function.  This is problematic for the kernel
>> implementation (no floating point), is slower than the lookup table, and
>> makes me worry about whether the floating point calculations are
>> consistent across architectures (the mapping has to be completely
>> deterministic).
>>
>>   - Use some approximation of the logarithm with fixed-point arithmetic.
>> I spent a bit of time search and found
>>
>>   http://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm
>>
>> but it also involves a couple of lookup tables and (judging by figure 1)
>> is probably slower than the 256 KB table.
>>
>> We could probably expand out taylor series and get something half decent,
>> but any precision we lose will translate to OSD utilizations that are off
>> from the input weights--something that costs real disk space and we
>> probably want to avoid.
>>
>>   - Stick with the 128KB lookup table.  Performance sensitive clients can
>> precalculate all PG mappings when they get OSDMap updates if they are
>> concerned about leave the CRUSH calculation in the IO path.
>>
>> Any other suggestions?
>>
>> Here is my implementation of the lookup-table based approach:
>>
>>          https://github.com/ceph/ceph/commit/1d462c9f6a262de3a51533193ed2dff34c730727
>>          https://github.com/ceph/ceph/commits/wip-crush-straw2
>>
>> You'll notice that included in there is a unit test that verifies that
>> changing a single item's weight does not effect the distribution of inputs
>> among other items in the bucket.  :)
>>
>> Thanks-
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> We discussed it off line, but should probably bring it to the mailing
> list. One approach would be to have some kind of linear approximation
> of the ln() function. Here's my take on it, which is very raw:
> https://github.com/ceph/ceph/commit/224b921d6c680a8ef1fe39c9e6e183171cefd299
>
> The idea is to find segments that are roughly linear (up to some
> degree of accuracy), and then to generate some kind of lookup into
> these segments. Then you could calculate the actual result using the
> linear params.
> The current code searches for lines that gives up to +-10 points
> accuracy. The built code can be structured differently, allowing to do
> the lookup using some kind of binary search (didn't make any effort in
> this direction currently as you can notice). The generated code might
> not be as nice as a simple lookup table, and may prove to be slower
> (due to instruction cache misses), but maybe a combination of a
> smaller lookup table with something like this could do the trick.

Impressive Yehuda! :)

>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: crush: straw is dead, long live straw2
  2014-12-08 23:48 crush: straw is dead, long live straw2 Sage Weil
                   ` (2 preceding siblings ...)
  2014-12-12 17:27 ` Yehuda Sadeh
@ 2017-01-27  0:26 ` Loic Dachary
  3 siblings, 0 replies; 13+ messages in thread
From: Loic Dachary @ 2017-01-27  0:26 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi Sage,

On 12/09/2014 12:48 AM, Sage Weil wrote:
> The new straw2 bucket works like this:
> 
>  max_x = -1
>  max_item = -1
>  for each item:
>     x = random value from 0..65535
>     x = ln(x / 65536) / weight
>     if x > max_x:
>        max_x = x
>        max_item = item
>  return item
> 
> That ln() is a natural log (well, a 16-bit fixed-point approximation of 
> it) and as you can see it's a simple function of the weight of that item.  
> That means that changing one item's weight won't have any effect on the 
> straw lengths for other items, so a change will either make the item win 
> or not win but won't change who the other winner is in the not-win case.
> 
> Somewhat embarassingly, I stumbled onto this solution half by accident.  
> Sam found the underlying principle that makes it work:
> 
>  http://en.wikipedia.org/wiki/Exponential_distribution#Distribution_of_the_minimum_of_exponential_random_variables

I dont' understand what that works :-) Is there some documentation I could read to better understand the principle ?

Cheers

[1] bucket_straw2_choose http://libcrush.org/main/libcrush/blob/master/crush/mapper.c#L309

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-01-27  0:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-08 23:48 crush: straw is dead, long live straw2 Sage Weil
2014-12-09  0:21 ` Josh Durgin
2014-12-12  9:14 ` Thorsten Behrens
2014-12-12 14:46   ` Sage Weil
2014-12-12 15:39     ` Joe Landman
2014-12-12 16:20       ` Sage Weil
2014-12-12 16:39         ` Joe Landman
2014-12-12 18:18         ` Joe Landman
2014-12-12 21:29           ` Sage Weil
2014-12-12 15:43     ` Milosz Tanski
2014-12-12 17:27 ` Yehuda Sadeh
2014-12-12 21:42   ` Mark Nelson
2017-01-27  0:26 ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.