Re: crush: straw is dead, long live straw2

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Nelson <mark.nelson@inktank.com>
To: Yehuda Sadeh <yehuda@redhat.com>, Sage Weil <sweil@redhat.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: crush: straw is dead, long live straw2
Date: Fri, 12 Dec 2014 15:42:29 -0600	[thread overview]
Message-ID: <548B6145.2070303@redhat.com> (raw)
In-Reply-To: <CABBk=J96oyvpKjR=Z+Mk8oqvHf+6-c2+P0WVdDETX5AA_TQGdA@mail.gmail.com>



On 12/12/2014 11:27 AM, Yehuda Sadeh wrote:
> On Mon, Dec 8, 2014 at 3:48 PM, Sage Weil <sweil@redhat.com> wrote:
>> Hey everyone,
>>
>> When I was writing the original CRUSH code ages ago, I made several
>> different bucket types, each using a different 'choose' algorithm for
>> pseudorandomly selecting an item.  Most of these were modeled after the
>> original RUSH algorithms from RJ Honicky, but there was one new bucket
>> type I called 'straw' with some new properties that I was rather proud of:
>>
>>   1- items could have any weight.
>>   2- O(n) to choose an item.
>>   3- if an item's weight was adjusted up or down, mappings would either
>> move to or from the adjusted item, but never between other unmodified
>> items in the bucket.
>>
>> The basic motivation for the process to look like drawing straws, where
>> the longest straw would win, except that each item would have their random
>> straw lengths scaled based on their weight.
>>
>> This is all fine and good, except that it turns out the third property
>> doesn't actually hold true!  The amount that the random straw lengths got
>> scaled turned out to be a complicated function of the other weights in the
>> bucket.  Although the placement calculation is pretty clearly independent
>> for each item, a change in the weights affects the scaling factors for
>> other items, which means that a weight change could affect the placement
>> of items between other items.
>>
>> In retrospect this should have been obvious, but it turns out nobody has
>> looked too carefully at this bit of code or the underlying algorithm for
>> some 8 years.  It took a customer making a deliberately miniscule change
>> and seeing a huge data movement to dig into this bit of behavior.  Also in
>> retrospect, a well written test that verified the third property held true
>> for various changes would have caught it, but my testing at the time was
>> limited to strange combinations of weights.
>>
>> We suspect that this has actually been noticed by lots of people who have
>> complained about more data moving after a weight change than they
>> expected.  These sorts of reports are hard to quantify and investigate and
>> are easy to ignore.  Sigh.
>>
>> Anyway, that's the bad news.
>>
>> The good news is that we figured out how to fix the placement algorithm to
>> get all three properties.  The new bucket type will be called 'straw2'
>> (unless someone has a better suggestion).  Here's the difference.
>>
>> For straw buckets, the choose function looks roughly like this:
>>
>>   max_x = -1
>>   max_item = -1
>>   for each item:
>>      x = random value from 0..65535
>>      x *= scaling factor
>>      if x > max_x:
>>         max_x = x
>>         max_item = item
>>   return item
>>
>> The issue, as I mentioned, is that the scaling factor is a strange
>> function of *all* of the other item weights, which means a change in a
>> single item A's weight could affect the relative distribution of items
>> previously on B and C.
>>
>> The new straw2 bucket works like this:
>>
>>   max_x = -1
>>   max_item = -1
>>   for each item:
>>      x = random value from 0..65535
>>      x = ln(x / 65536) / weight
>>      if x > max_x:
>>         max_x = x
>>         max_item = item
>>   return item
>>
>> That ln() is a natural log (well, a 16-bit fixed-point approximation of
>> it) and as you can see it's a simple function of the weight of that item.
>> That means that changing one item's weight won't have any effect on the
>> straw lengths for other items, so a change will either make the item win
>> or not win but won't change who the other winner is in the not-win case.
>>
>> Somewhat embarassingly, I stumbled onto this solution half by accident.
>> Sam found the underlying principle that makes it work:
>>
>>   http://en.wikipedia.org/wiki/Exponential_distribution#Distribution_of_the_minimum_of_exponential_random_variables
>>
>> Calculating the ln() function is a bit annoying because it is a floating
>> point function and CRUSH is all fixed-point arithmetic (integer-based).
>> The current draft implementation uses a 128 KB lookup table (2 bytes per
>> entry for 16 bits of input precision).  It seems to be about 25% slower
>> than the original straw code in my simple microbenchmark, but I'm not sure
>> that's a number we should trust since it loops through a zillion inputs
>> and will pull most of the lookup table into the processor caches.
>>
>> We could probably try a few others things:
>>
>>   - Lose half the precision to reduce the table size.  I don't think this
>> will work well, and in most cases we'll still miss the CPU cache so the
>> benefit is marginal at best.
>>
>>   - Use floating point log function.  This is problematic for the kernel
>> implementation (no floating point), is slower than the lookup table, and
>> makes me worry about whether the floating point calculations are
>> consistent across architectures (the mapping has to be completely
>> deterministic).
>>
>>   - Use some approximation of the logarithm with fixed-point arithmetic.
>> I spent a bit of time search and found
>>
>>   http://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm
>>
>> but it also involves a couple of lookup tables and (judging by figure 1)
>> is probably slower than the 256 KB table.
>>
>> We could probably expand out taylor series and get something half decent,
>> but any precision we lose will translate to OSD utilizations that are off
>> from the input weights--something that costs real disk space and we
>> probably want to avoid.
>>
>>   - Stick with the 128KB lookup table.  Performance sensitive clients can
>> precalculate all PG mappings when they get OSDMap updates if they are
>> concerned about leave the CRUSH calculation in the IO path.
>>
>> Any other suggestions?
>>
>> Here is my implementation of the lookup-table based approach:
>>
>>          https://github.com/ceph/ceph/commit/1d462c9f6a262de3a51533193ed2dff34c730727
>>          https://github.com/ceph/ceph/commits/wip-crush-straw2
>>
>> You'll notice that included in there is a unit test that verifies that
>> changing a single item's weight does not effect the distribution of inputs
>> among other items in the bucket.  :)
>>
>> Thanks-
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> We discussed it off line, but should probably bring it to the mailing
> list. One approach would be to have some kind of linear approximation
> of the ln() function. Here's my take on it, which is very raw:
> https://github.com/ceph/ceph/commit/224b921d6c680a8ef1fe39c9e6e183171cefd299
>
> The idea is to find segments that are roughly linear (up to some
> degree of accuracy), and then to generate some kind of lookup into
> these segments. Then you could calculate the actual result using the
> linear params.
> The current code searches for lines that gives up to +-10 points
> accuracy. The built code can be structured differently, allowing to do
> the lookup using some kind of binary search (didn't make any effort in
> this direction currently as you can notice). The generated code might
> not be as nice as a simple lookup table, and may prove to be slower
> (due to instruction cache misses), but maybe a combination of a
> smaller lookup table with something like this could do the trick.

Impressive Yehuda! :)

>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2014-12-12 21:42 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-08 23:48 crush: straw is dead, long live straw2 Sage Weil
2014-12-09  0:21 ` Josh Durgin
2014-12-12  9:14 ` Thorsten Behrens
2014-12-12 14:46   ` Sage Weil
2014-12-12 15:39     ` Joe Landman
2014-12-12 16:20       ` Sage Weil
2014-12-12 16:39         ` Joe Landman
2014-12-12 18:18         ` Joe Landman
2014-12-12 21:29           ` Sage Weil
2014-12-12 15:43     ` Milosz Tanski
2014-12-12 17:27 ` Yehuda Sadeh
2014-12-12 21:42   ` Mark Nelson [this message]
2017-01-27  0:26 ` Loic Dachary

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=548B6145.2070303@redhat.com \
    --to=mark.nelson@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=mnelson@redhat.com \
    --cc=sweil@redhat.com \
    --cc=yehuda@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.