From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: crush: straw is dead, long live straw2 Date: Fri, 12 Dec 2014 15:42:29 -0600 Message-ID: <548B6145.2070303@redhat.com> References: Reply-To: mnelson@redhat.com Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ie0-f179.google.com ([209.85.223.179]:63890 "EHLO mail-ie0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751647AbaLLVmc (ORCPT ); Fri, 12 Dec 2014 16:42:32 -0500 Received: by mail-ie0-f179.google.com with SMTP id rp18so7725363iec.38 for ; Fri, 12 Dec 2014 13:42:31 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Yehuda Sadeh , Sage Weil Cc: ceph-devel On 12/12/2014 11:27 AM, Yehuda Sadeh wrote: > On Mon, Dec 8, 2014 at 3:48 PM, Sage Weil wrote: >> Hey everyone, >> >> When I was writing the original CRUSH code ages ago, I made several >> different bucket types, each using a different 'choose' algorithm for >> pseudorandomly selecting an item. Most of these were modeled after the >> original RUSH algorithms from RJ Honicky, but there was one new bucket >> type I called 'straw' with some new properties that I was rather proud of: >> >> 1- items could have any weight. >> 2- O(n) to choose an item. >> 3- if an item's weight was adjusted up or down, mappings would either >> move to or from the adjusted item, but never between other unmodified >> items in the bucket. >> >> The basic motivation for the process to look like drawing straws, where >> the longest straw would win, except that each item would have their random >> straw lengths scaled based on their weight. >> >> This is all fine and good, except that it turns out the third property >> doesn't actually hold true! The amount that the random straw lengths got >> scaled turned out to be a complicated function of the other weights in the >> bucket. Although the placement calculation is pretty clearly independent >> for each item, a change in the weights affects the scaling factors for >> other items, which means that a weight change could affect the placement >> of items between other items. >> >> In retrospect this should have been obvious, but it turns out nobody has >> looked too carefully at this bit of code or the underlying algorithm for >> some 8 years. It took a customer making a deliberately miniscule change >> and seeing a huge data movement to dig into this bit of behavior. Also in >> retrospect, a well written test that verified the third property held true >> for various changes would have caught it, but my testing at the time was >> limited to strange combinations of weights. >> >> We suspect that this has actually been noticed by lots of people who have >> complained about more data moving after a weight change than they >> expected. These sorts of reports are hard to quantify and investigate and >> are easy to ignore. Sigh. >> >> Anyway, that's the bad news. >> >> The good news is that we figured out how to fix the placement algorithm to >> get all three properties. The new bucket type will be called 'straw2' >> (unless someone has a better suggestion). Here's the difference. >> >> For straw buckets, the choose function looks roughly like this: >> >> max_x = -1 >> max_item = -1 >> for each item: >> x = random value from 0..65535 >> x *= scaling factor >> if x > max_x: >> max_x = x >> max_item = item >> return item >> >> The issue, as I mentioned, is that the scaling factor is a strange >> function of *all* of the other item weights, which means a change in a >> single item A's weight could affect the relative distribution of items >> previously on B and C. >> >> The new straw2 bucket works like this: >> >> max_x = -1 >> max_item = -1 >> for each item: >> x = random value from 0..65535 >> x = ln(x / 65536) / weight >> if x > max_x: >> max_x = x >> max_item = item >> return item >> >> That ln() is a natural log (well, a 16-bit fixed-point approximation of >> it) and as you can see it's a simple function of the weight of that item. >> That means that changing one item's weight won't have any effect on the >> straw lengths for other items, so a change will either make the item win >> or not win but won't change who the other winner is in the not-win case. >> >> Somewhat embarassingly, I stumbled onto this solution half by accident. >> Sam found the underlying principle that makes it work: >> >> http://en.wikipedia.org/wiki/Exponential_distribution#Distribution_of_the_minimum_of_exponential_random_variables >> >> Calculating the ln() function is a bit annoying because it is a floating >> point function and CRUSH is all fixed-point arithmetic (integer-based). >> The current draft implementation uses a 128 KB lookup table (2 bytes per >> entry for 16 bits of input precision). It seems to be about 25% slower >> than the original straw code in my simple microbenchmark, but I'm not sure >> that's a number we should trust since it loops through a zillion inputs >> and will pull most of the lookup table into the processor caches. >> >> We could probably try a few others things: >> >> - Lose half the precision to reduce the table size. I don't think this >> will work well, and in most cases we'll still miss the CPU cache so the >> benefit is marginal at best. >> >> - Use floating point log function. This is problematic for the kernel >> implementation (no floating point), is slower than the lookup table, and >> makes me worry about whether the floating point calculations are >> consistent across architectures (the mapping has to be completely >> deterministic). >> >> - Use some approximation of the logarithm with fixed-point arithmetic. >> I spent a bit of time search and found >> >> http://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm >> >> but it also involves a couple of lookup tables and (judging by figure 1) >> is probably slower than the 256 KB table. >> >> We could probably expand out taylor series and get something half decent, >> but any precision we lose will translate to OSD utilizations that are off >> from the input weights--something that costs real disk space and we >> probably want to avoid. >> >> - Stick with the 128KB lookup table. Performance sensitive clients can >> precalculate all PG mappings when they get OSDMap updates if they are >> concerned about leave the CRUSH calculation in the IO path. >> >> Any other suggestions? >> >> Here is my implementation of the lookup-table based approach: >> >> https://github.com/ceph/ceph/commit/1d462c9f6a262de3a51533193ed2dff34c730727 >> https://github.com/ceph/ceph/commits/wip-crush-straw2 >> >> You'll notice that included in there is a unit test that verifies that >> changing a single item's weight does not effect the distribution of inputs >> among other items in the bucket. :) >> >> Thanks- >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > We discussed it off line, but should probably bring it to the mailing > list. One approach would be to have some kind of linear approximation > of the ln() function. Here's my take on it, which is very raw: > https://github.com/ceph/ceph/commit/224b921d6c680a8ef1fe39c9e6e183171cefd299 > > The idea is to find segments that are roughly linear (up to some > degree of accuracy), and then to generate some kind of lookup into > these segments. Then you could calculate the actual result using the > linear params. > The current code searches for lines that gives up to +-10 points > accuracy. The built code can be structured differently, allowing to do > the lookup using some kind of binary search (didn't make any effort in > this direction currently as you can notice). The generated code might > not be as nice as a simple lookup table, and may prove to be slower > (due to instruction cache misses), but maybe a combination of a > smaller lookup table with something like this could do the trick. Impressive Yehuda! :) > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >