Re: SIMD accelerated crush_do_rule proof of concept

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Loic Dachary <loic@dachary.org>
To: Sage Weil <sage@newdream.net>
Cc: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: SIMD accelerated crush_do_rule proof of concept
Date: Mon, 29 Aug 2016 16:03:27 +0200	[thread overview]
Message-ID: <57C440AF.1010905@dachary.org> (raw)
In-Reply-To: <alpine.DEB.2.11.1608291352290.24244@piezo.us.to>

Hi Sage,

On 29/08/2016 15:55, Sage Weil wrote:
> On Mon, 29 Aug 2016, Loic Dachary wrote:
>> Hi,
>>
>> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is 
>> straightforward and would help with crushmap validation, is there any 
>> reason not to do it ?
>>
>> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 
>> function (bucket_straw2_choose) calls the hashing function 
>> (crush_hash32_3) for each item in a bucket and keeps the best match. 
>> When a bucket has four items, the hash function can be run using SIMD 
>> instructions. Each item value is 32 bits and four can fit in a __m128i.
>>
>> I tried to inline the hash function when the conditions are right[1] and 
>> run a test to measure the difference.
>>
>> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
>> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
>> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
>> rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
>>
>> With SIMD
>>
>> real	0m10.433s
>> user	0m10.428s
>> sys	0m0.000s
>>
>> Without SIMD
>>
>> real	0m19.344s
>> user	0m19.340s
>> sys	0m0.004s
>>
>> Callgrind estimated cycles for each crush_do_rule are in the same range:
>>
>> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
>> kcachegrind crush.callgrind
>>
>> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
>> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
>>
>> This proof of concept relies on instructions that are available on all 
>> ARM & Intel processors, nothing complicated is going on. It is 
>> beneficial to crush maps that have more than four disks per host, more 
>> than four hosts per rack etc. It probably is a small win for an OSD or 
>> even a client. For crushmap validation it helps significantly since the 
>> MON are not able to run crushtool asynchronously and it needs to run 
>> within a few seconds (because it blocks the MON).
>>
>> The implementation is straightforward: it needs sub/xor/lshift/rshift. 
>> The only relatively tricky part is runtime / compile time detection of 
>> the SIMD instructions for both Intel and ARM processors. Luckily this 
>> has already been taken care of when integrating with the jerasure 
>> erasure code plugin.
>>
>> Is there any reason why it would not be good to implement this ?
> 
> This is really cool!  I agree that the straw2 O(n) calculation on each 
> node is the place to apply this.
> 
> To answer your question, the only real risk/problem I see is that we need 
> to keep the perfectly in sync with the non-optimized variant since the 
> result has to be deterministic.  The maintenance burden is small, I think, 
> since for that reason the code behavior doesn't really change, but we do 
> need to pretty exhaustively verify that the new implementation matches the 
> old one.  Perhaps a set of unit tests that compile both variants and feed 
> it randomly sized and weighted straw2 buckets and feed lots of values 
> through?

Right: the implementation is likely to be simple but it needs serious testing. I'll give it a try.

Cheers

> sage
> 
>>
>> Cheers
>>
>> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

next prev parent reply	other threads:[~2016-08-29 14:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-29 11:42 SIMD accelerated crush_do_rule proof of concept Loic Dachary
2016-08-29 13:15 ` Mark Nelson
2016-08-29 13:57   ` Sage Weil
2016-08-29 13:58   ` Loic Dachary
2016-08-29 13:55 ` Sage Weil
2016-08-29 14:03   ` Loic Dachary [this message]
2016-08-29 15:07     ` Ilya Dryomov
2016-08-29 14:08   ` Vincent JARDIN
2016-08-29 14:54     ` Loic Dachary
2016-08-30 13:24 ` Piotr Dałek
2016-08-30 16:53   ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57C440AF.1010905@dachary.org \
    --to=loic@dachary.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.