All of lore.kernel.org
 help / color / mirror / Atom feed
* SIMD accelerated crush_do_rule proof of concept
@ 2016-08-29 11:42 Loic Dachary
  2016-08-29 13:15 ` Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Loic Dachary @ 2016-08-29 11:42 UTC (permalink / raw)
  To: Ceph Development

Hi,

TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is straightforward and would help with crushmap validation, is there any reason not to do it ?

When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each item in a bucket and keeps the best match. When a bucket has four items, the hash function can be run using SIMD instructions. Each item value is 32 bits and four can fit in a __m128i.

I tried to inline the hash function when the conditions are right[1] and run a test to measure the difference.

crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000

With SIMD

real	0m10.433s
user	0m10.428s
sys	0m0.000s

Without SIMD

real	0m19.344s
user	0m19.340s
sys	0m0.004s

Callgrind estimated cycles for each crush_do_rule are in the same range:

rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
kcachegrind crush.callgrind

With SIMD    : crush_do_rule is estimated to use 21 205 cycles
Without SIMD : crush_do_rule is estimated to use 53 068 cycles

This proof of concept relies on instructions that are available on all ARM & Intel processors, nothing complicated is going on. It is beneficial to crush maps that have more than four disks per host, more than four hosts per rack etc. It probably is a small win for an OSD or even a client. For crushmap validation it helps significantly since the MON are not able to run crushtool asynchronously and it needs to run within a few seconds (because it blocks the MON).

The implementation is straightforward: it needs sub/xor/lshift/rshift. The only relatively tricky part is runtime / compile time detection of the SIMD instructions for both Intel and ARM processors. Luckily this has already been taken care of when integrating with the jerasure erasure code plugin.

Is there any reason why it would not be good to implement this ?

Cheers

[1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 11:42 SIMD accelerated crush_do_rule proof of concept Loic Dachary
@ 2016-08-29 13:15 ` Mark Nelson
  2016-08-29 13:57   ` Sage Weil
  2016-08-29 13:58   ` Loic Dachary
  2016-08-29 13:55 ` Sage Weil
  2016-08-30 13:24 ` Piotr Dałek
  2 siblings, 2 replies; 11+ messages in thread
From: Mark Nelson @ 2016-08-29 13:15 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

Anything we can do to help on the CPU usage front is a win IMHO, though 
I would be interested in seeing an example where we are spending a lot 
of time on crush in a real usage scenario?

Mark

On 08/29/2016 06:42 AM, Loic Dachary wrote:
> Hi,
>
> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is straightforward and would help with crushmap validation, is there any reason not to do it ?
>
> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each item in a bucket and keeps the best match. When a bucket has four items, the hash function can be run using SIMD instructions. Each item value is 32 bits and four can fit in a __m128i.
>
> I tried to inline the hash function when the conditions are right[1] and run a test to measure the difference.
>
> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
> rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
>
> With SIMD
>
> real	0m10.433s
> user	0m10.428s
> sys	0m0.000s
>
> Without SIMD
>
> real	0m19.344s
> user	0m19.340s
> sys	0m0.004s
>
> Callgrind estimated cycles for each crush_do_rule are in the same range:
>
> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
> kcachegrind crush.callgrind
>
> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
>
> This proof of concept relies on instructions that are available on all ARM & Intel processors, nothing complicated is going on. It is beneficial to crush maps that have more than four disks per host, more than four hosts per rack etc. It probably is a small win for an OSD or even a client. For crushmap validation it helps significantly since the MON are not able to run crushtool asynchronously and it needs to run within a few seconds (because it blocks the MON).
>
> The implementation is straightforward: it needs sub/xor/lshift/rshift. The only relatively tricky part is runtime / compile time detection of the SIMD instructions for both Intel and ARM processors. Luckily this has already been taken care of when integrating with the jerasure erasure code plugin.
>
> Is there any reason why it would not be good to implement this ?
>
> Cheers
>
> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 11:42 SIMD accelerated crush_do_rule proof of concept Loic Dachary
  2016-08-29 13:15 ` Mark Nelson
@ 2016-08-29 13:55 ` Sage Weil
  2016-08-29 14:03   ` Loic Dachary
  2016-08-29 14:08   ` Vincent JARDIN
  2016-08-30 13:24 ` Piotr Dałek
  2 siblings, 2 replies; 11+ messages in thread
From: Sage Weil @ 2016-08-29 13:55 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3513 bytes --]

On Mon, 29 Aug 2016, Loic Dachary wrote:
> Hi,
> 
> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is 
> straightforward and would help with crushmap validation, is there any 
> reason not to do it ?
> 
> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 
> function (bucket_straw2_choose) calls the hashing function 
> (crush_hash32_3) for each item in a bucket and keeps the best match. 
> When a bucket has four items, the hash function can be run using SIMD 
> instructions. Each item value is 32 bits and four can fit in a __m128i.
> 
> I tried to inline the hash function when the conditions are right[1] and 
> run a test to measure the difference.
> 
> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
> rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
> 
> With SIMD
> 
> real	0m10.433s
> user	0m10.428s
> sys	0m0.000s
> 
> Without SIMD
> 
> real	0m19.344s
> user	0m19.340s
> sys	0m0.004s
> 
> Callgrind estimated cycles for each crush_do_rule are in the same range:
> 
> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
> kcachegrind crush.callgrind
> 
> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
> 
> This proof of concept relies on instructions that are available on all 
> ARM & Intel processors, nothing complicated is going on. It is 
> beneficial to crush maps that have more than four disks per host, more 
> than four hosts per rack etc. It probably is a small win for an OSD or 
> even a client. For crushmap validation it helps significantly since the 
> MON are not able to run crushtool asynchronously and it needs to run 
> within a few seconds (because it blocks the MON).
> 
> The implementation is straightforward: it needs sub/xor/lshift/rshift. 
> The only relatively tricky part is runtime / compile time detection of 
> the SIMD instructions for both Intel and ARM processors. Luckily this 
> has already been taken care of when integrating with the jerasure 
> erasure code plugin.
> 
> Is there any reason why it would not be good to implement this ?

This is really cool!  I agree that the straw2 O(n) calculation on each 
node is the place to apply this.

To answer your question, the only real risk/problem I see is that we need 
to keep the perfectly in sync with the non-optimized variant since the 
result has to be deterministic.  The maintenance burden is small, I think, 
since for that reason the code behavior doesn't really change, but we do 
need to pretty exhaustively verify that the new implementation matches the 
old one.  Perhaps a set of unit tests that compile both variants and feed 
it randomly sized and weighted straw2 buckets and feed lots of values 
through?

sage

> 
> Cheers
> 
> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 13:15 ` Mark Nelson
@ 2016-08-29 13:57   ` Sage Weil
  2016-08-29 13:58   ` Loic Dachary
  1 sibling, 0 replies; 11+ messages in thread
From: Sage Weil @ 2016-08-29 13:57 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Loic Dachary, Ceph Development

On Mon, 29 Aug 2016, Mark Nelson wrote:
> Anything we can do to help on the CPU usage front is a win IMHO, though I
> would be interested in seeing an example where we are spending a lot of time
> on crush in a real usage scenario?

The monitor prime_pg_temp has to calculate a crush mapping for every PG 
when the osdmap changes in significant ways.  A 2x improvement there is a 
big help since the work has to be timeboxed and aborted if it runs too 
long.  Same goes for the crushtool test that runs whenever the crush map 
changes.

sage


> 
> Mark
> 
> On 08/29/2016 06:42 AM, Loic Dachary wrote:
> > Hi,
> > 
> > TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is
> > straightforward and would help with crushmap validation, is there any reason
> > not to do it ?
> > 
> > When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function
> > (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each
> > item in a bucket and keeps the best match. When a bucket has four items, the
> > hash function can be run using SIMD instructions. Each item value is 32 bits
> > and four can fit in a __m128i.
> > 
> > I tried to inline the hash function when the conditions are right[1] and run
> > a test to measure the difference.
> > 
> > crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter
> > straw2 4 root straw2 0
> > time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test
> > --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
> > rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
> > rule 0 (replicated_ruleset) num_rep 4 result size == 4:
> > 2048000/2048000
> > 
> > With SIMD
> > 
> > real	0m10.433s
> > user	0m10.428s
> > sys	0m0.000s
> > 
> > Without SIMD
> > 
> > real	0m19.344s
> > user	0m19.340s
> > sys	0m0.004s
> > 
> > Callgrind estimated cycles for each crush_do_rule are in the same range:
> > 
> > rm crush.callgrind ; valgrind --tool=callgrind
> > --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map
> > --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x
> > 204800 --num-rep 4
> > kcachegrind crush.callgrind
> > 
> > With SIMD    : crush_do_rule is estimated to use 21 205 cycles
> > Without SIMD : crush_do_rule is estimated to use 53 068 cycles
> > 
> > This proof of concept relies on instructions that are available on all ARM &
> > Intel processors, nothing complicated is going on. It is beneficial to crush
> > maps that have more than four disks per host, more than four hosts per rack
> > etc. It probably is a small win for an OSD or even a client. For crushmap
> > validation it helps significantly since the MON are not able to run
> > crushtool asynchronously and it needs to run within a few seconds (because
> > it blocks the MON).
> > 
> > The implementation is straightforward: it needs sub/xor/lshift/rshift. The
> > only relatively tricky part is runtime / compile time detection of the SIMD
> > instructions for both Intel and ARM processors. Luckily this has already
> > been taken care of when integrating with the jerasure erasure code plugin.
> > 
> > Is there any reason why it would not be good to implement this ?
> > 
> > Cheers
> > 
> > [1]
> > https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 13:15 ` Mark Nelson
  2016-08-29 13:57   ` Sage Weil
@ 2016-08-29 13:58   ` Loic Dachary
  1 sibling, 0 replies; 11+ messages in thread
From: Loic Dachary @ 2016-08-29 13:58 UTC (permalink / raw)
  To: Mark Nelson, Ceph Development

Hi Mark,

On 29/08/2016 15:15, Mark Nelson wrote:
> Anything we can do to help on the CPU usage front is a win IMHO, though I would be interested in seeing an example where we are spending a lot of time on crush in a real usage scenario?

When the ceph setcrushmap sends a new map to the monitor, it runs crushtool --test[1] and blocks until it's done. When the crushmap is large or the rules are complex it can timeout because the monitor won't let it run for longer than the mon lease time.

Cheers

[1] https://github.com/ceph/ceph/blob/master/src/crush/CrushTester.cc#L361

> Mark
> 
> On 08/29/2016 06:42 AM, Loic Dachary wrote:
>> Hi,
>>
>> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is straightforward and would help with crushmap validation, is there any reason not to do it ?
>>
>> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each item in a bucket and keeps the best match. When a bucket has four items, the hash function can be run using SIMD instructions. Each item value is 32 bits and four can fit in a __m128i.
>>
>> I tried to inline the hash function when the conditions are right[1] and run a test to measure the difference.
>>
>> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
>> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
>> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
>> rule 0 (replicated_ruleset) num_rep 4 result size == 4:    2048000/2048000
>>
>> With SIMD
>>
>> real    0m10.433s
>> user    0m10.428s
>> sys    0m0.000s
>>
>> Without SIMD
>>
>> real    0m19.344s
>> user    0m19.340s
>> sys    0m0.004s
>>
>> Callgrind estimated cycles for each crush_do_rule are in the same range:
>>
>> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
>> kcachegrind crush.callgrind
>>
>> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
>> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
>>
>> This proof of concept relies on instructions that are available on all ARM & Intel processors, nothing complicated is going on. It is beneficial to crush maps that have more than four disks per host, more than four hosts per rack etc. It probably is a small win for an OSD or even a client. For crushmap validation it helps significantly since the MON are not able to run crushtool asynchronously and it needs to run within a few seconds (because it blocks the MON).
>>
>> The implementation is straightforward: it needs sub/xor/lshift/rshift. The only relatively tricky part is runtime / compile time detection of the SIMD instructions for both Intel and ARM processors. Luckily this has already been taken care of when integrating with the jerasure erasure code plugin.
>>
>> Is there any reason why it would not be good to implement this ?
>>
>> Cheers
>>
>> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 13:55 ` Sage Weil
@ 2016-08-29 14:03   ` Loic Dachary
  2016-08-29 15:07     ` Ilya Dryomov
  2016-08-29 14:08   ` Vincent JARDIN
  1 sibling, 1 reply; 11+ messages in thread
From: Loic Dachary @ 2016-08-29 14:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hi Sage,

On 29/08/2016 15:55, Sage Weil wrote:
> On Mon, 29 Aug 2016, Loic Dachary wrote:
>> Hi,
>>
>> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is 
>> straightforward and would help with crushmap validation, is there any 
>> reason not to do it ?
>>
>> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 
>> function (bucket_straw2_choose) calls the hashing function 
>> (crush_hash32_3) for each item in a bucket and keeps the best match. 
>> When a bucket has four items, the hash function can be run using SIMD 
>> instructions. Each item value is 32 bits and four can fit in a __m128i.
>>
>> I tried to inline the hash function when the conditions are right[1] and 
>> run a test to measure the difference.
>>
>> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
>> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
>> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
>> rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
>>
>> With SIMD
>>
>> real	0m10.433s
>> user	0m10.428s
>> sys	0m0.000s
>>
>> Without SIMD
>>
>> real	0m19.344s
>> user	0m19.340s
>> sys	0m0.004s
>>
>> Callgrind estimated cycles for each crush_do_rule are in the same range:
>>
>> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
>> kcachegrind crush.callgrind
>>
>> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
>> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
>>
>> This proof of concept relies on instructions that are available on all 
>> ARM & Intel processors, nothing complicated is going on. It is 
>> beneficial to crush maps that have more than four disks per host, more 
>> than four hosts per rack etc. It probably is a small win for an OSD or 
>> even a client. For crushmap validation it helps significantly since the 
>> MON are not able to run crushtool asynchronously and it needs to run 
>> within a few seconds (because it blocks the MON).
>>
>> The implementation is straightforward: it needs sub/xor/lshift/rshift. 
>> The only relatively tricky part is runtime / compile time detection of 
>> the SIMD instructions for both Intel and ARM processors. Luckily this 
>> has already been taken care of when integrating with the jerasure 
>> erasure code plugin.
>>
>> Is there any reason why it would not be good to implement this ?
> 
> This is really cool!  I agree that the straw2 O(n) calculation on each 
> node is the place to apply this.
> 
> To answer your question, the only real risk/problem I see is that we need 
> to keep the perfectly in sync with the non-optimized variant since the 
> result has to be deterministic.  The maintenance burden is small, I think, 
> since for that reason the code behavior doesn't really change, but we do 
> need to pretty exhaustively verify that the new implementation matches the 
> old one.  Perhaps a set of unit tests that compile both variants and feed 
> it randomly sized and weighted straw2 buckets and feed lots of values 
> through?

Right: the implementation is likely to be simple but it needs serious testing. I'll give it a try.

Cheers

> sage
> 
>>
>> Cheers
>>
>> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 13:55 ` Sage Weil
  2016-08-29 14:03   ` Loic Dachary
@ 2016-08-29 14:08   ` Vincent JARDIN
  2016-08-29 14:54     ` Loic Dachary
  1 sibling, 1 reply; 11+ messages in thread
From: Vincent JARDIN @ 2016-08-29 14:08 UTC (permalink / raw)
  To: Sage Weil, Loic Dachary; +Cc: Ceph Development

Le 29/08/2016 à 15:55, Sage Weil a écrit :
> To answer your question, the only real risk/problem I see is that we need
> to keep the perfectly in sync with the non-optimized variant

I do propose a generic implementation that allows to share SIMD on ARM, 
Intel and others (Altivec),

 
https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef

You can try the following,
For instance,
#include <stdint.h>
#include <immintrin.h>
{
__v32qi va, vb;
va = (__v32qi) { 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 
17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 4, 1, 0 };
vb = (__v32qi) { 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 
17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };

__v32qi res = va ^ vb;
}

it will produce the optimized Neon or AVX, AVX2 according to each targets.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 14:08   ` Vincent JARDIN
@ 2016-08-29 14:54     ` Loic Dachary
  0 siblings, 0 replies; 11+ messages in thread
From: Loic Dachary @ 2016-08-29 14:54 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: Ceph Development

Hi Vincent,

On 29/08/2016 16:08, Vincent JARDIN wrote:
> Le 29/08/2016 à 15:55, Sage Weil a écrit :
>> To answer your question, the only real risk/problem I see is that we need
>> to keep the perfectly in sync with the non-optimized variant
> 
> I do propose a generic implementation that allows to share SIMD on ARM, Intel and others (Altivec),
> 
> 
> https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
> 
> You can try the following,
> For instance,
> #include <stdint.h>
> #include <immintrin.h>
> {
> __v32qi va, vb;
> va = (__v32qi) { 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 4, 1, 0 };
> vb = (__v32qi) { 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 };
> 
> __v32qi res = va ^ vb;
> }
> 
> it will produce the optimized Neon or AVX, AVX2 according to each targets.

Generic code that relies on the compiler optimizations is terse, which is nice. But the code is not generic: it needs to be written specifically for the optimizer, which is self defeating. The http://locklessinc.com/articles/vectorize/ article illustrate that in a fun way. Instead of maintaining code with SIMD instructions, you need to understand each optimizer by reading assembly language, which is more complicated.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 14:03   ` Loic Dachary
@ 2016-08-29 15:07     ` Ilya Dryomov
  0 siblings, 0 replies; 11+ messages in thread
From: Ilya Dryomov @ 2016-08-29 15:07 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, Ceph Development

On Mon, Aug 29, 2016 at 4:03 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sage,
>
> On 29/08/2016 15:55, Sage Weil wrote:
>> On Mon, 29 Aug 2016, Loic Dachary wrote:
>>> Hi,
>>>
>>> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is
>>> straightforward and would help with crushmap validation, is there any
>>> reason not to do it ?
>>>
>>> When resolving a crush rule (crush_do_rule in mapper.c), the straw2
>>> function (bucket_straw2_choose) calls the hashing function
>>> (crush_hash32_3) for each item in a bucket and keeps the best match.
>>> When a bucket has four items, the hash function can be run using SIMD
>>> instructions. Each item value is 32 bits and four can fit in a __m128i.
>>>
>>> I tried to inline the hash function when the conditions are right[1] and
>>> run a test to measure the difference.
>>>
>>> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
>>> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
>>> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
>>> rule 0 (replicated_ruleset) num_rep 4 result size == 4:      2048000/2048000
>>>
>>> With SIMD
>>>
>>> real 0m10.433s
>>> user 0m10.428s
>>> sys  0m0.000s
>>>
>>> Without SIMD
>>>
>>> real 0m19.344s
>>> user 0m19.340s
>>> sys  0m0.004s
>>>
>>> Callgrind estimated cycles for each crush_do_rule are in the same range:
>>>
>>> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
>>> kcachegrind crush.callgrind
>>>
>>> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
>>> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
>>>
>>> This proof of concept relies on instructions that are available on all
>>> ARM & Intel processors, nothing complicated is going on. It is
>>> beneficial to crush maps that have more than four disks per host, more
>>> than four hosts per rack etc. It probably is a small win for an OSD or
>>> even a client. For crushmap validation it helps significantly since the
>>> MON are not able to run crushtool asynchronously and it needs to run
>>> within a few seconds (because it blocks the MON).
>>>
>>> The implementation is straightforward: it needs sub/xor/lshift/rshift.
>>> The only relatively tricky part is runtime / compile time detection of
>>> the SIMD instructions for both Intel and ARM processors. Luckily this
>>> has already been taken care of when integrating with the jerasure
>>> erasure code plugin.
>>>
>>> Is there any reason why it would not be good to implement this ?
>>
>> This is really cool!  I agree that the straw2 O(n) calculation on each
>> node is the place to apply this.
>>
>> To answer your question, the only real risk/problem I see is that we need
>> to keep the perfectly in sync with the non-optimized variant since the
>> result has to be deterministic.  The maintenance burden is small, I think,
>> since for that reason the code behavior doesn't really change, but we do
>> need to pretty exhaustively verify that the new implementation matches the
>> old one.  Perhaps a set of unit tests that compile both variants and feed
>> it randomly sized and weighted straw2 buckets and feed lots of values
>> through?
>
> Right: the implementation is likely to be simple but it needs serious testing. I'll give it a try.

We should also make sure it's neatly hidden under !__KERNEL__.  Since
the benefit to the clients is assumed to be minimal it's probably not
worth bothering with bringing SSE into the kernel implementation.

(However, if it turns out otherwise we can look into it - IIRC raid6
in-kernel library is using SSE, and probably some crypto code too.)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-29 11:42 SIMD accelerated crush_do_rule proof of concept Loic Dachary
  2016-08-29 13:15 ` Mark Nelson
  2016-08-29 13:55 ` Sage Weil
@ 2016-08-30 13:24 ` Piotr Dałek
  2016-08-30 16:53   ` Sage Weil
  2 siblings, 1 reply; 11+ messages in thread
From: Piotr Dałek @ 2016-08-30 13:24 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Mon, Aug 29, 2016 at 01:42:22PM +0200, Loic Dachary wrote:
> Hi,
> 
> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is straightforward and would help with crushmap validation, is there any reason not to do it ?
> 
> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each item in a bucket and keeps the best match. When a bucket has four items, the hash function can be run using SIMD instructions. Each item value is 32 bits and four can fit in a __m128i.
> 
> I tried to inline the hash function when the conditions are right[1] and run a test to measure the difference.
> 
> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
> rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
> 
> With SIMD
> 
> real	0m10.433s
> user	0m10.428s
> sys	0m0.000s
> 
> Without SIMD
> 
> real	0m19.344s
> user	0m19.340s
> sys	0m0.004s
> 
> Callgrind estimated cycles for each crush_do_rule are in the same range:
> 
> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
> kcachegrind crush.callgrind
> 
> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
> 
> This proof of concept relies on instructions that are available on all ARM & Intel processors, nothing complicated is going on. It is beneficial to crush maps that have more than four disks per host, more than four hosts per rack etc. It probably is a small win for an OSD or even a client. For crushmap validation it helps significantly since the MON are not able to run crushtool asynchronously and it needs to run within a few seconds (because it blocks the MON).
> 
> The implementation is straightforward: it needs sub/xor/lshift/rshift. The only relatively tricky part is runtime / compile time detection of the SIMD instructions for both Intel and ARM processors. Luckily this has already been taken care of when integrating with the jerasure erasure code plugin.
> 
> Is there any reason why it would not be good to implement this ?

I like this and I hope you'll be able to get it into master. I was wondering
if straw2 can be optimized further and realized that crush_ln() looks quite
expensive. I did some basic statistics and it looks like that crush_ln
accepts inputs in 0-65535 range, making lookup table solution feasible.
I tried it out:

Original:

[branch@localhost bin]$ time ./crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4                                                                                                                
rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4                                                                       
rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000                                                          
                                                                                                                                 
real    0m25.635s                                                                                                                
user    0m25.553s                                                                                                                
sys     0m0.072s                                                                                                                 

Loic's SIMD:

[branch@localhost bin]$ time ./crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4                                                                                                                
rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4                                                                       
rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000                                                          
                                                                                                                                 
real    0m15.292s                                                                                                                
user    0m15.227s                                                                                                                
sys     0m0.056s        

+Crush LN cache:

[branch@localhost bin]$ time ./crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4                                                                                                                
rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4                                                                       
rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000                                                          
                                                                                                                                 
real    0m11.828s                                                                                                                
user    0m11.746s                                                                                                                
sys     0m0.078s

There's a drawback, too - 65536 of__u64's take 512KB of ram, so the question is
- is it worth it...

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SIMD accelerated crush_do_rule proof of concept
  2016-08-30 13:24 ` Piotr Dałek
@ 2016-08-30 16:53   ` Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2016-08-30 16:53 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Loic Dachary, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6451 bytes --]

On Tue, 30 Aug 2016, Piotr Dałek wrote:
> On Mon, Aug 29, 2016 at 01:42:22PM +0200, Loic Dachary wrote:
> > Hi,
> > 
> > TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is straightforward and would help with crushmap validation, is there any reason not to do it ?
> > 
> > When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each item in a bucket and keeps the best match. When a bucket has four items, the hash function can be run using SIMD instructions. Each item value is 32 bits and four can fit in a __m128i.
> > 
> > I tried to inline the hash function when the conditions are right[1] and run a test to measure the difference.
> > 
> > crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
> > time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
> > rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
> > rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
> > 
> > With SIMD
> > 
> > real	0m10.433s
> > user	0m10.428s
> > sys	0m0.000s
> > 
> > Without SIMD
> > 
> > real	0m19.344s
> > user	0m19.340s
> > sys	0m0.004s
> > 
> > Callgrind estimated cycles for each crush_do_rule are in the same range:
> > 
> > rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
> > kcachegrind crush.callgrind
> > 
> > With SIMD    : crush_do_rule is estimated to use 21 205 cycles
> > Without SIMD : crush_do_rule is estimated to use 53 068 cycles
> > 
> > This proof of concept relies on instructions that are available on all ARM & Intel processors, nothing complicated is going on. It is beneficial to crush maps that have more than four disks per host, more than four hosts per rack etc. It probably is a small win for an OSD or even a client. For crushmap validation it helps significantly since the MON are not able to run crushtool asynchronously and it needs to run within a few seconds (because it blocks the MON).
> > 
> > The implementation is straightforward: it needs sub/xor/lshift/rshift. The only relatively tricky part is runtime / compile time detection of the SIMD instructions for both Intel and ARM processors. Luckily this has already been taken care of when integrating with the jerasure erasure code plugin.
> > 
> > Is there any reason why it would not be good to implement this ?
> 
> I like this and I hope you'll be able to get it into master. I was wondering
> if straw2 can be optimized further and realized that crush_ln() looks quite
> expensive. I did some basic statistics and it looks like that crush_ln
> accepts inputs in 0-65535 range, making lookup table solution feasible.
> I tried it out:
> 
> Original:
> 
> [branch@localhost bin]$ time ./crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4                                                                                                                
> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4                                                                       
> rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000                                                          
>                                                                                                                                  
> real    0m25.635s                                                                                                                
> user    0m25.553s                                                                                                                
> sys     0m0.072s                                                                                                                 
> 
> Loic's SIMD:
> 
> [branch@localhost bin]$ time ./crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4                                                                                                                
> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4                                                                       
> rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000                                                          
>                                                                                                                                  
> real    0m15.292s                                                                                                                
> user    0m15.227s                                                                                                                
> sys     0m0.056s        
> 
> +Crush LN cache:
> 
> [branch@localhost bin]$ time ./crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4                                                                                                                
> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4                                                                       
> rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000                                                          
>                                                                                                                                  
> real    0m11.828s                                                                                                                
> user    0m11.746s                                                                                                                
> sys     0m0.078s
> 
> There's a drawback, too - 65536 of__u64's take 512KB of ram, so the question is
> - is it worth it...

I think it probably isn't, since we'll miss the CPU cache and have to 
fetch from DRAM.  But we have a pretty long discussion about this here

http://www.spinics.net/lists/ceph-devel/msg21635.html

and then Xiaoxi at Intel came up with the current implementation 
(replacing my original lookup table).

http://www.spinics.net/lists/ceph-devel/msg22094.html

I don't remember the specifics unfortunately...
sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-30 16:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-29 11:42 SIMD accelerated crush_do_rule proof of concept Loic Dachary
2016-08-29 13:15 ` Mark Nelson
2016-08-29 13:57   ` Sage Weil
2016-08-29 13:58   ` Loic Dachary
2016-08-29 13:55 ` Sage Weil
2016-08-29 14:03   ` Loic Dachary
2016-08-29 15:07     ` Ilya Dryomov
2016-08-29 14:08   ` Vincent JARDIN
2016-08-29 14:54     ` Loic Dachary
2016-08-30 13:24 ` Piotr Dałek
2016-08-30 16:53   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.