* controlling erasure code chunk size
@ 2014-02-02 15:15 Loic Dachary
2014-02-02 16:18 ` Andreas Joachim Peters
0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-02-02 15:15 UTC (permalink / raw)
To: Samuel Just; +Cc: Ceph Development, Andreas Joachim Peters
[-- Attachment #1: Type: text/plain, Size: 953 bytes --]
[cc' ceph-devel]
Hi Sam,
Here is how chunks are expected to be aligned:
https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
unsigned alignment = k*w*packetsize*sizeof(int);
if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
return alignment;
If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: controlling erasure code chunk size
2014-02-02 15:15 controlling erasure code chunk size Loic Dachary
@ 2014-02-02 16:18 ` Andreas Joachim Peters
2014-02-02 22:45 ` Samuel Just
0 siblings, 1 reply; 10+ messages in thread
From: Andreas Joachim Peters @ 2014-02-02 16:18 UTC (permalink / raw)
To: Loic Dachary, Samuel Just; +Cc: Ceph Development
Hi Loic et.al.
I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
Cheers Andreas.
________________________________________
From: Loic Dachary [loic@dachary.org]
Sent: 02 February 2014 16:15
To: Samuel Just
Cc: Ceph Development; Andreas Joachim Peters
Subject: controlling erasure code chunk size
[cc' ceph-devel]
Hi Sam,
Here is how chunks are expected to be aligned:
https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
unsigned alignment = k*w*packetsize*sizeof(int);
if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
return alignment;
If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size
2014-02-02 16:18 ` Andreas Joachim Peters
@ 2014-02-02 22:45 ` Samuel Just
2014-02-02 23:27 ` Andreas Joachim Peters
2014-02-03 11:35 ` Loic Dachary
0 siblings, 2 replies; 10+ messages in thread
From: Samuel Just @ 2014-02-02 22:45 UTC (permalink / raw)
To: Andreas Joachim Peters; +Cc: Loic Dachary, Ceph Development
I assume we will use get_chunksize(desired_chunksize) *
get_data_chunk_count() on the mon to define the stripe width (the size
of the buffer which will be presented to the plugin for encoding) for
the pool. At the moment, get_chunksize(4*(2<<10)) *
get_data_chunk_count() = 393216 using the jerasure plugin where
get_data_chunk_count() = 4. This seems a bit big?
-Sam
On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> Hi Loic et.al.
>
> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
>
> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
>
> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
>
> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
>
> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
>
> Cheers Andreas.
>
>
>
>
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 02 February 2014 16:15
> To: Samuel Just
> Cc: Ceph Development; Andreas Joachim Peters
> Subject: controlling erasure code chunk size
>
> [cc' ceph-devel]
>
> Hi Sam,
>
> Here is how chunks are expected to be aligned:
>
> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
>
> unsigned alignment = k*w*packetsize*sizeof(int);
> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
> return alignment;
>
> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
>
> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
>
> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
>
> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: controlling erasure code chunk size
2014-02-02 22:45 ` Samuel Just
@ 2014-02-02 23:27 ` Andreas Joachim Peters
2014-02-02 23:33 ` Samuel Just
2014-02-03 10:57 ` Loic Dachary
2014-02-03 11:35 ` Loic Dachary
1 sibling, 2 replies; 10+ messages in thread
From: Andreas Joachim Peters @ 2014-02-02 23:27 UTC (permalink / raw)
To: Samuel Just; +Cc: Loic Dachary, Ceph Development
If you want 4k stripe_size, you have to configure the cauchy plugin with w=8 packetsize=128 for a k=4 configuration.
For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close.
Cheers Andreas.
________________________________________
From: Samuel Just [sam.just@inktank.com]
Sent: 02 February 2014 23:45
To: Andreas Joachim Peters
Cc: Loic Dachary; Ceph Development
Subject: Re: controlling erasure code chunk size
I assume we will use get_chunksize(desired_chunksize) *
get_data_chunk_count() on the mon to define the stripe width (the size
of the buffer which will be presented to the plugin for encoding) for
the pool. At the moment, get_chunksize(4*(2<<10)) *
get_data_chunk_count() = 393216 using the jerasure plugin where
get_data_chunk_count() = 4. This seems a bit big?
-Sam
On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> Hi Loic et.al.
>
> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
>
> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
>
> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
>
> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
>
> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
>
> Cheers Andreas.
>
>
>
>
> ________________________________________
> From: Loic Dachary [loic@dachary.org]
> Sent: 02 February 2014 16:15
> To: Samuel Just
> Cc: Ceph Development; Andreas Joachim Peters
> Subject: controlling erasure code chunk size
>
> [cc' ceph-devel]
>
> Hi Sam,
>
> Here is how chunks are expected to be aligned:
>
> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
>
> unsigned alignment = k*w*packetsize*sizeof(int);
> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
> return alignment;
>
> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
>
> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
>
> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
>
> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size
2014-02-02 23:27 ` Andreas Joachim Peters
@ 2014-02-02 23:33 ` Samuel Just
2014-02-04 16:17 ` Loic Dachary
2014-02-03 10:57 ` Loic Dachary
1 sibling, 1 reply; 10+ messages in thread
From: Samuel Just @ 2014-02-02 23:33 UTC (permalink / raw)
To: Andreas Joachim Peters; +Cc: Loic Dachary, Ceph Development
Adjusting deterministically based on the desired chunk_size seems like
it would be the simplest thing, if only to avoid having one more knob
to mis-adjust. How large does packetsize need to be before making it
bigger no longer provides a benefit?
-Sam
On Sun, Feb 2, 2014 at 3:27 PM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> If you want 4k stripe_size, you have to configure the cauchy plugin with w=8 packetsize=128 for a k=4 configuration.
>
> For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close.
>
> Cheers Andreas.
> ________________________________________
> From: Samuel Just [sam.just@inktank.com]
> Sent: 02 February 2014 23:45
> To: Andreas Joachim Peters
> Cc: Loic Dachary; Ceph Development
> Subject: Re: controlling erasure code chunk size
>
> I assume we will use get_chunksize(desired_chunksize) *
> get_data_chunk_count() on the mon to define the stripe width (the size
> of the buffer which will be presented to the plugin for encoding) for
> the pool. At the moment, get_chunksize(4*(2<<10)) *
> get_data_chunk_count() = 393216 using the jerasure plugin where
> get_data_chunk_count() = 4. This seems a bit big?
> -Sam
>
> On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters
> <Andreas.Joachim.Peters@cern.ch> wrote:
>> Hi Loic et.al.
>>
>> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
>>
>> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
>>
>> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
>>
>> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
>>
>> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 02 February 2014 16:15
>> To: Samuel Just
>> Cc: Ceph Development; Andreas Joachim Peters
>> Subject: controlling erasure code chunk size
>>
>> [cc' ceph-devel]
>>
>> Hi Sam,
>>
>> Here is how chunks are expected to be aligned:
>>
>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
>>
>> unsigned alignment = k*w*packetsize*sizeof(int);
>> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
>> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
>> return alignment;
>>
>> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
>>
>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
>>
>> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
>>
>> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
>>
>> Cheers
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size
2014-02-02 23:27 ` Andreas Joachim Peters
2014-02-02 23:33 ` Samuel Just
@ 2014-02-03 10:57 ` Loic Dachary
1 sibling, 0 replies; 10+ messages in thread
From: Loic Dachary @ 2014-02-03 10:57 UTC (permalink / raw)
To: Andreas Joachim Peters, Samuel Just; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 4266 bytes --]
Hi Andreas,
I better understand what we're after. Can you join the irc.oftc.net#ceph-devel irc channel to discuss the details ? We have a few hours ahead of us before Los Angeles wakes up ;-)
Cheers
On 03/02/2014 00:27, Andreas Joachim Peters wrote:
> If you want 4k stripe_size, you have to configure the cauchy plugin with w=8 packetsize=128 for a k=4 configuration.
>
> For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close.
>
> Cheers Andreas.
> ________________________________________
> From: Samuel Just [sam.just@inktank.com]
> Sent: 02 February 2014 23:45
> To: Andreas Joachim Peters
> Cc: Loic Dachary; Ceph Development
> Subject: Re: controlling erasure code chunk size
>
> I assume we will use get_chunksize(desired_chunksize) *
> get_data_chunk_count() on the mon to define the stripe width (the size
> of the buffer which will be presented to the plugin for encoding) for
> the pool. At the moment, get_chunksize(4*(2<<10)) *
> get_data_chunk_count() = 393216 using the jerasure plugin where
> get_data_chunk_count() = 4. This seems a bit big?
> -Sam
>
> On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters
> <Andreas.Joachim.Peters@cern.ch> wrote:
>> Hi Loic et.al.
>>
>> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
>>
>> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
>>
>> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
>>
>> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
>>
>> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 02 February 2014 16:15
>> To: Samuel Just
>> Cc: Ceph Development; Andreas Joachim Peters
>> Subject: controlling erasure code chunk size
>>
>> [cc' ceph-devel]
>>
>> Hi Sam,
>>
>> Here is how chunks are expected to be aligned:
>>
>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
>>
>> unsigned alignment = k*w*packetsize*sizeof(int);
>> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
>> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
>> return alignment;
>>
>> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
>>
>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
>>
>> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
>>
>> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
>>
>> Cheers
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size
2014-02-02 22:45 ` Samuel Just
2014-02-02 23:27 ` Andreas Joachim Peters
@ 2014-02-03 11:35 ` Loic Dachary
2014-02-03 18:15 ` Samuel Just
1 sibling, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-02-03 11:35 UTC (permalink / raw)
To: Samuel Just, Andreas Joachim Peters; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 4212 bytes --]
Hi Sam,
The argument to get_chunk_size is the stripe width, named object_size because the API knows nothing about stripes, it is a concept for the caller to implement. Say you have a desired chunk size in mind, you would:
object_size = desired_chunk_size * get_data_chunk_count()
actual_chunk_size = get_chunk_size(object_size)
If you have a desired stripe width / object size in mind you would:
object_size = desired_stripe_width
chunk_size = get_chunk_size(object_size)
Following Andreas suggestions, controlling the size of the actual chunk is a matter of tweaking the alignment constraints via the erasure code plugin parameters.
Cheers
On 02/02/2014 23:45, Samuel Just wrote:
> I assume we will use get_chunksize(desired_chunksize) *
> get_data_chunk_count() on the mon to define the stripe width (the size
> of the buffer which will be presented to the plugin for encoding) for
> the pool. At the moment, get_chunksize(4*(2<<10)) *
> get_data_chunk_count() = 393216 using the jerasure plugin where
> get_data_chunk_count() = 4. This seems a bit big?
> -Sam
>
> On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters
> <Andreas.Joachim.Peters@cern.ch> wrote:
>> Hi Loic et.al.
>>
>> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
>>
>> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
>>
>> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
>>
>> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
>>
>> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
>>
>> Cheers Andreas.
>>
>>
>>
>>
>> ________________________________________
>> From: Loic Dachary [loic@dachary.org]
>> Sent: 02 February 2014 16:15
>> To: Samuel Just
>> Cc: Ceph Development; Andreas Joachim Peters
>> Subject: controlling erasure code chunk size
>>
>> [cc' ceph-devel]
>>
>> Hi Sam,
>>
>> Here is how chunks are expected to be aligned:
>>
>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
>>
>> unsigned alignment = k*w*packetsize*sizeof(int);
>> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
>> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
>> return alignment;
>>
>> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
>>
>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
>>
>> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
>>
>> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
>>
>> Cheers
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size
2014-02-03 11:35 ` Loic Dachary
@ 2014-02-03 18:15 ` Samuel Just
0 siblings, 0 replies; 10+ messages in thread
From: Samuel Just @ 2014-02-03 18:15 UTC (permalink / raw)
To: Loic Dachary; +Cc: Andreas Joachim Peters, Ceph Development
Yes, I figured we might as well match stripe_width to chunk_size *
get_data_chunk_count().
-Sam
On Mon, Feb 3, 2014 at 3:35 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sam,
>
> The argument to get_chunk_size is the stripe width, named object_size because the API knows nothing about stripes, it is a concept for the caller to implement. Say you have a desired chunk size in mind, you would:
>
> object_size = desired_chunk_size * get_data_chunk_count()
> actual_chunk_size = get_chunk_size(object_size)
>
> If you have a desired stripe width / object size in mind you would:
>
> object_size = desired_stripe_width
> chunk_size = get_chunk_size(object_size)
>
> Following Andreas suggestions, controlling the size of the actual chunk is a matter of tweaking the alignment constraints via the erasure code plugin parameters.
>
> Cheers
>
> On 02/02/2014 23:45, Samuel Just wrote:
>> I assume we will use get_chunksize(desired_chunksize) *
>> get_data_chunk_count() on the mon to define the stripe width (the size
>> of the buffer which will be presented to the plugin for encoding) for
>> the pool. At the moment, get_chunksize(4*(2<<10)) *
>> get_data_chunk_count() = 393216 using the jerasure plugin where
>> get_data_chunk_count() = 4. This seems a bit big?
>> -Sam
>>
>> On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters
>> <Andreas.Joachim.Peters@cern.ch> wrote:
>>> Hi Loic et.al.
>>>
>>> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream.
>>>
>>> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1.
>>>
>>> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part.
>>>
>>> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation.
>>>
>>> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned.
>>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>>
>>> ________________________________________
>>> From: Loic Dachary [loic@dachary.org]
>>> Sent: 02 February 2014 16:15
>>> To: Samuel Just
>>> Cc: Ceph Development; Andreas Joachim Peters
>>> Subject: controlling erasure code chunk size
>>>
>>> [cc' ceph-devel]
>>>
>>> Hi Sam,
>>>
>>> Here is how chunks are expected to be aligned:
>>>
>>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
>>>
>>> unsigned alignment = k*w*packetsize*sizeof(int);
>>> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
>>> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
>>> return alignment;
>>>
>>> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
>>>
>>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
>>>
>>> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
>>>
>>> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
>>>
>>> Cheers
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size
2014-02-02 23:33 ` Samuel Just
@ 2014-02-04 16:17 ` Loic Dachary
2014-02-04 17:01 ` Andreas Joachim Peters
0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-02-04 16:17 UTC (permalink / raw)
To: Andreas Joachim Peters; +Cc: Ceph Development
[-- Attachment #1: Type: text/plain, Size: 2514 bytes --]
Hi Andreas,
> For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close.
You are correct : the packet size is best adapted to the object size (or stripe size) rather than being set once for all. However Sam wants to use a fixed stripe size and we don't need this flexibility right now.
I don't fully understand the alignment requirements of Jerasure. Since we're using Cauchy because it is the fastest, here is how I understand its alignment constraints. I copied them from the original encode/decode methods found in jerasure into the get_alignment method whithout understanding the details.
* each chunk memory address must be aligned to allow
https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/vectorop.h to be used by https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748 . This is done without reading from get_alignment() because each buffer is created with https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L519 buffer::create_page_aligned which calls https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L235 posix_memalign with an alignment of CEPH_PAGE_SIZE which is large enough. It is implicit though and it would be better to explicitly set this constraint.
https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L288
* each chunk size must be a multiple of get_alignment() and in the case of the Cauch techniques it means:
** being a multiple of sizeof(int) (why?)
** being a multiple of LARGEST_VECTOR_WORDSIZE (because https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748)
** being a multiple of k*w*packetsize (because each chunk contains k packets of packets size and each packet is made of words of size w)
I would be grateful if you could explain what the sizeof(int) is about. Also, I understand that k*w*packetsize should be a multiple of LARGEST_VECTOR_WORDSIZE but I don't understand why you would multiply the alignment to achieve this. Is it be enough to if (alignment % LARGEST_VECTOR_WORDSIZE) alignment += alignment % LARGEST_VECTOR_WORDSIZE ?
Thanks in advance for your patience :-)
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: controlling erasure code chunk size
2014-02-04 16:17 ` Loic Dachary
@ 2014-02-04 17:01 ` Andreas Joachim Peters
0 siblings, 0 replies; 10+ messages in thread
From: Andreas Joachim Peters @ 2014-02-04 17:01 UTC (permalink / raw)
To: Loic Dachary; +Cc: Ceph Development
Hi Loic,
for the sizeof(int)... the reason is that JERAUSRE internally uses uses long* addresses with operations on them e.g. if you XOR two chunks of size 3 you access illegal memory a long* xor's also byte 4.
The PDF documentation says this:
int packetsize: The packet size as defined in section 1. This must be a multiple of sizeof(long).
int size: The total number of bytes per device to encode/decode. This must be a multiple of sizeof(long). If a
bit-matrix is being employed, then it must be a multiple of packetsize * w. If one desires to encode data blocks
that do not conform to these restrictions, than one must pad the data blocks with zeroes so that the restrictions
are met.
You cannot just do the modulo adjustment because this breaks the requirement that len is a multiple of packetsize*w !!!
Imagine w=3, packetsize=4 .... a module add would adjust it to 16 and you cannot divide 16 by 12, so the smallest proper adjustment here is 48 ! So the most simple approach is to add just another " * VECTOR_WORD_SIZE" since the condition will be always fulfilled.
Cheers Andreas.
________________________________________
From: ceph-devel-owner@vger.kernel.org [ceph-devel-owner@vger.kernel.org] on behalf of Loic Dachary [loic@dachary.org]
Sent: 04 February 2014 17:17
To: Andreas Joachim Peters
Cc: Ceph Development
Subject: Re: controlling erasure code chunk size
Hi Andreas,
> For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close.
You are correct : the packet size is best adapted to the object size (or stripe size) rather than being set once for all. However Sam wants to use a fixed stripe size and we don't need this flexibility right now.
I don't fully understand the alignment requirements of Jerasure. Since we're using Cauchy because it is the fastest, here is how I understand its alignment constraints. I copied them from the original encode/decode methods found in jerasure into the get_alignment method whithout understanding the details.
* each chunk memory address must be aligned to allow
https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/vectorop.h to be used by https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748 . This is done without reading from get_alignment() because each buffer is created with https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L519 buffer::create_page_aligned which calls https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L235 posix_memalign with an alignment of CEPH_PAGE_SIZE which is large enough. It is implicit though and it would be better to explicitly set this constraint.
https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L288
* each chunk size must be a multiple of get_alignment() and in the case of the Cauch techniques it means:
** being a multiple of sizeof(int) (why?)
** being a multiple of LARGEST_VECTOR_WORDSIZE (because https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748)
** being a multiple of k*w*packetsize (because each chunk contains k packets of packets size and each packet is made of words of size w)
I would be grateful if you could explain what the sizeof(int) is about. Also, I understand that k*w*packetsize should be a multiple of LARGEST_VECTOR_WORDSIZE but I don't understand why you would multiply the alignment to achieve this. Is it be enough to if (alignment % LARGEST_VECTOR_WORDSIZE) alignment += alignment % LARGEST_VECTOR_WORDSIZE ?
Thanks in advance for your patience :-)
--
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-02-04 17:01 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-02 15:15 controlling erasure code chunk size Loic Dachary
2014-02-02 16:18 ` Andreas Joachim Peters
2014-02-02 22:45 ` Samuel Just
2014-02-02 23:27 ` Andreas Joachim Peters
2014-02-02 23:33 ` Samuel Just
2014-02-04 16:17 ` Loic Dachary
2014-02-04 17:01 ` Andreas Joachim Peters
2014-02-03 10:57 ` Loic Dachary
2014-02-03 11:35 ` Loic Dachary
2014-02-03 18:15 ` Samuel Just
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.