* controlling erasure code chunk size
@ 2014-02-02 15:15 Loic Dachary
2014-02-02 16:18 ` Andreas Joachim Peters
0 siblings, 1 reply; 10+ messages in thread
From: Loic Dachary @ 2014-02-02 15:15 UTC (permalink / raw)
To: Samuel Just; +Cc: Ceph Development, Andreas Joachim Peters
[-- Attachment #1: Type: text/plain, Size: 953 bytes --]
[cc' ceph-devel]
Hi Sam,
Here is how chunks are expected to be aligned:
https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365
unsigned alignment = k*w*packetsize*sizeof(int);
if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) )
alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE;
return alignment;
If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072
https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406
A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ).
I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic.
Cheers
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread* RE: controlling erasure code chunk size 2014-02-02 15:15 controlling erasure code chunk size Loic Dachary @ 2014-02-02 16:18 ` Andreas Joachim Peters 2014-02-02 22:45 ` Samuel Just 0 siblings, 1 reply; 10+ messages in thread From: Andreas Joachim Peters @ 2014-02-02 16:18 UTC (permalink / raw) To: Loic Dachary, Samuel Just; +Cc: Ceph Development Hi Loic et.al. I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. Cheers Andreas. ________________________________________ From: Loic Dachary [loic@dachary.org] Sent: 02 February 2014 16:15 To: Samuel Just Cc: Ceph Development; Andreas Joachim Peters Subject: controlling erasure code chunk size [cc' ceph-devel] Hi Sam, Here is how chunks are expected to be aligned: https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 unsigned alignment = k*w*packetsize*sizeof(int); if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; return alignment; If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size 2014-02-02 16:18 ` Andreas Joachim Peters @ 2014-02-02 22:45 ` Samuel Just 2014-02-02 23:27 ` Andreas Joachim Peters 2014-02-03 11:35 ` Loic Dachary 0 siblings, 2 replies; 10+ messages in thread From: Samuel Just @ 2014-02-02 22:45 UTC (permalink / raw) To: Andreas Joachim Peters; +Cc: Loic Dachary, Ceph Development I assume we will use get_chunksize(desired_chunksize) * get_data_chunk_count() on the mon to define the stripe width (the size of the buffer which will be presented to the plugin for encoding) for the pool. At the moment, get_chunksize(4*(2<<10)) * get_data_chunk_count() = 393216 using the jerasure plugin where get_data_chunk_count() = 4. This seems a bit big? -Sam On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters <Andreas.Joachim.Peters@cern.ch> wrote: > Hi Loic et.al. > > I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. > > Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. > > packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. > > If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. > > To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. > > Cheers Andreas. > > > > > ________________________________________ > From: Loic Dachary [loic@dachary.org] > Sent: 02 February 2014 16:15 > To: Samuel Just > Cc: Ceph Development; Andreas Joachim Peters > Subject: controlling erasure code chunk size > > [cc' ceph-devel] > > Hi Sam, > > Here is how chunks are expected to be aligned: > > https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 > > unsigned alignment = k*w*packetsize*sizeof(int); > if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) > alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; > return alignment; > > If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 > > https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 > > A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). > > I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. > > Cheers > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: controlling erasure code chunk size 2014-02-02 22:45 ` Samuel Just @ 2014-02-02 23:27 ` Andreas Joachim Peters 2014-02-02 23:33 ` Samuel Just 2014-02-03 10:57 ` Loic Dachary 2014-02-03 11:35 ` Loic Dachary 1 sibling, 2 replies; 10+ messages in thread From: Andreas Joachim Peters @ 2014-02-02 23:27 UTC (permalink / raw) To: Samuel Just; +Cc: Loic Dachary, Ceph Development If you want 4k stripe_size, you have to configure the cauchy plugin with w=8 packetsize=128 for a k=4 configuration. For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close. Cheers Andreas. ________________________________________ From: Samuel Just [sam.just@inktank.com] Sent: 02 February 2014 23:45 To: Andreas Joachim Peters Cc: Loic Dachary; Ceph Development Subject: Re: controlling erasure code chunk size I assume we will use get_chunksize(desired_chunksize) * get_data_chunk_count() on the mon to define the stripe width (the size of the buffer which will be presented to the plugin for encoding) for the pool. At the moment, get_chunksize(4*(2<<10)) * get_data_chunk_count() = 393216 using the jerasure plugin where get_data_chunk_count() = 4. This seems a bit big? -Sam On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters <Andreas.Joachim.Peters@cern.ch> wrote: > Hi Loic et.al. > > I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. > > Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. > > packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. > > If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. > > To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. > > Cheers Andreas. > > > > > ________________________________________ > From: Loic Dachary [loic@dachary.org] > Sent: 02 February 2014 16:15 > To: Samuel Just > Cc: Ceph Development; Andreas Joachim Peters > Subject: controlling erasure code chunk size > > [cc' ceph-devel] > > Hi Sam, > > Here is how chunks are expected to be aligned: > > https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 > > unsigned alignment = k*w*packetsize*sizeof(int); > if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) > alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; > return alignment; > > If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 > > https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 > > A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). > > I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. > > Cheers > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size 2014-02-02 23:27 ` Andreas Joachim Peters @ 2014-02-02 23:33 ` Samuel Just 2014-02-04 16:17 ` Loic Dachary 2014-02-03 10:57 ` Loic Dachary 1 sibling, 1 reply; 10+ messages in thread From: Samuel Just @ 2014-02-02 23:33 UTC (permalink / raw) To: Andreas Joachim Peters; +Cc: Loic Dachary, Ceph Development Adjusting deterministically based on the desired chunk_size seems like it would be the simplest thing, if only to avoid having one more knob to mis-adjust. How large does packetsize need to be before making it bigger no longer provides a benefit? -Sam On Sun, Feb 2, 2014 at 3:27 PM, Andreas Joachim Peters <Andreas.Joachim.Peters@cern.ch> wrote: > If you want 4k stripe_size, you have to configure the cauchy plugin with w=8 packetsize=128 for a k=4 configuration. > > For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close. > > Cheers Andreas. > ________________________________________ > From: Samuel Just [sam.just@inktank.com] > Sent: 02 February 2014 23:45 > To: Andreas Joachim Peters > Cc: Loic Dachary; Ceph Development > Subject: Re: controlling erasure code chunk size > > I assume we will use get_chunksize(desired_chunksize) * > get_data_chunk_count() on the mon to define the stripe width (the size > of the buffer which will be presented to the plugin for encoding) for > the pool. At the moment, get_chunksize(4*(2<<10)) * > get_data_chunk_count() = 393216 using the jerasure plugin where > get_data_chunk_count() = 4. This seems a bit big? > -Sam > > On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters > <Andreas.Joachim.Peters@cern.ch> wrote: >> Hi Loic et.al. >> >> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. >> >> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. >> >> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. >> >> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. >> >> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. >> >> Cheers Andreas. >> >> >> >> >> ________________________________________ >> From: Loic Dachary [loic@dachary.org] >> Sent: 02 February 2014 16:15 >> To: Samuel Just >> Cc: Ceph Development; Andreas Joachim Peters >> Subject: controlling erasure code chunk size >> >> [cc' ceph-devel] >> >> Hi Sam, >> >> Here is how chunks are expected to be aligned: >> >> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 >> >> unsigned alignment = k*w*packetsize*sizeof(int); >> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) >> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; >> return alignment; >> >> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 >> >> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 >> >> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). >> >> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. >> >> Cheers >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size 2014-02-02 23:33 ` Samuel Just @ 2014-02-04 16:17 ` Loic Dachary 2014-02-04 17:01 ` Andreas Joachim Peters 0 siblings, 1 reply; 10+ messages in thread From: Loic Dachary @ 2014-02-04 16:17 UTC (permalink / raw) To: Andreas Joachim Peters; +Cc: Ceph Development [-- Attachment #1: Type: text/plain, Size: 2514 bytes --] Hi Andreas, > For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close. You are correct : the packet size is best adapted to the object size (or stripe size) rather than being set once for all. However Sam wants to use a fixed stripe size and we don't need this flexibility right now. I don't fully understand the alignment requirements of Jerasure. Since we're using Cauchy because it is the fastest, here is how I understand its alignment constraints. I copied them from the original encode/decode methods found in jerasure into the get_alignment method whithout understanding the details. * each chunk memory address must be aligned to allow https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/vectorop.h to be used by https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748 . This is done without reading from get_alignment() because each buffer is created with https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L519 buffer::create_page_aligned which calls https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L235 posix_memalign with an alignment of CEPH_PAGE_SIZE which is large enough. It is implicit though and it would be better to explicitly set this constraint. https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L288 * each chunk size must be a multiple of get_alignment() and in the case of the Cauch techniques it means: ** being a multiple of sizeof(int) (why?) ** being a multiple of LARGEST_VECTOR_WORDSIZE (because https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748) ** being a multiple of k*w*packetsize (because each chunk contains k packets of packets size and each packet is made of words of size w) I would be grateful if you could explain what the sizeof(int) is about. Also, I understand that k*w*packetsize should be a multiple of LARGEST_VECTOR_WORDSIZE but I don't understand why you would multiply the alignment to achieve this. Is it be enough to if (alignment % LARGEST_VECTOR_WORDSIZE) alignment += alignment % LARGEST_VECTOR_WORDSIZE ? Thanks in advance for your patience :-) -- Loïc Dachary, Artisan Logiciel Libre [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: controlling erasure code chunk size 2014-02-04 16:17 ` Loic Dachary @ 2014-02-04 17:01 ` Andreas Joachim Peters 0 siblings, 0 replies; 10+ messages in thread From: Andreas Joachim Peters @ 2014-02-04 17:01 UTC (permalink / raw) To: Loic Dachary; +Cc: Ceph Development Hi Loic, for the sizeof(int)... the reason is that JERAUSRE internally uses uses long* addresses with operations on them e.g. if you XOR two chunks of size 3 you access illegal memory a long* xor's also byte 4. The PDF documentation says this: int packetsize: The packet size as defined in section 1. This must be a multiple of sizeof(long). int size: The total number of bytes per device to encode/decode. This must be a multiple of sizeof(long). If a bit-matrix is being employed, then it must be a multiple of packetsize * w. If one desires to encode data blocks that do not conform to these restrictions, than one must pad the data blocks with zeroes so that the restrictions are met. You cannot just do the modulo adjustment because this breaks the requirement that len is a multiple of packetsize*w !!! Imagine w=3, packetsize=4 .... a module add would adjust it to 16 and you cannot divide 16 by 12, so the smallest proper adjustment here is 48 ! So the most simple approach is to add just another " * VECTOR_WORD_SIZE" since the condition will be always fulfilled. Cheers Andreas. ________________________________________ From: ceph-devel-owner@vger.kernel.org [ceph-devel-owner@vger.kernel.org] on behalf of Loic Dachary [loic@dachary.org] Sent: 04 February 2014 17:17 To: Andreas Joachim Peters Cc: Ceph Development Subject: Re: controlling erasure code chunk size Hi Andreas, > For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close. You are correct : the packet size is best adapted to the object size (or stripe size) rather than being set once for all. However Sam wants to use a fixed stripe size and we don't need this flexibility right now. I don't fully understand the alignment requirements of Jerasure. Since we're using Cauchy because it is the fastest, here is how I understand its alignment constraints. I copied them from the original encode/decode methods found in jerasure into the get_alignment method whithout understanding the details. * each chunk memory address must be aligned to allow https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/vectorop.h to be used by https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748 . This is done without reading from get_alignment() because each buffer is created with https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L519 buffer::create_page_aligned which calls https://github.com/ceph/ceph/blob/v0.76/src/common/buffer.cc#L235 posix_memalign with an alignment of CEPH_PAGE_SIZE which is large enough. It is implicit though and it would be better to explicitly set this constraint. https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L288 * each chunk size must be a multiple of get_alignment() and in the case of the Cauch techniques it means: ** being a multiple of sizeof(int) (why?) ** being a multiple of LARGEST_VECTOR_WORDSIZE (because https://github.com/ceph/ceph/blob/v0.76/src/osd/ErasureCodePluginJerasure/galois.c#L748) ** being a multiple of k*w*packetsize (because each chunk contains k packets of packets size and each packet is made of words of size w) I would be grateful if you could explain what the sizeof(int) is about. Also, I understand that k*w*packetsize should be a multiple of LARGEST_VECTOR_WORDSIZE but I don't understand why you would multiply the alignment to achieve this. Is it be enough to if (alignment % LARGEST_VECTOR_WORDSIZE) alignment += alignment % LARGEST_VECTOR_WORDSIZE ? Thanks in advance for your patience :-) -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size 2014-02-02 23:27 ` Andreas Joachim Peters 2014-02-02 23:33 ` Samuel Just @ 2014-02-03 10:57 ` Loic Dachary 1 sibling, 0 replies; 10+ messages in thread From: Loic Dachary @ 2014-02-03 10:57 UTC (permalink / raw) To: Andreas Joachim Peters, Samuel Just; +Cc: Ceph Development [-- Attachment #1: Type: text/plain, Size: 4266 bytes --] Hi Andreas, I better understand what we're after. Can you join the irc.oftc.net#ceph-devel irc channel to discuss the details ? We have a few hours ahead of us before Los Angeles wakes up ;-) Cheers On 03/02/2014 00:27, Andreas Joachim Peters wrote: > If you want 4k stripe_size, you have to configure the cauchy plugin with w=8 packetsize=128 for a k=4 configuration. > > For w=(multiple of 8) we could probably skip the (*sizeof(int)) and get the chunksize factor 4 down ... Loic we should check if this is ok with the Jerasure implementation .... I wonder if we should have 'packetsize' as a plugin parameter or we should just adjust the packetsize based on the desired chunk_size to get it close. > > Cheers Andreas. > ________________________________________ > From: Samuel Just [sam.just@inktank.com] > Sent: 02 February 2014 23:45 > To: Andreas Joachim Peters > Cc: Loic Dachary; Ceph Development > Subject: Re: controlling erasure code chunk size > > I assume we will use get_chunksize(desired_chunksize) * > get_data_chunk_count() on the mon to define the stripe width (the size > of the buffer which will be presented to the plugin for encoding) for > the pool. At the moment, get_chunksize(4*(2<<10)) * > get_data_chunk_count() = 393216 using the jerasure plugin where > get_data_chunk_count() = 4. This seems a bit big? > -Sam > > On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters > <Andreas.Joachim.Peters@cern.ch> wrote: >> Hi Loic et.al. >> >> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. >> >> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. >> >> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. >> >> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. >> >> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. >> >> Cheers Andreas. >> >> >> >> >> ________________________________________ >> From: Loic Dachary [loic@dachary.org] >> Sent: 02 February 2014 16:15 >> To: Samuel Just >> Cc: Ceph Development; Andreas Joachim Peters >> Subject: controlling erasure code chunk size >> >> [cc' ceph-devel] >> >> Hi Sam, >> >> Here is how chunks are expected to be aligned: >> >> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 >> >> unsigned alignment = k*w*packetsize*sizeof(int); >> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) >> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; >> return alignment; >> >> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 >> >> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 >> >> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). >> >> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. >> >> Cheers >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> -- Loïc Dachary, Artisan Logiciel Libre [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size 2014-02-02 22:45 ` Samuel Just 2014-02-02 23:27 ` Andreas Joachim Peters @ 2014-02-03 11:35 ` Loic Dachary 2014-02-03 18:15 ` Samuel Just 1 sibling, 1 reply; 10+ messages in thread From: Loic Dachary @ 2014-02-03 11:35 UTC (permalink / raw) To: Samuel Just, Andreas Joachim Peters; +Cc: Ceph Development [-- Attachment #1: Type: text/plain, Size: 4212 bytes --] Hi Sam, The argument to get_chunk_size is the stripe width, named object_size because the API knows nothing about stripes, it is a concept for the caller to implement. Say you have a desired chunk size in mind, you would: object_size = desired_chunk_size * get_data_chunk_count() actual_chunk_size = get_chunk_size(object_size) If you have a desired stripe width / object size in mind you would: object_size = desired_stripe_width chunk_size = get_chunk_size(object_size) Following Andreas suggestions, controlling the size of the actual chunk is a matter of tweaking the alignment constraints via the erasure code plugin parameters. Cheers On 02/02/2014 23:45, Samuel Just wrote: > I assume we will use get_chunksize(desired_chunksize) * > get_data_chunk_count() on the mon to define the stripe width (the size > of the buffer which will be presented to the plugin for encoding) for > the pool. At the moment, get_chunksize(4*(2<<10)) * > get_data_chunk_count() = 393216 using the jerasure plugin where > get_data_chunk_count() = 4. This seems a bit big? > -Sam > > On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters > <Andreas.Joachim.Peters@cern.ch> wrote: >> Hi Loic et.al. >> >> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. >> >> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. >> >> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. >> >> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. >> >> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. >> >> Cheers Andreas. >> >> >> >> >> ________________________________________ >> From: Loic Dachary [loic@dachary.org] >> Sent: 02 February 2014 16:15 >> To: Samuel Just >> Cc: Ceph Development; Andreas Joachim Peters >> Subject: controlling erasure code chunk size >> >> [cc' ceph-devel] >> >> Hi Sam, >> >> Here is how chunks are expected to be aligned: >> >> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 >> >> unsigned alignment = k*w*packetsize*sizeof(int); >> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) >> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; >> return alignment; >> >> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 >> >> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 >> >> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). >> >> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. >> >> Cheers >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: controlling erasure code chunk size 2014-02-03 11:35 ` Loic Dachary @ 2014-02-03 18:15 ` Samuel Just 0 siblings, 0 replies; 10+ messages in thread From: Samuel Just @ 2014-02-03 18:15 UTC (permalink / raw) To: Loic Dachary; +Cc: Andreas Joachim Peters, Ceph Development Yes, I figured we might as well match stripe_width to chunk_size * get_data_chunk_count(). -Sam On Mon, Feb 3, 2014 at 3:35 AM, Loic Dachary <loic@dachary.org> wrote: > Hi Sam, > > The argument to get_chunk_size is the stripe width, named object_size because the API knows nothing about stripes, it is a concept for the caller to implement. Say you have a desired chunk size in mind, you would: > > object_size = desired_chunk_size * get_data_chunk_count() > actual_chunk_size = get_chunk_size(object_size) > > If you have a desired stripe width / object size in mind you would: > > object_size = desired_stripe_width > chunk_size = get_chunk_size(object_size) > > Following Andreas suggestions, controlling the size of the actual chunk is a matter of tweaking the alignment constraints via the erasure code plugin parameters. > > Cheers > > On 02/02/2014 23:45, Samuel Just wrote: >> I assume we will use get_chunksize(desired_chunksize) * >> get_data_chunk_count() on the mon to define the stripe width (the size >> of the buffer which will be presented to the plugin for encoding) for >> the pool. At the moment, get_chunksize(4*(2<<10)) * >> get_data_chunk_count() = 393216 using the jerasure plugin where >> get_data_chunk_count() = 4. This seems a bit big? >> -Sam >> >> On Sun, Feb 2, 2014 at 8:18 AM, Andreas Joachim Peters >> <Andreas.Joachim.Peters@cern.ch> wrote: >>> Hi Loic et.al. >>> >>> I think there is now some confusion about chunk_size, alignment, packetsize and the stripe_size to be used upstream. >>> >>> Algorithms with a bit-matrix require that the size per device is a multiple of (packetsize*w). Moreover the size per device and packetsize itself must be a multiple of sizeof(long/int). For other algorithms you can assume the same with packetsize=1. >>> >>> packetsize and w influence the performance and too small stripe_size on top will have negative performance effects due to the preparation of bufferlist, internal buffer checks and more loops to execute for the same amount of data. We can also do some measurement for this but the current benchmark would probably not reflect this, since it measures the algorithmic part not the bufferlist preparation part. >>> >>> If you want to define a stripe_size it has to be a multiple of the value returned by get_chunksize and possibly it is a large multiple but in total not larger than processor caches. The plugin can not define the stripe_size, it defines only the alignment to be used for stripe_size and stripe_size is defined outside the plugin which maybe complicates the understanding. We should carefully check once more the Jerasure alignment requirements and our current implementation. >>> >>> To get rid of the platform dependency we could put a generic alignment requirement that chunksize has to be also 64-byte aligned. >>> >>> Cheers Andreas. >>> >>> >>> >>> >>> ________________________________________ >>> From: Loic Dachary [loic@dachary.org] >>> Sent: 02 February 2014 16:15 >>> To: Samuel Just >>> Cc: Ceph Development; Andreas Joachim Peters >>> Subject: controlling erasure code chunk size >>> >>> [cc' ceph-devel] >>> >>> Hi Sam, >>> >>> Here is how chunks are expected to be aligned: >>> >>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L365 >>> >>> unsigned alignment = k*w*packetsize*sizeof(int); >>> if ( ((w*packetsize*sizeof(int))%LARGEST_VECTOR_WORDSIZE) ) >>> alignment = k*w*packetsize*LARGEST_VECTOR_WORDSIZE; >>> return alignment; >>> >>> If you are going to encode small objects, it may very well lead to oversized chunks if packetsize is large. At the moment the default is 3072 >>> >>> https://github.com/ceph/ceph/blob/4c4e1d0d470beba7690d1c0e39bfd1146a25f465/src/common/config_opts.h#L406 >>> >>> A value I picked when experimenting with 1MB objects encoding ( http://dachary.org/?p=2594 ). >>> >>> I'm not entirely sure why the alignment is calculated the way it is. Andreas certainly has a better understanding on this topic. >>> >>> Cheers >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-02-04 17:01 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-02 15:15 controlling erasure code chunk size Loic Dachary 2014-02-02 16:18 ` Andreas Joachim Peters 2014-02-02 22:45 ` Samuel Just 2014-02-02 23:27 ` Andreas Joachim Peters 2014-02-02 23:33 ` Samuel Just 2014-02-04 16:17 ` Loic Dachary 2014-02-04 17:01 ` Andreas Joachim Peters 2014-02-03 10:57 ` Loic Dachary 2014-02-03 11:35 ` Loic Dachary 2014-02-03 18:15 ` Samuel Just
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.