From mboxrd@z Thu Jan 1 00:00:00 1970 From: Loic Dachary Subject: Re: CEPH Erasure Encoding + OSD Scalability Date: Sat, 14 Sep 2013 20:04:39 +0200 Message-ID: <5234A537.609@dachary.org> References: <3472A07E6605974CBC9BC573F1BC02E494B06990@PLOXCHG04.cern.ch> <51D73960.3070303@dachary.org> <51D8827E.8030906@dachary.org> <3472A07E6605974CBC9BC573F1BC02E494B06E64@PLOXCHG04.cern.ch> <5211F508.3030706@dachary.org> <521698D8.5020009@dachary.org> <52190C65.2090204@dachary.org>,<5219EF44.7050603@dachary.org> <3472A07E6605974CBC9BC573F1BC02E4A525F387@PLOXCHG03.cern.ch> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig31A93348AB6EBF90CD36E07B" Return-path: Received: from smtp.dmail.dachary.org ([91.121.254.229]:34824 "EHLO smtp.dmail.dachary.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753554Ab3INSEn (ORCPT ); Sat, 14 Sep 2013 14:04:43 -0400 In-Reply-To: <3472A07E6605974CBC9BC573F1BC02E4A525F387@PLOXCHG03.cern.ch> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Andreas Joachim Peters Cc: Ceph Development This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig31A93348AB6EBF90CD36E07B Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Andreas, On 14/09/2013 16:59, Andreas Joachim Peters wrote:> Hi Loic, > > I finally run/read the code of the erasure encoding. Great ! > What I noticed is, that in your implementation you always copy the data= to encode once because you add a padding block to the bufferlist and the= n call "out.c_str()", which calls bufferlist::rebuild and then new with t= he full size of all chunks and the it copies the input data. Please corre= ct me if I am wrong ... couldn't you just allocate the additional redunda= ncy chunks and return bufferptr pointing into the 'in' bufferlist ? I assume you're referring to https://github.com/ceph/ceph/blob/e9e53912503259326a7877bda31c4360302c2c3= 4/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L78 and indeed it implies an extra copy because of the padding. The optimizat= ion you're suggesting, if I get it right would only require an extra copy= of the last data chunk. The code would extract the char * from in with c= _str() before padding ( hence no rebuild ). Feed data[] with pointers to = this but the last one if it requires padding. Allocate + copy a chunk + p= ad the last chunk if necessary. The allocated area can be made big enough= to accomodate for the coding chunks. That would reduce the copy to the m= inimum. It also means that the returned bufferlist has to properly refere= nce the input buffer and that the caller must not modify the content of *= in* after calling encode otherwise it may have a side effect on the *enco= ded* result because they really are the same pointer. > > Another question is, why 'in' in the encode function is a list of buffe= rs? Maybe this is the natural interface object in CEPH IO, don't know ...= the implementation would concatenate them and produce chunks for the mer= ged block ... You guess right. Initially I had the encode function accept a bufferptr i= nstead of a bufferlist but it's not the preferred API data structure to c= onvey a buffer. > I will try to run a benchmark to see, if the additional copy has a visi= ble impact on the performance, however it looks unnecessary. Indeed there should be a way to avoid this extra copy. > I am also more or less finished with the 3 + (3XOR) implementation ... = will do also a benchmark with this and let you know the result. Cool ! > Last question a little bit out of context, I did some benchmark about = librados and latency. I see a latency of 1ms to read/stat objects of very= small size (5 bytes in this case). If we (re-)write such an object with = a 3-fold replica configuration on a 10 GBit setup with 1000 disks I see a= latency of 80 ms per object. If I append it is 75 ms. If we do a massive= test with the benchmark tool the total object creation rate saturates at= 20kHz which is ok however the individual latency is higher than I would = expect ? > > Is there something in the OSD delaying communication since I don't beli= eve it takes 80 ms to sync 5 bytes on an idle pool to a harddisk with a n= etwork roundtrip time of far less than a ms ? I suggest you start a separate thread for this, chances are your question= will not be noticed otherwise. Cheers > Cheers, Andreas. > > > > > > > > > > > > > > > > ________________________________________ > From: Loic Dachary [loic@dachary.org] > Sent: 25 August 2013 13:49 > To: Andreas Joachim Peters > Cc: Ceph Development > Subject: Re: CEPH Erasure Encoding + OSD Scalability > > On 24/08/2013 21:41, Loic Dachary wrote: >> >> >> On 24/08/2013 15:30, Andreas-Joachim Peters wrote: >>> Hi Loic, >>> I will start to review >> >> Cool :-) >> >> ...maybe you can briefly explain few things beforehand: >>> >>> 1) the buffer management .... who allocates the output buffers for t= he encoding? Are they always malloced or does it use some generic CEPH bu= ffer recyling functionality? >> >> The output bufferlist is allocated by the pluing and it is the respons= ibility of the caller to deallocate them. I will write doxygen documentat= ion >> https://github.com/ceph/ceph/pull/518/files#r5966727 > > Hi Andreas, > > The documentation added today in > https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeInterf= ace.h > will hopefully clarify things. It requires an understanding of https://= github.com/ceph/ceph/blob/master/src/include/buffer.h > > Let me know if you have more questions. > >> >>> 2) do you support to retrieve partial blocks or only the full 4M bloc= k? are decoded blocks cached for some time? >> >> This is outside of the scope of https://github.com/ceph/ceph/pull/518/= files : the plugin can handle encode/decode of 128 bytes or 4M in the sam= e way. >> >>> 3) do you want to tune the 2+1 basic code for performance or is it ju= st proof of concept? If yes, then you should move over the encoding buffe= r with *ptr++ and use the largest available vector size for the used plat= form to perform XOR operations. I will send you an improved version of th= e loop if you want ... >> >> The 2+1 is just a proof of concept. I completed a first implementation= of the jerasure plugin https://github.com/ceph/ceph/pull/538/files which= is meant to be used as a default. >> >>> 4) if you are interested I can write also code for a (3+3) plugin whi= ch tolerates 2-3 lost stripes. (one has to add P3=3DA^B^C to my [3,2] pro= posal). Atleast it reduces the overhead from 3-fold replication from 300%= =3D> 200% ... >> >> It would be great to have such a plugin :-) >> >>> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?= ) or will this be a CEPH generic functionality for any kind of block? >> >> The idea is to have a CRC32C checksum per object / shard ( as describe= d in http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossa= ry ) : it is the only way for scrubbing to figure out if a given shard is= not corrupted and not too expensive since erasure coded pool only suppor= t full writes + append and not partial writes that would require to re-ca= lculate the CRC32C for the whole shard each time one byte is changed. >> >>> 6) do you put a kind of header or magic into the encoded blocks to ve= rify that your input blocks are actually corresponding? >> >> This has not been decided yet but I think it would be sensible to use = the object attributes ( either xattr or leveldb ) to store meta informati= on instead of creating a file format specifically designed for erasure co= de. >> >> Cheers >> >>> Cheers Andreas. >>> >>> >>> >>> >>> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary > wrote: >>> >>> >>> >>> On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic, >>> > sorry for the late reply, I was on vacation ... you are right,= I did a simple logical mistake since I assumed you loose only the data s= tripes but never the parity stripes which is a very wrong assumption. >>> > >>> > So for testing you probably could just implement (2+1) and then= move to jerasure or dual parity (4+2) where you build horizontal and dia= gonal parities. >>> > >>> >>> Hi Andreas, >>> >>> That's what I did :-) It would be great if you could review the p= roposed implementation at https://github.com/ceph/ceph/pull/518/files . I= 'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba= 63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it'= s not ready for review yet. >>> >>> Cheers >>> >>> > Cheers Andreas. >>> > >>> > >>> > >>> > >>> > >>> > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary >> wrote: >>> > >>> > Hi Andreas, >>> > >>> > Trying to write minimal code as you suggested, for an examp= le plugin. My first attempt at writing an erasure coding function. I don'= t get how you can rebuild P1 + A from P2 + B + C. I must be missing somet= hing obvious :-) >>> > >>> > Cheers >>> > >>> > On 07/07/2013 23:04, Andreas Joachim Peters wrote: >>> > > >>> > > Hi Loic, >>> > > I don't think there is a better generic implementation. J= ust made a benchmark .. the Jerasure library with algorithm 'cauchy_good'= gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=3D3= 2. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a s= pecialized implementation in QFS (Hadoop in C++) for (M+3) ... for curios= ity I will make a benchmark with this to compare with Jerasure ... >>> > > >>> > > In any case I would do an optimized implementation for 3+= 2 which would be probably the most performant implementation having the s= ame reliability like standard 3-fold replication in CEPH using only 53% o= f the space. >>> > > >>> > > 3+2 is trivial since you encode (A,B,C) with only two par= ity operations >>> > > P1 =3D A^B >>> > > P2 =3D B^C >>> > > and reconstruct with one or two parity operations: >>> > > A =3D P1^B >>> > > B =3D P1^A >>> > > B =3D P2^C >>> > > C =3D P2^B >>> > > aso. >>> > > >>> > > You can write this as a simple loop using advanced vector= extensions on Intel (AVX). I can paste a benchmark tomorrow. >>> > > >>> > > Considering the crc32c-intel code you added ... I would p= rovide a function which provides a crc32c checksum and detects if it can = do it using SSE4.2 or implements just the standard algorithm e.g if you r= un in a virtual machine you need this emulation ... >>> > > >>> > > Cheers Andreas. >>> > > ________________________________________ >>> > > From: Loic Dachary [loic@dachary.org >] >>> > > Sent: 06 July 2013 22:47 >>> > > To: Andreas Joachim Peters >>> > > Cc: ceph-devel@vger.kernel.org > >>> > > Subject: Re: CEPH Erasure Encoding + OSD Scalability >>> > > >>> > > Hi Andreas, >>> > > >>> > > Since it looks like we're going to use jerasure-1.2, we w= ill be able to try (C)RS using >>> > > >>> > > https://github.com/tsuraan/Jerasure/blob/master/src/cauch= y.c >>> > > https://github.com/tsuraan/Jerasure/blob/master/src/cauch= y.h >>> > > >>> > > Do you know of a better / faster implementation ? Is ther= e a tradeoff between (C)RS and RS ? >>> > > >>> > > Cheers >>> > > >>> > > On 06/07/2013 15:43, Andreas-Joachim Peters wrote: >>> > >> HI Loic, >>> > >> (C)RS stands for the Cauchy Reed-Solomon codes which are= based on pure parity operations, while the standard Reed-Solomon codes n= eed more multiplications and are slower. >>> > >> >>> > >> Considering the checksumming ... for comparison the CRC3= 2 code from libz run's on a 8-core Xeon at ~730 MB/s for small block size= s while SSE4.2 CRC32C checksum run's at ~2GByte/s. >>> > >> >>> > >> Cheers Andreas. >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary > >>> wrote: >>> > >> >>> > >> Hi Andreas, >>> > >> >>> > >> On 04/07/2013 23:01, Andreas Joachim Peters wrote:> = Hi Loic, >>> > >> > thanks for the responses! >>> > >> > >>> > >> > Maybe this is useful for your erasure code discuss= ion: >>> > >> > >>> > >> > as an example in our RS implementation we chunk a = data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity= chunks. >>> > >> > >>> > >> > Data & parity chunks are split into 4k blocks and = these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension =3D= > MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4= 096 bytes) - nothing compared to the parity overhead ... >>> > >> > >>> > >> > You can now easily detect data corruption using th= e local checksums and avoid to read any parity information and (C)RS deco= ding if there is no corruption detected. Moreover CRC32C computation is d= istributed over several (in this case 4) machines while (C)RS decoding wo= uld run on a single machine where you assemble a block ... and CRC32C is = faster than (C)RS decoding (with SSE4.2) ... >>> > >> >>> > >> What does (C)RS mean ? (C)Reed-Solomon ? >>> > >> >>> > >> > In our case we write this checksum information sep= arate from the original data ... while in a block-based storage like CEPH= it would be probably inlined in the data chunk. >>> > >> > If an OSD detects to run on BRTFS or ZFS one could= disable automatically the CRC32C code. >>> > >> >>> > >> Nice. I did not know that was built-in :-) >>> > >> https://github.com/dachary/ceph/blob/wip-4929/doc/de= v/osd_internals/erasure-code.rst#scrubbing >>> > >> >>> > >> > (wouldn't CRC32C be also useful for normal CEPH bl= ock replication? ) >>> > >> >>> > >> I don't know the details of scrubbing but it seems C= RC is already used by deep scrubbing >>> > >> >>> > >> https://github.com/ceph/ceph/blob/master/src/osd/PG.= cc#L2731 >>> > >> >>> > >> Cheers >>> > >> >>> > >> > As far as I know with the RS CODEC we use you can = either miss stripes (data =3D0) in the decoding process but you cannot in= ject corrupted stripes into the decoding process, so the block checksummi= ng is important. >>> > >> > >>> > >> > Cheers Andreas. >>> > >> >>> > >> -- >>> > >> Lo=EFc Dachary, Artisan Logiciel Libre >>> > >> All that is necessary for the triumph of evil is tha= t good people do nothing. >>> > >> >>> > >> >>> > > >>> > > -- >>> > > Lo=EFc Dachary, Artisan Logiciel Libre >>> > > All that is necessary for the triumph of evil is that goo= d people do nothing. >>> > > >>> > > -- >>> > > To unsubscribe from this list: send the line "unsubscribe= ceph-devel" in >>> > > the body of a message to majordomo@vger.kernel.org > >>> > > More majordomo info at http://vger.kernel.org/majordomo-= info.html >>> > > >>> > >>> > -- >>> > Lo=EFc Dachary, Artisan Logiciel Libre >>> > All that is necessary for the triumph of evil is that good = people do nothing. >>> > >>> > >>> >>> -- >>> Lo=EFc Dachary, Artisan Logiciel Libre >>> All that is necessary for the triumph of evil is that good people= do nothing. >>> >>> >> > > -- > Lo=EFc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do no= thing. > --=20 Lo=EFc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do noth= ing. --------------enig31A93348AB6EBF90CD36E07B Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlI0pTcACgkQ8dLMyEl6F21edgCfZcCiIwSiUKL66/N7IesAb7BC lHIAnAqEWGT4FyzB1sdueZnja0qyQM3V =xRgY -----END PGP SIGNATURE----- --------------enig31A93348AB6EBF90CD36E07B--