From mboxrd@z Thu Jan 1 00:00:00 1970 From: Loic Dachary Subject: Re: Erasure coding library API Date: Fri, 05 Jul 2013 18:34:53 +0200 Message-ID: <51D6F5AD.2050007@dachary.org> References: <51D20A03.1030204@dachary.org> <68444BDD-2068-4850-998D-C65449C20353@ornl.gov> <14B8F72D-C792-4829-9F9B-1175CB27F3EA@ornl.gov> <622F4407872BA447A16110F65453358C01A4D8067419@FMSAMAIL.fmsa.local> <487BCFAF-6093-4E0D-BE80-26D8D919EF4D@ornl.gov> <622F4407872BA447A16110F65453358C01A4D80674CF@FMSAMAIL.fmsa.local> <51D57784.8080005@dachary.org> <155A8CED-D1FA-4642-BEBB-FCD13F81A021@ornl.gov> <51D6D2F4.5050108@dachary.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig4D6E6AD87DE28F395559583A" Return-path: Received: from smtp.dmail.dachary.org ([86.65.39.20]:47937 "EHLO smtp.dmail.dachary.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756153Ab3GEQe5 (ORCPT ); Fri, 5 Jul 2013 12:34:57 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Atchley, Scott" Cc: Ceph Development This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig4D6E6AD87DE28F395559583A Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Scott, That's unfortunate indeed and I now better understand why it is necessary= to add a signature to the chunks. I added the following to: https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasu= re-code.rst Bit flips happen. Not often, but it is possible. Here is an article from = 2011 also search for "bit rot" and "bit error rate". To detect corrupted = chunks, a signature (SHA1 for instance) should be added as an attribute o= f the file containing the chunk so that deep scrubbing can check that the= chunk is valid by rehashing the content of the chunk and compare it with= the signature. While writing a more detailed version of the API taking your comments and= Sam's comment in account, I realized that I don't know if it is possible= to recover a parity chunk. If a data chunk is missing, it is enough to r= ecover with a decode operation and write it back ( assuming a systematic = code is used ). Would it be possible to do something similar with parity = chunks ? Or would we need to re-encode all the parity chunks to write jus= t one of them ? I assume the later but ... I'm not sure ;-) Cheers On 05/07/2013 17:02, Atchley, Scott wrote: > On Jul 5, 2013, at 10:06 AM, Loic Dachary wrote: >=20 >> On 05/07/2013 14:13, Atchley, Scott wrote:> Loic, >>> >>> Erasure codes take what ever you give them. You need to verify the ch= unk before using it. Perhaps storing the checksum in the metadata/context= that describes the parity object? >> >> Hi Scott, >> >> Does that mean that if I give the chunks A + B + C to decode() and B i= s corrupted but A and C are ok it will return an incorrectly decoded cont= ent ? >=20 > Unfortunately, yes. >=20 >> I'm curious to know the answer but I don't think it is an actual probl= em. A corrupted chunk would mean that the underlying file system corrupte= d the content of one of its file. I can't remember the last time I saw th= at happen ;-) >=20 > Bit flips happen. Not often, but it is possible. Here is an article fro= m 2011: >=20 > http://www.linux-mag.com/id/8794/ >=20 > also search for "bit rot" and "bit error rate". >=20 > Scott >=20 >> >> Cheers >> >>> >>> Scott >>> >>> On Jul 4, 2013, at 9:24 AM, Loic Dachary wrote: >>> >>>> Hi, >>>> >>>> I was thinking about scrubbing of erasure coded chunks and realized = I don't know the answer to this very simple question : what happens when = a chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) += Z (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard = the corrupted chunk ? If that's the case I think it slightly changes what= the API should be. >>>> >>>> Cheers >>>> >>>> On 04/07/2013 05:06, Paul Von-Stamwitz wrote: >>>>> Scott, et al. >>>>> >>>>> Here is an interesting paper from Usenix HotStorage Conference whic= h provides local codes without additional capacity overhead. >>>>> >>>>> Check it out. (abstract with links to paper and slides) >>>>> https://www.usenix.org/conference/hotstorage13/solution-network-cha= llenges-data-recovery-erasure-coded-distributed-storage >>>>> >>>>> Cheers, >>>>> pvs >>>>> >>>>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote: >>>>>> >>>>>> Hi Scott, >>>>>> >>>>>> Point taken. >>>>>> >>>>>> I was thinking about Loic's decode description where k+m was reque= sted and >>>>>> data was decoded when k blocks were received. But he was referring= to full >>>>>> stripe reads where all the memory is allocated. >>>>>> >>>>>> Degraded reads and block repair are a different matter. >>>>>> >>>>>> pvs >>>>>> >>>>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote: >>>>>>> >>>>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz >>>>>>> wrote: >>>>>>> >>>>>>>> Scott, >>>>>>>> >>>>>>>> You make a good point comparing (5/3) RS with Xorbas, but a smal= l nit: >>>>>>>> >>>>>>>> "The I/O to recover from a single failure for both schemes is 5 = blocks >>>>>>> so it is as efficient as Xorbas." >>>>>>>> >>>>>>>> Maybe not. You would probably issue I/O to all the remaining 7 b= locks >>>>>> to >>>>>>> cover for the possibility of double errors. The time to reconstru= ct >>>>>> would >>>>>>> be about the same, but there could be more disk and network I/O. = (LRC >>>>>> will >>>>>>> need to issue I/O to the rest of the global stripe if it detected= >>>>>> multiple >>>>>>> local errors.) >>>>>>> >>>>>>> Why would you request more than five? If one cannot be read, requ= est >>>>>>> another. >>>>>>> >>>>>>> Also, I am not sure that you want to request five at once since i= t will >>>>>>> lead to spikes in network traffic and require memory for all five= blocks. >>>>>>> You will need at least two buffers. Request the first two and sta= rt the >>>>>>> decoding. You may want a third buffer to overlap the decoding of = the >>>>>>> current block with the communication for the next block. It may b= e that >>>>>>> the decode time is less than the communication and, in that case,= you >>>>>> will >>>>>>> want to request all of the blocks at once. >>>>>>> >>>>>>>> What I like about Xorbas is that it is an extension of a (x,y) R= S. You >>>>>>> can start with traditional RS. If degraded reads and repaired blo= cks are >>>>>>> causing a problem, you can add the LRC. If capacity is an issue, = you can >>>>>>> take it out. >>>>>>> >>>>>>> I like it too and Microsoft has something similar with Pyramid co= des. >>>>>> That >>>>>>> said, my example using traditional RS can provide more fault-tole= rance >>>>>> on >>>>>>> average given the same amount of storage overhead. >>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> Paul >>>>>>>> >>>>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote: >>>>>>>>> I think we should be able to cover most cases by adding an inte= rface >>>>>>> like: >>>>>>>>> >>>>>>>>> set minimum_to_read(const set &want_to_read, const se= t >>>>>>>>> &available_chunks); >>>>>>>>> >>>>>>>>> which returns the smallest set required to read/rebuild the chu= nks in >>>>>>>>> want_to_read given the chunks in available_chunks. Alternately= , we >>>>>>> might >>>>>>>>> include a "cost" for reading each chunk like >>>>>>>>> >>>>>>>>> set minimum_to_read_with_cost(const set &want_to_read= , >>>>>> const >>>>>>>>> map &available) >>>>>>>>> >>>>>>>>> which returns the minimum cost set required to read the specifi= ed >>>>>>> chunks >>>>>>>>> given a mapping of available chunks to costs. The costs might = allow >>>>>> us >>>>>>> to >>>>>>>>> consider the difference between reading local chunks vs remote = chunks. >>>>>>>>> This should be sufficient to cover the read case (esp the degra= ded >>>>>> read >>>>>>>>> case) and the repair case. >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott >>>>>>>>> wrote: >>>>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary w= rote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop P= roject >>>>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopU= SC/ >>>>>> for >>>>>>>>> instance ) would need to be different from the one initialy pro= posed: >>>>>>>>>>> >>>>>>>>>>> An interesting video. Not as entertaining as Jim Plank's vide= o. ;-) >>>>>>>>>>> >>>>>>>>>>> While Plank's focused on the processor requirements for >>>>>>>>> encoding/decoding, this video focuses on the network and disk I= /O >>>>>>>>> requirements. >>>>>>>>>>> >>>>>>>>>>>> context(k, m, reed-solomon|...) =3D> context* c >>>>>>>>>>>> encode(context* c, void* data) =3D> void* chunks[k+m] >>>>>>>>>>>> decode(context* c, void* chunk[k+m], int* >>>>>>>>>>>> indices_of_erased_chunks) =3D> void* data // erased chunks a= re not >>>>>>> used >>>>>>>>>>>> repair(context* c, void* chunk[k+m], int* >>>>>>>>>>>> indices_of_erased_chunks) =3D> void* chunks[k+m] // erased c= hunks >>>>>> are >>>>>>>>>>>> rebuilt >>>>>>>>>>>> >>>>>>>>>>>> The decode function must allow for partial read: >>>>>>>>>>>> >>>>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m],= int* >>>>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) =3D> void* da= ta >>>>>>>>>>>> >>>>>>>>>>>> If there are not enough chunks to recover the desired data r= ange >>>>>>>>> [offset, offset+length) the function returns NULL and sets >>>>>>> missing_chunks >>>>>>>>> to the list of chunks that must be retrieved in order to be abl= e to >>>>>>> read >>>>>>>>> the desired data. >>>>>>>>>>>> >>>>>>>>>>>> If decode is called to read just 1 chunk and it is missing, = reed- >>>>>>>>> solomon would return on error and ask for all other chunks to r= epair. >>>>>>> If >>>>>>>>> the underlying library implements LRC, it would ask for a subse= t of >>>>>> the >>>>>>>>> chunks. >>>>>>>>>>>> >>>>>>>>>>>> An implementation allowing only full reads and using jerasur= e >>>>>>> ( which >>>>>>>>> does not do LRC ) requires that offset is always zero, length i= s the >>>>>>> size >>>>>>>>> of the object and returns a copy of indices_of_erased_chunks if= there >>>>>>> are >>>>>>>>> not enough chunks to rebuild the missing ones. >>>>>>>>>>>> >>>>>>>>>>>> Comments are welcome :-) >>>>>>>>>>> >>>>>>>>>>> I have loosely followed this discussion and I have not looked= >>>>>> closely >>>>>>>>> at the proposed API nor at the jerasure interface. My apologies= if >>>>>> this >>>>>>>>> has already been addressed. >>>>>>>>>>> >>>>>>>>>>> It is not clear to me from the above proposed API (ignoring t= he >>>>>>> partial >>>>>>>>> read) what it would do. Was the original intent to encode the e= ntire >>>>>>> file >>>>>>>>> using k+m blocks irregardless of the file size and of the rados= >>>>>> object >>>>>>>>> size? >>>>>>>>>>> >>>>>>>>>>> If so, how will you map rados objects to the logical k+m obje= cts >>>>>> and >>>>>>>>> vice versa? >>>>>>>>>>> >>>>>>>>>>> If not, then the initial API needed an offset and length (eit= her >>>>>>>>> logical or rados object). >>>>>>>>>>> >>>>>>>>>>> I would assume that you would want to operate on rados sized >>>>>> objects. >>>>>>>>> Given a fixed k+m, then you may have more than one set of k+m o= bjects >>>>>>> per >>>>>>>>> file. This is ignoring the LRC "local" parity blocks. For examp= le, if >>>>>>> the >>>>>>>>> rados object size if 1 MB and k =3D 10 and m =3D 4 (as in the X= orbas >>>>>> video), >>>>>>>>> then for a 20 MB file one would need two sets of encoding block= s. The >>>>>>>>> first for objects 1-10 and the second for objects 11-20. >>>>>>>>>>> >>>>>>>>>>> Perhaps, this is what the context is above. If so, it should = have >>>>>> the >>>>>>>>> logical offset and rados object size, no? >>>>>>>>>>> >>>>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasur= e >>>>>>> library >>>>>>>>> can be modified to generate the local parity blocks such that t= hey >>>>>> can >>>>>>> be >>>>>>>>> used to generate the global parity blocks. That would be a ques= tion >>>>>> for >>>>>>>>> Jim Plank. >>>>>>>>>> >>>>>>>>>> The benefits of the Xorbas concept is reduced network and disk= I/O >>>>>> for >>>>>>>>> failures while maintaining traditional RS's higher fault-tolera= nce >>>>>> than >>>>>>> 3x >>>>>>>>> replication while using less space. >>>>>>>>>> >>>>>>>>>> You can do almost the same thing with jerasure without modifyi= ng it >>>>>> at >>>>>>>>> all. Using the values from the Xorbas video, they have k data b= locks, >>>>>> m >>>>>>>>> global parity blocks, and 2 local parity blocks (generated from= k/2 >>>>>>> data >>>>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate a= ny m >>>>>>>>> failures. In their example, k =3D 10 and m =3D 4. They store 16= blocks >>>>>> for >>>>>>>>> each 10 data blocks. >>>>>>>>>> >>>>>>>>>> If you use traditional RS encoding via jerasure and used the s= ame >>>>>>> amount >>>>>>>>> of storage (16 blocks for each 10 data blocks), you could encod= e 3 >>>>>>> parity >>>>>>>>> blocks for each 5 data blocks. This would consume 16 data block= s for >>>>>>> each >>>>>>>>> 10 data blocks and the fault-tolerance would be variable from 3= -6 >>>>>>> failures >>>>>>>>> depending on how the failures fell between the two groups of 5 = blocks >>>>>>>>> which is higher than the static 4 failures for the Xorbas code.= The >>>>>> I/O >>>>>>> to >>>>>>>>> recover from a single failure for both schemes is 5 blocks so i= t is >>>>>> as >>>>>>>>> efficient as Xorbas. On average, it provides more fault-toleran= ce, >>>>>> but >>>>>>> it >>>>>>>>> can be less (four failures from one group of 5 data + 3 parity >>>>>> blocks), >>>>>>>>> but that worst case is the same as 3x replication. >>>>>>>>>> >>>>>>>>>> Scott-- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph= - >>>>>> devel" >>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More maj= ordomo >>>>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-= devel" >>>>>>> in >>>>>>>>> the body of a message to majordomo@vger.kernel.org More majordo= mo >>>>>> info >>>>>>> at >>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-deve= l" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> >>>> --=20 >>>> Lo=EFc Dachary, Artisan Logiciel Libre >>>> All that is necessary for the triumph of evil is that good people do= nothing. >>>> >>> >> >> --=20 >> Lo=EFc Dachary, Artisan Logiciel Libre >> All that is necessary for the triumph of evil is that good people do n= othing. >> >=20 --=20 Lo=EFc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do noth= ing. --------------enig4D6E6AD87DE28F395559583A Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlHW9a0ACgkQ8dLMyEl6F22lFQCdFvE96x+p/9XJQY3Hpt8jYv7m 2foAnRbC5a8qEAuO02CjRuXbWL11f4ic =4lmJ -----END PGP SIGNATURE----- --------------enig4D6E6AD87DE28F395559583A--