From mboxrd@z Thu Jan 1 00:00:00 1970 From: Loic Dachary Subject: Re: Erasure coding library API Date: Fri, 05 Jul 2013 16:06:44 +0200 Message-ID: <51D6D2F4.5050108@dachary.org> References: <51D20A03.1030204@dachary.org> <68444BDD-2068-4850-998D-C65449C20353@ornl.gov> <14B8F72D-C792-4829-9F9B-1175CB27F3EA@ornl.gov> <622F4407872BA447A16110F65453358C01A4D8067419@FMSAMAIL.fmsa.local> <487BCFAF-6093-4E0D-BE80-26D8D919EF4D@ornl.gov> <622F4407872BA447A16110F65453358C01A4D80674CF@FMSAMAIL.fmsa.local> <51D57784.8080005@dachary.org> <155A8CED-D1FA-4642-BEBB-FCD13F81A021@ornl.gov> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigAF09C61A923B0DFD0513C2CD" Return-path: Received: from smtp.dmail.dachary.org ([86.65.39.20]:52365 "EHLO smtp.dmail.dachary.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751582Ab3GEOGs (ORCPT ); Fri, 5 Jul 2013 10:06:48 -0400 In-Reply-To: <155A8CED-D1FA-4642-BEBB-FCD13F81A021@ornl.gov> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Atchley, Scott" Cc: Ceph Development This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigAF09C61A923B0DFD0513C2CD Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 05/07/2013 14:13, Atchley, Scott wrote:> Loic, >=20 > Erasure codes take what ever you give them. You need to verify the chun= k before using it. Perhaps storing the checksum in the metadata/context t= hat describes the parity object? Hi Scott, Does that mean that if I give the chunks A + B + C to decode() and B is c= orrupted but A and C are ok it will return an incorrectly decoded content= ? I'm curious to know the answer but I don't think it is an actual probl= em. A corrupted chunk would mean that the underlying file system corrupte= d the content of one of its file. I can't remember the last time I saw th= at happen ;-) Cheers >=20 > Scott >=20 > On Jul 4, 2013, at 9:24 AM, Loic Dachary wrote: >=20 >> Hi, >> >> I was thinking about scrubbing of erasure coded chunks and realized I = don't know the answer to this very simple question : what happens when a = chunk is corrupted ? I.e. if AB is coded with 2+1 into A + B ( data ) + Z= (parity ) and Z is replaced with Q. Would reed-solomon ignore/discard th= e corrupted chunk ? If that's the case I think it slightly changes what t= he API should be. >> >> Cheers >> >> On 04/07/2013 05:06, Paul Von-Stamwitz wrote: >>> Scott, et al. >>> >>> Here is an interesting paper from Usenix HotStorage Conference which = provides local codes without additional capacity overhead. >>> >>> Check it out. (abstract with links to paper and slides) >>> https://www.usenix.org/conference/hotstorage13/solution-network-chall= enges-data-recovery-erasure-coded-distributed-storage >>> >>> Cheers, >>> pvs >>> >>>> On Jul 3, 2013, at 11:19 AM, Paul Von-Stamwitz wrote: >>>> >>>> Hi Scott, >>>> >>>> Point taken. >>>> >>>> I was thinking about Loic's decode description where k+m was request= ed and >>>> data was decoded when k blocks were received. But he was referring t= o full >>>> stripe reads where all the memory is allocated. >>>> >>>> Degraded reads and block repair are a different matter. >>>> >>>> pvs >>>> >>>>> On Jul 3, 2013, at 4:53 AM Scott Atchley wrote: >>>>> >>>>> On Jul 2, 2013, at 10:12 PM, Paul Von-Stamwitz >>>>> wrote: >>>>> >>>>>> Scott, >>>>>> >>>>>> You make a good point comparing (5/3) RS with Xorbas, but a small = nit: >>>>>> >>>>>> "The I/O to recover from a single failure for both schemes is 5 bl= ocks >>>>> so it is as efficient as Xorbas." >>>>>> >>>>>> Maybe not. You would probably issue I/O to all the remaining 7 blo= cks >>>> to >>>>> cover for the possibility of double errors. The time to reconstruct= >>>> would >>>>> be about the same, but there could be more disk and network I/O. (L= RC >>>> will >>>>> need to issue I/O to the rest of the global stripe if it detected >>>> multiple >>>>> local errors.) >>>>> >>>>> Why would you request more than five? If one cannot be read, reques= t >>>>> another. >>>>> >>>>> Also, I am not sure that you want to request five at once since it = will >>>>> lead to spikes in network traffic and require memory for all five b= locks. >>>>> You will need at least two buffers. Request the first two and start= the >>>>> decoding. You may want a third buffer to overlap the decoding of th= e >>>>> current block with the communication for the next block. It may be = that >>>>> the decode time is less than the communication and, in that case, y= ou >>>> will >>>>> want to request all of the blocks at once. >>>>> >>>>>> What I like about Xorbas is that it is an extension of a (x,y) RS.= You >>>>> can start with traditional RS. If degraded reads and repaired block= s are >>>>> causing a problem, you can add the LRC. If capacity is an issue, yo= u can >>>>> take it out. >>>>> >>>>> I like it too and Microsoft has something similar with Pyramid code= s. >>>> That >>>>> said, my example using traditional RS can provide more fault-tolera= nce >>>> on >>>>> average given the same amount of storage overhead. >>>>> >>>>>> >>>>>> Best, >>>>>> Paul >>>>>> >>>>>> On Tue, Jul 2, 2013 at 2:33 PM, Samuel Just wrote: >>>>>>> I think we should be able to cover most cases by adding an interf= ace >>>>> like: >>>>>>> >>>>>>> set minimum_to_read(const set &want_to_read, const set<= int> >>>>>>> &available_chunks); >>>>>>> >>>>>>> which returns the smallest set required to read/rebuild the chunk= s in >>>>>>> want_to_read given the chunks in available_chunks. Alternately, = we >>>>> might >>>>>>> include a "cost" for reading each chunk like >>>>>>> >>>>>>> set minimum_to_read_with_cost(const set &want_to_read, >>>> const >>>>>>> map &available) >>>>>>> >>>>>>> which returns the minimum cost set required to read the specified= >>>>> chunks >>>>>>> given a mapping of available chunks to costs. The costs might al= low >>>> us >>>>> to >>>>>>> consider the difference between reading local chunks vs remote ch= unks. >>>>>>> This should be sufficient to cover the read case (esp the degrade= d >>>> read >>>>>>> case) and the repair case. >>>>>>> -Sam >>>>>>> >>>>>>> On Tue, Jul 2, 2013 at 10:14 AM, Atchley, Scott >>>>>>> wrote: >>>>>>>> On Jul 2, 2013, at 10:07 AM, "Atchley, Scott" >>>>>>> wrote: >>>>>>>> >>>>>>>>> On Jul 1, 2013, at 7:00 PM, Loic Dachary wro= te: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Today Sam pointed out that the API for LRC ( Xorbas Hadoop Pro= ject >>>>>>> Page, Locally Repairable Codes (LRC) http://smahesh.com/HadoopUSC= / >>>> for >>>>>>> instance ) would need to be different from the one initialy propo= sed: >>>>>>>>> >>>>>>>>> An interesting video. Not as entertaining as Jim Plank's video.= ;-) >>>>>>>>> >>>>>>>>> While Plank's focused on the processor requirements for >>>>>>> encoding/decoding, this video focuses on the network and disk I/O= >>>>>>> requirements. >>>>>>>>> >>>>>>>>>> context(k, m, reed-solomon|...) =3D> context* c >>>>>>>>>> encode(context* c, void* data) =3D> void* chunks[k+m] >>>>>>>>>> decode(context* c, void* chunk[k+m], int* >>>>>>>>>> indices_of_erased_chunks) =3D> void* data // erased chunks are= not >>>>> used >>>>>>>>>> repair(context* c, void* chunk[k+m], int* >>>>>>>>>> indices_of_erased_chunks) =3D> void* chunks[k+m] // erased chu= nks >>>> are >>>>>>>>>> rebuilt >>>>>>>>>> >>>>>>>>>> The decode function must allow for partial read: >>>>>>>>>> >>>>>>>>>> decode(context* c, int offset, int length, void* chunk[k+m], i= nt* >>>>>>>>>> indices_of_erased_chunks, int* missing_chunks) =3D> void* data= >>>>>>>>>> >>>>>>>>>> If there are not enough chunks to recover the desired data ran= ge >>>>>>> [offset, offset+length) the function returns NULL and sets >>>>> missing_chunks >>>>>>> to the list of chunks that must be retrieved in order to be able = to >>>>> read >>>>>>> the desired data. >>>>>>>>>> >>>>>>>>>> If decode is called to read just 1 chunk and it is missing, re= ed- >>>>>>> solomon would return on error and ask for all other chunks to rep= air. >>>>> If >>>>>>> the underlying library implements LRC, it would ask for a subset = of >>>> the >>>>>>> chunks. >>>>>>>>>> >>>>>>>>>> An implementation allowing only full reads and using jerasure >>>>> ( which >>>>>>> does not do LRC ) requires that offset is always zero, length is = the >>>>> size >>>>>>> of the object and returns a copy of indices_of_erased_chunks if t= here >>>>> are >>>>>>> not enough chunks to rebuild the missing ones. >>>>>>>>>> >>>>>>>>>> Comments are welcome :-) >>>>>>>>> >>>>>>>>> I have loosely followed this discussion and I have not looked >>>> closely >>>>>>> at the proposed API nor at the jerasure interface. My apologies i= f >>>> this >>>>>>> has already been addressed. >>>>>>>>> >>>>>>>>> It is not clear to me from the above proposed API (ignoring the= >>>>> partial >>>>>>> read) what it would do. Was the original intent to encode the ent= ire >>>>> file >>>>>>> using k+m blocks irregardless of the file size and of the rados >>>> object >>>>>>> size? >>>>>>>>> >>>>>>>>> If so, how will you map rados objects to the logical k+m object= s >>>> and >>>>>>> vice versa? >>>>>>>>> >>>>>>>>> If not, then the initial API needed an offset and length (eithe= r >>>>>>> logical or rados object). >>>>>>>>> >>>>>>>>> I would assume that you would want to operate on rados sized >>>> objects. >>>>>>> Given a fixed k+m, then you may have more than one set of k+m obj= ects >>>>> per >>>>>>> file. This is ignoring the LRC "local" parity blocks. For example= , if >>>>> the >>>>>>> rados object size if 1 MB and k =3D 10 and m =3D 4 (as in the Xor= bas >>>> video), >>>>>>> then for a 20 MB file one would need two sets of encoding blocks.= The >>>>>>> first for objects 1-10 and the second for objects 11-20. >>>>>>>>> >>>>>>>>> Perhaps, this is what the context is above. If so, it should ha= ve >>>> the >>>>>>> logical offset and rados object size, no? >>>>>>>>> >>>>>>>>> I see value in the Xorbas concept and I wonder if the jerasure >>>>> library >>>>>>> can be modified to generate the local parity blocks such that the= y >>>> can >>>>> be >>>>>>> used to generate the global parity blocks. That would be a questi= on >>>> for >>>>>>> Jim Plank. >>>>>>>> >>>>>>>> The benefits of the Xorbas concept is reduced network and disk I= /O >>>> for >>>>>>> failures while maintaining traditional RS's higher fault-toleranc= e >>>> than >>>>> 3x >>>>>>> replication while using less space. >>>>>>>> >>>>>>>> You can do almost the same thing with jerasure without modifying= it >>>> at >>>>>>> all. Using the values from the Xorbas video, they have k data blo= cks, >>>> m >>>>>>> global parity blocks, and 2 local parity blocks (generated from k= /2 >>>>> data >>>>>>> blocks) for a total of k+m+2 blocks on disk that can tolerate any= m >>>>>>> failures. In their example, k =3D 10 and m =3D 4. They store 16 b= locks >>>> for >>>>>>> each 10 data blocks. >>>>>>>> >>>>>>>> If you use traditional RS encoding via jerasure and used the sam= e >>>>> amount >>>>>>> of storage (16 blocks for each 10 data blocks), you could encode = 3 >>>>> parity >>>>>>> blocks for each 5 data blocks. This would consume 16 data blocks = for >>>>> each >>>>>>> 10 data blocks and the fault-tolerance would be variable from 3-6= >>>>> failures >>>>>>> depending on how the failures fell between the two groups of 5 bl= ocks >>>>>>> which is higher than the static 4 failures for the Xorbas code. T= he >>>> I/O >>>>> to >>>>>>> recover from a single failure for both schemes is 5 blocks so it = is >>>> as >>>>>>> efficient as Xorbas. On average, it provides more fault-tolerance= , >>>> but >>>>> it >>>>>>> can be less (four failures from one group of 5 data + 3 parity >>>> blocks), >>>>>>> but that worst case is the same as 3x replication. >>>>>>>> >>>>>>>> Scott-- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph- >>>> devel" >>>>>>>> in the body of a message to majordomo@vger.kernel.org More major= domo >>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-de= vel" >>>>> in >>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo= >>>> info >>>>> at >>>>>>> http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> --=20 >> Lo=EFc Dachary, Artisan Logiciel Libre >> All that is necessary for the triumph of evil is that good people do n= othing. >> >=20 --=20 Lo=EFc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do noth= ing. --------------enigAF09C61A923B0DFD0513C2CD Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlHW0vQACgkQ8dLMyEl6F22gTwCfZ7wozM1owxJ0POg5JeVhS0rI QQYAn0KQ8monxfjme6kLC0nCIXO119IV =G380 -----END PGP SIGNATURE----- --------------enigAF09C61A923B0DFD0513C2CD--