From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Dillaman Subject: Re: RBD journal draft design Date: Wed, 3 Jun 2015 12:13:33 -0400 (EDT) Message-ID: <1679134333.10270211.1433348013379.JavaMail.zimbra@redhat.com> References: <1574383603.9391063.1433257824183.JavaMail.zimbra@redhat.com> <1346174854.9391994.1433257911870.JavaMail.zimbra@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mx5-phx2.redhat.com ([209.132.183.37]:56535 "EHLO mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757586AbbFCQNg convert rfc822-to-8bit (ORCPT ); Wed, 3 Jun 2015 12:13:36 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: Ceph Development > > In contrast to the current journal code used by CephFS, the new jou= rnal > > code will use sequence numbers to identify journal entries, instead= of > > offsets within the journal. >=20 > Am I misremembering what actually got done with our journal v2 format= ? > I think this is done =E2=80=94 or at least we made a move in this dir= ection. Assuming journal v2 is the code in osdc/Journaler.cc, there is a new "r= esilient" format that helps in detecting corruption, but it appears to = be still largely based upon offsets and using the Filer/Striper for I/O= =2E This does remind me that I probably want to include a magic preamb= le value at the start of each journal entry to facilitate recovery. > > A new journal object class method will be used to submit journal en= try > > append requests. This will act as a gatekeeper for the concurrent = client > > case. >=20 > The object class is going to be a big barrier to using EC pools; > unless you want to block the use of EC pools on EC pools supporting > object classes. :( Josh mentioned (via Sam) that reads were not currently supported by obj= ect classes on EC pools. Are appends not supported either? > >A successful append will indicate whether or not the journal is now = full > >(larger than the max object size), indicating to the client that a n= ew > >journal object should be used. If the journal is too large, an erro= r code > >responce would alert the client that it needs to write to the curren= t > >active journal object. In practice, the only time the journaler sho= uld > >expect to see such a response would be in the case where multiple cl= ients > >are using the same journal and the active object update notification= has > >yet to be received. >=20 > I'm confused. How does this work with the splay count thing you > mentioned above? Can you define ? Similar to the stripe width. > What happens if users submit sequenced entries substantially out of > order? It sounds like if you have multiple writers (or even just a > misbehaving client) it would not be hard for one of them to grab > sequence value N, for another to fill up one of the journal entry > objects with sequences in the range [N+1]...[N+x] and then for the > user of N to get an error response. I was thinking that when a client submits their journal entry payload, = the journaler will allocate the next available sequence number, compute= which active journal object that sequence should be submitted to, and = start an AIO append op to write the journal entry. The next journal en= try to be appended to the same journal object would be entries later. This does bring up a good point that if you are gen= erating journal entries fast enough, the delayed response saying the ob= ject is full could cause multiple later journal entry ops to need to be= resent to the new (non-full) object. Given that, it might be best to = scrap the hard error when the journal object gets full and just let the= journaler eventually switch to a new object when it receives a respons= e saying the object is now full. > > > > Since the journal is designed to be append-only, there needs to be = support > > for cases where journal entry needs to be updated out-of-band (e.g.= fixing > > a corrupt entry similar to CephFS's current journal recovery tools)= =2E The > > proposed solution is to just append a new journal entry with the sa= me > > sequence number as the record to be replaced to the end of the jour= nal > > (i.e. last entry for a given sequence number wins). This also prot= ects > > against accidental replays of the original append operation. An > > alternative suggestion would be to use a compare-and-swap mechanism= to > > update the full journal object with the updated contents. >=20 > I'm confused by this bit. It seems to imply that fetching a single > entry requires checking the entire object to make sure there's no > replacement. Certainly if we were doing replay we couldn't just apply > each entry sequentially any more because an overwritten entry might > have its value replaced by a later (by sequence number) entry that > occurs earlier (by offset) in the journal. The goal would be to use prefetching on the replay. Since the whole ob= ject is already in-memory, scanning for duplicates would be fairly triv= ial. If there is a way to prevent the OSDs from potentially replaying = a duplicate append journal entry message, the CAS update technique coul= d be used. > I'd also like it if we could organize a single Journal implementation > within the Ceph project, or at least have a blessed one going forward > that we use for new stuff and might plausibly migrate existing users > to. The big things I see different from osdc/Journaler are: Agreed. While librbd will be the first user of this, I wasn't planning= to locate it within the librbd library. > 1) (design) class-based > 2) (design) uses librados instead of Objecter (hurray) > 3) (need) should allow multiple writers > 4) (fallout of other choices?) does not stripe entries across multipl= e > objects =46or striping, I assume this is a function of how large MDS journal en= tries are expected to be. The largest RBD journal entries would be blo= ck write operations, so in the low kilobytes. It would be possible to = add a higher layer to this design that could break-up large client jour= nal entries into multiple, smaller entries. > Using librados instead of the Objecter might make this tough to use i= n > the MDS, but we've already got journaling happening in a separate > thread and it's one of the more isolated bits of code so we might be > able to handle it. I'm not sure if we'd want to stripe across objects > or not, but the possibility does appeal to me. >=20 > > > > Journal Header > > ~~~~~~~~~~~~~~ > > > > omap > > * soft max object size > > * journal objects splay count > > * min object number > > * most recent active journal objects (could be out-of-date) > > * registered clients > > * client description (i.e. zone) > > * journal entry tag > > * last committed sequence number >=20 > omap definitely doesn't go in EC pools =E2=80=94 I'm not sure how blu= e-sky you > were thinking when you mentioned those. :) Did not realize that. Good to know. > More generally the naive client implementation would be pretty slow t= o > commit something (go to header for sequence number, write data out). > Do you expect to always have a queue of sequence numbers available in > case you need to do an immediate commit of data? What makes the singl= e > header sequence assignment be not a bottleneck on its own for multipl= e > clients? It will need to do a write each time... There is no need to go to the header for a sequence number. Multiple (= out-of-process) writers to the same journal would need to use a differe= nt tag so that they would have their own sequence number set. > -Greg > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html