From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joao Eduardo Luis Subject: Comments on Ceph.com's blog article 'Ceph's New Monitor Changes' Date: Mon, 11 Mar 2013 12:04:03 +0000 Message-ID: <513DC833.9090806@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-ee0-f50.google.com ([74.125.83.50]:38745 "EHLO mail-ee0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753735Ab3CKMEz (ORCPT ); Mon, 11 Mar 2013 08:04:55 -0400 Received: by mail-ee0-f50.google.com with SMTP id e51so2172484eek.23 for ; Mon, 11 Mar 2013 05:04:54 -0700 (PDT) Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "ceph-devel@vger.kernel.org" Cc: florian@hastexo.com, Greg Farnum Last Friday, Florian Haas (CC'ed) commented with regard to the Monitor=20 changes blog post [1] on Google+ [2]. This is a transcription of the=20 resulting thread, in which Greg (CC'ed) also participated, and I am now= =20 cross-posting it to the list for the benefit of the larger community on= =20 ceph-devel that might not stumble upon the post (although it is public=20 and should not require a G+ account). [1] - http://ceph.com/dev-notes/cephs-new-monitor-changes/ [2] - https://plus.google.com/110443614427234590648/posts/iuxSyCC5aJp -Joao > Florian Haas on Mar 8, 2013 wrote: > > Good to see more Ceph developers providing their insight on recent > and ongoing codebase changes. > > +Joao Eduardo Luis, I have a comment about the transition from the > file-backed mon store to a leveldb k/v store. As the original > reporter of http://tracker.ceph.com/issues/2752, I'm always > wondering how to best recover if any issue simultaneously bringing > down all mons were ever to happen again. At the time, +Gregory > Farnum's suggestion for recovery > (http://marc.info/?l=3Dceph-devel&m=3D134151077312444&w=3D2) involved > manipulating files in the mon data directory manually; I wonder if > the leveldb approach makes this easier or harder. > > It tends to be a common source of discomfort among potential Ceph > users that if their mons ever become unrecoverable, it's almost > impossible to recover your data (compare to GlusterFS, where you can > always pull data out of Gluster bricks unharmed, at least as long as > you don't use striping volumes). With a file backed mon store, I had > hoped that eventually this might tie into btrfs snapshots such that > you would have been able to roll back to a known good configuration > in an emergency. With the switch to leveldb, I no longer foresee that > ever happening. Mind sharing your thoughts on that? > Gregory Farnum on Mar 8, 2013 wrote: > > Actually LevelDB snapshots just fine. > > I'll leave the rest for Joao, or me after I've been awake for more > than a minute. :)=EF=BB=BF > Florian Haas on Mar 8, 2013 wrote: > > Thanks. So leveldb updates are always atomic, consistent, isolated > and durable? > > Also, do the mons open level leveldb fd with O_SYNC, or do they > periodically do fsync() or fdatasync()?=EF=BB=BF > Gregory Farnum on Mar 8, 2013 wrote: > > I'm not entirely sure -- leveldb handles all that stuff on its own > and provides the right guarantees at the interface level, so I assum= e its > doing those things on its own. ;) > (In particular it's actually a hierarchy, not a single file.)=EF=BB=BF > Joao Eduardo Luis on Mar 9, 2013 wrote: > > When it comes to manipulating the contents of the store, yeah, it's > harder now. You'll need a tool that speaks leveldb for that. Which IM= O > can actually be nice if we have a tool that allows one to perform min= or > incursions in the store in a somewhat automated fashion (say, revert = to > an older osdmap) -- and creating a tool to change the store's content= s > isn't hard to do either, we just didn't get around to put it together > (there's one however to read the store's contents, adapting it should= be > easy). > > And I can see the discomfort in having the mon store on leveldb inste= ad > of the FS. I had never thought about backing up the mon store resorti= ng > to, say, btrfs snapshots, as I've been approaching it on a 'distribut= ed > is path to success' and maintaining multiple mons around. So I guess > that it also might make things harder, if btrfs snapshots (or any oth= er > snapshot tools for that matter) don't play nice with leveldb (or > vice-versa). I looked on the internets for a potential solution, and = it > looks like someone at stackoverflow [1] recommended performing some > copies and creating some hard links of some of leveldb's contents to > achieve that -- it's not a pretty solution and would require the moni= tor > to block accesses to the store while performing such operation. > > Also, it is worth to mention that leveldb does support snapshots, but > they will all be lost after leveldb is closed, so there's no gain fro= m > such support when attempting to create checkpoints (unless there's > something I've missed altogether and this is in fact possible!). > Furthermore, operations are applied in batches (we have a nifty > interface that abstracts them as transactions, but they're not really > ACID transactions, although they benefit from Atomicity, Durability a= nd > some form of Consistency and limited Isolation [2]), and we force the= m > to be written synchronously to leveldb. This basically means that if = the > whole batch successfully reaches the disk, everything should be okay;= if > the system crashes sometime in the middle of it, leveldb will > automatically ignore partial writes, and for a batch of operations th= at > will mean the whole batch. > > By the way, I believe this discussion should be moved to ceph-devel, = as > it can be beneficial to other members of the community. If no one > objects, I will do as such later today. > > [1] - http://goo.gl/7HSCH > [2] - Atomicity is internally guaranteed by leveldb, resorting to the= ir > own magic; Consistency as well, guaranteed by ignoring partial writes= if > a system should fail, but leveldb mostly leaves it up to the user wrt > how things are flushed to disk; Isolation is fairly limited: you can > only have one process accessing leveldb at each time, but multiple > threads can do as they please, and although leveldb will take care of > most of the required synchronization, some thought should still go in= to > that; and Durability is said to be configurable, but I have no idea w= hat > that means -- am yet to find a way to make not durable, not that I > really need it, but would be nice to know. :-)=EF=BB=BF -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html