All of lore.kernel.org
 help / color / mirror / Atom feed
* Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'
@ 2013-03-11 12:04 Joao Eduardo Luis
  2013-03-12 17:54 ` Mark Kampe
  0 siblings, 1 reply; 2+ messages in thread
From: Joao Eduardo Luis @ 2013-03-11 12:04 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org; +Cc: florian, Greg Farnum

Last Friday, Florian Haas (CC'ed) commented with regard to the Monitor 
changes blog post [1] on Google+ [2].  This is a transcription of the 
resulting thread, in which Greg (CC'ed) also participated, and I am now 
cross-posting it to the list for the benefit of the larger community on 
ceph-devel that might not stumble upon the post (although it is public 
and should not require a G+ account).


[1] - http://ceph.com/dev-notes/cephs-new-monitor-changes/
[2] - https://plus.google.com/110443614427234590648/posts/iuxSyCC5aJp


   -Joao


> Florian Haas on Mar 8, 2013 wrote:
>
> Good to see more Ceph developers providing their insight on recent
> and ongoing codebase changes.
>
> +Joao Eduardo Luis, I have a comment about the transition from the
> file-backed mon store to a leveldb k/v store. As the original
> reporter of   http://tracker.ceph.com/issues/2752, I'm always
> wondering how to best recover if any issue simultaneously bringing
> down all mons were ever to happen again. At the time, +Gregory
> Farnum's suggestion for recovery
> (http://marc.info/?l=ceph-devel&m=134151077312444&w=2) involved
> manipulating files in the mon data directory manually; I wonder if
> the leveldb approach makes this easier or harder.
>
> It tends to be a common source of discomfort among potential Ceph
> users that if their mons ever become unrecoverable, it's almost
> impossible to recover your data (compare to GlusterFS, where you can
> always pull data out of Gluster bricks unharmed, at least as long as
> you don't use striping volumes). With a file backed mon store, I had
> hoped that eventually this might tie into btrfs snapshots such that
> you would have been able to roll back to a known good configuration
> in an emergency. With the switch to leveldb, I no longer foresee that
> ever happening. Mind sharing your thoughts on that?


> Gregory Farnum on Mar 8, 2013 wrote:
>
> Actually LevelDB snapshots just fine.
>
> I'll leave the rest for Joao, or me after I've been awake for more
> than a minute. :)


> Florian Haas on Mar 8, 2013 wrote:
>
> Thanks. So leveldb updates are always atomic, consistent, isolated
> and durable?
>
> Also, do the mons open level leveldb fd with O_SYNC, or do they
> periodically do fsync() or fdatasync()?


> Gregory Farnum on Mar 8, 2013 wrote:
 >
> I'm not entirely sure -- leveldb handles all that stuff on its own
> and  provides the right guarantees at the interface level, so I assume its
> doing those things on its own. ;)
> (In particular it's actually a hierarchy, not a single file.)


> Joao Eduardo Luis on Mar 9, 2013 wrote:
>
> When it comes to manipulating the contents of the store, yeah, it's
> harder now. You'll need a tool that speaks leveldb for that. Which IMO
> can actually be nice if we have a tool that allows one to perform minor
> incursions in the store in a somewhat automated fashion (say, revert to
> an older osdmap) -- and creating a tool to change the store's contents
> isn't hard to do either, we just didn't get around to put it together
> (there's one however to read the store's contents, adapting it should be
> easy).
>
> And I can see the discomfort in having the mon store on leveldb instead
> of the FS. I had never thought about backing up the mon store resorting
> to, say, btrfs snapshots, as I've been approaching it on a 'distributed
> is path to success' and maintaining multiple mons around. So I guess
> that it also might make things harder, if btrfs snapshots (or any other
> snapshot tools for that matter) don't play nice with leveldb (or
> vice-versa). I looked on the internets for a potential solution, and it
> looks like someone at stackoverflow [1] recommended performing some
> copies and creating some hard links of some of leveldb's contents to
> achieve that -- it's not a pretty solution and would require the monitor
> to block accesses to the store while performing such operation.
>
> Also, it is worth to mention that leveldb does support snapshots, but
> they will all be lost after leveldb is closed, so there's no gain from
> such support when attempting to create checkpoints (unless there's
> something I've missed altogether and this is in fact possible!).
> Furthermore, operations are applied in batches (we have a nifty
> interface that abstracts them as transactions, but they're not really
> ACID transactions, although they benefit from Atomicity, Durability and
> some form of Consistency and limited Isolation [2]), and we force them
> to be written synchronously to leveldb. This basically means that if the
> whole batch successfully reaches the disk, everything should be okay; if
> the system crashes sometime in the middle of it, leveldb will
> automatically ignore partial writes, and for a batch of operations that
> will mean the whole batch.
>
> By the way, I believe this discussion should be moved to ceph-devel, as
> it can be beneficial to other members of the community. If no one
> objects, I will do as such later today.
>
> [1] - http://goo.gl/7HSCH
> [2] - Atomicity is internally guaranteed by leveldb, resorting to their
> own magic; Consistency as well, guaranteed by ignoring partial writes if
> a system should fail, but leveldb mostly leaves it up to the user wrt
> how things are flushed to disk; Isolation is fairly limited: you can
> only have one process accessing leveldb at each time, but multiple
> threads can do as they please, and although leveldb will take care of
> most of the required synchronization, some thought should still go into
> that; and Durability is said to be configurable, but I have no idea what
> that means -- am yet to find a way to make not durable, not that I
> really need it, but would be nice to know. :-)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-03-12 17:54 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-11 12:04 Comments on Ceph.com's blog article 'Ceph's New Monitor Changes' Joao Eduardo Luis
2013-03-12 17:54 ` Mark Kampe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.