Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'

All of lore.kernel.org
 help / color / mirror / Atom feed

* Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'
@ 2013-03-11 12:04 Joao Eduardo Luis
  2013-03-12 17:54 ` Mark Kampe
  0 siblings, 1 reply; 2+ messages in thread
From: Joao Eduardo Luis @ 2013-03-11 12:04 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org; +Cc: florian, Greg Farnum

Last Friday, Florian Haas (CC'ed) commented with regard to the Monitor 
changes blog post [1] on Google+ [2].  This is a transcription of the 
resulting thread, in which Greg (CC'ed) also participated, and I am now 
cross-posting it to the list for the benefit of the larger community on 
ceph-devel that might not stumble upon the post (although it is public 
and should not require a G+ account).


[1] - http://ceph.com/dev-notes/cephs-new-monitor-changes/
[2] - https://plus.google.com/110443614427234590648/posts/iuxSyCC5aJp


   -Joao


> Florian Haas on Mar 8, 2013 wrote:
>
> Good to see more Ceph developers providing their insight on recent
> and ongoing codebase changes.
>
> +Joao Eduardo Luis, I have a comment about the transition from the
> file-backed mon store to a leveldb k/v store. As the original
> reporter of   http://tracker.ceph.com/issues/2752, I'm always
> wondering how to best recover if any issue simultaneously bringing
> down all mons were ever to happen again. At the time, +Gregory
> Farnum's suggestion for recovery
> (http://marc.info/?l=ceph-devel&m=134151077312444&w=2) involved
> manipulating files in the mon data directory manually; I wonder if
> the leveldb approach makes this easier or harder.
>
> It tends to be a common source of discomfort among potential Ceph
> users that if their mons ever become unrecoverable, it's almost
> impossible to recover your data (compare to GlusterFS, where you can
> always pull data out of Gluster bricks unharmed, at least as long as
> you don't use striping volumes). With a file backed mon store, I had
> hoped that eventually this might tie into btrfs snapshots such that
> you would have been able to roll back to a known good configuration
> in an emergency. With the switch to leveldb, I no longer foresee that
> ever happening. Mind sharing your thoughts on that?


> Gregory Farnum on Mar 8, 2013 wrote:
>
> Actually LevelDB snapshots just fine.
>
> I'll leave the rest for Joao, or me after I've been awake for more
> than a minute. :)


> Florian Haas on Mar 8, 2013 wrote:
>
> Thanks. So leveldb updates are always atomic, consistent, isolated
> and durable?
>
> Also, do the mons open level leveldb fd with O_SYNC, or do they
> periodically do fsync() or fdatasync()?


> Gregory Farnum on Mar 8, 2013 wrote:
 >
> I'm not entirely sure -- leveldb handles all that stuff on its own
> and  provides the right guarantees at the interface level, so I assume its
> doing those things on its own. ;)
> (In particular it's actually a hierarchy, not a single file.)


> Joao Eduardo Luis on Mar 9, 2013 wrote:
>
> When it comes to manipulating the contents of the store, yeah, it's
> harder now. You'll need a tool that speaks leveldb for that. Which IMO
> can actually be nice if we have a tool that allows one to perform minor
> incursions in the store in a somewhat automated fashion (say, revert to
> an older osdmap) -- and creating a tool to change the store's contents
> isn't hard to do either, we just didn't get around to put it together
> (there's one however to read the store's contents, adapting it should be
> easy).
>
> And I can see the discomfort in having the mon store on leveldb instead
> of the FS. I had never thought about backing up the mon store resorting
> to, say, btrfs snapshots, as I've been approaching it on a 'distributed
> is path to success' and maintaining multiple mons around. So I guess
> that it also might make things harder, if btrfs snapshots (or any other
> snapshot tools for that matter) don't play nice with leveldb (or
> vice-versa). I looked on the internets for a potential solution, and it
> looks like someone at stackoverflow [1] recommended performing some
> copies and creating some hard links of some of leveldb's contents to
> achieve that -- it's not a pretty solution and would require the monitor
> to block accesses to the store while performing such operation.
>
> Also, it is worth to mention that leveldb does support snapshots, but
> they will all be lost after leveldb is closed, so there's no gain from
> such support when attempting to create checkpoints (unless there's
> something I've missed altogether and this is in fact possible!).
> Furthermore, operations are applied in batches (we have a nifty
> interface that abstracts them as transactions, but they're not really
> ACID transactions, although they benefit from Atomicity, Durability and
> some form of Consistency and limited Isolation [2]), and we force them
> to be written synchronously to leveldb. This basically means that if the
> whole batch successfully reaches the disk, everything should be okay; if
> the system crashes sometime in the middle of it, leveldb will
> automatically ignore partial writes, and for a batch of operations that
> will mean the whole batch.
>
> By the way, I believe this discussion should be moved to ceph-devel, as
> it can be beneficial to other members of the community. If no one
> objects, I will do as such later today.
>
> [1] - http://goo.gl/7HSCH
> [2] - Atomicity is internally guaranteed by leveldb, resorting to their
> own magic; Consistency as well, guaranteed by ignoring partial writes if
> a system should fail, but leveldb mostly leaves it up to the user wrt
> how things are flushed to disk; Isolation is fairly limited: you can
> only have one process accessing leveldb at each time, but multiple
> threads can do as they please, and although leveldb will take care of
> most of the required synchronization, some thought should still go into
> that; and Durability is said to be configurable, but I have no idea what
> that means -- am yet to find a way to make not durable, not that I
> really need it, but would be nice to know. :-)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'
  2013-03-11 12:04 Comments on Ceph.com's blog article 'Ceph's New Monitor Changes' Joao Eduardo Luis
@ 2013-03-12 17:54 ` Mark Kampe
  0 siblings, 0 replies; 2+ messages in thread
From: Mark Kampe @ 2013-03-12 17:54 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel@vger.kernel.org

It seems to me that the surviving OSDs still remember all of
the osdmap and pgmap history back to "last epoch started"
for all of their PGs.  Isn't this enough to enable reconstruction
of all of the pgmaps and osdmaps required to find any copy of
currently stored object?

My history has given me biases, but I prefer reconstruction over
snapshots because:

  (a) it enables recovery from more catastrophic incidents
      (e.g. a bug has corrupted all of the monitor stores
      or a fire has reduced all monitor nodes to slag)

  (b) it is less likely to result in inconsistencies involving
      object updates after the last snapshot

  (c) the ability to reconstruct is a superset of the ability
      to audit, so we get consistency audits for free


>> It tends to be a common source of discomfort among potential Ceph
>> users that if their mons ever become unrecoverable, it's almost
>> impossible to recover your data (compare to GlusterFS, where you can
>> always pull data out of Gluster bricks unharmed, at least as long as
>> you don't use striping volumes). With a file backed mon store, I had
>> hoped that eventually this might tie into btrfs snapshots such that
>> you would have been able to roll back to a known good configuration
>> in an emergency. With the switch to leveldb, I no longer foresee that
>> ever happening. Mind sharing your thoughts on that?

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-03-12 17:54 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-11 12:04 Comments on Ceph.com's blog article 'Ceph's New Monitor Changes' Joao Eduardo Luis
2013-03-12 17:54 ` Mark Kampe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.