Questions about OSD recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

* Questions about OSD recovery
@ 2012-02-08  2:54 Henry C Chang
  2012-02-09  3:14 ` Josh Durgin
  0 siblings, 1 reply; 5+ messages in thread
From: Henry C Chang @ 2012-02-08  2:54 UTC (permalink / raw)
  To: ceph-devel

Hi all,

I did some experiments on the OSD and had some questions about it.

I removed one object directly from the osd data store. As expected,
the osd didn't notice it until I manually scrubbed the pg. However,
the scrubbing doen't trigger the recovery automatically. I had to do
'ceph pg repair' to fix it.

So, my first question is: can the recovery process be triggered
automatically once the scrubbing has detected the inconsistency?

Then, I tried again and removed another object. But this time, I
didn't scrub the pg. I restarted the osd. As expected, the osd didn't
notice that, either.

My second question is: is it possible to check the existence of the
objects when scanning the pg during osd startup? Does it make sense to
do so?

Regards,
Henry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about OSD recovery
  2012-02-08  2:54 Questions about OSD recovery Henry C Chang
@ 2012-02-09  3:14 ` Josh Durgin
  2012-02-09 17:28   ` Tommi Virtanen
  2012-02-10  8:26   ` Henry C Chang
  0 siblings, 2 replies; 5+ messages in thread
From: Josh Durgin @ 2012-02-09  3:14 UTC (permalink / raw)
  To: Henry C Chang; +Cc: ceph-devel

On 02/07/2012 06:54 PM, Henry C Chang wrote:
> Hi all,
>
> I did some experiments on the OSD and had some questions about it.
>
> I removed one object directly from the osd data store. As expected,
> the osd didn't notice it until I manually scrubbed the pg. However,
> the scrubbing doen't trigger the recovery automatically. I had to do
> 'ceph pg repair' to fix it.
>
> So, my first question is: can the recovery process be triggered
> automatically once the scrubbing has detected the inconsistency?

It's possible to do what the current repair code does
automatically, but this would be a bad idea since it just takes
the first osd (with primary before replicas) to have the object
as authoritative, and copies it to all the relevant osds. If the
primary has a corrupt copy, this corruption will spread to other
osds. In your case, since you removed the object entirely, repair
could correct it.

In general, if an object is corrupted, there's no way to tell
which one is correct right now. You could use btrfs checksumming
underneath the osd to protect against this, but the osds don't
checksum the objects themselves. Scrub/repair could certainly be
a lot smarter. It's been on the todo list for a while, but we
haven't gotten to it yet.

> Then, I tried again and removed another object. But this time, I
> didn't scrub the pg. I restarted the osd. As expected, the osd didn't
> notice that, either.
>
> My second question is: is it possible to check the existence of the
> objects when scanning the pg during osd startup? Does it make sense to
> do so?

Detecting missing objects on startup is possible by looking at
the pg log and comparing it to the objects on disk, but this can
be a pretty expensive operation. The osd might also be out of
date, so it's log might be useless (for example it could have
divergent history that was not acked). It can't know how many
current objects that should be there aren't until it goes through
peering (to get an up to date and authoritative log) and
recovery (to get missing data the logs say should be there). This
is why scrub skips pgs that aren't active+clean. More details of
peering can be found at http://ceph.newdream.net/docs/latest/dev/peering/.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about OSD recovery
  2012-02-09  3:14 ` Josh Durgin
@ 2012-02-09 17:28   ` Tommi Virtanen
  2012-02-10  8:26   ` Henry C Chang
  1 sibling, 0 replies; 5+ messages in thread
From: Tommi Virtanen @ 2012-02-09 17:28 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Henry C Chang, ceph-devel

On Wed, Feb 8, 2012 at 19:14, Josh Durgin <josh.durgin@dreamhost.com> wrote:
> It's possible to do what the current repair code does
> automatically, but this would be a bad idea since it just takes
> the first osd (with primary before replicas) to have the object
> as authoritative, and copies it to all the relevant osds. If the
> primary has a corrupt copy, this corruption will spread to other
> osds. In your case, since you removed the object entirely, repair
> could correct it.

At the risk of saying the obvious.. If you have >=3 copies, you could
hash them all, and let the majority decide which is the "good" copy.

An admin could do this manually, just deleting the bad one and letting
scrub repair it, and later on we might be able to automate it.

I'm not sure if Dynamo's/Cassandra's anti-entropy feature does this,
or if it's a simple "master overwrites slaves", and I realize the
multi-party communication is sort of hard to coordinate, but it's
definitely possible. I loves me some Merkle trees.

Of course, there might be cases where e.g. all 3 replicas have
different content.

In many ways, getting a hash stored alongside is object is
significantly better, and might be a better route to go -- our objects
are big enough, as opposed to typical Dynamo/Cassandra cells that are
often smaller than a sha1.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about OSD recovery
  2012-02-09  3:14 ` Josh Durgin
  2012-02-09 17:28   ` Tommi Virtanen
@ 2012-02-10  8:26   ` Henry C Chang
  2012-02-10 22:34     ` Sage Weil
  1 sibling, 1 reply; 5+ messages in thread
From: Henry C Chang @ 2012-02-10  8:26 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

在 2012年2月9日上午11:14，Josh Durgin <josh.durgin@dreamhost.com> 寫道：
> Detecting missing objects on startup is possible by looking at
> the pg log and comparing it to the objects on disk, but this can
> be a pretty expensive operation. The osd might also be out of

Yeah. It can be pretty expensive, but we only do it once on startup.
Also, since the osd has not yet joined the cluster, it shouldn't
affect the cluster
performance.

> date, so it's log might be useless (for example it could have
> divergent history that was not acked). It can't know how many
> current objects that should be there aren't until it goes through
> peering (to get an up to date and authoritative log) and
> recovery (to get missing data the logs say should be there). This
> is why scrub skips pgs that aren't active+clean. More details of
> peering can be found at http://ceph.newdream.net/docs/latest/dev/peering/.

Since peering only compare logs, I was thinking at least the osd should
check the existence of the objects the log claims to have. Then, we
would have the chance to recover the object before the pg goes active.

Also, I like the idea of storing crc/hash alongside the object as Tv said.
With that, we can even prevent the client from reading the corrupt data
by checking the crc/hash on each read. (Though, the read performance
will surely degrade.)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about OSD recovery
  2012-02-10  8:26   ` Henry C Chang
@ 2012-02-10 22:34     ` Sage Weil
  0 siblings, 0 replies; 5+ messages in thread
From: Sage Weil @ 2012-02-10 22:34 UTC (permalink / raw)
  To: Henry C Chang; +Cc: Josh Durgin, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1899 bytes --]

On Fri, 10 Feb 2012, Henry C Chang wrote:
> ÿÿ 2012ÿÿ2ÿÿ9ÿÿÿÿÿÿ11:14ÿÿJosh Durgin <josh.durgin@dreamhost.com> ÿÿÿÿÿÿ
> > Detecting missing objects on startup is possible by looking at
> > the pg log and comparing it to the objects on disk, but this can
> > be a pretty expensive operation. The osd might also be out of
> 
> Yeah. It can be pretty expensive, but we only do it once on startup.
> Also, since the osd has not yet joined the cluster, it shouldn't
> affect the cluster
> performance.
> 
> > date, so it's log might be useless (for example it could have
> > divergent history that was not acked). It can't know how many
> > current objects that should be there aren't until it goes through
> > peering (to get an up to date and authoritative log) and
> > recovery (to get missing data the logs say should be there). This
> > is why scrub skips pgs that aren't active+clean. More details of
> > peering can be found at http://ceph.newdream.net/docs/latest/dev/peering/.
> 
> Since peering only compare logs, I was thinking at least the osd should
> check the existence of the objects the log claims to have. Then, we
> would have the chance to recover the object before the pg goes active.
> 
> Also, I like the idea of storing crc/hash alongside the object as Tv said.
> With that, we can even prevent the client from reading the corrupt data
> by checking the crc/hash on each read. (Though, the read performance
> will surely degrade.)

It would be nice.  There would be a fair bit of additional complexity to 
do it, though.  We'd need crcs for smallish blocks, for example, to 
minimize reading adjacent data when we modify things.  It's also sort of 
frustrating that btrfs is doing exactly this one layer down.

On a semi-related note, we should be using hashes for scrub to avoid 
shipping a lot of metadata over the wire for comparison.

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-02-10 22:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-08  2:54 Questions about OSD recovery Henry C Chang
2012-02-09  3:14 ` Josh Durgin
2012-02-09 17:28   ` Tommi Virtanen
2012-02-10  8:26   ` Henry C Chang
2012-02-10 22:34     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.