Unfound objects and inconsistent reports

All of lore.kernel.org
 help / color / mirror / Atom feed

* Unfound objects and inconsistent reports
@ 2016-05-11 13:46 Ana Aviles
  2016-05-11 22:47 ` Victor Denisov
  0 siblings, 1 reply; 2+ messages in thread
From: Ana Aviles @ 2016-05-11 13:46 UTC (permalink / raw)
  To: ceph-users, ceph-devel

Hello everyone,

We experienced a strange scenario last week of unfound objects and
inconsistent reports from ceph tools. We solved it with the help from
Sage, and we wanted to share our experience and to see if it can be of
any use for developers too.

After OSDs segfaulting randomly, our cluster ended up with one OSD down
and unfound objects, probably due to a combination of inopportune
crashes. We tried to start that OSD again, but it crashed when reading a
specific PG from the log. here: http://pastebin.com/u9WFJnMR

Sage pointed that it looked like some metadata was corrupted. Funny
thing is that, that PG didn't belong to that OSD anymore. Once we made
sure it didn't belong to that OSD, we removed the PG from that OSD.

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51/ --pgid
2.1fd --op remove --journal-path /var/lib/ceph/osd/ceph-51/journal

We had to repeat this procedure for other PGs on that same OSD, as it
kept on crashing on startup. Finally the OSD was up and in, but the
recovery process was stuck with 10 unfound objects. We deleted marking
them as lost in their PGs doing:

ceph pg 2.481 mark_unfound_lost delete

Right after that, recovery was successfully completed but ceph reports
were a bit inconsistent. ceph -s was reporting 7 unfound objects, while
ceph health detail didn't report which PGs those unfound objects
belonged to. Sage pointed us to ceph pg dump, that indeed showed which
PGs owned those objects (in all PGs, the crashed OSD was a member).
However, when we listed missing objects on those PGs, they reported none:

{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 0,
    "num_unfound": 0,
    "objects": [],
    "more": 0
}

Then we decided to restart the OSDs on those PGs, and the unfound
objects disappear from ceph -s report.

It may be important to mention that we had four nodes running the OSDs.
Two nodes with v9.2.0 and another with v9.2.1. Our OSDs were crashing
apparently because of an issue on v9.2.1. We shared this on the
ceph-devel list, that were very helpful solving this
(http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/31123).

Greetings,

-- 
Ana Avilés
Greenhost - sustainable hosting & digital security
E: ana@greenhost.nl
T: +31 20 4890444
W: https://greenhost.nl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Unfound objects and inconsistent reports
  2016-05-11 13:46 Unfound objects and inconsistent reports Ana Aviles
@ 2016-05-11 22:47 ` Victor Denisov
  0 siblings, 0 replies; 2+ messages in thread
From: Victor Denisov @ 2016-05-11 22:47 UTC (permalink / raw)
  To: Ana Aviles; +Cc: ceph-users, ceph-devel

Thank you Ana for sharing this.
Please don't hesitate to share if you happen to have more cases and solutions).

On Wed, May 11, 2016 at 6:46 AM, Ana Aviles <ana@greenhost.nl> wrote:
> Hello everyone,
>
> We experienced a strange scenario last week of unfound objects and
> inconsistent reports from ceph tools. We solved it with the help from
> Sage, and we wanted to share our experience and to see if it can be of
> any use for developers too.
>
> After OSDs segfaulting randomly, our cluster ended up with one OSD down
> and unfound objects, probably due to a combination of inopportune
> crashes. We tried to start that OSD again, but it crashed when reading a
> specific PG from the log. here: http://pastebin.com/u9WFJnMR
>
> Sage pointed that it looked like some metadata was corrupted. Funny
> thing is that, that PG didn't belong to that OSD anymore. Once we made
> sure it didn't belong to that OSD, we removed the PG from that OSD.
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51/ --pgid
> 2.1fd --op remove --journal-path /var/lib/ceph/osd/ceph-51/journal
>
> We had to repeat this procedure for other PGs on that same OSD, as it
> kept on crashing on startup. Finally the OSD was up and in, but the
> recovery process was stuck with 10 unfound objects. We deleted marking
> them as lost in their PGs doing:
>
> ceph pg 2.481 mark_unfound_lost delete
>
> Right after that, recovery was successfully completed but ceph reports
> were a bit inconsistent. ceph -s was reporting 7 unfound objects, while
> ceph health detail didn't report which PGs those unfound objects
> belonged to. Sage pointed us to ceph pg dump, that indeed showed which
> PGs owned those objects (in all PGs, the crashed OSD was a member).
> However, when we listed missing objects on those PGs, they reported none:
>
> {
>     "offset": {
>         "oid": "",
>         "key": "",
>         "snapid": 0,
>         "hash": 0,
>         "max": 0,
>         "pool": -9223372036854775808,
>         "namespace": ""
>     },
>     "num_missing": 0,
>     "num_unfound": 0,
>     "objects": [],
>     "more": 0
> }
>
> Then we decided to restart the OSDs on those PGs, and the unfound
> objects disappear from ceph -s report.
>
> It may be important to mention that we had four nodes running the OSDs.
> Two nodes with v9.2.0 and another with v9.2.1. Our OSDs were crashing
> apparently because of an issue on v9.2.1. We shared this on the
> ceph-devel list, that were very helpful solving this
> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/31123).
>
> Greetings,
>
> --
> Ana Avilés
> Greenhost - sustainable hosting & digital security
> E: ana@greenhost.nl
> T: +31 20 4890444
> W: https://greenhost.nl
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-05-11 22:47 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-11 13:46 Unfound objects and inconsistent reports Ana Aviles
2016-05-11 22:47 ` Victor Denisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.