* [PATCH] Fix for corrupted ceph cluster
@ 2014-02-11 6:48 Daniel Poelzleithner
2014-03-05 23:03 ` Daniel Poelzleithner
0 siblings, 1 reply; 3+ messages in thread
From: Daniel Poelzleithner @ 2014-02-11 6:48 UTC (permalink / raw)
To: ceph-devel
Hi,
I wrote a small patch that ignores object_trim requests when he does not
find the context of this request.
We have a node that fails to start permanently and there is no way to
get all nodes back up.
As far as I understood, deleting something that does not exist should
not cause an assert. It is wired, but should not cause abort.
This is regarding bug http://tracker.ceph.com/issues/6101
Any help is highly appreciated.
kind regards
Daniel
---
src/osd/ReplicatedPG.cc | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 90d3e1d..d7e0b62 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1491,7 +1491,7 @@ ReplicatedPG::RepGather
*ReplicatedPG::trim_object(const hobject_t &coid)
int r = find_object_context(coid, &obc, false, NULL);
if (r == -ENOENT || coid.snap != obc->obs.oi.soid.snap) {
derr << __func__ << "could not find coid " << coid << dendl;
- assert(0);
+ return NULL;
}
assert(r == 0);
assert(obc->registered);
@@ -7866,7 +7866,10 @@ boost::statechart::result
ReplicatedPG::TrimmingObjects::react(const SnapTrim&)
dout(10) << "TrimmingObjects react trimming " << pos << dendl;
RepGather *repop = pg->trim_object(pos);
- assert(repop);
+ if (!repop) {
+ derr << "TrimmingObjects failed " << pos << dendl;
+ return discard_event();
+ }
repop->queue_snap_trimmer = true;
eversion_t old_last_update = pg->pg_log.get_head();
--
1.8.5.3
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] Fix for corrupted ceph cluster
2014-02-11 6:48 [PATCH] Fix for corrupted ceph cluster Daniel Poelzleithner
@ 2014-03-05 23:03 ` Daniel Poelzleithner
2014-03-05 23:12 ` Sage Weil
0 siblings, 1 reply; 3+ messages in thread
From: Daniel Poelzleithner @ 2014-03-05 23:03 UTC (permalink / raw)
To: ceph-devel
On 02/11/2014 07:48 AM, Daniel Poelzleithner wrote:
> I wrote a small patch that ignores object_trim requests when he does not
> find the context of this request.
> We have a node that fails to start permanently and there is no way to
> get all nodes back up.
[...]
> This is regarding bug http://tracker.ceph.com/issues/6101
The patch now ran for 2 weeks and the 4th node is working again.
I think this patch is safe to apply, but not fixing the underlying problem.
Some state in ceph causes the delete event to be triggered every some
seconds and causes a log entry to be generated.
Do you need more informations to find the cause ? This definitely is
some wired internal state and is no race condition.
kind regards
Daniel
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] Fix for corrupted ceph cluster
2014-03-05 23:03 ` Daniel Poelzleithner
@ 2014-03-05 23:12 ` Sage Weil
0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2014-03-05 23:12 UTC (permalink / raw)
To: Daniel Poelzleithner; +Cc: ceph-devel
On Thu, 6 Mar 2014, Daniel Poelzleithner wrote:
> On 02/11/2014 07:48 AM, Daniel Poelzleithner wrote:
>
> > I wrote a small patch that ignores object_trim requests when he does not
> > find the context of this request.
> > We have a node that fails to start permanently and there is no way to
> > get all nodes back up.
> [...]
> > This is regarding bug http://tracker.ceph.com/issues/6101
>
> The patch now ran for 2 weeks and the 4th node is working again.
> I think this patch is safe to apply, but not fixing the underlying problem.
> Some state in ceph causes the delete event to be triggered every some
> seconds and causes a log entry to be generated.
>
> Do you need more informations to find the cause ? This definitely is
> some wired internal state and is no race condition.
Can you try, instead of the discard_event, to do
post_event(SnapTrim());
return transit< WaitingOnReplicas >();
and see if that lets it move past the bad entry?
Thanks-
sage
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-03-05 23:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-11 6:48 [PATCH] Fix for corrupted ceph cluster Daniel Poelzleithner
2014-03-05 23:03 ` Daniel Poelzleithner
2014-03-05 23:12 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.