All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Fix for corrupted ceph cluster
@ 2014-02-11  6:48 Daniel Poelzleithner
  2014-03-05 23:03 ` Daniel Poelzleithner
  0 siblings, 1 reply; 3+ messages in thread
From: Daniel Poelzleithner @ 2014-02-11  6:48 UTC (permalink / raw)
  To: ceph-devel

Hi,

I wrote a small patch that ignores object_trim requests when he does not
find the context of this request.
We have a node that fails to start permanently and there is no way to
get all nodes back up.

As far as I understood, deleting something that does not exist should
not cause an assert. It is wired, but should not cause abort.

This is regarding bug http://tracker.ceph.com/issues/6101

Any help is highly appreciated.

kind regards
 Daniel


---
 src/osd/ReplicatedPG.cc | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 90d3e1d..d7e0b62 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1491,7 +1491,7 @@ ReplicatedPG::RepGather
*ReplicatedPG::trim_object(const hobject_t &coid)
   int r = find_object_context(coid, &obc, false, NULL);
   if (r == -ENOENT || coid.snap != obc->obs.oi.soid.snap) {
     derr << __func__ << "could not find coid " << coid << dendl;
-    assert(0);
+    return NULL;
   }
   assert(r == 0);
   assert(obc->registered);
@@ -7866,7 +7866,10 @@ boost::statechart::result
ReplicatedPG::TrimmingObjects::react(const SnapTrim&)

   dout(10) << "TrimmingObjects react trimming " << pos << dendl;
   RepGather *repop = pg->trim_object(pos);
-  assert(repop);
+  if (!repop) {
+      derr << "TrimmingObjects failed " << pos << dendl;
+      return discard_event();
+  }

   repop->queue_snap_trimmer = true;
   eversion_t old_last_update = pg->pg_log.get_head();
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] Fix for corrupted ceph cluster
  2014-02-11  6:48 [PATCH] Fix for corrupted ceph cluster Daniel Poelzleithner
@ 2014-03-05 23:03 ` Daniel Poelzleithner
  2014-03-05 23:12   ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Daniel Poelzleithner @ 2014-03-05 23:03 UTC (permalink / raw)
  To: ceph-devel

On 02/11/2014 07:48 AM, Daniel Poelzleithner wrote:

> I wrote a small patch that ignores object_trim requests when he does not
> find the context of this request.
> We have a node that fails to start permanently and there is no way to
> get all nodes back up.
[...]
> This is regarding bug http://tracker.ceph.com/issues/6101

The patch now ran for 2 weeks and the 4th node is working again.
I think this patch is safe to apply, but not fixing the underlying problem.
Some state in ceph causes the delete event to be triggered every some
seconds and causes a log entry to be generated.

Do you need more informations to find the cause ? This definitely is
some wired internal state and is no race condition.


kind regards
 Daniel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] Fix for corrupted ceph cluster
  2014-03-05 23:03 ` Daniel Poelzleithner
@ 2014-03-05 23:12   ` Sage Weil
  0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2014-03-05 23:12 UTC (permalink / raw)
  To: Daniel Poelzleithner; +Cc: ceph-devel

On Thu, 6 Mar 2014, Daniel Poelzleithner wrote:
> On 02/11/2014 07:48 AM, Daniel Poelzleithner wrote:
> 
> > I wrote a small patch that ignores object_trim requests when he does not
> > find the context of this request.
> > We have a node that fails to start permanently and there is no way to
> > get all nodes back up.
> [...]
> > This is regarding bug http://tracker.ceph.com/issues/6101
> 
> The patch now ran for 2 weeks and the 4th node is working again.
> I think this patch is safe to apply, but not fixing the underlying problem.
> Some state in ceph causes the delete event to be triggered every some
> seconds and causes a log entry to be generated.
> 
> Do you need more informations to find the cause ? This definitely is
> some wired internal state and is no race condition.

Can you try, instead of the discard_event, to do

    post_event(SnapTrim());
    return transit< WaitingOnReplicas >();

and see if that lets it move past the bad entry?

Thanks-
sage

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-03-05 23:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-11  6:48 [PATCH] Fix for corrupted ceph cluster Daniel Poelzleithner
2014-03-05 23:03 ` Daniel Poelzleithner
2014-03-05 23:12   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.