Re: 7915 is not resolved - Dmitry Borodaenko

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dmitry Borodaenko <dborodaenko@mirantis.com>
To: Boris Lukashev <blukashev@sempervictus.com>
Cc: Sage Weil <sage@newdream.net>, ceph-devel@vger.kernel.org
Subject: Re: 7915 is not resolved
Date: Mon, 11 Jan 2016 11:06:19 -0800	[thread overview]
Message-ID: <20160111190619.GA22809@localhost> (raw)
In-Reply-To: <CAFUG7CfS4uNp=Jxxc5bgf1s_X72Ezb8w5wg2zSdr=UnWCpYCQw@mail.gmail.com>

Fuel 8.0 will support Hammer, you can grab the packages from:
http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/pool/main/c/ceph/

or, if you build your own packages with the extra patches, grab the
Debian build scripts from:
https://review.fuel-infra.org/#/c/13879/

That would make sure that your packages would work with Fuel.

-- 
Dmitry Borodaenko

On Mon, Jan 11, 2016 at 01:15:50PM -0500, Boris Lukashev wrote:
> Thank you, pulling those into my branch currently and kicking off a build.
> In terms of upgrading to Hammer - the documentation looks straight
> forward enough, but given that this is a Fuel based OpenStack
> deployment, i'm wondering if you've heard of any potential
> compatibility issues from doing so.
> 
> -Boris
> 
> On Mon, Jan 11, 2016 at 12:25 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 11 Jan 2016, Boris Lukashev wrote:
> >> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
> >> cluster backing our openstack glance and cinder services by just
> >> asking RBD to snapshot one of the volumes.
> >> The conditions under which this occured are as follows - bash script
> >> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
> >> which either caused a nova host (and ceph OSD holder) to crash, or
> >> simply suffered the crash simultaneously. On reboot of the host, RBD
> >> started throwing errors, once all OSDs were restarted, they all fail,
> >> crashing with the following:
> >>
> >>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
> >> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
> >> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
> >> enter Started/Primary/Peering/GetInfo
> >>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
> >> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
> >> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
> >> 16:37:35.398335
> >> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
> >>
> >>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
> >>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> >> const&)+0xb0) [0x79d140]
> >>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
> >>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
> >> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
> >> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
> >> int, PG::RecoveryCtx*)+0x282) [0x772c22]
> >>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> >> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
> >> std::less<boost::intrusive_ptr<PG> >,
> >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
> >>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
> >> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
> >>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
> >> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
> >>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
> >>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
> >>  9: (()+0x8182) [0x7f170def5182]
> >>  10: (clone()+0x6d) [0x7f170c51447d]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >> needed to interpret this.
> >>
> >> To me, this looks like the snapshot which was being created when the
> >> nova host died is causing the assert to fail since the snap was never
> >> completed and is broken.
> >>
> >> http://tracker.ceph.com/issues/11493 which appears very similar is
> >> marked as resolved, but with firefly current (deployed via Fuel and
> >> updated in place with 0.80.11 debs) this issue hit us on Saturday.
> >
> > You can try cherry-picking the two commits in wip-11493-b which make the
> > OSD semi-gracefully tolerate this situation.  This is a bug that's been
> > fixed in hammer, but since the inconsistency has already been introduced
> > simply upgrading probably won't resolve it.  Nevertheless, after working
> > around this, I'd encourage you to move to hammer and firefly is at end of
> > life.
> >
> > sage
> >
> >>
> >> Whats the way around this? I imagine commenting out that assert may
> >> cause more damage, but we need to get our OSDs and the RBD data in
> >> them back online. Is there a permanent fix in any branch we can
> >> backport? We built this cluster using Fuel so this affects every
> >> Mirantis user if not every ceph user out there, and the vector into
> >> this catastrophic bug is normal daily operations (snapshot
> >> apparently)....
> >>
> >> Thank you all for looking over this, advice would be greatly appreciated.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2016-01-11 19:06 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-11 16:52 7915 is not resolved Boris Lukashev
2016-01-11 17:25 ` Sage Weil
2016-01-11 18:15   ` Boris Lukashev
2016-01-11 19:06     ` Dmitry Borodaenko [this message]
2016-01-11 20:13       ` Boris Lukashev
2016-01-12  2:00         ` Boris Lukashev
2016-01-12  6:53           ` Mykola Golub
2016-01-12 13:24             ` Sage Weil
2016-01-12 17:43               ` Boris Lukashev
2016-01-12 17:47                 ` Sage Weil
2016-01-12 18:03                   ` Boris Lukashev
2016-01-12 20:21                     ` Boris Lukashev
2016-01-13  1:06                       ` Boris Lukashev
2016-01-13 11:35                         ` Mykola Golub
2016-01-13 16:49               ` Alexey Sheplyakov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160111190619.GA22809@localhost \
    --to=dborodaenko@mirantis.com \
    --cc=blukashev@sempervictus.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.