From: "Piotr Dałek" <branch@predictor.org.pl>
To: ceph-devel@vger.kernel.org
Subject: Re: recovery priority preemption
Date: Wed, 13 Sep 2017 20:02:01 +0200 [thread overview]
Message-ID: <20170913180201.GB9903@mailware> (raw)
In-Reply-To: <alpine.DEB.2.11.1709131458410.24068@piezo.novalocal>
On Wed, Sep 13, 2017 at 03:03:22PM +0000, Sage Weil wrote:
> I recently observed a problem on the lab cluster while doing a log of
> rebalancing (filestore->bluestore conversion):
>
> - lots of pgs in backfill_wait
> - a few pgs that need pg log recovery, but these appear after backfills
> are already in progress, so they end up in backfill_wait too (confusing
> state name!)
> - ongoing write activity extents pg logs for those pgs, but they cannot
> trim
> - pg logs reach 5x-10x the max
> - OSDs OOM
Why we're keeping so many pg logs in RAM anyway? We could dump them to storage
and once things stabilize, just reload them by few hundred entries at once.
> I think what is needed is for the recovery priority scheduling to allow
> preemption. If we are currently working on recovery/backfill for PG X,
> but PG Y appears with a higher priority, we should suspend work on X and
> switch to Y.
>
> Piotr, I didn't look too closely at forced recovery changes you folks
> recently did, but I'm guessing that it was added to address this sort of
> situation, right?
Not exactly that, but close.
The problem we wanted to solve (or at least reduce) was the risk of SLA
reduction for at least some of customers when cascading failures occur.
We host multiple Ceph clusters that are used by even more VMs, meaning
a lot of data to recover and that sometimes takes days to finish - so if
one rack fails for some reason, the risk of failures in remaining two racks
is very real and very scary.
Because pgs are recovered in pretty much random order (or at least not in
any order that would recover entire images one-by-one), we wanted to add some
predictability to that, so if cascading failures do occur, less customers
would be impacted by that because at least some of them managed to recover.
> Would a general solution that preempts and always works
> on the highest priority PG resolve the problem you've observed?
Preempting (in exact meaning of that - "drop whatever you're doing and focus
on that instead") any ongoing recovery with force-recovery wasn't exactly a
priority and we didn't want to make it even more complex than it already got.
If entire rack needs recovery, waiting for a single pg that is currently
in progress isn't a *big* issue in the case mentioned above. Still, that would
be very welcome! Well, except if that would mean stopping recovery of pg that's
90% done and then restarting it from beginning later.
--
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
prev parent reply other threads:[~2017-09-13 18:11 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-13 15:03 recovery priority preemption Sage Weil
2017-09-13 18:02 ` Piotr Dałek [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170913180201.GB9903@mailware \
--to=branch@predictor.org.pl \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.