recovery priority preemption

All of lore.kernel.org
 help / color / mirror / Atom feed

* recovery priority preemption
@ 2017-09-13 15:03 Sage Weil
  2017-09-13 18:02 ` Piotr Dałek
  0 siblings, 1 reply; 2+ messages in thread
From: Sage Weil @ 2017-09-13 15:03 UTC (permalink / raw)
  To: ceph-devel; +Cc: piotr.dalek

I recently observed a problem on the lab cluster while doing a log of 
rebalancing (filestore->bluestore conversion):

 - lots of pgs in backfill_wait
 - a few pgs that need pg log recovery, but these appear after backfills 
are already in progress, so they end up in backfill_wait too (confusing 
state name!)
 - ongoing write activity extents pg logs for those pgs, but they cannot 
trim
 - pg logs reach 5x-10x the max
 - OSDs OOM

I think what is needed is for the recovery priority scheduling to allow 
preemption.  If we are currently working on recovery/backfill for PG X, 
but PG Y appears with a higher priority, we should suspend work on X and 
switch to Y.

Piotr, I didn't look too closely at forced recovery changes you folks 
recently did, but I'm guessing that it was added to address this sort of 
situation, right?  Would a general solution that preempts and always works 
on the highest priority PG resolve the problem you've observed?

Thanks-
sage

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: recovery priority preemption
  2017-09-13 15:03 recovery priority preemption Sage Weil
@ 2017-09-13 18:02 ` Piotr Dałek
  0 siblings, 0 replies; 2+ messages in thread
From: Piotr Dałek @ 2017-09-13 18:02 UTC (permalink / raw)
  To: ceph-devel

On Wed, Sep 13, 2017 at 03:03:22PM +0000, Sage Weil wrote:
> I recently observed a problem on the lab cluster while doing a log of 
> rebalancing (filestore->bluestore conversion):
> 
>  - lots of pgs in backfill_wait
>  - a few pgs that need pg log recovery, but these appear after backfills 
> are already in progress, so they end up in backfill_wait too (confusing 
> state name!)
>  - ongoing write activity extents pg logs for those pgs, but they cannot 
> trim
>  - pg logs reach 5x-10x the max
>  - OSDs OOM

Why we're keeping so many pg logs in RAM anyway? We could dump them to storage
and once things stabilize, just reload them by few hundred entries at once.

> I think what is needed is for the recovery priority scheduling to allow 
> preemption.  If we are currently working on recovery/backfill for PG X, 
> but PG Y appears with a higher priority, we should suspend work on X and 
> switch to Y.
> 
> Piotr, I didn't look too closely at forced recovery changes you folks 
> recently did, but I'm guessing that it was added to address this sort of 
> situation, right? 

Not exactly that, but close.
The problem we wanted to solve (or at least reduce) was the risk of SLA
reduction for at least some of customers when cascading failures occur. 
We host multiple Ceph clusters that are used by even more VMs, meaning
a lot of data to recover and that sometimes takes days to finish - so if
one rack fails for some reason, the risk of failures in remaining two racks
is very real and very scary. 
Because pgs are recovered in pretty much random order (or at least not in
any order that would recover entire images one-by-one), we wanted to add some
predictability to that, so if cascading failures do occur, less customers
would be impacted by that because at least some of them managed to recover. 

> Would a general solution that preempts and always works 
> on the highest priority PG resolve the problem you've observed?

Preempting (in exact meaning of that - "drop whatever you're doing and focus
on that instead") any ongoing recovery with force-recovery wasn't exactly a
priority and we didn't want to make it even more complex than it already got.
If entire rack needs recovery, waiting for a single pg that is currently
in progress isn't a *big* issue in the case mentioned above. Still, that would
be very welcome! Well, except if that would mean stopping recovery of pg that's
90% done and then restarting it from beginning later.

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-09-13 18:11 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-13 15:03 recovery priority preemption Sage Weil
2017-09-13 18:02 ` Piotr Dałek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.