PG recovery reservation state chart

All of lore.kernel.org
 help / color / mirror / Atom feed

* PG recovery reservation state chart
@ 2012-10-02 19:48 Mike Ryan
  2012-10-02 20:02 ` Gregory Farnum
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Mike Ryan @ 2012-10-02 19:48 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1161 bytes --]

Tried sending this earlier but it seems the list doesn't like PNGs.
dotty or dot -Tpng will make short work of the .dot file I've attached.

These are the changes to the Active state of the PG state chart in order
to support recovery reservations. This is Important Stuff, so please
criticize mercilessly.

Here's a prose version:

When the PG activates, it determines whether it needs to do recovery. If
it does, it grabs its local reservation, then grabs a remote reservation
from each replica in order of OSD ID (to prevent deadlock). Once all
remotes are reserved, it starts recovering.

After recovery, all remote reservations are dropped. If no backfill is
necessary, the local reservation is dropped and we jump to Clean.

If we need to backfill, we request a remote backfill reservation from
the replica. If this reservation is rejected (due to the OSD being too
full) we drop our local reservation and wait for a while in
NotBackfilling. We then grab our local reservation and try again on the
remote reservation. Once we have the remote reservation, we backfill.
After Backfilling we drop the local and remote backfill reservation and
jump to Clean.

[-- Attachment #2: pg_recovery_reservation.dot --]
[-- Type: text/plain, Size: 876 bytes --]

digraph G {
    Activating -> Clean [label="AllReplicasClean"];
    Activating -> LocalReserving [label="DoRecovery"];
    LocalReserving -> WaitRemoteRecoveryReserved [label="LocalRecoveryReserved"];
    WaitRemoteRecoveryReserved -> WaitRemoteRecoveryReserved [label="RemoteReserved"];
    WaitRemoteRecoveryReserved -> Recovering [label="AllRemotesReserved"];
    Recovering -> Clean [label="AllReplicasClean"];
    Recovering -> WaitRemoteBackfillReserved [label="RequestBackfill"];
    WaitRemoteBackfillReserved -> NotBackfilling [label="RemoteReservationRejected"];
    NotBackfilling -> WaitLocalBackfillReservation [label="RequestBackfill"];
    WaitLocalBackfillReservation -> WaitRemoteBackfillReserved [label="LocalBackfillReserved"];
    WaitRemoteBackfillReserved -> Backfilling [label="RemoteBackfillReserved"];
    Backfilling -> Clean [label="Backfilled"];
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 19:48 PG recovery reservation state chart Mike Ryan
@ 2012-10-02 20:02 ` Gregory Farnum
  2012-10-02 20:21   ` Mike Ryan
  2012-10-02 20:31 ` Josh Durgin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2012-10-02 20:02 UTC (permalink / raw)
  To: Mike Ryan; +Cc: ceph-devel

On Tue, Oct 2, 2012 at 12:48 PM, Mike Ryan <mike.ryan@inktank.com> wrote:
> Tried sending this earlier but it seems the list doesn't like PNGs.
> dotty or dot -Tpng will make short work of the .dot file I've attached.
>
>
> These are the changes to the Active state of the PG state chart in order
> to support recovery reservations. This is Important Stuff, so please
> criticize mercilessly.
>
> Here's a prose version:
>
> When the PG activates, it determines whether it needs to do recovery. If
> it does, it grabs its local reservation, then grabs a remote reservation
> from each replica in order of OSD ID (to prevent deadlock). Once all
> remotes are reserved, it starts recovering.

Remote and local reservations come out of a different pool?


> After recovery, all remote reservations are dropped. If no backfill is
> necessary, the local reservation is dropped and we jump to Clean.
>
> If we need to backfill, we request a remote backfill reservation from
> the replica. If this reservation is rejected (due to the OSD being too
> full) we drop our local reservation and wait for a while in
> NotBackfilling. We then grab our local reservation and try again on the
> remote reservation. Once we have the remote reservation, we backfill.
> After Backfilling we drop the local and remote backfill reservation and
> jump to Clean.


I think I know what you're talking about here, but can you provide a
bit more background on the reservations and stuff?
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 20:02 ` Gregory Farnum
@ 2012-10-02 20:21   ` Mike Ryan
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Ryan @ 2012-10-02 20:21 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On Tue, Oct 02, 2012 at 01:02:06PM -0700, Gregory Farnum wrote:
> Remote and local reservations come out of a different pool?

Yes. This simplifies deadlock prevention.

> I think I know what you're talking about here, but can you provide a
> bit more background on the reservations and stuff?

This is an attempt to limit the amount of recovery operations occurring
at the same time.

Each OSD has a finite number of reservation slots. Reservation requests
are made by PGs to the OSD. A reservation request succeeds immedately if
there are slots available. If none are available, it will succeed after
a reservation is released (freeing a slot).

Before a recovery op may proceed, the primary collects reservations from
itself and all its replicas. If one of the OSDs is busy, the reservation
process will wait until a reservation is available before continuing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 19:48 PG recovery reservation state chart Mike Ryan
  2012-10-02 20:02 ` Gregory Farnum
@ 2012-10-02 20:31 ` Josh Durgin
  2012-10-02 20:40   ` Mike Ryan
  2012-10-02 20:35 ` Tommi Virtanen
  2012-10-02 21:36 ` Sage Weil
  3 siblings, 1 reply; 11+ messages in thread
From: Josh Durgin @ 2012-10-02 20:31 UTC (permalink / raw)
  To: Mike Ryan; +Cc: ceph-devel

On 10/02/2012 12:48 PM, Mike Ryan wrote:
> Tried sending this earlier but it seems the list doesn't like PNGs.
> dotty or dot -Tpng will make short work of the .dot file I've attached.
>
>
> These are the changes to the Active state of the PG state chart in order
> to support recovery reservations. This is Important Stuff, so please
> criticize mercilessly.
>
> Here's a prose version:
>
> When the PG activates, it determines whether it needs to do recovery. If
> it does, it grabs its local reservation, then grabs a remote reservation
> from each replica in order of OSD ID (to prevent deadlock). Once all
> remotes are reserved, it starts recovering.

Is the local reservation taken in OSD ID order with the remote
reservations as well? What's the difference between local and remote
reservations? Are there different limits on remote and local
reservations?

> After recovery, all remote reservations are dropped. If no backfill is
> necessary, the local reservation is dropped and we jump to Clean.
>
> If we need to backfill, we request a remote backfill reservation from
> the replica. If this reservation is rejected (due to the OSD being too
> full) we drop our local reservation and wait for a while in
> NotBackfilling. We then grab our local reservation and try again on the
> remote reservation. Once we have the remote reservation, we backfill.
> After Backfilling we drop the local and remote backfill reservation and
> jump to Clean.

If there's more than one possible replica to backfill from could we try
to reserve others if the first is busy instead of waiting?

Why would a remote backfill reservation fail if the OSD is full (disk 
space)? Backfill doesn't write to the replica, right? Or by full, do
you mean out of reservations?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 19:48 PG recovery reservation state chart Mike Ryan
  2012-10-02 20:02 ` Gregory Farnum
  2012-10-02 20:31 ` Josh Durgin
@ 2012-10-02 20:35 ` Tommi Virtanen
  2012-10-02 20:42   ` Mike Ryan
  2012-10-02 21:36 ` Sage Weil
  3 siblings, 1 reply; 11+ messages in thread
From: Tommi Virtanen @ 2012-10-02 20:35 UTC (permalink / raw)
  To: Mike Ryan; +Cc: ceph-devel

On Tue, Oct 2, 2012 at 12:48 PM, Mike Ryan <mike.ryan@inktank.com> wrote:
> Tried sending this earlier but it seems the list doesn't like PNGs.
> dotty or dot -Tpng will make short work of the .dot file I've attached.

vger discards messages with attachments. It's old school mailing list
software. It's also used by many old school communities, that consider
this a valuable anti-spam tactic, so they're not interested in
changing it.

Once this becomes less a design hypothetical and more a description of
how the code works, please please please put the dot in doc/dev/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 20:31 ` Josh Durgin
@ 2012-10-02 20:40   ` Mike Ryan
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Ryan @ 2012-10-02 20:40 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Tue, Oct 02, 2012 at 01:31:13PM -0700, Josh Durgin wrote:
> Is the local reservation taken in OSD ID order with the remote
> reservations as well? What's the difference between local and remote
> reservations? Are there different limits on remote and local
> reservations?

They come from separate pools. Each pool has a finite number of
reservations, but if one pool has no more slots the other may still
grant reservations.

> If there's more than one possible replica to backfill from could we try
> to reserve others if the first is busy instead of waiting?

I think you may have your backfill terminology backward. We don't
backfill from a replica, we backfill to a replica.

There will never be more than one replica that needs to be backfilled
to.

> Why would a remote backfill reservation fail if the OSD is full
> (disk space)? Backfill doesn't write to the replica, right? Or by
> full, do
> you mean out of reservations?

If the disk on the OSD is near full we reject backfills. This change I
implemented a few weeks ago and was merged last week.

Backfill does write to the replica.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 20:35 ` Tommi Virtanen
@ 2012-10-02 20:42   ` Mike Ryan
  2012-10-02 22:00     ` Josh Durgin
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Ryan @ 2012-10-02 20:42 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Tue, Oct 02, 2012 at 01:35:34PM -0700, Tommi Virtanen wrote:
> On Tue, Oct 2, 2012 at 12:48 PM, Mike Ryan <mike.ryan@inktank.com> wrote:
> > Tried sending this earlier but it seems the list doesn't like PNGs.
> > dotty or dot -Tpng will make short work of the .dot file I've attached.
> 
> vger discards messages with attachments. It's old school mailing list
> software. It's also used by many old school communities, that consider
> this a valuable anti-spam tactic, so they're not interested in
> changing it.

I figured as much. It would have been nice to receive a notification
that it was dropped rather than having it silently fall on the floor,
especially since a copy of the message is not sent to the sender upon
list acceptance. c'est la vie

> Once this becomes less a design hypothetical and more a description of
> how the code works, please please please put the dot in doc/dev/

This is unnecessary, as the doc scripts will automatically generate a
full peering state chart (of which this is just a sub state). Major
kudos to Sam Just for making that happen!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 19:48 PG recovery reservation state chart Mike Ryan
                   ` (2 preceding siblings ...)
  2012-10-02 20:35 ` Tommi Virtanen
@ 2012-10-02 21:36 ` Sage Weil
  2012-10-02 21:43   ` Mike Ryan
  3 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-10-02 21:36 UTC (permalink / raw)
  To: Mike Ryan; +Cc: ceph-devel

On Tue, 2 Oct 2012, Mike Ryan wrote:
> Tried sending this earlier but it seems the list doesn't like PNGs.
> dotty or dot -Tpng will make short work of the .dot file I've attached.
> 
> 
> These are the changes to the Active state of the PG state chart in order
> to support recovery reservations. This is Important Stuff, so please
> criticize mercilessly.
> 
> Here's a prose version:
> 
> When the PG activates, it determines whether it needs to do recovery. If
> it does, it grabs its local reservation, then grabs a remote reservation
> from each replica in order of OSD ID (to prevent deadlock). Once all
> remotes are reserved, it starts recovering.
> 
> After recovery, all remote reservations are dropped. If no backfill is
> necessary, the local reservation is dropped and we jump to Clean.
> 
> If we need to backfill, we request a remote backfill reservation from
> the replica. If this reservation is rejected (due to the OSD being too
> full) we drop our local reservation and wait for a while in
> NotBackfilling. We then grab our local reservation and try again on the
> remote reservation. Once we have the remote reservation, we backfill.
> After Backfilling we drop the local and remote backfill reservation and
> jump to Clean.

This all looks right to me.  I only have one concern: if, at some future 
point, we decide it's necessary or worthwhile to avoid non-backfill 
recovery due to targets begin full, does this approach preclude an elegant 
solution?

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 21:36 ` Sage Weil
@ 2012-10-02 21:43   ` Mike Ryan
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Ryan @ 2012-10-02 21:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> This all looks right to me.  I only have one concern: if, at some future 
> point, we decide it's necessary or worthwhile to avoid non-backfill 
> recovery due to targets begin full, does this approach preclude an elegant 
> solution?

I believe this encourages an elegant solution:

We add a new state if a remote reservation is rejected:

   WaitRemoteRecoveryReserved -> SleepALittle -> LocalReserving
  
In the new state we drop the remote reservations we acquired and wait
until a timer goes off before transitioning back into LocalReserving.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 20:42   ` Mike Ryan
@ 2012-10-02 22:00     ` Josh Durgin
  2012-10-02 22:39       ` Mike Ryan
  0 siblings, 1 reply; 11+ messages in thread
From: Josh Durgin @ 2012-10-02 22:00 UTC (permalink / raw)
  To: Mike Ryan; +Cc: Tommi Virtanen, ceph-devel

On 10/02/2012 01:42 PM, Mike Ryan wrote:
> On Tue, Oct 02, 2012 at 01:35:34PM -0700, Tommi Virtanen wrote:
>> On Tue, Oct 2, 2012 at 12:48 PM, Mike Ryan <mike.ryan@inktank.com> wrote:
>>> Tried sending this earlier but it seems the list doesn't like PNGs.
>>> dotty or dot -Tpng will make short work of the .dot file I've attached.
>>
>> vger discards messages with attachments. It's old school mailing list
>> software. It's also used by many old school communities, that consider
>> this a valuable anti-spam tactic, so they're not interested in
>> changing it.
>
> I figured as much. It would have been nice to receive a notification
> that it was dropped rather than having it silently fall on the floor,
> especially since a copy of the message is not sent to the sender upon
> list acceptance. c'est la vie
>
>> Once this becomes less a design hypothetical and more a description of
>> how the code works, please please please put the dot in doc/dev/
>
> This is unnecessary, as the doc scripts will automatically generate a
> full peering state chart (of which this is just a sub state). Major
> kudos to Sam Just for making that happen!

It'd be good to update doc/dev/osd_internals with a description of the
reservations though, maybe expanding 
doc/dev/osd_internals/backfill_reservation.

One other thing I'd like to see made explicit:

How does this handle upgrades? i.e., what will happen when some OSDs
have this reservation mechanism and some do not?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PG recovery reservation state chart
  2012-10-02 22:00     ` Josh Durgin
@ 2012-10-02 22:39       ` Mike Ryan
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Ryan @ 2012-10-02 22:39 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Tommi Virtanen, ceph-devel

> It'd be good to update doc/dev/osd_internals with a description of the
> reservations though, maybe expanding
> doc/dev/osd_internals/backfill_reservation.

Will do.

> One other thing I'd like to see made explicit:
> 
> How does this handle upgrades? i.e., what will happen when some OSDs
> have this reservation mechanism and some do not?

I will introduce a feature bit for recovery reservations.

If a replica lacks the mechanism then the primary "grants" itself a
reservation from that replica. There will obviously be no need to
release the reservation.

If the primary lacks the mechanism, recovery proceeds as though none of
the replicas have the mechanism (which is to say, exactly how it
proceeds today).

In either case, an OSD can become heavily loaded if too many recovery
operations are occurring at the same time, but it's no worse than the
current status quo.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-10-02 22:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-02 19:48 PG recovery reservation state chart Mike Ryan
2012-10-02 20:02 ` Gregory Farnum
2012-10-02 20:21   ` Mike Ryan
2012-10-02 20:31 ` Josh Durgin
2012-10-02 20:40   ` Mike Ryan
2012-10-02 20:35 ` Tommi Virtanen
2012-10-02 20:42   ` Mike Ryan
2012-10-02 22:00     ` Josh Durgin
2012-10-02 22:39       ` Mike Ryan
2012-10-02 21:36 ` Sage Weil
2012-10-02 21:43   ` Mike Ryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.