All of lore.kernel.org
 help / color / mirror / Atom feed
* Reducing backfilling/recovery long tail
@ 2014-12-12 15:16 Loic Dachary
  2014-12-12 16:12 ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Loic Dachary @ 2014-12-12 15:16 UTC (permalink / raw)
  To: Samuel Just, Sage Weil; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

Hi Sam & Sage,

In the context of http://tracker.ceph.com/issues/9566 I'm inclined to think the best solution would be that the AsyncReserver choose a PG instead of just picking the next one in the list when there is a free slot. It would always choose a PG that must move to/from an OSDs for which there are more PGs waiting in the AsyncRerserver than any other OSD. The sort involved does not seem too expensive.

Calculating priorities before adding the PG to the AsyncReserver seems wrong because the state of the system will change significantly while the PG is waiting to be processed. For instance the first PGs to be added have a low priority while the next have increasing priorities when they accumulate. If reservations are canceled because the OSD map changed again (maybe another OSD is decommissioned before recovery of the first one completes), you may end up having high priorities for PGs that are no longer associated with busy OSDs. That could backfire and create even more frequent long tails.

What do you think ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Reducing backfilling/recovery long tail
  2014-12-12 15:16 Reducing backfilling/recovery long tail Loic Dachary
@ 2014-12-12 16:12 ` Sage Weil
  2014-12-12 17:59   ` Loic Dachary
  0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2014-12-12 16:12 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Samuel Just, Ceph Development

On Fri, 12 Dec 2014, Loic Dachary wrote:
> Hi Sam & Sage,
> 
> In the context of http://tracker.ceph.com/issues/9566 I'm inclined to 
> think the best solution would be that the AsyncReserver choose a PG 
> instead of just picking the next one in the list when there is a free 
> slot. It would always choose a PG that must move to/from an OSDs for 
> which there are more PGs waiting in the AsyncRerserver than any other 
> OSD. The sort involved does not seem too expensive.
> 
> Calculating priorities before adding the PG to the AsyncReserver seems 
> wrong because the state of the system will change significantly while 
> the PG is waiting to be processed. For instance the first PGs to be 
> added have a low priority while the next have increasing priorities when 
> they accumulate. If reservations are canceled because the OSD map 
> changed again (maybe another OSD is decommissioned before recovery of 
> the first one completes), you may end up having high priorities for PGs 
> that are no longer associated with busy OSDs. That could backfire and 
> create even more frequent long tails.
> 
> What do you think ?

That makes sense.  In order to make that decision, it means that the OSDs 
need to be sharing the level of recovery work they have pending on a 
regular basis, right?

sage


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Reducing backfilling/recovery long tail
  2014-12-12 16:12 ` Sage Weil
@ 2014-12-12 17:59   ` Loic Dachary
  0 siblings, 0 replies; 3+ messages in thread
From: Loic Dachary @ 2014-12-12 17:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2767 bytes --]



On 12/12/2014 17:12, Sage Weil wrote:
> On Fri, 12 Dec 2014, Loic Dachary wrote:
>> Hi Sam & Sage,
>>
>> In the context of http://tracker.ceph.com/issues/9566 I'm inclined to 
>> think the best solution would be that the AsyncReserver choose a PG 
>> instead of just picking the next one in the list when there is a free 
>> slot. It would always choose a PG that must move to/from an OSDs for 
>> which there are more PGs waiting in the AsyncRerserver than any other 
>> OSD. The sort involved does not seem too expensive.
>>
>> Calculating priorities before adding the PG to the AsyncReserver seems 
>> wrong because the state of the system will change significantly while 
>> the PG is waiting to be processed. For instance the first PGs to be 
>> added have a low priority while the next have increasing priorities when 
>> they accumulate. If reservations are canceled because the OSD map 
>> changed again (maybe another OSD is decommissioned before recovery of 
>> the first one completes), you may end up having high priorities for PGs 
>> that are no longer associated with busy OSDs. That could backfire and 
>> create even more frequent long tails.
>>
>> What do you think ?
> 
> That makes sense.  In order to make that decision, it means that the OSDs 
> need to be sharing the level of recovery work they have pending on a 
> regular basis, right?
>  

It may not be necessary. The local_reserver is populated with all PGs that need to move. Say 50 of them are for osd.0 and 10 are for osd.1. The decision is made to schedule a PG for osd.0 because it has more PG to go. This PG will then try to get a remote_reserver slot on osd.0 : if it turns out that osd.0 already is busy, it will be queued. Up to osd_max_backfill can be queued for a given osd in the remote_reserver in this way because only osd_max_backfill PGs will get a slot in the local_reserver. Since the remote_reserver queue is capped by osd_max_backfill, its length does not accurately reflect the workload associated to an OSD. For this reason the priority could be modified when asking for the remote reservation (the priority field that we currently have) to reflect the workload. If the workload change while PGs are waiting in the remote_reserver queue, it could be that these PGs are given a priority that is sub-optimal. It is probably an acceptable tradeoff since it impacts onl
y osd_max_backfill PGs per osd. In contrast, hundreds of PGs could be queued in the local_reserver and setting a priority for them at the time they are queued could have lasting undesirable side effects.

I should probably enumerate the steps of an actual situation to clarify my thinking :-)

Cheers

> sage
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-12-12 17:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-12 15:16 Reducing backfilling/recovery long tail Loic Dachary
2014-12-12 16:12 ` Sage Weil
2014-12-12 17:59   ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.