Re: Ceph backfilling explained ( maybe )

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Loic Dachary <loic@dachary.org>
To: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: Ceph backfilling explained ( maybe )
Date: Sun, 26 May 2013 13:45:16 +0200	[thread overview]
Message-ID: <51A1F5CC.2010309@dachary.org> (raw)
In-Reply-To: <51A10DD3.7000609@dachary.org>

[-- Attachment #1: Type: text/plain, Size: 6068 bytes --]

Hi,

Although I am yet to fully understand the logic of the placement group recovery ( I'm eager to read Sam's doc/dev/osd_internals/pg_recovery.rst :-), I wrote down my understanding of backfilling : http://dachary.org/?p=2009 . 

Cheers

On 05/25/2013 09:15 PM, Loic Dachary wrote:
> Hi !
> 
> On 05/25/2013 08:06 PM, Samuel Just wrote:
>> Hi, thanks for taking the time to try to get all this documented!
>>
>> Placement groups are assigned to a set of OSDs by crush.
>>
>> (4.1, osdmap(e 1)) --CRUSH--> [3,1,2]
>>
>> where the primary is 3.  When 3 dies, the osdmap is updated to reflect this
>> and we get a new mapping for pg 4.1:
>>
>> (4,1, osdmap(e 2)) --CRUSH--> [1,2,4]
>>
>> Here, 1 and 2 already have up-to-date copies of 4.1.  osd 4, however, needs
>> to be brought up to date.  During peering, osd 1 will learn that osd 4
>> falls into
>> 1 of 2 cases.
>>
>> Case 1 is that osd 4 already had an old copy of pg 4.1 AND its pg log for pg
>> 4.1 happens to overlap osd 1's pg log for pg 4.1.  In that case, by running
>> through the log of operations, we can determine exactly which objects need
>> to be copied over.  We usually refer to this as just "recovery" (or log based
>> recovery).
>>
>> In case 2, either osd 4's pg log does not overlap that of osd 1.  In this case,
>> we cannot determine from the log which objects need to be copied over.
>> To bring osd 4 up to date, we therefore need to backfill.
>>
>> Backfill involves the primary and the backfill peer (there is only ever one in
>> the acting set at a time, see PG::choose_acting) scanning over their pg stores
>> and copying the objects which are different or missing from the primary to the
>> backfill peer.  Because this may take a long time, we track the a last_backfill
>> attribute for each local pg copy indicating how far the local copy has been
>> backfilled.  In the case that the copy is complete, last_backfill is
>> hobject_t::max().
> 
> Is it true that if two osd briefly disconnect while backfilling, they may be in the case 1 above (i.e. log based recovery ) and then backfilling again when done, starting from last_backfill and up ? 
> 
>> More exactly, a local pg copy is described by a few pieces of information:
>> 1) the local pg log
> 
> pg_log_t https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1371
> pg_log_entry_t https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1277
> 
>> 2) the local last_backfill
> 
> pg_info_t::last_backfill https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1102
> 
>> 3) the local last_complete
> 
> pg_info_t::last_complete https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1089
> 
>> 4) the local missing set
> 
> pg_missing_t https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1468
> 
>> The local pg store reflects all updates up to version last_complete on all
> 
> I assume you mean 'local pg log' instead of 'local pg log'. 
> 
>> hobject_ts hoid such that hoid < last_backfill AND hoid is not in the missing
>> set.  Comparing the pg logs is used to fill in the missing set for OSDs which
>> were only down for a brief period thus avoiding a costly backfill in many cases.
> 
> The pg logs are trimmed ( https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L216 ), this is why the pg logs of two OSDs that have been disconnected for too long are unlikely to overlap ? And therefore require a backfill because the two pg logs cannot be compared ?
> 
>> This is a bit of a rough brain dump and may be somewhat misleading/wrong.
> 
> It is very helpful as it is, thanks :-)
> 
>> I'll get it cleaned up and put it into
>> doc/dev/osd_internals/pg_recovery.rst next
>> week.
>>
> 
> That would be great. 
> 
>> Also, rados objects currently have three pieces:
>> 1) data - read, write, writefull, etc.
>> 2) xattrs
>> 3) omap
>> The omap is much like the xattrs except that it can generally store a much
>> larger number of keys and support efficient scans.  It's used at the moment
>> for a few things including rgw bucket indices.  The omap entries are copied
>> over along with the rest of the object in recovery.  Behind the scenes, all
>> omap entries for all objects stored on an OSD are stored prefixed in a single
>> big leveldb instance.
>>
>> omap operations probably shouldn't be supported on objects in an
>> ErasureCodedPG :)
> 
> I thought omap / xattrs were mutually exclusive. I did not realize both could be used at the same time.
> 
> Cheers
> 
>> -Sam
>>
>> On Sat, May 25, 2013 at 10:37 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>
>>> On 05/25/2013 04:48 PM, Leen Besselink wrote:
>>>> On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 05/25/2013 02:33 PM, Leen Besselink wrote:
>>>>> Hi Leen,
>>>>>
>>>>>> - a Cehp object can store keys/values, not just data
>>>>>
>>>>> I did not know that. Could you explain or give me the URL ?
>>>>>
>>>>
>>>> Well, I got that impression from some of the earlier talks and from this blog post:
>>>>
>>>> http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/
>>>>
>>>> But I haven't read it in while.
>>>>
>>>> But at this time I only see something like:
>>>>
>>>> http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr
>>>>
>>>> Which looks like it is storing it in filesystem attributes.
>>>>
>>>> So maybe an object can be a piece of data or a key/value store.
>>>
>>> Thanks for explaining: I did not know about the works of Eleanor Cawthon. I knew about the objects xattributes but I thought you meant that the data inside of the object could be structured as key/value pairs. My bad :-)
>>>
>>> Cheers
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

next prev parent reply	other threads:[~2013-05-26 11:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-25 11:55 Ceph backfilling explained ( maybe ) Loic Dachary
2013-05-25 12:33 ` Leen Besselink
2013-05-25 14:27   ` Loic Dachary
2013-05-25 14:48     ` Leen Besselink
2013-05-25 17:37       ` Loic Dachary
2013-05-25 18:06         ` Samuel Just
2013-05-25 19:15           ` Loic Dachary
2013-05-26 11:45             ` Loic Dachary [this message]
2013-05-26  5:22           ` Leen Besselink

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51A1F5CC.2010309@dachary.org \
    --to=loic@dachary.org \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.