From: Loic Dachary <loic@dachary.org>
To: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Ceph backfilling explained ( maybe )
Date: Sat, 25 May 2013 13:55:30 +0200 [thread overview]
Message-ID: <51A0A6B2.9060105@dachary.org> (raw)
[-- Attachment #1: Type: text/plain, Size: 5272 bytes --]
Hi,
Here is a draft of my current understanding of backfilling. Disclaimer : it is possible that I completely misunderstood ;-)
Cheers
Ceph stores objects in pools which are divided in placement groups.
+---------------------------- pool a ----+
|+----- placement group 1 -------------+ |
||+-------+ +-------+ | |
|||object | |object | | |
||+-------+ +-------+ | |
|+-------------------------------------+ |
|+----- placement group 2 -------------+ |
||+-------+ +-------+ | |
|||object | |object | ... | |
||+-------+ +-------+ | |
|+-------------------------------------+ |
| .... |
| |
+----------------------------------------+
+---------------------------- pool b ----+
|+----- placement group 1 -------------+ |
||+-------+ +-------+ | |
|||object | |object | | |
||+-------+ +-------+ | |
|+-------------------------------------+ |
|+----- placement group 2 -------------+ |
||+-------+ +-------+ | |
|||object | |object | ... | |
||+-------+ +-------+ | |
|+-------------------------------------+ |
| .... |
| |
+----------------------------------------+
...
The placement group is supported by OSDs to store the objects. They are daemons running on machines where storage For instance, a placement group supporting three replicates will have three OSDs at his disposal : one OSDs is the primary and the two other store copies of each object.
+-------- placement group -------------+
|+----------------+ +----------------+ |
|| object A | | object B | |
|+----------------+ +----------------+ |
+---+-------------+-----------+--------+
| | |
| | |
OSD 0 OSD 1 OSD 2
+------+ +------+ +------+
|+---+ | |+---+ | |+---+ |
|| A | | || A | | || A | |
|+---+ | |+---+ | |+---+ |
|+---+ | |+---+ | |+---+ |
|| B | | || B | | || B | |
|+---+ | |+---+ | |+---+ |
+------+ +------+ +------+
The OSDs are not for the exclusive use of the placement group : multiple placement groups can use the same OSDs to store their objects. However, the collocation of objects from various placement groups in the same OSD is transparent and is not discussed here.
The placement group does not run as a single daemon as suggested above. Instead it os distributed and resides within each OSD. Whenever an OSD dies, the placement group for this OSD is gone and needs to be reconstructed using another OSD.
OSD 0 OSD 1 ...
+----------------+---- placement group --------+ +------
|+--- object --+ |+--------------------------+ | |
|| name : B | || pg_log_entry_t MODIFY | | |
|| key : 2 | || pg_log_entry_t DELETE | | |
|+-------------+ |+--------------------------+ | |
|+--- object --+ >------ last_backfill | | ....
|| name : A | | | |
|| key : 5 | | | |
|+-------------+ | | |
| | | |
| .... | | |
+----------------+-----------------------------+ +-----
When an object is deleted or modified in the placement group, it is recorded in a log to be replayed if needed. In the simplest case, if an OSD gets disconnected, reconnects and needs to catch up with the other OSDs, copies of the log entries will be sent to it. However, the logs have a limited size and it may be more efficient, in some cases, to just copy the objects over instead of replaying the logs.
Each object name is hashed into an integer that can be used to order them. For instance, the object B above has been hashed to key 2 and the object A above has been hashed to key 5. The last_backfill pointer of the placement group draws the limit separating the objects that have already been copied from other OSDs and those in the process of being copied. The objects that are lower than last_backfill have been copied ( that would be object B above ) and the objects that are greater than last_backfill are going to be copied.
It may take time for an OSD to catch up and it is useful to allow replaying the logs while backfilling. log entries related to objects lower than last_backfill are applied. However, log entries related to objects greater than last_backfill are discarded because it is scheduled to be copied at a later time anyway.
--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]
next reply other threads:[~2013-05-25 11:55 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-25 11:55 Loic Dachary [this message]
2013-05-25 12:33 ` Ceph backfilling explained ( maybe ) Leen Besselink
2013-05-25 14:27 ` Loic Dachary
2013-05-25 14:48 ` Leen Besselink
2013-05-25 17:37 ` Loic Dachary
2013-05-25 18:06 ` Samuel Just
2013-05-25 19:15 ` Loic Dachary
2013-05-26 11:45 ` Loic Dachary
2013-05-26 5:22 ` Leen Besselink
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51A0A6B2.9060105@dachary.org \
--to=loic@dachary.org \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.