RBD format changes and layering

All of lore.kernel.org
 help / color / mirror / Atom feed

* RBD format changes and layering
@ 2012-05-24 23:05 Josh Durgin
  2012-05-24 23:39 ` Yehuda Sadeh
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Josh Durgin @ 2012-05-24 23:05 UTC (permalink / raw)
  To: ceph-devel

RBD object format changes
=========================

To enable us to add more features to rbd, including copy-on-write
cloning via layering, we need to change to rbd header object
format. Since this won't be backwards compatible, the old format will
still be used by default. Once layering is implemented, the old format
will be deprecated, but still usable with an extra option (something
like rbd create --legacy ...). Clients will still be able to read the
old format, and images can be converted by exporting and importing them.

While we're making these changes, we can clean up the way librbd and
the rbd kernel module access the header, so that they don't have to
change each time we change the header format. Instead of reading the
header directly, they can use the OSD class mechanism to interact with
it. librbd already does this for snapshots, but kernel rbd reads the
entire header directly. Making them both use a well-defined api will
make later format additions much simpler. I'll describe the changes
needed in general, and then those that are needed for rbd layering.

New format, pre-layering
========================

Right now the header object is name $image_name.rbd, and the data
objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
Since we're making other incompatible changes, we have a chance to
rename these to be less likely to collide with other objects. Prefixing
them with a more specific string will help, and will work well with
a new security feature for layering discussed later. The new
names are:

rbd_header.$image_name
rbd_data.$id.$object_number

The new header will have the existing (used) fields of the old format as
key/value pairs in an omap (this is the rados interface that stores
key/value pairs in leveldb). Specifically, the existing fields are:

  * object_prefix // previously known as block_name
  * order         // bit shift to determine size of the data objects
  * size          // total size of the image in bytes
  * snap_seq      // latest snapshot id used with the image
  * snapshots     // list of (snap_name, snap_id, image_size) tuples

To make adding new things easier, there will be an additional
'features' field, which is a mask of the features used by the image.
Clients will know whether they can use an image by checking if they
support all the features the image uses that the osd reports as being
incompatible (see get_info() below).

RBD class interface
===================

Here's a proposed basic interface - new features will
add more functions and data to existing ones.

/**
  * Initialize the header with basic metadata.
  * Extra features may initialize more fields in the future.
  * Everything is stored as key/value pairs as omaps in the header object.
  *
  * If features the OSD does not understand are requested, -ENOSYS is
  * returned.
  */
create(__le64 size, __le32 order, __le64 features)

/**
  * Get the metadata about the image required to do I/O
  * to it. In the future this may include extra information for
  * features that require it, like encryption/compression type.
  * This extra data will be added at the end of the response, so
  * clients that don't support it don't interpret it.
  *
  * Features that would require clients to be updated to access
  * the image correctly (such as image bitmaps) are set in
  * the incompat_features field. A client that doesn't understand
  * those features will return an error when they try to open
  * the image.
  *
  * The size and any extra information is read from the appropriate
  * snapshot metadata, if snapid is not CEPH_NOSNAP.
  *
  * Returns __le64 size, __le64 order, __le64 features,
  *         __le64 incompat_features, __le64 snapseq and
  *         list of __le64 snapids
  */
get_info(__le64 snapid)

/**
  * Used when resizing the image. Sets the size in bytes.
  */
set_size(__le64 size)

/**
  * The same as the existing snap_add/snap_remove methods, but using the
  * new format.
  */
snapshot_add(string snap_name, __le64 snap_id)
snapshot_remove(string snap_name)

/**
  * list snapshots - like the existing snap_list, but
  * can return a subset of them.
  *
  * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
  * (snap_id, snap_size) just like the current snap_list
  */
snapshot_list(__le64 max_len)

/**
  * The same as the existing method. Should only be called
  * on the rbd_info object.
  * Returns an id number to use for a new image.
  */
assign_bid()

RBD layering
============

The first step is to implement trivial layering, i.e.
layering without bitmaps, as described at:

http://marc.info/?l=ceph-devel&m=129867273303846&w=2

There are a couple of things that complicate the implementation:

1) making sure parent images are not deleted when children still
    refer to them

A simple way to solve this is to add a reference count to the parent
image. This can cause issues with partially deleted images, if the
reference count is decremented more than once because the child
image's header was only deleted the second time 'rbd rm' was run.

To prevent this, a full list of children can be used. When an image is
cloned, the new image is added to the list of children. When a child is
deleted, it is removed from the list. Keeping this all in the parent
image's header leads to the second issue:

2) cloning an image into a different pool without giving the cloner
    write access to the parent image's pool

The current capabilities implemented with cephx only allow you to
restrict users to reading, writing or executing class methods on a
per-pool basis.

For the child image in rbd, we need to be able to read the data
objects of the parent image, but only interact with the parent image
header through certain class methods, namely add_child and
remove_child during cloning and deletion.

One way to do this is adding a whitelist of class methods to the
capabilities system, but this would be hard to manage as more class
methods are added. A more manageable way is to give classes some
string they can interpret as permissions however they wish. Combined
with allowing clients to access objects matching certain prefixes,
this can restrict access to the image header to going through the rbd
class, but still allow allow read-only access to the data objects.

If we change the names of the rbd header and data objects to start
with rbd_header and rbd_data, respectively, we have something like:

allow prefix rbd_header class rbd image-child pool=templates
allow prefix rbd_data r pool=templates

where 'image-child' is interpreted by the rbd class to mean 'only
allow adding or removing a child'.

The problem with this is that the restricted client can still remove
any child, not just images it has access to. To get around this, we
can give each image a randomly generated uuid, and store that in the
child header and the parent's list of children. Then when someone
calls remove_child, they must pass the uuid in addition to their pool,
name, and snapshot, and it will only be processed if it matches the
uuid in the parent header.

One thing that's not addressed in the earlier design is how to make
images read-only. The simplest way would be to only support layering
on top of snapshots, which are read-only by definition.

Another way would be to allow images to be set read-only or
read-write, and disallow setting images with children read-write. Are
there many use cases that would justify this second, more complicated
way?

Copy-up
=======

Another feature we want to include with layering is the ability to
copy all remaining data from the parent image to the child image, to
break the dependency of the latter on the former. This does not change
snapshots that were taken earlier though - they still rely on the
parent image. Thus, the children of a parent image will need to
include snapshots as well, and the reference to the parent image will
be needed to interact with snapshots. Thus, we can't just remove the
information pointing the parent. Instead, we can add a boolean
has_parent field that is stored in the header and with each snapshot,
since some snapshots may be taken when the parent was still used, and
some after all the data has been copied to the child.

Renaming
========

In order to support renaming layered images, we can use the id
assigned to each image in place of the name. We just need to store a
mapping from ids to names in each pool. Eventually this can replace
rbd_directory, when we stop supporting the old format. This can't
happen right now because clients assume rbd_directory is a tmap.

Thus, the parent and child image lists would contain (pool name, image
id, snapshot name) tuples. Pools and snapshots can't be renamed, so
they don't have this problem. Image ids are unique within a pool, so
(pool name, image id) uniquely identifies an image.

Resizing
========

To support resizing of layered images, we need to keep track of the
minimum size the image ever was, so that if a child image is shrunk
and then expanded, the re-expanded space is treated as unused instead
of being read from the parent image. Since this can change over time,
we need to store this for each snapshot as well.

In summary, the format changes specific to adding layering are:

New object
==========

rbd_images_names // stores a mapping from image ids to image names

New header fields
=================

* parent_pool, parent_image_id, parent_snapshot
* uuid
* children - tuples of (pool, image_id, snapshot)
* min_size
* has_parent
* new fields in snapshots:
   - min_size
   - has_parent

New rbd class methods
=====================

/**
  * Sets the parent, min_size, and has_parent keys.
  * Fails if any of these keys exist, since the image already
  * had a parent.
  */
set_parent(string pool_name, __le64 image_id, string snap_name)

/**
  * Sets has_parent to false.
  */
remove_parent() // after all parent data is copied to the child

/**
  * uuid is required here to prevent malicious users from
  * removing children they don't have access to.
  */
add_child(string pool, __le64 image_id, string snapname, string uuid)
remove_child(string pool_name, __le64 image_id, string snapname, string 
uuid)

/**
  * to be run on the rbd_image_names object.
  */
get_name(image_id)
set_name(image_id)

Changes to existing class methods
=================================

The new snapshot fields will be added to the return value of snapshot_list.
snapshot_add will need to fill them in.

create will generate a uuid for the image.

Does anyone have any thoughts on the design? Any ways to make it simpler?

Josh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-24 23:05 RBD format changes and layering Josh Durgin
@ 2012-05-24 23:39 ` Yehuda Sadeh
  2012-05-25 17:33   ` Josh Durgin
  2012-05-25 14:57 ` Alex Elder
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Yehuda Sadeh @ 2012-05-24 23:39 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Thu, May 24, 2012 at 4:05 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> RBD object format changes
> =========================
>
> To enable us to add more features to rbd, including copy-on-write
> cloning via layering, we need to change to rbd header object
> format. Since this won't be backwards compatible, the old format will
> still be used by default. Once layering is implemented, the old format
> will be deprecated, but still usable with an extra option (something
> like rbd create --legacy ...). Clients will still be able to read the
> old format, and images can be converted by exporting and importing them.
>
> While we're making these changes, we can clean up the way librbd and
> the rbd kernel module access the header, so that they don't have to
> change each time we change the header format. Instead of reading the
> header directly, they can use the OSD class mechanism to interact with
> it. librbd already does this for snapshots, but kernel rbd reads the
> entire header directly. Making them both use a well-defined api will
> make later format additions much simpler. I'll describe the changes
> needed in general, and then those that are needed for rbd layering.
>
> New format, pre-layering
> ========================
>
> Right now the header object is name $image_name.rbd, and the data
> objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
> Since we're making other incompatible changes, we have a chance to
> rename these to be less likely to collide with other objects. Prefixing
> them with a more specific string will help, and will work well with
> a new security feature for layering discussed later. The new
> names are:
>
> rbd_header.$image_name
> rbd_data.$id.$object_number
>
> The new header will have the existing (used) fields of the old format as
> key/value pairs in an omap (this is the rados interface that stores
> key/value pairs in leveldb). Specifically, the existing fields are:
>
>  * object_prefix // previously known as block_name
>  * order         // bit shift to determine size of the data objects
>  * size          // total size of the image in bytes
>  * snap_seq      // latest snapshot id used with the image
>  * snapshots     // list of (snap_name, snap_id, image_size) tuples
>
> To make adding new things easier, there will be an additional
> 'features' field, which is a mask of the features used by the image.
> Clients will know whether they can use an image by checking if they
> support all the features the image uses that the osd reports as being
> incompatible (see get_info() below).
>
> RBD class interface
> ===================
>
> Here's a proposed basic interface - new features will
> add more functions and data to existing ones.
>
> /**
>  * Initialize the header with basic metadata.
>  * Extra features may initialize more fields in the future.
>  * Everything is stored as key/value pairs as omaps in the header object.
>  *
>  * If features the OSD does not understand are requested, -ENOSYS is
>  * returned.
>  */
> create(__le64 size, __le32 order, __le64 features)
>
> /**
>  * Get the metadata about the image required to do I/O
>  * to it. In the future this may include extra information for
>  * features that require it, like encryption/compression type.
>  * This extra data will be added at the end of the response, so
>  * clients that don't support it don't interpret it.
>  *
>  * Features that would require clients to be updated to access
>  * the image correctly (such as image bitmaps) are set in
>  * the incompat_features field. A client that doesn't understand
>  * those features will return an error when they try to open
>  * the image.
>  *
>  * The size and any extra information is read from the appropriate
>  * snapshot metadata, if snapid is not CEPH_NOSNAP.
>  *
>  * Returns __le64 size, __le64 order, __le64 features,
>  *         __le64 incompat_features, __le64 snapseq and
>  *         list of __le64 snapids
>  */
> get_info(__le64 snapid)
>
> /**
>  * Used when resizing the image. Sets the size in bytes.
>  */
> set_size(__le64 size)
>
> /**
>  * The same as the existing snap_add/snap_remove methods, but using the
>  * new format.
>  */
> snapshot_add(string snap_name, __le64 snap_id)
> snapshot_remove(string snap_name)
>
> /**
>  * list snapshots - like the existing snap_list, but
>  * can return a subset of them.
>  *
>  * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
>  * (snap_id, snap_size) just like the current snap_list
>  */
> snapshot_list(__le64 max_len)
>
> /**
>  * The same as the existing method. Should only be called
>  * on the rbd_info object.
>  * Returns an id number to use for a new image.
>  */
> assign_bid()
>
>
> RBD layering
> ============
>
> The first step is to implement trivial layering, i.e.
> layering without bitmaps, as described at:
>
> http://marc.info/?l=ceph-devel&m=129867273303846&w=2
>
> There are a couple of things that complicate the implementation:
>
> 1) making sure parent images are not deleted when children still
>   refer to them
>
> A simple way to solve this is to add a reference count to the parent
> image. This can cause issues with partially deleted images, if the
> reference count is decremented more than once because the child
> image's header was only deleted the second time 'rbd rm' was run.
>
> To prevent this, a full list of children can be used. When an image is
> cloned, the new image is added to the list of children. When a child is
> deleted, it is removed from the list. Keeping this all in the parent
> image's header leads to the second issue:

I don't like the idea of writing anything to the parent. We can have
an orthogonal directory in a different location.

>
> 2) cloning an image into a different pool without giving the cloner
>   write access to the parent image's pool
>
> The current capabilities implemented with cephx only allow you to
> restrict users to reading, writing or executing class methods on a
> per-pool basis.
>
> For the child image in rbd, we need to be able to read the data
> objects of the parent image, but only interact with the parent image
> header through certain class methods, namely add_child and
> remove_child during cloning and deletion.
>
> One way to do this is adding a whitelist of class methods to the
> capabilities system, but this would be hard to manage as more class
> methods are added. A more manageable way is to give classes some
> string they can interpret as permissions however they wish. Combined
> with allowing clients to access objects matching certain prefixes,
> this can restrict access to the image header to going through the rbd
> class, but still allow allow read-only access to the data objects.

That sounds useful feature, however, it also sounds like a much bigger
hammer than you need for that specific problem.
As I said, I don't think we should modify the parent. You can keep
that list on a different object that relates to the parent (e.g.,
rbd_ref.$image_name), or in a central place. You can put that object
in a different pool, where the client has write permissions.

>
> If we change the names of the rbd header and data objects to start
> with rbd_header and rbd_data, respectively, we have something like:
>
> allow prefix rbd_header class rbd image-child pool=templates
> allow prefix rbd_data r pool=templates
>
> where 'image-child' is interpreted by the rbd class to mean 'only
> allow adding or removing a child'.
>
> The problem with this is that the restricted client can still remove
> any child, not just images it has access to. To get around this, we
> can give each image a randomly generated uuid, and store that in the
> child header and the parent's list of children. Then when someone
> calls remove_child, they must pass the uuid in addition to their pool,
> name, and snapshot, and it will only be processed if it matches the
> uuid in the parent header.
>
> One thing that's not addressed in the earlier design is how to make
> images read-only. The simplest way would be to only support layering
> on top of snapshots, which are read-only by definition.
>
> Another way would be to allow images to be set read-only or
> read-write, and disallow setting images with children read-write. Are
> there many use cases that would justify this second, more complicated
> way?

I think that explicitly setting a read-only flag on the parent (if not
set) is enough. No need to do other magic (see my above comments).

>
> Copy-up
> =======
>
> Another feature we want to include with layering is the ability to
> copy all remaining data from the parent image to the child image, to
> break the dependency of the latter on the former. This does not change
> snapshots that were taken earlier though - they still rely on the
> parent image. Thus, the children of a parent image will need to
> include snapshots as well, and the reference to the parent image will
> be needed to interact with snapshots. Thus, we can't just remove the
> information pointing the parent. Instead, we can add a boolean
> has_parent field that is stored in the header and with each snapshot,
> since some snapshots may be taken when the parent was still used, and
> some after all the data has been copied to the child.

To generalize: a snapshot context would contain the source rbd image.

>
> Renaming
> ========
>
> In order to support renaming layered images, we can use the id
> assigned to each image in place of the name. We just need to store a
> mapping from ids to names in each pool. Eventually this can replace
> rbd_directory, when we stop supporting the old format. This can't
> happen right now because clients assume rbd_directory is a tmap.
>
> Thus, the parent and child image lists would contain (pool name, image
> id, snapshot name) tuples. Pools and snapshots can't be renamed, so
> they don't have this problem. Image ids are unique within a pool, so
> (pool name, image id) uniquely identifies an image.
>
> Resizing
> ========
>
> To support resizing of layered images, we need to keep track of the
> minimum size the image ever was, so that if a child image is shrunk
> and then expanded, the re-expanded space is treated as unused instead
> of being read from the parent image. Since this can change over time,
> we need to store this for each snapshot as well.
>
> In summary, the format changes specific to adding layering are:
>
> New object
> ==========
>
> rbd_images_names // stores a mapping from image ids to image names
>
> New header fields
> =================
>
> * parent_pool, parent_image_id, parent_snapshot
> * uuid
> * children - tuples of (pool, image_id, snapshot)
> * min_size
> * has_parent
> * new fields in snapshots:
>  - min_size
>  - has_parent
>
> New rbd class methods
> =====================
>
> /**
>  * Sets the parent, min_size, and has_parent keys.
>  * Fails if any of these keys exist, since the image already
>  * had a parent.
>  */
> set_parent(string pool_name, __le64 image_id, string snap_name)
>
> /**
>  * Sets has_parent to false.
>  */
> remove_parent() // after all parent data is copied to the child
>
> /**
>  * uuid is required here to prevent malicious users from
>  * removing children they don't have access to.
>  */
> add_child(string pool, __le64 image_id, string snapname, string uuid)
> remove_child(string pool_name, __le64 image_id, string snapname, string
> uuid)
>
> /**
>  * to be run on the rbd_image_names object.
>  */
> get_name(image_id)
> set_name(image_id)
>
> Changes to existing class methods
> =================================
>
> The new snapshot fields will be added to the return value of snapshot_list.
> snapshot_add will need to fill them in.
>
> create will generate a uuid for the image.
>
> Does anyone have any thoughts on the design? Any ways to make it simpler?
>
> Josh
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-24 23:05 RBD format changes and layering Josh Durgin
  2012-05-24 23:39 ` Yehuda Sadeh
@ 2012-05-25 14:57 ` Alex Elder
  2012-05-25 20:21   ` Josh Durgin
  2012-05-25 20:55 ` Greg Farnum
  2012-05-25 23:07 ` Josh Durgin
  3 siblings, 1 reply; 11+ messages in thread
From: Alex Elder @ 2012-05-25 14:57 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 05/24/2012 06:05 PM, Josh Durgin wrote:
> RBD object format changes
> =========================

In this message I respond only to the first part of your message.
I'll look at the layering stuff separately.

> To enable us to add more features to rbd, including copy-on-write
> cloning via layering, we need to change to rbd header object
> format. Since this won't be backwards compatible, the old format will
> still be used by default. Once layering is implemented, the old format
> will be deprecated, but still usable with an extra option (something
> like rbd create --legacy ...). Clients will still be able to read the
> old format, and images can be converted by exporting and importing them.

Are you saying here that the clients will recognize both types and will
work with either?  Or will clients need to specify how to interpret it?
Is there a plan to deprecate clients' support for the old format?

> While we're making these changes, we can clean up the way librbd and
> the rbd kernel module access the header, so that they don't have to
> change each time we change the header format. Instead of reading the
> header directly, they can use the OSD class mechanism to interact with
> it. librbd already does this for snapshots, but kernel rbd reads the
> entire header directly. Making them both use a well-defined api will
> make later format additions much simpler. I'll describe the changes
> needed in general, and then those that are needed for rbd layering.
>
> New format, pre-layering
> ========================
>
> Right now the header object is name $image_name.rbd, and the data
> objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
> Since we're making other incompatible changes, we have a chance to
> rename these to be less likely to collide with other objects. Prefixing
> them with a more specific string will help, and will work well with
> a new security feature for layering discussed later. The new
> names are:
>
> rbd_header.$image_name
> rbd_data.$id.$object_number

This is an improvement.

> The new header will have the existing (used) fields of the old format as
> key/value pairs in an omap (this is the rados interface that stores
> key/value pairs in leveldb). Specifically, the existing fields are:
>
> * object_prefix // previously known as block_name
> * order // bit shift to determine size of the data objects
> * size // total size of the image in bytes

Is the size restricted to be a multiple of (2 ^ (object order))?
If so I would like to see the size expressed in units of
object-order-sized blocks.

> * snap_seq // latest snapshot id used with the image
> * snapshots // list of (snap_name, snap_id, image_size) tuples

Can you remind me the distinction between the snap_name and the
snap_id?  Are they both unique identifiers?  Do both need to be
exposed, or just the snap_name?  Is the name a user-supplied name
and the id an indication of sequence?  (Sorry, haven't been looking
at this stuff in a while.)

> To make adding new things easier, there will be an additional
> 'features' field, which is a mask of the features used by the image.
> Clients will know whether they can use an image by checking if they
> support all the features the image uses that the osd reports as being
> incompatible (see get_info() below).

Does a feature vector of all zeroes (returned by the interface) indicate
old-style RBD?  Maybe the first feature bit should simply indicate that
the new format is in used, which allows the new interfaces to be used
with old images, and lets the old/new question to be determined using
the interfaces you define below.

> RBD class interface
> ===================
>
> Here's a proposed basic interface - new features will
> add more functions and data to existing ones.
>
> /**
> * Initialize the header with basic metadata.
> * Extra features may initialize more fields in the future.
> * Everything is stored as key/value pairs as omaps in the header object.
> *
> * If features the OSD does not understand are requested, -ENOSYS is
> * returned.
> */
> create(__le64 size, __le32 order, __le64 features)

Maybe I don't know where this interface lies in the software
stack.  But all integer parameters here and elsewhere should
be expressed in CPU order, not little-endian.  (And unless
there is a reason to make them signed, they should always
be unsigned.)

The only place the conversion between CPU and little-endian
byte order should occur is at the interface with the wire
or persistent storage.  This is not where these interfaces
sit (right?).  (There may be some internal code that keeps
things in little-endian for various reasons, but that should
not be specified in an external interface like this.)

The fact that everything sent over-the-wire and on-disk is
in little-endian format is significant, but just needs to
be stated in a single blanket statement.

> /**
> * Get the metadata about the image required to do I/O
> * to it. In the future this may include extra information for
> * features that require it, like encryption/compression type.
> * This extra data will be added at the end of the response, so
> * clients that don't support it don't interpret it.
> *
> * Features that would require clients to be updated to access
> * the image correctly (such as image bitmaps) are set in
> * the incompat_features field. A client that doesn't understand
> * those features will return an error when they try to open
> * the image.
> *
> * The size and any extra information is read from the appropriate
> * snapshot metadata, if snapid is not CEPH_NOSNAP.
> *
> * Returns __le64 size, __le64 order, __le64 features,
> * __le64 incompat_features, __le64 snapseq and
> * list of __le64 snapids
> */
> get_info(__le64 snapid)

I think I would prefer to see these bits of information broken
out into a few routines that group related information, or to
separate what's supplied based on the time or frequency it might
need to be accessed, or the "effort" involved in collecting it.

For example:

get_features() would supply the features and incompat_features.
It might not hurt to add "ible" to that second name too.  As I
understand it "features" would be things that allow additional
functionality but which neither affect nor are affected by the
way old code (that does not support it) might manipulate the
persistent state of the RBD storage.

Does it make sense for a snapshot to have features encoded in
it?  Can you envision a feature that might affect snapshots,
and if so, would one want to have snapshots in the same RBD
device that have different feature sets?  If the answer to
these is "no," then there's no need for a "snapid" to be
supplied for this interface, it would apply to an RBD device
itself.

get_size() might supply the block size order and the size in
blocks of the RBD device (or a snapshot of the device).

I don't remember what the snapshot sequence number is for
at the moment, but get_snapseq() might provide that, and
would (I think) apply to a device, not a snapshot.

get_snapids() would supply all the snapshot id's recorded for
a device.  This too is a device operation only, not a snapshot
operation.  Is there a need for this interface *and* the
list_snapshots() you specify below?

Rather than implicitly appending additional information to
the response, I think separate new entry points should be added
when features are added, extending the interface.  Old code,
or new code operating on old-format RBD devices, would not need
to access the new entry points.

> /**
> * Used when resizing the image. Sets the size in bytes.
> */
> set_size(__le64 size)

Maybe define the size in units of (order-dependent) blocks.

> /**
> * The same as the existing snap_add/snap_remove methods, but using the
> * new format.
> */
> snapshot_add(string snap_name, __le64 snap_id)

This probably reflects my ignorance, but does it make sense for
something "external" to the RBD implementation to specify the
snapshot id?

> snapshot_remove(string snap_name)

Can you remove by id?

> /**
> * list snapshots - like the existing snap_list, but
> * can return a subset of them.
> *
> * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
> * (snap_id, snap_size) just like the current snap_list
> */
> snapshot_list(__le64 max_len)
>
> /**
> * The same as the existing method. Should only be called
> * on the rbd_info object.
> * Returns an id number to use for a new image.
> */
> assign_bid()

Why bid?  I guess this is on the server side so I'm just not
familiar with it.

> RBD layering
> ============

I'll look at this part later.

					-Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-24 23:39 ` Yehuda Sadeh
@ 2012-05-25 17:33   ` Josh Durgin
  0 siblings, 0 replies; 11+ messages in thread
From: Josh Durgin @ 2012-05-25 17:33 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: ceph-devel

On 05/24/2012 04:39 PM, Yehuda Sadeh wrote:
> On Thu, May 24, 2012 at 4:05 PM, Josh Durgin<josh.durgin@inktank.com>  wrote:
>> RBD object format changes
>> =========================
>>
>> To enable us to add more features to rbd, including copy-on-write
>> cloning via layering, we need to change to rbd header object
>> format. Since this won't be backwards compatible, the old format will
>> still be used by default. Once layering is implemented, the old format
>> will be deprecated, but still usable with an extra option (something
>> like rbd create --legacy ...). Clients will still be able to read the
>> old format, and images can be converted by exporting and importing them.
>>
>> While we're making these changes, we can clean up the way librbd and
>> the rbd kernel module access the header, so that they don't have to
>> change each time we change the header format. Instead of reading the
>> header directly, they can use the OSD class mechanism to interact with
>> it. librbd already does this for snapshots, but kernel rbd reads the
>> entire header directly. Making them both use a well-defined api will
>> make later format additions much simpler. I'll describe the changes
>> needed in general, and then those that are needed for rbd layering.
>>
>> New format, pre-layering
>> ========================
>>
>> Right now the header object is name $image_name.rbd, and the data
>> objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
>> Since we're making other incompatible changes, we have a chance to
>> rename these to be less likely to collide with other objects. Prefixing
>> them with a more specific string will help, and will work well with
>> a new security feature for layering discussed later. The new
>> names are:
>>
>> rbd_header.$image_name
>> rbd_data.$id.$object_number
>>
>> The new header will have the existing (used) fields of the old format as
>> key/value pairs in an omap (this is the rados interface that stores
>> key/value pairs in leveldb). Specifically, the existing fields are:
>>
>>   * object_prefix // previously known as block_name
>>   * order         // bit shift to determine size of the data objects
>>   * size          // total size of the image in bytes
>>   * snap_seq      // latest snapshot id used with the image
>>   * snapshots     // list of (snap_name, snap_id, image_size) tuples
>>
>> To make adding new things easier, there will be an additional
>> 'features' field, which is a mask of the features used by the image.
>> Clients will know whether they can use an image by checking if they
>> support all the features the image uses that the osd reports as being
>> incompatible (see get_info() below).
>>
>> RBD class interface
>> ===================
>>
>> Here's a proposed basic interface - new features will
>> add more functions and data to existing ones.
>>
>> /**
>>   * Initialize the header with basic metadata.
>>   * Extra features may initialize more fields in the future.
>>   * Everything is stored as key/value pairs as omaps in the header object.
>>   *
>>   * If features the OSD does not understand are requested, -ENOSYS is
>>   * returned.
>>   */
>> create(__le64 size, __le32 order, __le64 features)
>>
>> /**
>>   * Get the metadata about the image required to do I/O
>>   * to it. In the future this may include extra information for
>>   * features that require it, like encryption/compression type.
>>   * This extra data will be added at the end of the response, so
>>   * clients that don't support it don't interpret it.
>>   *
>>   * Features that would require clients to be updated to access
>>   * the image correctly (such as image bitmaps) are set in
>>   * the incompat_features field. A client that doesn't understand
>>   * those features will return an error when they try to open
>>   * the image.
>>   *
>>   * The size and any extra information is read from the appropriate
>>   * snapshot metadata, if snapid is not CEPH_NOSNAP.
>>   *
>>   * Returns __le64 size, __le64 order, __le64 features,
>>   *         __le64 incompat_features, __le64 snapseq and
>>   *         list of __le64 snapids
>>   */
>> get_info(__le64 snapid)
>>
>> /**
>>   * Used when resizing the image. Sets the size in bytes.
>>   */
>> set_size(__le64 size)
>>
>> /**
>>   * The same as the existing snap_add/snap_remove methods, but using the
>>   * new format.
>>   */
>> snapshot_add(string snap_name, __le64 snap_id)
>> snapshot_remove(string snap_name)
>>
>> /**
>>   * list snapshots - like the existing snap_list, but
>>   * can return a subset of them.
>>   *
>>   * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
>>   * (snap_id, snap_size) just like the current snap_list
>>   */
>> snapshot_list(__le64 max_len)
>>
>> /**
>>   * The same as the existing method. Should only be called
>>   * on the rbd_info object.
>>   * Returns an id number to use for a new image.
>>   */
>> assign_bid()
>>
>>
>> RBD layering
>> ============
>>
>> The first step is to implement trivial layering, i.e.
>> layering without bitmaps, as described at:
>>
>> http://marc.info/?l=ceph-devel&m=129867273303846&w=2
>>
>> There are a couple of things that complicate the implementation:
>>
>> 1) making sure parent images are not deleted when children still
>>    refer to them
>>
>> A simple way to solve this is to add a reference count to the parent
>> image. This can cause issues with partially deleted images, if the
>> reference count is decremented more than once because the child
>> image's header was only deleted the second time 'rbd rm' was run.
>>
>> To prevent this, a full list of children can be used. When an image is
>> cloned, the new image is added to the list of children. When a child is
>> deleted, it is removed from the list. Keeping this all in the parent
>> image's header leads to the second issue:
>
> I don't like the idea of writing anything to the parent. We can have
> an orthogonal directory in a different location.
>
>>
>> 2) cloning an image into a different pool without giving the cloner
>>    write access to the parent image's pool
>>
>> The current capabilities implemented with cephx only allow you to
>> restrict users to reading, writing or executing class methods on a
>> per-pool basis.
>>
>> For the child image in rbd, we need to be able to read the data
>> objects of the parent image, but only interact with the parent image
>> header through certain class methods, namely add_child and
>> remove_child during cloning and deletion.
>>
>> One way to do this is adding a whitelist of class methods to the
>> capabilities system, but this would be hard to manage as more class
>> methods are added. A more manageable way is to give classes some
>> string they can interpret as permissions however they wish. Combined
>> with allowing clients to access objects matching certain prefixes,
>> this can restrict access to the image header to going through the rbd
>> class, but still allow allow read-only access to the data objects.
>
> That sounds useful feature, however, it also sounds like a much bigger
> hammer than you need for that specific problem.
> As I said, I don't think we should modify the parent. You can keep
> that list on a different object that relates to the parent (e.g.,
> rbd_ref.$image_name), or in a central place. You can put that objectheader
> in a different pool, where the client has write permissions.

This doesn't solve the problem of a user of the child image being able
to make the parent have no children. No matter where it is stored, if
they have write permissions to the parent's list of children, they can
delete other children by simply writing an empty list. Being in a
separate object or pool doesn't help.

The problem is that simple rwx isn't granular enough - we want to
prevent the users of the child image from doing anything other than
add_child/remove_child, and possibly other class methods in the future.

Prefix matching with class-enforced restrictions is simple to implement,
and generally useful for future problems like this.

>> If we change the names of the rbd header and data objects to start
>> with rbd_header and rbd_data, respectively, we have something like:
>>
>> allow prefix rbd_header class rbd image-child pool=templates
>> allow prefix rbd_data r pool=templates
>>
>> where 'image-child' is interpreted by the rbd class to mean 'only
>> allow adding or removing a child'.
>>
>> The problem with this is that the restricted client can still remove
>> any child, not just images it has access to. To get around this, we
>> can give each image a randomly generated uuid, and store that in the
>> child header and the parent's list of children. Then when someone
>> calls remove_child, they must pass the uuid in addition to their pool,
>> name, and snapshot, and it will only be processed if it matches the
>> uuid in the parent header.
>>
>> One thing that's not addressed in the earlier design is how to make
>> images read-only. The simplest way would be to only support layering
>> on top of snapshots, which are read-only by definition.
>>
>> Another way would be to allow images to be set read-only or
>> read-write, and disallow setting images with children read-write. Are
>> there many use cases that would justify this second, more complicated
>> way?
>
> I think that explicitly setting a read-only flag on the parent (if not
> set) is enough. No need to do other magic (see my above comments).
>
>>
>> Copy-up
>> =======
>>
>> Another feature we want to include with layering is the ability to
>> copy all remaining data from the parent image to the child image, to
>> break the dependency of the latter on the former. This does not change
>> snapshots that were taken earlier though - they still rely on the
>> parent image. Thus, the children of a parent image will need to
>> include snapshots as well, and the reference to the parent image will
>> be needed to interact with snapshots. Thus, we can't just remove the
>> information pointing the parent. Instead, we can add a boolean
>> has_parent field that is stored in the header and with each snapshot,
>> since some snapshots may be taken when the parent was still used, and
>> some after all the data has been copied to the child.
>
> To generalize: a snapshot context would contain the source rbd image.

Yeah, but I'd call it snapshot metadata to avoid confusion with the 
rados snapshot context.

>
>>
>> Renaming
>> ========
>>
>> In order to support renaming layered images, we can use the id
>> assigned to each image in place of the name. We just need to store a
>> mapping from ids to names in each pool. Eventually this can replace
>> rbd_directory, when we stop supporting the old format. This can't
>> happen right now because clients assume rbd_directory is a tmap.
>>
>> Thus, the parent and child image lists would contain (pool name, image
>> id, snapshot name) tuples. Pools and snapshots can't be renamed, so
>> they don't have this problem. Image ids are unique within a pool, so
>> (pool name, image id) uniquely identifies an image.
>>
>> Resizing
>> ========
>>
>> To support resizing of layered images, we need to keep track of the
>> minimum size the image ever was, so that if a child image is shrunk
>> and then expanded, the re-expanded space is treated as unused instead
>> of being read from the parent image. Since this can change over time,
>> we need to store this for each snapshot as well.
>>
>> In summary, the format changes specific to adding layering are:
>>
>> New object
>> ==========
>>
>> rbd_images_names // stores a mapping from image ids to image names
>>
>> New header fields
>> =================
>>
>> * parent_pool, parent_image_id, parent_snapshot
>> * uuid
>> * children - tuples of (pool, image_id, snapshot)
>> * min_size
>> * has_parent
>> * new fields in snapshots:
>>   - min_size
>>   - has_parent
>>
>> New rbd class methods
>> =====================
>>
>> /**
>>   * Sets the parent, min_size, and has_parent keys.
>>   * Fails if any of these keys exist, since the image already
>>   * had a parent.
>>   */
>> set_parent(string pool_name, __le64 image_id, string snap_name)
>>
>> /**
>>   * Sets has_parent to false.
>>   */
>> remove_parent() // after all parent data is copied to the child
>>
>> /**
>>   * uuid is required here to prevent malicious users from
>>   * removing children they don't have access to.
>>   */
>> add_child(string pool, __le64 image_id, string snapname, string uuid)
>> remove_child(string pool_name, __le64 image_id, string snapname, string
>> uuid)
>>
>> /**
>>   * to be run on the rbd_image_names object.
>>   */
>> get_name(image_id)
>> set_name(image_id)
>>
>> Changes to existing class methods
>> =================================
>>
>> The new snapshot fields will be added to the return value of snapshot_list.
>> snapshot_add will need to fill them in.
>>
>> create will generate a uuid for the image.
>>
>> Does anyone have any thoughts on the design? Any ways to make it simpler?
>>
>> Josh
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-25 14:57 ` Alex Elder
@ 2012-05-25 20:21   ` Josh Durgin
  2012-05-25 22:26     ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Josh Durgin @ 2012-05-25 20:21 UTC (permalink / raw)
  To: elder; +Cc: ceph-devel

On 05/25/2012 07:57 AM, Alex Elder wrote:
> On 05/24/2012 06:05 PM, Josh Durgin wrote:
>> RBD object format changes
>> =========================
>
> In this message I respond only to the first part of your message.
> I'll look at the layering stuff separately.
>
>> To enable us to add more features to rbd, including copy-on-write
>> cloning via layering, we need to change to rbd header object
>> format. Since this won't be backwards compatible, the old format will
>> still be used by default. Once layering is implemented, the old format
>> will be deprecated, but still usable with an extra option (something
>> like rbd create --legacy ...). Clients will still be able to read the
>> old format, and images can be converted by exporting and importing them.
>
> Are you saying here that the clients will recognize both types and will
> work with either? Or will clients need to specify how to interpret it?
> Is there a plan to deprecate clients' support for the old format?

I'm saying clients will recognize and read both types automatically.
We should remove the old format eventually, but we probably want to wait
a while so users of older kernels can upgrade.

>> While we're making these changes, we can clean up the way librbd and
>> the rbd kernel module access the header, so that they don't have to
>> change each time we change the header format. Instead of reading the
>> header directly, they can use the OSD class mechanism to interact with
>> it. librbd already does this for snapshots, but kernel rbd reads the
>> entire header directly. Making them both use a well-defined api will
>> make later format additions much simpler. I'll describe the changes
>> needed in general, and then those that are needed for rbd layering.
>>
>> New format, pre-layering
>> ========================
>>
>> Right now the header object is name $image_name.rbd, and the data
>> objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
>> Since we're making other incompatible changes, we have a chance to
>> rename these to be less likely to collide with other objects. Prefixing
>> them with a more specific string will help, and will work well with
>> a new security feature for layering discussed later. The new
>> names are:
>>
>> rbd_header.$image_name
>> rbd_data.$id.$object_number
>
> This is an improvement.
>
>> The new header will have the existing (used) fields of the old format as
>> key/value pairs in an omap (this is the rados interface that stores
>> key/value pairs in leveldb). Specifically, the existing fields are:
>>
>> * object_prefix // previously known as block_name
>> * order // bit shift to determine size of the data objects
>> * size // total size of the image in bytes
>
> Is the size restricted to be a multiple of (2 ^ (object order))?
> If so I would like to see the size expressed in units of
> object-order-sized blocks.

No, it can be any number of bytes.

>> * snap_seq // latest snapshot id used with the image
>> * snapshots // list of (snap_name, snap_id, image_size) tuples
>
> Can you remind me the distinction between the snap_name and the
> snap_id? Are they both unique identifiers? Do both need to be
> exposed, or just the snap_name? Is the name a user-supplied name
> and the id an indication of sequence? (Sorry, haven't been looking
> at this stuff in a while.)

snap_id is unique to the pool, and is an indication of sequence managed
by the monitiors. snap_name is provided by the user, and
is unique to the image.

snap_id doesn't need to be exposed to the user, but the client does
need it for the rados snapshot context that gets sent with every
rados operation.

>> To make adding new things easier, there will be an additional
>> 'features' field, which is a mask of the features used by the image.
>> Clients will know whether they can use an image by checking if they
>> support all the features the image uses that the osd reports as being
>> incompatible (see get_info() below).
>
> Does a feature vector of all zeroes (returned by the interface) indicate
> old-style RBD? Maybe the first feature bit should simply indicate that
> the new format is in used, which allows the new interfaces to be used
> with old images, and lets the old/new question to be determined using
> the interfaces you define below.

Old images have the header stored in a differently named object, so
existence of the old-style header ($image_name.rbd) means it's in the
old format. I think it's simpler to keep the existing header
interaction in clients when using the old format. This is needed anyway
if the clients are upgraded before the osds.

A zeroed feature vector means no new features (like layering) are
enabled. I'm inclined to continue using the old format until layering
is implemented, since the new format doesn't add much until
layering is there. There would be an option to use it for testing,
before layering is done, of course.

>> RBD class interface
>> ===================
>>
>> Here's a proposed basic interface - new features will
>> add more functions and data to existing ones.
>>
>> /**
>> * Initialize the header with basic metadata.
>> * Extra features may initialize more fields in the future.
>> * Everything is stored as key/value pairs as omaps in the header object.
>> *
>> * If features the OSD does not understand are requested, -ENOSYS is
>> * returned.
>> */
>> create(__le64 size, __le32 order, __le64 features)
>
> Maybe I don't know where this interface lies in the software
> stack. But all integer parameters here and elsewhere should
> be expressed in CPU order, not little-endian. (And unless
> there is a reason to make them signed, they should always
> be unsigned.)

Yeah, all of these should be unsigned.

> The only place the conversion between CPU and little-endian
> byte order should occur is at the interface with the wire
> or persistent storage. This is not where these interfaces
> sit (right?). (There may be some internal code that keeps
> things in little-endian for various reasons, but that should
> not be specified in an external interface like this.)

Yeah, just replace all of the numerical types with uint64_t or uint32_t.

> The fact that everything sent over-the-wire and on-disk is
> in little-endian format is significant, but just needs to
> be stated in a single blanket statement.
>
>> /**
>> * Get the metadata about the image required to do I/O
>> * to it. In the future this may include extra information for
>> * features that require it, like encryption/compression type.
>> * This extra data will be added at the end of the response, so
>> * clients that don't support it don't interpret it.
>> *
>> * Features that would require clients to be updated to access
>> * the image correctly (such as image bitmaps) are set in
>> * the incompat_features field. A client that doesn't understand
>> * those features will return an error when they try to open
>> * the image.
>> *
>> * The size and any extra information is read from the appropriate
>> * snapshot metadata, if snapid is not CEPH_NOSNAP.
>> *
>> * Returns __le64 size, __le64 order, __le64 features,
>> * __le64 incompat_features, __le64 snapseq and
>> * list of __le64 snapids
>> */
>> get_info(__le64 snapid)
>
> I think I would prefer to see these bits of information broken
> out into a few routines that group related information, or to
> separate what's supplied based on the time or frequency it might
> need to be accessed, or the "effort" involved in collecting it.

I was thinking that we might want these all in one operation for
atomicity, but we could add support for multi-operation transactions to
the kernel instead. These were added to userspace a few months ago.

> For example:
>
> get_features() would supply the features and incompat_features.
> It might not hurt to add "ible" to that second name too. As I
> understand it "features" would be things that allow additional
> functionality but which neither affect nor are affected by the
> way old code (that does not support it) might manipulate the
> persistent state of the RBD storage.

Exactly.

> Does it make sense for a snapshot to have features encoded in
> it? Can you envision a feature that might affect snapshots,
> and if so, would one want to have snapshots in the same RBD
> device that have different feature sets? If the answer to
> these is "no," then there's no need for a "snapid" to be
> supplied for this interface, it would apply to an RBD device
> itself.

Features like allocation maps (of the entire image or for individual
objects) might be added to existing images.

> get_size() might supply the block size order and the size in
> blocks of the RBD device (or a snapshot of the device).
>
> I don't remember what the snapshot sequence number is for
> at the moment, but get_snapseq() might provide that, and
> would (I think) apply to a device, not a snapshot.
>
> get_snapids() would supply all the snapshot id's recorded for
> a device. This too is a device operation only, not a snapshot
> operation. Is there a need for this interface *and* the
> list_snapshots() you specify below?

Together the snapids and the snap_seq form the snapshot context for a
rados operation, and are always used together, so if we go for mulitple
methods instead of get_info(), a get_snapcontext() that returns both
snapids and snap_seq might make more sense.

This snapshot context is different from rbd-level snapshot metadata,
which is currently just snapshot name and size, but would include other
fields like has_parent and features/incompatible_features in the future.

> Rather than implicitly appending additional information to
> the response, I think separate new entry points should be added
> when features are added, extending the interface. Old code,
> or new code operating on old-format RBD devices, would not need
> to access the new entry points.

This is fine as long as we add multi-operation transaction support to
the kernel to make accessing all the image metadata needed for I/O
atomic. This probably isn't too hard, but what do you think? The
userspace client side parts were done in 
fe077832b915175b8ed7880c1fa285309c642563.

>> /**
>> * Used when resizing the image. Sets the size in bytes.
>> */
>> set_size(__le64 size)
>
> Maybe define the size in units of (order-dependent) blocks.

I don't see any reason to restrict the size to a multiple of the object
size.

>> /**
>> * The same as the existing snap_add/snap_remove methods, but using the
>> * new format.
>> */
>> snapshot_add(string snap_name, __le64 snap_id)
>
> This probably reflects my ignorance, but does it make sense for
> something "external" to the RBD implementation to specify the
> snapshot id?

The snapshot id is allocated by the monitors. Clients create
snapshots by getting a new snap id from the monitors, and adding
it to the header with the user-specified name.

>> snapshot_remove(string snap_name)
>
> Can you remove by id?

That would be better. It would prevent some unlikely races if
someone deletes a snapshot while another client deleted one and created
another with the same name.

>> /**
>> * list snapshots - like the existing snap_list, but
>> * can return a subset of them.
>> *
>> * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
>> * (snap_id, snap_size) just like the current snap_list
>> */
>> snapshot_list(__le64 max_len)
>>
>> /**
>> * The same as the existing method. Should only be called
>> * on the rbd_info object.
>> * Returns an id number to use for a new image.
>> */
>> assign_bid()
>
> Why bid? I guess this is on the server side so I'm just not
> familiar with it.

bid stands for block id, but we could certainly change the name. It's
only used when creating a new image, to get a unique id for use in the
object names.

Josh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-24 23:05 RBD format changes and layering Josh Durgin
  2012-05-24 23:39 ` Yehuda Sadeh
  2012-05-25 14:57 ` Alex Elder
@ 2012-05-25 20:55 ` Greg Farnum
  2012-05-25 21:25   ` Josh Durgin
  2012-05-25 23:07 ` Josh Durgin
  3 siblings, 1 reply; 11+ messages in thread
From: Greg Farnum @ 2012-05-25 20:55 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Thursday, May 24, 2012 at 4:05 PM, Josh Durgin wrote:

<snip> 
> 
> One thing that's not addressed in the earlier design is how to make
> images read-only. The simplest way would be to only support layering
> on top of snapshots, which are read-only by definition.
> 
> Another way would be to allow images to be set read-only or
> read-write, and disallow setting images with children read-write. Are
> there many use cases that would justify this second, more complicated
> way?

I'm pretty sure we want to require images to be based on snapshots. It's actually more flexible than read-write flags: service providers could provide several Ubuntu 12.04 installs with different packages available by simply snapshotting as they go through the install procedure. If they instead had to go to an endpoint and then mark the image read-only, they would need to duplicate all the shared data.

> Copy-up
> =======
> 
> Another feature we want to include with layering is the ability to
> copy all remaining data from the parent image to the child image, to
> break the dependency of the latter on the former. This does not change
> snapshots that were taken earlier though - they still rely on the
> parent image. Thus, the children of a parent image will need to
> include snapshots as well, and the reference to the parent image will
> be needed to interact with snapshots. Thus, we can't just remove the
> information pointing the parent. Instead, we can add a boolean
> has_parent field that is stored in the header and with each snapshot,
> since some snapshots may be taken when the parent was still used, and
> some after all the data has been copied to the child.

I understand why you're maintaining a reference to the parent image for old snapshots, but it makes me a little uneasy. This limitation means that you either need to delete snapshots or you need to maintain access to the parent image, which makes me a sad panda.
Have you looked into options for doing a full "local" copy of the needed parent data? I realize there are several tricky problems, but given some of the usage scenarios for layering (ie, migration) it would be an advantage.

My last question is about recursive layering. I know it's been discussed some, and *hopefully* it won't impact the actual on-disk layout of RBD images; do you have enough of a design sketched out to be sure? (One example: given the security concerns you've raised, I think layered images are going to need to list themselves as a child of each of their ancestors, rather than letting that information be absorbed by the intermediate image. Can the plan for storing parent pointers handle that?)
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-25 20:55 ` Greg Farnum
@ 2012-05-25 21:25   ` Josh Durgin
  0 siblings, 0 replies; 11+ messages in thread
From: Josh Durgin @ 2012-05-25 21:25 UTC (permalink / raw)
  To: Greg Farnum; +Cc: ceph-devel

On 05/25/2012 01:55 PM, Greg Farnum wrote:
> On Thursday, May 24, 2012 at 4:05 PM, Josh Durgin wrote:
>
> <snip>
>>
>> One thing that's not addressed in the earlier design is how to make
>> images read-only. The simplest way would be to only support layering
>> on top of snapshots, which are read-only by definition.
>>
>> Another way would be to allow images to be set read-only or
>> read-write, and disallow setting images with children read-write. Are
>> there many use cases that would justify this second, more complicated
>> way?
>
> I'm pretty sure we want to require images to be based on snapshots. It's actually more flexible than read-write flags: service providers could provide several Ubuntu 12.04 installs with different packages available by simply snapshotting as they go through the install procedure. If they instead had to go to an endpoint and then mark the image read-only, they would need to duplicate all the shared data.

I like this approach better myself, since it's less confusing and has
a smaller potential for errors.

>
>> Copy-up
>> =======
>>
>> Another feature we want to include with layering is the ability to
>> copy all remaining data from the parent image to the child image, to
>> break the dependency of the latter on the former. This does not change
>> snapshots that were taken earlier though - they still rely on the
>> parent image. Thus, the children of a parent image will need to
>> include snapshots as well, and the reference to the parent image will
>> be needed to interact with snapshots. Thus, we can't just remove the
>> information pointing the parent. Instead, we can add a boolean
>> has_parent field that is stored in the header and with each snapshot,
>> since some snapshots may be taken when the parent was still used, and
>> some after all the data has been copied to the child.
>
> I understand why you're maintaining a reference to the parent image for old snapshots, but it makes me a little uneasy. This limitation means that you either need to delete snapshots or you need to maintain access to the parent image, which makes me a sad panda.
> Have you looked into options for doing a full "local" copy of the needed parent data? I realize there are several tricky problems, but given some of the usage scenarios for layering (ie, migration) it would be an advantage.

I'm not quite sure what you mean by a full "local" copy.

My plan was to treat snapshots as extra children of the parent if they
reference it. That is, snapshotting a cloned image would include
calling add_child on the parent. This ensures that the parent won't be 
deleted if child images or snapshots still need it.

> My last question is about recursive layering. I know it's been discussed some, and *hopefully* it won't impact the actual on-disk layout of RBD images; do you have enough of a design sketched out to be sure? (One example: given the security concerns you've raised, I think layered images are going to need to list themselves as a child of each of their ancestors, rather than letting that information be absorbed by the intermediate image. Can the plan for storing parent pointers handle that?)

Since parents maintain a list of children, there's no need to add
references at more than one level. If you have images A, B, and C,
with each a child of the last, A can't be deleted until B is, and B
can't be deleted until C is.

The only restriction is that the client needs to have read access to
all the parent pools.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-25 20:21   ` Josh Durgin
@ 2012-05-25 22:26     ` Sage Weil
  2012-05-26  1:43       ` Josh Durgin
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-05-25 22:26 UTC (permalink / raw)
  To: Josh Durgin; +Cc: elder, ceph-devel

On Fri, 25 May 2012, Josh Durgin wrote:
> On 05/25/2012 07:57 AM, Alex Elder wrote:
> > > /**
> > > * Get the metadata about the image required to do I/O
> > > * to it. In the future this may include extra information for
> > > * features that require it, like encryption/compression type.
> > > * This extra data will be added at the end of the response, so
> > > * clients that don't support it don't interpret it.
> > > *
> > > * Features that would require clients to be updated to access
> > > * the image correctly (such as image bitmaps) are set in
> > > * the incompat_features field. A client that doesn't understand
> > > * those features will return an error when they try to open
> > > * the image.
> > > *
> > > * The size and any extra information is read from the appropriate
> > > * snapshot metadata, if snapid is not CEPH_NOSNAP.
> > > *
> > > * Returns __le64 size, __le64 order, __le64 features,
> > > * __le64 incompat_features, __le64 snapseq and
> > > * list of __le64 snapids
> > > */
> > > get_info(__le64 snapid)
> > 
> > I think I would prefer to see these bits of information broken
> > out into a few routines that group related information, or to
> > separate what's supplied based on the time or frequency it might
> > need to be accessed, or the "effort" involved in collecting it.
> 
> I was thinking that we might want these all in one operation for
> atomicity, but we could add support for multi-operation transactions to
> the kernel instead. These were added to userspace a few months ago.

I would prefer separate operations too (e.g., get-size, get-order, 
get-features, etc.).  IIRC there is already some infrastructure to handle 
compound operations already.  Atomicity shouldn't be a concern, either 
way.  This makes it simple to expand the header with other infos without 
creating a get-info2 command or something similar.

A couple other comments:

- The pools currently can't be renamed, but there isn't any reason why 
they couldn't be... at least until we start refering to them by name in 
the rbd parent pointers.  I'd rather use the pool ids to keep our options 
open.

- Requiring parents be snapshots seems fine to me.  It just means the 
child lists need to be per-snapshot, so that we know when it is safe to 
remove snaps on the parent.

- I don't think that creating snapshots on the child needs to touch the 
parent (if that is still the plan).  The child can remove itself as a 
child one the final reference (head or snap) is removed; no need to bother 
the parent with that information.  (It could also cause a lot of noise for 
the parent 12.04 image with 10,000 children getting snapped regularly.)

- I wonder if it makes sense to create an 'open' method (and maybe 
corresponding 'close').  I'm imagining future *compat* features (e.g., 
bitmaps), where a new client creates some bitmaps, and then an old client 
mounts the image.  The bitmap doesn't have to be incompat if the old 
client invalidates it (e.g., via open with old feature set).  

This might be useful also when we add locking (so that clients get EBUSY 
if multiple hosts try to map).

- Will we have class methods for rbd_directory as well?  That seems like 
the simplest way to maintain backwards compatibility.  Also, if we keep 
the name, maybe rbd_header.* and rbd_data.* are more consistent.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-24 23:05 RBD format changes and layering Josh Durgin
                   ` (2 preceding siblings ...)
  2012-05-25 20:55 ` Greg Farnum
@ 2012-05-25 23:07 ` Josh Durgin
  2012-05-29 22:08   ` Tommi Virtanen
  3 siblings, 1 reply; 11+ messages in thread
From: Josh Durgin @ 2012-05-25 23:07 UTC (permalink / raw)
  To: ceph-devel

On 05/24/2012 04:05 PM, Josh Durgin wrote:
> 1) making sure parent images are not deleted when children still
> refer to them

Yehuda and Tv and I talked about this more, and came up with a simpler
design that doesn't require the security changes or writing anything
to the parent.

Each image (and snapshot) has a 'preserved' flag that means it is 
read-only and cannot be deleted without explicitly declaring it
deletable. Something like:

   $ rbd preserve pool/image
   $ rbd rm pool/image
   Error: image is preserved. If you really know what you're doing, 
unpreserve it.
   $ rbd unpreserve pool/image
   $ rbd rm pool/image

Images or snapshots that don't have the preserved flag set can't be
cloned. Images or snapshots that do have it set can't be deleted until
it is unset.

To answer the question 'does this image have children', we can have an
object in the child's pool that maintains info about which children
exist in that pool (rbd_clones). This could be an omap with keys of
(parent pool id, parent image id, parent snap name, child image id,
child snap name) and empty values, or keys with the parent info and
values consisting of lists of child info.

To check whether children exist, you can iterate over
all the pools and check the rbd_clones object in each one.
Since the number of pools is relatively small, this isn't
very expensive. If the pool is deleted, by definition all the children 
in it are deleted.

With separate namespaces in the future, this will be a bit more
expensive, but it's only needed at base image deletion time,
which is relatively rare. Deleting the image itself already
requires an I/O per object, so this is probably not the slow
part anyway.

Yehuda, Tv, did I miss anything?

Josh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-25 22:26     ` Sage Weil
@ 2012-05-26  1:43       ` Josh Durgin
  0 siblings, 0 replies; 11+ messages in thread
From: Josh Durgin @ 2012-05-26  1:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: elder, ceph-devel

On 05/25/2012 03:26 PM, Sage Weil wrote:
> On Fri, 25 May 2012, Josh Durgin wrote:
>> On 05/25/2012 07:57 AM, Alex Elder wrote:
>>>> /**
>>>> * Get the metadata about the image required to do I/O
>>>> * to it. In the future this may include extra information for
>>>> * features that require it, like encryption/compression type.
>>>> * This extra data will be added at the end of the response, so
>>>> * clients that don't support it don't interpret it.
>>>> *
>>>> * Features that would require clients to be updated to access
>>>> * the image correctly (such as image bitmaps) are set in
>>>> * the incompat_features field. A client that doesn't understand
>>>> * those features will return an error when they try to open
>>>> * the image.
>>>> *
>>>> * The size and any extra information is read from the appropriate
>>>> * snapshot metadata, if snapid is not CEPH_NOSNAP.
>>>> *
>>>> * Returns __le64 size, __le64 order, __le64 features,
>>>> * __le64 incompat_features, __le64 snapseq and
>>>> * list of __le64 snapids
>>>> */
>>>> get_info(__le64 snapid)
>>>
>>> I think I would prefer to see these bits of information broken
>>> out into a few routines that group related information, or to
>>> separate what's supplied based on the time or frequency it might
>>> need to be accessed, or the "effort" involved in collecting it.
>>
>> I was thinking that we might want these all in one operation for
>> atomicity, but we could add support for multi-operation transactions to
>> the kernel instead. These were added to userspace a few months ago.
>
> I would prefer separate operations too (e.g., get-size, get-order,
> get-features, etc.).  IIRC there is already some infrastructure to handle
> compound operations already.  Atomicity shouldn't be a concern, either
> way.  This makes it simple to expand the header with other infos without
> creating a get-info2 command or something similar.
>
> A couple other comments:
>
> - The pools currently can't be renamed, but there isn't any reason why
> they couldn't be... at least until we start refering to them by name in
> the rbd parent pointers.  I'd rather use the pool ids to keep our options
> open.

Sounds good.

> - Requiring parents be snapshots seems fine to me.  It just means the
> child lists need to be per-snapshot, so that we know when it is safe to
> remove snaps on the parent.
>
> - I don't think that creating snapshots on the child needs to touch the
> parent (if that is still the plan).  The child can remove itself as a
> child one the final reference (head or snap) is removed; no need to bother
> the parent with that information.  (It could also cause a lot of noise for
> the parent 12.04 image with 10,000 children getting snapped regularly.)
>
> - I wonder if it makes sense to create an 'open' method (and maybe
> corresponding 'close').  I'm imagining future *compat* features (e.g.,
> bitmaps), where a new client creates some bitmaps, and then an old client
> mounts the image.  The bitmap doesn't have to be incompat if the old
> client invalidates it (e.g., via open with old feature set).

This sounds like a good idea too. I imagine when we add compat features
like this, we might want extra methods to add them to existing images
too.

> This might be useful also when we add locking (so that clients get EBUSY
> if multiple hosts try to map).
>
> - Will we have class methods for rbd_directory as well?  That seems like
> the simplest way to maintain backwards compatibility.  Also, if we keep
> the name, maybe rbd_header.* and rbd_data.* are more consistent.

Not sure what you mean about the object names.

We can add a class method for rbd_directory too, so we can change its
format when the old format is removed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RBD format changes and layering
  2012-05-25 23:07 ` Josh Durgin
@ 2012-05-29 22:08   ` Tommi Virtanen
  0 siblings, 0 replies; 11+ messages in thread
From: Tommi Virtanen @ 2012-05-29 22:08 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Fri, May 25, 2012 at 4:07 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> To check whether children exist, you can iterate over
> all the pools and check the rbd_clones object in each one.
> Since the number of pools is relatively small, this isn't
> very expensive. If the pool is deleted, by definition all the children in it
> are deleted.
>
> With separate namespaces in the future, this will be a bit more
> expensive, but it's only needed at base image deletion time,
> which is relatively rare. Deleting the image itself already
> requires an I/O per object, so this is probably not the slow
> part anyway.
>
> Yehuda, Tv, did I miss anything?

One thing: that's still racy, and we discussed a solution.

1. A: walk through all pools, look for clones, find none
2. B: create a clone
3. A: rbd unpreserve parent
4. A: rbd rm parent

Oopsie.

To avoid that, I proposed a "deleting" flag. Clones can only be
created when parent is preserved && !deleting. Now,

1. A: rbd deleting parent
2. A: walk through all pools, look for clones, find none
3. B: attempt to create a clone, fails
...

Now, that doesn't have to be strictly "deleting".. "going_unpreserved"
or something; instead of deletion, the intended operation might be
starting a vm against the parent image to e.g. add security updates.

And, as we discussed, these flags would be per snapshot (or also on
the master image, if you want to support that). Thus, one snapshot can
preserved while an older one is scheduled for deletion.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-05-29 22:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-24 23:05 RBD format changes and layering Josh Durgin
2012-05-24 23:39 ` Yehuda Sadeh
2012-05-25 17:33   ` Josh Durgin
2012-05-25 14:57 ` Alex Elder
2012-05-25 20:21   ` Josh Durgin
2012-05-25 22:26     ` Sage Weil
2012-05-26  1:43       ` Josh Durgin
2012-05-25 20:55 ` Greg Farnum
2012-05-25 21:25   ` Josh Durgin
2012-05-25 23:07 ` Josh Durgin
2012-05-29 22:08   ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.