* RADOS translator for GlusterFS
[not found] <980181538.654650.1399300829103.JavaMail.zimbra@redhat.com>
@ 2014-05-05 15:21 ` Jeff Darcy
2014-05-05 15:37 ` Dan van der Ster
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Jeff Darcy @ 2014-05-05 15:21 UTC (permalink / raw)
To: ceph-devel; +Cc: gluster-devel
Now that we're all one big happy family, I've been mulling over
different ways that the two technology stacks could work together. One
idea would be to use some of the GlusterFS upper layers for their
interface and integration possibilities, but then falling down to RADOS
instead of GlusterFS's own distribution and replication. I must
emphasize that I don't necessarily think this is The Right Way for
anything real, but I think it's an important experiment just to see what
the problems are and how well it performs. So here's what I'm thinking.
For the Ceph folks, I'll describe just a tiny bit of how GlusterFS
works. The core concept in GlusterFS is a "translator" which accepts
file system requests and generates file system requests in exactly the
same form. This allows them to be stacked in arbitrary orders, moved
back and forth across the server/client divide, etc. There are several
broad classes of translators:
* Some, such as FUSE or GFAPI, inject new requests into the translator
stack.
* Some, such as "posix", satisfy requests by calling a server-local FS.
* The "client" and "server" translators together get requests from one
machine to another.
* Some translators *route* requests (one in to one of several out).
* Some translators *fan out* requests (one in to all of several out).
* Most are one in, one out, to add e.g. locks or caching etc.
Of particular interest here are the DHT (routing/distribution) and AFR
(fan-out/replication) translators, which mirror functionality in RADOS.
My idea is to cut out everything from these on below, in favor of a
translator based on librados instead. How this works is pretty obvious
for file data - just read and write to RADOS objects instead of to
files. It's a bit less obvious for metadata, especially directory
entries. One really simple idea is to store metadata as data, in some
format defined by the translator itself, and have it handle the
read/modify/write for adding/deleting entries and such. That would be
enough to get some basic performance tests done. A slightly more
sophisticated idea might be to use OSD class methods to do the
read/modify/write, but I don't know much about that mechanism so I'm not
sure that's even feasible.
This is not something I'm going to be working on as part of my main job,
but I'd like to get the experiment started in some of my "spare" time.
Is there anyone else interested in collaborating, or are there any other
obvious ideas I'm missing?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 15:21 ` RADOS translator for GlusterFS Jeff Darcy
@ 2014-05-05 15:37 ` Dan van der Ster
2014-05-05 16:39 ` Yehuda Sadeh
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Dan van der Ster @ 2014-05-05 15:37 UTC (permalink / raw)
To: Jeff Darcy, ceph-devel; +Cc: gluster-devel
Hi,
On 05/05/14 17:21, Jeff Darcy wrote:
> Now that we're all one big happy family, I've been mulling over
> different ways that the two technology stacks could work together. One
> idea would be to use some of the GlusterFS upper layers for their
> interface and integration possibilities, but then falling down to RADOS
> instead of GlusterFS's own distribution and replication. I must
> emphasize that I don't necessarily think this is The Right Way for
> anything real, but I think it's an important experiment just to see what
> the problems are and how well it performs. So here's what I'm thinking.
>
> For the Ceph folks, I'll describe just a tiny bit of how GlusterFS
> works. The core concept in GlusterFS is a "translator" which accepts
> file system requests and generates file system requests in exactly the
> same form. This allows them to be stacked in arbitrary orders, moved
> back and forth across the server/client divide, etc. There are several
> broad classes of translators:
>
> * Some, such as FUSE or GFAPI, inject new requests into the translator
> stack.
>
> * Some, such as "posix", satisfy requests by calling a server-local FS.
>
> * The "client" and "server" translators together get requests from one
> machine to another.
>
> * Some translators *route* requests (one in to one of several out).
>
> * Some translators *fan out* requests (one in to all of several out).
>
> * Most are one in, one out, to add e.g. locks or caching etc.
>
> Of particular interest here are the DHT (routing/distribution) and AFR
> (fan-out/replication) translators, which mirror functionality in RADOS.
> My idea is to cut out everything from these on below, in favor of a
> translator based on librados instead. How this works is pretty obvious
> for file data - just read and write to RADOS objects instead of to
> files. It's a bit less obvious for metadata, especially directory
> entries. One really simple idea is to store metadata as data, in some
> format defined by the translator itself, and have it handle the
> read/modify/write for adding/deleting entries and such. That would be
> enough to get some basic performance tests done. A slightly more
> sophisticated idea might be to use OSD class methods to do the
> read/modify/write, but I don't know much about that mechanism so I'm not
> sure that's even feasible.
>
> This is not something I'm going to be working on as part of my main job,
> but I'd like to get the experiment started in some of my "spare" time.
> Is there anyone else interested in collaborating, or are there any other
> obvious ideas I'm missing?
Regarding obvious ideas, FWIW, I've been testing GlusterFS volumes which
distribute over a few VMs with locally attached RBDs. That seems to be
usable today, and shouldn't lose data but I guess would do something bad
while individual VM/RBDs go down.
I'm very new to gluster, but I can't think of a way to make this HA
without either replication at the gluster level (expensive) or making
gluster speak to RADOS directly.
Cheers, Dan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 15:21 ` RADOS translator for GlusterFS Jeff Darcy
2014-05-05 15:37 ` Dan van der Ster
@ 2014-05-05 16:39 ` Yehuda Sadeh
2014-05-05 17:08 ` Jeff Darcy
[not found] ` <355696287.706122.1399303290204.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-05 16:43 ` John Spray
3 siblings, 1 reply; 12+ messages in thread
From: Yehuda Sadeh @ 2014-05-05 16:39 UTC (permalink / raw)
To: Jeff Darcy; +Cc: ceph-devel, gluster-devel
On Mon, May 5, 2014 at 8:21 AM, Jeff Darcy <jdarcy@redhat.com> wrote:
> Now that we're all one big happy family, I've been mulling over
> different ways that the two technology stacks could work together. One
> idea would be to use some of the GlusterFS upper layers for their
> interface and integration possibilities, but then falling down to RADOS
> instead of GlusterFS's own distribution and replication. I must
> emphasize that I don't necessarily think this is The Right Way for
> anything real, but I think it's an important experiment just to see what
> the problems are and how well it performs. So here's what I'm thinking.
>
> For the Ceph folks, I'll describe just a tiny bit of how GlusterFS
> works. The core concept in GlusterFS is a "translator" which accepts
> file system requests and generates file system requests in exactly the
> same form. This allows them to be stacked in arbitrary orders, moved
> back and forth across the server/client divide, etc. There are several
> broad classes of translators:
>
> * Some, such as FUSE or GFAPI, inject new requests into the translator
> stack.
>
> * Some, such as "posix", satisfy requests by calling a server-local FS.
>
> * The "client" and "server" translators together get requests from one
> machine to another.
>
> * Some translators *route* requests (one in to one of several out).
>
> * Some translators *fan out* requests (one in to all of several out).
>
> * Most are one in, one out, to add e.g. locks or caching etc.
>
> Of particular interest here are the DHT (routing/distribution) and AFR
> (fan-out/replication) translators, which mirror functionality in RADOS.
> My idea is to cut out everything from these on below, in favor of a
> translator based on librados instead. How this works is pretty obvious
> for file data - just read and write to RADOS objects instead of to
> files. It's a bit less obvious for metadata, especially directory
Sorry if I'm missing something obvious, but how are reads / writes
actually done? Do you keep an open file descriptor and work on that
(e.g., are there open() / close() operations), or are operations don't
require any state? With RADOS it's the latter case, so we don't
provide certain guarantees and there are no file-state operations
(like open(), close(), lock(), etc.). Anything like that needs to be
implemented on top of it.
> entries. One really simple idea is to store metadata as data, in some
> format defined by the translator itself, and have it handle the
> read/modify/write for adding/deleting entries and such. That would be
Maybe integrate it with the mds (which by itself stores metadata as
data and does all the relevant work)?
> enough to get some basic performance tests done. A slightly more
> sophisticated idea might be to use OSD class methods to do the
> read/modify/write, but I don't know much about that mechanism so I'm not
> sure that's even feasible.
I don't see why it wouldn't work. The rados gateway does things
similarly for handling the bucket index.
>
> This is not something I'm going to be working on as part of my main job,
> but I'd like to get the experiment started in some of my "spare" time.
> Is there anyone else interested in collaborating, or are there any other
> obvious ideas I'm missing?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
[not found] ` <355696287.706122.1399303290204.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-05 16:41 ` John Spray
0 siblings, 0 replies; 12+ messages in thread
From: John Spray @ 2014-05-05 16:41 UTC (permalink / raw)
To: Jeff Darcy; +Cc: Ceph Development, gluster-devel-+FkPdpiNhgJAfugRpC6u6w
[-- Attachment #1.1: Type: text/plain, Size: 3370 bytes --]
In terms of making something work really quickly, one approach would be to
base off the existing POSIX translator, use a local FS backed by an RBD
volume for the metadata, and store the file content directly using
librados. That would avoid the need to invent a way to map
filesystem-style metadata to librados calls, while still getting reasonably
efficient data operations through to rados.
I would doubt this would be very slick, but it could be a fun hack!
John
On Mon, May 5, 2014 at 4:21 PM, Jeff Darcy <jdarcy-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Now that we're all one big happy family, I've been mulling over
> different ways that the two technology stacks could work together. One
> idea would be to use some of the GlusterFS upper layers for their
> interface and integration possibilities, but then falling down to RADOS
> instead of GlusterFS's own distribution and replication. I must
> emphasize that I don't necessarily think this is The Right Way for
> anything real, but I think it's an important experiment just to see what
> the problems are and how well it performs. So here's what I'm thinking.
>
> For the Ceph folks, I'll describe just a tiny bit of how GlusterFS
> works. The core concept in GlusterFS is a "translator" which accepts
> file system requests and generates file system requests in exactly the
> same form. This allows them to be stacked in arbitrary orders, moved
> back and forth across the server/client divide, etc. There are several
> broad classes of translators:
>
> * Some, such as FUSE or GFAPI, inject new requests into the translator
> stack.
>
> * Some, such as "posix", satisfy requests by calling a server-local FS.
>
> * The "client" and "server" translators together get requests from one
> machine to another.
>
> * Some translators *route* requests (one in to one of several out).
>
> * Some translators *fan out* requests (one in to all of several out).
>
> * Most are one in, one out, to add e.g. locks or caching etc.
>
> Of particular interest here are the DHT (routing/distribution) and AFR
> (fan-out/replication) translators, which mirror functionality in RADOS.
> My idea is to cut out everything from these on below, in favor of a
> translator based on librados instead. How this works is pretty obvious
> for file data - just read and write to RADOS objects instead of to
> files. It's a bit less obvious for metadata, especially directory
> entries. One really simple idea is to store metadata as data, in some
> format defined by the translator itself, and have it handle the
> read/modify/write for adding/deleting entries and such. That would be
> enough to get some basic performance tests done. A slightly more
> sophisticated idea might be to use OSD class methods to do the
> read/modify/write, but I don't know much about that mechanism so I'm not
> sure that's even feasible.
>
> This is not something I'm going to be working on as part of my main job,
> but I'd like to get the experiment started in some of my "spare" time.
> Is there anyone else interested in collaborating, or are there any other
> obvious ideas I'm missing?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #1.2: Type: text/html, Size: 4231 bytes --]
[-- Attachment #2: Type: text/plain, Size: 191 bytes --]
_______________________________________________
Gluster-devel mailing list
Gluster-devel-+FkPdpiNhgJAfugRpC6u6w@public.gmane.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 15:21 ` RADOS translator for GlusterFS Jeff Darcy
` (2 preceding siblings ...)
[not found] ` <355696287.706122.1399303290204.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-05 16:43 ` John Spray
3 siblings, 0 replies; 12+ messages in thread
From: John Spray @ 2014-05-05 16:43 UTC (permalink / raw)
To: Ceph Development
In terms of making something work really quickly, one approach would
be to base off the existing POSIX translator, use a local FS backed by
an RBD volume for the metadata, and store the file content directly
using librados. That would avoid the need to invent a way to map
filesystem-style metadata to librados calls, while still getting
reasonably efficient data operations through to rados.
I would doubt this would be very slick, but it could be a fun hack!
John
On Mon, May 5, 2014 at 4:21 PM, Jeff Darcy <jdarcy@redhat.com> wrote:
>
> Now that we're all one big happy family, I've been mulling over
> different ways that the two technology stacks could work together. One
> idea would be to use some of the GlusterFS upper layers for their
> interface and integration possibilities, but then falling down to RADOS
> instead of GlusterFS's own distribution and replication. I must
> emphasize that I don't necessarily think this is The Right Way for
> anything real, but I think it's an important experiment just to see what
> the problems are and how well it performs. So here's what I'm thinking.
>
> For the Ceph folks, I'll describe just a tiny bit of how GlusterFS
> works. The core concept in GlusterFS is a "translator" which accepts
> file system requests and generates file system requests in exactly the
> same form. This allows them to be stacked in arbitrary orders, moved
> back and forth across the server/client divide, etc. There are several
> broad classes of translators:
>
> * Some, such as FUSE or GFAPI, inject new requests into the translator
> stack.
>
> * Some, such as "posix", satisfy requests by calling a server-local FS.
>
> * The "client" and "server" translators together get requests from one
> machine to another.
>
> * Some translators *route* requests (one in to one of several out).
>
> * Some translators *fan out* requests (one in to all of several out).
>
> * Most are one in, one out, to add e.g. locks or caching etc.
>
> Of particular interest here are the DHT (routing/distribution) and AFR
> (fan-out/replication) translators, which mirror functionality in RADOS.
> My idea is to cut out everything from these on below, in favor of a
> translator based on librados instead. How this works is pretty obvious
> for file data - just read and write to RADOS objects instead of to
> files. It's a bit less obvious for metadata, especially directory
> entries. One really simple idea is to store metadata as data, in some
> format defined by the translator itself, and have it handle the
> read/modify/write for adding/deleting entries and such. That would be
> enough to get some basic performance tests done. A slightly more
> sophisticated idea might be to use OSD class methods to do the
> read/modify/write, but I don't know much about that mechanism so I'm not
> sure that's even feasible.
>
> This is not something I'm going to be working on as part of my main job,
> but I'd like to get the experiment started in some of my "spare" time.
> Is there anyone else interested in collaborating, or are there any other
> obvious ideas I'm missing?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 16:39 ` Yehuda Sadeh
@ 2014-05-05 17:08 ` Jeff Darcy
2014-05-05 17:30 ` Samuel Just
0 siblings, 1 reply; 12+ messages in thread
From: Jeff Darcy @ 2014-05-05 17:08 UTC (permalink / raw)
To: Yehuda Sadeh; +Cc: ceph-devel, gluster-devel
> > Of particular interest here are the DHT (routing/distribution) and AFR
> > (fan-out/replication) translators, which mirror functionality in RADOS.
> > My idea is to cut out everything from these on below, in favor of a
> > translator based on librados instead. How this works is pretty obvious
> > for file data - just read and write to RADOS objects instead of to
> > files. It's a bit less obvious for metadata, especially directory
>
> Sorry if I'm missing something obvious, but how are reads / writes
> actually done? Do you keep an open file descriptor and work on that
> (e.g., are there open() / close() operations), or are operations don't
> require any state? With RADOS it's the latter case, so we don't
> provide certain guarantees and there are no file-state operations
> (like open(), close(), lock(), etc.). Anything like that needs to be
> implemented on top of it.
We'd have an open file descriptor on the client side, and associated with
that we would keep the OID for the corresponding RADOS object. In the
simplest case, we could just use those for rados_read/rados_write and not
worry about consistency. For stronger consistency, we'd need something
more. Would that be rados_watch/rados_notify or something else?
> > entries. One really simple idea is to store metadata as data, in some
> > format defined by the translator itself, and have it handle the
> > read/modify/write for adding/deleting entries and such. That would be
>
> Maybe integrate it with the mds (which by itself stores metadata as
> data and does all the relevant work)?
Well, part of the point is not to go through the Ceph file system layer,
since that's almost guaranteed to be worse than using the Ceph file
system client. The question to be answered here is whether there's
something to be gained by mixing and matching somewhere in the middle,
as opposed to just layering one file system implementation on top of
the other.
> > enough to get some basic performance tests done. A slightly more
> > sophisticated idea might be to use OSD class methods to do the
> > read/modify/write, but I don't know much about that mechanism so I'm not
> > sure that's even feasible.
>
> I don't see why it wouldn't work. The rados gateway does things
> similarly for handling the bucket index.
Good to know. I'll take a look at how it does that. Thanks!
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 17:08 ` Jeff Darcy
@ 2014-05-05 17:30 ` Samuel Just
2014-05-05 17:38 ` Jeff Darcy
0 siblings, 1 reply; 12+ messages in thread
From: Samuel Just @ 2014-05-05 17:30 UTC (permalink / raw)
To: Jeff Darcy; +Cc: Yehuda Sadeh, ceph-devel, gluster-devel
rados_watch/notify could probably be used for coordinating client access.
One important caveat is that rados objects should be limited in size
(4MB for rbd blocks), so you'll want to chunk files somewhere before
rados.
-Sam
On Mon, May 5, 2014 at 10:08 AM, Jeff Darcy <jdarcy@redhat.com> wrote:
>> > Of particular interest here are the DHT (routing/distribution) and AFR
>> > (fan-out/replication) translators, which mirror functionality in RADOS.
>> > My idea is to cut out everything from these on below, in favor of a
>> > translator based on librados instead. How this works is pretty obvious
>> > for file data - just read and write to RADOS objects instead of to
>> > files. It's a bit less obvious for metadata, especially directory
>>
>> Sorry if I'm missing something obvious, but how are reads / writes
>> actually done? Do you keep an open file descriptor and work on that
>> (e.g., are there open() / close() operations), or are operations don't
>> require any state? With RADOS it's the latter case, so we don't
>> provide certain guarantees and there are no file-state operations
>> (like open(), close(), lock(), etc.). Anything like that needs to be
>> implemented on top of it.
>
> We'd have an open file descriptor on the client side, and associated with
> that we would keep the OID for the corresponding RADOS object. In the
> simplest case, we could just use those for rados_read/rados_write and not
> worry about consistency. For stronger consistency, we'd need something
> more. Would that be rados_watch/rados_notify or something else?
>
>> > entries. One really simple idea is to store metadata as data, in some
>> > format defined by the translator itself, and have it handle the
>> > read/modify/write for adding/deleting entries and such. That would be
>>
>> Maybe integrate it with the mds (which by itself stores metadata as
>> data and does all the relevant work)?
>
> Well, part of the point is not to go through the Ceph file system layer,
> since that's almost guaranteed to be worse than using the Ceph file
> system client. The question to be answered here is whether there's
> something to be gained by mixing and matching somewhere in the middle,
> as opposed to just layering one file system implementation on top of
> the other.
>
>> > enough to get some basic performance tests done. A slightly more
>> > sophisticated idea might be to use OSD class methods to do the
>> > read/modify/write, but I don't know much about that mechanism so I'm not
>> > sure that's even feasible.
>>
>> I don't see why it wouldn't work. The rados gateway does things
>> similarly for handling the bucket index.
>
> Good to know. I'll take a look at how it does that. Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 17:30 ` Samuel Just
@ 2014-05-05 17:38 ` Jeff Darcy
[not found] ` <1666953774.790843.1399311496408.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Jeff Darcy @ 2014-05-05 17:38 UTC (permalink / raw)
To: Samuel Just; +Cc: Yehuda Sadeh, ceph-devel, gluster-devel
> One important caveat is that rados objects should be limited in size
> (4MB for rbd blocks), so you'll want to chunk files somewhere before
> rados.
OK, that's going to make life interesting. How dire are the results
of not chunking like this? Is it just that the data won't be
distributed across multiple OSDs and therefore you'll only get one
OSD's worth of throughput? Or is it something worse? There might
be a way that we can use the GlusterFS striping translator on top
of the RADOS translator, but since it currently sits one level lower
(still above replication but below distribution) there might be some
issues there.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
[not found] ` <1666953774.790843.1399311496408.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-05 17:46 ` Samuel Just
2014-05-05 18:07 ` Jeff Darcy
0 siblings, 1 reply; 12+ messages in thread
From: Samuel Just @ 2014-05-05 17:46 UTC (permalink / raw)
To: Jeff Darcy; +Cc: ceph-devel, gluster-devel-+FkPdpiNhgJAfugRpC6u6w, Yehuda Sadeh
It's very important, several kinds of blocking are done at object
granularity. Off the top of my head, large objects would cause deep
scrub and recovery to stall requests for longer. Elephant objects
would also be able to skew data distribution.
-Sam
On Mon, May 5, 2014 at 10:38 AM, Jeff Darcy <jdarcy-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> One important caveat is that rados objects should be limited in size
>> (4MB for rbd blocks), so you'll want to chunk files somewhere before
>> rados.
>
> OK, that's going to make life interesting. How dire are the results
> of not chunking like this? Is it just that the data won't be
> distributed across multiple OSDs and therefore you'll only get one
> OSD's worth of throughput? Or is it something worse? There might
> be a way that we can use the GlusterFS striping translator on top
> of the RADOS translator, but since it currently sits one level lower
> (still above replication but below distribution) there might be some
> issues there.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 17:46 ` Samuel Just
@ 2014-05-05 18:07 ` Jeff Darcy
2014-05-05 18:23 ` Samuel Just
[not found] ` <324933830.809209.1399313264579.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 2 replies; 12+ messages in thread
From: Jeff Darcy @ 2014-05-05 18:07 UTC (permalink / raw)
To: Samuel Just; +Cc: Yehuda Sadeh, ceph-devel, gluster-devel
> It's very important, several kinds of blocking are done at object
> granularity. Off the top of my head, large objects would cause deep
> scrub and recovery to stall requests for longer. Elephant objects
> would also be able to skew data distribution.
There are some definite parallels here to discussions we've had in
Gluster-land, which we might as well go through because people from
either "parent" won't have heard the other. The data distribution
issue has turned out to be a practical non-issue for GlusterFS
users. Sure, if you have very few "elephant objects" on very few
small-ish bricks (our equivalent of OSDs) then you can get skewed
distribution. On the other hand, that problem *very* quickly
solves itself for even moderate object and brick counts, to the
point that almost no users have found it useful to enable striping.
Has your experience been different, or do you not know because
striping is mandatory instead of optional?
The "deep scrub and recovery" point brings up a whole different
set of memories. We used to have a problem in GlusterFS where
self-heal would lock an entire file while it ran, so other access
to that file would be blocked for a long time. This would cause
VMs to hang, for example. In either 3.3 or 3.4 (can't remember)
we added "granular self-heal" which would only lock the portion
of the file that was currently under repair, in a sort of rolling
fashion. From your comment, it sounds like RADOS still locks the
entire object. Is that correct? If so, I posit that it's
something we wouldn't need to solve in a prototype. If/when that
starts turning into something real, then we'd have two options.
One is to do striping as you suggest, which means solving all of
the associated coordination problems. Another would be to do
something like what GlusterFS did, with locking at the sub-object
level. That does make repair less atomic, which some would
consider a consistency problem, but we do have some evidence that
it's a violation users don't seem to care about.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
2014-05-05 18:07 ` Jeff Darcy
@ 2014-05-05 18:23 ` Samuel Just
[not found] ` <324933830.809209.1399313264579.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
1 sibling, 0 replies; 12+ messages in thread
From: Samuel Just @ 2014-05-05 18:23 UTC (permalink / raw)
To: Jeff Darcy; +Cc: Yehuda Sadeh, ceph-devel, gluster-devel
We do essentially lock entire objects for many purposes. This isn't
generally a problem (and greatly simplifies many bits of the
implementation) because all existing rados users employ some form of
chunking/striping. That said, it's probably a good thing to punt on
for a prototype.
-Sam
On Mon, May 5, 2014 at 11:07 AM, Jeff Darcy <jdarcy@redhat.com> wrote:
>> It's very important, several kinds of blocking are done at object
>> granularity. Off the top of my head, large objects would cause deep
>> scrub and recovery to stall requests for longer. Elephant objects
>> would also be able to skew data distribution.
>
> There are some definite parallels here to discussions we've had in
> Gluster-land, which we might as well go through because people from
> either "parent" won't have heard the other. The data distribution
> issue has turned out to be a practical non-issue for GlusterFS
> users. Sure, if you have very few "elephant objects" on very few
> small-ish bricks (our equivalent of OSDs) then you can get skewed
> distribution. On the other hand, that problem *very* quickly
> solves itself for even moderate object and brick counts, to the
> point that almost no users have found it useful to enable striping.
> Has your experience been different, or do you not know because
> striping is mandatory instead of optional?
>
> The "deep scrub and recovery" point brings up a whole different
> set of memories. We used to have a problem in GlusterFS where
> self-heal would lock an entire file while it ran, so other access
> to that file would be blocked for a long time. This would cause
> VMs to hang, for example. In either 3.3 or 3.4 (can't remember)
> we added "granular self-heal" which would only lock the portion
> of the file that was currently under repair, in a sort of rolling
> fashion. From your comment, it sounds like RADOS still locks the
> entire object. Is that correct? If so, I posit that it's
> something we wouldn't need to solve in a prototype. If/when that
> starts turning into something real, then we'd have two options.
> One is to do striping as you suggest, which means solving all of
> the associated coordination problems. Another would be to do
> something like what GlusterFS did, with locking at the sub-object
> level. That does make repair less atomic, which some would
> consider a consistency problem, but we do have some evidence that
> it's a violation users don't seem to care about.
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RADOS translator for GlusterFS
[not found] ` <324933830.809209.1399313264579.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-05 20:25 ` Sebastien Ponce
0 siblings, 0 replies; 12+ messages in thread
From: Sebastien Ponce @ 2014-05-05 20:25 UTC (permalink / raw)
To: Jeff Darcy
Cc: ceph-devel, gluster-devel-+FkPdpiNhgJAfugRpC6u6w, Yehuda Sadeh,
Samuel Just
> The data distribution
> issue has turned out to be a practical non-issue for GlusterFS
> users. Sure, if you have very few "elephant objects" on very few
> small-ish bricks (our equivalent of OSDs) then you can get skewed
> distribution. On the other hand, that problem *very* quickly
> solves itself for even moderate object and brick counts, to the
> point that almost no users have found it useful to enable striping.
> Has your experience been different, or do you not know because
> striping is mandatory instead of optional?
We have indeed an example of a case where striping is needed here at
CERN : we are starting to test rados as a backend for the disk cache of
our mass storage system (understand tape backend). There, files can
indeed be really big (up to TB level) and we need parallel accesses to
be able to feed our tape drives to their limit of > 250MB/s.
The solution has been to implement a layer of striping on top of rados
that "hides" the striping while basically keeping the rados interface
and (most of) the consistency and locking.
This is currently not integrated in the ceph mainstream but is available
and ready for merge. See blue print at
https://wiki.ceph.com/Planning/Blueprints/Firefly/Object_striping_in_librados#section_4 and implementation/pull request at https://github.com/ceph/ceph/pull/1186.
By the way, it needs some review :-)
Sebastien
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-05-05 20:25 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <980181538.654650.1399300829103.JavaMail.zimbra@redhat.com>
2014-05-05 15:21 ` RADOS translator for GlusterFS Jeff Darcy
2014-05-05 15:37 ` Dan van der Ster
2014-05-05 16:39 ` Yehuda Sadeh
2014-05-05 17:08 ` Jeff Darcy
2014-05-05 17:30 ` Samuel Just
2014-05-05 17:38 ` Jeff Darcy
[not found] ` <1666953774.790843.1399311496408.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-05 17:46 ` Samuel Just
2014-05-05 18:07 ` Jeff Darcy
2014-05-05 18:23 ` Samuel Just
[not found] ` <324933830.809209.1399313264579.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-05 20:25 ` Sebastien Ponce
[not found] ` <355696287.706122.1399303290204.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-05 16:41 ` John Spray
2014-05-05 16:43 ` John Spray
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.