All of lore.kernel.org
 help / color / mirror / Atom feed
* ceph and efficient access of distributed resources
@ 2013-04-12  3:59 Matthias Urlichs
  2013-04-12 16:08 ` Mark Nelson
  0 siblings, 1 reply; 15+ messages in thread
From: Matthias Urlichs @ 2013-04-12  3:59 UTC (permalink / raw)
  To: ceph-devel

As I understand it, in Ceph one can cluster storage nodes, but otherwise
every node is essentially identical, so if three storage nodes have a file,
ceph randomly uses one of them.

This is not efficient use of network resources in a distributed data center.
Or even in a multi-rack situation.

I want to prefer accessing nodes which are "local".
The client in rack A should prefer to read from the storage nodes that are
also in rack A.
Ditto for rack B.
Ditto for s/rack/data center/.

As far as I understand, the Ceph clients can't do that.
(Nor can Ceph nodes among each other, but I care less about that, as most
traffic is reading data.)

I think this is an important feature for many high-reliability situations.

What would be the next steps to get this feature, assuming I don't have time
to implement it myself? Persistently annoy this mailing list that people
need it? Offer to pay for implementing it? Shut up and look for some other
solution -- which I already did, but I didn't find any that's as good as
Ceph, otherwise?

I've opened a feature request for this, half a year ago, which hasn't seen
any comments yet: http://tracker.ceph.com/issues/3249

-- Matthias Urlichs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-12  3:59 ceph and efficient access of distributed resources Matthias Urlichs
@ 2013-04-12 16:08 ` Mark Nelson
  2013-04-12 16:20   ` Gregory Farnum
  2013-04-15 20:06   ` Gandalf Corvotempesta
  0 siblings, 2 replies; 15+ messages in thread
From: Mark Nelson @ 2013-04-12 16:08 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: ceph-devel@vger.kernel.org

On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
> As I understand it, in Ceph one can cluster storage nodes, but otherwise
> every node is essentially identical, so if three storage nodes have a file,
> ceph randomly uses one of them.

Ceph clusters have the concept of pools, where each pool has a certain 
number of placement groups.  Placement groups are just collections of 
mappings to OSDs.  Each PG has a primary OSD and a number of secondary 
ones, based on the replication level you set when you make the pool. 
When an object gets written to the cluster, CRUSH will determine which 
PG the data should be sent to.  The data will first hit the primary OSD 
and then replicated out to the other OSDs in the same placement group.

Currently reads always come from the primary OSD in the placement group 
rather than a secondary even if the secondary is closer to the client. 
I'm guessing there are probably some tricks that could be played here to 
best determine which machines should service which clients, but it's not 
exactly an easy problem.  In many cases spreading reads out over all of 
the OSDs in the cluster is better than trying to optimize reads to only 
hit local OSDs.  Ideally you probably want to prefer local OSDs first, 
but not exclusively.

>
> This is not efficient use of network resources in a distributed data center.
> Or even in a multi-rack situation.
>
> I want to prefer accessing nodes which are "local".
> The client in rack A should prefer to read from the storage nodes that are
> also in rack A.
> Ditto for rack B.
> Ditto for s/rack/data center/.
>
> As far as I understand, the Ceph clients can't do that.
> (Nor can Ceph nodes among each other, but I care less about that, as most
> traffic is reading data.)
>
> I think this is an important feature for many high-reliability situations.
>
> What would be the next steps to get this feature, assuming I don't have time
> to implement it myself? Persistently annoy this mailing list that people
> need it? Offer to pay for implementing it? Shut up and look for some other
> solution -- which I already did, but I didn't find any that's as good as
> Ceph, otherwise?

I don't really have that much insight into the product roadmap, but I 
assume that if you spoke to some of our business folks about paying for 
development work you'd at least get a response.

>
> I've opened a feature request for this, half a year ago, which hasn't seen
> any comments yet: http://tracker.ceph.com/issues/3249

Sadly there's a lot of things we'd like to do and not enough time to do 
them. :(  If we get a lot of requests for this from other people too, it 
might bump the priority up.

>
> -- Matthias Urlichs
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-12 16:08 ` Mark Nelson
@ 2013-04-12 16:20   ` Gregory Farnum
  2013-04-13  2:32     ` Chen, Xiaoxi
  2013-04-15 20:06   ` Gandalf Corvotempesta
  1 sibling, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-04-12 16:20 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

I was in the middle of writing a response to this when Mark's email
came in, so I'll just add a few things:

On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
>>
>> As I understand it, in Ceph one can cluster storage nodes, but otherwise
>> every node is essentially identical, so if three storage nodes have a
>> file,
>> ceph randomly uses one of them.
>
>
> Ceph clusters have the concept of pools, where each pool has a certain
> number of placement groups.  Placement groups are just collections of
> mappings to OSDs.  Each PG has a primary OSD and a number of secondary ones,
> based on the replication level you set when you make the pool. When an
> object gets written to the cluster, CRUSH will determine which PG the data
> should be sent to.  The data will first hit the primary OSD and then
> replicated out to the other OSDs in the same placement group.
>
> Currently reads always come from the primary OSD in the placement group
> rather than a secondary even if the secondary is closer to the client. I'm
> guessing there are probably some tricks that could be played here to best
> determine which machines should service which clients, but it's not exactly
> an easy problem.  In many cases spreading reads out over all of the OSDs in
> the cluster is better than trying to optimize reads to only hit local OSDs.
> Ideally you probably want to prefer local OSDs first, but not exclusively.

In addition to just determining the locality (which we've started on
via external interfaces), this has a number of consistency challenges
associated with it. The infrastructure we have to allow reading from
non-primaries tends to involve clients having different consistency
expectations, and it's not fully explored yet or set up so that
clients can choose to read from a specific non-primary — the options
currently are "local if available and we can tell", "random", and
"primary".


>> This is not efficient use of network resources in a distributed data
>> center.
>> Or even in a multi-rack situation.
>>
>> I want to prefer accessing nodes which are "local".
>> The client in rack A should prefer to read from the storage nodes that are
>> also in rack A.
>> Ditto for rack B.
>> Ditto for s/rack/data center/.

I do want to ask if you're sure this is as useful as you think it is.
There are use cases where it would be, but since writes have to
traverse these links (at a multiple of the actual write count) as well
you should be very certain. :)

>> As far as I understand, the Ceph clients can't do that.
>> (Nor can Ceph nodes among each other, but I care less about that, as most
>> traffic is reading data.)
>>
>> I think this is an important feature for many high-reliability situations.
>>
>> What would be the next steps to get this feature, assuming I don't have
>> time
>> to implement it myself? Persistently annoy this mailing list that people
>> need it? Offer to pay for implementing it? Shut up and look for some other
>> solution -- which I already did, but I didn't find any that's as good as
>> Ceph, otherwise?
>
>
> I don't really have that much insight into the product roadmap, but I assume
> that if you spoke to some of our business folks about paying for development
> work you'd at least get a response.

Yeah. It's not a feature in large enough demand right now that we can
see to be worth bumping up over other things, but I don't think
anybody's opposed to it existing. As with Mark I have no idea if
you're best off asking us or others to do things for money, but it
would certainly have to go through business channels. (If somebody
outside Inktank did want to implement this feature, I'd love to talk
to them about it on an informal but ongoing basis during development.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-12 16:20   ` Gregory Farnum
@ 2013-04-13  2:32     ` Chen, Xiaoxi
  2013-04-15 16:42       ` Gregory Farnum
  0 siblings, 1 reply; 15+ messages in thread
From: Chen, Xiaoxi @ 2013-04-13  2:32 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org

We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested.  If Mark and Greg can provide some feedback,that would be great.

We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules.

When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration

发自我的 iPhone

在 2013-4-13,0:20,"Gregory Farnum" <greg@inktank.com> 写道:

> I was in the middle of writing a response to this when Mark's email
> came in, so I'll just add a few things:
> 
> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
>>> 
>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise
>>> every node is essentially identical, so if three storage nodes have a
>>> file,
>>> ceph randomly uses one of them.
>> 
>> 
>> Ceph clusters have the concept of pools, where each pool has a certain
>> number of placement groups.  Placement groups are just collections of
>> mappings to OSDs.  Each PG has a primary OSD and a number of secondary ones,
>> based on the replication level you set when you make the pool. When an
>> object gets written to the cluster, CRUSH will determine which PG the data
>> should be sent to.  The data will first hit the primary OSD and then
>> replicated out to the other OSDs in the same placement group.
>> 
>> Currently reads always come from the primary OSD in the placement group
>> rather than a secondary even if the secondary is closer to the client. I'm
>> guessing there are probably some tricks that could be played here to best
>> determine which machines should service which clients, but it's not exactly
>> an easy problem.  In many cases spreading reads out over all of the OSDs in
>> the cluster is better than trying to optimize reads to only hit local OSDs.
>> Ideally you probably want to prefer local OSDs first, but not exclusively.
> 
> In addition to just determining the locality (which we've started on
> via external interfaces), this has a number of consistency challenges
> associated with it. The infrastructure we have to allow reading from
> non-primaries tends to involve clients having different consistency
> expectations, and it's not fully explored yet or set up so that
> clients can choose to read from a specific non-primary ― the options
> currently are "local if available and we can tell", "random", and
> "primary".
> 
> 
>>> This is not efficient use of network resources in a distributed data
>>> center.
>>> Or even in a multi-rack situation.
>>> 
>>> I want to prefer accessing nodes which are "local".
>>> The client in rack A should prefer to read from the storage nodes that are
>>> also in rack A.
>>> Ditto for rack B.
>>> Ditto for s/rack/data center/.
> 
> I do want to ask if you're sure this is as useful as you think it is.
> There are use cases where it would be, but since writes have to
> traverse these links (at a multiple of the actual write count) as well
> you should be very certain. :)
> 
>>> As far as I understand, the Ceph clients can't do that.
>>> (Nor can Ceph nodes among each other, but I care less about that, as most
>>> traffic is reading data.)
>>> 
>>> I think this is an important feature for many high-reliability situations.
>>> 
>>> What would be the next steps to get this feature, assuming I don't have
>>> time
>>> to implement it myself? Persistently annoy this mailing list that people
>>> need it? Offer to pay for implementing it? Shut up and look for some other
>>> solution -- which I already did, but I didn't find any that's as good as
>>> Ceph, otherwise?
>> 
>> 
>> I don't really have that much insight into the product roadmap, but I assume
>> that if you spoke to some of our business folks about paying for development
>> work you'd at least get a response.
> 
> Yeah. It's not a feature in large enough demand right now that we can
> see to be worth bumping up over other things, but I don't think
> anybody's opposed to it existing. As with Mark I have no idea if
> you're best off asking us or others to do things for money, but it
> would certainly have to go through business channels. (If somebody
> outside Inktank did want to implement this feature, I'd love to talk
> to them about it on an informal but ongoing basis during development.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-13  2:32     ` Chen, Xiaoxi
@ 2013-04-15 16:42       ` Gregory Farnum
  2013-04-15 23:14         ` Chen, Xiaoxi
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Farnum @ 2013-04-15 16:42 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org

Yeah, this is very much like DreamHost is doing with their
DreamCompute installation (you can find some talks about it online, I
believe, though I'm not sure how much detail they include there versus
in the Q&As).

On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested.  If Mark and Greg can provide some feedback,that would be great.
>
> We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules.
>
> When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration

This is one of the use cases that layering is designed to handle (in
addition to standard cloning and snapshots). Just create a clone that
lives in the new pool, and either let it copy-up to the new position
lazily or run the command at a time when you know your network is less
busy.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> 发自我的 iPhone
>
> 在 2013-4-13,0:20,"Gregory Farnum" <greg@inktank.com> 写道:
>
>> I was in the middle of writing a response to this when Mark's email
>> came in, so I'll just add a few things:
>>
>> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
>>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
>>>>
>>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise
>>>> every node is essentially identical, so if three storage nodes have a
>>>> file,
>>>> ceph randomly uses one of them.
>>>
>>>
>>> Ceph clusters have the concept of pools, where each pool has a certain
>>> number of placement groups.  Placement groups are just collections of
>>> mappings to OSDs.  Each PG has a primary OSD and a number of secondary ones,
>>> based on the replication level you set when you make the pool. When an
>>> object gets written to the cluster, CRUSH will determine which PG the data
>>> should be sent to.  The data will first hit the primary OSD and then
>>> replicated out to the other OSDs in the same placement group.
>>>
>>> Currently reads always come from the primary OSD in the placement group
>>> rather than a secondary even if the secondary is closer to the client. I'm
>>> guessing there are probably some tricks that could be played here to best
>>> determine which machines should service which clients, but it's not exactly
>>> an easy problem.  In many cases spreading reads out over all of the OSDs in
>>> the cluster is better than trying to optimize reads to only hit local OSDs.
>>> Ideally you probably want to prefer local OSDs first, but not exclusively.
>>
>> In addition to just determining the locality (which we've started on
>> via external interfaces), this has a number of consistency challenges
>> associated with it. The infrastructure we have to allow reading from
>> non-primaries tends to involve clients having different consistency
>> expectations, and it's not fully explored yet or set up so that
>> clients can choose to read from a specific non-primary ― the options
>> currently are "local if available and we can tell", "random", and
>> "primary".
>>
>>
>>>> This is not efficient use of network resources in a distributed data
>>>> center.
>>>> Or even in a multi-rack situation.
>>>>
>>>> I want to prefer accessing nodes which are "local".
>>>> The client in rack A should prefer to read from the storage nodes that are
>>>> also in rack A.
>>>> Ditto for rack B.
>>>> Ditto for s/rack/data center/.
>>
>> I do want to ask if you're sure this is as useful as you think it is.
>> There are use cases where it would be, but since writes have to
>> traverse these links (at a multiple of the actual write count) as well
>> you should be very certain. :)
>>
>>>> As far as I understand, the Ceph clients can't do that.
>>>> (Nor can Ceph nodes among each other, but I care less about that, as most
>>>> traffic is reading data.)
>>>>
>>>> I think this is an important feature for many high-reliability situations.
>>>>
>>>> What would be the next steps to get this feature, assuming I don't have
>>>> time
>>>> to implement it myself? Persistently annoy this mailing list that people
>>>> need it? Offer to pay for implementing it? Shut up and look for some other
>>>> solution -- which I already did, but I didn't find any that's as good as
>>>> Ceph, otherwise?
>>>
>>>
>>> I don't really have that much insight into the product roadmap, but I assume
>>> that if you spoke to some of our business folks about paying for development
>>> work you'd at least get a response.
>>
>> Yeah. It's not a feature in large enough demand right now that we can
>> see to be worth bumping up over other things, but I don't think
>> anybody's opposed to it existing. As with Mark I have no idea if
>> you're best off asking us or others to do things for money, but it
>> would certainly have to go through business channels. (If somebody
>> outside Inktank did want to implement this feature, I'd love to talk
>> to them about it on an informal but ongoing basis during development.)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-12 16:08 ` Mark Nelson
  2013-04-12 16:20   ` Gregory Farnum
@ 2013-04-15 20:06   ` Gandalf Corvotempesta
  2013-04-15 22:25     ` Dan Mick
  1 sibling, 1 reply; 15+ messages in thread
From: Gandalf Corvotempesta @ 2013-04-15 20:06 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

2013/4/12 Mark Nelson <mark.nelson@inktank.com>

> Currently reads always come from the primary OSD in the placement group
> rather than a secondary even if the secondary is closer to the client.
>

In this way, only one OSD will be involved in reading an object, this will
result in a bottleneck if multiple clients needs to access to the same file.

For example, a 3KB CSS file served by a webserver to 400 users, will be
read just from one OSD. 400 users directed to 1 OSD  while (in case of
replica 3) other 2 OSDs are available?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-15 20:06   ` Gandalf Corvotempesta
@ 2013-04-15 22:25     ` Dan Mick
  2013-04-15 22:38       ` Mark Kampe
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Mick @ 2013-04-15 22:25 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org


On 04/15/2013 01:06 PM, Gandalf Corvotempesta wrote:
> 2013/4/12 Mark Nelson <mark.nelson@inktank.com>
>
>> Currently reads always come from the primary OSD in the placement group
>> rather than a secondary even if the secondary is closer to the client.
>>
>
> In this way, only one OSD will be involved in reading an object, this will
> result in a bottleneck if multiple clients needs to access to the same file.
>
> For example, a 3KB CSS file served by a webserver to 400 users, will be
> read just from one OSD. 400 users directed to 1 OSD  while (in case of
> replica 3) other 2 OSDs are available?

Yes.  Consistency across the cluster is dependent on this scheme, currently.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-15 22:25     ` Dan Mick
@ 2013-04-15 22:38       ` Mark Kampe
  2013-04-16  7:20         ` Gandalf Corvotempesta
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Kampe @ 2013-04-15 22:38 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

If I correctly understand the discussion, you are correct
that I/O could be saved by doing this ... were it not for
the fact the I/O in question is already being saved much
more effectively by someone else.

The entire web is richly festooned with cache servers whose
sole raison d'etre is to solve precisely this problem.  They
are so good at it that back-bone providers often find it more
cash-efficient to buy more cache servers than to lay more
fiber.  Cache servers don't merely save disk I/O, they catch
these requests before they reach the server (or even the
backbone).

> On 04/15/2013 01:06 PM, Gandalf Corvotempesta wrote:
>>
>>> Currently reads always come from the primary OSD in the placement group
>>> rather than a secondary even if the secondary is closer to the client.
>>>
>>
>> In this way, only one OSD will be involved in reading an object, this
>> will
>> result in a bottleneck if multiple clients needs to access to the same
>> file.
>>
>> For example, a 3KB CSS file served by a webserver to 400 users, will be
>> read just from one OSD. 400 users directed to 1 OSD  while (in case of
>> replica 3) other 2 OSDs are available?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-15 16:42       ` Gregory Farnum
@ 2013-04-15 23:14         ` Chen, Xiaoxi
  0 siblings, 0 replies; 15+ messages in thread
From: Chen, Xiaoxi @ 2013-04-15 23:14 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org,
	Huang, Zhiteng

Thanks Gerg.
Technically speaking, it's still workable that someone may even want to make such policy "per node".


在 2013-4-16,0:42,"Gregory Farnum" <greg@inktank.com> 写道:

> Yeah, this is very much like DreamHost is doing with their
> DreamCompute installation (you can find some talks about it online, I
> believe, though I'm not sure how much detail they include there versus
> in the Q&As).
> 
> On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
>> We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested.  If Mark and Greg can provide some feedback,that would be great.
>> 
>> We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules.
>> 
>> When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration
> 
> This is one of the use cases that layering is designed to handle (in
> addition to standard cloning and snapshots). Just create a clone that
> lives in the new pool, and either let it copy-up to the new position
> lazily or run the command at a time when you know your network is less
> busy.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
>> 
>> 发自我的 iPhone
>> 
>> 在 2013-4-13,0:20,"Gregory Farnum" <greg@inktank.com> 写道:
>> 
>>> I was in the middle of writing a response to this when Mark's email
>>> came in, so I'll just add a few things:
>>> 
>>> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
>>>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote:
>>>>> 
>>>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise
>>>>> every node is essentially identical, so if three storage nodes have a
>>>>> file,
>>>>> ceph randomly uses one of them.
>>>> 
>>>> 
>>>> Ceph clusters have the concept of pools, where each pool has a certain
>>>> number of placement groups.  Placement groups are just collections of
>>>> mappings to OSDs.  Each PG has a primary OSD and a number of secondary ones,
>>>> based on the replication level you set when you make the pool. When an
>>>> object gets written to the cluster, CRUSH will determine which PG the data
>>>> should be sent to.  The data will first hit the primary OSD and then
>>>> replicated out to the other OSDs in the same placement group.
>>>> 
>>>> Currently reads always come from the primary OSD in the placement group
>>>> rather than a secondary even if the secondary is closer to the client. I'm
>>>> guessing there are probably some tricks that could be played here to best
>>>> determine which machines should service which clients, but it's not exactly
>>>> an easy problem.  In many cases spreading reads out over all of the OSDs in
>>>> the cluster is better than trying to optimize reads to only hit local OSDs.
>>>> Ideally you probably want to prefer local OSDs first, but not exclusively.
>>> 
>>> In addition to just determining the locality (which we've started on
>>> via external interfaces), this has a number of consistency challenges
>>> associated with it. The infrastructure we have to allow reading from
>>> non-primaries tends to involve clients having different consistency
>>> expectations, and it's not fully explored yet or set up so that
>>> clients can choose to read from a specific non-primary ― the options
>>> currently are "local if available and we can tell", "random", and
>>> "primary".
>>> 
>>> 
>>>>> This is not efficient use of network resources in a distributed data
>>>>> center.
>>>>> Or even in a multi-rack situation.
>>>>> 
>>>>> I want to prefer accessing nodes which are "local".
>>>>> The client in rack A should prefer to read from the storage nodes that are
>>>>> also in rack A.
>>>>> Ditto for rack B.
>>>>> Ditto for s/rack/data center/.
>>> 
>>> I do want to ask if you're sure this is as useful as you think it is.
>>> There are use cases where it would be, but since writes have to
>>> traverse these links (at a multiple of the actual write count) as well
>>> you should be very certain. :)
>>> 
>>>>> As far as I understand, the Ceph clients can't do that.
>>>>> (Nor can Ceph nodes among each other, but I care less about that, as most
>>>>> traffic is reading data.)
>>>>> 
>>>>> I think this is an important feature for many high-reliability situations.
>>>>> 
>>>>> What would be the next steps to get this feature, assuming I don't have
>>>>> time
>>>>> to implement it myself? Persistently annoy this mailing list that people
>>>>> need it? Offer to pay for implementing it? Shut up and look for some other
>>>>> solution -- which I already did, but I didn't find any that's as good as
>>>>> Ceph, otherwise?
>>>> 
>>>> 
>>>> I don't really have that much insight into the product roadmap, but I assume
>>>> that if you spoke to some of our business folks about paying for development
>>>> work you'd at least get a response.
>>> 
>>> Yeah. It's not a feature in large enough demand right now that we can
>>> see to be worth bumping up over other things, but I don't think
>>> anybody's opposed to it existing. As with Mark I have no idea if
>>> you're best off asking us or others to do things for money, but it
>>> would certainly have to go through business channels. (If somebody
>>> outside Inktank did want to implement this feature, I'd love to talk
>>> to them about it on an informal but ongoing basis during development.)
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-15 22:38       ` Mark Kampe
@ 2013-04-16  7:20         ` Gandalf Corvotempesta
  2013-04-16 13:59           ` Sage Weil
  2013-04-16 14:18           ` Mark Kampe
  0 siblings, 2 replies; 15+ messages in thread
From: Gandalf Corvotempesta @ 2013-04-16  7:20 UTC (permalink / raw)
  To: Mark Kampe; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

2013/4/16 Mark Kampe <mark.kampe@inktank.com>:
> The entire web is richly festooned with cache servers whose
> sole raison d'etre is to solve precisely this problem.  They
> are so good at it that back-bone providers often find it more
> cash-efficient to buy more cache servers than to lay more
> fiber.  Cache servers don't merely save disk I/O, they catch
> these requests before they reach the server (or even the
> backbone).

Mine was just an example, there are many other cases where a frotnend
cache is not possible.
I think that ceph should spread reads across the whole clusters by
default (like a big RAID-1), to archieve bandwidth improvement.

Glusters does this, and also MooseFS.

What happens in case of a big file (for example, 100MB) with multiple
chunks? Is ceph smart enough to read multiple chunks from multiple
servers simultaneously or the whole file will be served by just an OSD
?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-16  7:20         ` Gandalf Corvotempesta
@ 2013-04-16 13:59           ` Sage Weil
  2013-04-16 14:18           ` Mark Kampe
  1 sibling, 0 replies; 15+ messages in thread
From: Sage Weil @ 2013-04-16 13:59 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Mark Kampe, Matthias Urlichs, ceph-devel@vger.kernel.org

On Tue, 16 Apr 2013, Gandalf Corvotempesta wrote:
> 2013/4/16 Mark Kampe <mark.kampe@inktank.com>:
> > The entire web is richly festooned with cache servers whose
> > sole raison d'etre is to solve precisely this problem.  They
> > are so good at it that back-bone providers often find it more
> > cash-efficient to buy more cache servers than to lay more
> > fiber.  Cache servers don't merely save disk I/O, they catch
> > these requests before they reach the server (or even the
> > backbone).
> 
> Mine was just an example, there are many other cases where a frotnend
> cache is not possible.
> I think that ceph should spread reads across the whole clusters by
> default (like a big RAID-1), to archieve bandwidth improvement.
> 
> Glusters does this, and also MooseFS.
> 
> What happens in case of a big file (for example, 100MB) with multiple
> chunks? Is ceph smart enough to read multiple chunks from multiple
> servers simultaneously or the whole file will be served by just an OSD
> ?

Yes.  The readahead window grows to include a few objects to take 
advantage of parallelism for reads.

The problem with reading from random/multiple replicas by default is cache 
efficiency.  If every reader picks a random replica, then there are 
effectively N locations that may hae an object cached in RAM (instead of 
on disk), and the caches for each OSD will be about 1/Nth as effective.  
The only time in makes sense to read from replicas is when you are CPU or 
network limited; the rest of the time it is better to read from the 
primary's cache than a replica's disk.

Unfortunately at the librados level, the client doesn't generally know 
that.  The infrastructure is in place for the MDS (or librados user) to 
indicate when reads from replicas are safe, but a bit more work is needed 
to make the client code utilize that information.  It's not a difficult 
improvement, and loadiness could also be communicated back to clients on a 
per-osd session basis, but it's not implemented yet.

sage


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-16  7:20         ` Gandalf Corvotempesta
  2013-04-16 13:59           ` Sage Weil
@ 2013-04-16 14:18           ` Mark Kampe
  2013-04-16 20:06             ` Gandalf Corvotempesta
  1 sibling, 1 reply; 15+ messages in thread
From: Mark Kampe @ 2013-04-16 14:18 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

On 04/16/13 00:20, Gandalf Corvotempesta wrote:
> 2013/4/16 Mark Kampe <mark.kampe@inktank.com>:
>> The entire web is richly festooned with cache servers whose
>> sole raison d'etre is to solve precisely this problem.  They
>> are so good at it that back-bone providers often find it more
>> cash-efficient to buy more cache servers than to lay more
>> fiber.  Cache servers don't merely save disk I/O, they catch
>> these requests before they reach the server (or even the
>> backbone).
>
> Mine was just an example, there are many other cases where a frotnend
> cache is not possible.
> I think that ceph should spread reads across the whole clusters by
> default (like a big RAID-1), to archieve bandwidth improvement.

At my previous distributed storage start-up (Parascale) we had the
ability to distribute reads across copies for load distribution
purposes and everybody we talked to said "who cares!".  Why?

    For hot-spot situations (as in your original example)
    higher level caching is far more effective than random
    traffic distribution.

    For lower level (e.g. coincidental) reuse, sending all the
    requests to a single server will usually perform better.
    Network I/O is much faster than disk I/O, and a single
    recipient will have N * the cache hit rate that N servers
    would have.

> What happens in case of a big file (for example, 100MB) with multiple
> chunks? Is ceph smart enough to read multiple chunks from multiple
> servers simultaneously or the whole file will be served by just an OSD

RADOS is the underlying storage cluster, but the access methods (block,
object, and file) stripe their data across many RADOS objects, which
CRUSH very effectively distributes across all of the servers.  A 100MB
read or write turns into dozens of parallel operations to servers all
over the cluster.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-16 14:18           ` Mark Kampe
@ 2013-04-16 20:06             ` Gandalf Corvotempesta
  2013-04-16 20:44               ` Mark Kampe
  0 siblings, 1 reply; 15+ messages in thread
From: Gandalf Corvotempesta @ 2013-04-16 20:06 UTC (permalink / raw)
  To: Mark Kampe; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

2013/4/16 Mark Kampe <mark.kampe@inktank.com>:
> RADOS is the underlying storage cluster, but the access methods (block,
> object, and file) stripe their data across many RADOS objects, which
> CRUSH very effectively distributes across all of the servers.  A 100MB
> read or write turns into dozens of parallel operations to servers all
> over the cluster.

Let me try to explain.
AFAIK check will split datas into chunks of 4MB each, so, a single
12MB file will be stored in 3 different chunks across multiple OSDs
and then replicated many times (based on value of replica count)

Let's assume a 12MB file and a 3x replica.
RADOS will create 3x3 chuks for the same file stored on 9 OSDs

When reading AFAIK replicas are not used, so all reads are done to the
"master copy".
But these 3 chunks are read in parallel on multiple OSDs or all read
request are done trough a single OSD? In the first case we will have
3x bandwidth for read operations directed to a file with at least 3
chunks, in the latter we have a big bottleneck.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-16 20:06             ` Gandalf Corvotempesta
@ 2013-04-16 20:44               ` Mark Kampe
  2013-04-17  7:22                 ` Gandalf Corvotempesta
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Kampe @ 2013-04-16 20:44 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

The client does a 12MB read, which (because of the striping)
gets broken into 3 separate 4MB reads, each of which is sent,
all in parallel, to 3 distinct OSDs.  The only bottle-neck
in such an operation is the client-NIC.

On 04/16/2013 01:06 PM, Gandalf Corvotempesta wrote:
> 2013/4/16 Mark Kampe <mark.kampe@inktank.com>:
>> RADOS is the underlying storage cluster, but the access methods (block,
>> object, and file) stripe their data across many RADOS objects, which
>> CRUSH very effectively distributes across all of the servers.  A 100MB
>> read or write turns into dozens of parallel operations to servers all
>> over the cluster.
>
> Let me try to explain.
> AFAIK check will split datas into chunks of 4MB each, so, a single
> 12MB file will be stored in 3 different chunks across multiple OSDs
> and then replicated many times (based on value of replica count)
>
> Let's assume a 12MB file and a 3x replica.
> RADOS will create 3x3 chuks for the same file stored on 9 OSDs
>
> When reading AFAIK replicas are not used, so all reads are done to the
> "master copy".
> But these 3 chunks are read in parallel on multiple OSDs or all read
> request are done trough a single OSD? In the first case we will have
> 3x bandwidth for read operations directed to a file with at least 3
> chunks, in the latter we have a big bottleneck.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ceph and efficient access of distributed resources
  2013-04-16 20:44               ` Mark Kampe
@ 2013-04-17  7:22                 ` Gandalf Corvotempesta
  0 siblings, 0 replies; 15+ messages in thread
From: Gandalf Corvotempesta @ 2013-04-17  7:22 UTC (permalink / raw)
  To: Mark Kampe; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org

Il giorno 16/apr/2013 22:44, "Mark Kampe" <mark.kampe@inktank.com> ha scritto:
>
> The client does a 12MB read, which (because of the striping)
> gets broken into 3 separate 4MB reads, each of which is sent,
> all in parallel, to 3 distinct OSDs.  The only bottle-neck
> in such an operation is the client-NIC.


Thank you, now it's clear.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-04-17  7:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-12  3:59 ceph and efficient access of distributed resources Matthias Urlichs
2013-04-12 16:08 ` Mark Nelson
2013-04-12 16:20   ` Gregory Farnum
2013-04-13  2:32     ` Chen, Xiaoxi
2013-04-15 16:42       ` Gregory Farnum
2013-04-15 23:14         ` Chen, Xiaoxi
2013-04-15 20:06   ` Gandalf Corvotempesta
2013-04-15 22:25     ` Dan Mick
2013-04-15 22:38       ` Mark Kampe
2013-04-16  7:20         ` Gandalf Corvotempesta
2013-04-16 13:59           ` Sage Weil
2013-04-16 14:18           ` Mark Kampe
2013-04-16 20:06             ` Gandalf Corvotempesta
2013-04-16 20:44               ` Mark Kampe
2013-04-17  7:22                 ` Gandalf Corvotempesta

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.