* ceph and efficient access of distributed resources @ 2013-04-12 3:59 Matthias Urlichs 2013-04-12 16:08 ` Mark Nelson 0 siblings, 1 reply; 15+ messages in thread From: Matthias Urlichs @ 2013-04-12 3:59 UTC (permalink / raw) To: ceph-devel As I understand it, in Ceph one can cluster storage nodes, but otherwise every node is essentially identical, so if three storage nodes have a file, ceph randomly uses one of them. This is not efficient use of network resources in a distributed data center. Or even in a multi-rack situation. I want to prefer accessing nodes which are "local". The client in rack A should prefer to read from the storage nodes that are also in rack A. Ditto for rack B. Ditto for s/rack/data center/. As far as I understand, the Ceph clients can't do that. (Nor can Ceph nodes among each other, but I care less about that, as most traffic is reading data.) I think this is an important feature for many high-reliability situations. What would be the next steps to get this feature, assuming I don't have time to implement it myself? Persistently annoy this mailing list that people need it? Offer to pay for implementing it? Shut up and look for some other solution -- which I already did, but I didn't find any that's as good as Ceph, otherwise? I've opened a feature request for this, half a year ago, which hasn't seen any comments yet: http://tracker.ceph.com/issues/3249 -- Matthias Urlichs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-12 3:59 ceph and efficient access of distributed resources Matthias Urlichs @ 2013-04-12 16:08 ` Mark Nelson 2013-04-12 16:20 ` Gregory Farnum 2013-04-15 20:06 ` Gandalf Corvotempesta 0 siblings, 2 replies; 15+ messages in thread From: Mark Nelson @ 2013-04-12 16:08 UTC (permalink / raw) To: Matthias Urlichs; +Cc: ceph-devel@vger.kernel.org On 04/11/2013 10:59 PM, Matthias Urlichs wrote: > As I understand it, in Ceph one can cluster storage nodes, but otherwise > every node is essentially identical, so if three storage nodes have a file, > ceph randomly uses one of them. Ceph clusters have the concept of pools, where each pool has a certain number of placement groups. Placement groups are just collections of mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, based on the replication level you set when you make the pool. When an object gets written to the cluster, CRUSH will determine which PG the data should be sent to. The data will first hit the primary OSD and then replicated out to the other OSDs in the same placement group. Currently reads always come from the primary OSD in the placement group rather than a secondary even if the secondary is closer to the client. I'm guessing there are probably some tricks that could be played here to best determine which machines should service which clients, but it's not exactly an easy problem. In many cases spreading reads out over all of the OSDs in the cluster is better than trying to optimize reads to only hit local OSDs. Ideally you probably want to prefer local OSDs first, but not exclusively. > > This is not efficient use of network resources in a distributed data center. > Or even in a multi-rack situation. > > I want to prefer accessing nodes which are "local". > The client in rack A should prefer to read from the storage nodes that are > also in rack A. > Ditto for rack B. > Ditto for s/rack/data center/. > > As far as I understand, the Ceph clients can't do that. > (Nor can Ceph nodes among each other, but I care less about that, as most > traffic is reading data.) > > I think this is an important feature for many high-reliability situations. > > What would be the next steps to get this feature, assuming I don't have time > to implement it myself? Persistently annoy this mailing list that people > need it? Offer to pay for implementing it? Shut up and look for some other > solution -- which I already did, but I didn't find any that's as good as > Ceph, otherwise? I don't really have that much insight into the product roadmap, but I assume that if you spoke to some of our business folks about paying for development work you'd at least get a response. > > I've opened a feature request for this, half a year ago, which hasn't seen > any comments yet: http://tracker.ceph.com/issues/3249 Sadly there's a lot of things we'd like to do and not enough time to do them. :( If we get a lot of requests for this from other people too, it might bump the priority up. > > -- Matthias Urlichs > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-12 16:08 ` Mark Nelson @ 2013-04-12 16:20 ` Gregory Farnum 2013-04-13 2:32 ` Chen, Xiaoxi 2013-04-15 20:06 ` Gandalf Corvotempesta 1 sibling, 1 reply; 15+ messages in thread From: Gregory Farnum @ 2013-04-12 16:20 UTC (permalink / raw) To: Mark Nelson; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org I was in the middle of writing a response to this when Mark's email came in, so I'll just add a few things: On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >> >> As I understand it, in Ceph one can cluster storage nodes, but otherwise >> every node is essentially identical, so if three storage nodes have a >> file, >> ceph randomly uses one of them. > > > Ceph clusters have the concept of pools, where each pool has a certain > number of placement groups. Placement groups are just collections of > mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, > based on the replication level you set when you make the pool. When an > object gets written to the cluster, CRUSH will determine which PG the data > should be sent to. The data will first hit the primary OSD and then > replicated out to the other OSDs in the same placement group. > > Currently reads always come from the primary OSD in the placement group > rather than a secondary even if the secondary is closer to the client. I'm > guessing there are probably some tricks that could be played here to best > determine which machines should service which clients, but it's not exactly > an easy problem. In many cases spreading reads out over all of the OSDs in > the cluster is better than trying to optimize reads to only hit local OSDs. > Ideally you probably want to prefer local OSDs first, but not exclusively. In addition to just determining the locality (which we've started on via external interfaces), this has a number of consistency challenges associated with it. The infrastructure we have to allow reading from non-primaries tends to involve clients having different consistency expectations, and it's not fully explored yet or set up so that clients can choose to read from a specific non-primary — the options currently are "local if available and we can tell", "random", and "primary". >> This is not efficient use of network resources in a distributed data >> center. >> Or even in a multi-rack situation. >> >> I want to prefer accessing nodes which are "local". >> The client in rack A should prefer to read from the storage nodes that are >> also in rack A. >> Ditto for rack B. >> Ditto for s/rack/data center/. I do want to ask if you're sure this is as useful as you think it is. There are use cases where it would be, but since writes have to traverse these links (at a multiple of the actual write count) as well you should be very certain. :) >> As far as I understand, the Ceph clients can't do that. >> (Nor can Ceph nodes among each other, but I care less about that, as most >> traffic is reading data.) >> >> I think this is an important feature for many high-reliability situations. >> >> What would be the next steps to get this feature, assuming I don't have >> time >> to implement it myself? Persistently annoy this mailing list that people >> need it? Offer to pay for implementing it? Shut up and look for some other >> solution -- which I already did, but I didn't find any that's as good as >> Ceph, otherwise? > > > I don't really have that much insight into the product roadmap, but I assume > that if you spoke to some of our business folks about paying for development > work you'd at least get a response. Yeah. It's not a feature in large enough demand right now that we can see to be worth bumping up over other things, but I don't think anybody's opposed to it existing. As with Mark I have no idea if you're best off asking us or others to do things for money, but it would certainly have to go through business channels. (If somebody outside Inktank did want to implement this feature, I'd love to talk to them about it on an informal but ongoing basis during development.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-12 16:20 ` Gregory Farnum @ 2013-04-13 2:32 ` Chen, Xiaoxi 2013-04-15 16:42 ` Gregory Farnum 0 siblings, 1 reply; 15+ messages in thread From: Chen, Xiaoxi @ 2013-04-13 2:32 UTC (permalink / raw) To: Gregory Farnum; +Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested. If Mark and Greg can provide some feedback,that would be great. We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules. When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration 发自我的 iPhone 在 2013-4-13,0:20,"Gregory Farnum" <greg@inktank.com> 写道: > I was in the middle of writing a response to this when Mark's email > came in, so I'll just add a few things: > > On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote: >> On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >>> >>> As I understand it, in Ceph one can cluster storage nodes, but otherwise >>> every node is essentially identical, so if three storage nodes have a >>> file, >>> ceph randomly uses one of them. >> >> >> Ceph clusters have the concept of pools, where each pool has a certain >> number of placement groups. Placement groups are just collections of >> mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, >> based on the replication level you set when you make the pool. When an >> object gets written to the cluster, CRUSH will determine which PG the data >> should be sent to. The data will first hit the primary OSD and then >> replicated out to the other OSDs in the same placement group. >> >> Currently reads always come from the primary OSD in the placement group >> rather than a secondary even if the secondary is closer to the client. I'm >> guessing there are probably some tricks that could be played here to best >> determine which machines should service which clients, but it's not exactly >> an easy problem. In many cases spreading reads out over all of the OSDs in >> the cluster is better than trying to optimize reads to only hit local OSDs. >> Ideally you probably want to prefer local OSDs first, but not exclusively. > > In addition to just determining the locality (which we've started on > via external interfaces), this has a number of consistency challenges > associated with it. The infrastructure we have to allow reading from > non-primaries tends to involve clients having different consistency > expectations, and it's not fully explored yet or set up so that > clients can choose to read from a specific non-primary ― the options > currently are "local if available and we can tell", "random", and > "primary". > > >>> This is not efficient use of network resources in a distributed data >>> center. >>> Or even in a multi-rack situation. >>> >>> I want to prefer accessing nodes which are "local". >>> The client in rack A should prefer to read from the storage nodes that are >>> also in rack A. >>> Ditto for rack B. >>> Ditto for s/rack/data center/. > > I do want to ask if you're sure this is as useful as you think it is. > There are use cases where it would be, but since writes have to > traverse these links (at a multiple of the actual write count) as well > you should be very certain. :) > >>> As far as I understand, the Ceph clients can't do that. >>> (Nor can Ceph nodes among each other, but I care less about that, as most >>> traffic is reading data.) >>> >>> I think this is an important feature for many high-reliability situations. >>> >>> What would be the next steps to get this feature, assuming I don't have >>> time >>> to implement it myself? Persistently annoy this mailing list that people >>> need it? Offer to pay for implementing it? Shut up and look for some other >>> solution -- which I already did, but I didn't find any that's as good as >>> Ceph, otherwise? >> >> >> I don't really have that much insight into the product roadmap, but I assume >> that if you spoke to some of our business folks about paying for development >> work you'd at least get a response. > > Yeah. It's not a feature in large enough demand right now that we can > see to be worth bumping up over other things, but I don't think > anybody's opposed to it existing. As with Mark I have no idea if > you're best off asking us or others to do things for money, but it > would certainly have to go through business channels. (If somebody > outside Inktank did want to implement this feature, I'd love to talk > to them about it on an informal but ongoing basis during development.) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-13 2:32 ` Chen, Xiaoxi @ 2013-04-15 16:42 ` Gregory Farnum 2013-04-15 23:14 ` Chen, Xiaoxi 0 siblings, 1 reply; 15+ messages in thread From: Gregory Farnum @ 2013-04-15 16:42 UTC (permalink / raw) To: Chen, Xiaoxi; +Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org Yeah, this is very much like DreamHost is doing with their DreamCompute installation (you can find some talks about it online, I believe, though I'm not sure how much detail they include there versus in the Q&As). On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote: > We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested. If Mark and Greg can provide some feedback,that would be great. > > We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules. > > When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration This is one of the use cases that layering is designed to handle (in addition to standard cloning and snapshots). Just create a clone that lives in the new pool, and either let it copy-up to the new position lazily or run the command at a time when you know your network is less busy. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com > > 发自我的 iPhone > > 在 2013-4-13,0:20,"Gregory Farnum" <greg@inktank.com> 写道: > >> I was in the middle of writing a response to this when Mark's email >> came in, so I'll just add a few things: >> >> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote: >>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >>>> >>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise >>>> every node is essentially identical, so if three storage nodes have a >>>> file, >>>> ceph randomly uses one of them. >>> >>> >>> Ceph clusters have the concept of pools, where each pool has a certain >>> number of placement groups. Placement groups are just collections of >>> mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, >>> based on the replication level you set when you make the pool. When an >>> object gets written to the cluster, CRUSH will determine which PG the data >>> should be sent to. The data will first hit the primary OSD and then >>> replicated out to the other OSDs in the same placement group. >>> >>> Currently reads always come from the primary OSD in the placement group >>> rather than a secondary even if the secondary is closer to the client. I'm >>> guessing there are probably some tricks that could be played here to best >>> determine which machines should service which clients, but it's not exactly >>> an easy problem. In many cases spreading reads out over all of the OSDs in >>> the cluster is better than trying to optimize reads to only hit local OSDs. >>> Ideally you probably want to prefer local OSDs first, but not exclusively. >> >> In addition to just determining the locality (which we've started on >> via external interfaces), this has a number of consistency challenges >> associated with it. The infrastructure we have to allow reading from >> non-primaries tends to involve clients having different consistency >> expectations, and it's not fully explored yet or set up so that >> clients can choose to read from a specific non-primary ― the options >> currently are "local if available and we can tell", "random", and >> "primary". >> >> >>>> This is not efficient use of network resources in a distributed data >>>> center. >>>> Or even in a multi-rack situation. >>>> >>>> I want to prefer accessing nodes which are "local". >>>> The client in rack A should prefer to read from the storage nodes that are >>>> also in rack A. >>>> Ditto for rack B. >>>> Ditto for s/rack/data center/. >> >> I do want to ask if you're sure this is as useful as you think it is. >> There are use cases where it would be, but since writes have to >> traverse these links (at a multiple of the actual write count) as well >> you should be very certain. :) >> >>>> As far as I understand, the Ceph clients can't do that. >>>> (Nor can Ceph nodes among each other, but I care less about that, as most >>>> traffic is reading data.) >>>> >>>> I think this is an important feature for many high-reliability situations. >>>> >>>> What would be the next steps to get this feature, assuming I don't have >>>> time >>>> to implement it myself? Persistently annoy this mailing list that people >>>> need it? Offer to pay for implementing it? Shut up and look for some other >>>> solution -- which I already did, but I didn't find any that's as good as >>>> Ceph, otherwise? >>> >>> >>> I don't really have that much insight into the product roadmap, but I assume >>> that if you spoke to some of our business folks about paying for development >>> work you'd at least get a response. >> >> Yeah. It's not a feature in large enough demand right now that we can >> see to be worth bumping up over other things, but I don't think >> anybody's opposed to it existing. As with Mark I have no idea if >> you're best off asking us or others to do things for money, but it >> would certainly have to go through business channels. (If somebody >> outside Inktank did want to implement this feature, I'd love to talk >> to them about it on an informal but ongoing basis during development.) >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-15 16:42 ` Gregory Farnum @ 2013-04-15 23:14 ` Chen, Xiaoxi 0 siblings, 0 replies; 15+ messages in thread From: Chen, Xiaoxi @ 2013-04-15 23:14 UTC (permalink / raw) To: Gregory Farnum Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org, Huang, Zhiteng Thanks Gerg. Technically speaking, it's still workable that someone may even want to make such policy "per node". 在 2013-4-16,0:42,"Gregory Farnum" <greg@inktank.com> 写道: > Yeah, this is very much like DreamHost is doing with their > DreamCompute installation (you can find some talks about it online, I > believe, though I'm not sure how much detail they include there versus > in the Q&As). > > On Fri, Apr 12, 2013 at 7:32 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote: >> We are also discussing this internally, and come out with an idea to walk around it(Only for RBD case,havent think about Obj store),but not yet tested. If Mark and Greg can provide some feedback,that would be great. >> >> We are trying to write a script to generate some pools,for rack A,there is a pool A,which defined the crush ruleset to choose Osd in rackA as the primary.so if we have 10 racks,we will have 10 pools and 10 rules. >> >> When the VM migrated to other rack,or the volume be detached and attached to another VM hosted in other rack,a data migration is needed.we are thinking about how to smooth such migration > > This is one of the use cases that layering is designed to handle (in > addition to standard cloning and snapshots). Just create a clone that > lives in the new pool, and either let it copy-up to the new position > lazily or run the command at a time when you know your network is less > busy. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > >> >> 发自我的 iPhone >> >> 在 2013-4-13,0:20,"Gregory Farnum" <greg@inktank.com> 写道: >> >>> I was in the middle of writing a response to this when Mark's email >>> came in, so I'll just add a few things: >>> >>> On Fri, Apr 12, 2013 at 9:08 AM, Mark Nelson <mark.nelson@inktank.com> wrote: >>>> On 04/11/2013 10:59 PM, Matthias Urlichs wrote: >>>>> >>>>> As I understand it, in Ceph one can cluster storage nodes, but otherwise >>>>> every node is essentially identical, so if three storage nodes have a >>>>> file, >>>>> ceph randomly uses one of them. >>>> >>>> >>>> Ceph clusters have the concept of pools, where each pool has a certain >>>> number of placement groups. Placement groups are just collections of >>>> mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, >>>> based on the replication level you set when you make the pool. When an >>>> object gets written to the cluster, CRUSH will determine which PG the data >>>> should be sent to. The data will first hit the primary OSD and then >>>> replicated out to the other OSDs in the same placement group. >>>> >>>> Currently reads always come from the primary OSD in the placement group >>>> rather than a secondary even if the secondary is closer to the client. I'm >>>> guessing there are probably some tricks that could be played here to best >>>> determine which machines should service which clients, but it's not exactly >>>> an easy problem. In many cases spreading reads out over all of the OSDs in >>>> the cluster is better than trying to optimize reads to only hit local OSDs. >>>> Ideally you probably want to prefer local OSDs first, but not exclusively. >>> >>> In addition to just determining the locality (which we've started on >>> via external interfaces), this has a number of consistency challenges >>> associated with it. The infrastructure we have to allow reading from >>> non-primaries tends to involve clients having different consistency >>> expectations, and it's not fully explored yet or set up so that >>> clients can choose to read from a specific non-primary ― the options >>> currently are "local if available and we can tell", "random", and >>> "primary". >>> >>> >>>>> This is not efficient use of network resources in a distributed data >>>>> center. >>>>> Or even in a multi-rack situation. >>>>> >>>>> I want to prefer accessing nodes which are "local". >>>>> The client in rack A should prefer to read from the storage nodes that are >>>>> also in rack A. >>>>> Ditto for rack B. >>>>> Ditto for s/rack/data center/. >>> >>> I do want to ask if you're sure this is as useful as you think it is. >>> There are use cases where it would be, but since writes have to >>> traverse these links (at a multiple of the actual write count) as well >>> you should be very certain. :) >>> >>>>> As far as I understand, the Ceph clients can't do that. >>>>> (Nor can Ceph nodes among each other, but I care less about that, as most >>>>> traffic is reading data.) >>>>> >>>>> I think this is an important feature for many high-reliability situations. >>>>> >>>>> What would be the next steps to get this feature, assuming I don't have >>>>> time >>>>> to implement it myself? Persistently annoy this mailing list that people >>>>> need it? Offer to pay for implementing it? Shut up and look for some other >>>>> solution -- which I already did, but I didn't find any that's as good as >>>>> Ceph, otherwise? >>>> >>>> >>>> I don't really have that much insight into the product roadmap, but I assume >>>> that if you spoke to some of our business folks about paying for development >>>> work you'd at least get a response. >>> >>> Yeah. It's not a feature in large enough demand right now that we can >>> see to be worth bumping up over other things, but I don't think >>> anybody's opposed to it existing. As with Mark I have no idea if >>> you're best off asking us or others to do things for money, but it >>> would certainly have to go through business channels. (If somebody >>> outside Inktank did want to implement this feature, I'd love to talk >>> to them about it on an informal but ongoing basis during development.) >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-12 16:08 ` Mark Nelson 2013-04-12 16:20 ` Gregory Farnum @ 2013-04-15 20:06 ` Gandalf Corvotempesta 2013-04-15 22:25 ` Dan Mick 1 sibling, 1 reply; 15+ messages in thread From: Gandalf Corvotempesta @ 2013-04-15 20:06 UTC (permalink / raw) To: Mark Nelson; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org 2013/4/12 Mark Nelson <mark.nelson@inktank.com> > Currently reads always come from the primary OSD in the placement group > rather than a secondary even if the secondary is closer to the client. > In this way, only one OSD will be involved in reading an object, this will result in a bottleneck if multiple clients needs to access to the same file. For example, a 3KB CSS file served by a webserver to 400 users, will be read just from one OSD. 400 users directed to 1 OSD while (in case of replica 3) other 2 OSDs are available? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-15 20:06 ` Gandalf Corvotempesta @ 2013-04-15 22:25 ` Dan Mick 2013-04-15 22:38 ` Mark Kampe 0 siblings, 1 reply; 15+ messages in thread From: Dan Mick @ 2013-04-15 22:25 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Mark Nelson, Matthias Urlichs, ceph-devel@vger.kernel.org On 04/15/2013 01:06 PM, Gandalf Corvotempesta wrote: > 2013/4/12 Mark Nelson <mark.nelson@inktank.com> > >> Currently reads always come from the primary OSD in the placement group >> rather than a secondary even if the secondary is closer to the client. >> > > In this way, only one OSD will be involved in reading an object, this will > result in a bottleneck if multiple clients needs to access to the same file. > > For example, a 3KB CSS file served by a webserver to 400 users, will be > read just from one OSD. 400 users directed to 1 OSD while (in case of > replica 3) other 2 OSDs are available? Yes. Consistency across the cluster is dependent on this scheme, currently. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-15 22:25 ` Dan Mick @ 2013-04-15 22:38 ` Mark Kampe 2013-04-16 7:20 ` Gandalf Corvotempesta 0 siblings, 1 reply; 15+ messages in thread From: Mark Kampe @ 2013-04-15 22:38 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org If I correctly understand the discussion, you are correct that I/O could be saved by doing this ... were it not for the fact the I/O in question is already being saved much more effectively by someone else. The entire web is richly festooned with cache servers whose sole raison d'etre is to solve precisely this problem. They are so good at it that back-bone providers often find it more cash-efficient to buy more cache servers than to lay more fiber. Cache servers don't merely save disk I/O, they catch these requests before they reach the server (or even the backbone). > On 04/15/2013 01:06 PM, Gandalf Corvotempesta wrote: >> >>> Currently reads always come from the primary OSD in the placement group >>> rather than a secondary even if the secondary is closer to the client. >>> >> >> In this way, only one OSD will be involved in reading an object, this >> will >> result in a bottleneck if multiple clients needs to access to the same >> file. >> >> For example, a 3KB CSS file served by a webserver to 400 users, will be >> read just from one OSD. 400 users directed to 1 OSD while (in case of >> replica 3) other 2 OSDs are available? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-15 22:38 ` Mark Kampe @ 2013-04-16 7:20 ` Gandalf Corvotempesta 2013-04-16 13:59 ` Sage Weil 2013-04-16 14:18 ` Mark Kampe 0 siblings, 2 replies; 15+ messages in thread From: Gandalf Corvotempesta @ 2013-04-16 7:20 UTC (permalink / raw) To: Mark Kampe; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org 2013/4/16 Mark Kampe <mark.kampe@inktank.com>: > The entire web is richly festooned with cache servers whose > sole raison d'etre is to solve precisely this problem. They > are so good at it that back-bone providers often find it more > cash-efficient to buy more cache servers than to lay more > fiber. Cache servers don't merely save disk I/O, they catch > these requests before they reach the server (or even the > backbone). Mine was just an example, there are many other cases where a frotnend cache is not possible. I think that ceph should spread reads across the whole clusters by default (like a big RAID-1), to archieve bandwidth improvement. Glusters does this, and also MooseFS. What happens in case of a big file (for example, 100MB) with multiple chunks? Is ceph smart enough to read multiple chunks from multiple servers simultaneously or the whole file will be served by just an OSD ? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-16 7:20 ` Gandalf Corvotempesta @ 2013-04-16 13:59 ` Sage Weil 2013-04-16 14:18 ` Mark Kampe 1 sibling, 0 replies; 15+ messages in thread From: Sage Weil @ 2013-04-16 13:59 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Mark Kampe, Matthias Urlichs, ceph-devel@vger.kernel.org On Tue, 16 Apr 2013, Gandalf Corvotempesta wrote: > 2013/4/16 Mark Kampe <mark.kampe@inktank.com>: > > The entire web is richly festooned with cache servers whose > > sole raison d'etre is to solve precisely this problem. They > > are so good at it that back-bone providers often find it more > > cash-efficient to buy more cache servers than to lay more > > fiber. Cache servers don't merely save disk I/O, they catch > > these requests before they reach the server (or even the > > backbone). > > Mine was just an example, there are many other cases where a frotnend > cache is not possible. > I think that ceph should spread reads across the whole clusters by > default (like a big RAID-1), to archieve bandwidth improvement. > > Glusters does this, and also MooseFS. > > What happens in case of a big file (for example, 100MB) with multiple > chunks? Is ceph smart enough to read multiple chunks from multiple > servers simultaneously or the whole file will be served by just an OSD > ? Yes. The readahead window grows to include a few objects to take advantage of parallelism for reads. The problem with reading from random/multiple replicas by default is cache efficiency. If every reader picks a random replica, then there are effectively N locations that may hae an object cached in RAM (instead of on disk), and the caches for each OSD will be about 1/Nth as effective. The only time in makes sense to read from replicas is when you are CPU or network limited; the rest of the time it is better to read from the primary's cache than a replica's disk. Unfortunately at the librados level, the client doesn't generally know that. The infrastructure is in place for the MDS (or librados user) to indicate when reads from replicas are safe, but a bit more work is needed to make the client code utilize that information. It's not a difficult improvement, and loadiness could also be communicated back to clients on a per-osd session basis, but it's not implemented yet. sage ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-16 7:20 ` Gandalf Corvotempesta 2013-04-16 13:59 ` Sage Weil @ 2013-04-16 14:18 ` Mark Kampe 2013-04-16 20:06 ` Gandalf Corvotempesta 1 sibling, 1 reply; 15+ messages in thread From: Mark Kampe @ 2013-04-16 14:18 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org On 04/16/13 00:20, Gandalf Corvotempesta wrote: > 2013/4/16 Mark Kampe <mark.kampe@inktank.com>: >> The entire web is richly festooned with cache servers whose >> sole raison d'etre is to solve precisely this problem. They >> are so good at it that back-bone providers often find it more >> cash-efficient to buy more cache servers than to lay more >> fiber. Cache servers don't merely save disk I/O, they catch >> these requests before they reach the server (or even the >> backbone). > > Mine was just an example, there are many other cases where a frotnend > cache is not possible. > I think that ceph should spread reads across the whole clusters by > default (like a big RAID-1), to archieve bandwidth improvement. At my previous distributed storage start-up (Parascale) we had the ability to distribute reads across copies for load distribution purposes and everybody we talked to said "who cares!". Why? For hot-spot situations (as in your original example) higher level caching is far more effective than random traffic distribution. For lower level (e.g. coincidental) reuse, sending all the requests to a single server will usually perform better. Network I/O is much faster than disk I/O, and a single recipient will have N * the cache hit rate that N servers would have. > What happens in case of a big file (for example, 100MB) with multiple > chunks? Is ceph smart enough to read multiple chunks from multiple > servers simultaneously or the whole file will be served by just an OSD RADOS is the underlying storage cluster, but the access methods (block, object, and file) stripe their data across many RADOS objects, which CRUSH very effectively distributes across all of the servers. A 100MB read or write turns into dozens of parallel operations to servers all over the cluster. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-16 14:18 ` Mark Kampe @ 2013-04-16 20:06 ` Gandalf Corvotempesta 2013-04-16 20:44 ` Mark Kampe 0 siblings, 1 reply; 15+ messages in thread From: Gandalf Corvotempesta @ 2013-04-16 20:06 UTC (permalink / raw) To: Mark Kampe; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org 2013/4/16 Mark Kampe <mark.kampe@inktank.com>: > RADOS is the underlying storage cluster, but the access methods (block, > object, and file) stripe their data across many RADOS objects, which > CRUSH very effectively distributes across all of the servers. A 100MB > read or write turns into dozens of parallel operations to servers all > over the cluster. Let me try to explain. AFAIK check will split datas into chunks of 4MB each, so, a single 12MB file will be stored in 3 different chunks across multiple OSDs and then replicated many times (based on value of replica count) Let's assume a 12MB file and a 3x replica. RADOS will create 3x3 chuks for the same file stored on 9 OSDs When reading AFAIK replicas are not used, so all reads are done to the "master copy". But these 3 chunks are read in parallel on multiple OSDs or all read request are done trough a single OSD? In the first case we will have 3x bandwidth for read operations directed to a file with at least 3 chunks, in the latter we have a big bottleneck. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-16 20:06 ` Gandalf Corvotempesta @ 2013-04-16 20:44 ` Mark Kampe 2013-04-17 7:22 ` Gandalf Corvotempesta 0 siblings, 1 reply; 15+ messages in thread From: Mark Kampe @ 2013-04-16 20:44 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org The client does a 12MB read, which (because of the striping) gets broken into 3 separate 4MB reads, each of which is sent, all in parallel, to 3 distinct OSDs. The only bottle-neck in such an operation is the client-NIC. On 04/16/2013 01:06 PM, Gandalf Corvotempesta wrote: > 2013/4/16 Mark Kampe <mark.kampe@inktank.com>: >> RADOS is the underlying storage cluster, but the access methods (block, >> object, and file) stripe their data across many RADOS objects, which >> CRUSH very effectively distributes across all of the servers. A 100MB >> read or write turns into dozens of parallel operations to servers all >> over the cluster. > > Let me try to explain. > AFAIK check will split datas into chunks of 4MB each, so, a single > 12MB file will be stored in 3 different chunks across multiple OSDs > and then replicated many times (based on value of replica count) > > Let's assume a 12MB file and a 3x replica. > RADOS will create 3x3 chuks for the same file stored on 9 OSDs > > When reading AFAIK replicas are not used, so all reads are done to the > "master copy". > But these 3 chunks are read in parallel on multiple OSDs or all read > request are done trough a single OSD? In the first case we will have > 3x bandwidth for read operations directed to a file with at least 3 > chunks, in the latter we have a big bottleneck. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ceph and efficient access of distributed resources 2013-04-16 20:44 ` Mark Kampe @ 2013-04-17 7:22 ` Gandalf Corvotempesta 0 siblings, 0 replies; 15+ messages in thread From: Gandalf Corvotempesta @ 2013-04-17 7:22 UTC (permalink / raw) To: Mark Kampe; +Cc: Matthias Urlichs, ceph-devel@vger.kernel.org Il giorno 16/apr/2013 22:44, "Mark Kampe" <mark.kampe@inktank.com> ha scritto: > > The client does a 12MB read, which (because of the striping) > gets broken into 3 separate 4MB reads, each of which is sent, > all in parallel, to 3 distinct OSDs. The only bottle-neck > in such an operation is the client-NIC. Thank you, now it's clear. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-04-17 7:22 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-12 3:59 ceph and efficient access of distributed resources Matthias Urlichs 2013-04-12 16:08 ` Mark Nelson 2013-04-12 16:20 ` Gregory Farnum 2013-04-13 2:32 ` Chen, Xiaoxi 2013-04-15 16:42 ` Gregory Farnum 2013-04-15 23:14 ` Chen, Xiaoxi 2013-04-15 20:06 ` Gandalf Corvotempesta 2013-04-15 22:25 ` Dan Mick 2013-04-15 22:38 ` Mark Kampe 2013-04-16 7:20 ` Gandalf Corvotempesta 2013-04-16 13:59 ` Sage Weil 2013-04-16 14:18 ` Mark Kampe 2013-04-16 20:06 ` Gandalf Corvotempesta 2013-04-16 20:44 ` Mark Kampe 2013-04-17 7:22 ` Gandalf Corvotempesta
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.