* production ready? @ 2012-10-26 21:52 Gandalf Corvotempesta 2012-10-29 20:21 ` Dan Mick 0 siblings, 1 reply; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-26 21:52 UTC (permalink / raw) To: ceph-devel Hi all,i'm new to ceph. Are RBD and REST API production ready? Do you have any use case to share? we are looking for a distributed block storage for an HP C7000 blade with 16 dual processor X5675 blades with 64/128GB ram each. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-26 21:52 production ready? Gandalf Corvotempesta @ 2012-10-29 20:21 ` Dan Mick 2012-10-30 11:35 ` Gandalf Corvotempesta 0 siblings, 1 reply; 17+ messages in thread From: Dan Mick @ 2012-10-29 20:21 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: ceph-devel On 10/26/2012 02:52 PM, Gandalf Corvotempesta wrote: > Hi all,i'm new to ceph. > Are RBD and REST API production ready? There are sites using them in production now. > Do you have any use case to share? we are looking for a distributed > block storage for an HP C7000 blade with 16 dual processor X5675 > blades with 64/128GB ram each. Ceph should certainly be able to handle that hardware. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-29 20:21 ` Dan Mick @ 2012-10-30 11:35 ` Gandalf Corvotempesta 2012-10-30 13:15 ` Gregory Farnum 0 siblings, 1 reply; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-30 11:35 UTC (permalink / raw) To: Dan Mick; +Cc: ceph-devel 2012/10/29 Dan Mick <dan.mick@inktank.com>: > There are sites using them in production now. Any docs about infrastructure size, topology, performance and so on? I'm evaluating many distribuited system and I would like to have some feedbacks/usecase) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 11:35 ` Gandalf Corvotempesta @ 2012-10-30 13:15 ` Gregory Farnum 2012-10-30 13:36 ` Gandalf Corvotempesta 0 siblings, 1 reply; 17+ messages in thread From: Gregory Farnum @ 2012-10-30 13:15 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Dan Mick, ceph-devel Not a lot of people are publicly discussing their sizes on things like that, unfortunately. I believe DreamHost is still the most open. They have an (RGW-based) object storage service which is backed by ~800 OSDs and are currently beta-testing a compute service using RBD, which you can see described here: http://www.youtube.com/watch?v=l_8Y988fO44&feature=plcp On Tue, Oct 30, 2012 at 12:35 PM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > 2012/10/29 Dan Mick <dan.mick@inktank.com>: >> There are sites using them in production now. > > Any docs about infrastructure size, topology, performance and so on? > > I'm evaluating many distribuited system and I would like to have some > feedbacks/usecase) > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:15 ` Gregory Farnum @ 2012-10-30 13:36 ` Gandalf Corvotempesta 2012-10-30 13:38 ` Stefan Priebe - Profihost AG 2012-10-30 13:40 ` Gregory Farnum 0 siblings, 2 replies; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-30 13:36 UTC (permalink / raw) To: Gregory Farnum; +Cc: Dan Mick, ceph-devel 2012/10/30 Gregory Farnum <greg@inktank.com>: > Not a lot of people are publicly discussing their sizes on things like > that, unfortunately. I believe DreamHost is still the most open. They > have an (RGW-based) object storage service which is backed by ~800 > OSDs and are currently beta-testing a compute service using RBD, which > you can see described here: > http://www.youtube.com/watch?v=l_8Y988fO44&feature=plcp I'm watching right now. Seems interesting. Please let me know if I understand ceph properly: RADOS is the block storage. RADOS can be accessed through RGW (a REST Gateway) or throgh librbd In the first case, we will have an object store, in the second case, we will have a block device connected directly to a server (like an iSCSI block device) But in the first case, should I create a filesystem on RBD and then manage that FS with gateway? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:36 ` Gandalf Corvotempesta @ 2012-10-30 13:38 ` Stefan Priebe - Profihost AG 2012-10-30 13:45 ` Gregory Farnum 2012-10-30 13:40 ` Gregory Farnum 1 sibling, 1 reply; 17+ messages in thread From: Stefan Priebe - Profihost AG @ 2012-10-30 13:38 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Gregory Farnum, Dan Mick, ceph-devel Am 30.10.2012 14:36, schrieb Gandalf Corvotempesta: > 2012/10/30 Gregory Farnum <greg@inktank.com>: >> Not a lot of people are publicly discussing their sizes on things like >> that, unfortunately. I believe DreamHost is still the most open. They >> have an (RGW-based) object storage service which is backed by ~800 >> OSDs and are currently beta-testing a compute service using RBD, which >> you can see described here: >> http://www.youtube.com/watch?v=l_8Y988fO44&feature=plcp But there's still the problem of slow random write IOP/s. At least i haven't seen any good benchmarks. Stefan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:38 ` Stefan Priebe - Profihost AG @ 2012-10-30 13:45 ` Gregory Farnum 2012-10-30 20:32 ` Stefan Priebe 0 siblings, 1 reply; 17+ messages in thread From: Gregory Farnum @ 2012-10-30 13:45 UTC (permalink / raw) To: Stefan Priebe - Profihost AG; +Cc: Dan Mick, ceph-devel On Tue, Oct 30, 2012 at 2:38 PM, Stefan Priebe - Profihost AG <s.priebe@profihost.ag> wrote: > Am 30.10.2012 14:36, schrieb Gandalf Corvotempesta: > >> 2012/10/30 Gregory Farnum <greg@inktank.com>: >>> >>> Not a lot of people are publicly discussing their sizes on things like >>> that, unfortunately. I believe DreamHost is still the most open. They >>> have an (RGW-based) object storage service which is backed by ~800 >>> OSDs and are currently beta-testing a compute service using RBD, which >>> you can see described here: >>> http://www.youtube.com/watch?v=l_8Y988fO44&feature=plcp > > > But there's still the problem of slow random write IOP/s. At least i haven't > seen any good benchmarks. It's not magic — I haven't done extensive testing but I believe people see aggregate IOPs of about what you can calculate: (number of storage disks * IOPS per disks) / (replication level) The journaling bumps that up a little bit for bursts, of course; similarly if you're doing it on a brand new RBD image it can be a bit slower since you need to create all the objects as well as write data to them. You need to architect your storage system to match your requirements. If you want to run write-heavy databases on RBD, there are people doing that. They're using SSDs and are very pleased with its performance. *shrug* -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:45 ` Gregory Farnum @ 2012-10-30 20:32 ` Stefan Priebe 0 siblings, 0 replies; 17+ messages in thread From: Stefan Priebe @ 2012-10-30 20:32 UTC (permalink / raw) To: Gregory Farnum; +Cc: Dan Mick, ceph-devel Am 30.10.2012 14:45, schrieb Gregory Farnum: >> But there's still the problem of slow random write IOP/s. At least i haven't >> seen any good benchmarks. > > It's not magic — I haven't done extensive testing but I believe people > see aggregate IOPs of about what you can calculate: > (number of storage disks * IOPS per disks) / (replication level) > The journaling bumps that up a little bit for bursts, of course; > similarly if you're doing it on a brand new RBD image it can be a bit > slower since you need to create all the objects as well as write data > to them. You need to architect your storage system to match your > requirements. If you want to run write-heavy databases on RBD, there > are people doing that. They're using SSDs and are very pleased with > its performance. *shrug* My last test was with 0.49 so i can't talk about 0.52 but as far as i know nothing has changed in this case. I had 6 Dedicated servers running each with 4x Intel 520series SSDs running 4 OSDs (one OSD per disk). I had the journal running in tmpfs 1GB size to be sure it isn't the bottleneck. Replication was set to 2. Each SSD is capable of doing 30.000 IOP/s random 4k. But with RBD i wasn't able to get more than 20.000 IOP/s but overall i had: 6 ded. servers * 4 SSDS => 24 OSDs/SSDs * 30.000 IOP/s / Replication 2 => 360.000 iop/s theoretical overall performance But i didn't get more than 20.000 while using 3.6Ghz Xeon CPUs and Dual 10GBE. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:36 ` Gandalf Corvotempesta 2012-10-30 13:38 ` Stefan Priebe - Profihost AG @ 2012-10-30 13:40 ` Gregory Farnum 2012-10-30 13:57 ` Gandalf Corvotempesta 1 sibling, 1 reply; 17+ messages in thread From: Gregory Farnum @ 2012-10-30 13:40 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Dan Mick, ceph-devel On Tue, Oct 30, 2012 at 2:36 PM, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > 2012/10/30 Gregory Farnum <greg@inktank.com>: >> Not a lot of people are publicly discussing their sizes on things like >> that, unfortunately. I believe DreamHost is still the most open. They >> have an (RGW-based) object storage service which is backed by ~800 >> OSDs and are currently beta-testing a compute service using RBD, which >> you can see described here: >> http://www.youtube.com/watch?v=l_8Y988fO44&feature=plcp > > I'm watching right now. Seems interesting. > Please let me know if I understand ceph properly: > RADOS is the block storage. > RADOS can be accessed through RGW (a REST Gateway) or throgh librbd Not exactly. RADOS is natively a (powerful) object store. RGW takes S3 and Swift REST requests and translates them into RADOS requests, stored in a "custom" format. RBD is a client-side library which takes a logical block device and stripes it over RADOS objects (by default, the first 4MB is one object, the second 4MB are another object, etc). Make sense? -Greg > In the first case, we will have an object store, in the second case, > we will have a block device connected directly to a server (like an > iSCSI block device) > > But in the first case, should I create a filesystem on RBD and then > manage that FS with gateway? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:40 ` Gregory Farnum @ 2012-10-30 13:57 ` Gandalf Corvotempesta 2012-10-30 14:17 ` 袁冬 0 siblings, 1 reply; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-30 13:57 UTC (permalink / raw) To: Gregory Farnum; +Cc: Dan Mick, ceph-devel 2012/10/30 Gregory Farnum <greg@inktank.com>: > Not exactly. RADOS is natively a (powerful) object store. RGW takes S3 > and Swift REST requests and translates them into RADOS requests, > stored in a "custom" format. RBD is a client-side library which takes > a logical block device and stripes it over RADOS objects (by default, > the first 4MB is one object, the second 4MB are another object, etc). > Make sense? So, a ceph cluster is made from multiple OSDs. these OSDs are combined by RADOS that is an object store that will stripe over multiple OSDs This store can be accessed by RGW (for S3 and Swift api compatibility, in needed) or directly by a server as a block device with librbd This should be the architecture: OSD -> RADOS -> RGW/LibRBD -> Customer/Server Nothing prevents me to offer a service directly based on RADOS API, if S3 compatibility is not needed, right ? What I don't understand is how can I access a single file from RGW. If LibRBD and RGW are 'gateway' to a RADOS store, i'll have access to a block device, not to a single file. Should I create a filesystem on the block device before using that with RGW? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 13:57 ` Gandalf Corvotempesta @ 2012-10-30 14:17 ` 袁冬 2012-10-30 14:31 ` Gandalf Corvotempesta 0 siblings, 1 reply; 17+ messages in thread From: 袁冬 @ 2012-10-30 14:17 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Gregory Farnum, Dan Mick, ceph-devel > Nothing prevents me to offer a service directly based on RADOS API, if > S3 compatibility is not needed, right ? Correct, That is librados. > What I don't understand is how can I access a single file from RGW. If > LibRBD and RGW are 'gateway' to a RADOS store, i'll have access to a > block device, not to a single file. > Should I create a filesystem on the block device before using that with > RGW? RGW and libRBD are not the same pool, you can`t access aRBD volume with RGW. The RADOS treats the RBD volume as just a large object. On 30 October 2012 21:57, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > > 2012/10/30 Gregory Farnum <greg@inktank.com>: > > Not exactly. RADOS is natively a (powerful) object store. RGW takes S3 > > and Swift REST requests and translates them into RADOS requests, > > stored in a "custom" format. RBD is a client-side library which takes > > a logical block device and stripes it over RADOS objects (by default, > > the first 4MB is one object, the second 4MB are another object, etc). > > Make sense? > > So, a ceph cluster is made from multiple OSDs. > these OSDs are combined by RADOS that is an object store that will > stripe over multiple OSDs > > This store can be accessed by RGW (for S3 and Swift api compatibility, > in needed) or directly by a server as a block device with librbd > > This should be the architecture: > > OSD -> RADOS -> RGW/LibRBD -> Customer/Server > > Nothing prevents me to offer a service directly based on RADOS API, if > S3 compatibility is not needed, right ? > > What I don't understand is how can I access a single file from RGW. If > LibRBD and RGW are 'gateway' to a RADOS store, i'll have access to a > block device, not to a single file. > Should I create a filesystem on the block device before using that with > RGW? > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- 袁冬 Email:yuandong1222@gmail.com QQ:10200230 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 14:17 ` 袁冬 @ 2012-10-30 14:31 ` Gandalf Corvotempesta 2012-10-30 14:38 ` 袁冬 0 siblings, 1 reply; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-30 14:31 UTC (permalink / raw) To: 袁冬; +Cc: Gregory Farnum, Dan Mick, ceph-devel 2012/10/30 袁冬 <yuandong1222@gmail.com>: > RGW and libRBD are not the same pool, you can`t access aRBD volume with > RGW. The RADOS treats the RBD volume as just a large object. Ok, I think to have understood. RADOS store only object on an existent filesystem (this is why I have to create a FS to use RADOS/Ceph). Now, if that object is accessed by RGW, that object will be a single file stored on the FS but if I'm accessing with RBD, RBD is masking a very large object, stored on FS, as it is a single block device. In this case, can a single block device (for example a huge virtual machine image) be striped across many OSDs to archieve better performance in reading? an image striped across 3 disks, should get 3*IOPS when reading Another question: in a standard RGW/RBD infrastructure (no CephFS), I have to configure only "mon" and "osd" nodes, right? How many monitor nodes is suggested? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 14:31 ` Gandalf Corvotempesta @ 2012-10-30 14:38 ` 袁冬 2012-10-30 14:59 ` Gandalf Corvotempesta 0 siblings, 1 reply; 17+ messages in thread From: 袁冬 @ 2012-10-30 14:38 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Gregory Farnum, Dan Mick, ceph-devel > In this case, can a single block device (for example a huge virtual > machine image) be striped across many OSDs to archieve better > performance in reading? > an image striped across 3 disks, should get 3*IOPS when reading Yes, but network (and many other isssues) must be considered. > Another question: in a standard RGW/RBD infrastructure (no CephFS), I > have to configure only "mon" and "osd" nodes, right? Yes. > How many monitor nodes is suggested? 3 is suggested. On 30 October 2012 22:31, Gandalf Corvotempesta <gandalf.corvotempesta@gmail.com> wrote: > 2012/10/30 袁冬 <yuandong1222@gmail.com>: >> RGW and libRBD are not the same pool, you can`t access aRBD volume with >> RGW. The RADOS treats the RBD volume as just a large object. > > Ok, I think to have understood. RADOS store only object on an existent > filesystem (this is why I have to create a FS to use RADOS/Ceph). > Now, if that object is accessed by RGW, that object will be a single > file stored on the FS but if I'm accessing > with RBD, RBD is masking a very large object, stored on FS, as it is a > single block device. > In this case, can a single block device (for example a huge virtual > machine image) be striped across many OSDs to archieve better > performance in reading? > an image striped across 3 disks, should get 3*IOPS when reading > > > Another question: in a standard RGW/RBD infrastructure (no CephFS), I > have to configure only "mon" and "osd" nodes, right? > How many monitor nodes is suggested? -- 袁冬 Email:yuandong1222@gmail.com QQ:10200230 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 14:38 ` 袁冬 @ 2012-10-30 14:59 ` Gandalf Corvotempesta 2012-10-30 21:06 ` Dan Mick 0 siblings, 1 reply; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-30 14:59 UTC (permalink / raw) To: 袁冬; +Cc: Gregory Farnum, Dan Mick, ceph-devel 2012/10/30 袁冬 <yuandong1222@gmail.com>: > Yes, but network (and many other isssues) must be considered. Obviously > 3 is suggested. Any contraindication running mon in the same OSD server? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 14:59 ` Gandalf Corvotempesta @ 2012-10-30 21:06 ` Dan Mick 2012-10-30 21:17 ` Gandalf Corvotempesta 0 siblings, 1 reply; 17+ messages in thread From: Dan Mick @ 2012-10-30 21:06 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: 袁冬, Gregory Farnum, ceph-devel On 10/30/2012 07:59 AM, Gandalf Corvotempesta wrote: > 2012/10/30 袁冬 <yuandong1222@gmail.com>: >> Yes, but network (and many other isssues) must be considered. > > Obviously > >> 3 is suggested. > > Any contraindication running mon in the same OSD server? > Generally that's considered OK. ceph-mon doesn't use very much disk or CPU or network bandwidth. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 21:06 ` Dan Mick @ 2012-10-30 21:17 ` Gandalf Corvotempesta 2012-10-30 21:21 ` Sage Weil 0 siblings, 1 reply; 17+ messages in thread From: Gandalf Corvotempesta @ 2012-10-30 21:17 UTC (permalink / raw) To: Dan Mick; +Cc: 袁冬, Gregory Farnum, ceph-devel 2012/10/30 Dan Mick <dan.mick@inktank.com>: > Generally that's considered OK. ceph-mon doesn't use very much disk or CPU > or network bandwidth. In this case, should I reserve some space to ceph-mon (a partition or a dedicated disk) or ceph-mon is able to 'share' the osd disk space automatically (for example using a directory)? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: production ready? 2012-10-30 21:17 ` Gandalf Corvotempesta @ 2012-10-30 21:21 ` Sage Weil 0 siblings, 0 replies; 17+ messages in thread From: Sage Weil @ 2012-10-30 21:21 UTC (permalink / raw) To: Gandalf Corvotempesta Cc: Dan Mick, 袁冬, Gregory Farnum, ceph-devel On Tue, 30 Oct 2012, Gandalf Corvotempesta wrote: > 2012/10/30 Dan Mick <dan.mick@inktank.com>: > > Generally that's considered OK. ceph-mon doesn't use very much disk or CPU > > or network bandwidth. > > In this case, should I reserve some space to ceph-mon (a partition or > a dedicated disk) or ceph-mon is able to 'share' the osd disk space > automatically (for example using a directory)? A common pattern is to give it a directory on the OS/boot volume. This can be a dedicated disk (lots of space for logs) or something carved out of another disk (more space for ceph data, but can interfere with ceph-osd performance). sage ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-10-30 21:21 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-26 21:52 production ready? Gandalf Corvotempesta 2012-10-29 20:21 ` Dan Mick 2012-10-30 11:35 ` Gandalf Corvotempesta 2012-10-30 13:15 ` Gregory Farnum 2012-10-30 13:36 ` Gandalf Corvotempesta 2012-10-30 13:38 ` Stefan Priebe - Profihost AG 2012-10-30 13:45 ` Gregory Farnum 2012-10-30 20:32 ` Stefan Priebe 2012-10-30 13:40 ` Gregory Farnum 2012-10-30 13:57 ` Gandalf Corvotempesta 2012-10-30 14:17 ` 袁冬 2012-10-30 14:31 ` Gandalf Corvotempesta 2012-10-30 14:38 ` 袁冬 2012-10-30 14:59 ` Gandalf Corvotempesta 2012-10-30 21:06 ` Dan Mick 2012-10-30 21:17 ` Gandalf Corvotempesta 2012-10-30 21:21 ` Sage Weil
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox