* is rados block cluster production ready ? [not found] <531ff5f3-11d5-488e-adfb-68000f3c36c0@mailpro> @ 2012-05-18 6:08 ` Alexandre DERUMIER 2012-05-18 8:45 ` Christian Brunner 2012-05-18 16:13 ` Tommi Virtanen 0 siblings, 2 replies; 8+ messages in thread From: Alexandre DERUMIER @ 2012-05-18 6:08 UTC (permalink / raw) To: ceph-devel Hi, I'm going to build a rados block cluster for my kvm hypervisors. Is it already production ready ? (stable,no crash) I have read some btrfs bugs on this mailing list, so I'm a bit scary... Also, what performance could I expect ? I try to build a fast cluster, with fast ssd disk. each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet. Do you think I can saturate the 10GB ? I also have some questions about performance in time. I have had somes problems with my zfs san and zfs fragmentation and metastab problem. How does btrfs perform in time ? About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? I have submited the contact form on Inktank website, as we would like to pay for support/expertise to build this cluster, but didn't have response yet. Regards, Alexandre Derumier System Engineer aderumier@odiso.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-18 6:08 ` is rados block cluster production ready ? Alexandre DERUMIER @ 2012-05-18 8:45 ` Christian Brunner 2012-05-18 9:58 ` Alexandre DERUMIER 2012-05-18 16:13 ` Tommi Virtanen 1 sibling, 1 reply; 8+ messages in thread From: Christian Brunner @ 2012-05-18 8:45 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel 2012/5/18 Alexandre DERUMIER <aderumier@odiso.com>: > Hi, > I'm going to build a rados block cluster for my kvm hypervisors. > > Is it already production ready ? (stable,no crash) We are using 0.45 in production. Recent ceph versions are quite stable (although we hat some troubles with excessive logging and a full log partition lately which caused our cluster to halt). > I have read some btrfs bugs on this mailing list, so I'm a bit scary... For the moment I would definitely recommend using XFS as the underlying filesystem. At least until there is a fix for the orphan_commit_root problem. XFS comes with a slight performance impact, but it seems to be the only filesystem that is able to handle heavy ceph workload for the moment. > Also, what performance could I expect ? We are running a small ceph cluster (4 Servers with 4 OSDs each) on a 10GE network. Servers are spread across two datacenters with a 5km (3 mile) long 10GE fibre-link for data replication. Our servers are equipped with 80GB Fusion-IO drives (for the journal) and traditional 3,5'' SAS drives in a RAID5 configuration (but I would not reccommend this setup). From a guest we can get a throughput ~ 500MB/s. > I try to build a fast cluster, with fast ssd disk. > each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet. > Do you think I can saturate the 10GB ? This is probably the best hardware for a ceph cluster money can buy. Are you planning a single SAS drive per OSD? I still don't know the cause exactly, but we are not able to saturate 10GE (maybe it's the latency on the WAN link or some network configuration problem). > I also have some questions about performance in time. > I have had somes problems with my zfs san and zfs fragmentation and metastab problem. > How does btrfs perform in time ? I did some artificial tests with btrfs with large metadata enabled (e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to be gone. > About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? We are using bonding. The rados-client is doing a failover to another osd node after a few seconds, when there is no response from the OSD. (You should read about CRUSH in the ceph docs). Regards, Christian ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-18 8:45 ` Christian Brunner @ 2012-05-18 9:58 ` Alexandre DERUMIER 2012-05-18 10:43 ` Christian Brunner 0 siblings, 1 reply; 8+ messages in thread From: Alexandre DERUMIER @ 2012-05-18 9:58 UTC (permalink / raw) To: Christian Brunner; +Cc: ceph-devel Hi Christian, thanks for your response. >>We are using 0.45 in production. Recent ceph versions are quite stable >>(although we hat some troubles with excessive logging and a full log >>partition lately which caused our cluster to halt). excessive logging because of a configuration error ? >>For the moment I would definitely recommend using XFS as the >>underlying filesystem. At least until there is a fix for the >>orphan_commit_root problem. XFS comes with a slight performance >>impact, but it seems to be the only filesystem that is able to handle >>heavy ceph workload for the moment. What's the benefit of using btrfs ? snapshots ? (I would like to be able to do snapshots, maybe clones) >>We are running a small ceph cluster (4 Servers with 4 OSDs each) on a >>10GE network. Servers are spread across two datacenters with a 5km (3 >>mile) long 10GE fibre-link for data replication. Our servers are >>equipped with 80GB Fusion-IO drives (for the journal) and traditional >>3,5'' SAS drives in a RAID5 configuration (but I would not reccommend >>this setup). >>From a guest we can get a throughput ~ 500MB/s. Great ! (And from multiple guests ? do you have more throughput ?) Also about latencies, do you have good latencies with you fusion-io journal? I currently use zfs storage, and writes are going to fast journal nvram then flushed to disk 15K. It is the same behaviour with ceph ? >>This is probably the best hardware for a ceph cluster money can buy. >>Are you planning a single SAS drive per OSD? Yes, on osd by drive. So if something goes wrong with brtfs or xfs, I'll have only 1 failed disk and not the whole raid. Is it the right way for osd ? >>I still don't know the cause exactly, but we are not able to saturate >>10GE (maybe it's the latency on the WAN link or some network >>configuration problem). yes maybe. (I would like to have money for this kind of setup ;) >>I did some artificial tests with btrfs with large metadata enabled >>(e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to >>be gone. Great! (I'm very scary about this kind of bugs) >>We are using bonding. The rados-client is doing a failover to another >>osd node after a few seconds, when there is no response from the OSD. >>(You should read about CRUSH in the ceph docs). Thanks again for all your reponse. (Ceph community seem to be great :) Regards, Alexandre ----- Mail original ----- De: "Christian Brunner" <christian@brunner-muc.de> À: "Alexandre DERUMIER" <aderumier@odiso.com> Cc: ceph-devel@vger.kernel.org Envoyé: Vendredi 18 Mai 2012 10:45:48 Objet: Re: is rados block cluster production ready ? 2012/5/18 Alexandre DERUMIER <aderumier@odiso.com>: > Hi, > I'm going to build a rados block cluster for my kvm hypervisors. > > Is it already production ready ? (stable,no crash) We are using 0.45 in production. Recent ceph versions are quite stable (although we hat some troubles with excessive logging and a full log partition lately which caused our cluster to halt). > I have read some btrfs bugs on this mailing list, so I'm a bit scary... For the moment I would definitely recommend using XFS as the underlying filesystem. At least until there is a fix for the orphan_commit_root problem. XFS comes with a slight performance impact, but it seems to be the only filesystem that is able to handle heavy ceph workload for the moment. > Also, what performance could I expect ? We are running a small ceph cluster (4 Servers with 4 OSDs each) on a 10GE network. Servers are spread across two datacenters with a 5km (3 mile) long 10GE fibre-link for data replication. Our servers are equipped with 80GB Fusion-IO drives (for the journal) and traditional 3,5'' SAS drives in a RAID5 configuration (but I would not reccommend this setup). From a guest we can get a throughput ~ 500MB/s. > I try to build a fast cluster, with fast ssd disk. > each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet. > Do you think I can saturate the 10GB ? This is probably the best hardware for a ceph cluster money can buy. Are you planning a single SAS drive per OSD? I still don't know the cause exactly, but we are not able to saturate 10GE (maybe it's the latency on the WAN link or some network configuration problem). > I also have some questions about performance in time. > I have had somes problems with my zfs san and zfs fragmentation and metastab problem. > How does btrfs perform in time ? I did some artificial tests with btrfs with large metadata enabled (e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to be gone. > About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? We are using bonding. The rados-client is doing a failover to another osd node after a few seconds, when there is no response from the OSD. (You should read about CRUSH in the ceph docs). Regards, Christian -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-18 9:58 ` Alexandre DERUMIER @ 2012-05-18 10:43 ` Christian Brunner 0 siblings, 0 replies; 8+ messages in thread From: Christian Brunner @ 2012-05-18 10:43 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel 2012/5/18 Alexandre DERUMIER <aderumier@odiso.com>: > Hi Christian, > thanks for your response. > >>>We are using 0.45 in production. Recent ceph versions are quite stable >>>(although we hat some troubles with excessive logging and a full log >>>partition lately which caused our cluster to halt). > > excessive logging because of a configuration error ? 0.45 had some debug messages enabled by default, which we didn't realize when doing the update. It can be easily disabled in the config. (Haven't checked if this is still the case in 0.46). >>>For the moment I would definitely recommend using XFS as the >>>underlying filesystem. At least until there is a fix for the >>>orphan_commit_root problem. XFS comes with a slight performance >>>impact, but it seems to be the only filesystem that is able to handle >>>heavy ceph workload for the moment. > > What's the benefit of using btrfs ? snapshots ? (I would like to be able to do snapshots, maybe clones) RBD snapshots are handled independent of the underlying filesystem. So you wouldn't loose that feature. (AFAIK clones are still on the roadmap - RBD layering). When enabled ceph is using btrfs snapshots internally for consistent writes. This gives you some performances advantages. With other filesystems you can only use "write ahead journaling". see http://ceph.com/docs/master/dev/filestore-filesystem-compat/ >>>We are running a small ceph cluster (4 Servers with 4 OSDs each) on a >>>10GE network. Servers are spread across two datacenters with a 5km (3 >>>mile) long 10GE fibre-link for data replication. Our servers are >>>equipped with 80GB Fusion-IO drives (for the journal) and traditional >>>3,5'' SAS drives in a RAID5 configuration (but I would not reccommend >>>this setup). > >>>From a guest we can get a throughput ~ 500MB/s. > > Great ! (And from multiple guests ? do you have more throughput ?) Yes. We were able to increase that even from a single guest with a RAID0 over multiple rbd volumes. > Also about latencies, do you have good latencies with you fusion-io journal? Latencies are ok, but I don't like the proprietary driver. Not that the driver is causing any problems, but it is always a bit tricky when doing kernel updates. > I currently use zfs storage, and writes are going to fast journal nvram then flushed to disk 15K. > It is the same behaviour with ceph ? That is quite similar to the ceph journal. >>>This is probably the best hardware for a ceph cluster money can buy. >>>Are you planning a single SAS drive per OSD? > > Yes, on osd by drive. > So if something goes wrong with brtfs or xfs, I'll have only 1 failed disk and not the whole raid. > Is it the right way for osd ? You can do it this way. We decided to put the ceph storage on a local RAID5 because we didn't want to re-replicate over the network when a single disk has to be swapped. There has been a discussion on the list, about the best way to setup the OSDs, but I think there was no final consensus. Regards, Christian ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-18 6:08 ` is rados block cluster production ready ? Alexandre DERUMIER 2012-05-18 8:45 ` Christian Brunner @ 2012-05-18 16:13 ` Tommi Virtanen 2012-05-21 8:09 ` Stefan Priebe - Profihost AG 1 sibling, 1 reply; 8+ messages in thread From: Tommi Virtanen @ 2012-05-18 16:13 UTC (permalink / raw) To: Alexandre DERUMIER; +Cc: ceph-devel Thanks Christian for doing an awesome job, you answered some of the questions better than I personally could have ;) On Thu, May 17, 2012 at 11:08 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote: > About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? Bonding is almost always good if you have the switch ports & bandwidth to spare. Even without bonding, one of the tricks RADOS does is that each chunk of the RBD image is stored on a different OSD, so your hypervisor will actually talk to even thousands of OSDs when the vm accesses the disk. This way, the network link of any individual storage node is less likely to become a performance bottleneck for the vm. > I have submited the contact form on Inktank website, as we would like to pay for support/expertise to build this cluster, but didn't have response yet. You should receive a response soon. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-18 16:13 ` Tommi Virtanen @ 2012-05-21 8:09 ` Stefan Priebe - Profihost AG 2012-05-21 8:18 ` Tim O'Donovan 0 siblings, 1 reply; 8+ messages in thread From: Stefan Priebe - Profihost AG @ 2012-05-21 8:09 UTC (permalink / raw) To: Tommi Virtanen; +Cc: Alexandre DERUMIER, ceph-devel Am 18.05.2012 18:13, schrieb Tommi Virtanen: > <aderumier@odiso.com> wrote: >> About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? > > Bonding is almost always good if you have the switch ports & bandwidth > to spare. Even without bonding, one of the tricks RADOS does is that > each chunk of the RBD image is stored on a different OSD, so your > hypervisor will actually talk to even thousands of OSDs when the vm > accesses the disk. This way, the network link of any individual > storage node is less likely to become a performance bottleneck for the > vm. That's great but if you go the way of having for example 8x OSDs per server with 8 single Disks - how can i archieve that ceph is splitting my files to the correct servers for redundancy? Stefan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-21 8:09 ` Stefan Priebe - Profihost AG @ 2012-05-21 8:18 ` Tim O'Donovan 2012-05-21 8:25 ` Stefan Priebe - Profihost AG 0 siblings, 1 reply; 8+ messages in thread From: Tim O'Donovan @ 2012-05-21 8:18 UTC (permalink / raw) To: ceph-devel@vger.kernel.org > That's great but if you go the way of having for example 8x OSDs per > server with 8 single Disks - how can i archieve that ceph is splitting > my files to the correct servers for redundancy? I believe this is handled by CRUSH: http://ceph.com/wiki/Custom_data_placement_with_CRUSH Regards, Tim O'Donovan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: is rados block cluster production ready ? 2012-05-21 8:18 ` Tim O'Donovan @ 2012-05-21 8:25 ` Stefan Priebe - Profihost AG 0 siblings, 0 replies; 8+ messages in thread From: Stefan Priebe - Profihost AG @ 2012-05-21 8:25 UTC (permalink / raw) To: Tim O'Donovan; +Cc: ceph-devel@vger.kernel.org Am 21.05.2012 10:18, schrieb Tim O'Donovan: >> That's great but if you go the way of having for example 8x OSDs per >> server with 8 single Disks - how can i archieve that ceph is splitting >> my files to the correct servers for redundancy? > > I believe this is handled by CRUSH: > > http://ceph.com/wiki/Custom_data_placement_with_CRUSH Thanks! Stefan ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-05-21 8:25 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <531ff5f3-11d5-488e-adfb-68000f3c36c0@mailpro>
2012-05-18 6:08 ` is rados block cluster production ready ? Alexandre DERUMIER
2012-05-18 8:45 ` Christian Brunner
2012-05-18 9:58 ` Alexandre DERUMIER
2012-05-18 10:43 ` Christian Brunner
2012-05-18 16:13 ` Tommi Virtanen
2012-05-21 8:09 ` Stefan Priebe - Profihost AG
2012-05-21 8:18 ` Tim O'Donovan
2012-05-21 8:25 ` Stefan Priebe - Profihost AG
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.