is rados block cluster production ready ?

All of lore.kernel.org
 help / color / mirror / Atom feed

* is rados block cluster production ready ?
       [not found] <531ff5f3-11d5-488e-adfb-68000f3c36c0@mailpro>
@ 2012-05-18  6:08 ` Alexandre DERUMIER
  2012-05-18  8:45   ` Christian Brunner
  2012-05-18 16:13   ` Tommi Virtanen
  0 siblings, 2 replies; 8+ messages in thread
From: Alexandre DERUMIER @ 2012-05-18  6:08 UTC (permalink / raw)
  To: ceph-devel

Hi,
I'm going to build a rados block cluster for my kvm hypervisors.

Is it already production ready ? (stable,no crash)

I have read some btrfs bugs on this mailing list, so I'm a bit scary...

Also, what performance could I expect ?
I try to build a fast cluster, with fast ssd disk.
each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet.
Do you think I can saturate the 10GB ?

I also have some questions about performance in time. 
I have had somes problems with my zfs san and zfs fragmentation and metastab problem. 
How does btrfs perform in time ?

About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ?

I have submited the contact form on Inktank website, as we would like to pay for support/expertise to build this cluster, but didn't have response yet.

Regards,

Alexandre Derumier
System Engineer
aderumier@odiso.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-18  6:08 ` is rados block cluster production ready ? Alexandre DERUMIER
@ 2012-05-18  8:45   ` Christian Brunner
  2012-05-18  9:58     ` Alexandre DERUMIER
  2012-05-18 16:13   ` Tommi Virtanen
  1 sibling, 1 reply; 8+ messages in thread
From: Christian Brunner @ 2012-05-18  8:45 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel

2012/5/18 Alexandre DERUMIER <aderumier@odiso.com>:
> Hi,
> I'm going to build a rados block cluster for my kvm hypervisors.
>
> Is it already production ready ? (stable,no crash)

We are using 0.45 in production. Recent ceph versions are quite stable
(although we hat some troubles with excessive logging and a full log
partition lately which caused our cluster to halt).

> I have read some btrfs bugs on this mailing list, so I'm a bit scary...

For the moment I would definitely recommend using XFS as the
underlying filesystem. At least until there is a fix for the
orphan_commit_root problem. XFS comes with a slight performance
impact, but it seems to be the only filesystem that is able to handle
heavy ceph workload for the moment.

> Also, what performance could I expect ?

We are running a small ceph cluster (4 Servers with 4 OSDs each) on a
10GE network. Servers are spread across two datacenters with a 5km (3
mile) long 10GE fibre-link for data replication. Our servers are
equipped with 80GB Fusion-IO drives (for the journal) and traditional
3,5'' SAS drives in a RAID5 configuration (but I would not reccommend
this setup).

From a guest we can get a throughput ~ 500MB/s.

> I try to build a fast cluster, with fast ssd disk.
> each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet.
> Do you think I can saturate the 10GB ?

This is probably the best hardware for a ceph cluster money can buy.
Are you planning a single SAS drive per OSD?

I still don't know the cause exactly, but we are not able to saturate
10GE (maybe it's the latency on the WAN link or some network
configuration problem).

> I also have some questions about performance in time.
> I have had somes problems with my zfs san and zfs fragmentation and metastab problem.
> How does btrfs perform in time ?

I did some artificial tests with btrfs with large metadata enabled
(e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to
be gone.

> About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ?

We are using bonding. The rados-client is doing a failover to another
osd node after a few seconds, when there is no response from the OSD.
(You should read about CRUSH in the ceph docs).

Regards,
Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-18  8:45   ` Christian Brunner
@ 2012-05-18  9:58     ` Alexandre DERUMIER
  2012-05-18 10:43       ` Christian Brunner
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre DERUMIER @ 2012-05-18  9:58 UTC (permalink / raw)
  To: Christian Brunner; +Cc: ceph-devel

Hi Christian,
thanks for your response.

>>We are using 0.45 in production. Recent ceph versions are quite stable 
>>(although we hat some troubles with excessive logging and a full log 
>>partition lately which caused our cluster to halt). 

excessive logging because of a configuration error ? 

>>For the moment I would definitely recommend using XFS as the 
>>underlying filesystem. At least until there is a fix for the 
>>orphan_commit_root problem. XFS comes with a slight performance 
>>impact, but it seems to be the only filesystem that is able to handle 
>>heavy ceph workload for the moment. 

What's the benefit of using btrfs ? snapshots ? (I would like to be able to do snapshots, maybe clones)

>>We are running a small ceph cluster (4 Servers with 4 OSDs each) on a 
>>10GE network. Servers are spread across two datacenters with a 5km (3 
>>mile) long 10GE fibre-link for data replication. Our servers are 
>>equipped with 80GB Fusion-IO drives (for the journal) and traditional 
>>3,5'' SAS drives in a RAID5 configuration (but I would not reccommend 
>>this setup). 

>>From a guest we can get a throughput ~ 500MB/s. 

Great ! (And from multiple guests ? do you have more throughput ?)

Also about latencies, do you have good latencies with you fusion-io journal?
I currently use zfs storage, and writes are going to fast journal nvram then flushed to disk 15K.
It is the same behaviour with ceph ?

>>This is probably the best hardware for a ceph cluster money can buy. 
>>Are you planning a single SAS drive per OSD? 

Yes, on osd by drive. 
So if something goes wrong with brtfs or xfs, I'll have only 1 failed disk and not the whole raid.
Is it the right way for osd ?

>>I still don't know the cause exactly, but we are not able to saturate 
>>10GE (maybe it's the latency on the WAN link or some network 
>>configuration problem). 

yes maybe. (I would like to have money for this kind of setup ;)

>>I did some artificial tests with btrfs with large metadata enabled 
>>(e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to 
>>be gone. 

Great! (I'm very scary about this kind of bugs)

>>We are using bonding. The rados-client is doing a failover to another 
>>osd node after a few seconds, when there is no response from the OSD. 
>>(You should read about CRUSH in the ceph docs). 

Thanks again for all your reponse. (Ceph community seem to be great :)

Regards,

Alexandre

----- Mail original ----- 

De: "Christian Brunner" <christian@brunner-muc.de> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Vendredi 18 Mai 2012 10:45:48 
Objet: Re: is rados block cluster production ready ? 

2012/5/18 Alexandre DERUMIER <aderumier@odiso.com>: 
> Hi, 
> I'm going to build a rados block cluster for my kvm hypervisors. 
> 
> Is it already production ready ? (stable,no crash) 

We are using 0.45 in production. Recent ceph versions are quite stable 
(although we hat some troubles with excessive logging and a full log 
partition lately which caused our cluster to halt). 

> I have read some btrfs bugs on this mailing list, so I'm a bit scary... 

For the moment I would definitely recommend using XFS as the 
underlying filesystem. At least until there is a fix for the 
orphan_commit_root problem. XFS comes with a slight performance 
impact, but it seems to be the only filesystem that is able to handle 
heavy ceph workload for the moment. 

> Also, what performance could I expect ? 

We are running a small ceph cluster (4 Servers with 4 OSDs each) on a 
10GE network. Servers are spread across two datacenters with a 5km (3 
mile) long 10GE fibre-link for data replication. Our servers are 
equipped with 80GB Fusion-IO drives (for the journal) and traditional 
3,5'' SAS drives in a RAID5 configuration (but I would not reccommend 
this setup). 

From a guest we can get a throughput ~ 500MB/s. 

> I try to build a fast cluster, with fast ssd disk. 
> each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet. 
> Do you think I can saturate the 10GB ? 

This is probably the best hardware for a ceph cluster money can buy. 
Are you planning a single SAS drive per OSD? 

I still don't know the cause exactly, but we are not able to saturate 
10GE (maybe it's the latency on the WAN link or some network 
configuration problem). 

> I also have some questions about performance in time. 
> I have had somes problems with my zfs san and zfs fragmentation and metastab problem. 
> How does btrfs perform in time ? 

I did some artificial tests with btrfs with large metadata enabled 
(e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to 
be gone. 

> About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? 

We are using bonding. The rados-client is doing a failover to another 
osd node after a few seconds, when there is no response from the OSD. 
(You should read about CRUSH in the ceph docs). 

Regards, 
Christian 

-- 

-- 

	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-18  9:58     ` Alexandre DERUMIER
@ 2012-05-18 10:43       ` Christian Brunner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brunner @ 2012-05-18 10:43 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel

2012/5/18 Alexandre DERUMIER <aderumier@odiso.com>:
> Hi Christian,
> thanks for your response.
>
>>>We are using 0.45 in production. Recent ceph versions are quite stable
>>>(although we hat some troubles with excessive logging and a full log
>>>partition lately which caused our cluster to halt).
>
> excessive logging because of a configuration error ?

0.45 had some debug messages enabled by default, which we didn't
realize when doing the update. It can be easily disabled in the
config. (Haven't checked if this is still the case in 0.46).

>>>For the moment I would definitely recommend using XFS as the
>>>underlying filesystem. At least until there is a fix for the
>>>orphan_commit_root problem. XFS comes with a slight performance
>>>impact, but it seems to be the only filesystem that is able to handle
>>>heavy ceph workload for the moment.
>
> What's the benefit of using btrfs ? snapshots ? (I would like to be able to do snapshots, maybe clones)

RBD snapshots are handled independent of the underlying filesystem. So
you wouldn't loose that feature. (AFAIK clones are still on the
roadmap - RBD layering).

When enabled ceph is using btrfs snapshots internally for consistent
writes. This gives you some performances advantages. With other
filesystems you can only use "write ahead journaling".

see http://ceph.com/docs/master/dev/filestore-filesystem-compat/

>>>We are running a small ceph cluster (4 Servers with 4 OSDs each) on a
>>>10GE network. Servers are spread across two datacenters with a 5km (3
>>>mile) long 10GE fibre-link for data replication. Our servers are
>>>equipped with 80GB Fusion-IO drives (for the journal) and traditional
>>>3,5'' SAS drives in a RAID5 configuration (but I would not reccommend
>>>this setup).
>
>>>From a guest we can get a throughput ~ 500MB/s.
>
> Great ! (And from multiple guests ? do you have more throughput ?)

Yes. We were able to increase that even from a single guest with a
RAID0 over multiple rbd volumes.

> Also about latencies, do you have good latencies with you fusion-io journal?

Latencies are ok, but I don't like the proprietary driver. Not that
the driver is causing any problems, but it is always a bit tricky when
doing kernel updates.

> I currently use zfs storage, and writes are going to fast journal nvram then flushed to disk 15K.
> It is the same behaviour with ceph ?

That is quite similar to the ceph journal.

>>>This is probably the best hardware for a ceph cluster money can buy.
>>>Are you planning a single SAS drive per OSD?
>
> Yes, on osd by drive.
> So if something goes wrong with brtfs or xfs, I'll have only 1 failed disk and not the whole raid.
> Is it the right way for osd ?

You can do it this way. We decided to put the ceph storage on a local
RAID5 because we didn't want to re-replicate over the network when a
single disk has to be swapped. There has been a discussion on the
list, about the best way to setup the OSDs, but I think there was no
final consensus.

Regards,
Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-18  6:08 ` is rados block cluster production ready ? Alexandre DERUMIER
  2012-05-18  8:45   ` Christian Brunner
@ 2012-05-18 16:13   ` Tommi Virtanen
  2012-05-21  8:09     ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 8+ messages in thread
From: Tommi Virtanen @ 2012-05-18 16:13 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel

Thanks Christian for doing an awesome job, you answered some of the
questions better than I personally could have ;)

On Thu, May 17, 2012 at 11:08 PM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ?

Bonding is almost always good if you have the switch ports & bandwidth
to spare. Even without bonding, one of the tricks RADOS does is that
each chunk of the RBD image is stored on a different OSD, so your
hypervisor will actually talk to even thousands of OSDs when the vm
accesses the disk. This way, the network link of any individual
storage node is less likely to become a performance bottleneck for the
vm.

> I have submited the contact form on Inktank website, as we would like to pay for support/expertise to build this cluster, but didn't have response yet.

You should receive a response soon.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-18 16:13   ` Tommi Virtanen
@ 2012-05-21  8:09     ` Stefan Priebe - Profihost AG
  2012-05-21  8:18       ` Tim O'Donovan
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-21  8:09 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Alexandre DERUMIER, ceph-devel

Am 18.05.2012 18:13, schrieb Tommi Virtanen:
> <aderumier@odiso.com> wrote:
>> About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ?
> 
> Bonding is almost always good if you have the switch ports & bandwidth
> to spare. Even without bonding, one of the tricks RADOS does is that
> each chunk of the RBD image is stored on a different OSD, so your
> hypervisor will actually talk to even thousands of OSDs when the vm
> accesses the disk. This way, the network link of any individual
> storage node is less likely to become a performance bottleneck for the
> vm.
That's great but if you go the way of having for example 8x OSDs per
server with 8 single Disks - how can i archieve that ceph is splitting
my files to the correct servers for redundancy?

Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-21  8:09     ` Stefan Priebe - Profihost AG
@ 2012-05-21  8:18       ` Tim O'Donovan
  2012-05-21  8:25         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 8+ messages in thread
From: Tim O'Donovan @ 2012-05-21  8:18 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

> That's great but if you go the way of having for example 8x OSDs per
> server with 8 single Disks - how can i archieve that ceph is splitting
> my files to the correct servers for redundancy?

I believe this is handled by CRUSH:

http://ceph.com/wiki/Custom_data_placement_with_CRUSH


Regards,
Tim O'Donovan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: is rados block cluster production ready ?
  2012-05-21  8:18       ` Tim O'Donovan
@ 2012-05-21  8:25         ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-05-21  8:25 UTC (permalink / raw)
  To: Tim O'Donovan; +Cc: ceph-devel@vger.kernel.org

Am 21.05.2012 10:18, schrieb Tim O'Donovan:
>> That's great but if you go the way of having for example 8x OSDs per
>> server with 8 single Disks - how can i archieve that ceph is splitting
>> my files to the correct servers for redundancy?
> 
> I believe this is handled by CRUSH:
> 
> http://ceph.com/wiki/Custom_data_placement_with_CRUSH
Thanks!

Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-05-21  8:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <531ff5f3-11d5-488e-adfb-68000f3c36c0@mailpro>
2012-05-18  6:08 ` is rados block cluster production ready ? Alexandre DERUMIER
2012-05-18  8:45   ` Christian Brunner
2012-05-18  9:58     ` Alexandre DERUMIER
2012-05-18 10:43       ` Christian Brunner
2012-05-18 16:13   ` Tommi Virtanen
2012-05-21  8:09     ` Stefan Priebe - Profihost AG
2012-05-21  8:18       ` Tim O'Donovan
2012-05-21  8:25         ` Stefan Priebe - Profihost AG

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.