Ideal hardware spec?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Ideal hardware spec?
@ 2012-08-22 13:55 Jonathan Proulx
  2012-08-22 14:17 ` Wido den Hollander
  2012-08-22 14:41 ` Mark Nelson
  0 siblings, 2 replies; 22+ messages in thread
From: Jonathan Proulx @ 2012-08-22 13:55 UTC (permalink / raw)
  To: ceph-devel

Hi All,

Yes I'm asking the impossible question, what is the "best" hardware
confing.

I'm looking at (possibly) using ceph as backing store for images and
volumes on OpenStack as well as exposing at least the object store for
direct use.  

The openstack cluster exists and is currently in the early stages of
use by researchers here, approx 1500 vCPU (counts hyperthreads
actually 768 physical cores) and 3T or RAM across 64 physical nodes.

On the object store side it would be a new resource for usand hard to
say what people would do with it except that it would be many
different things and the use profile would be constantly changing
(which is true of all our existing storage).

In this sense, even though it's a "private cloud" the somewhat
unpredictable useage profile gives it some charateristics of a small
public cloud.

Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
to end up with a 20-30T 3x replicated storage (call me paranoid).

So the monitor specs seem relatively easy to come up with.  For the
OSDs it looks like
http://ceph.com/docs/master/install/hardware-recommendations suggests
1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
node).  On list discussions seem to frequently include an SSD for
journaling (which is similar to what we do for our current ZFS back
NFS storage).

I'm hoping to wrap the hardware in a grant and willing to experiment a
bit with different software configurations to tune it up when/if I get
the hardware in.  So my imediate concern is a hardware spec that will
ahve a reasonable processor:memory:disk ratio and opinions (or better
data) on the utility of SSD.

First is the documented core to disk ratio still current best
practice?  Given a platform with more drive slots could 8 cores handle
more disk? would that need/like more memory?

Have SSD been shown to speed performance with this architecture?

If so given the 8 drive slot example with seven OSDs presented in the
docs what is the liklihood that using a high performance SSD for the
OS image and also cutting journal/log partitions out of it for the
remaining 7 2-3T near line SAS drives?

Thanks,
-Jon

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx
@ 2012-08-22 14:17 ` Wido den Hollander
  2012-08-22 14:39   ` Stephen Perkins
  2012-08-22 15:46   ` Jonathan Proulx
  2012-08-22 14:41 ` Mark Nelson
  1 sibling, 2 replies; 22+ messages in thread
From: Wido den Hollander @ 2012-08-22 14:17 UTC (permalink / raw)
  To: Jonathan Proulx; +Cc: ceph-devel

Hi,

On 08/22/2012 03:55 PM, Jonathan Proulx wrote:
> Hi All,
>
> Yes I'm asking the impossible question, what is the "best" hardware
> confing.
>
> I'm looking at (possibly) using ceph as backing store for images and
> volumes on OpenStack as well as exposing at least the object store for
> direct use.
>
> The openstack cluster exists and is currently in the early stages of
> use by researchers here, approx 1500 vCPU (counts hyperthreads
> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>
> On the object store side it would be a new resource for usand hard to
> say what people would do with it except that it would be many
> different things and the use profile would be constantly changing
> (which is true of all our existing storage).
>
> In this sense, even though it's a "private cloud" the somewhat
> unpredictable useage profile gives it some charateristics of a small
> public cloud.
>
> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
> to end up with a 20-30T 3x replicated storage (call me paranoid).
>

I prefer 3x replication as well. I've seen the "wrong" OSDs die on me 
too often.

> So the monitor specs seem relatively easy to come up with.  For the
> OSDs it looks like
> http://ceph.com/docs/master/install/hardware-recommendations suggests
> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
> node).  On list discussions seem to frequently include an SSD for
> journaling (which is similar to what we do for our current ZFS back
> NFS storage).
>
> I'm hoping to wrap the hardware in a grant and willing to experiment a
> bit with different software configurations to tune it up when/if I get
> the hardware in.  So my imediate concern is a hardware spec that will
> ahve a reasonable processor:memory:disk ratio and opinions (or better
> data) on the utility of SSD.
>
> First is the documented core to disk ratio still current best
> practice?  Given a platform with more drive slots could 8 cores handle
> more disk? would that need/like more memory?
>

I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the 
OSD machines, the more the kernel can buffer, which will always be a 
performance gain.

You should however ask yourself the question if you want a lot of OSDs 
per server and not go for smaller machines with less disks.

For example

- 1U
- 4 cores
- 8GB RAM
- 4 disks
- 1 SSD

Or

- 2U
- 8 cores
- 16GB RAM
- 8 disks
- 1|2 SSDs

Both will give you the same amount of storage, but the impact of loosing 
one physicial machine will be larger with the 2U machine.

If you take 1TB disks you'd loose 8TB of storage, that is a lot of 
recovery to be done.

Since btrfs (Assuming you are going to use that) is still in development 
it's not excluded that your machine goes down due to a kernel panic or 
other problems.

My personal favor is having multiple small(er) machines than having a 
couple of large machines.

> Have SSD been shown to speed performance with this architecture?
>

I've seen a improvement in performance indeed. Make sure however you 
have a recent version of glibc with syncfs support.

> If so given the 8 drive slot example with seven OSDs presented in the
> docs what is the liklihood that using a high performance SSD for the
> OS image and also cutting journal/log partitions out of it for the
> remaining 7 2-3T near line SAS drives?
>

You should make sure your SSD is capable of doing line-speed of your 
network.

If you are connecting the machines with 4G trunks, make sure the SSD is 
capable of doing around 400MB/sec of sustained writes.

I'd recommended the Intel 520 SSDs and change their available capacity 
with hdparm to about 20% of their original capacity. This way the SSD 
always has a lot of free cells available for writing. Reprogramming 
cells is expensive on an SSD.

You can run the OS on the same SSD since that won't do that much I/O. 
I'd recommend not logging locally though, since that will also write to 
the same SSD. Try using remote syslog.

You can also use the USB sticks[0] from Stec, they have servergrade 
onboard USB sticks for these kind of applications.

A couple of questions still need to be answered though:
* Which OS are you planning on using? Ubuntu 12.04 is recommended
* Which filesystem do you want to use underneath the OSDs?

Wido

[0]: http://www.stec-inc.com/product/ufm.php

> Thanks,
> -Jon
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Ideal hardware spec?
  2012-08-22 14:17 ` Wido den Hollander
@ 2012-08-22 14:39   ` Stephen Perkins
  2012-08-23  8:24     ` Wido den Hollander
  2012-08-22 15:46   ` Jonathan Proulx
  1 sibling, 1 reply; 22+ messages in thread
From: Stephen Perkins @ 2012-08-22 14:39 UTC (permalink / raw)
  To: 'Wido den Hollander', 'Jonathan Proulx'; +Cc: ceph-devel

Hi all,

Is there a place we can set up a group of hardware recipes that people can
query and modify over time?  It would be good if people could submit and
"group modify" the recipes.   I would envision "hypothetical" configurations
and "deployed/tested" configurations.  

Trekking back through email exchanges like this becomes hard for people who
join later.

I'd like to see a "best" hardware config as well... however, I'm interested
in a SAS switching fabric where the nodes do not have any storage (except
possibly onboard boot drive/USB as listed below).  Each node would have a
SAS HBA that allows it to access a LARGE jbod  provided by a HA set of SAS
Switches (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives
are lun masked for each host.

The thought here is that you can add compute nodes, storage shelves, and
disks all independently.  With proper masking, you could provide redundancy
to cover drive, node, and shelf failures.    You could also add disks
"horizontally" if you have spare slots in a shelf, and you could add shelves
"vertically" and increase the disk count available to existing nodes.

My goal is to be able to scale without having to draw the enormous power of
lots of 1U devices or buy lots of disks and shelves each time I wasn't to
add a little capacity.

Anybody looked at atom processors?

- Steve

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org
[mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wido den Hollander
Sent: Wednesday, August 22, 2012 9:17 AM
To: Jonathan Proulx
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ideal hardware spec?

Hi,

On 08/22/2012 03:55 PM, Jonathan Proulx wrote:
> Hi All,
>
> Yes I'm asking the impossible question, what is the "best" hardware 
> confing.
>
> I'm looking at (possibly) using ceph as backing store for images and 
> volumes on OpenStack as well as exposing at least the object store for 
> direct use.
>
> The openstack cluster exists and is currently in the early stages of 
> use by researchers here, approx 1500 vCPU (counts hyperthreads 
> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>
> On the object store side it would be a new resource for usand hard to 
> say what people would do with it except that it would be many 
> different things and the use profile would be constantly changing 
> (which is true of all our existing storage).
>
> In this sense, even though it's a "private cloud" the somewhat 
> unpredictable useage profile gives it some charateristics of a small 
> public cloud.
>
> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes 
> to end up with a 20-30T 3x replicated storage (call me paranoid).
>

I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too
often.

> So the monitor specs seem relatively easy to come up with.  For the 
> OSDs it looks like 
> http://ceph.com/docs/master/install/hardware-recommendations suggests
> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage 
> node).  On list discussions seem to frequently include an SSD for 
> journaling (which is similar to what we do for our current ZFS back 
> NFS storage).
>
> I'm hoping to wrap the hardware in a grant and willing to experiment a 
> bit with different software configurations to tune it up when/if I get 
> the hardware in.  So my imediate concern is a hardware spec that will 
> ahve a reasonable processor:memory:disk ratio and opinions (or better
> data) on the utility of SSD.
>
> First is the documented core to disk ratio still current best 
> practice?  Given a platform with more drive slots could 8 cores handle 
> more disk? would that need/like more memory?
>

I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD
machines, the more the kernel can buffer, which will always be a performance
gain.

You should however ask yourself the question if you want a lot of OSDs per
server and not go for smaller machines with less disks.

For example

- 1U
- 4 cores
- 8GB RAM
- 4 disks
- 1 SSD

Or

- 2U
- 8 cores
- 16GB RAM
- 8 disks
- 1|2 SSDs

Both will give you the same amount of storage, but the impact of loosing one
physicial machine will be larger with the 2U machine.

If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery
to be done.

Since btrfs (Assuming you are going to use that) is still in development
it's not excluded that your machine goes down due to a kernel panic or other
problems.

My personal favor is having multiple small(er) machines than having a couple
of large machines.

> Have SSD been shown to speed performance with this architecture?
>

I've seen a improvement in performance indeed. Make sure however you have a
recent version of glibc with syncfs support.

> If so given the 8 drive slot example with seven OSDs presented in the 
> docs what is the liklihood that using a high performance SSD for the 
> OS image and also cutting journal/log partitions out of it for the 
> remaining 7 2-3T near line SAS drives?
>

You should make sure your SSD is capable of doing line-speed of your
network.

If you are connecting the machines with 4G trunks, make sure the SSD is
capable of doing around 400MB/sec of sustained writes.

I'd recommended the Intel 520 SSDs and change their available capacity with
hdparm to about 20% of their original capacity. This way the SSD always has
a lot of free cells available for writing. Reprogramming cells is expensive
on an SSD.

You can run the OS on the same SSD since that won't do that much I/O. 
I'd recommend not logging locally though, since that will also write to the
same SSD. Try using remote syslog.

You can also use the USB sticks[0] from Stec, they have servergrade onboard
USB sticks for these kind of applications.

A couple of questions still need to be answered though:
* Which OS are you planning on using? Ubuntu 12.04 is recommended
* Which filesystem do you want to use underneath the OSDs?

Wido

[0]: http://www.stec-inc.com/product/ufm.php

> Thanks,
> -Jon
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
body of a message to majordomo@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-22 14:39   ` Stephen Perkins
@ 2012-08-23  8:24     ` Wido den Hollander
  2012-08-24 14:17       ` Stephen Perkins
  0 siblings, 1 reply; 22+ messages in thread
From: Wido den Hollander @ 2012-08-23  8:24 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: 'Jonathan Proulx', ceph-devel

On 08/22/2012 04:39 PM, Stephen Perkins wrote:
> Hi all,
>
> Is there a place we can set up a group of hardware recipes that people can
> query and modify over time?  It would be good if people could submit and
> "group modify" the recipes.   I would envision "hypothetical" configurations
> and "deployed/tested" configurations.
>
> Trekking back through email exchanges like this becomes hard for people who
> join later.
>

At the moment there isn't, but yes, a "show your setup" would be useful. 
I don't know if there is any really reference material right now, but in 
a later stage some showcases could be a great reference.

> I'd like to see a "best" hardware config as well... however, I'm interested
> in a SAS switching fabric where the nodes do not have any storage (except
> possibly onboard boot drive/USB as listed below).  Each node would have a
> SAS HBA that allows it to access a LARGE jbod  provided by a HA set of SAS
> Switches (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives
> are lun masked for each host.
>
> The thought here is that you can add compute nodes, storage shelves, and
> disks all independently.  With proper masking, you could provide redundancy
> to cover drive, node, and shelf failures.    You could also add disks
> "horizontally" if you have spare slots in a shelf, and you could add shelves
> "vertically" and increase the disk count available to existing nodes.
>

What would the benefit be from building such a complex SAS environment? 
You'd be spending a lot of money on SAS switch, JBODs and cabling.

Your SPOF would still be your whole SAS setup.

And what is the benefit for having Ceph run on top of that? If you have 
all the disks available to all the nodes, why not run ZFS? ZFS would 
give you better performance since what you are building would actually 
be a local filesystem.

For risk spreading you should not interconnect all the nodes.

The more complexity you add to the whole setup, the more likely it's to 
go down completely at some point in time.

I'm just trying to understand why you would want to run a distributed 
filesystem on top of a bunch of direct attached disks.

Again, if all the disks are attached locally you'd be better of by using 
ZFS.

> My goal is to be able to scale without having to draw the enormous power of
> lots of 1U devices or buy lots of disks and shelves each time I wasn't to
> add a little capacity.
>

You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the 
time, depending on your crushmap you might need to add 3 machines at a once.

If you have three "racks" in your crushmap each containing 5 nodes, you 
need to add a new node to each rack when expanding capacity to keep the 
racks balanced.

This way you would add three nodes when expanding.

> Anybody looked at atom processors?
>

Yes, I have.

I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 
2TB disks and a 80GB SSD (old X25-M) for journaling.

That works, but what I notice is that under heavy recover the Atoms 
can't cope with it.

I'm thinking about building a couple of nodes with the AMD Brazos 
mainboard, somelike like an Asus E35M1-I.

That is not a serverboard, but it would just be a reference to see what 
it does.

One of the problems with the Atoms is the 4GB memory limitation, with 
the AMD Brazos you can use 8GB.

I'm trying to figure out a way to have a really large amount of small 
nodes for a low price to have a massive cluster where the impact of 
loosing one node is very small.

Wido

> - Steve
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wido den Hollander
> Sent: Wednesday, August 22, 2012 9:17 AM
> To: Jonathan Proulx
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Ideal hardware spec?
>
> Hi,
>
> On 08/22/2012 03:55 PM, Jonathan Proulx wrote:
>> Hi All,
>>
>> Yes I'm asking the impossible question, what is the "best" hardware
>> confing.
>>
>> I'm looking at (possibly) using ceph as backing store for images and
>> volumes on OpenStack as well as exposing at least the object store for
>> direct use.
>>
>> The openstack cluster exists and is currently in the early stages of
>> use by researchers here, approx 1500 vCPU (counts hyperthreads
>> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>>
>> On the object store side it would be a new resource for usand hard to
>> say what people would do with it except that it would be many
>> different things and the use profile would be constantly changing
>> (which is true of all our existing storage).
>>
>> In this sense, even though it's a "private cloud" the somewhat
>> unpredictable useage profile gives it some charateristics of a small
>> public cloud.
>>
>> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
>> to end up with a 20-30T 3x replicated storage (call me paranoid).
>>
>
> I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too
> often.
>
>> So the monitor specs seem relatively easy to come up with.  For the
>> OSDs it looks like
>> http://ceph.com/docs/master/install/hardware-recommendations suggests
>> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
>> node).  On list discussions seem to frequently include an SSD for
>> journaling (which is similar to what we do for our current ZFS back
>> NFS storage).
>>
>> I'm hoping to wrap the hardware in a grant and willing to experiment a
>> bit with different software configurations to tune it up when/if I get
>> the hardware in.  So my imediate concern is a hardware spec that will
>> ahve a reasonable processor:memory:disk ratio and opinions (or better
>> data) on the utility of SSD.
>>
>> First is the documented core to disk ratio still current best
>> practice?  Given a platform with more drive slots could 8 cores handle
>> more disk? would that need/like more memory?
>>
>
> I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD
> machines, the more the kernel can buffer, which will always be a performance
> gain.
>
> You should however ask yourself the question if you want a lot of OSDs per
> server and not go for smaller machines with less disks.
>
> For example
>
> - 1U
> - 4 cores
> - 8GB RAM
> - 4 disks
> - 1 SSD
>
> Or
>
> - 2U
> - 8 cores
> - 16GB RAM
> - 8 disks
> - 1|2 SSDs
>
> Both will give you the same amount of storage, but the impact of loosing one
> physicial machine will be larger with the 2U machine.
>
> If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery
> to be done.
>
> Since btrfs (Assuming you are going to use that) is still in development
> it's not excluded that your machine goes down due to a kernel panic or other
> problems.
>
> My personal favor is having multiple small(er) machines than having a couple
> of large machines.
>
>> Have SSD been shown to speed performance with this architecture?
>>
>
> I've seen a improvement in performance indeed. Make sure however you have a
> recent version of glibc with syncfs support.
>
>> If so given the 8 drive slot example with seven OSDs presented in the
>> docs what is the liklihood that using a high performance SSD for the
>> OS image and also cutting journal/log partitions out of it for the
>> remaining 7 2-3T near line SAS drives?
>>
>
> You should make sure your SSD is capable of doing line-speed of your
> network.
>
> If you are connecting the machines with 4G trunks, make sure the SSD is
> capable of doing around 400MB/sec of sustained writes.
>
> I'd recommended the Intel 520 SSDs and change their available capacity with
> hdparm to about 20% of their original capacity. This way the SSD always has
> a lot of free cells available for writing. Reprogramming cells is expensive
> on an SSD.
>
> You can run the OS on the same SSD since that won't do that much I/O.
> I'd recommend not logging locally though, since that will also write to the
> same SSD. Try using remote syslog.
>
> You can also use the USB sticks[0] from Stec, they have servergrade onboard
> USB sticks for these kind of applications.
>
> A couple of questions still need to be answered though:
> * Which OS are you planning on using? Ubuntu 12.04 is recommended
> * Which filesystem do you want to use underneath the OSDs?
>
> Wido
>
> [0]: http://www.stec-inc.com/product/ufm.php
>
>> Thanks,
>> -Jon
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Ideal hardware spec?
  2012-08-23  8:24     ` Wido den Hollander
@ 2012-08-24 14:17       ` Stephen Perkins
  2012-08-24 14:41         ` Joe Landman
                           ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Stephen Perkins @ 2012-08-24 14:17 UTC (permalink / raw)
  To: 'Wido den Hollander'; +Cc: ceph-devel

Morning Wido (and all),

>> I'd like to see a "best" hardware config as well... however, I'm 
>> interested in a SAS switching fabric where the nodes do not have any 
>> storage (except possibly onboard boot drive/USB as listed below).  
>> Each node would have a SAS HBA that allows it to access a LARGE jbod  
>> provided by a HA set of SAS Switches 
>> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun
masked for each host.
>>
>> The thought here is that you can add compute nodes, storage shelves, 
>> and disks all independently.  With proper masking, you could provide
redundancy
>> to cover drive, node, and shelf failures.    You could also add disks
>> "horizontally" if you have spare slots in a shelf, and you could add 
>> shelves "vertically" and increase the disk count available to existing
nodes.
>>
>
>What would the benefit be from building such a complex SAS environment? 
>You'd be spending a lot of money on SAS switch, JBODs and cabling.

Density.

>Your SPOF would still be your whole SAS setup.

Well... I'm not sure I would consider it a single point of failure...  a
pair of cross-connected switches and 3-5 disk shelves.  Shelves can be
purchased with fully redundant internals (dual data paths etc to SAS
drives).  That is not even that important. If each shelf is just looked at
as JBOD, then you can group disks from different shelves into btrfs or
hardware RAID groups.  Or... you can look at each disk as its own storage
with its own OSD.

A SAS switch going offline would have no impact since everything is cross
connected.

A whole shelf can go offline and it would only appear as a single drive
failure in a RAID group (if disks groups are distributed properly).

You can then get compute nodes fairly densely packed by purchasing
SuperMicro 2uTwin enclosures:
	http://www.supermicro.com/products/nfo/2UTwin2.cfm

You can get 3 - 4 of those compute enclosure with dual SAS connectors (each
enclosure not necessarily fully populated initially). The beauty is that the
SAS interconnect is fast.   Much faster than Ethernet.

Please bear in mind that I am looking to create a highly available and
scalable storage system that will fit in as small an area as possible and
draw as little power as possible.  The reasoning is that we co-locate all
our equipment at remote data centers.  Each rack (along with its associated
power and any needed cross connects) represents a significant ongoing
operational expense.  Therefore, for me, density and incremental scalability
are important.

>And what is the benefit for having Ceph run on top of that? If you have all
the disks available to all the nodes, why not run ZFS?
> ZFS would give you better performance since what you are building would
actually be a local filesystem.

There is no high availability here.  Yes... You can try to do old school
magic with SAN file systems, complicated clustering, and synchronous
replication, but a RAIN approach appeals to me.  That is what I see in Ceph.
Don't get me wrong... I love ZFS... but am trying to figure out a scalable
HA solution that looks like RAIN. (Am I missing a feature of ZFS)?

>For risk spreading you should not interconnect all the nodes.

I do understand this.  However, our operational setup will not allow
multiple racks at the beginning.  So... given the constraints of 1 rack
(with dual power and dual WAN links), I do not see that a pair of cross
connected SAS switches is any less reliable than a pair of cross connected
ethernet switches...

As storage scales and we outgrow the single rack at a location, we can
overflow into a second rack etc.

>The more complexity you add to the whole setup, the more likely it's to go
down completely at some point in time.
>
>I'm just trying to understand why you would want to run a distributed
filesystem on top of a bunch of direct attached disks.

I guess I don't consider a SAN a bunch of direct attached disks.  The SAS
infrastructure is a SAN with SAS interconnects  (versus fiber,  iscsi or
infiniband)...  The disks are accessed via JBOD if desired... or you can put
RAID on top of a group of them.  The multiple shelves of drives are a way to
attempt to reduce the dependence on a single piece of hardware (i.e. it
becomes RAIN).

>Again, if all the disks are attached locally you'd be better of by using
ZFS.

This is not highly available, and AFAICT, the compute load would not scale
with the storage.

>> My goal is to be able to scale without having to draw the enormous 
>> power of lots of 1U devices or buy lots of disks and shelves each time 
>> I wasn't to add a little capacity.
>>
>
>You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the
time, depending on your crushmap you might need to add 3 machines at a once.

Adding three machines at once is what I was trying to avoid (I believe that
I need 3 replicas to make things reasonably redundant).  From first glance,
it does not seem like a very dense solution to try to add a bunch of 1U
servers with a few disks.  The associated cost of a bunch of 1U Servers over
JBOD, plus (and more importantly) the rack space and power draw, can cause
OPEX problems.  I can purchase multiple enclosures, but not fully populate
them with disks/cpus.  This gives me a redundant array of nodes (RAIN).
Then. as needed, I can add drives or compute cards to the existing
enclosures for little incremental cost.

In your 3 1U server case above, I can add 12 disks to existing 4 enclosures
(in groups of three) instead of three 1U servers with 4 disks each.  I can
then either run more OSDs on existing compute nodes or I can add one more
compute node and it can handle the new drives with one or more OSDs.  If I
run out of space in enclosures, I can add one more shelf (just one) and
start adding drives.  I can then "include" the new drives into existing OSDs
such that each existing OSD has a little more storage it needs to worry
about.  (The specifics of growing an existing OSD by adding a disk is still
a little fuzzy to me).

>> Anybody looked at atom processors?
>>
>
>Yes, I have..
>
>I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB
disks and a 80GB SSD (old X25-M) for journaling.
>
>That works, but what I notice is that under heavy recover the Atoms can't
cope with it.
>
>I'm thinking about building a couple of nodes with the AMD Brazos
mainboard, somelike like an Asus E35M1-I.
>
>That is not a serverboard, but it would just be a reference to see what it
does.
>
>One of the problems with the Atoms is the 4GB memory limitation, with the
AMD Brazos you can use 8GB.
>
>I'm trying to figure out a way to have a really large amount of small nodes
for a low price to have
> a massive cluster where the impact of loosing one node is very small.

Given that "massive" is a relative term, I am as well... but I'm also trying
to reduce the footprint (power and space) of that "massive" cluster.  I also
want to start small (1/2 rack) and scale as needed.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 14:17       ` Stephen Perkins
@ 2012-08-24 14:41         ` Joe Landman
  2012-08-24 15:05         ` Mark Nelson
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Joe Landman @ 2012-08-24 14:41 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: 'Wido den Hollander', ceph-devel

On 08/24/2012 10:17 AM, Stephen Perkins wrote:

>>> The thought here is that you can add compute nodes, storage shelves,
>>> and disks all independently.  With proper masking, you could provide
> redundancy
>>> to cover drive, node, and shelf failures.    You could also add disks
>>> "horizontally" if you have spare slots in a shelf, and you could add
>>> shelves "vertically" and increase the disk count available to existing
> nodes.
>>>
>>
>> What would the benefit be from building such a complex SAS environment?
>> You'd be spending a lot of money on SAS switch, JBODs and cabling.
>
> Density.

As a solutions vendor, we try to stay out of these discussions in 
general, as we are biased (of course).

Your discussion of being able to scale up density, fabric, and other 
relevant things is rather precisely what one of our products is meant to 
do, though we take a different route on the fabric.

Rather than using SAS switching and SAS targets, we use iSCSI and iSER 
transports over 10 and 40GbE and IB.  Our targets are iSCSI/iSER.  Put 
these underneath what we call the presentation layer, where the Ceph 
OSDs, MDSs, etc will live.

Otherwise they are quite similar.

I don't want to pollute this discussion with a commercial.  Just wanted 
to chime in here to let Stephen know that we've been doing that sort of 
design for a while.

>> Your SPOF would still be your whole SAS setup.

Actually no.  This design is, when well implemented, more resilient than 
many others.

>
> Well... I'm not sure I would consider it a single point of failure...  a
> pair of cross-connected switches and 3-5 disk shelves.  Shelves can be
> purchased with fully redundant internals (dual data paths etc to SAS
> drives).  That is not even that important. If each shelf is just looked at
> as JBOD, then you can group disks from different shelves into btrfs or
> hardware RAID groups.  Or... you can look at each disk as its own storage
> with its own OSD.
>
> A SAS switch going offline would have no impact since everything is cross
> connected.
>
> A whole shelf can go offline and it would only appear as a single drive
> failure in a RAID group (if disks groups are distributed properly).
>
> You can then get compute nodes fairly densely packed by purchasing
> SuperMicro 2uTwin enclosures:
> 	http://www.supermicro.com/products/nfo/2UTwin2.cfm
>
> You can get 3 - 4 of those compute enclosure with dual SAS connectors (each
> enclosure not necessarily fully populated initially). The beauty is that the
> SAS interconnect is fast.   Much faster than Ethernet.

You remove SPOFs by accepting the reality that its effectively 
impossible to have truly redundant power/data pathways on single 
backplane boards (literally the definition of a single point of 
failure).  If your redundant power supplies have a single power path to 
your backplane, is that redundant power (in the event of a short on the 
backplane)?  No, not even close.  And if your expander unit completely 
fails and locks hard ..., do you have a completely electrically separate 
pathway to your data?  With the single backplane/data path units, no you 
don't have this.  So putting multiple RAID cards into these units 
provides you with something akin to "security theatre".

>
> Please bear in mind that I am looking to create a highly available and
> scalable storage system that will fit in as small an area as possible and
> draw as little power as possible.  The reasoning is that we co-locate all
> our equipment at remote data centers.  Each rack (along with its associated
> power and any needed cross connects) represents a significant ongoing
> operational expense.  Therefore, for me, density and incremental scalability
> are important.

Not trying to be a commercial:  Think multi PB per 42U rack without heroics.

>
>> And what is the benefit for having Ceph run on top of that? If you have all
> the disks available to all the nodes, why not run ZFS?
>> ZFS would give you better performance since what you are building would
> actually be a local filesystem.
>
> There is no high availability here.  Yes... You can try to do old school
> magic with SAN file systems, complicated clustering, and synchronous
> replication, but a RAIN approach appeals to me.  That is what I see in Ceph.
> Don't get me wrong... I love ZFS... but am trying to figure out a scalable
> HA solution that looks like RAIN. (Am I missing a feature of ZFS)?

RAIN has some use cases, but rebuild times for a limited number of RAIDs 
and a huge number of drives will be HUGE.  Especially if your 
distributed LUNs start looking like multi tens to hundreds of TB. 
Really, you'd have to go Ceph at this point.

>
>> For risk spreading you should not interconnect all the nodes.
>
> I do understand this.  However, our operational setup will not allow
> multiple racks at the beginning.  So... given the constraints of 1 rack
> (with dual power and dual WAN links), I do not see that a pair of cross
> connected SAS switches is any less reliable than a pair of cross connected
> ethernet switches...
>
> As storage scales and we outgrow the single rack at a location, we can
> overflow into a second rack etc.
>
>> The more complexity you add to the whole setup, the more likely it's to go
> down completely at some point in time.
>>
>> I'm just trying to understand why you would want to run a distributed
> filesystem on top of a bunch of direct attached disks.
>
> I guess I don't consider a SAN a bunch of direct attached disks.  The SAS
> infrastructure is a SAN with SAS interconnects  (versus fiber,  iscsi or
> infiniband)...  The disks are accessed via JBOD if desired... or you can put
> RAID on top of a group of them.  The multiple shelves of drives are a way to
> attempt to reduce the dependence on a single piece of hardware (i.e. it
> becomes RAIN).
>
>> Again, if all the disks are attached locally you'd be better of by using
> ZFS.
>
> This is not highly available, and AFAICT, the compute load would not scale
> with the storage.
>
>>> My goal is to be able to scale without having to draw the enormous
>>> power of lots of 1U devices or buy lots of disks and shelves each time
>>> I wasn't to add a little capacity.
>>>
>>
>> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the
> time, depending on your crushmap you might need to add 3 machines at a once.
>
> Adding three machines at once is what I was trying to avoid (I believe that
> I need 3 replicas to make things reasonably redundant).  From first glance,
> it does not seem like a very dense solution to try to add a bunch of 1U
> servers with a few disks.  The associated cost of a bunch of 1U Servers over
> JBOD, plus (and more importantly) the rack space and power draw, can cause
> OPEX problems.  I can purchase multiple enclosures, but not fully populate
> them with disks/cpus.  This gives me a redundant array of nodes (RAIN).
> Then. as needed, I can add drives or compute cards to the existing
> enclosures for little incremental cost.
>
> In your 3 1U server case above, I can add 12 disks to existing 4 enclosures
> (in groups of three) instead of three 1U servers with 4 disks each.  I can
> then either run more OSDs on existing compute nodes or I can add one more
> compute node and it can handle the new drives with one or more OSDs.  If I
> run out of space in enclosures, I can add one more shelf (just one) and
> start adding drives.  I can then "include" the new drives into existing OSDs
> such that each existing OSD has a little more storage it needs to worry
> about.  (The specifics of growing an existing OSD by adding a disk is still
> a little fuzzy to me).
>
>>> Anybody looked at atom processors?
>>>
>>
>> Yes, I have..
>>
>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB
> disks and a 80GB SSD (old X25-M) for journaling.
>>
>> That works, but what I notice is that under heavy recover the Atoms can't
> cope with it.
>>
>> I'm thinking about building a couple of nodes with the AMD Brazos
> mainboard, somelike like an Asus E35M1-I.
>>
>> That is not a serverboard, but it would just be a reference to see what it
> does.
>>
>> One of the problems with the Atoms is the 4GB memory limitation, with the
> AMD Brazos you can use 8GB.
>>
>> I'm trying to figure out a way to have a really large amount of small nodes
> for a low price to have
>> a massive cluster where the impact of loosing one node is very small.
>
> Given that "massive" is a relative term, I am as well... but I'm also trying
> to reduce the footprint (power and space) of that "massive" cluster.  I also
> want to start small (1/2 rack) and scale as needed.

Again, not a commericial:  Think 1PB in less than 1/2 a 42U rack, with a 
little more than 1 ton of AC.

>
> - Steve
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 14:17       ` Stephen Perkins
  2012-08-24 14:41         ` Joe Landman
@ 2012-08-24 15:05         ` Mark Nelson
  2012-08-24 16:30           ` Sławomir Skowron
  2012-08-24 18:12           ` Wido den Hollander
  2012-08-24 16:12         ` Tommi Virtanen
  2012-08-24 18:09         ` Wido den Hollander
  3 siblings, 2 replies; 22+ messages in thread
From: Mark Nelson @ 2012-08-24 15:05 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: 'Wido den Hollander', ceph-devel

On 08/24/2012 09:17 AM, Stephen Perkins wrote:
> Morning Wido (and all),
>
>>> I'd like to see a "best" hardware config as well... however, I'm
>>> interested in a SAS switching fabric where the nodes do not have any
>>> storage (except possibly onboard boot drive/USB as listed below).
>>> Each node would have a SAS HBA that allows it to access a LARGE jbod
>>> provided by a HA set of SAS Switches
>>> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun
> masked for each host.
>>>
>>> The thought here is that you can add compute nodes, storage shelves,
>>> and disks all independently.  With proper masking, you could provide
> redundancy
>>> to cover drive, node, and shelf failures.    You could also add disks
>>> "horizontally" if you have spare slots in a shelf, and you could add
>>> shelves "vertically" and increase the disk count available to existing
> nodes.
>>>
>>
>> What would the benefit be from building such a complex SAS environment?
>> You'd be spending a lot of money on SAS switch, JBODs and cabling.
>
> Density.
>

Trying to balance between dense solutions with more failure points vs 
cheap low density solutions is always tough.  Though not the densest 
solution out there, we are starting to investigate performance on an 
SC847a chassis with 36 hotswap drives in 4U (along with internal drives 
for the system).  Our setup doesn't use SAS expanders which is nice 
bonus, though it does require a lot of controllers.

>> Your SPOF would still be your whole SAS setup.
>
> Well... I'm not sure I would consider it a single point of failure...  a
> pair of cross-connected switches and 3-5 disk shelves.  Shelves can be
> purchased with fully redundant internals (dual data paths etc to SAS
> drives).  That is not even that important. If each shelf is just looked at
> as JBOD, then you can group disks from different shelves into btrfs or
> hardware RAID groups.  Or... you can look at each disk as its own storage
> with its own OSD.
>
> A SAS switch going offline would have no impact since everything is cross
> connected.
>
> A whole shelf can go offline and it would only appear as a single drive
> failure in a RAID group (if disks groups are distributed properly).
>
> You can then get compute nodes fairly densely packed by purchasing
> SuperMicro 2uTwin enclosures:
> 	http://www.supermicro.com/products/nfo/2UTwin2.cfm
>
> You can get 3 - 4 of those compute enclosure with dual SAS connectors (each
> enclosure not necessarily fully populated initially). The beauty is that the
> SAS interconnect is fast.   Much faster than Ethernet.
>
> Please bear in mind that I am looking to create a highly available and
> scalable storage system that will fit in as small an area as possible and
> draw as little power as possible.  The reasoning is that we co-locate all
> our equipment at remote data centers.  Each rack (along with its associated
> power and any needed cross connects) represents a significant ongoing
> operational expense.  Therefore, for me, density and incremental scalability
> are important.

There are some pretty interesting solutions on the horizon from various 
vendors that achieve a pretty decent amount of density.  Should be 
interesting times ahead. :)

>
>> And what is the benefit for having Ceph run on top of that? If you have all
> the disks available to all the nodes, why not run ZFS?
>> ZFS would give you better performance since what you are building would
> actually be a local filesystem.
>
> There is no high availability here.  Yes... You can try to do old school
> magic with SAN file systems, complicated clustering, and synchronous
> replication, but a RAIN approach appeals to me.  That is what I see in Ceph.
> Don't get me wrong... I love ZFS... but am trying to figure out a scalable
> HA solution that looks like RAIN. (Am I missing a feature of ZFS)?
>
>> For risk spreading you should not interconnect all the nodes.
>
> I do understand this.  However, our operational setup will not allow
> multiple racks at the beginning.  So... given the constraints of 1 rack
> (with dual power and dual WAN links), I do not see that a pair of cross
> connected SAS switches is any less reliable than a pair of cross connected
> ethernet switches...
>
> As storage scales and we outgrow the single rack at a location, we can
> overflow into a second rack etc.
>
>> The more complexity you add to the whole setup, the more likely it's to go
> down completely at some point in time.
>>
>> I'm just trying to understand why you would want to run a distributed
> filesystem on top of a bunch of direct attached disks.
>
> I guess I don't consider a SAN a bunch of direct attached disks.  The SAS
> infrastructure is a SAN with SAS interconnects  (versus fiber,  iscsi or
> infiniband)...  The disks are accessed via JBOD if desired... or you can put
> RAID on top of a group of them.  The multiple shelves of drives are a way to
> attempt to reduce the dependence on a single piece of hardware (i.e. it
> becomes RAIN).
>
>> Again, if all the disks are attached locally you'd be better of by using
> ZFS.
>
> This is not highly available, and AFAICT, the compute load would not scale
> with the storage.
>
>>> My goal is to be able to scale without having to draw the enormous
>>> power of lots of 1U devices or buy lots of disks and shelves each time
>>> I wasn't to add a little capacity.
>>>
>>
>> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the
> time, depending on your crushmap you might need to add 3 machines at a once.
>
> Adding three machines at once is what I was trying to avoid (I believe that
> I need 3 replicas to make things reasonably redundant).  From first glance,
> it does not seem like a very dense solution to try to add a bunch of 1U
> servers with a few disks.  The associated cost of a bunch of 1U Servers over
> JBOD, plus (and more importantly) the rack space and power draw, can cause
> OPEX problems.  I can purchase multiple enclosures, but not fully populate
> them with disks/cpus.  This gives me a redundant array of nodes (RAIN).
> Then. as needed, I can add drives or compute cards to the existing
> enclosures for little incremental cost.
>
> In your 3 1U server case above, I can add 12 disks to existing 4 enclosures
> (in groups of three) instead of three 1U servers with 4 disks each.  I can
> then either run more OSDs on existing compute nodes or I can add one more
> compute node and it can handle the new drives with one or more OSDs.  If I
> run out of space in enclosures, I can add one more shelf (just one) and
> start adding drives.  I can then "include" the new drives into existing OSDs
> such that each existing OSD has a little more storage it needs to worry
> about.  (The specifics of growing an existing OSD by adding a disk is still
> a little fuzzy to me).
>
>>> Anybody looked at atom processors?
>>>
>>
>> Yes, I have..
>>
>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB
> disks and a 80GB SSD (old X25-M) for journaling.
>>
>> That works, but what I notice is that under heavy recover the Atoms can't
> cope with it.
>>
>> I'm thinking about building a couple of nodes with the AMD Brazos
> mainboard, somelike like an Asus E35M1-I.
>>
>> That is not a serverboard, but it would just be a reference to see what it
> does.
>>
>> One of the problems with the Atoms is the 4GB memory limitation, with the
> AMD Brazos you can use 8GB.
>>
>> I'm trying to figure out a way to have a really large amount of small nodes
> for a low price to have
>> a massive cluster where the impact of loosing one node is very small.
>
> Given that "massive" is a relative term, I am as well... but I'm also trying
> to reduce the footprint (power and space) of that "massive" cluster.  I also
> want to start small (1/2 rack) and scale as needed.

If you do end up testing Brazos processes, please post your results!  I 
think it really depends on what kind of performance you are aiming for. 
  Our stock 2U test boxes have 6-core opterons, and our SC847a has dual 
6-core low power Xeon E5s.  At 10GbE+ these are probably going to be 
pushed pretty hard, especially during recovery.

>
> - Steve
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 15:05         ` Mark Nelson
@ 2012-08-24 16:30           ` Sławomir Skowron
  2012-08-24 18:12           ` Wido den Hollander
  1 sibling, 0 replies; 22+ messages in thread
From: Sławomir Skowron @ 2012-08-24 16:30 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Stephen Perkins, Wido den Hollander, ceph-devel@vger.kernel.org

Dnia 24 sie 2012 o godz. 17:05 Mark Nelson <mark.nelson@inktank.com> napisał(a):

> On 08/24/2012 09:17 AM, Stephen Perkins wrote:
>> Morning Wido (and all),
>>
>>>> I'd like to see a "best" hardware config as well... however, I'm
>>>> interested in a SAS switching fabric where the nodes do not have any
>>>> storage (except possibly onboard boot drive/USB as listed below).
>>>> Each node would have a SAS HBA that allows it to access a LARGE jbod
>>>> provided by a HA set of SAS Switches
>>>> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun
>> masked for each host.
>>>>
>>>> The thought here is that you can add compute nodes, storage shelves,
>>>> and disks all independently.  With proper masking, you could provide
>> redundancy
>>>> to cover drive, node, and shelf failures.    You could also add disks
>>>> "horizontally" if you have spare slots in a shelf, and you could add
>>>> shelves "vertically" and increase the disk count available to existing
>> nodes.
>>>>
>>>
>>> What would the benefit be from building such a complex SAS environment?
>>> You'd be spending a lot of money on SAS switch, JBODs and cabling.
>>
>> Density.
>>
>
> Trying to balance between dense solutions with more failure points vs cheap low density solutions is always tough.  Though not the densest solution out there, we are starting to investigate performance on an SC847a chassis with 36 hotswap drives in 4U (along with internal drives for the system).  Our setup doesn't use SAS expanders which is nice bonus, though it does require a lot of controllers.
>
>>> Your SPOF would still be your whole SAS setup.
>>
>> Well... I'm not sure I would consider it a single point of failure...  a
>> pair of cross-connected switches and 3-5 disk shelves.  Shelves can be
>> purchased with fully redundant internals (dual data paths etc to SAS
>> drives).  That is not even that important. If each shelf is just looked at
>> as JBOD, then you can group disks from different shelves into btrfs or
>> hardware RAID groups.  Or... you can look at each disk as its own storage
>> with its own OSD.
>>
>> A SAS switch going offline would have no impact since everything is cross
>> connected.
>>
>> A whole shelf can go offline and it would only appear as a single drive
>> failure in a RAID group (if disks groups are distributed properly).
>>
>> You can then get compute nodes fairly densely packed by purchasing
>> SuperMicro 2uTwin enclosures:
>>   http://www.supermicro.com/products/nfo/2UTwin2.cfm
>>
>> You can get 3 - 4 of those compute enclosure with dual SAS connectors (each
>> enclosure not necessarily fully populated initially). The beauty is that the
>> SAS interconnect is fast.   Much faster than Ethernet.
>>
>> Please bear in mind that I am looking to create a highly available and
>> scalable storage system that will fit in as small an area as possible and
>> draw as little power as possible.  The reasoning is that we co-locate all
>> our equipment at remote data centers.  Each rack (along with its associated
>> power and any needed cross connects) represents a significant ongoing
>> operational expense.  Therefore, for me, density and incremental scalability
>> are important.
>
> There are some pretty interesting solutions on the horizon from various vendors that achieve a pretty decent amount of density.  Should be interesting times ahead. :)

LSI/Netapp have nice 60xNL SAS drives in 4U solution with SAS
backplane, but this is always, a balance between price, and
performance with elasticity. Balance between low/middle price hardware
vs midrange/enterprise solutions.

I think Ceph was created to be cheaper solution. To give as, a chance,
to use storage servers, commodity hardware, without priced SAN
infrastructure behind, and a fast 10Gb Ethernet. That gives more
scalability, and ability, to scale out, not to scale in. Software like
Ceph, do the job, for hardware solutions.

>
>>
>>> And what is the benefit for having Ceph run on top of that? If you have all
>> the disks available to all the nodes, why not run ZFS?
>>> ZFS would give you better performance since what you are building would
>> actually be a local filesystem.
>>
>> There is no high availability here.  Yes... You can try to do old school
>> magic with SAN file systems, complicated clustering, and synchronous
>> replication, but a RAIN approach appeals to me.  That is what I see in Ceph.
>> Don't get me wrong... I love ZFS... but am trying to figure out a scalable
>> HA solution that looks like RAIN. (Am I missing a feature of ZFS)?
>>
>>> For risk spreading you should not interconnect all the nodes.
>>
>> I do understand this.  However, our operational setup will not allow
>> multiple racks at the beginning.  So... given the constraints of 1 rack
>> (with dual power and dual WAN links), I do not see that a pair of cross
>> connected SAS switches is any less reliable than a pair of cross connected
>> ethernet switches...
>>
>> As storage scales and we outgrow the single rack at a location, we can
>> overflow into a second rack etc.
>>
>>> The more complexity you add to the whole setup, the more likely it's to go
>> down completely at some point in time.
>>>
>>> I'm just trying to understand why you would want to run a distributed
>> filesystem on top of a bunch of direct attached disks.
>>
>> I guess I don't consider a SAN a bunch of direct attached disks.  The SAS
>> infrastructure is a SAN with SAS interconnects  (versus fiber,  iscsi or
>> infiniband)...  The disks are accessed via JBOD if desired... or you can put
>> RAID on top of a group of them.  The multiple shelves of drives are a way to
>> attempt to reduce the dependence on a single piece of hardware (i.e. it
>> becomes RAIN).
>>
>>> Again, if all the disks are attached locally you'd be better of by using
>> ZFS.
>>
>> This is not highly available, and AFAICT, the compute load would not scale
>> with the storage.
>>
>>>> My goal is to be able to scale without having to draw the enormous
>>>> power of lots of 1U devices or buy lots of disks and shelves each time
>>>> I wasn't to add a little capacity.
>>>>
>>>
>>> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the
>> time, depending on your crushmap you might need to add 3 machines at a once.
>>
>> Adding three machines at once is what I was trying to avoid (I believe that
>> I need 3 replicas to make things reasonably redundant).  From first glance,
>> it does not seem like a very dense solution to try to add a bunch of 1U
>> servers with a few disks.  The associated cost of a bunch of 1U Servers over
>> JBOD, plus (and more importantly) the rack space and power draw, can cause
>> OPEX problems.  I can purchase multiple enclosures, but not fully populate
>> them with disks/cpus.  This gives me a redundant array of nodes (RAIN).
>> Then. as needed, I can add drives or compute cards to the existing
>> enclosures for little incremental cost.
>>
>> In your 3 1U server case above, I can add 12 disks to existing 4 enclosures
>> (in groups of three) instead of three 1U servers with 4 disks each.  I can
>> then either run more OSDs on existing compute nodes or I can add one more
>> compute node and it can handle the new drives with one or more OSDs.  If I
>> run out of space in enclosures, I can add one more shelf (just one) and
>> start adding drives.  I can then "include" the new drives into existing OSDs
>> such that each existing OSD has a little more storage it needs to worry
>> about.  (The specifics of growing an existing OSD by adding a disk is still
>> a little fuzzy to me).
>>
>>>> Anybody looked at atom processors?
>>>>
>>>
>>> Yes, I have..
>>>
>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB
>> disks and a 80GB SSD (old X25-M) for journaling.
>>>
>>> That works, but what I notice is that under heavy recover the Atoms can't
>> cope with it.
>>>
>>> I'm thinking about building a couple of nodes with the AMD Brazos
>> mainboard, somelike like an Asus E35M1-I.
>>>
>>> That is not a serverboard, but it would just be a reference to see what it
>> does.
>>>
>>> One of the problems with the Atoms is the 4GB memory limitation, with the
>> AMD Brazos you can use 8GB.
>>>
>>> I'm trying to figure out a way to have a really large amount of small nodes
>> for a low price to have
>>> a massive cluster where the impact of loosing one node is very small.
>>
>> Given that "massive" is a relative term, I am as well... but I'm also trying
>> to reduce the footprint (power and space) of that "massive" cluster.  I also
>> want to start small (1/2 rack) and scale as needed.
>
> If you do end up testing Brazos processes, please post your results!  I think it really depends on what kind of performance you are aiming for.  Our stock 2U test boxes have 6-core opterons, and our SC847a has dual 6-core low power Xeon E5s.  At 10GbE+ these are probably going to be pushed pretty hard, especially during recovery.

Today i have done a 500MB/s in cluster with 10Gb Ethernet during
recovery. With each machine 12 cores of Xeon E5600, do a 50 system
load !!

>
>>
>> - Steve
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 15:05         ` Mark Nelson
  2012-08-24 16:30           ` Sławomir Skowron
@ 2012-08-24 18:12           ` Wido den Hollander
  2012-08-24 18:23             ` Mark Nelson
       [not found]             ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>
  1 sibling, 2 replies; 22+ messages in thread
From: Wido den Hollander @ 2012-08-24 18:12 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel



On 08/24/2012 05:05 PM, Mark Nelson wrote:
>>>
>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and
>>> 4 2TB
>> disks and a 80GB SSD (old X25-M) for journaling.
>>>
>>> That works, but what I notice is that under heavy recover the Atoms
>>> can't
>> cope with it.
>>>
>>> I'm thinking about building a couple of nodes with the AMD Brazos
>> mainboard, somelike like an Asus E35M1-I.
>>>
>>> That is not a serverboard, but it would just be a reference to see
>>> what it
>> does.
>>>
>>> One of the problems with the Atoms is the 4GB memory limitation, with
>>> the
>> AMD Brazos you can use 8GB.
>>>
>>> I'm trying to figure out a way to have a really large amount of small
>>> nodes
>> for a low price to have
>>> a massive cluster where the impact of loosing one node is very small.
>>
>> Given that "massive" is a relative term, I am as well... but I'm also
>> trying
>> to reduce the footprint (power and space) of that "massive" cluster.
>> I also
>> want to start small (1/2 rack) and scale as needed.
>
> If you do end up testing Brazos processes, please post your results!  I
> think it really depends on what kind of performance you are aiming for.
>   Our stock 2U test boxes have 6-core opterons, and our SC847a has dual
> 6-core low power Xeon E5s.  At 10GbE+ these are probably going to be
> pushed pretty hard, especially during recovery.
>

I'm aiming for a Ceph cluster of a couple of hundred TB consisting out 
of 5 or 6 racks full of 1U machines with each 4x 1TB.

Having about ~200 of these nodes all doing not that much work.

If one fails I'd loose 0.5% of my cluster and recovery shouldn't be that 
hard. Assuming here that the node crashes due to hardware failure, not 
being plagued by some Ceph or BTRFS bug cluster-wide :)

Wido

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 18:12           ` Wido den Hollander
@ 2012-08-24 18:23             ` Mark Nelson
  2012-08-27 18:05               ` Stephen Perkins
       [not found]             ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>
  1 sibling, 1 reply; 22+ messages in thread
From: Mark Nelson @ 2012-08-24 18:23 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On 08/24/2012 01:12 PM, Wido den Hollander wrote:
>
>
> On 08/24/2012 05:05 PM, Mark Nelson wrote:
>>>>
>>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and
>>>> 4 2TB
>>> disks and a 80GB SSD (old X25-M) for journaling.
>>>>
>>>> That works, but what I notice is that under heavy recover the Atoms
>>>> can't
>>> cope with it.
>>>>
>>>> I'm thinking about building a couple of nodes with the AMD Brazos
>>> mainboard, somelike like an Asus E35M1-I.
>>>>
>>>> That is not a serverboard, but it would just be a reference to see
>>>> what it
>>> does.
>>>>
>>>> One of the problems with the Atoms is the 4GB memory limitation, with
>>>> the
>>> AMD Brazos you can use 8GB.
>>>>
>>>> I'm trying to figure out a way to have a really large amount of small
>>>> nodes
>>> for a low price to have
>>>> a massive cluster where the impact of loosing one node is very small.
>>>
>>> Given that "massive" is a relative term, I am as well... but I'm also
>>> trying
>>> to reduce the footprint (power and space) of that "massive" cluster.
>>> I also
>>> want to start small (1/2 rack) and scale as needed.
>>
>> If you do end up testing Brazos processes, please post your results! I
>> think it really depends on what kind of performance you are aiming for.
>> Our stock 2U test boxes have 6-core opterons, and our SC847a has dual
>> 6-core low power Xeon E5s. At 10GbE+ these are probably going to be
>> pushed pretty hard, especially during recovery.
>>
>
> I'm aiming for a Ceph cluster of a couple of hundred TB consisting out
> of 5 or 6 racks full of 1U machines with each 4x 1TB.
>
> Having about ~200 of these nodes all doing not that much work.
>
> If one fails I'd loose 0.5% of my cluster and recovery shouldn't be that
> hard. Assuming here that the node crashes due to hardware failure, not
> being plagued by some Ceph or BTRFS bug cluster-wide :)
>
> Wido

Just based on past experience, I figure the most common causes of 
failure are going to be drive "failure", and controller failure.  Your 
solution mitigates that by just going with tons of 1U nodes with few 
drives.  I'm hoping we can also mitigate it by skipping expanders and 
doing no more than 8 drives per controller.  It does mean you top out at 
like 40-48 drives per node max on most server boards.

Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: Ideal hardware spec?
  2012-08-24 18:23             ` Mark Nelson
@ 2012-08-27 18:05               ` Stephen Perkins
  2012-08-27 22:33                 ` Wido den Hollander
  0 siblings, 1 reply; 22+ messages in thread
From: Stephen Perkins @ 2012-08-27 18:05 UTC (permalink / raw)
  To: ceph-devel; +Cc: 'Mark Nelson'

>>> Given that "massive" is a relative term, I am as well... but I'm 
>>> also trying to reduce the footprint (power and space) of that 
>>> "massive" cluster.
>>> I also
>>> want to start small (1/2 rack) and scale as needed.
>>
>> If you do end up testing Brazos processes, please post your results! 
>> I think it really depends on what kind of performance you are aiming for.
>> Our stock 2U test boxes have 6-core opterons, and our SC847a has dual 
>> 6-core low power Xeon E5s. At 10GbE+ these are probably going to be 
>> pushed pretty hard, especially during recovery.
>>
>
> I'm aiming for a Ceph cluster of a couple of hundred TB consisting out 
> of 5 or 6 racks full of 1U machines with each 4x 1TB.

Thinking along the lines of the approach of many 1U by 4 drive host (as
above) with no hardware RAID... what are the thoughts between SATAII (3G/s)
vs SATAIII (6G/s) and on 1G Ethernet versus 10G Ethernet.

- Steve

P.S.  I will be assuming a replication level of 3 copies and would probably
be looking at 10 nodes or less initially.  Maybe populating with 6 drives
instead of 4 (if I can find the right chassis).





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-27 18:05               ` Stephen Perkins
@ 2012-08-27 22:33                 ` Wido den Hollander
  0 siblings, 0 replies; 22+ messages in thread
From: Wido den Hollander @ 2012-08-27 22:33 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: ceph-devel

On 08/27/2012 08:05 PM, Stephen Perkins wrote:
>>>> Given that "massive" is a relative term, I am as well... but I'm
>>>> also trying to reduce the footprint (power and space) of that
>>>> "massive" cluster.
>>>> I also
>>>> want to start small (1/2 rack) and scale as needed.
>>>
>>> If you do end up testing Brazos processes, please post your results!
>>> I think it really depends on what kind of performance you are aiming for.
>>> Our stock 2U test boxes have 6-core opterons, and our SC847a has dual
>>> 6-core low power Xeon E5s. At 10GbE+ these are probably going to be
>>> pushed pretty hard, especially during recovery.
>>>
>>
>> I'm aiming for a Ceph cluster of a couple of hundred TB consisting out
>> of 5 or 6 racks full of 1U machines with each 4x 1TB.
>
> Thinking along the lines of the approach of many 1U by 4 drive host (as
> above) with no hardware RAID... what are the thoughts between SATAII (3G/s)
> vs SATAIII (6G/s) and on 1G Ethernet versus 10G Ethernet.
>

While SATA3 offers more bandwidth you won't benefit that much with 
7200RPM disks.

Buffer writes might go a bit faster, but it won't be shocking.

You will however notice the difference when using a SSD for journaling, 
since the new SSDs are able to utilize the SATA3 bandwidth much better.

I think that 10G would be overkill for a node with just 4 OSDs running 
on 4 disks in total, but you might want to look at trunking 2 1Gb NIC's 
with LACP?

> - Steve
>
> P.S.  I will be assuming a replication level of 3 copies and would probably
> be looking at 10 nodes or less initially.  Maybe populating with 6 drives
> instead of 4 (if I can find the right chassis).
>

I'd go with 3 as well. Going with 2 would cause you to limp whenever 
just one machine/disk fails.

If you want to go for 6 drives in 1U you'd be looking at 2.5" drives. 
It's a bummer they are still so expensive when looking at price per GB.

Wido

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>]

* Re: Ideal hardware spec?
       [not found]             ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>
@ 2012-08-25 11:48               ` Wido den Hollander
  0 siblings, 0 replies; 22+ messages in thread
From: Wido den Hollander @ 2012-08-25 11:48 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: ceph-devel@vger.kernel.org

(CC back to the list)

On 08/24/2012 11:22 PM, Stephen Perkins wrote:
> Hi Wildo,
>
> Why 4 x 1TB?  I get the 4 (many boards seem to have  4 sata connectors so
> you don't need a separate controller).  However... why not 2TB or 3TB
> drives?  Is recover time too large?
>

Yes, due to recovery time mainly. With 4x 1TB I'd loose about 3.2TB of 
data (85% full) at max, that is recoverable for the cluster.

Would I increase that to 2TB or 3TB disks the recovery would indeed get 
harder for the CPU and Memory.

I could have less nodes to get the same amount of storage, but in this 
situation I also get more IOps since I have more spindles running.

> I'm guessing no RAID and one OSD process per disk?
>

Correct. RAID is expensive and the Ceph replication already provides the 
data redundancy here.

> I'm still evaluating your "looking at things differently" to see about a
> bunch of cheap 1Us.
>
> Would your 1Us have redundant power and be redundantly Ethernet connected?
> Or... cheaper single power and single Ethernet (reduced cabling)?
>
> ECC memory?
>

No redundant power, no redundant Ethernet (or switches) and no ECC memory.

I'm quoting here from the CRUSH publication Sage wrote [0]:

"Data safety is of critical importance in large storage systems,
where the large number of devices makes hardware failure
the rule rather than the exception." (4.4 Reliability)

I've been designing by that rule.

I'm relying on CRUSH to do all the redundancy work for me. By 
strategically placing nodes on different power feeds and different 
switches I can mitigate hardware failure. You just have to make sure 
that your CRUSH map resembles your physical layout of your cluster.

Make sure that two copies of your data never end up in the same rack or 
on the same switch.

Wido

[0]: http://ceph.newdream.net/papers/weil-crush-sc06.pdf

> - Steve
>
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wido den Hollander
> Sent: Friday, August 24, 2012 1:12 PM
> To: Mark Nelson
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Ideal hardware spec?
>
>
>
> On 08/24/2012 05:05 PM, Mark Nelson wrote:
>>>>
>>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM
>>>> and
>>>> 4 2TB
>>> disks and a 80GB SSD (old X25-M) for journaling.
>>>>
>>>> That works, but what I notice is that under heavy recover the Atoms
>>>> can't
>>> cope with it.
>>>>
>>>> I'm thinking about building a couple of nodes with the AMD Brazos
>>> mainboard, somelike like an Asus E35M1-I.
>>>>
>>>> That is not a serverboard, but it would just be a reference to see
>>>> what it
>>> does.
>>>>
>>>> One of the problems with the Atoms is the 4GB memory limitation,
>>>> with the
>>> AMD Brazos you can use 8GB.
>>>>
>>>> I'm trying to figure out a way to have a really large amount of
>>>> small nodes
>>> for a low price to have
>>>> a massive cluster where the impact of loosing one node is very small.
>>>
>>> Given that "massive" is a relative term, I am as well... but I'm also
>>> trying to reduce the footprint (power and space) of that "massive"
>>> cluster.
>>> I also
>>> want to start small (1/2 rack) and scale as needed.
>>
>> If you do end up testing Brazos processes, please post your results!
>> I think it really depends on what kind of performance you are aiming for.
>>    Our stock 2U test boxes have 6-core opterons, and our SC847a has
>> dual 6-core low power Xeon E5s.  At 10GbE+ these are probably going to
>> be pushed pretty hard, especially during recovery.
>>
>
> I'm aiming for a Ceph cluster of a couple of hundred TB consisting out of 5
> or 6 racks full of 1U machines with each 4x 1TB.
>
> Having about ~200 of these nodes all doing not that much work.
>
> If one fails I'd loose 0.5% of my cluster and recovery shouldn't be that
> hard. Assuming here that the node crashes due to hardware failure, not being
> plagued by some Ceph or BTRFS bug cluster-wide :)
>
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 14:17       ` Stephen Perkins
  2012-08-24 14:41         ` Joe Landman
  2012-08-24 15:05         ` Mark Nelson
@ 2012-08-24 16:12         ` Tommi Virtanen
  2012-08-24 18:09         ` Wido den Hollander
  3 siblings, 0 replies; 22+ messages in thread
From: Tommi Virtanen @ 2012-08-24 16:12 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: Wido den Hollander, ceph-devel

On Fri, Aug 24, 2012 at 7:17 AM, Stephen Perkins <perkins@netmass.com> wrote:
> Adding three machines at once is what I was trying to avoid (I believe that
> I need 3 replicas to make things reasonably redundant).  From first glance,

You need 3 machines to *start with*, to have 3 truly independent
replicas. After that point, there's nothing preventing you from
growing one machine -- or one disk -- at a time.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-24 14:17       ` Stephen Perkins
                           ` (2 preceding siblings ...)
  2012-08-24 16:12         ` Tommi Virtanen
@ 2012-08-24 18:09         ` Wido den Hollander
  3 siblings, 0 replies; 22+ messages in thread
From: Wido den Hollander @ 2012-08-24 18:09 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: ceph-devel

On 08/24/2012 04:17 PM, Stephen Perkins wrote:
>
>> Your SPOF would still be your whole SAS setup.
>
> Well... I'm not sure I would consider it a single point of failure...  a
> pair of cross-connected switches and 3-5 disk shelves.  Shelves can be
> purchased with fully redundant internals (dual data paths etc to SAS
> drives).  That is not even that important. If each shelf is just looked at
> as JBOD, then you can group disks from different shelves into btrfs or
> hardware RAID groups.  Or... you can look at each disk as its own storage
> with its own OSD.
>
> A SAS switch going offline would have no impact since everything is cross
> connected.
>
> A whole shelf can go offline and it would only appear as a single drive
> failure in a RAID group (if disks groups are distributed properly).
>

I'm not against your idea and I get the reasoning, however, in my 
opinion a distributed filesystem should not have interconnects on SAS 
basis between OSD nodes.

There are multiple ways to Rome, I know, but I'm just trying to view 
this from another perspective.

> You can then get compute nodes fairly densely packed by purchasing
> SuperMicro 2uTwin enclosures:
> 	http://www.supermicro.com/products/nfo/2UTwin2.cfm
>
> You can get 3 - 4 of those compute enclosure with dual SAS connectors (each
> enclosure not necessarily fully populated initially). The beauty is that the
> SAS interconnect is fast.   Much faster than Ethernet.

Yes, SAS is faster than ethernet, but all the replication traffic 
between OSDs will still go over Ethernet. The OSD in his turn will write 
the data over SAS.

I'd actually think your SAS bus (although they are beefy) could become a 
bottleneck at some point.

>
> Please bear in mind that I am looking to create a highly available and
> scalable storage system that will fit in as small an area as possible and
> draw as little power as possible.  The reasoning is that we co-locate all
> our equipment at remote data centers.  Each rack (along with its associated
> power and any needed cross connects) represents a significant ongoing
> operational expense.  Therefore, for me, density and incremental scalability
> are important.
>

Got ya. Operational costs in datacenters are getting higher and higher, 
sometimes it's worth investing more upfront so you can save operationally.

>
> There is no high availability here.  Yes... You can try to do old school
> magic with SAN file systems, complicated clustering, and synchronous
> replication, but a RAIN approach appeals to me.  That is what I see in Ceph.
> Don't get me wrong... I love ZFS... but am trying to figure out a scalable
> HA solution that looks like RAIN. (Am I missing a feature of ZFS)?
>

I'm managing a couple of 50TB ZFS systems with Nexenta. The two nodes 
have 96GB of RAM each and all the disks are in LSI 630J JBOD's with LSI 
SAS switches, this way both nodes have access to the disks and thus the 
ZFS pool.

Expansion can be done by adding extra disks or creating a second pool 
and running that pool on a different node.

Since you are staying inside on rack I don't think you'll be doing that 
much IOps. A descent ZFS system can do 100k IOps without any issues, I 
don't think you'll do that with Ceph very soon in one rack (assuming 
your clients are in the same rack).

Don't get me wrong, I'm not trying to scare you away from Ceph, just 
trying to view it from a different perspective.

>> For risk spreading you should not interconnect all the nodes.
>
> I do understand this.  However, our operational setup will not allow
> multiple racks at the beginning.  So... given the constraints of 1 rack
> (with dual power and dual WAN links), I do not see that a pair of cross
> connected SAS switches is any less reliable than a pair of cross connected
> ethernet switches...
>

The problem with interconnected SAS switches is that IF something goes 
wrong your filesystem looses it's connection to the disk, risking 
valuable data which could still be in transit from buffers.

The risk would be that all the OSDs will loose access to their disks all 
at once.

Yes, it is redundant, but you wouldn't be the first to suffer from a 
firmware glitch somewhere.

By physically keeping this separated you don't have the risk of all OSDs 
loosing disk access at once.

> As storage scales and we outgrow the single rack at a location, we can
> overflow into a second rack etc.
>

True, that is something that you won't do with a ZFS setup that fast. 
The question you have to ask yourself: Do you want all your data on one 
system? Do you want to bet everything on one horse?

Wido

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-22 14:17 ` Wido den Hollander
  2012-08-22 14:39   ` Stephen Perkins
@ 2012-08-22 15:46   ` Jonathan Proulx
  2012-08-23  9:59     ` Wido den Hollander
  1 sibling, 1 reply; 22+ messages in thread
From: Jonathan Proulx @ 2012-08-22 15:46 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Wed, Aug 22, 2012 at 04:17:23PM +0200, Wido den Hollander wrote:

:On 08/22/2012 03:55 PM, Jonathan Proulx wrote:

:You can also use the USB sticks[0] from Stec, they have servergrade
:onboard USB sticks for these kind of applications.

Those look quite interesting.

:A couple of questions still need to be answered though:
:* Which OS are you planning on using? Ubuntu 12.04 is recommended

Ubuntu 12.04 is our current preferred OS

:* Which filesystem do you want to use underneath the OSDs?

Whatever I can get to work best in testing :)

Since this is for a research platform not a product I'd likely start with
BTRFS and see if it is "stable enough" and "performant enough" with
fall back to XFS if needed

-Jon

:Wido
:
:[0]: http://www.stec-inc.com/product/ufm.php

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-22 15:46   ` Jonathan Proulx
@ 2012-08-23  9:59     ` Wido den Hollander
       [not found]       ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Wido den Hollander @ 2012-08-23  9:59 UTC (permalink / raw)
  To: Jonathan Proulx; +Cc: ceph-devel

On 08/22/2012 05:46 PM, Jonathan Proulx wrote:
> On Wed, Aug 22, 2012 at 04:17:23PM +0200, Wido den Hollander wrote:
>
> :On 08/22/2012 03:55 PM, Jonathan Proulx wrote:
>
> :You can also use the USB sticks[0] from Stec, they have servergrade
> :onboard USB sticks for these kind of applications.
>
> Those look quite interesting.
>

They should be much more reliable than regular USB sticks due to their 
SLC memory.

You could also take a look at these: 
http://www.transcend-info.com/industry/products_details.asp?CatNo=2&SerNo=14&ModNo=28&Func1No=1

> :A couple of questions still need to be answered though:
> :* Which OS are you planning on using? Ubuntu 12.04 is recommended
>
> Ubuntu 12.04 is our current preferred OS
>

That should work fine.

> :* Which filesystem do you want to use underneath the OSDs?
>
> Whatever I can get to work best in testing :)
>
> Since this is for a research platform not a product I'd likely start with
> BTRFS and see if it is "stable enough" and "performant enough" with
> fall back to XFS if needed
>

BTRFS is indeed the best in terms of features. I'd recommend using a 
recent kernel like 3.5.

Wido

> -Jon
>
> :Wido
> :
> :[0]: http://www.stec-inc.com/product/ufm.php
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>]

* Re: Ideal hardware spec?
       [not found]       ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>
@ 2012-08-26 11:15         ` Wido den Hollander
  2012-08-26 13:29           ` Mark Nelson
  0 siblings, 1 reply; 22+ messages in thread
From: Wido den Hollander @ 2012-08-26 11:15 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-devel@vger.kernel.org

CC'ing this one back to the list.

On 08/25/2012 09:58 PM, Andrey Korolyov wrote:
>>
>> They should be much more reliable than regular USB sticks due to their SLC
>> memory.
>>
>> You could also take a look at these:
>> http://www.transcend-info.com/industry/products_details.asp?CatNo=2&SerNo=14&ModNo=28&Func1No=1
>>
>>
>
> Did you tried yet those or simular sticks for CEPH journal? Right now
> I am using Intel 313`s, which is very fast and have durability/price
> ratio a far higher than any imaginable MLC, but they occupying one HDD
> slot which is a quite impractical.
>

No, I haven't tried, but I think it won't work.

These kind of SLC chips don't do random writes that great, you'll 
probably get something like 4MB/sec in random writes.

Bigger SSDs have more cells to spread the writes over, those small 
sticks don't.

The Intel 3XX or 5XX serie should work just fine for journaling, I'd 
however recommend you change the Host Protected Area to ~50% of the 
available capacity to prevent write-degradation over time.

Wido

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-26 11:15         ` Wido den Hollander
@ 2012-08-26 13:29           ` Mark Nelson
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Nelson @ 2012-08-26 13:29 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Andrey Korolyov, ceph-devel@vger.kernel.org

On 08/26/2012 06:15 AM, Wido den Hollander wrote:
> CC'ing this one back to the list.
>
> On 08/25/2012 09:58 PM, Andrey Korolyov wrote:
>>>
>>> They should be much more reliable than regular USB sticks due to
>>> their SLC
>>> memory.
>>>
>>> You could also take a look at these:
>>> http://www.transcend-info.com/industry/products_details.asp?CatNo=2&SerNo=14&ModNo=28&Func1No=1
>>>
>>>
>>>
>>
>> Did you tried yet those or simular sticks for CEPH journal? Right now
>> I am using Intel 313`s, which is very fast and have durability/price
>> ratio a far higher than any imaginable MLC, but they occupying one HDD
>> slot which is a quite impractical.
>>
>
> No, I haven't tried, but I think it won't work.
>
> These kind of SLC chips don't do random writes that great, you'll
> probably get something like 4MB/sec in random writes.
>
> Bigger SSDs have more cells to spread the writes over, those small
> sticks don't.
>
> The Intel 3XX or 5XX serie should work just fine for journaling, I'd
> however recommend you change the Host Protected Area to ~50% of the
> available capacity to prevent write-degradation over time.
>
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Not just write degradation, but undersubscribing the SSDs should 
hopefully help them last a little longer under such a heavy write 
workload.  We are doing 3 10GB journals per 180GB Intel 520 on our 
supermicro test node.

Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx
  2012-08-22 14:17 ` Wido den Hollander
@ 2012-08-22 14:41 ` Mark Nelson
  2012-08-28  0:02   ` Curtis C.
  1 sibling, 1 reply; 22+ messages in thread
From: Mark Nelson @ 2012-08-22 14:41 UTC (permalink / raw)
  To: Jonathan Proulx; +Cc: ceph-devel

On 08/22/2012 08:55 AM, Jonathan Proulx wrote:
> Hi All,

Hi Jonathon!

>
> Yes I'm asking the impossible question, what is the "best" hardware
> confing.

That is the impossible question. :)

>
> I'm looking at (possibly) using ceph as backing store for images and
> volumes on OpenStack as well as exposing at least the object store for
> direct use.
>
> The openstack cluster exists and is currently in the early stages of
> use by researchers here, approx 1500 vCPU (counts hyperthreads
> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>
> On the object store side it would be a new resource for usand hard to
> say what people would do with it except that it would be many
> different things and the use profile would be constantly changing
> (which is true of all our existing storage).
>
> In this sense, even though it's a "private cloud" the somewhat
> unpredictable useage profile gives it some charateristics of a small
> public cloud.
>
> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
> to end up with a 20-30T 3x replicated storage (call me paranoid).
>
> So the monitor specs seem relatively easy to come up with.  For the
> OSDs it looks like
> http://ceph.com/docs/master/install/hardware-recommendations suggests
> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
> node).  On list discussions seem to frequently include an SSD for
> journaling (which is similar to what we do for our current ZFS back
> NFS storage).
>
> I'm hoping to wrap the hardware in a grant and willing to experiment a
> bit with different software configurations to tune it up when/if I get
> the hardware in.  So my imediate concern is a hardware spec that will
> ahve a reasonable processor:memory:disk ratio and opinions (or better
> data) on the utility of SSD.

Before I joined up with Inktank, I was prototyping a private openstack 
cloud for HPC applications at a supercomputing site.  We similarly were 
pursuing grant funding.  I know how it goes!

>
> First is the documented core to disk ratio still current best
> practice?  Given a platform with more drive slots could 8 cores handle
> more disk? would that need/like more memory?

The big thing is the CPU and memory needed during recovery.  During 
standard operation you shouldn't be pushing the CPU too hard unless you 
are really pushing data through fast and have many drives per node, or 
have severely underspecced the CPU.

Given that you are only shooting for around 90TB of space across 5+ osd 
nodes, you should be able to get away with 12 2TB+ drive 2U boxes. 
That's probably the closest thing we have right now to a "standard" 
configuration.  We use a single 6-core 2.8GHz AMD operation chip in each 
node with 16GB of memory.  It might be worth bumping that up to 24-32GB 
of memory for very large deployments with lots of OSDs.

In terms of controller we are using Dell H700 cards which are similar to 
LSI 9260s, but I think there is a good chance that it may actually be 
better to use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode 
firmware.  That's one of the commonly used cards in ZFS builds too and 
has a pretty good reputation.

I've actually got a supermicro SC847a chassis and a whole bunch of 
various SATA/SAS/RAID controllers I'm testing now in different 
configurations.  Hopefully I should have some data soon.  For now, our 
best tested configuration is with 12 drive nodes.  Smaller 1U nodes may 
be an option as well, but not very dense.

>
> Have SSD been shown to speed performance with this architecture?

Yes, but in different ways depending on how you use them.  SSDs for data 
storage tend to help mitigate some of the seek behavior issues we've 
seen on the filestore.  This isn't really a reasonable solution for a 
lot of people though.

In terms of the journal, the biggest benefit that SSDs provide is high 
throughput, so you can load multiple journals onto 1 SSD and cram more 
OSDs into one box.  Depending on how much you trust your SSDs, you could 
try either a 10 disk + 2 SSD or a 9 disk + SSD configuration.  Keep in 
mind that this will be writing a lot of data to the SSDs, so you should 
try to undersubscribe them to lengthen the lifespan.  For testing I'm 
doing 3 journals per 180GB Intel 520 SSD.

>
> If so given the 8 drive slot example with seven OSDs presented in the
> docs what is the liklihood that using a high performance SSD for the
> OS image and also cutting journal/log partitions out of it for the
> remaining 7 2-3T near line SAS drives?

Just keep in mind that in this case you're total throughput will likely 
be limited by the SSD unless you get a very fast one (or are using 1GbE 
or have some other bottleneck).

>
> Thanks,
> -Jon
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-22 14:41 ` Mark Nelson
@ 2012-08-28  0:02   ` Curtis C.
  2012-08-28  1:18     ` Mark Nelson
  0 siblings, 1 reply; 22+ messages in thread
From: Curtis C. @ 2012-08-28  0:02 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Jonathan Proulx, ceph-devel

On Wed, Aug 22, 2012 at 8:41 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 08/22/2012 08:55 AM, Jonathan Proulx wrote:
>>
>> Hi All,
>
>
> Hi Jonathon!
>
>
>>
>> Yes I'm asking the impossible question, what is the "best" hardware
>> confing.
>
>
> That is the impossible question. :)
>
>
>>
>> I'm looking at (possibly) using ceph as backing store for images and
>> volumes on OpenStack as well as exposing at least the object store for
>> direct use.
>>
>> The openstack cluster exists and is currently in the early stages of
>> use by researchers here, approx 1500 vCPU (counts hyperthreads
>> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>>
>> On the object store side it would be a new resource for usand hard to
>> say what people would do with it except that it would be many
>> different things and the use profile would be constantly changing
>> (which is true of all our existing storage).
>>
>> In this sense, even though it's a "private cloud" the somewhat
>> unpredictable useage profile gives it some charateristics of a small
>> public cloud.
>>
>> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
>> to end up with a 20-30T 3x replicated storage (call me paranoid).
>>
>> So the monitor specs seem relatively easy to come up with.  For the
>> OSDs it looks like
>> http://ceph.com/docs/master/install/hardware-recommendations suggests
>> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
>> node).  On list discussions seem to frequently include an SSD for
>> journaling (which is similar to what we do for our current ZFS back
>> NFS storage).
>>
>> I'm hoping to wrap the hardware in a grant and willing to experiment a
>> bit with different software configurations to tune it up when/if I get
>> the hardware in.  So my imediate concern is a hardware spec that will
>> ahve a reasonable processor:memory:disk ratio and opinions (or better
>> data) on the utility of SSD.
>
>
> Before I joined up with Inktank, I was prototyping a private openstack cloud
> for HPC applications at a supercomputing site.  We similarly were pursuing
> grant funding.  I know how it goes!
>
>
>>
>> First is the documented core to disk ratio still current best
>> practice?  Given a platform with more drive slots could 8 cores handle
>> more disk? would that need/like more memory?
>
>
> The big thing is the CPU and memory needed during recovery.  During standard
> operation you shouldn't be pushing the CPU too hard unless you are really
> pushing data through fast and have many drives per node, or have severely
> underspecced the CPU.
>
> Given that you are only shooting for around 90TB of space across 5+ osd
> nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's
> probably the closest thing we have right now to a "standard" configuration.
> We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of
> memory.  It might be worth bumping that up to 24-32GB of memory for very
> large deployments with lots of OSDs.
>
> In terms of controller we are using Dell H700 cards which are similar to LSI
> 9260s, but I think there is a good chance that it may actually be better to
> use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware.
> That's one of the commonly used cards in ZFS builds too and has a pretty
> good reputation.
>
> I've actually got a supermicro SC847a chassis and a whole bunch of various
> SATA/SAS/RAID controllers I'm testing now in different configurations.
> Hopefully I should have some data soon.  For now, our best tested
> configuration is with 12 drive nodes.  Smaller 1U nodes may be an option as
> well, but not very dense.
>

I've worked a bit with a Supermicro 36 drive bay chassis, though I've
since moved on from the organization we had them in place at. I quite
liked them. Wrote a bit of a blog post about them too
(http://serverascode.com/2012/06/07/36-hot-swappable-day-supermicro-chassis.html)
so I'm excited to see Inktank trying them out.

The place I currently work at is a big OpenStack user and thinking
about Ceph, but is not, as of yet, interested in a chassis like the
Supermicro, so please post about your findings. :)

Thanks,
Curtis.

>
>>
>> Have SSD been shown to speed performance with this architecture?
>
>
> Yes, but in different ways depending on how you use them.  SSDs for data
> storage tend to help mitigate some of the seek behavior issues we've seen on
> the filestore.  This isn't really a reasonable solution for a lot of people
> though.
>
> In terms of the journal, the biggest benefit that SSDs provide is high
> throughput, so you can load multiple journals onto 1 SSD and cram more OSDs
> into one box.  Depending on how much you trust your SSDs, you could try
> either a 10 disk + 2 SSD or a 9 disk + SSD configuration.  Keep in mind that
> this will be writing a lot of data to the SSDs, so you should try to
> undersubscribe them to lengthen the lifespan.  For testing I'm doing 3
> journals per 180GB Intel 520 SSD.
>
>
>>
>> If so given the 8 drive slot example with seven OSDs presented in the
>> docs what is the liklihood that using a high performance SSD for the
>> OS image and also cutting journal/log partitions out of it for the
>> remaining 7 2-3T near line SAS drives?
>
>
> Just keep in mind that in this case you're total throughput will likely be
> limited by the SSD unless you get a very fast one (or are using 1GbE or have
> some other bottleneck).
>
>
>>
>> Thanks,
>> -Jon
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Ideal hardware spec?
  2012-08-28  0:02   ` Curtis C.
@ 2012-08-28  1:18     ` Mark Nelson
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Nelson @ 2012-08-28  1:18 UTC (permalink / raw)
  To: Curtis C.; +Cc: Jonathan Proulx, ceph-devel

On 08/27/2012 07:02 PM, Curtis C. wrote:
> On Wed, Aug 22, 2012 at 8:41 AM, Mark Nelson<mark.nelson@inktank.com>  wrote:
>> On 08/22/2012 08:55 AM, Jonathan Proulx wrote:
>>>
>>> Hi All,
>>
>>
>> Hi Jonathon!
>>
>>
>>>
>>> Yes I'm asking the impossible question, what is the "best" hardware
>>> confing.
>>
>>
>> That is the impossible question. :)
>>
>>
>>>
>>> I'm looking at (possibly) using ceph as backing store for images and
>>> volumes on OpenStack as well as exposing at least the object store for
>>> direct use.
>>>
>>> The openstack cluster exists and is currently in the early stages of
>>> use by researchers here, approx 1500 vCPU (counts hyperthreads
>>> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>>>
>>> On the object store side it would be a new resource for usand hard to
>>> say what people would do with it except that it would be many
>>> different things and the use profile would be constantly changing
>>> (which is true of all our existing storage).
>>>
>>> In this sense, even though it's a "private cloud" the somewhat
>>> unpredictable useage profile gives it some charateristics of a small
>>> public cloud.
>>>
>>> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
>>> to end up with a 20-30T 3x replicated storage (call me paranoid).
>>>
>>> So the monitor specs seem relatively easy to come up with.  For the
>>> OSDs it looks like
>>> http://ceph.com/docs/master/install/hardware-recommendations suggests
>>> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
>>> node).  On list discussions seem to frequently include an SSD for
>>> journaling (which is similar to what we do for our current ZFS back
>>> NFS storage).
>>>
>>> I'm hoping to wrap the hardware in a grant and willing to experiment a
>>> bit with different software configurations to tune it up when/if I get
>>> the hardware in.  So my imediate concern is a hardware spec that will
>>> ahve a reasonable processor:memory:disk ratio and opinions (or better
>>> data) on the utility of SSD.
>>
>>
>> Before I joined up with Inktank, I was prototyping a private openstack cloud
>> for HPC applications at a supercomputing site.  We similarly were pursuing
>> grant funding.  I know how it goes!
>>
>>
>>>
>>> First is the documented core to disk ratio still current best
>>> practice?  Given a platform with more drive slots could 8 cores handle
>>> more disk? would that need/like more memory?
>>
>>
>> The big thing is the CPU and memory needed during recovery.  During standard
>> operation you shouldn't be pushing the CPU too hard unless you are really
>> pushing data through fast and have many drives per node, or have severely
>> underspecced the CPU.
>>
>> Given that you are only shooting for around 90TB of space across 5+ osd
>> nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's
>> probably the closest thing we have right now to a "standard" configuration.
>> We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of
>> memory.  It might be worth bumping that up to 24-32GB of memory for very
>> large deployments with lots of OSDs.
>>
>> In terms of controller we are using Dell H700 cards which are similar to LSI
>> 9260s, but I think there is a good chance that it may actually be better to
>> use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware.
>> That's one of the commonly used cards in ZFS builds too and has a pretty
>> good reputation.
>>
>> I've actually got a supermicro SC847a chassis and a whole bunch of various
>> SATA/SAS/RAID controllers I'm testing now in different configurations.
>> Hopefully I should have some data soon.  For now, our best tested
>> configuration is with 12 drive nodes.  Smaller 1U nodes may be an option as
>> well, but not very dense.
>>
>
> I've worked a bit with a Supermicro 36 drive bay chassis, though I've
> since moved on from the organization we had them in place at. I quite
> liked them. Wrote a bit of a blog post about them too
> (http://serverascode.com/2012/06/07/36-hot-swappable-day-supermicro-chassis.html)
> so I'm excited to see Inktank trying them out.
>

I really like this chassis.  It's one of the nicer ones that I've worked 
with.  The drives in the back could be a deal breaker for some, but I 
think it's a decent trade-off for what you get.

> The place I currently work at is a big OpenStack user and thinking
> about Ceph, but is not, as of yet, interested in a chassis like the
> Supermicro, so please post about your findings. :)
>
> Thanks,
> Curtis.
>

So far I've only been doing single controller tests with an onboard LSI 
SAS2208 and an external SAS2008 card (9211-8i).  The SAS2008 is actually 
slightly faster.  With 6 7200rpm SATA drives and 2 Intel 520 SSDs for 
journals I can do nearly 600MB/s with 1x replication and 4MB requests 
via rados bench.

I've got a couple of other cards to test (An Areca 1680, LSI SAS2308, 
and a Marvel based highpoint rocketraid card).  After that I'll start in 
on multiple controllers and more drives.  I also got the bracket I 
needed in for my 1U client node so I should be able to start in on 2x 
bonded 10GbE tests.

Hopefully I can convince the powers that be to let me fill out the 
SC847a chassis and maybe buy another one if the tests look good. ;)

>>
>>>
>>> Have SSD been shown to speed performance with this architecture?
>>
>>
>> Yes, but in different ways depending on how you use them.  SSDs for data
>> storage tend to help mitigate some of the seek behavior issues we've seen on
>> the filestore.  This isn't really a reasonable solution for a lot of people
>> though.
>>
>> In terms of the journal, the biggest benefit that SSDs provide is high
>> throughput, so you can load multiple journals onto 1 SSD and cram more OSDs
>> into one box.  Depending on how much you trust your SSDs, you could try
>> either a 10 disk + 2 SSD or a 9 disk + SSD configuration.  Keep in mind that
>> this will be writing a lot of data to the SSDs, so you should try to
>> undersubscribe them to lengthen the lifespan.  For testing I'm doing 3
>> journals per 180GB Intel 520 SSD.
>>
>>
>>>
>>> If so given the 8 drive slot example with seven OSDs presented in the
>>> docs what is the liklihood that using a high performance SSD for the
>>> OS image and also cutting journal/log partitions out of it for the
>>> remaining 7 2-3T near line SAS drives?
>>
>>
>> Just keep in mind that in this case you're total throughput will likely be
>> limited by the SSD unless you get a very fast one (or are using 1GbE or have
>> some other bottleneck).
>>
>>
>>>
>>> Thanks,
>>> -Jon
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Mark

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-08-28  1:18 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx
2012-08-22 14:17 ` Wido den Hollander
2012-08-22 14:39   ` Stephen Perkins
2012-08-23  8:24     ` Wido den Hollander
2012-08-24 14:17       ` Stephen Perkins
2012-08-24 14:41         ` Joe Landman
2012-08-24 15:05         ` Mark Nelson
2012-08-24 16:30           ` Sławomir Skowron
2012-08-24 18:12           ` Wido den Hollander
2012-08-24 18:23             ` Mark Nelson
2012-08-27 18:05               ` Stephen Perkins
2012-08-27 22:33                 ` Wido den Hollander
     [not found]             ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>
2012-08-25 11:48               ` Wido den Hollander
2012-08-24 16:12         ` Tommi Virtanen
2012-08-24 18:09         ` Wido den Hollander
2012-08-22 15:46   ` Jonathan Proulx
2012-08-23  9:59     ` Wido den Hollander
     [not found]       ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>
2012-08-26 11:15         ` Wido den Hollander
2012-08-26 13:29           ` Mark Nelson
2012-08-22 14:41 ` Mark Nelson
2012-08-28  0:02   ` Curtis C.
2012-08-28  1:18     ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.