* Ideal hardware spec? @ 2012-08-22 13:55 Jonathan Proulx 2012-08-22 14:17 ` Wido den Hollander 2012-08-22 14:41 ` Mark Nelson 0 siblings, 2 replies; 22+ messages in thread From: Jonathan Proulx @ 2012-08-22 13:55 UTC (permalink / raw) To: ceph-devel Hi All, Yes I'm asking the impossible question, what is the "best" hardware confing. I'm looking at (possibly) using ceph as backing store for images and volumes on OpenStack as well as exposing at least the object store for direct use. The openstack cluster exists and is currently in the early stages of use by researchers here, approx 1500 vCPU (counts hyperthreads actually 768 physical cores) and 3T or RAM across 64 physical nodes. On the object store side it would be a new resource for usand hard to say what people would do with it except that it would be many different things and the use profile would be constantly changing (which is true of all our existing storage). In this sense, even though it's a "private cloud" the somewhat unpredictable useage profile gives it some charateristics of a small public cloud. Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes to end up with a 20-30T 3x replicated storage (call me paranoid). So the monitor specs seem relatively easy to come up with. For the OSDs it looks like http://ceph.com/docs/master/install/hardware-recommendations suggests 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage node). On list discussions seem to frequently include an SSD for journaling (which is similar to what we do for our current ZFS back NFS storage). I'm hoping to wrap the hardware in a grant and willing to experiment a bit with different software configurations to tune it up when/if I get the hardware in. So my imediate concern is a hardware spec that will ahve a reasonable processor:memory:disk ratio and opinions (or better data) on the utility of SSD. First is the documented core to disk ratio still current best practice? Given a platform with more drive slots could 8 cores handle more disk? would that need/like more memory? Have SSD been shown to speed performance with this architecture? If so given the 8 drive slot example with seven OSDs presented in the docs what is the liklihood that using a high performance SSD for the OS image and also cutting journal/log partitions out of it for the remaining 7 2-3T near line SAS drives? Thanks, -Jon ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx @ 2012-08-22 14:17 ` Wido den Hollander 2012-08-22 14:39 ` Stephen Perkins 2012-08-22 15:46 ` Jonathan Proulx 2012-08-22 14:41 ` Mark Nelson 1 sibling, 2 replies; 22+ messages in thread From: Wido den Hollander @ 2012-08-22 14:17 UTC (permalink / raw) To: Jonathan Proulx; +Cc: ceph-devel Hi, On 08/22/2012 03:55 PM, Jonathan Proulx wrote: > Hi All, > > Yes I'm asking the impossible question, what is the "best" hardware > confing. > > I'm looking at (possibly) using ceph as backing store for images and > volumes on OpenStack as well as exposing at least the object store for > direct use. > > The openstack cluster exists and is currently in the early stages of > use by researchers here, approx 1500 vCPU (counts hyperthreads > actually 768 physical cores) and 3T or RAM across 64 physical nodes. > > On the object store side it would be a new resource for usand hard to > say what people would do with it except that it would be many > different things and the use profile would be constantly changing > (which is true of all our existing storage). > > In this sense, even though it's a "private cloud" the somewhat > unpredictable useage profile gives it some charateristics of a small > public cloud. > > Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes > to end up with a 20-30T 3x replicated storage (call me paranoid). > I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too often. > So the monitor specs seem relatively easy to come up with. For the > OSDs it looks like > http://ceph.com/docs/master/install/hardware-recommendations suggests > 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage > node). On list discussions seem to frequently include an SSD for > journaling (which is similar to what we do for our current ZFS back > NFS storage). > > I'm hoping to wrap the hardware in a grant and willing to experiment a > bit with different software configurations to tune it up when/if I get > the hardware in. So my imediate concern is a hardware spec that will > ahve a reasonable processor:memory:disk ratio and opinions (or better > data) on the utility of SSD. > > First is the documented core to disk ratio still current best > practice? Given a platform with more drive slots could 8 cores handle > more disk? would that need/like more memory? > I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD machines, the more the kernel can buffer, which will always be a performance gain. You should however ask yourself the question if you want a lot of OSDs per server and not go for smaller machines with less disks. For example - 1U - 4 cores - 8GB RAM - 4 disks - 1 SSD Or - 2U - 8 cores - 16GB RAM - 8 disks - 1|2 SSDs Both will give you the same amount of storage, but the impact of loosing one physicial machine will be larger with the 2U machine. If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery to be done. Since btrfs (Assuming you are going to use that) is still in development it's not excluded that your machine goes down due to a kernel panic or other problems. My personal favor is having multiple small(er) machines than having a couple of large machines. > Have SSD been shown to speed performance with this architecture? > I've seen a improvement in performance indeed. Make sure however you have a recent version of glibc with syncfs support. > If so given the 8 drive slot example with seven OSDs presented in the > docs what is the liklihood that using a high performance SSD for the > OS image and also cutting journal/log partitions out of it for the > remaining 7 2-3T near line SAS drives? > You should make sure your SSD is capable of doing line-speed of your network. If you are connecting the machines with 4G trunks, make sure the SSD is capable of doing around 400MB/sec of sustained writes. I'd recommended the Intel 520 SSDs and change their available capacity with hdparm to about 20% of their original capacity. This way the SSD always has a lot of free cells available for writing. Reprogramming cells is expensive on an SSD. You can run the OS on the same SSD since that won't do that much I/O. I'd recommend not logging locally though, since that will also write to the same SSD. Try using remote syslog. You can also use the USB sticks[0] from Stec, they have servergrade onboard USB sticks for these kind of applications. A couple of questions still need to be answered though: * Which OS are you planning on using? Ubuntu 12.04 is recommended * Which filesystem do you want to use underneath the OSDs? Wido [0]: http://www.stec-inc.com/product/ufm.php > Thanks, > -Jon > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: Ideal hardware spec? 2012-08-22 14:17 ` Wido den Hollander @ 2012-08-22 14:39 ` Stephen Perkins 2012-08-23 8:24 ` Wido den Hollander 2012-08-22 15:46 ` Jonathan Proulx 1 sibling, 1 reply; 22+ messages in thread From: Stephen Perkins @ 2012-08-22 14:39 UTC (permalink / raw) To: 'Wido den Hollander', 'Jonathan Proulx'; +Cc: ceph-devel Hi all, Is there a place we can set up a group of hardware recipes that people can query and modify over time? It would be good if people could submit and "group modify" the recipes. I would envision "hypothetical" configurations and "deployed/tested" configurations. Trekking back through email exchanges like this becomes hard for people who join later. I'd like to see a "best" hardware config as well... however, I'm interested in a SAS switching fabric where the nodes do not have any storage (except possibly onboard boot drive/USB as listed below). Each node would have a SAS HBA that allows it to access a LARGE jbod provided by a HA set of SAS Switches (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun masked for each host. The thought here is that you can add compute nodes, storage shelves, and disks all independently. With proper masking, you could provide redundancy to cover drive, node, and shelf failures. You could also add disks "horizontally" if you have spare slots in a shelf, and you could add shelves "vertically" and increase the disk count available to existing nodes. My goal is to be able to scale without having to draw the enormous power of lots of 1U devices or buy lots of disks and shelves each time I wasn't to add a little capacity. Anybody looked at atom processors? - Steve -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wido den Hollander Sent: Wednesday, August 22, 2012 9:17 AM To: Jonathan Proulx Cc: ceph-devel@vger.kernel.org Subject: Re: Ideal hardware spec? Hi, On 08/22/2012 03:55 PM, Jonathan Proulx wrote: > Hi All, > > Yes I'm asking the impossible question, what is the "best" hardware > confing. > > I'm looking at (possibly) using ceph as backing store for images and > volumes on OpenStack as well as exposing at least the object store for > direct use. > > The openstack cluster exists and is currently in the early stages of > use by researchers here, approx 1500 vCPU (counts hyperthreads > actually 768 physical cores) and 3T or RAM across 64 physical nodes. > > On the object store side it would be a new resource for usand hard to > say what people would do with it except that it would be many > different things and the use profile would be constantly changing > (which is true of all our existing storage). > > In this sense, even though it's a "private cloud" the somewhat > unpredictable useage profile gives it some charateristics of a small > public cloud. > > Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes > to end up with a 20-30T 3x replicated storage (call me paranoid). > I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too often. > So the monitor specs seem relatively easy to come up with. For the > OSDs it looks like > http://ceph.com/docs/master/install/hardware-recommendations suggests > 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage > node). On list discussions seem to frequently include an SSD for > journaling (which is similar to what we do for our current ZFS back > NFS storage). > > I'm hoping to wrap the hardware in a grant and willing to experiment a > bit with different software configurations to tune it up when/if I get > the hardware in. So my imediate concern is a hardware spec that will > ahve a reasonable processor:memory:disk ratio and opinions (or better > data) on the utility of SSD. > > First is the documented core to disk ratio still current best > practice? Given a platform with more drive slots could 8 cores handle > more disk? would that need/like more memory? > I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD machines, the more the kernel can buffer, which will always be a performance gain. You should however ask yourself the question if you want a lot of OSDs per server and not go for smaller machines with less disks. For example - 1U - 4 cores - 8GB RAM - 4 disks - 1 SSD Or - 2U - 8 cores - 16GB RAM - 8 disks - 1|2 SSDs Both will give you the same amount of storage, but the impact of loosing one physicial machine will be larger with the 2U machine. If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery to be done. Since btrfs (Assuming you are going to use that) is still in development it's not excluded that your machine goes down due to a kernel panic or other problems. My personal favor is having multiple small(er) machines than having a couple of large machines. > Have SSD been shown to speed performance with this architecture? > I've seen a improvement in performance indeed. Make sure however you have a recent version of glibc with syncfs support. > If so given the 8 drive slot example with seven OSDs presented in the > docs what is the liklihood that using a high performance SSD for the > OS image and also cutting journal/log partitions out of it for the > remaining 7 2-3T near line SAS drives? > You should make sure your SSD is capable of doing line-speed of your network. If you are connecting the machines with 4G trunks, make sure the SSD is capable of doing around 400MB/sec of sustained writes. I'd recommended the Intel 520 SSDs and change their available capacity with hdparm to about 20% of their original capacity. This way the SSD always has a lot of free cells available for writing. Reprogramming cells is expensive on an SSD. You can run the OS on the same SSD since that won't do that much I/O. I'd recommend not logging locally though, since that will also write to the same SSD. Try using remote syslog. You can also use the USB sticks[0] from Stec, they have servergrade onboard USB sticks for these kind of applications. A couple of questions still need to be answered though: * Which OS are you planning on using? Ubuntu 12.04 is recommended * Which filesystem do you want to use underneath the OSDs? Wido [0]: http://www.stec-inc.com/product/ufm.php > Thanks, > -Jon > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-22 14:39 ` Stephen Perkins @ 2012-08-23 8:24 ` Wido den Hollander 2012-08-24 14:17 ` Stephen Perkins 0 siblings, 1 reply; 22+ messages in thread From: Wido den Hollander @ 2012-08-23 8:24 UTC (permalink / raw) To: Stephen Perkins; +Cc: 'Jonathan Proulx', ceph-devel On 08/22/2012 04:39 PM, Stephen Perkins wrote: > Hi all, > > Is there a place we can set up a group of hardware recipes that people can > query and modify over time? It would be good if people could submit and > "group modify" the recipes. I would envision "hypothetical" configurations > and "deployed/tested" configurations. > > Trekking back through email exchanges like this becomes hard for people who > join later. > At the moment there isn't, but yes, a "show your setup" would be useful. I don't know if there is any really reference material right now, but in a later stage some showcases could be a great reference. > I'd like to see a "best" hardware config as well... however, I'm interested > in a SAS switching fabric where the nodes do not have any storage (except > possibly onboard boot drive/USB as listed below). Each node would have a > SAS HBA that allows it to access a LARGE jbod provided by a HA set of SAS > Switches (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives > are lun masked for each host. > > The thought here is that you can add compute nodes, storage shelves, and > disks all independently. With proper masking, you could provide redundancy > to cover drive, node, and shelf failures. You could also add disks > "horizontally" if you have spare slots in a shelf, and you could add shelves > "vertically" and increase the disk count available to existing nodes. > What would the benefit be from building such a complex SAS environment? You'd be spending a lot of money on SAS switch, JBODs and cabling. Your SPOF would still be your whole SAS setup. And what is the benefit for having Ceph run on top of that? If you have all the disks available to all the nodes, why not run ZFS? ZFS would give you better performance since what you are building would actually be a local filesystem. For risk spreading you should not interconnect all the nodes. The more complexity you add to the whole setup, the more likely it's to go down completely at some point in time. I'm just trying to understand why you would want to run a distributed filesystem on top of a bunch of direct attached disks. Again, if all the disks are attached locally you'd be better of by using ZFS. > My goal is to be able to scale without having to draw the enormous power of > lots of 1U devices or buy lots of disks and shelves each time I wasn't to > add a little capacity. > You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the time, depending on your crushmap you might need to add 3 machines at a once. If you have three "racks" in your crushmap each containing 5 nodes, you need to add a new node to each rack when expanding capacity to keep the racks balanced. This way you would add three nodes when expanding. > Anybody looked at atom processors? > Yes, I have. I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB disks and a 80GB SSD (old X25-M) for journaling. That works, but what I notice is that under heavy recover the Atoms can't cope with it. I'm thinking about building a couple of nodes with the AMD Brazos mainboard, somelike like an Asus E35M1-I. That is not a serverboard, but it would just be a reference to see what it does. One of the problems with the Atoms is the 4GB memory limitation, with the AMD Brazos you can use 8GB. I'm trying to figure out a way to have a really large amount of small nodes for a low price to have a massive cluster where the impact of loosing one node is very small. Wido > - Steve > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wido den Hollander > Sent: Wednesday, August 22, 2012 9:17 AM > To: Jonathan Proulx > Cc: ceph-devel@vger.kernel.org > Subject: Re: Ideal hardware spec? > > Hi, > > On 08/22/2012 03:55 PM, Jonathan Proulx wrote: >> Hi All, >> >> Yes I'm asking the impossible question, what is the "best" hardware >> confing. >> >> I'm looking at (possibly) using ceph as backing store for images and >> volumes on OpenStack as well as exposing at least the object store for >> direct use. >> >> The openstack cluster exists and is currently in the early stages of >> use by researchers here, approx 1500 vCPU (counts hyperthreads >> actually 768 physical cores) and 3T or RAM across 64 physical nodes. >> >> On the object store side it would be a new resource for usand hard to >> say what people would do with it except that it would be many >> different things and the use profile would be constantly changing >> (which is true of all our existing storage). >> >> In this sense, even though it's a "private cloud" the somewhat >> unpredictable useage profile gives it some charateristics of a small >> public cloud. >> >> Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes >> to end up with a 20-30T 3x replicated storage (call me paranoid). >> > > I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too > often. > >> So the monitor specs seem relatively easy to come up with. For the >> OSDs it looks like >> http://ceph.com/docs/master/install/hardware-recommendations suggests >> 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage >> node). On list discussions seem to frequently include an SSD for >> journaling (which is similar to what we do for our current ZFS back >> NFS storage). >> >> I'm hoping to wrap the hardware in a grant and willing to experiment a >> bit with different software configurations to tune it up when/if I get >> the hardware in. So my imediate concern is a hardware spec that will >> ahve a reasonable processor:memory:disk ratio and opinions (or better >> data) on the utility of SSD. >> >> First is the documented core to disk ratio still current best >> practice? Given a platform with more drive slots could 8 cores handle >> more disk? would that need/like more memory? >> > > I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD > machines, the more the kernel can buffer, which will always be a performance > gain. > > You should however ask yourself the question if you want a lot of OSDs per > server and not go for smaller machines with less disks. > > For example > > - 1U > - 4 cores > - 8GB RAM > - 4 disks > - 1 SSD > > Or > > - 2U > - 8 cores > - 16GB RAM > - 8 disks > - 1|2 SSDs > > Both will give you the same amount of storage, but the impact of loosing one > physicial machine will be larger with the 2U machine. > > If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery > to be done. > > Since btrfs (Assuming you are going to use that) is still in development > it's not excluded that your machine goes down due to a kernel panic or other > problems. > > My personal favor is having multiple small(er) machines than having a couple > of large machines. > >> Have SSD been shown to speed performance with this architecture? >> > > I've seen a improvement in performance indeed. Make sure however you have a > recent version of glibc with syncfs support. > >> If so given the 8 drive slot example with seven OSDs presented in the >> docs what is the liklihood that using a high performance SSD for the >> OS image and also cutting journal/log partitions out of it for the >> remaining 7 2-3T near line SAS drives? >> > > You should make sure your SSD is capable of doing line-speed of your > network. > > If you are connecting the machines with 4G trunks, make sure the SSD is > capable of doing around 400MB/sec of sustained writes. > > I'd recommended the Intel 520 SSDs and change their available capacity with > hdparm to about 20% of their original capacity. This way the SSD always has > a lot of free cells available for writing. Reprogramming cells is expensive > on an SSD. > > You can run the OS on the same SSD since that won't do that much I/O. > I'd recommend not logging locally though, since that will also write to the > same SSD. Try using remote syslog. > > You can also use the USB sticks[0] from Stec, they have servergrade onboard > USB sticks for these kind of applications. > > A couple of questions still need to be answered though: > * Which OS are you planning on using? Ubuntu 12.04 is recommended > * Which filesystem do you want to use underneath the OSDs? > > Wido > > [0]: http://www.stec-inc.com/product/ufm.php > >> Thanks, >> -Jon >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majordomo@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: Ideal hardware spec? 2012-08-23 8:24 ` Wido den Hollander @ 2012-08-24 14:17 ` Stephen Perkins 2012-08-24 14:41 ` Joe Landman ` (3 more replies) 0 siblings, 4 replies; 22+ messages in thread From: Stephen Perkins @ 2012-08-24 14:17 UTC (permalink / raw) To: 'Wido den Hollander'; +Cc: ceph-devel Morning Wido (and all), >> I'd like to see a "best" hardware config as well... however, I'm >> interested in a SAS switching fabric where the nodes do not have any >> storage (except possibly onboard boot drive/USB as listed below). >> Each node would have a SAS HBA that allows it to access a LARGE jbod >> provided by a HA set of SAS Switches >> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun masked for each host. >> >> The thought here is that you can add compute nodes, storage shelves, >> and disks all independently. With proper masking, you could provide redundancy >> to cover drive, node, and shelf failures. You could also add disks >> "horizontally" if you have spare slots in a shelf, and you could add >> shelves "vertically" and increase the disk count available to existing nodes. >> > >What would the benefit be from building such a complex SAS environment? >You'd be spending a lot of money on SAS switch, JBODs and cabling. Density. >Your SPOF would still be your whole SAS setup. Well... I'm not sure I would consider it a single point of failure... a pair of cross-connected switches and 3-5 disk shelves. Shelves can be purchased with fully redundant internals (dual data paths etc to SAS drives). That is not even that important. If each shelf is just looked at as JBOD, then you can group disks from different shelves into btrfs or hardware RAID groups. Or... you can look at each disk as its own storage with its own OSD. A SAS switch going offline would have no impact since everything is cross connected. A whole shelf can go offline and it would only appear as a single drive failure in a RAID group (if disks groups are distributed properly). You can then get compute nodes fairly densely packed by purchasing SuperMicro 2uTwin enclosures: http://www.supermicro.com/products/nfo/2UTwin2.cfm You can get 3 - 4 of those compute enclosure with dual SAS connectors (each enclosure not necessarily fully populated initially). The beauty is that the SAS interconnect is fast. Much faster than Ethernet. Please bear in mind that I am looking to create a highly available and scalable storage system that will fit in as small an area as possible and draw as little power as possible. The reasoning is that we co-locate all our equipment at remote data centers. Each rack (along with its associated power and any needed cross connects) represents a significant ongoing operational expense. Therefore, for me, density and incremental scalability are important. >And what is the benefit for having Ceph run on top of that? If you have all the disks available to all the nodes, why not run ZFS? > ZFS would give you better performance since what you are building would actually be a local filesystem. There is no high availability here. Yes... You can try to do old school magic with SAN file systems, complicated clustering, and synchronous replication, but a RAIN approach appeals to me. That is what I see in Ceph. Don't get me wrong... I love ZFS... but am trying to figure out a scalable HA solution that looks like RAIN. (Am I missing a feature of ZFS)? >For risk spreading you should not interconnect all the nodes. I do understand this. However, our operational setup will not allow multiple racks at the beginning. So... given the constraints of 1 rack (with dual power and dual WAN links), I do not see that a pair of cross connected SAS switches is any less reliable than a pair of cross connected ethernet switches... As storage scales and we outgrow the single rack at a location, we can overflow into a second rack etc. >The more complexity you add to the whole setup, the more likely it's to go down completely at some point in time. > >I'm just trying to understand why you would want to run a distributed filesystem on top of a bunch of direct attached disks. I guess I don't consider a SAN a bunch of direct attached disks. The SAS infrastructure is a SAN with SAS interconnects (versus fiber, iscsi or infiniband)... The disks are accessed via JBOD if desired... or you can put RAID on top of a group of them. The multiple shelves of drives are a way to attempt to reduce the dependence on a single piece of hardware (i.e. it becomes RAIN). >Again, if all the disks are attached locally you'd be better of by using ZFS. This is not highly available, and AFAICT, the compute load would not scale with the storage. >> My goal is to be able to scale without having to draw the enormous >> power of lots of 1U devices or buy lots of disks and shelves each time >> I wasn't to add a little capacity. >> > >You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the time, depending on your crushmap you might need to add 3 machines at a once. Adding three machines at once is what I was trying to avoid (I believe that I need 3 replicas to make things reasonably redundant). From first glance, it does not seem like a very dense solution to try to add a bunch of 1U servers with a few disks. The associated cost of a bunch of 1U Servers over JBOD, plus (and more importantly) the rack space and power draw, can cause OPEX problems. I can purchase multiple enclosures, but not fully populate them with disks/cpus. This gives me a redundant array of nodes (RAIN). Then. as needed, I can add drives or compute cards to the existing enclosures for little incremental cost. In your 3 1U server case above, I can add 12 disks to existing 4 enclosures (in groups of three) instead of three 1U servers with 4 disks each. I can then either run more OSDs on existing compute nodes or I can add one more compute node and it can handle the new drives with one or more OSDs. If I run out of space in enclosures, I can add one more shelf (just one) and start adding drives. I can then "include" the new drives into existing OSDs such that each existing OSD has a little more storage it needs to worry about. (The specifics of growing an existing OSD by adding a disk is still a little fuzzy to me). >> Anybody looked at atom processors? >> > >Yes, I have.. > >I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB disks and a 80GB SSD (old X25-M) for journaling. > >That works, but what I notice is that under heavy recover the Atoms can't cope with it. > >I'm thinking about building a couple of nodes with the AMD Brazos mainboard, somelike like an Asus E35M1-I. > >That is not a serverboard, but it would just be a reference to see what it does. > >One of the problems with the Atoms is the 4GB memory limitation, with the AMD Brazos you can use 8GB. > >I'm trying to figure out a way to have a really large amount of small nodes for a low price to have > a massive cluster where the impact of loosing one node is very small. Given that "massive" is a relative term, I am as well... but I'm also trying to reduce the footprint (power and space) of that "massive" cluster. I also want to start small (1/2 rack) and scale as needed. - Steve ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 14:17 ` Stephen Perkins @ 2012-08-24 14:41 ` Joe Landman 2012-08-24 15:05 ` Mark Nelson ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: Joe Landman @ 2012-08-24 14:41 UTC (permalink / raw) To: Stephen Perkins; +Cc: 'Wido den Hollander', ceph-devel On 08/24/2012 10:17 AM, Stephen Perkins wrote: >>> The thought here is that you can add compute nodes, storage shelves, >>> and disks all independently. With proper masking, you could provide > redundancy >>> to cover drive, node, and shelf failures. You could also add disks >>> "horizontally" if you have spare slots in a shelf, and you could add >>> shelves "vertically" and increase the disk count available to existing > nodes. >>> >> >> What would the benefit be from building such a complex SAS environment? >> You'd be spending a lot of money on SAS switch, JBODs and cabling. > > Density. As a solutions vendor, we try to stay out of these discussions in general, as we are biased (of course). Your discussion of being able to scale up density, fabric, and other relevant things is rather precisely what one of our products is meant to do, though we take a different route on the fabric. Rather than using SAS switching and SAS targets, we use iSCSI and iSER transports over 10 and 40GbE and IB. Our targets are iSCSI/iSER. Put these underneath what we call the presentation layer, where the Ceph OSDs, MDSs, etc will live. Otherwise they are quite similar. I don't want to pollute this discussion with a commercial. Just wanted to chime in here to let Stephen know that we've been doing that sort of design for a while. >> Your SPOF would still be your whole SAS setup. Actually no. This design is, when well implemented, more resilient than many others. > > Well... I'm not sure I would consider it a single point of failure... a > pair of cross-connected switches and 3-5 disk shelves. Shelves can be > purchased with fully redundant internals (dual data paths etc to SAS > drives). That is not even that important. If each shelf is just looked at > as JBOD, then you can group disks from different shelves into btrfs or > hardware RAID groups. Or... you can look at each disk as its own storage > with its own OSD. > > A SAS switch going offline would have no impact since everything is cross > connected. > > A whole shelf can go offline and it would only appear as a single drive > failure in a RAID group (if disks groups are distributed properly). > > You can then get compute nodes fairly densely packed by purchasing > SuperMicro 2uTwin enclosures: > http://www.supermicro.com/products/nfo/2UTwin2.cfm > > You can get 3 - 4 of those compute enclosure with dual SAS connectors (each > enclosure not necessarily fully populated initially). The beauty is that the > SAS interconnect is fast. Much faster than Ethernet. You remove SPOFs by accepting the reality that its effectively impossible to have truly redundant power/data pathways on single backplane boards (literally the definition of a single point of failure). If your redundant power supplies have a single power path to your backplane, is that redundant power (in the event of a short on the backplane)? No, not even close. And if your expander unit completely fails and locks hard ..., do you have a completely electrically separate pathway to your data? With the single backplane/data path units, no you don't have this. So putting multiple RAID cards into these units provides you with something akin to "security theatre". > > Please bear in mind that I am looking to create a highly available and > scalable storage system that will fit in as small an area as possible and > draw as little power as possible. The reasoning is that we co-locate all > our equipment at remote data centers. Each rack (along with its associated > power and any needed cross connects) represents a significant ongoing > operational expense. Therefore, for me, density and incremental scalability > are important. Not trying to be a commercial: Think multi PB per 42U rack without heroics. > >> And what is the benefit for having Ceph run on top of that? If you have all > the disks available to all the nodes, why not run ZFS? >> ZFS would give you better performance since what you are building would > actually be a local filesystem. > > There is no high availability here. Yes... You can try to do old school > magic with SAN file systems, complicated clustering, and synchronous > replication, but a RAIN approach appeals to me. That is what I see in Ceph. > Don't get me wrong... I love ZFS... but am trying to figure out a scalable > HA solution that looks like RAIN. (Am I missing a feature of ZFS)? RAIN has some use cases, but rebuild times for a limited number of RAIDs and a huge number of drives will be HUGE. Especially if your distributed LUNs start looking like multi tens to hundreds of TB. Really, you'd have to go Ceph at this point. > >> For risk spreading you should not interconnect all the nodes. > > I do understand this. However, our operational setup will not allow > multiple racks at the beginning. So... given the constraints of 1 rack > (with dual power and dual WAN links), I do not see that a pair of cross > connected SAS switches is any less reliable than a pair of cross connected > ethernet switches... > > As storage scales and we outgrow the single rack at a location, we can > overflow into a second rack etc. > >> The more complexity you add to the whole setup, the more likely it's to go > down completely at some point in time. >> >> I'm just trying to understand why you would want to run a distributed > filesystem on top of a bunch of direct attached disks. > > I guess I don't consider a SAN a bunch of direct attached disks. The SAS > infrastructure is a SAN with SAS interconnects (versus fiber, iscsi or > infiniband)... The disks are accessed via JBOD if desired... or you can put > RAID on top of a group of them. The multiple shelves of drives are a way to > attempt to reduce the dependence on a single piece of hardware (i.e. it > becomes RAIN). > >> Again, if all the disks are attached locally you'd be better of by using > ZFS. > > This is not highly available, and AFAICT, the compute load would not scale > with the storage. > >>> My goal is to be able to scale without having to draw the enormous >>> power of lots of 1U devices or buy lots of disks and shelves each time >>> I wasn't to add a little capacity. >>> >> >> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the > time, depending on your crushmap you might need to add 3 machines at a once. > > Adding three machines at once is what I was trying to avoid (I believe that > I need 3 replicas to make things reasonably redundant). From first glance, > it does not seem like a very dense solution to try to add a bunch of 1U > servers with a few disks. The associated cost of a bunch of 1U Servers over > JBOD, plus (and more importantly) the rack space and power draw, can cause > OPEX problems. I can purchase multiple enclosures, but not fully populate > them with disks/cpus. This gives me a redundant array of nodes (RAIN). > Then. as needed, I can add drives or compute cards to the existing > enclosures for little incremental cost. > > In your 3 1U server case above, I can add 12 disks to existing 4 enclosures > (in groups of three) instead of three 1U servers with 4 disks each. I can > then either run more OSDs on existing compute nodes or I can add one more > compute node and it can handle the new drives with one or more OSDs. If I > run out of space in enclosures, I can add one more shelf (just one) and > start adding drives. I can then "include" the new drives into existing OSDs > such that each existing OSD has a little more storage it needs to worry > about. (The specifics of growing an existing OSD by adding a disk is still > a little fuzzy to me). > >>> Anybody looked at atom processors? >>> >> >> Yes, I have.. >> >> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB > disks and a 80GB SSD (old X25-M) for journaling. >> >> That works, but what I notice is that under heavy recover the Atoms can't > cope with it. >> >> I'm thinking about building a couple of nodes with the AMD Brazos > mainboard, somelike like an Asus E35M1-I. >> >> That is not a serverboard, but it would just be a reference to see what it > does. >> >> One of the problems with the Atoms is the 4GB memory limitation, with the > AMD Brazos you can use 8GB. >> >> I'm trying to figure out a way to have a really large amount of small nodes > for a low price to have >> a massive cluster where the impact of loosing one node is very small. > > Given that "massive" is a relative term, I am as well... but I'm also trying > to reduce the footprint (power and space) of that "massive" cluster. I also > want to start small (1/2 rack) and scale as needed. Again, not a commericial: Think 1PB in less than 1/2 a 42U rack, with a little more than 1 ton of AC. > > - Steve > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 14:17 ` Stephen Perkins 2012-08-24 14:41 ` Joe Landman @ 2012-08-24 15:05 ` Mark Nelson 2012-08-24 16:30 ` Sławomir Skowron 2012-08-24 18:12 ` Wido den Hollander 2012-08-24 16:12 ` Tommi Virtanen 2012-08-24 18:09 ` Wido den Hollander 3 siblings, 2 replies; 22+ messages in thread From: Mark Nelson @ 2012-08-24 15:05 UTC (permalink / raw) To: Stephen Perkins; +Cc: 'Wido den Hollander', ceph-devel On 08/24/2012 09:17 AM, Stephen Perkins wrote: > Morning Wido (and all), > >>> I'd like to see a "best" hardware config as well... however, I'm >>> interested in a SAS switching fabric where the nodes do not have any >>> storage (except possibly onboard boot drive/USB as listed below). >>> Each node would have a SAS HBA that allows it to access a LARGE jbod >>> provided by a HA set of SAS Switches >>> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun > masked for each host. >>> >>> The thought here is that you can add compute nodes, storage shelves, >>> and disks all independently. With proper masking, you could provide > redundancy >>> to cover drive, node, and shelf failures. You could also add disks >>> "horizontally" if you have spare slots in a shelf, and you could add >>> shelves "vertically" and increase the disk count available to existing > nodes. >>> >> >> What would the benefit be from building such a complex SAS environment? >> You'd be spending a lot of money on SAS switch, JBODs and cabling. > > Density. > Trying to balance between dense solutions with more failure points vs cheap low density solutions is always tough. Though not the densest solution out there, we are starting to investigate performance on an SC847a chassis with 36 hotswap drives in 4U (along with internal drives for the system). Our setup doesn't use SAS expanders which is nice bonus, though it does require a lot of controllers. >> Your SPOF would still be your whole SAS setup. > > Well... I'm not sure I would consider it a single point of failure... a > pair of cross-connected switches and 3-5 disk shelves. Shelves can be > purchased with fully redundant internals (dual data paths etc to SAS > drives). That is not even that important. If each shelf is just looked at > as JBOD, then you can group disks from different shelves into btrfs or > hardware RAID groups. Or... you can look at each disk as its own storage > with its own OSD. > > A SAS switch going offline would have no impact since everything is cross > connected. > > A whole shelf can go offline and it would only appear as a single drive > failure in a RAID group (if disks groups are distributed properly). > > You can then get compute nodes fairly densely packed by purchasing > SuperMicro 2uTwin enclosures: > http://www.supermicro.com/products/nfo/2UTwin2.cfm > > You can get 3 - 4 of those compute enclosure with dual SAS connectors (each > enclosure not necessarily fully populated initially). The beauty is that the > SAS interconnect is fast. Much faster than Ethernet. > > Please bear in mind that I am looking to create a highly available and > scalable storage system that will fit in as small an area as possible and > draw as little power as possible. The reasoning is that we co-locate all > our equipment at remote data centers. Each rack (along with its associated > power and any needed cross connects) represents a significant ongoing > operational expense. Therefore, for me, density and incremental scalability > are important. There are some pretty interesting solutions on the horizon from various vendors that achieve a pretty decent amount of density. Should be interesting times ahead. :) > >> And what is the benefit for having Ceph run on top of that? If you have all > the disks available to all the nodes, why not run ZFS? >> ZFS would give you better performance since what you are building would > actually be a local filesystem. > > There is no high availability here. Yes... You can try to do old school > magic with SAN file systems, complicated clustering, and synchronous > replication, but a RAIN approach appeals to me. That is what I see in Ceph. > Don't get me wrong... I love ZFS... but am trying to figure out a scalable > HA solution that looks like RAIN. (Am I missing a feature of ZFS)? > >> For risk spreading you should not interconnect all the nodes. > > I do understand this. However, our operational setup will not allow > multiple racks at the beginning. So... given the constraints of 1 rack > (with dual power and dual WAN links), I do not see that a pair of cross > connected SAS switches is any less reliable than a pair of cross connected > ethernet switches... > > As storage scales and we outgrow the single rack at a location, we can > overflow into a second rack etc. > >> The more complexity you add to the whole setup, the more likely it's to go > down completely at some point in time. >> >> I'm just trying to understand why you would want to run a distributed > filesystem on top of a bunch of direct attached disks. > > I guess I don't consider a SAN a bunch of direct attached disks. The SAS > infrastructure is a SAN with SAS interconnects (versus fiber, iscsi or > infiniband)... The disks are accessed via JBOD if desired... or you can put > RAID on top of a group of them. The multiple shelves of drives are a way to > attempt to reduce the dependence on a single piece of hardware (i.e. it > becomes RAIN). > >> Again, if all the disks are attached locally you'd be better of by using > ZFS. > > This is not highly available, and AFAICT, the compute load would not scale > with the storage. > >>> My goal is to be able to scale without having to draw the enormous >>> power of lots of 1U devices or buy lots of disks and shelves each time >>> I wasn't to add a little capacity. >>> >> >> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the > time, depending on your crushmap you might need to add 3 machines at a once. > > Adding three machines at once is what I was trying to avoid (I believe that > I need 3 replicas to make things reasonably redundant). From first glance, > it does not seem like a very dense solution to try to add a bunch of 1U > servers with a few disks. The associated cost of a bunch of 1U Servers over > JBOD, plus (and more importantly) the rack space and power draw, can cause > OPEX problems. I can purchase multiple enclosures, but not fully populate > them with disks/cpus. This gives me a redundant array of nodes (RAIN). > Then. as needed, I can add drives or compute cards to the existing > enclosures for little incremental cost. > > In your 3 1U server case above, I can add 12 disks to existing 4 enclosures > (in groups of three) instead of three 1U servers with 4 disks each. I can > then either run more OSDs on existing compute nodes or I can add one more > compute node and it can handle the new drives with one or more OSDs. If I > run out of space in enclosures, I can add one more shelf (just one) and > start adding drives. I can then "include" the new drives into existing OSDs > such that each existing OSD has a little more storage it needs to worry > about. (The specifics of growing an existing OSD by adding a disk is still > a little fuzzy to me). > >>> Anybody looked at atom processors? >>> >> >> Yes, I have.. >> >> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB > disks and a 80GB SSD (old X25-M) for journaling. >> >> That works, but what I notice is that under heavy recover the Atoms can't > cope with it. >> >> I'm thinking about building a couple of nodes with the AMD Brazos > mainboard, somelike like an Asus E35M1-I. >> >> That is not a serverboard, but it would just be a reference to see what it > does. >> >> One of the problems with the Atoms is the 4GB memory limitation, with the > AMD Brazos you can use 8GB. >> >> I'm trying to figure out a way to have a really large amount of small nodes > for a low price to have >> a massive cluster where the impact of loosing one node is very small. > > Given that "massive" is a relative term, I am as well... but I'm also trying > to reduce the footprint (power and space) of that "massive" cluster. I also > want to start small (1/2 rack) and scale as needed. If you do end up testing Brazos processes, please post your results! I think it really depends on what kind of performance you are aiming for. Our stock 2U test boxes have 6-core opterons, and our SC847a has dual 6-core low power Xeon E5s. At 10GbE+ these are probably going to be pushed pretty hard, especially during recovery. > > - Steve > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 15:05 ` Mark Nelson @ 2012-08-24 16:30 ` Sławomir Skowron 2012-08-24 18:12 ` Wido den Hollander 1 sibling, 0 replies; 22+ messages in thread From: Sławomir Skowron @ 2012-08-24 16:30 UTC (permalink / raw) To: Mark Nelson Cc: Stephen Perkins, Wido den Hollander, ceph-devel@vger.kernel.org Dnia 24 sie 2012 o godz. 17:05 Mark Nelson <mark.nelson@inktank.com> napisał(a): > On 08/24/2012 09:17 AM, Stephen Perkins wrote: >> Morning Wido (and all), >> >>>> I'd like to see a "best" hardware config as well... however, I'm >>>> interested in a SAS switching fabric where the nodes do not have any >>>> storage (except possibly onboard boot drive/USB as listed below). >>>> Each node would have a SAS HBA that allows it to access a LARGE jbod >>>> provided by a HA set of SAS Switches >>>> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun >> masked for each host. >>>> >>>> The thought here is that you can add compute nodes, storage shelves, >>>> and disks all independently. With proper masking, you could provide >> redundancy >>>> to cover drive, node, and shelf failures. You could also add disks >>>> "horizontally" if you have spare slots in a shelf, and you could add >>>> shelves "vertically" and increase the disk count available to existing >> nodes. >>>> >>> >>> What would the benefit be from building such a complex SAS environment? >>> You'd be spending a lot of money on SAS switch, JBODs and cabling. >> >> Density. >> > > Trying to balance between dense solutions with more failure points vs cheap low density solutions is always tough. Though not the densest solution out there, we are starting to investigate performance on an SC847a chassis with 36 hotswap drives in 4U (along with internal drives for the system). Our setup doesn't use SAS expanders which is nice bonus, though it does require a lot of controllers. > >>> Your SPOF would still be your whole SAS setup. >> >> Well... I'm not sure I would consider it a single point of failure... a >> pair of cross-connected switches and 3-5 disk shelves. Shelves can be >> purchased with fully redundant internals (dual data paths etc to SAS >> drives). That is not even that important. If each shelf is just looked at >> as JBOD, then you can group disks from different shelves into btrfs or >> hardware RAID groups. Or... you can look at each disk as its own storage >> with its own OSD. >> >> A SAS switch going offline would have no impact since everything is cross >> connected. >> >> A whole shelf can go offline and it would only appear as a single drive >> failure in a RAID group (if disks groups are distributed properly). >> >> You can then get compute nodes fairly densely packed by purchasing >> SuperMicro 2uTwin enclosures: >> http://www.supermicro.com/products/nfo/2UTwin2.cfm >> >> You can get 3 - 4 of those compute enclosure with dual SAS connectors (each >> enclosure not necessarily fully populated initially). The beauty is that the >> SAS interconnect is fast. Much faster than Ethernet. >> >> Please bear in mind that I am looking to create a highly available and >> scalable storage system that will fit in as small an area as possible and >> draw as little power as possible. The reasoning is that we co-locate all >> our equipment at remote data centers. Each rack (along with its associated >> power and any needed cross connects) represents a significant ongoing >> operational expense. Therefore, for me, density and incremental scalability >> are important. > > There are some pretty interesting solutions on the horizon from various vendors that achieve a pretty decent amount of density. Should be interesting times ahead. :) LSI/Netapp have nice 60xNL SAS drives in 4U solution with SAS backplane, but this is always, a balance between price, and performance with elasticity. Balance between low/middle price hardware vs midrange/enterprise solutions. I think Ceph was created to be cheaper solution. To give as, a chance, to use storage servers, commodity hardware, without priced SAN infrastructure behind, and a fast 10Gb Ethernet. That gives more scalability, and ability, to scale out, not to scale in. Software like Ceph, do the job, for hardware solutions. > >> >>> And what is the benefit for having Ceph run on top of that? If you have all >> the disks available to all the nodes, why not run ZFS? >>> ZFS would give you better performance since what you are building would >> actually be a local filesystem. >> >> There is no high availability here. Yes... You can try to do old school >> magic with SAN file systems, complicated clustering, and synchronous >> replication, but a RAIN approach appeals to me. That is what I see in Ceph. >> Don't get me wrong... I love ZFS... but am trying to figure out a scalable >> HA solution that looks like RAIN. (Am I missing a feature of ZFS)? >> >>> For risk spreading you should not interconnect all the nodes. >> >> I do understand this. However, our operational setup will not allow >> multiple racks at the beginning. So... given the constraints of 1 rack >> (with dual power and dual WAN links), I do not see that a pair of cross >> connected SAS switches is any less reliable than a pair of cross connected >> ethernet switches... >> >> As storage scales and we outgrow the single rack at a location, we can >> overflow into a second rack etc. >> >>> The more complexity you add to the whole setup, the more likely it's to go >> down completely at some point in time. >>> >>> I'm just trying to understand why you would want to run a distributed >> filesystem on top of a bunch of direct attached disks. >> >> I guess I don't consider a SAN a bunch of direct attached disks. The SAS >> infrastructure is a SAN with SAS interconnects (versus fiber, iscsi or >> infiniband)... The disks are accessed via JBOD if desired... or you can put >> RAID on top of a group of them. The multiple shelves of drives are a way to >> attempt to reduce the dependence on a single piece of hardware (i.e. it >> becomes RAIN). >> >>> Again, if all the disks are attached locally you'd be better of by using >> ZFS. >> >> This is not highly available, and AFAICT, the compute load would not scale >> with the storage. >> >>>> My goal is to be able to scale without having to draw the enormous >>>> power of lots of 1U devices or buy lots of disks and shelves each time >>>> I wasn't to add a little capacity. >>>> >>> >>> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the >> time, depending on your crushmap you might need to add 3 machines at a once. >> >> Adding three machines at once is what I was trying to avoid (I believe that >> I need 3 replicas to make things reasonably redundant). From first glance, >> it does not seem like a very dense solution to try to add a bunch of 1U >> servers with a few disks. The associated cost of a bunch of 1U Servers over >> JBOD, plus (and more importantly) the rack space and power draw, can cause >> OPEX problems. I can purchase multiple enclosures, but not fully populate >> them with disks/cpus. This gives me a redundant array of nodes (RAIN). >> Then. as needed, I can add drives or compute cards to the existing >> enclosures for little incremental cost. >> >> In your 3 1U server case above, I can add 12 disks to existing 4 enclosures >> (in groups of three) instead of three 1U servers with 4 disks each. I can >> then either run more OSDs on existing compute nodes or I can add one more >> compute node and it can handle the new drives with one or more OSDs. If I >> run out of space in enclosures, I can add one more shelf (just one) and >> start adding drives. I can then "include" the new drives into existing OSDs >> such that each existing OSD has a little more storage it needs to worry >> about. (The specifics of growing an existing OSD by adding a disk is still >> a little fuzzy to me). >> >>>> Anybody looked at atom processors? >>>> >>> >>> Yes, I have.. >>> >>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB >> disks and a 80GB SSD (old X25-M) for journaling. >>> >>> That works, but what I notice is that under heavy recover the Atoms can't >> cope with it. >>> >>> I'm thinking about building a couple of nodes with the AMD Brazos >> mainboard, somelike like an Asus E35M1-I. >>> >>> That is not a serverboard, but it would just be a reference to see what it >> does. >>> >>> One of the problems with the Atoms is the 4GB memory limitation, with the >> AMD Brazos you can use 8GB. >>> >>> I'm trying to figure out a way to have a really large amount of small nodes >> for a low price to have >>> a massive cluster where the impact of loosing one node is very small. >> >> Given that "massive" is a relative term, I am as well... but I'm also trying >> to reduce the footprint (power and space) of that "massive" cluster. I also >> want to start small (1/2 rack) and scale as needed. > > If you do end up testing Brazos processes, please post your results! I think it really depends on what kind of performance you are aiming for. Our stock 2U test boxes have 6-core opterons, and our SC847a has dual 6-core low power Xeon E5s. At 10GbE+ these are probably going to be pushed pretty hard, especially during recovery. Today i have done a 500MB/s in cluster with 10Gb Ethernet during recovery. With each machine 12 cores of Xeon E5600, do a 50 system load !! > >> >> - Steve >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 15:05 ` Mark Nelson 2012-08-24 16:30 ` Sławomir Skowron @ 2012-08-24 18:12 ` Wido den Hollander 2012-08-24 18:23 ` Mark Nelson [not found] ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com> 1 sibling, 2 replies; 22+ messages in thread From: Wido den Hollander @ 2012-08-24 18:12 UTC (permalink / raw) To: Mark Nelson; +Cc: ceph-devel On 08/24/2012 05:05 PM, Mark Nelson wrote: >>> >>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and >>> 4 2TB >> disks and a 80GB SSD (old X25-M) for journaling. >>> >>> That works, but what I notice is that under heavy recover the Atoms >>> can't >> cope with it. >>> >>> I'm thinking about building a couple of nodes with the AMD Brazos >> mainboard, somelike like an Asus E35M1-I. >>> >>> That is not a serverboard, but it would just be a reference to see >>> what it >> does. >>> >>> One of the problems with the Atoms is the 4GB memory limitation, with >>> the >> AMD Brazos you can use 8GB. >>> >>> I'm trying to figure out a way to have a really large amount of small >>> nodes >> for a low price to have >>> a massive cluster where the impact of loosing one node is very small. >> >> Given that "massive" is a relative term, I am as well... but I'm also >> trying >> to reduce the footprint (power and space) of that "massive" cluster. >> I also >> want to start small (1/2 rack) and scale as needed. > > If you do end up testing Brazos processes, please post your results! I > think it really depends on what kind of performance you are aiming for. > Our stock 2U test boxes have 6-core opterons, and our SC847a has dual > 6-core low power Xeon E5s. At 10GbE+ these are probably going to be > pushed pretty hard, especially during recovery. > I'm aiming for a Ceph cluster of a couple of hundred TB consisting out of 5 or 6 racks full of 1U machines with each 4x 1TB. Having about ~200 of these nodes all doing not that much work. If one fails I'd loose 0.5% of my cluster and recovery shouldn't be that hard. Assuming here that the node crashes due to hardware failure, not being plagued by some Ceph or BTRFS bug cluster-wide :) Wido ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 18:12 ` Wido den Hollander @ 2012-08-24 18:23 ` Mark Nelson 2012-08-27 18:05 ` Stephen Perkins [not found] ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com> 1 sibling, 1 reply; 22+ messages in thread From: Mark Nelson @ 2012-08-24 18:23 UTC (permalink / raw) To: Wido den Hollander; +Cc: ceph-devel On 08/24/2012 01:12 PM, Wido den Hollander wrote: > > > On 08/24/2012 05:05 PM, Mark Nelson wrote: >>>> >>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and >>>> 4 2TB >>> disks and a 80GB SSD (old X25-M) for journaling. >>>> >>>> That works, but what I notice is that under heavy recover the Atoms >>>> can't >>> cope with it. >>>> >>>> I'm thinking about building a couple of nodes with the AMD Brazos >>> mainboard, somelike like an Asus E35M1-I. >>>> >>>> That is not a serverboard, but it would just be a reference to see >>>> what it >>> does. >>>> >>>> One of the problems with the Atoms is the 4GB memory limitation, with >>>> the >>> AMD Brazos you can use 8GB. >>>> >>>> I'm trying to figure out a way to have a really large amount of small >>>> nodes >>> for a low price to have >>>> a massive cluster where the impact of loosing one node is very small. >>> >>> Given that "massive" is a relative term, I am as well... but I'm also >>> trying >>> to reduce the footprint (power and space) of that "massive" cluster. >>> I also >>> want to start small (1/2 rack) and scale as needed. >> >> If you do end up testing Brazos processes, please post your results! I >> think it really depends on what kind of performance you are aiming for. >> Our stock 2U test boxes have 6-core opterons, and our SC847a has dual >> 6-core low power Xeon E5s. At 10GbE+ these are probably going to be >> pushed pretty hard, especially during recovery. >> > > I'm aiming for a Ceph cluster of a couple of hundred TB consisting out > of 5 or 6 racks full of 1U machines with each 4x 1TB. > > Having about ~200 of these nodes all doing not that much work. > > If one fails I'd loose 0.5% of my cluster and recovery shouldn't be that > hard. Assuming here that the node crashes due to hardware failure, not > being plagued by some Ceph or BTRFS bug cluster-wide :) > > Wido Just based on past experience, I figure the most common causes of failure are going to be drive "failure", and controller failure. Your solution mitigates that by just going with tons of 1U nodes with few drives. I'm hoping we can also mitigate it by skipping expanders and doing no more than 8 drives per controller. It does mean you top out at like 40-48 drives per node max on most server boards. Mark ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: Ideal hardware spec? 2012-08-24 18:23 ` Mark Nelson @ 2012-08-27 18:05 ` Stephen Perkins 2012-08-27 22:33 ` Wido den Hollander 0 siblings, 1 reply; 22+ messages in thread From: Stephen Perkins @ 2012-08-27 18:05 UTC (permalink / raw) To: ceph-devel; +Cc: 'Mark Nelson' >>> Given that "massive" is a relative term, I am as well... but I'm >>> also trying to reduce the footprint (power and space) of that >>> "massive" cluster. >>> I also >>> want to start small (1/2 rack) and scale as needed. >> >> If you do end up testing Brazos processes, please post your results! >> I think it really depends on what kind of performance you are aiming for. >> Our stock 2U test boxes have 6-core opterons, and our SC847a has dual >> 6-core low power Xeon E5s. At 10GbE+ these are probably going to be >> pushed pretty hard, especially during recovery. >> > > I'm aiming for a Ceph cluster of a couple of hundred TB consisting out > of 5 or 6 racks full of 1U machines with each 4x 1TB. Thinking along the lines of the approach of many 1U by 4 drive host (as above) with no hardware RAID... what are the thoughts between SATAII (3G/s) vs SATAIII (6G/s) and on 1G Ethernet versus 10G Ethernet. - Steve P.S. I will be assuming a replication level of 3 copies and would probably be looking at 10 nodes or less initially. Maybe populating with 6 drives instead of 4 (if I can find the right chassis). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-27 18:05 ` Stephen Perkins @ 2012-08-27 22:33 ` Wido den Hollander 0 siblings, 0 replies; 22+ messages in thread From: Wido den Hollander @ 2012-08-27 22:33 UTC (permalink / raw) To: Stephen Perkins; +Cc: ceph-devel On 08/27/2012 08:05 PM, Stephen Perkins wrote: >>>> Given that "massive" is a relative term, I am as well... but I'm >>>> also trying to reduce the footprint (power and space) of that >>>> "massive" cluster. >>>> I also >>>> want to start small (1/2 rack) and scale as needed. >>> >>> If you do end up testing Brazos processes, please post your results! >>> I think it really depends on what kind of performance you are aiming for. >>> Our stock 2U test boxes have 6-core opterons, and our SC847a has dual >>> 6-core low power Xeon E5s. At 10GbE+ these are probably going to be >>> pushed pretty hard, especially during recovery. >>> >> >> I'm aiming for a Ceph cluster of a couple of hundred TB consisting out >> of 5 or 6 racks full of 1U machines with each 4x 1TB. > > Thinking along the lines of the approach of many 1U by 4 drive host (as > above) with no hardware RAID... what are the thoughts between SATAII (3G/s) > vs SATAIII (6G/s) and on 1G Ethernet versus 10G Ethernet. > While SATA3 offers more bandwidth you won't benefit that much with 7200RPM disks. Buffer writes might go a bit faster, but it won't be shocking. You will however notice the difference when using a SSD for journaling, since the new SSDs are able to utilize the SATA3 bandwidth much better. I think that 10G would be overkill for a node with just 4 OSDs running on 4 disks in total, but you might want to look at trunking 2 1Gb NIC's with LACP? > - Steve > > P.S. I will be assuming a replication level of 3 copies and would probably > be looking at 10 nodes or less initially. Maybe populating with 6 drives > instead of 4 (if I can find the right chassis). > I'd go with 3 as well. Going with 2 would cause you to limp whenever just one machine/disk fails. If you want to go for 6 drives in 1U you'd be looking at 2.5" drives. It's a bummer they are still so expensive when looking at price per GB. Wido ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>]
* Re: Ideal hardware spec? [not found] ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com> @ 2012-08-25 11:48 ` Wido den Hollander 0 siblings, 0 replies; 22+ messages in thread From: Wido den Hollander @ 2012-08-25 11:48 UTC (permalink / raw) To: Stephen Perkins; +Cc: ceph-devel@vger.kernel.org (CC back to the list) On 08/24/2012 11:22 PM, Stephen Perkins wrote: > Hi Wildo, > > Why 4 x 1TB? I get the 4 (many boards seem to have 4 sata connectors so > you don't need a separate controller). However... why not 2TB or 3TB > drives? Is recover time too large? > Yes, due to recovery time mainly. With 4x 1TB I'd loose about 3.2TB of data (85% full) at max, that is recoverable for the cluster. Would I increase that to 2TB or 3TB disks the recovery would indeed get harder for the CPU and Memory. I could have less nodes to get the same amount of storage, but in this situation I also get more IOps since I have more spindles running. > I'm guessing no RAID and one OSD process per disk? > Correct. RAID is expensive and the Ceph replication already provides the data redundancy here. > I'm still evaluating your "looking at things differently" to see about a > bunch of cheap 1Us. > > Would your 1Us have redundant power and be redundantly Ethernet connected? > Or... cheaper single power and single Ethernet (reduced cabling)? > > ECC memory? > No redundant power, no redundant Ethernet (or switches) and no ECC memory. I'm quoting here from the CRUSH publication Sage wrote [0]: "Data safety is of critical importance in large storage systems, where the large number of devices makes hardware failure the rule rather than the exception." (4.4 Reliability) I've been designing by that rule. I'm relying on CRUSH to do all the redundancy work for me. By strategically placing nodes on different power feeds and different switches I can mitigate hardware failure. You just have to make sure that your CRUSH map resembles your physical layout of your cluster. Make sure that two copies of your data never end up in the same rack or on the same switch. Wido [0]: http://ceph.newdream.net/papers/weil-crush-sc06.pdf > - Steve > > > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wido den Hollander > Sent: Friday, August 24, 2012 1:12 PM > To: Mark Nelson > Cc: ceph-devel@vger.kernel.org > Subject: Re: Ideal hardware spec? > > > > On 08/24/2012 05:05 PM, Mark Nelson wrote: >>>> >>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM >>>> and >>>> 4 2TB >>> disks and a 80GB SSD (old X25-M) for journaling. >>>> >>>> That works, but what I notice is that under heavy recover the Atoms >>>> can't >>> cope with it. >>>> >>>> I'm thinking about building a couple of nodes with the AMD Brazos >>> mainboard, somelike like an Asus E35M1-I. >>>> >>>> That is not a serverboard, but it would just be a reference to see >>>> what it >>> does. >>>> >>>> One of the problems with the Atoms is the 4GB memory limitation, >>>> with the >>> AMD Brazos you can use 8GB. >>>> >>>> I'm trying to figure out a way to have a really large amount of >>>> small nodes >>> for a low price to have >>>> a massive cluster where the impact of loosing one node is very small. >>> >>> Given that "massive" is a relative term, I am as well... but I'm also >>> trying to reduce the footprint (power and space) of that "massive" >>> cluster. >>> I also >>> want to start small (1/2 rack) and scale as needed. >> >> If you do end up testing Brazos processes, please post your results! >> I think it really depends on what kind of performance you are aiming for. >> Our stock 2U test boxes have 6-core opterons, and our SC847a has >> dual 6-core low power Xeon E5s. At 10GbE+ these are probably going to >> be pushed pretty hard, especially during recovery. >> > > I'm aiming for a Ceph cluster of a couple of hundred TB consisting out of 5 > or 6 racks full of 1U machines with each 4x 1TB. > > Having about ~200 of these nodes all doing not that much work. > > If one fails I'd loose 0.5% of my cluster and recovery shouldn't be that > hard. Assuming here that the node crashes due to hardware failure, not being > plagued by some Ceph or BTRFS bug cluster-wide :) > > Wido > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 14:17 ` Stephen Perkins 2012-08-24 14:41 ` Joe Landman 2012-08-24 15:05 ` Mark Nelson @ 2012-08-24 16:12 ` Tommi Virtanen 2012-08-24 18:09 ` Wido den Hollander 3 siblings, 0 replies; 22+ messages in thread From: Tommi Virtanen @ 2012-08-24 16:12 UTC (permalink / raw) To: Stephen Perkins; +Cc: Wido den Hollander, ceph-devel On Fri, Aug 24, 2012 at 7:17 AM, Stephen Perkins <perkins@netmass.com> wrote: > Adding three machines at once is what I was trying to avoid (I believe that > I need 3 replicas to make things reasonably redundant). From first glance, You need 3 machines to *start with*, to have 3 truly independent replicas. After that point, there's nothing preventing you from growing one machine -- or one disk -- at a time. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-24 14:17 ` Stephen Perkins ` (2 preceding siblings ...) 2012-08-24 16:12 ` Tommi Virtanen @ 2012-08-24 18:09 ` Wido den Hollander 3 siblings, 0 replies; 22+ messages in thread From: Wido den Hollander @ 2012-08-24 18:09 UTC (permalink / raw) To: Stephen Perkins; +Cc: ceph-devel On 08/24/2012 04:17 PM, Stephen Perkins wrote: > >> Your SPOF would still be your whole SAS setup. > > Well... I'm not sure I would consider it a single point of failure... a > pair of cross-connected switches and 3-5 disk shelves. Shelves can be > purchased with fully redundant internals (dual data paths etc to SAS > drives). That is not even that important. If each shelf is just looked at > as JBOD, then you can group disks from different shelves into btrfs or > hardware RAID groups. Or... you can look at each disk as its own storage > with its own OSD. > > A SAS switch going offline would have no impact since everything is cross > connected. > > A whole shelf can go offline and it would only appear as a single drive > failure in a RAID group (if disks groups are distributed properly). > I'm not against your idea and I get the reasoning, however, in my opinion a distributed filesystem should not have interconnects on SAS basis between OSD nodes. There are multiple ways to Rome, I know, but I'm just trying to view this from another perspective. > You can then get compute nodes fairly densely packed by purchasing > SuperMicro 2uTwin enclosures: > http://www.supermicro.com/products/nfo/2UTwin2.cfm > > You can get 3 - 4 of those compute enclosure with dual SAS connectors (each > enclosure not necessarily fully populated initially). The beauty is that the > SAS interconnect is fast. Much faster than Ethernet. Yes, SAS is faster than ethernet, but all the replication traffic between OSDs will still go over Ethernet. The OSD in his turn will write the data over SAS. I'd actually think your SAS bus (although they are beefy) could become a bottleneck at some point. > > Please bear in mind that I am looking to create a highly available and > scalable storage system that will fit in as small an area as possible and > draw as little power as possible. The reasoning is that we co-locate all > our equipment at remote data centers. Each rack (along with its associated > power and any needed cross connects) represents a significant ongoing > operational expense. Therefore, for me, density and incremental scalability > are important. > Got ya. Operational costs in datacenters are getting higher and higher, sometimes it's worth investing more upfront so you can save operationally. > > There is no high availability here. Yes... You can try to do old school > magic with SAN file systems, complicated clustering, and synchronous > replication, but a RAIN approach appeals to me. That is what I see in Ceph. > Don't get me wrong... I love ZFS... but am trying to figure out a scalable > HA solution that looks like RAIN. (Am I missing a feature of ZFS)? > I'm managing a couple of 50TB ZFS systems with Nexenta. The two nodes have 96GB of RAM each and all the disks are in LSI 630J JBOD's with LSI SAS switches, this way both nodes have access to the disks and thus the ZFS pool. Expansion can be done by adding extra disks or creating a second pool and running that pool on a different node. Since you are staying inside on rack I don't think you'll be doing that much IOps. A descent ZFS system can do 100k IOps without any issues, I don't think you'll do that with Ceph very soon in one rack (assuming your clients are in the same rack). Don't get me wrong, I'm not trying to scare you away from Ceph, just trying to view it from a different perspective. >> For risk spreading you should not interconnect all the nodes. > > I do understand this. However, our operational setup will not allow > multiple racks at the beginning. So... given the constraints of 1 rack > (with dual power and dual WAN links), I do not see that a pair of cross > connected SAS switches is any less reliable than a pair of cross connected > ethernet switches... > The problem with interconnected SAS switches is that IF something goes wrong your filesystem looses it's connection to the disk, risking valuable data which could still be in transit from buffers. The risk would be that all the OSDs will loose access to their disks all at once. Yes, it is redundant, but you wouldn't be the first to suffer from a firmware glitch somewhere. By physically keeping this separated you don't have the risk of all OSDs loosing disk access at once. > As storage scales and we outgrow the single rack at a location, we can > overflow into a second rack etc. > True, that is something that you won't do with a ZFS setup that fast. The question you have to ask yourself: Do you want all your data on one system? Do you want to bet everything on one horse? Wido ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-22 14:17 ` Wido den Hollander 2012-08-22 14:39 ` Stephen Perkins @ 2012-08-22 15:46 ` Jonathan Proulx 2012-08-23 9:59 ` Wido den Hollander 1 sibling, 1 reply; 22+ messages in thread From: Jonathan Proulx @ 2012-08-22 15:46 UTC (permalink / raw) To: Wido den Hollander; +Cc: ceph-devel On Wed, Aug 22, 2012 at 04:17:23PM +0200, Wido den Hollander wrote: :On 08/22/2012 03:55 PM, Jonathan Proulx wrote: :You can also use the USB sticks[0] from Stec, they have servergrade :onboard USB sticks for these kind of applications. Those look quite interesting. :A couple of questions still need to be answered though: :* Which OS are you planning on using? Ubuntu 12.04 is recommended Ubuntu 12.04 is our current preferred OS :* Which filesystem do you want to use underneath the OSDs? Whatever I can get to work best in testing :) Since this is for a research platform not a product I'd likely start with BTRFS and see if it is "stable enough" and "performant enough" with fall back to XFS if needed -Jon :Wido : :[0]: http://www.stec-inc.com/product/ufm.php ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-22 15:46 ` Jonathan Proulx @ 2012-08-23 9:59 ` Wido den Hollander [not found] ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com> 0 siblings, 1 reply; 22+ messages in thread From: Wido den Hollander @ 2012-08-23 9:59 UTC (permalink / raw) To: Jonathan Proulx; +Cc: ceph-devel On 08/22/2012 05:46 PM, Jonathan Proulx wrote: > On Wed, Aug 22, 2012 at 04:17:23PM +0200, Wido den Hollander wrote: > > :On 08/22/2012 03:55 PM, Jonathan Proulx wrote: > > :You can also use the USB sticks[0] from Stec, they have servergrade > :onboard USB sticks for these kind of applications. > > Those look quite interesting. > They should be much more reliable than regular USB sticks due to their SLC memory. You could also take a look at these: http://www.transcend-info.com/industry/products_details.asp?CatNo=2&SerNo=14&ModNo=28&Func1No=1 > :A couple of questions still need to be answered though: > :* Which OS are you planning on using? Ubuntu 12.04 is recommended > > Ubuntu 12.04 is our current preferred OS > That should work fine. > :* Which filesystem do you want to use underneath the OSDs? > > Whatever I can get to work best in testing :) > > Since this is for a research platform not a product I'd likely start with > BTRFS and see if it is "stable enough" and "performant enough" with > fall back to XFS if needed > BTRFS is indeed the best in terms of features. I'd recommend using a recent kernel like 3.5. Wido > -Jon > > :Wido > : > :[0]: http://www.stec-inc.com/product/ufm.php > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>]
* Re: Ideal hardware spec? [not found] ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com> @ 2012-08-26 11:15 ` Wido den Hollander 2012-08-26 13:29 ` Mark Nelson 0 siblings, 1 reply; 22+ messages in thread From: Wido den Hollander @ 2012-08-26 11:15 UTC (permalink / raw) To: Andrey Korolyov; +Cc: ceph-devel@vger.kernel.org CC'ing this one back to the list. On 08/25/2012 09:58 PM, Andrey Korolyov wrote: >> >> They should be much more reliable than regular USB sticks due to their SLC >> memory. >> >> You could also take a look at these: >> http://www.transcend-info.com/industry/products_details.asp?CatNo=2&SerNo=14&ModNo=28&Func1No=1 >> >> > > Did you tried yet those or simular sticks for CEPH journal? Right now > I am using Intel 313`s, which is very fast and have durability/price > ratio a far higher than any imaginable MLC, but they occupying one HDD > slot which is a quite impractical. > No, I haven't tried, but I think it won't work. These kind of SLC chips don't do random writes that great, you'll probably get something like 4MB/sec in random writes. Bigger SSDs have more cells to spread the writes over, those small sticks don't. The Intel 3XX or 5XX serie should work just fine for journaling, I'd however recommend you change the Host Protected Area to ~50% of the available capacity to prevent write-degradation over time. Wido ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-26 11:15 ` Wido den Hollander @ 2012-08-26 13:29 ` Mark Nelson 0 siblings, 0 replies; 22+ messages in thread From: Mark Nelson @ 2012-08-26 13:29 UTC (permalink / raw) To: Wido den Hollander; +Cc: Andrey Korolyov, ceph-devel@vger.kernel.org On 08/26/2012 06:15 AM, Wido den Hollander wrote: > CC'ing this one back to the list. > > On 08/25/2012 09:58 PM, Andrey Korolyov wrote: >>> >>> They should be much more reliable than regular USB sticks due to >>> their SLC >>> memory. >>> >>> You could also take a look at these: >>> http://www.transcend-info.com/industry/products_details.asp?CatNo=2&SerNo=14&ModNo=28&Func1No=1 >>> >>> >>> >> >> Did you tried yet those or simular sticks for CEPH journal? Right now >> I am using Intel 313`s, which is very fast and have durability/price >> ratio a far higher than any imaginable MLC, but they occupying one HDD >> slot which is a quite impractical. >> > > No, I haven't tried, but I think it won't work. > > These kind of SLC chips don't do random writes that great, you'll > probably get something like 4MB/sec in random writes. > > Bigger SSDs have more cells to spread the writes over, those small > sticks don't. > > The Intel 3XX or 5XX serie should work just fine for journaling, I'd > however recommend you change the Host Protected Area to ~50% of the > available capacity to prevent write-degradation over time. > > Wido > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Not just write degradation, but undersubscribing the SSDs should hopefully help them last a little longer under such a heavy write workload. We are doing 3 10GB journals per 180GB Intel 520 on our supermicro test node. Mark ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx 2012-08-22 14:17 ` Wido den Hollander @ 2012-08-22 14:41 ` Mark Nelson 2012-08-28 0:02 ` Curtis C. 1 sibling, 1 reply; 22+ messages in thread From: Mark Nelson @ 2012-08-22 14:41 UTC (permalink / raw) To: Jonathan Proulx; +Cc: ceph-devel On 08/22/2012 08:55 AM, Jonathan Proulx wrote: > Hi All, Hi Jonathon! > > Yes I'm asking the impossible question, what is the "best" hardware > confing. That is the impossible question. :) > > I'm looking at (possibly) using ceph as backing store for images and > volumes on OpenStack as well as exposing at least the object store for > direct use. > > The openstack cluster exists and is currently in the early stages of > use by researchers here, approx 1500 vCPU (counts hyperthreads > actually 768 physical cores) and 3T or RAM across 64 physical nodes. > > On the object store side it would be a new resource for usand hard to > say what people would do with it except that it would be many > different things and the use profile would be constantly changing > (which is true of all our existing storage). > > In this sense, even though it's a "private cloud" the somewhat > unpredictable useage profile gives it some charateristics of a small > public cloud. > > Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes > to end up with a 20-30T 3x replicated storage (call me paranoid). > > So the monitor specs seem relatively easy to come up with. For the > OSDs it looks like > http://ceph.com/docs/master/install/hardware-recommendations suggests > 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage > node). On list discussions seem to frequently include an SSD for > journaling (which is similar to what we do for our current ZFS back > NFS storage). > > I'm hoping to wrap the hardware in a grant and willing to experiment a > bit with different software configurations to tune it up when/if I get > the hardware in. So my imediate concern is a hardware spec that will > ahve a reasonable processor:memory:disk ratio and opinions (or better > data) on the utility of SSD. Before I joined up with Inktank, I was prototyping a private openstack cloud for HPC applications at a supercomputing site. We similarly were pursuing grant funding. I know how it goes! > > First is the documented core to disk ratio still current best > practice? Given a platform with more drive slots could 8 cores handle > more disk? would that need/like more memory? The big thing is the CPU and memory needed during recovery. During standard operation you shouldn't be pushing the CPU too hard unless you are really pushing data through fast and have many drives per node, or have severely underspecced the CPU. Given that you are only shooting for around 90TB of space across 5+ osd nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's probably the closest thing we have right now to a "standard" configuration. We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of memory. It might be worth bumping that up to 24-32GB of memory for very large deployments with lots of OSDs. In terms of controller we are using Dell H700 cards which are similar to LSI 9260s, but I think there is a good chance that it may actually be better to use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware. That's one of the commonly used cards in ZFS builds too and has a pretty good reputation. I've actually got a supermicro SC847a chassis and a whole bunch of various SATA/SAS/RAID controllers I'm testing now in different configurations. Hopefully I should have some data soon. For now, our best tested configuration is with 12 drive nodes. Smaller 1U nodes may be an option as well, but not very dense. > > Have SSD been shown to speed performance with this architecture? Yes, but in different ways depending on how you use them. SSDs for data storage tend to help mitigate some of the seek behavior issues we've seen on the filestore. This isn't really a reasonable solution for a lot of people though. In terms of the journal, the biggest benefit that SSDs provide is high throughput, so you can load multiple journals onto 1 SSD and cram more OSDs into one box. Depending on how much you trust your SSDs, you could try either a 10 disk + 2 SSD or a 9 disk + SSD configuration. Keep in mind that this will be writing a lot of data to the SSDs, so you should try to undersubscribe them to lengthen the lifespan. For testing I'm doing 3 journals per 180GB Intel 520 SSD. > > If so given the 8 drive slot example with seven OSDs presented in the > docs what is the liklihood that using a high performance SSD for the > OS image and also cutting journal/log partitions out of it for the > remaining 7 2-3T near line SAS drives? Just keep in mind that in this case you're total throughput will likely be limited by the SSD unless you get a very fast one (or are using 1GbE or have some other bottleneck). > > Thanks, > -Jon > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-22 14:41 ` Mark Nelson @ 2012-08-28 0:02 ` Curtis C. 2012-08-28 1:18 ` Mark Nelson 0 siblings, 1 reply; 22+ messages in thread From: Curtis C. @ 2012-08-28 0:02 UTC (permalink / raw) To: Mark Nelson; +Cc: Jonathan Proulx, ceph-devel On Wed, Aug 22, 2012 at 8:41 AM, Mark Nelson <mark.nelson@inktank.com> wrote: > On 08/22/2012 08:55 AM, Jonathan Proulx wrote: >> >> Hi All, > > > Hi Jonathon! > > >> >> Yes I'm asking the impossible question, what is the "best" hardware >> confing. > > > That is the impossible question. :) > > >> >> I'm looking at (possibly) using ceph as backing store for images and >> volumes on OpenStack as well as exposing at least the object store for >> direct use. >> >> The openstack cluster exists and is currently in the early stages of >> use by researchers here, approx 1500 vCPU (counts hyperthreads >> actually 768 physical cores) and 3T or RAM across 64 physical nodes. >> >> On the object store side it would be a new resource for usand hard to >> say what people would do with it except that it would be many >> different things and the use profile would be constantly changing >> (which is true of all our existing storage). >> >> In this sense, even though it's a "private cloud" the somewhat >> unpredictable useage profile gives it some charateristics of a small >> public cloud. >> >> Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes >> to end up with a 20-30T 3x replicated storage (call me paranoid). >> >> So the monitor specs seem relatively easy to come up with. For the >> OSDs it looks like >> http://ceph.com/docs/master/install/hardware-recommendations suggests >> 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage >> node). On list discussions seem to frequently include an SSD for >> journaling (which is similar to what we do for our current ZFS back >> NFS storage). >> >> I'm hoping to wrap the hardware in a grant and willing to experiment a >> bit with different software configurations to tune it up when/if I get >> the hardware in. So my imediate concern is a hardware spec that will >> ahve a reasonable processor:memory:disk ratio and opinions (or better >> data) on the utility of SSD. > > > Before I joined up with Inktank, I was prototyping a private openstack cloud > for HPC applications at a supercomputing site. We similarly were pursuing > grant funding. I know how it goes! > > >> >> First is the documented core to disk ratio still current best >> practice? Given a platform with more drive slots could 8 cores handle >> more disk? would that need/like more memory? > > > The big thing is the CPU and memory needed during recovery. During standard > operation you shouldn't be pushing the CPU too hard unless you are really > pushing data through fast and have many drives per node, or have severely > underspecced the CPU. > > Given that you are only shooting for around 90TB of space across 5+ osd > nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's > probably the closest thing we have right now to a "standard" configuration. > We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of > memory. It might be worth bumping that up to 24-32GB of memory for very > large deployments with lots of OSDs. > > In terms of controller we are using Dell H700 cards which are similar to LSI > 9260s, but I think there is a good chance that it may actually be better to > use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware. > That's one of the commonly used cards in ZFS builds too and has a pretty > good reputation. > > I've actually got a supermicro SC847a chassis and a whole bunch of various > SATA/SAS/RAID controllers I'm testing now in different configurations. > Hopefully I should have some data soon. For now, our best tested > configuration is with 12 drive nodes. Smaller 1U nodes may be an option as > well, but not very dense. > I've worked a bit with a Supermicro 36 drive bay chassis, though I've since moved on from the organization we had them in place at. I quite liked them. Wrote a bit of a blog post about them too (http://serverascode.com/2012/06/07/36-hot-swappable-day-supermicro-chassis.html) so I'm excited to see Inktank trying them out. The place I currently work at is a big OpenStack user and thinking about Ceph, but is not, as of yet, interested in a chassis like the Supermicro, so please post about your findings. :) Thanks, Curtis. > >> >> Have SSD been shown to speed performance with this architecture? > > > Yes, but in different ways depending on how you use them. SSDs for data > storage tend to help mitigate some of the seek behavior issues we've seen on > the filestore. This isn't really a reasonable solution for a lot of people > though. > > In terms of the journal, the biggest benefit that SSDs provide is high > throughput, so you can load multiple journals onto 1 SSD and cram more OSDs > into one box. Depending on how much you trust your SSDs, you could try > either a 10 disk + 2 SSD or a 9 disk + SSD configuration. Keep in mind that > this will be writing a lot of data to the SSDs, so you should try to > undersubscribe them to lengthen the lifespan. For testing I'm doing 3 > journals per 180GB Intel 520 SSD. > > >> >> If so given the 8 drive slot example with seven OSDs presented in the >> docs what is the liklihood that using a high performance SSD for the >> OS image and also cutting journal/log partitions out of it for the >> remaining 7 2-3T near line SAS drives? > > > Just keep in mind that in this case you're total throughput will likely be > limited by the SSD unless you get a very fast one (or are using 1GbE or have > some other bottleneck). > > >> >> Thanks, >> -Jon >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Ideal hardware spec? 2012-08-28 0:02 ` Curtis C. @ 2012-08-28 1:18 ` Mark Nelson 0 siblings, 0 replies; 22+ messages in thread From: Mark Nelson @ 2012-08-28 1:18 UTC (permalink / raw) To: Curtis C.; +Cc: Jonathan Proulx, ceph-devel On 08/27/2012 07:02 PM, Curtis C. wrote: > On Wed, Aug 22, 2012 at 8:41 AM, Mark Nelson<mark.nelson@inktank.com> wrote: >> On 08/22/2012 08:55 AM, Jonathan Proulx wrote: >>> >>> Hi All, >> >> >> Hi Jonathon! >> >> >>> >>> Yes I'm asking the impossible question, what is the "best" hardware >>> confing. >> >> >> That is the impossible question. :) >> >> >>> >>> I'm looking at (possibly) using ceph as backing store for images and >>> volumes on OpenStack as well as exposing at least the object store for >>> direct use. >>> >>> The openstack cluster exists and is currently in the early stages of >>> use by researchers here, approx 1500 vCPU (counts hyperthreads >>> actually 768 physical cores) and 3T or RAM across 64 physical nodes. >>> >>> On the object store side it would be a new resource for usand hard to >>> say what people would do with it except that it would be many >>> different things and the use profile would be constantly changing >>> (which is true of all our existing storage). >>> >>> In this sense, even though it's a "private cloud" the somewhat >>> unpredictable useage profile gives it some charateristics of a small >>> public cloud. >>> >>> Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes >>> to end up with a 20-30T 3x replicated storage (call me paranoid). >>> >>> So the monitor specs seem relatively easy to come up with. For the >>> OSDs it looks like >>> http://ceph.com/docs/master/install/hardware-recommendations suggests >>> 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage >>> node). On list discussions seem to frequently include an SSD for >>> journaling (which is similar to what we do for our current ZFS back >>> NFS storage). >>> >>> I'm hoping to wrap the hardware in a grant and willing to experiment a >>> bit with different software configurations to tune it up when/if I get >>> the hardware in. So my imediate concern is a hardware spec that will >>> ahve a reasonable processor:memory:disk ratio and opinions (or better >>> data) on the utility of SSD. >> >> >> Before I joined up with Inktank, I was prototyping a private openstack cloud >> for HPC applications at a supercomputing site. We similarly were pursuing >> grant funding. I know how it goes! >> >> >>> >>> First is the documented core to disk ratio still current best >>> practice? Given a platform with more drive slots could 8 cores handle >>> more disk? would that need/like more memory? >> >> >> The big thing is the CPU and memory needed during recovery. During standard >> operation you shouldn't be pushing the CPU too hard unless you are really >> pushing data through fast and have many drives per node, or have severely >> underspecced the CPU. >> >> Given that you are only shooting for around 90TB of space across 5+ osd >> nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's >> probably the closest thing we have right now to a "standard" configuration. >> We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of >> memory. It might be worth bumping that up to 24-32GB of memory for very >> large deployments with lots of OSDs. >> >> In terms of controller we are using Dell H700 cards which are similar to LSI >> 9260s, but I think there is a good chance that it may actually be better to >> use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware. >> That's one of the commonly used cards in ZFS builds too and has a pretty >> good reputation. >> >> I've actually got a supermicro SC847a chassis and a whole bunch of various >> SATA/SAS/RAID controllers I'm testing now in different configurations. >> Hopefully I should have some data soon. For now, our best tested >> configuration is with 12 drive nodes. Smaller 1U nodes may be an option as >> well, but not very dense. >> > > I've worked a bit with a Supermicro 36 drive bay chassis, though I've > since moved on from the organization we had them in place at. I quite > liked them. Wrote a bit of a blog post about them too > (http://serverascode.com/2012/06/07/36-hot-swappable-day-supermicro-chassis.html) > so I'm excited to see Inktank trying them out. > I really like this chassis. It's one of the nicer ones that I've worked with. The drives in the back could be a deal breaker for some, but I think it's a decent trade-off for what you get. > The place I currently work at is a big OpenStack user and thinking > about Ceph, but is not, as of yet, interested in a chassis like the > Supermicro, so please post about your findings. :) > > Thanks, > Curtis. > So far I've only been doing single controller tests with an onboard LSI SAS2208 and an external SAS2008 card (9211-8i). The SAS2008 is actually slightly faster. With 6 7200rpm SATA drives and 2 Intel 520 SSDs for journals I can do nearly 600MB/s with 1x replication and 4MB requests via rados bench. I've got a couple of other cards to test (An Areca 1680, LSI SAS2308, and a Marvel based highpoint rocketraid card). After that I'll start in on multiple controllers and more drives. I also got the bracket I needed in for my 1U client node so I should be able to start in on 2x bonded 10GbE tests. Hopefully I can convince the powers that be to let me fill out the SC847a chassis and maybe buy another one if the tests look good. ;) >> >>> >>> Have SSD been shown to speed performance with this architecture? >> >> >> Yes, but in different ways depending on how you use them. SSDs for data >> storage tend to help mitigate some of the seek behavior issues we've seen on >> the filestore. This isn't really a reasonable solution for a lot of people >> though. >> >> In terms of the journal, the biggest benefit that SSDs provide is high >> throughput, so you can load multiple journals onto 1 SSD and cram more OSDs >> into one box. Depending on how much you trust your SSDs, you could try >> either a 10 disk + 2 SSD or a 9 disk + SSD configuration. Keep in mind that >> this will be writing a lot of data to the SSDs, so you should try to >> undersubscribe them to lengthen the lifespan. For testing I'm doing 3 >> journals per 180GB Intel 520 SSD. >> >> >>> >>> If so given the 8 drive slot example with seven OSDs presented in the >>> docs what is the liklihood that using a high performance SSD for the >>> OS image and also cutting journal/log partitions out of it for the >>> remaining 7 2-3T near line SAS drives? >> >> >> Just keep in mind that in this case you're total throughput will likely be >> limited by the SSD unless you get a very fast one (or are using 1GbE or have >> some other bottleneck). >> >> >>> >>> Thanks, >>> -Jon >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks, Mark ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2012-08-28 1:18 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx
2012-08-22 14:17 ` Wido den Hollander
2012-08-22 14:39 ` Stephen Perkins
2012-08-23 8:24 ` Wido den Hollander
2012-08-24 14:17 ` Stephen Perkins
2012-08-24 14:41 ` Joe Landman
2012-08-24 15:05 ` Mark Nelson
2012-08-24 16:30 ` Sławomir Skowron
2012-08-24 18:12 ` Wido den Hollander
2012-08-24 18:23 ` Mark Nelson
2012-08-27 18:05 ` Stephen Perkins
2012-08-27 22:33 ` Wido den Hollander
[not found] ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>
2012-08-25 11:48 ` Wido den Hollander
2012-08-24 16:12 ` Tommi Virtanen
2012-08-24 18:09 ` Wido den Hollander
2012-08-22 15:46 ` Jonathan Proulx
2012-08-23 9:59 ` Wido den Hollander
[not found] ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>
2012-08-26 11:15 ` Wido den Hollander
2012-08-26 13:29 ` Mark Nelson
2012-08-22 14:41 ` Mark Nelson
2012-08-28 0:02 ` Curtis C.
2012-08-28 1:18 ` Mark Nelson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.