All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Nelson <mark.nelson@inktank.com>
To: "Curtis C." <serverascode@gmail.com>
Cc: Jonathan Proulx <jon@csail.mit.edu>, ceph-devel@vger.kernel.org
Subject: Re: Ideal hardware spec?
Date: Mon, 27 Aug 2012 20:18:08 -0500	[thread overview]
Message-ID: <503C1C50.90404@inktank.com> (raw)
In-Reply-To: <CAJ_JamB5vtgt5TWOHhd-AZfDR7aL5QNKhy1Br-RLYF5PerF88A@mail.gmail.com>

On 08/27/2012 07:02 PM, Curtis C. wrote:
> On Wed, Aug 22, 2012 at 8:41 AM, Mark Nelson<mark.nelson@inktank.com>  wrote:
>> On 08/22/2012 08:55 AM, Jonathan Proulx wrote:
>>>
>>> Hi All,
>>
>>
>> Hi Jonathon!
>>
>>
>>>
>>> Yes I'm asking the impossible question, what is the "best" hardware
>>> confing.
>>
>>
>> That is the impossible question. :)
>>
>>
>>>
>>> I'm looking at (possibly) using ceph as backing store for images and
>>> volumes on OpenStack as well as exposing at least the object store for
>>> direct use.
>>>
>>> The openstack cluster exists and is currently in the early stages of
>>> use by researchers here, approx 1500 vCPU (counts hyperthreads
>>> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>>>
>>> On the object store side it would be a new resource for usand hard to
>>> say what people would do with it except that it would be many
>>> different things and the use profile would be constantly changing
>>> (which is true of all our existing storage).
>>>
>>> In this sense, even though it's a "private cloud" the somewhat
>>> unpredictable useage profile gives it some charateristics of a small
>>> public cloud.
>>>
>>> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
>>> to end up with a 20-30T 3x replicated storage (call me paranoid).
>>>
>>> So the monitor specs seem relatively easy to come up with.  For the
>>> OSDs it looks like
>>> http://ceph.com/docs/master/install/hardware-recommendations suggests
>>> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
>>> node).  On list discussions seem to frequently include an SSD for
>>> journaling (which is similar to what we do for our current ZFS back
>>> NFS storage).
>>>
>>> I'm hoping to wrap the hardware in a grant and willing to experiment a
>>> bit with different software configurations to tune it up when/if I get
>>> the hardware in.  So my imediate concern is a hardware spec that will
>>> ahve a reasonable processor:memory:disk ratio and opinions (or better
>>> data) on the utility of SSD.
>>
>>
>> Before I joined up with Inktank, I was prototyping a private openstack cloud
>> for HPC applications at a supercomputing site.  We similarly were pursuing
>> grant funding.  I know how it goes!
>>
>>
>>>
>>> First is the documented core to disk ratio still current best
>>> practice?  Given a platform with more drive slots could 8 cores handle
>>> more disk? would that need/like more memory?
>>
>>
>> The big thing is the CPU and memory needed during recovery.  During standard
>> operation you shouldn't be pushing the CPU too hard unless you are really
>> pushing data through fast and have many drives per node, or have severely
>> underspecced the CPU.
>>
>> Given that you are only shooting for around 90TB of space across 5+ osd
>> nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's
>> probably the closest thing we have right now to a "standard" configuration.
>> We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of
>> memory.  It might be worth bumping that up to 24-32GB of memory for very
>> large deployments with lots of OSDs.
>>
>> In terms of controller we are using Dell H700 cards which are similar to LSI
>> 9260s, but I think there is a good chance that it may actually be better to
>> use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware.
>> That's one of the commonly used cards in ZFS builds too and has a pretty
>> good reputation.
>>
>> I've actually got a supermicro SC847a chassis and a whole bunch of various
>> SATA/SAS/RAID controllers I'm testing now in different configurations.
>> Hopefully I should have some data soon.  For now, our best tested
>> configuration is with 12 drive nodes.  Smaller 1U nodes may be an option as
>> well, but not very dense.
>>
>
> I've worked a bit with a Supermicro 36 drive bay chassis, though I've
> since moved on from the organization we had them in place at. I quite
> liked them. Wrote a bit of a blog post about them too
> (http://serverascode.com/2012/06/07/36-hot-swappable-day-supermicro-chassis.html)
> so I'm excited to see Inktank trying them out.
>

I really like this chassis.  It's one of the nicer ones that I've worked 
with.  The drives in the back could be a deal breaker for some, but I 
think it's a decent trade-off for what you get.

> The place I currently work at is a big OpenStack user and thinking
> about Ceph, but is not, as of yet, interested in a chassis like the
> Supermicro, so please post about your findings. :)
>
> Thanks,
> Curtis.
>

So far I've only been doing single controller tests with an onboard LSI 
SAS2208 and an external SAS2008 card (9211-8i).  The SAS2008 is actually 
slightly faster.  With 6 7200rpm SATA drives and 2 Intel 520 SSDs for 
journals I can do nearly 600MB/s with 1x replication and 4MB requests 
via rados bench.

I've got a couple of other cards to test (An Areca 1680, LSI SAS2308, 
and a Marvel based highpoint rocketraid card).  After that I'll start in 
on multiple controllers and more drives.  I also got the bracket I 
needed in for my 1U client node so I should be able to start in on 2x 
bonded 10GbE tests.

Hopefully I can convince the powers that be to let me fill out the 
SC847a chassis and maybe buy another one if the tests look good. ;)

>>
>>>
>>> Have SSD been shown to speed performance with this architecture?
>>
>>
>> Yes, but in different ways depending on how you use them.  SSDs for data
>> storage tend to help mitigate some of the seek behavior issues we've seen on
>> the filestore.  This isn't really a reasonable solution for a lot of people
>> though.
>>
>> In terms of the journal, the biggest benefit that SSDs provide is high
>> throughput, so you can load multiple journals onto 1 SSD and cram more OSDs
>> into one box.  Depending on how much you trust your SSDs, you could try
>> either a 10 disk + 2 SSD or a 9 disk + SSD configuration.  Keep in mind that
>> this will be writing a lot of data to the SSDs, so you should try to
>> undersubscribe them to lengthen the lifespan.  For testing I'm doing 3
>> journals per 180GB Intel 520 SSD.
>>
>>
>>>
>>> If so given the 8 drive slot example with seven OSDs presented in the
>>> docs what is the liklihood that using a high performance SSD for the
>>> OS image and also cutting journal/log partitions out of it for the
>>> remaining 7 2-3T near line SAS drives?
>>
>>
>> Just keep in mind that in this case you're total throughput will likely be
>> limited by the SSD unless you get a very fast one (or are using 1GbE or have
>> some other bottleneck).
>>
>>
>>>
>>> Thanks,
>>> -Jon
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Mark

      reply	other threads:[~2012-08-28  1:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-22 13:55 Ideal hardware spec? Jonathan Proulx
2012-08-22 14:17 ` Wido den Hollander
2012-08-22 14:39   ` Stephen Perkins
2012-08-23  8:24     ` Wido den Hollander
2012-08-24 14:17       ` Stephen Perkins
2012-08-24 14:41         ` Joe Landman
2012-08-24 15:05         ` Mark Nelson
2012-08-24 16:30           ` Sławomir Skowron
2012-08-24 18:12           ` Wido den Hollander
2012-08-24 18:23             ` Mark Nelson
2012-08-27 18:05               ` Stephen Perkins
2012-08-27 22:33                 ` Wido den Hollander
     [not found]             ` <00ae01cd823e$84e2ed20$8ea8c760$@netmass.com>
2012-08-25 11:48               ` Wido den Hollander
2012-08-24 16:12         ` Tommi Virtanen
2012-08-24 18:09         ` Wido den Hollander
2012-08-22 15:46   ` Jonathan Proulx
2012-08-23  9:59     ` Wido den Hollander
     [not found]       ` <CABYiri_-73UyTKHcHWDZdjqb=rozjraVzxd166NZV2ir53tduA@mail.gmail.com>
2012-08-26 11:15         ` Wido den Hollander
2012-08-26 13:29           ` Mark Nelson
2012-08-22 14:41 ` Mark Nelson
2012-08-28  0:02   ` Curtis C.
2012-08-28  1:18     ` Mark Nelson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=503C1C50.90404@inktank.com \
    --to=mark.nelson@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=jon@csail.mit.edu \
    --cc=serverascode@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.