All of lore.kernel.org
 help / color / mirror / Atom feed
From: Craig Dunwoody <cdunwoody@graphstream.com>
To: ceph-devel@lists.sourceforge.net
Cc: cdunwoody@graphstream.com
Subject: Hardware-config suggestions for HDD-based OSD node?
Date: Sun, 28 Mar 2010 15:36:57 -0700	[thread overview]
Message-ID: <23450.1269815817@n20.hq.graphstream.com> (raw)


I'd be interested to hear from anyone who has suggestions about
optimizing the hardware config of an HDD-based OSD node for Ceph, using
currently available COTS hardware components.

More specifically, I'm interested in how one might try for an efficient
balance among key hardware resources including:

    CPU cores
    Main-memory throughput and capacity
    HDD controllers
    HDDs
    SSDs for journaling, if any
    NICs

Some reasonable answers I expect might include:

-   It's very early days for Ceph, no one really knows yet, and the only
    way to find out is to experiment with real hardware and
    applications, which is expensive

-   Answer depends a lot on many factors, including:
    -   Cost/performance tradeoff choices for a particular application
    -   Details of workloads for a particular application
    -   Details of hardware-component performance characteristics

Seems to me that one of many possible approaches would be to choose a
particular HDD type (e.g. 2TB 3.5" 7200RPM SAS-6G), and then work toward
the following goals, recognizing that there are tensions/conflicts among
these goals:

    Goal G1
        Maximize the incremental improvement in overall FS access
        performance that results from each incremental addition of a
        single HDD.

    Goal G2
        Minimize physical space used per bit of total FS capacity.

    Goal G3
        Minimize total hardware cost per bit of total FS capacity.

I would expect to be able to do well on G1 by stacking up nodes, each
with a single HDD, single cosd instance, and one or more GigE ports.
However, I would expect to do better on G2 and G3 by increasing #HDDs
per node.

Based on currently available server components that are relatively
inexpensive and convenient to deploy, I can imagine that for some
applications it might be attractive to stack up 1RU-rackmount nodes,
each with four HDDs, four cosd instances, and two or more GigE ports.

Beyond that, I'm wondering if it would be possible to serve some
applications better with a fatter OSD node config.  In particular, could
I improve space-efficiency (G2) and maybe also cost-per-bit (G3) by
increasing the #HDDs per node until incremental performance contribution
of each additional HDD (G1) just starts to drop below what I would get
with only a single HDD per node?

As one really extreme example, at a cost that might be acceptable for
some applications I could build a single max-configuration node with:
    
     2 CPU sockets
    24 CPU threads (2 x 6core x 2thread, or 2 x 12core x 1thread)
    12 DIMMs (currently up to 96GB capacity, up to 85 GByte/sec peak)
     3 8port SAS6G HBAs (aggregate 14.4GByte/sec peak to HDDs)
     5 2port 10GigE NICs (aggregate 12.5GByte/sec peak to network)

Using appropriate chassis, I could attach a pretty large number of 2TB
3.5" 7200RPM SAS-6G HDDs to this node, even hundreds if I wanted to (but
I wouldn't).

I'm wondering how large I could push the number of attached HDDs, before
the incremental performance contribution of each HDD starts to drop off.

As number of attached HDDs increases, I would expect to hit a number of
hardware and software resource limitations in the node.  Certainly the
achievable sustained throughput of the lowest-level hardware interfaces
would be only a fraction of the aggregate-peak numbers that I listed
above.

As one very crude calculation, ignoring many other constraints, if I
thought that I could get all HDDs streaming simultaneously to Ethernet
at a sustained 100MByte/sec each (I can't), and I thought that I could
sustain 50% of wire-speed across the ten 10GigE ports, then I'd limit
myself to about 62 HDDs (6.25 GByte/sec) to avoid worrying about the
Ethernet interfaces throttling the aggregate streaming throughput of the
HDDs.

I expect that a more-realistic assumption about max aggregate streaming
throughput under Ceph would lead to a higher limit on #HDDs based on
this one consideration.

I would expect that long before reaching 62 HDDs, many other constraints
would cause the per-HDD performance contribution to drop below the
single-HDD-per-server level, including:

-   Limitations in CPU throughput
-   Limitations in main-memory throughput and capacity
-   Various Linux limitations
-   Various Ceph limitations

62 HDDs and 62 cosd instances would be 2.6 cosd instances per CPU
thread, which seems to me like a lot.  I would not be surprised at all
to receive a recommendation to limit to less than 1.0 cosd instance per
CPU thread.

I can imagine reducing the number of cosd instances by running each atop
a multi-HDD btrfs-level stripe, but I expect that might have various
disadvantages, and I do like the simplicity of one cosd instance per
btrfs filesystem per HDD.

Realistically, I expect that there might be a sweet-spot at a much more
moderate number of HDDs per node, with a node hardware config that is
much less extreme than the example I described above.

I also wonder if perhaps the sweet-spot for #HDDs per OSD node might be
able to increase over time, as Ceph matures and more tuning is done.

Thanks in advance to anyone for any thoughts/comments on this topic.
Would appreciate any suggestions on better ways to analyze the
tradeoffs, and corrections of any fundamental misunderstandings that I
might have about how Ceph works and how to configure it.

-- 
Craig Dunwoody
GraphStream Incorporated

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

             reply	other threads:[~2010-03-28 22:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-28 22:36 Craig Dunwoody [this message]
2010-03-29  0:29 ` Hardware-config suggestions for HDD-based OSD node? Martin Millnert
2010-03-29  1:15 ` Gregory Farnum
2010-03-29  1:48   ` Craig Dunwoody
2010-03-29  5:18 ` ales-76
2010-03-29 13:00   ` Craig Dunwoody
2010-03-29 15:46     ` Aleš Bláha
2010-03-29 22:05       ` [OLD ceph-devel] " Craig Dunwoody
2010-03-29 21:26 ` Sage Weil
2010-03-29 22:54   ` Craig Dunwoody
  -- strict thread matches above, loose matches on Subject: below --
2010-03-29  0:58 Craig Dunwoody

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23450.1269815817@n20.hq.graphstream.com \
    --to=cdunwoody@graphstream.com \
    --cc=ceph-devel@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.