All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mark Nelson <mark.nelson@inktank.com>
To: "emyr.james" <emyr.james@sussex.ac.uk>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: Looking to Use Ceph
Date: Thu, 03 Jan 2013 08:43:03 -0600	[thread overview]
Message-ID: <50E598F7.8050000@inktank.com> (raw)
In-Reply-To: <50E579AF.10206@sussex.ac.uk>

On 01/03/2013 06:29 AM, emyr.james wrote:
> Hi,
>
> I'm thinking of starting to use ceph initially for evaluation...seeing 
> how it compares to our existing lustre file system.
> One thing that I would like confirmation of is how ceph stores large 
> files. If I store a large file in CephFS is it automatically split up 
> into chunks with the various chunks stored and replicated 
> automatically across the whole cluster, or does it store the whole 
> file as one object on one individial OSD and then has individual 
> replicants of the whole file on a small number of other OSD's ? What 
> is the typical block size used if files are split up....can this be 
> configured ?
>
> Regards,
>
> Emyr
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Emry,

If you are using CephFS (ie the posix file system component), large 
files will by default be broken up into 4MB objects.  The object size is 
configurable.  Each object is distributed pseudo randomly between the 
different OSDs (though if you have a replication level > 1, object 
replicas will obey the rules defined in your crush map).  Replication is 
on a per pool basis and happens automatically.  If an OSD goes down and 
replication is used, Ceph will attempt to heal itself by redistributing 
the objects on the down OSD to the remaining ones.

For a more in-depth explanation about objects and striping, see:
http://ceph.com/docs/master/dev/file-striping/
http://ceph.com/docs/master/architecture/

One thing you should know is that Ceph's journal is similar to EXT4's 
"data=journal" mode in that data is always written to the journal before 
it goes to disk.  If I remember correctly, ldiskfs by default uses 
ext3/4's "data=ordered" mode that only writes the data once.  The upshot 
of this is that Ceph needs to do more writes for the same amount of data 
vs lustre, but there is a lower chance of data corruption as the data is 
written.  When not network bound, Ceph will likely be slower for long 
sequential writes on the same hardware vs a highly tuned lustre system, 
but theoretically is faster for short bursty traffic as writes can be 
acknowledged as soon as they hit the journal (which requires fewer seeks 
vs writing the data out to the underlying filesystem) .  By putting OSD 
journals on high-throughput SSDs, you can mitigate the sequential write 
penalties and get the best of both worlds, though you need more PCIE and 
controller throughput, and do potentially lose a bit of capacity and 
read throughput if you have to reduce your OSD count to add the SSD 
journals.  PCIE SSDs may be a very interesting solution for journals as 
the price comes down.

Thanks,
Mark

      parent reply	other threads:[~2013-01-03 14:42 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-03 12:29 Looking to Use Ceph emyr.james
2013-01-03 14:38 ` Wido den Hollander
2013-01-03 14:43 ` Mark Nelson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50E598F7.8050000@inktank.com \
    --to=mark.nelson@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=emyr.james@sussex.ac.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.