Looking to Use Ceph

All of lore.kernel.org
 help / color / mirror / Atom feed

* Looking to Use Ceph
@ 2013-01-03 12:29 emyr.james
  2013-01-03 14:38 ` Wido den Hollander
  2013-01-03 14:43 ` Mark Nelson
  0 siblings, 2 replies; 3+ messages in thread
From: emyr.james @ 2013-01-03 12:29 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi,

I'm thinking of starting to use ceph initially for evaluation...seeing 
how it compares to our existing lustre file system.
One thing that I would like confirmation of is how ceph stores large 
files. If I store a large file in CephFS is it automatically split up 
into chunks with the various chunks stored and replicated automatically 
across the whole cluster, or does it store the whole file as one object 
on one individial OSD and then has individual replicants of the whole 
file on a small number of other OSD's ? What is the typical block size 
used if files are split up....can this be configured ?

Regards,

Emyr

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Looking to Use Ceph
  2013-01-03 12:29 Looking to Use Ceph emyr.james
@ 2013-01-03 14:38 ` Wido den Hollander
  2013-01-03 14:43 ` Mark Nelson
  1 sibling, 0 replies; 3+ messages in thread
From: Wido den Hollander @ 2013-01-03 14:38 UTC (permalink / raw)
  To: emyr.james; +Cc: ceph-devel@vger.kernel.org

On 01/03/2013 01:29 PM, emyr.james wrote:
> Hi,
>
> I'm thinking of starting to use ceph initially for evaluation...seeing
> how it compares to our existing lustre file system.
> One thing that I would like confirmation of is how ceph stores large
> files. If I store a large file in CephFS is it automatically split up
> into chunks with the various chunks stored and replicated automatically
> across the whole cluster, or does it store the whole file as one object
> on one individial OSD and then has individual replicants of the whole
> file on a small number of other OSD's ? What is the typical block size
> used if files are split up....can this be configured ?
>

Files are by default striped in 4MB blocks which are then distributed 
over the OSDS and replicated.

A 1G file will thus result in 256 different object distributed and 
replicated through your cluster.

You can configure the stripe-size with the "cephfs" tool on a client.

Note: CephFS hasn't got the attention like RADOS and RBD got, so you 
might run into some weird situations.

Wido

> Regards,
>
> Emyr
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Looking to Use Ceph
  2013-01-03 12:29 Looking to Use Ceph emyr.james
  2013-01-03 14:38 ` Wido den Hollander
@ 2013-01-03 14:43 ` Mark Nelson
  1 sibling, 0 replies; 3+ messages in thread
From: Mark Nelson @ 2013-01-03 14:43 UTC (permalink / raw)
  To: emyr.james; +Cc: ceph-devel@vger.kernel.org

On 01/03/2013 06:29 AM, emyr.james wrote:
> Hi,
>
> I'm thinking of starting to use ceph initially for evaluation...seeing 
> how it compares to our existing lustre file system.
> One thing that I would like confirmation of is how ceph stores large 
> files. If I store a large file in CephFS is it automatically split up 
> into chunks with the various chunks stored and replicated 
> automatically across the whole cluster, or does it store the whole 
> file as one object on one individial OSD and then has individual 
> replicants of the whole file on a small number of other OSD's ? What 
> is the typical block size used if files are split up....can this be 
> configured ?
>
> Regards,
>
> Emyr
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Emry,

If you are using CephFS (ie the posix file system component), large 
files will by default be broken up into 4MB objects.  The object size is 
configurable.  Each object is distributed pseudo randomly between the 
different OSDs (though if you have a replication level > 1, object 
replicas will obey the rules defined in your crush map).  Replication is 
on a per pool basis and happens automatically.  If an OSD goes down and 
replication is used, Ceph will attempt to heal itself by redistributing 
the objects on the down OSD to the remaining ones.

For a more in-depth explanation about objects and striping, see:
http://ceph.com/docs/master/dev/file-striping/
http://ceph.com/docs/master/architecture/

One thing you should know is that Ceph's journal is similar to EXT4's 
"data=journal" mode in that data is always written to the journal before 
it goes to disk.  If I remember correctly, ldiskfs by default uses 
ext3/4's "data=ordered" mode that only writes the data once.  The upshot 
of this is that Ceph needs to do more writes for the same amount of data 
vs lustre, but there is a lower chance of data corruption as the data is 
written.  When not network bound, Ceph will likely be slower for long 
sequential writes on the same hardware vs a highly tuned lustre system, 
but theoretically is faster for short bursty traffic as writes can be 
acknowledged as soon as they hit the journal (which requires fewer seeks 
vs writing the data out to the underlying filesystem) .  By putting OSD 
journals on high-throughput SSDs, you can mitigate the sequential write 
penalties and get the best of both worlds, though you need more PCIE and 
controller throughput, and do potentially lose a bit of capacity and 
read throughput if you have to reduce your OSD count to add the SSD 
journals.  PCIE SSDs may be a very interesting solution for journals as 
the price comes down.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-01-03 14:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-03 12:29 Looking to Use Ceph emyr.james
2013-01-03 14:38 ` Wido den Hollander
2013-01-03 14:43 ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.