From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Looking to Use Ceph Date: Thu, 03 Jan 2013 08:43:03 -0600 Message-ID: <50E598F7.8050000@inktank.com> References: <50E579AF.10206@sussex.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ia0-f182.google.com ([209.85.210.182]:61313 "EHLO mail-ia0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753052Ab3ACOmz (ORCPT ); Thu, 3 Jan 2013 09:42:55 -0500 Received: by mail-ia0-f182.google.com with SMTP id x2so12916363iad.27 for ; Thu, 03 Jan 2013 06:42:54 -0800 (PST) In-Reply-To: <50E579AF.10206@sussex.ac.uk> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "emyr.james" Cc: "ceph-devel@vger.kernel.org" On 01/03/2013 06:29 AM, emyr.james wrote: > Hi, > > I'm thinking of starting to use ceph initially for evaluation...seeing > how it compares to our existing lustre file system. > One thing that I would like confirmation of is how ceph stores large > files. If I store a large file in CephFS is it automatically split up > into chunks with the various chunks stored and replicated > automatically across the whole cluster, or does it store the whole > file as one object on one individial OSD and then has individual > replicants of the whole file on a small number of other OSD's ? What > is the typical block size used if files are split up....can this be > configured ? > > Regards, > > Emyr > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Emry, If you are using CephFS (ie the posix file system component), large files will by default be broken up into 4MB objects. The object size is configurable. Each object is distributed pseudo randomly between the different OSDs (though if you have a replication level > 1, object replicas will obey the rules defined in your crush map). Replication is on a per pool basis and happens automatically. If an OSD goes down and replication is used, Ceph will attempt to heal itself by redistributing the objects on the down OSD to the remaining ones. For a more in-depth explanation about objects and striping, see: http://ceph.com/docs/master/dev/file-striping/ http://ceph.com/docs/master/architecture/ One thing you should know is that Ceph's journal is similar to EXT4's "data=journal" mode in that data is always written to the journal before it goes to disk. If I remember correctly, ldiskfs by default uses ext3/4's "data=ordered" mode that only writes the data once. The upshot of this is that Ceph needs to do more writes for the same amount of data vs lustre, but there is a lower chance of data corruption as the data is written. When not network bound, Ceph will likely be slower for long sequential writes on the same hardware vs a highly tuned lustre system, but theoretically is faster for short bursty traffic as writes can be acknowledged as soon as they hit the journal (which requires fewer seeks vs writing the data out to the underlying filesystem) . By putting OSD journals on high-throughput SSDs, you can mitigate the sequential write penalties and get the best of both worlds, though you need more PCIE and controller throughput, and do potentially lose a bit of capacity and read throughput if you have to reduce your OSD count to add the SSD journals. PCIE SSDs may be a very interesting solution for journals as the price comes down. Thanks, Mark