All of lore.kernel.org
 help / color / mirror / Atom feed
* Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
@ 2012-12-13 14:54 Lachfeld, Jutta
  2012-12-13 17:27 ` Sage Weil
  2012-12-14 14:53 ` Mark Nelson
  0 siblings, 2 replies; 12+ messages in thread
From: Lachfeld, Jutta @ 2012-12-13 14:54 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi all,

I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop.  The benchmark measures the elapsed time of a sort run.

I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.

Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.

OS: SLES 11 SP2

CEPH:
OSDs are distributed over several machines.
There is 1 MON and 1 MDS process on yet another machine.

Replication of the data pool is set to 1.
Underlying file systems for data are btrfs.
Mount options  are only "rw,noatime".
For each CEPH OSD, we use a RAM disk of 256MB for the journal.
Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.

HDFS:
HDFS is distributed over the same machines.
HDFS name node on yet another machine.

Replication level is set to 1.
HDFS block size is set to  64MB or even 512MB.
Underlying file systems for data are btrfs.
Mount options are only "rw,noatime".

Hadoop version is 1.0.3.
Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.

The same disk partitions are either formatted for HDFS or for CEPH usage.

CPU usage in both cases is almost 100 percent on all data related nodes.
There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.

Best regards,

Jutta Lachfeld.

--
jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-01-10 21:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
2012-12-13 17:27 ` Sage Weil
2012-12-13 17:41   ` Gregory Farnum
2012-12-13 20:23     ` Cameron Bahar
2012-12-13 20:27       ` Gregory Farnum
2012-12-13 20:33         ` Noah Watkins
2012-12-14 14:09           ` Lachfeld, Jutta
2013-01-05  0:17             ` Gregory Farnum
2013-01-09 15:11           ` Lachfeld, Jutta
2013-01-09 16:00             ` Noah Watkins
2013-01-10 21:42               ` Gregory Farnum
2012-12-14 14:53 ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.