Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
@ 2012-12-13 14:54 Lachfeld, Jutta
  2012-12-13 17:27 ` Sage Weil
  2012-12-14 14:53 ` Mark Nelson
  0 siblings, 2 replies; 12+ messages in thread
From: Lachfeld, Jutta @ 2012-12-13 14:54 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi all,

I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop.  The benchmark measures the elapsed time of a sort run.

I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.

Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.

OS: SLES 11 SP2

CEPH:
OSDs are distributed over several machines.
There is 1 MON and 1 MDS process on yet another machine.

Replication of the data pool is set to 1.
Underlying file systems for data are btrfs.
Mount options  are only "rw,noatime".
For each CEPH OSD, we use a RAM disk of 256MB for the journal.
Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.

HDFS:
HDFS is distributed over the same machines.
HDFS name node on yet another machine.

Replication level is set to 1.
HDFS block size is set to  64MB or even 512MB.
Underlying file systems for data are btrfs.
Mount options are only "rw,noatime".

Hadoop version is 1.0.3.
Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.

The same disk partitions are either formatted for HDFS or for CEPH usage.

CPU usage in both cases is almost 100 percent on all data related nodes.
There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.

Best regards,

Jutta Lachfeld.

--
jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
@ 2012-12-13 17:27 ` Sage Weil
  2012-12-13 17:41   ` Gregory Farnum
  2012-12-14 14:53 ` Mark Nelson
  1 sibling, 1 reply; 12+ messages in thread
From: Sage Weil @ 2012-12-13 17:27 UTC (permalink / raw)
  To: Lachfeld, Jutta; +Cc: ceph-devel@vger.kernel.org

Hi Jutta,

On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
> Hi all,
> 
> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop.  The benchmark measures the elapsed time of a sort run.
> 
> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
> 
> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
> 
> OS: SLES 11 SP2
> 
> CEPH:
> OSDs are distributed over several machines.
> There is 1 MON and 1 MDS process on yet another machine.
> 
> Replication of the data pool is set to 1.
> Underlying file systems for data are btrfs.
> Mount options  are only "rw,noatime".
> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
> 
> HDFS:
> HDFS is distributed over the same machines.
> HDFS name node on yet another machine.
> 
> Replication level is set to 1.
> HDFS block size is set to  64MB or even 512MB.

I suspect that this is part of it.  The default ceph block size is only 
4MB.  Especially since the differential increases with larger blocks.
I'm not sure if the setting of block sizees is properly wired up; it 
depends on what version of the hadoop bindings you are using.  Noah would 
know more.

You can adjust the default block/object size for the fs with the cephfs 
utility from a kernel mount.  There isn't yet a convenient way to do this 
via ceph-fuse.

sage

> Underlying file systems for data are btrfs.
> Mount options are only "rw,noatime".
> 
> Hadoop version is 1.0.3.
> Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
> The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.
> 
> The same disk partitions are either formatted for HDFS or for CEPH usage.
> 
> CPU usage in both cases is almost 100 percent on all data related nodes.
> There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.
> 
> Best regards,
> 
> Jutta Lachfeld.
> 
> --
> jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 17:27 ` Sage Weil
@ 2012-12-13 17:41   ` Gregory Farnum
  2012-12-13 20:23     ` Cameron Bahar
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-12-13 17:41 UTC (permalink / raw)
  To: Sage Weil, Lachfeld, Jutta
  Cc: ceph-devel@vger.kernel.org, Noah Watkins, Joe Buck

On Thu, Dec 13, 2012 at 9:27 AM, Sage Weil <sage@inktank.com> wrote:
> Hi Jutta,
>
> On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
>> Hi all,
>>
>> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop.  The benchmark measures the elapsed time of a sort run.
>>
>> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>>
>> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>>
>> OS: SLES 11 SP2
>>
>> CEPH:
>> OSDs are distributed over several machines.
>> There is 1 MON and 1 MDS process on yet another machine.
>>
>> Replication of the data pool is set to 1.
>> Underlying file systems for data are btrfs.
>> Mount options  are only "rw,noatime".
>> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
>> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>>
>> HDFS:
>> HDFS is distributed over the same machines.
>> HDFS name node on yet another machine.
>>
>> Replication level is set to 1.
>> HDFS block size is set to  64MB or even 512MB.
>
> I suspect that this is part of it.  The default ceph block size is only
> 4MB.  Especially since the differential increases with larger blocks.
> I'm not sure if the setting of block sizees is properly wired up; it
> depends on what version of the hadoop bindings you are using.  Noah would
> know more.
>
> You can adjust the default block/object size for the fs with the cephfs
> utility from a kernel mount.  There isn't yet a convenient way to do this
> via ceph-fuse.

If Jutta is using the *old* ones I last worked on in 2009, then this
is already wired up for 64MB blocks. A "ceph pg dump" would let us get
a rough estimate of the block sizes in use.

"ceph -s" would also be useful to check that everything is set up reasonably.

Other than that, it would be fair to describe these bindings as
little-used — minimal performance tests indicated rough parity back in
2009, but those were only a couple minutes long and on very small
clusters, so 1.2x might be normal. Noah and Joe are working on new
bindings now, and those will be tuned and accompany some backend
changes if necessary. They might also have a better eye for typical
results.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 17:41   ` Gregory Farnum
@ 2012-12-13 20:23     ` Cameron Bahar
  2012-12-13 20:27       ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Cameron Bahar @ 2012-12-13 20:23 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Sage Weil, Lachfeld, Jutta, ceph-devel@vger.kernel.org,
	Noah Watkins, Joe Buck

Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?

Thanks,
Cameron

Sent from my iPhone

On Dec 13, 2012, at 9:41 AM, Gregory Farnum <greg@inktank.com> wrote:

> On Thu, Dec 13, 2012 at 9:27 AM, Sage Weil <sage@inktank.com> wrote:
>> Hi Jutta,
>> 
>> On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
>>> Hi all,
>>> 
>>> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop.  The benchmark measures the elapsed time of a sort run.
>>> 
>>> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>>> 
>>> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>>> 
>>> OS: SLES 11 SP2
>>> 
>>> CEPH:
>>> OSDs are distributed over several machines.
>>> There is 1 MON and 1 MDS process on yet another machine.
>>> 
>>> Replication of the data pool is set to 1.
>>> Underlying file systems for data are btrfs.
>>> Mount options  are only "rw,noatime".
>>> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
>>> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>>> 
>>> HDFS:
>>> HDFS is distributed over the same machines.
>>> HDFS name node on yet another machine.
>>> 
>>> Replication level is set to 1.
>>> HDFS block size is set to  64MB or even 512MB.
>> 
>> I suspect that this is part of it.  The default ceph block size is only
>> 4MB.  Especially since the differential increases with larger blocks.
>> I'm not sure if the setting of block sizees is properly wired up; it
>> depends on what version of the hadoop bindings you are using.  Noah would
>> know more.
>> 
>> You can adjust the default block/object size for the fs with the cephfs
>> utility from a kernel mount.  There isn't yet a convenient way to do this
>> via ceph-fuse.
> 
> If Jutta is using the *old* ones I last worked on in 2009, then this
> is already wired up for 64MB blocks. A "ceph pg dump" would let us get
> a rough estimate of the block sizes in use.
> 
> "ceph -s" would also be useful to check that everything is set up reasonably.
> 
> Other than that, it would be fair to describe these bindings as
> little-used — minimal performance tests indicated rough parity back in
> 2009, but those were only a couple minutes long and on very small
> clusters, so 1.2x might be normal. Noah and Joe are working on new
> bindings now, and those will be tuned and accompany some backend
> changes if necessary. They might also have a better eye for typical
> results.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 20:23     ` Cameron Bahar
@ 2012-12-13 20:27       ` Gregory Farnum
  2012-12-13 20:33         ` Noah Watkins
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-12-13 20:27 UTC (permalink / raw)
  To: Cameron Bahar
  Cc: Sage Weil, Lachfeld, Jutta, ceph-devel@vger.kernel.org,
	Noah Watkins, Joe Buck

On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?

Yeah. You can set chunk size on a per-file basis; you just can't
change it once the file has any data written to it.
In the context of Hadoop the question is just if the bindings are
configured correctly to do so automatically.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 20:27       ` Gregory Farnum
@ 2012-12-13 20:33         ` Noah Watkins
  2012-12-14 14:09           ` Lachfeld, Jutta
  2013-01-09 15:11           ` Lachfeld, Jutta
  0 siblings, 2 replies; 12+ messages in thread
From: Noah Watkins @ 2012-12-13 20:33 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Cameron Bahar, Sage Weil, Lachfeld, Jutta,
	ceph-devel@vger.kernel.org, Noah Watkins, Joe Buck

The bindings use the default Hadoop settings (e.g. 64 or 128 MB
chunks) when creating new files. The chunk size can also be specified
on a per-file basis using the same interface as Hadoop. Additionally,
while Hadoop doesn't provide an interface to configuration parameters
beyond chunk size, we will also let users fully configure for any Ceph
striping strategy. http://ceph.com/docs/master/dev/file-striping/

-Noah

On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
>> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?
>
> Yeah. You can set chunk size on a per-file basis; you just can't
> change it once the file has any data written to it.
> In the context of Hadoop the question is just if the bindings are
> configured correctly to do so automatically.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 20:33         ` Noah Watkins
@ 2012-12-14 14:09           ` Lachfeld, Jutta
  2013-01-05  0:17             ` Gregory Farnum
  2013-01-09 15:11           ` Lachfeld, Jutta
  1 sibling, 1 reply; 12+ messages in thread
From: Lachfeld, Jutta @ 2012-12-14 14:09 UTC (permalink / raw)
  To: Noah Watkins, Gregory Farnum
  Cc: Cameron Bahar, Sage Weil, ceph-devel@vger.kernel.org,
	Noah Watkins, Joe Buck

Hi Noah, Gregory and Sage,

first of all, thanks for your quick replies. Here are some answers to your questions.

Gregory, I have got the output of "ceph -s" before and after this specific TeraSort run, and to me it looks ok; all 30 osds are "up":

   health HEALTH_OK
   monmap e1: 1 mons at {0=192.168.111.18:6789/0}, election epoch 0, quorum 0 0
   osdmap e22: 30 osds: 30 up, 30 in
    pgmap v13688: 5760 pgs: 5760 active+clean; 1862 GB data, 1868 GB used, 6142 GB / 8366 GB avail
   mdsmap e4: 1/1/1 up {0=0=up:active}

   health HEALTH_OK
   monmap e1: 1 mons at {0=192.168.111.18:6789/0}, election epoch 0, quorum 0 0
   osdmap e22: 30 osds: 30 up, 30 in
    pgmap v19657: 5760 pgs: 5760 active+clean; 1862 GB data, 1868 GB used, 6142 GB / 8366 GB avail
   mdsmap e4: 1/1/1 up {0=0=up:active}

I do not have the full output of "ceph pg dump" for that specific TeraSort run, but here is a typical output after automatically preparing CEPH for a benchmark run
 (removed almost all lines in the long pg_stat table hoping that you do not need them):

dumped all in format plain
version 403
last_osdmap_epoch 22
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip     degr    unf     bytes   log     disklog state   state_stamp     v       reported        up      acting  last_scrub      scrub_stamp
2.314   0       0       0       0       0       0       0       active+clean    2012-12-14 08:31:24.524152      0'0     11'17   [23,7]  [23,7]  0'0     2012-12-14 08:31:24.524096
0.316   0       0       0       0       0       0       0       active+clean    2012-12-14 08:25:12.780643      0'0     11'19   [23]    [23]    0'0     2012-12-14 08:24:08.394930
1.317   0       0       0       0       0       0       0       active+clean    2012-12-14 08:27:56.400997      0'0     3'17    [11,17] [11,17] 0'0     2012-12-14 08:27:56.400953
[...]
pool 0  1       0       0       0       4       136     136
pool 1  21      0       0       0       23745   5518    5518
pool 2  0       0       0       0       0       0       0
 sum    22      0       0       0       23749   5654    5654
osdstat kbused  kbavail kb      hb in   hb out
0       2724    279808588       292420608       [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]     []
1       2892    279808588       292420608       [3,4,5,6,8,9,11,12,13,14,15,16,17,18,20,22,24,25,26,27,28]      []
2       2844    279808588       292420608       [3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,22,23,24,25,26,27,29]      []
3       2716    279808588       292420608       [0,1,2,6,7,8,9,10,11,12,13,14,15,16,17,19,20,22,23,24,25,26,27,28,29]   []
4       2556    279808588       292420608       [1,2,7,8,9,12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29]  []
5       2856    279808584       292420608       [0,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,28,29]  []
6       2840    279808584       292420608       [0,1,2,3,4,5,9,10,11,12,13,14,15,16,17,18,19,20,22,24,25,26,27,28,29]   []
7       2604    279808588       292420608       [1,2,3,4,5,9,10,11,12,13,15,17,18,19,20,21,23,24,25,26,27,28,29]        []
8       2564    279808588       292420608       [1,2,3,4,5,9,10,11,12,14,16,17,18,19,20,21,22,23,24,25,27,28,29]        []
9       2804    279808588       292420608       [1,2,3,4,5,6,8,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,29] []
10      2556    279808588       292420608       [0,1,2,4,5,6,7,8,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28]       []
11      3084    279808588       292420608       [0,1,2,3,4,5,6,7,8,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]  []
12      2572    279808588       292420608       [0,1,2,3,4,5,7,8,10,11,15,16,18,20,21,22,23,24,27,28,29]        []
13      2912    279808560       292420608       [0,1,2,3,5,6,7,8,9,10,11,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]  []
14      2992    279808584       292420608       [1,2,3,4,5,6,7,8,9,10,11,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]  []
15      2652    279808588       292420608       [1,2,3,4,5,6,7,8,9,10,11,13,14,19,20,21,22,23,25,26,27,28,29]   []
16      3028    279808588       292420608       [0,1,2,3,5,6,7,8,9,10,11,12,14,18,20,21,22,24,25,26,27,28,29]   []
17      2772    279808588       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,18,19,21,22,23,24,25,26,27,28,29]   []
18      2804    279808588       292420608       [0,1,2,3,5,6,8,9,10,11,12,14,15,16,17,21,22,23,24,25,26,27,29]  []
19      2620    279808588       292420608       [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,21,22,23,25,26,27,28,29]     []
20      2956    279808588       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,21,22,23,24,25,27,29]      []
21      2876    279808588       292420608       [0,1,2,3,4,5,6,8,9,10,12,13,15,16,17,18,19,20,24,25,26,27,29]   []
22      3044    279808588       292420608       [1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,26,27,28,29]    []
23      2752    279808584       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,27,28,29]   []
24      2948    279808588       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,28,29]        []
25      3068    279808588       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,28,29]        []
26      2540    279808588       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,27,28]   []
27      3012    279808588       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,19,20,21,22,23,24,25,26]      []
28      2800    279808560       292420608       [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26]   []
29      3052    279808588       292420608       [1,2,3,4,5,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26]       []
 sum    84440   8394257568      8772618240

Does this information help? Is it really 64MB? That is what I had assumed.

As I am relatively new to CEPH, I need some time to digest and understand all your answers.

Regards,
Jutta.

jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint

-----Original Message-----
From: Noah Watkins [mailto:jayhawk@cs.ucsc.edu] 
Sent: Thursday, December 13, 2012 9:33 PM
To: Gregory Farnum
Cc: Cameron Bahar; Sage Weil; Lachfeld, Jutta; ceph-devel@vger.kernel.org; Noah Watkins; Joe Buck
Subject: Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

The bindings use the default Hadoop settings (e.g. 64 or 128 MB
chunks) when creating new files. The chunk size can also be specified on a per-file basis using the same interface as Hadoop. Additionally, while Hadoop doesn't provide an interface to configuration parameters beyond chunk size, we will also let users fully configure for any Ceph striping strategy. http://ceph.com/docs/master/dev/file-striping/

-Noah

On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
>> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically configurable when a cluster is first installed?
>
> Yeah. You can set chunk size on a per-file basis; you just can't 
> change it once the file has any data written to it.
> In the context of Hadoop the question is just if the bindings are 
> configured correctly to do so automatically.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-14 14:09           ` Lachfeld, Jutta
@ 2013-01-05  0:17             ` Gregory Farnum
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2013-01-05  0:17 UTC (permalink / raw)
  To: Lachfeld, Jutta
  Cc: Cameron Bahar, Sage Weil, ceph-devel@vger.kernel.org,
	Noah Watkins, Joe Buck, Mark Nelson

Sorry for the delay; I've been out on vacation...

On Fri, Dec 14, 2012 at 6:09 AM, Lachfeld, Jutta
<jutta.lachfeld@ts.fujitsu.com> wrote:
> I do not have the full output of "ceph pg dump" for that specific TeraSort run, but here is a typical output after automatically preparing CEPH for a benchmark run
>  (removed almost all lines in the long pg_stat table hoping that you do not need them):

Actually those were exactly what I was after; they include output on
the total PG size and the number of objects so we can check on average
size. :) If you'd like to do it yourself, look at some of the PGs
which correspond to your data pool (the PG ids are all of the form
0.123a, and the number before the decimal point is the pool ID; by
default you'll be looking for 0).


On Fri, Dec 14, 2012 at 6:53 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
> The large block size may be an issue (at least with some of our default
> tunable settings).  You might want to try 4 or 16MB and see if it's any
> better or worse.

Unless you've got a specific reason to think this is busted, I am
pretty confident it's not a problem. :)


Jutta, do you have any finer-grained numbers than total run time
(specifically, how much time is spent on data generation versus the
read-and-sort for each FS)? HDFS doesn't do any journaling like Ceph
does and the fact that the Ceph journal is in-memory might not be
helping much since it's so small compared to the amount of data being
written.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 20:33         ` Noah Watkins
  2012-12-14 14:09           ` Lachfeld, Jutta
@ 2013-01-09 15:11           ` Lachfeld, Jutta
  2013-01-09 16:00             ` Noah Watkins
  1 sibling, 1 reply; 12+ messages in thread
From: Lachfeld, Jutta @ 2013-01-09 15:11 UTC (permalink / raw)
  To: Noah Watkins, Gregory Farnum
  Cc: Cameron Bahar, Sage Weil, ceph-devel@vger.kernel.org,
	Noah Watkins, Joe Buck

Hi Noah,

the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size.
Is it the CEPH equivalent  to the "HDFS block size" parameter which I have been looking for?

Does the parameter ceph.object.size apply to version 0.56.1?

I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH?

Regards,
Jutta.

-
jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint

> -----Original Message-----
> From: Noah Watkins [mailto:jayhawk@cs.ucsc.edu]
> Sent: Thursday, December 13, 2012 9:33 PM
> To: Gregory Farnum
> Cc: Cameron Bahar; Sage Weil; Lachfeld, Jutta; ceph-devel@vger.kernel.org; Noah
> Watkins; Joe Buck
> Subject: Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark
> performance comparison issue
> 
> The bindings use the default Hadoop settings (e.g. 64 or 128 MB
> chunks) when creating new files. The chunk size can also be specified on a per-file basis
> using the same interface as Hadoop. Additionally, while Hadoop doesn't provide an
> interface to configuration parameters beyond chunk size, we will also let users fully
> configure for any Ceph striping strategy. http://ceph.com/docs/master/dev/file-striping/
> 
> -Noah
> 
> On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum <greg@inktank.com> wrote:
> > On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar <cbahar@gmail.com> wrote:
> >> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even statically
> configurable when a cluster is first installed?
> >
> > Yeah. You can set chunk size on a per-file basis; you just can't
> > change it once the file has any data written to it.
> > In the context of Hadoop the question is just if the bindings are
> > configured correctly to do so automatically.
> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2013-01-09 15:11           ` Lachfeld, Jutta
@ 2013-01-09 16:00             ` Noah Watkins
  2013-01-10 21:42               ` Gregory Farnum
  0 siblings, 1 reply; 12+ messages in thread
From: Noah Watkins @ 2013-01-09 16:00 UTC (permalink / raw)
  To: Lachfeld, Jutta
  Cc: Noah Watkins, Gregory Farnum, Cameron Bahar, Sage Weil,
	ceph-devel@vger.kernel.org, Joe Buck

Hi Jutta,

On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta
<jutta.lachfeld@ts.fujitsu.com> wrote:
>
> the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size.
> Is it the CEPH equivalent  to the "HDFS block size" parameter which I have been looking for?

Yes. By specifying ceph.object.size, the Hadoop will use a default
Ceph file layout with stripe unit = object size, and stripe count = 1.
This is effectively the same meaning as dfs.block.size for HDFS.

> Does the parameter ceph.object.size apply to version 0.56.1?

The Ceph/Hadoop file system plugin is being developed here:

  git://github.com/ceph/hadoop-common cephfs/branch-1.0

There is an old version of the Hadoop plugin in the Ceph tree which
will be removed shortly. Regarding the versions, development is taking
place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a
system in place for dealing with compatibility across versions because
the code is in heavy development.

If you are running 0.56.1 then a recent version of cephfs/branch-1.0
should work with that, but may not long, as development continues.

> I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH?

I can't think of any reason why a large size would cause concern, but
maybe someone else can chime in?

- Noah

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2013-01-09 16:00             ` Noah Watkins
@ 2013-01-10 21:42               ` Gregory Farnum
  0 siblings, 0 replies; 12+ messages in thread
From: Gregory Farnum @ 2013-01-10 21:42 UTC (permalink / raw)
  To: Noah Watkins
  Cc: Lachfeld, Jutta, Noah Watkins, Cameron Bahar, Sage Weil,
	ceph-devel@vger.kernel.org, Joe Buck

On Wed, Jan 9, 2013 at 8:00 AM, Noah Watkins <noah.watkins@inktank.com> wrote:
> Hi Jutta,
>
> On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta
> <jutta.lachfeld@ts.fujitsu.com> wrote:
>>
>> the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size.
>> Is it the CEPH equivalent  to the "HDFS block size" parameter which I have been looking for?
>
> Yes. By specifying ceph.object.size, the Hadoop will use a default
> Ceph file layout with stripe unit = object size, and stripe count = 1.
> This is effectively the same meaning as dfs.block.size for HDFS.
>
>> Does the parameter ceph.object.size apply to version 0.56.1?
>
> The Ceph/Hadoop file system plugin is being developed here:
>
>   git://github.com/ceph/hadoop-common cephfs/branch-1.0
>
> There is an old version of the Hadoop plugin in the Ceph tree which
> will be removed shortly. Regarding the versions, development is taking
> place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a
> system in place for dealing with compatibility across versions because
> the code is in heavy development.

If you are using the old version in the Ceph tree, you should be
setting fs.ceph.blockSize rather than ceph.object.size. :)


>> I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH?
>
> I can't think of any reason why a large size would cause concern, but
> maybe someone else can chime in?

Yep, totally fine.
-Greg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
  2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
  2012-12-13 17:27 ` Sage Weil
@ 2012-12-14 14:53 ` Mark Nelson
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Nelson @ 2012-12-14 14:53 UTC (permalink / raw)
  To: Lachfeld, Jutta; +Cc: ceph-devel@vger.kernel.org

On 12/13/2012 08:54 AM, Lachfeld, Jutta wrote:
> Hi all,

Hi!  Sorry to send this a bit late, it looks like the reply I authored 
yesterday from my phone got eaten by vger.

>
> I am currently doing some comparisons between CEPH FS and HDFS as a file system for Hadoop using Hadoop's integrated benchmark TeraSort. This benchmark first generates the specified amount of data in the file system used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce framework of Hadoop, sending the sorted output again to the file system used by Hadoop.  The benchmark measures the elapsed time of a sort run.
>
> I am wondering about my best result achieved with CEPH FS in comparison to the ones achieved with HDFS. With CEPH, the runtime of the benchmark is somewhat longer, the factor is about 1.2 when comparing with an HDFS run using the default HDFS block size of 64MB. When comparing with an HDFS run using an HDFS block size of 512MB the factor is even 1.5.
>
> Could you please take a look at the configuration, perhaps some key factor already catches your eye, e.g. CEPH version.
>
> OS: SLES 11 SP2

Beyond what the others have said, this could be an issue.  If I recall, 
that's an older version of SLES and won't have syncfs support in glibc 
(you need 2.14+).  In newer versions of Ceph you can still use syncfs if 
your kernel is new enough (2.6.38+), but in 0.48 you need support for it 
in glibc too.  This will have a performance impact, especially if you 
have more than one OSD per server.

>
> CEPH:
> OSDs are distributed over several machines.
> There is 1 MON and 1 MDS process on yet another machine.
>
> Replication of the data pool is set to 1.
> Underlying file systems for data are btrfs.

What kernel are you using?  If it's older, this could also be an issue. 
  We've seen pretty bad btrfs fragmentation on older kernels that seems 
to be related to degradation in performance over time.

> Mount options  are only "rw,noatime".
> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>
> HDFS:
> HDFS is distributed over the same machines.
> HDFS name node on yet another machine.
>
> Replication level is set to 1.
> HDFS block size is set to  64MB or even 512MB.
> Underlying file systems for data are btrfs.
> Mount options are only "rw,noatime".

The large block size may be an issue (at least with some of our default 
tunable settings).  You might want to try 4 or 16MB and see if it's any 
better or worse.

>
> Hadoop version is 1.0.3.
> Applied the CEPH patch for Hadoop that was generated with 0 .20.205.0.
> The same maximum number of Hadoop map tasks has been used for HDFS and for CEPH FS.
>
> The same disk partitions are either formatted for HDFS or for CEPH usage.
>
> CPU usage in both cases is almost 100 percent on all data related nodes.

If you run sysprof, you can probably get an idea of where the time is 
being spent.  perf sort of works but doesn't seem to report ceph-osd 
symbols properly.

> There is enough memory on all nodes for the joint load of ceph-osd and Hadoop java processes.
>
> Best regards,
>
> Jutta Lachfeld.
>
> --
> jutta.lachfeld@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-01-10 21:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-13 14:54 Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue Lachfeld, Jutta
2012-12-13 17:27 ` Sage Weil
2012-12-13 17:41   ` Gregory Farnum
2012-12-13 20:23     ` Cameron Bahar
2012-12-13 20:27       ` Gregory Farnum
2012-12-13 20:33         ` Noah Watkins
2012-12-14 14:09           ` Lachfeld, Jutta
2013-01-05  0:17             ` Gregory Farnum
2013-01-09 15:11           ` Lachfeld, Jutta
2013-01-09 16:00             ` Noah Watkins
2013-01-10 21:42               ` Gregory Farnum
2012-12-14 14:53 ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.