OOM's on the Ceph client machine

All of lore.kernel.org
 help / color / mirror / Atom feed

* OOM's on the Ceph client machine
@ 2010-10-13  0:31 Theodore Ts'o
  2010-10-13  2:30 ` Gregory Farnum
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Theodore Ts'o @ 2010-10-13  0:31 UTC (permalink / raw)
  To: ceph-devel; +Cc: mrubin

Hi there,

I've recently been playing with Ceph on an evaluation basis, and found
that I was able to fairly reliably induce an OOM kill on my the ceph
client machine by using FFSB with the following configuration file (see
attached, below).

I am using Ceph v0.21.3 plus a few commits that were on the testing
branch as of late September (commit ID 569d96b).  The Ceph cluster
contains 10 commodity servers with 5 disks configured for Ceph object
storage on each server (plus a separate spindle for the journal files),
so there are 5 instances of cosd on each OSD server.  The disks are
formatted using ext4 in no-journal mode.  I am using 3 servers for the
MDS and montioring daemons, with the MDS and monitoring daemons
colocated these 3 servers.  The machines all have gigabit ethernet
cards.

I've been running the client on a separate machine, and this is the
machine which has been dying with an OOM.

Any help, suggestions, or "hey stupid!  You screwed up XXXX in your
ceph.conf file" would be gratefully accepted.

Thanks,

	      	       	     	     - Ted

P.S.  In case people are curious, here are the results of the "boxacle"
(http://btrfs.boxacle.net) FFSB workloads that I ran.  The results are
fairly stable, except very often the 8 thread random_write workload is a
little hard to reproduce because it very often OOM's.  I've never gotten
a 32 thread random_write workload measurement, since it very reliably
OOM's on my client machine.  

Do these results look reasonable to you?  I confess I'm a little
disappointed with the sequential and random read numbers in particular.
And given 10 servers and fifty spindles, even the large_file_create
numbers seems surprising slow.

(Also, given the we are using gigabit ethernet in this evaluation
cluster, the 1GB/sec seems ridiculously high, which suggests to me that
the fsync request wasn't honored -- FFSB includes the fsync time when
calculating write bandwidth -- and it may explain why we are OOM'ing in
the random_write workload.)

                    1 thread           8 threads            32 threads 
large_file_create   101 MB/sec         102 MB/sec           101 MB/sec 
sequential_reads     35 MB/sec         113 MB/sec           114 MB/sec 
random_reads          1.48 MB/sec        5.44 MB/sec        11.7 MB/sec 
random_writes      923 MB/sec           1.09 GB/sec             (*) 

For comparison, here are the FFSB numbers on a single local ext4 disk
with no journal:

                    1 thread           8 threads            32 threads 
large_file_create   75.5 MB/sec        72.2 MB/sec	    74.2 MB/sec
sequential_reads    77.2 MB/sec	       69.2 MB/sec	    70.3 MB/sec
random_reads        734 K/sec	       537 K/sec	    537 K/sec
random_writes       44.5 MB/sec	       41.5 MB/sec	    41.6 MB/sec

It's very possible that I may have done something wrong, so I've
enclosed the ceph.conf file I used for doing this test run....  please
let me know if there's something I've screwed up.

---------------------------- random_write.32.ffsb
# Large file random writes.
# 1024 files, 100MB per file.

time=300  # 5 min
alignio=1

[filesystem0]
	location=/mnt/ffsb1
	num_files=1024
	min_filesize=104857600  # 100 MB
	max_filesize=104857600
	reuse=1
[end0]

[threadgroup0]
	num_threads=32

	write_random=1
	write_weight=1

	write_size=5242880  # 5 MB
	write_blocksize=4096

	[stats]
		enable_stats=1
		enable_range=1

		msec_range    0.00      0.01
		msec_range    0.01      0.02
		msec_range    0.02      0.05
		msec_range    0.05      0.10
		msec_range    0.10      0.20
		msec_range    0.20      0.50
		msec_range    0.50      1.00
		msec_range    1.00      2.00
		msec_range    2.00      5.00
		msec_range    5.00     10.00
		msec_range   10.00     20.00
		msec_range   20.00     50.00
		msec_range   50.00    100.00
		msec_range  100.00    200.00
		msec_range  200.00    500.00
		msec_range  500.00   1000.00
		msec_range 1000.00   2000.00
		msec_range 2000.00   5000.00
		msec_range 5000.00  10000.00
	[end]
[end0]
------------------------------------------------ My ceph.conf file

;
; This is the test ceph configuration file
;
; [tytso:20101007.0813EDT]
;
; This file defines cluster membership, the various locations
; that Ceph stores data, and any other runtime options.
;
; If a 'host' is defined for a daemon, the start/stop script will
; verify that it matches the hostname (or else ignore it).  If it is
; not defined, it is assumed that the daemon is intended to start on
; the current host (e.g., in a setup with a startup.conf on each
; node).

; global
[global]
	user = root
	pid file = /disk/sda3/tmp/ceph/$name.pid
	logger dir = /disk/sda3/tmp/ceph
	log dir = /disk/sda3/tmp/ceph
	chdir = /disk/sda3

; monitors
;  You need at least one.  You need at least three if you want to
;  tolerate any node failures.  Always create an odd number.
[mon]
	mon data = /disk/sda3/cephmon/data/mon$id

	; logging, for debugging monitor crashes, in order of
	; their likelihood of being helpful :)
	;debug ms = 1
	;debug mon = 20
	;debug paxos = 20
	;debug auth = 20

[mon0]
	host = mach1
	mon addr = 1.2.3.4:6789

[mon1]
	host = mach2
	mon addr = 1.2.3.5:6789

[mon1]
	host = mach3
	mon addr = 1.2.3.6:6789

; mds
;  You need at least one.  Define two to get a standby.
[mds]
	; where the mds keeps it's secret encryption keys
	keyring = /data/keyring.$name

	; mds logging to debug issues.
	;debug ms = 1
	;debug mds = 20

[mds.alpha]
	host = mach2

[mds.beta]
	host = mach3

[mds.gamma]
	host = mach1

; osd
;  You need at least one.  Two if you want data to be replicated.
;  Define as many as you like.
[osd]
	; osd logging to debug osd issues, in order of likelihood of being
	; helpful
	;debug ms = 1
	;debug osd = 20
	;debug filestore = 20
	;debug journal = 20

[osd0]
	host = mach10
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd1]
	host = mach11
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd2]
	host = mach12
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd3]
	host = mach13
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd4]
	host = mach14
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd5]
	host = mach15
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd6]
	host = mach16
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd7]
	host = mach17
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd8]
	host = mach18
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd9]
	host = mach19
	osd data = /disk/sdb3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdb3

[osd10]
	host = mach10
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd11]
	host = mach11
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd12]
	host = mach12
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd13]
	host = mach13
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd14]
	host = mach14
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd15]
	host = mach15
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd16]
	host = mach16
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd17]
	host = mach17
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd18]
	host = mach18
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd19]
	host = mach19
	osd data = /disk/sdd3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdd3

[osd20]
	host = mach10
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd21]
	host = mach11
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd22]
	host = mach12
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd23]
	host = mach13
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd24]
	host = mach14
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd25]
	host = mach15
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd26]
	host = mach16
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd27]
	host = mach17
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd28]
	host = mach18
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd29]
	host = mach19
	osd data = /disk/sde3/cephdata
	osd journal = /disk/sdc3/cephjnl.sde3

[osd30]
	host = mach10
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd31]
	host = mach11
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd32]
	host = mach12
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd33]
	host = mach13
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd34]
	host = mach14
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd35]
	host = mach15
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd36]
	host = mach16
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd37]
	host = mach17
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd38]
	host = mach18
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd39]
	host = mach19
	osd data = /disk/sdf3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdf3

[osd40]
	host = mach10
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd41]
	host = mach11
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd42]
	host = mach12
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd43]
	host = mach13
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd44]
	host = mach14
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd45]
	host = mach15
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd46]
	host = mach16
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd47]
	host = mach17
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd48]
	host = mach18
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3

[osd49]
	host = mach19
	osd data = /disk/sdg3/cephdata
	osd journal = /disk/sdc3/cephjnl.sdg3




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13  0:31 OOM's on the Ceph client machine Theodore Ts'o
@ 2010-10-13  2:30 ` Gregory Farnum
  2010-10-13  3:34   ` Ted Ts'o
  2010-10-13  3:43 ` DongJin Lee
  2010-10-13 17:42 ` Sage Weil
  2 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2010-10-13  2:30 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ceph-devel, mrubin

On Tue, Oct 12, 2010 at 5:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> Hi there,
>
> I've recently been playing with Ceph on an evaluation basis, and found
> that I was able to fairly reliably induce an OOM kill on my the ceph
> client machine by using FFSB with the following configuration file (see
> attached, below).
Does this mean you're using cfuse rather than the kernel client?
FUSE performance in general is fairly disappointing and our cfuse is
probably not as fast as the kernel client even so, though I don't
think it should be *that* unhappy in most environments.

> I am using Ceph v0.21.3 plus a few commits that were on the testing
> branch as of late September (commit ID 569d96b).  The Ceph cluster
> contains 10 commodity servers with 5 disks configured for Ceph object
> storage on each server (plus a separate spindle for the journal files),
> so there are 5 instances of cosd on each OSD server.  The disks are
> formatted using ext4 in no-journal mode.  I am using 3 servers for the
> MDS and montioring daemons, with the MDS and monitoring daemons
> colocated these 3 servers.  The machines all have gigabit ethernet
> cards.
So you have 5 journals running on one spindle? This could be the cause
of your slightly low sequential write performance; in the current
default configuration writes have to go to the journal before going to
the main disk and with multiple OSDs on one journal spindle they could
be getting in each other's way.
Also, how much memory do you have on these machines?

> P.S.  In case people are curious, here are the results of the "boxacle"
> (http://btrfs.boxacle.net) FFSB workloads that I ran.  The results are
> fairly stable, except very often the 8 thread random_write workload is a
> little hard to reproduce because it very often OOM's.  I've never gotten
> a 32 thread random_write workload measurement, since it very reliably
> OOM's on my client machine.
>
> Do these results look reasonable to you?  I confess I'm a little
> disappointed with the sequential and random read numbers in particular.
> And given 10 servers and fifty spindles, even the large_file_create
> numbers seems surprising slow.
I'm not familiar with FFSB and there doesn't seem to be any
easily-accessible documentation, can you tell us a little more about
how it works? For instance, how are the test files created (are they
written out for the reads and then tested? Does the random write
create the files as it goes, or are they pre-existing and then
overwritten)?
A few thoughts/wild guesses:
I'm not sure exactly what the limit is, but 114MB/s reads are close to
what you can get over a 1Gb link.
If single-threaded FFSB means there's only one request in-flight at a
time there may be a latency issue which is causing those 35MB/s reads.
The kernel client ought to be prefetching but maybe it's not doing so
properly, and I don't recall how much prefetching cfuse is actually
capable of. Sage can say more on this.

> (Also, given the we are using gigabit ethernet in this evaluation
> cluster, the 1GB/sec seems ridiculously high, which suggests to me that
> the fsync request wasn't honored -- FFSB includes the fsync time when
> calculating write bandwidth -- and it may explain why we are OOM'ing in
> the random_write workload.)
Err, yes. Extremely odd. In glancing over cfuse this looks like it's
working properly, but if you confirm that's what you're using I'll
trace it.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13  2:30 ` Gregory Farnum
@ 2010-10-13  3:34   ` Ted Ts'o
  2010-10-13 17:29     ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Ted Ts'o @ 2010-10-13  3:34 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, mrubin

On Tue, Oct 12, 2010 at 07:30:48PM -0700, Gregory Farnum wrote:
> Does this mean you're using cfuse rather than the kernel client?
> FUSE performance in general is fairly disappointing and our cfuse is
> probably not as fast as the kernel client even so, though I don't
> think it should be *that* unhappy in most environments.

No, I'm using the kernel client (from 2.6.34).  Specifically, I'm
doing a "modprobe ceph; mount -t ceph 1.2.3.4:/ /mnt"

Sorry, I should have mentioned that.  I can use a more recent kernel
(i.e., 2.6.36-rc7) if that's likely to help.

> So you have 5 journals running on one spindle? This could be the cause
> of your slightly low sequential write performance; in the current
> default configuration writes have to go to the journal before going to
> the main disk and with multiple OSDs on one journal spindle they could
> be getting in each other's way.

Hmm, what do you recommend, then?  The problem is if the journal only
needs to be a few gigabytes (I used a 5GB file), using an entire 1T or
2T disk just so each of the journals can have their own spindle is
pretty wasteful.

> Also, how much memory do you have on these machines?

32GB

> I'm not familiar with FFSB and there doesn't seem to be any
> easily-accessible documentation, can you tell us a little more about
> how it works? For instance, how are the test files created (are they
> written out for the reads and then tested? Does the random write
> create the files as it goes, or are they pre-existing and then
> overwritten)?

There's a quicky explanation of these workloads at
http://btrfs.boxacle.net (I'm using the raid configuration FFSB
files), but essentially, the large file create test is creating 100MB
files as quickly as possible.  In the rest of the tests we create 1024
100MB files, and then try (a) reading from them sequentially as
quickly as possible, (b) picking a random file, and a random offset,
and read 5MB, and repeat, (c) picking a random file, and a random
offset, and write 5MB, and repeat.  The creation of the 1024 100MB
files (if necessary; the tests will reuse the previously created set
of 100MB files) is not counted in the benchmark time.  So in the last
three tests there is no block allocation; just the time it takes to
read or overwrite existing data blocks.

Note BTW that this is not intrinsic to FFSB; FFSB stands for the
"flexible filesystem benchmark" system.  All of this is configurable
using the ffsb config files.  I'm just reusing the "boxacle workloads"
just because they are convenient, and I'm familiar with how they work
on local disk filesystems.  They're used for example for benchmarking
ext4 here: http://free.linux.hp.com/~enw/ext4/2.6.35/, and for btrfs
here: http://btrfs.boxacle.net.  (And when the IBM folks have done
btrfs benchmarks, since they are so detailed and the
hardware/configurations are so well described, I've also used them to
help improve ext4's performance.)

> A few thoughts/wild guesses:
> I'm not sure exactly what the limit is, but 114MB/s reads are close to
> what you can get over a 1Gb link.
> If single-threaded FFSB means there's only one request in-flight at a
> time there may be a latency issue which is causing those 35MB/s reads.
> The kernel client ought to be prefetching but maybe it's not doing so
> properly, and I don't recall how much prefetching cfuse is actually
> capable of. Sage can say more on this.

I'm not using cfuse; I'm using the in-kernel Ceph module.  As far as
network latency is concerned, ping RTT time is under 0.25ms.

And sure maybe it's a prefetching issue --- but in that case I would
have expected 8 thread would have had better than 2x the 1 thread case.

     	      	       	     	      	     - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13  3:34   ` Ted Ts'o
@ 2010-10-13 17:29     ` Sage Weil
  2010-10-14  0:03       ` Ted Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2010-10-13 17:29 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin

Hi Ted,

On Tue, 12 Oct 2010, Ted Ts'o wrote:
> On Tue, Oct 12, 2010 at 07:30:48PM -0700, Gregory Farnum wrote:
> > Does this mean you're using cfuse rather than the kernel client?
> > FUSE performance in general is fairly disappointing and our cfuse is
> > probably not as fast as the kernel client even so, though I don't
> > think it should be *that* unhappy in most environments.
> 
> No, I'm using the kernel client (from 2.6.34).  Specifically, I'm
> doing a "modprobe ceph; mount -t ceph 1.2.3.4:/ /mnt"
> 
> Sorry, I should have mentioned that.  I can use a more recent kernel
> (i.e., 2.6.36-rc7) if that's likely to help.

There have been a number of memory leak fixes since then, at least one of 
which may be causing your problem (it was caused by an uninitialized 
variable and didn't usually trigger for us, but may in your environment).  
Can you retry with the latest mainline?  The benchmark completes without 
problems in my test environment.

> > So you have 5 journals running on one spindle? This could be the cause
> > of your slightly low sequential write performance; in the current
> > default configuration writes have to go to the journal before going to
> > the main disk and with multiple OSDs on one journal spindle they could
> > be getting in each other's way.
> 
> Hmm, what do you recommend, then?  The problem is if the journal only
> needs to be a few gigabytes (I used a 5GB file), using an entire 1T or
> 2T disk just so each of the journals can have their own spindle is
> pretty wasteful.

If fsync on a single file in journal-less ext4 doesn't do any extra work, 
I would just put the (preallocated) journal file together with the data on 
each disk.  Usually that's bad news because of the journal flushing, but 
you shouldn't have that problem.  Alternatively, you could use a small 
separate partition on the same spindle.  

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13 17:29     ` Sage Weil
@ 2010-10-14  0:03       ` Ted Ts'o
  2010-10-14  3:43         ` Sage Weil
  2010-10-21 20:36         ` Ted Ts'o
  0 siblings, 2 replies; 13+ messages in thread
From: Ted Ts'o @ 2010-10-14  0:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, mrubin

On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote:
> There have been a number of memory leak fixes since then, at least one of 
> which may be causing your problem (it was caused by an uninitialized 
> variable and didn't usually trigger for us, but may in your environment).  
> Can you retry with the latest mainline?  The benchmark completes without 
> problems in my test environment.

Sure.  This may have to wait until early next week for me to retry
with the latest mainline, but I'll definitely move to 2.6.36 in the
near future.

> If fsync on a single file in journal-less ext4 doesn't do any extra work, 
> I would just put the (preallocated) journal file together with the data on 
> each disk.  Usually that's bad news because of the journal flushing, but 
> you shouldn't have that problem.  Alternatively, you could use a small 
> separate partition on the same spindle.

I'm currently reformatting the Ceph cluster to put the journal for
/dev/sdX3 on /disk/sdX3/ceph.journal, so I'll try that test first, and
see what difference that makes.  That way I can make one change at a
time and see what difference each change in my cluster configuration
actually gives me.

BTW, this might be a good time to report a tiny little problem which I
found.  If the journal file doesn't exist, then when you run mkcephfs,
cosd will attempt to create the file for you.  But it creates it as a
4k file, and then it loops forever in FileJournal::wrap_read_bl() on
line 808, because get_top() and and header.max_size are both 4096, and
it results in it being an expensive while (1) loop.  This completely
stalls the mkcephfs operation, and it took me a while to debug.

It might be nice if cosd either (a) failed completely if the journal
file is missing, or too small, or (b) if cosd is started in mkfs mode,
and the journal file does not exist, perhaps it should create a
journal file with some suitable default size.

For stuff like this, I assume the right thing to do is to just open a
bug in tracker.newdream.net?  Is there any project-specific customs I
should be aware of?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-14  0:03       ` Ted Ts'o
@ 2010-10-14  3:43         ` Sage Weil
  2010-10-21 20:36         ` Ted Ts'o
  1 sibling, 0 replies; 13+ messages in thread
From: Sage Weil @ 2010-10-14  3:43 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin, martin

On Wed, 13 Oct 2010, Ted Ts'o wrote:
> On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote:
> > There have been a number of memory leak fixes since then, at least one of 
> > which may be causing your problem (it was caused by an uninitialized 
> > variable and didn't usually trigger for us, but may in your environment).  
> > Can you retry with the latest mainline?  The benchmark completes without 
> > problems in my test environment.
> 
> Sure.  This may have to wait until early next week for me to retry
> with the latest mainline, but I'll definitely move to 2.6.36 in the
> near future.
> 
> > If fsync on a single file in journal-less ext4 doesn't do any extra work, 
> > I would just put the (preallocated) journal file together with the data on 
> > each disk.  Usually that's bad news because of the journal flushing, but 
> > you shouldn't have that problem.  Alternatively, you could use a small 
> > separate partition on the same spindle.
> 
> I'm currently reformatting the Ceph cluster to put the journal for
> /dev/sdX3 on /disk/sdX3/ceph.journal, so I'll try that test first, and
> see what difference that makes.  That way I can make one change at a
> time and see what difference each change in my cluster configuration
> actually gives me.

Sounds good!

> BTW, this might be a good time to report a tiny little problem which I
> found.  If the journal file doesn't exist, then when you run mkcephfs,
> cosd will attempt to create the file for you.  But it creates it as a
> 4k file, and then it loops forever in FileJournal::wrap_read_bl() on
> line 808, because get_top() and and header.max_size are both 4096, and
> it results in it being an expensive while (1) loop.  This completely
> stalls the mkcephfs operation, and it took me a while to debug.

That likely explains the hang Martin saw a few days back, and why we 
haven't hit it (our journal files usually already exist).  Thanks for 
tracking that down!
 
> It might be nice if cosd either (a) failed completely if the journal
> file is missing, or too small, or (b) if cosd is started in mkfs mode,
> and the journal file does not exist, perhaps it should create a
> journal file with some suitable default size.
> 
> For stuff like this, I assume the right thing to do is to just open a
> bug in tracker.newdream.net?  Is there any project-specific customs I
> should be aware of?

Yeah, entering it directly in the tracker is nice, although just reporting 
it here is fine as well.  I went ahead and added this one (#487).

Thanks!
sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-14  0:03       ` Ted Ts'o
  2010-10-14  3:43         ` Sage Weil
@ 2010-10-21 20:36         ` Ted Ts'o
  2010-10-21 21:46           ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Ted Ts'o @ 2010-10-21 20:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, mrubin

On Wed, Oct 13, 2010 at 08:03:06PM -0400, Ted Ts'o wrote:
> On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote:
> > There have been a number of memory leak fixes since then, at least one of 
> > which may be causing your problem (it was caused by an uninitialized 
> > variable and didn't usually trigger for us, but may in your environment).  
> > Can you retry with the latest mainline?  The benchmark completes without 
> > problems in my test environment.
> 
> Sure.  This may have to wait until early next week for me to retry
> with the latest mainline, but I'll definitely move to 2.6.36 in the
> near future.

Just to give you an update.  I've tried to use 2.6.34 with nearly all
of the commits that apply to fs/ceph between 2.6.34 and 2.6.36-rc7
both with the 0.21 version of Ceph servers, as well as 0.22 plus some
testing bug fixes (up to fd42c852).  In both cases, using newer Ceph
client causes the FFSB process to hang when it tries running the sync
command.  The dmesg is filled with lines like this:

[ 4756.662789] ceph: skipping osd40 192.168.11.8:6808 seq 2495, expected 2496
[ 4756.662832] ceph: skipping osd7 192.168.12.18:6800 seq 4274, expected 4275
[ 4756.662843] ceph: skipping osd14 192.168.12.15:6802 seq 4124, expected 4125
[ 4756.662853] ceph: skipping osd38 192.168.11.3:6806 seq 3289, expected 3290
[ 4756.663093] ceph: skipping osd7 192.168.12.18:6800 seq 4275, expected 4276
[ 4756.882336] ceph: skipping osd7 192.168.12.18:6800 seq 4276, expected 4277
[ 4757.996962] ceph: skipping osd40 192.168.11.8:6808 seq 2496, expected 2497
[ 4757.997267] ceph: skipping osd7 192.168.12.18:6800 seq 4277, expected 4278
[ 4758.000149] ceph: skipping osd38 192.168.11.3:6806 seq 3290, expected 3291
[ 4758.003755] ceph: skipping osd14 192.168.12.15:6802 seq 4125, expected 4126
[ 4758.018078] ceph: skipping osd14 192.168.12.15:6802 seq 4126, expected 4127
[ 4758.018787] ceph: skipping osd7 192.168.12.18:6800 seq 4278, expected 4279
[ 4758.020263] ceph: skipping osd40 192.168.11.8:6808 seq 2497, expected 2498
[ 4758.020370] ceph: skipping osd10 192.168.11.8:6802 seq 946, expected 947
[ 4761.670848] ceph:  tid 4422463 timed out on osd7, will reset osd
[ 4761.813068] ceph:  tid 4480042 timed out on osd40, will reset osd
[ 4761.956584] ceph:  tid 4487615 timed out on osd14, will reset osd
[ 4762.102343] ceph:  tid 4645028 timed out on osd38, will reset osd
[ 4762.249425] ceph: skipping osd10 192.168.11.8:6802 seq 947, expected 948
[ 4767.257944] ceph: skipping osd10 192.168.11.8:6802 seq 948, expected 949
[ 4768.047058] ceph: skipping osd10 192.168.11.8:6802 seq 949, expected 950
[ 4772.260309] ceph:  tid 4817033 timed out on osd10, will reset osd

It's very possible (likely, even) that this was caused by my backwards
porting of the various ceph patches to 2.6.34.  Hopefully later today
I'll be able to do an actual test run using 2.6.36, without needing to
use "git cherry-pick" on some 170 odd patches.  For a variety of
reasons it was easier for me to use 2.6.34 as a base (drivers, patches
that support dmesg dumps over the network after kernel panic/oops, and
other stuff needed for our environment) but I should be able to move
to 2.6.36 soon.

I also ran into strange problems (which I haven't tried to
characterize accurately enough for a bug report) when using the 2.6.34
client against the new 0.22 release.  Is this expected to work?  If
so, I can try to more accurately characterize what was going on.  

Also, It seems that there are issues moving back and forth between
0.21 and 0.22 without reformating the ceph client.  Is that accurate?
It looked like when I tried going back to 0.21, I needed to rerun
mkcephfs, or else the 0.21 cmon, cosd or cmds daemons would die with
various failures when they saw that 0.22 data files.  That's not
surprising, but it does make it a little harder for me to go back and
forth between 0.21 and 0.22 for the purpose of differential debugging.

If I can get something stable working with 0.22 against either the
2.6.34 or 2.6.36 Ceph client, I'll drop my efforts using 0.21.

Thanks, regards,

	      	      	       	    	    - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-21 20:36         ` Ted Ts'o
@ 2010-10-21 21:46           ` Sage Weil
  2010-10-21 22:28             ` Ted Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2010-10-21 21:46 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin

On Thu, 21 Oct 2010, Ted Ts'o wrote:
> On Wed, Oct 13, 2010 at 08:03:06PM -0400, Ted Ts'o wrote:
> > On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote:
> > > There have been a number of memory leak fixes since then, at least one of 
> > > which may be causing your problem (it was caused by an uninitialized 
> > > variable and didn't usually trigger for us, but may in your environment).  
> > > Can you retry with the latest mainline?  The benchmark completes without 
> > > problems in my test environment.
> > 
> > Sure.  This may have to wait until early next week for me to retry
> > with the latest mainline, but I'll definitely move to 2.6.36 in the
> > near future.
> 
> Just to give you an update.  I've tried to use 2.6.34 with nearly all
> of the commits that apply to fs/ceph between 2.6.34 and 2.6.36-rc7
> both with the 0.21 version of Ceph servers, as well as 0.22 plus some
> testing bug fixes (up to fd42c852).  In both cases, using newer Ceph
> client causes the FFSB process to hang when it tries running the sync
> command.  The dmesg is filled with lines like this:
> 
> [ 4756.662789] ceph: skipping osd40 192.168.11.8:6808 seq 2495, expected 2496
> [ 4756.662832] ceph: skipping osd7 192.168.12.18:6800 seq 4274, expected 4275
> [ 4756.662843] ceph: skipping osd14 192.168.12.15:6802 seq 4124, expected 4125
> [ 4756.662853] ceph: skipping osd38 192.168.11.3:6806 seq 3289, expected 3290
> [ 4756.663093] ceph: skipping osd7 192.168.12.18:6800 seq 4275, expected 4276
> [ 4756.882336] ceph: skipping osd7 192.168.12.18:6800 seq 4276, expected 4277
> [ 4757.996962] ceph: skipping osd40 192.168.11.8:6808 seq 2496, expected 2497
> [ 4757.997267] ceph: skipping osd7 192.168.12.18:6800 seq 4277, expected 4278
> [ 4758.000149] ceph: skipping osd38 192.168.11.3:6806 seq 3290, expected 3291
> [ 4758.003755] ceph: skipping osd14 192.168.12.15:6802 seq 4125, expected 4126
> [ 4758.018078] ceph: skipping osd14 192.168.12.15:6802 seq 4126, expected 4127
> [ 4758.018787] ceph: skipping osd7 192.168.12.18:6800 seq 4278, expected 4279
> [ 4758.020263] ceph: skipping osd40 192.168.11.8:6808 seq 2497, expected 2498
> [ 4758.020370] ceph: skipping osd10 192.168.11.8:6802 seq 946, expected 947
> [ 4761.670848] ceph:  tid 4422463 timed out on osd7, will reset osd
> [ 4761.813068] ceph:  tid 4480042 timed out on osd40, will reset osd
> [ 4761.956584] ceph:  tid 4487615 timed out on osd14, will reset osd
> [ 4762.102343] ceph:  tid 4645028 timed out on osd38, will reset osd
> [ 4762.249425] ceph: skipping osd10 192.168.11.8:6802 seq 947, expected 948
> [ 4767.257944] ceph: skipping osd10 192.168.11.8:6802 seq 948, expected 949
> [ 4768.047058] ceph: skipping osd10 192.168.11.8:6802 seq 949, expected 950
> [ 4772.260309] ceph:  tid 4817033 timed out on osd10, will reset osd
> 
> It's very possible (likely, even) that this was caused by my backwards
> porting of the various ceph patches to 2.6.34.  Hopefully later today
> I'll be able to do an actual test run using 2.6.36, without needing to
> use "git cherry-pick" on some 170 odd patches.  For a variety of
> reasons it was easier for me to use 2.6.34 as a base (drivers, patches
> that support dmesg dumps over the network after kernel panic/oops, and
> other stuff needed for our environment) but I should be able to move
> to 2.6.36 soon.

There is a ceph-client-standalone.git that has just the module source, 
with backport #ifdefs through 2.6.27 (see the master-backport or 
unstable-backport branches).  It isn't well tested, but may be worth a 
shot if 2.6.36 is problematic for other reasons.

Unfortunately it's not obvious to me from dmesg where the problem is, 
other than that it looks like some of the osds aren't responding (but are 
apparently still up).  There is a known regression in v0.22 that can cause 
crashes in the osd cluster; we should have a fix pushed later today.  
That would look a bit different, though (you'd see osd down messages).  
I'll post an update (and probably v0.22.1) when that's been tested.

> I also ran into strange problems (which I haven't tried to
> characterize accurately enough for a bug report) when using the 2.6.34
> client against the new 0.22 release.  Is this expected to work?  If
> so, I can try to more accurately characterize what was going on.  

The vanilla 2.6.34 client you mean?  There have been a range of bugs fixed 
since then (enough for me to lose track of), so I wouldn't be surpised to 
see problems.  And it's not something we've been testing.  That said, the 
basics should work.

> Also, It seems that there are issues moving back and forth between
> 0.21 and 0.22 without reformating the ceph client.  Is that accurate?

Yeah, that isn't expected to work.  In general, rolling backward isn't 
supported.  In this case we forgot to add an incompat flag to generate a 
nice error message to that effect.

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-21 21:46           ` Sage Weil
@ 2010-10-21 22:28             ` Ted Ts'o
  2010-10-21 22:44               ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Ted Ts'o @ 2010-10-21 22:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, mrubin

On Thu, Oct 21, 2010 at 02:46:11PM -0700, Sage Weil wrote:
> 
> Unfortunately it's not obvious to me from dmesg where the problem is, 
> other than that it looks like some of the osds aren't responding (but are 
> apparently still up).  There is a known regression in v0.22 that can cause 
> crashes in the osd cluster; we should have a fix pushed later today.  
> That would look a bit different, though (you'd see osd down messages).  
> I'll post an update (and probably v0.22.1) when that's been tested.

I looked earlier in the logs, and I do see some "osd down", "osd up",
and "osd socket closed" messages.  So it looks like the v0.22
regression you mentioned.  I'll wait for the git update and try
rebuilding the server.  Thanks!!

> > Also, It seems that there are issues moving back and forth between
> > 0.21 and 0.22 without reformating the ceph client.  Is that accurate?
> 
> Yeah, that isn't expected to work.  In general, rolling backward isn't 
> supported.  In this case we forgot to add an incompat flag to generate a 
> nice error message to that effect.

Is rolling forward between 0.21 and 0.22 expected to work?  Or should
I just do a mkcephfs just to be safe?  It's not a data preservation
issue, but rather the time it takes to do a mkcephfs.  Random
question: how do you feel about using Python?  Trying to make a
version of mkcephfs that runs in parallel would probably be easier if
we could port the shell script to a python script.  I don't think
there are any Python dependencies in Ceph right now, though.

      	      	     		     	  - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-21 22:28             ` Ted Ts'o
@ 2010-10-21 22:44               ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2010-10-21 22:44 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin

On Thu, 21 Oct 2010, Ted Ts'o wrote:
> On Thu, Oct 21, 2010 at 02:46:11PM -0700, Sage Weil wrote:
> > 
> > Unfortunately it's not obvious to me from dmesg where the problem is, 
> > other than that it looks like some of the osds aren't responding (but are 
> > apparently still up).  There is a known regression in v0.22 that can cause 
> > crashes in the osd cluster; we should have a fix pushed later today.  
> > That would look a bit different, though (you'd see osd down messages).  
> > I'll post an update (and probably v0.22.1) when that's been tested.
> 
> I looked earlier in the logs, and I do see some "osd down", "osd up",
> and "osd socket closed" messages.  So it looks like the v0.22
> regression you mentioned.  I'll wait for the git update and try
> rebuilding the server.  Thanks!!

Phew!  :)

> > > Also, It seems that there are issues moving back and forth between
> > > 0.21 and 0.22 without reformating the ceph client.  Is that accurate?
> > 
> > Yeah, that isn't expected to work.  In general, rolling backward isn't 
> > supported.  In this case we forgot to add an incompat flag to generate a 
> > nice error message to that effect.
> 
> Is rolling forward between 0.21 and 0.22 expected to work?  Or should
> I just do a mkcephfs just to be safe?  It's not a data preservation
> issue, but rather the time it takes to do a mkcephfs. 

Rolling forward is always supposed to work.  (And if we do end up changing 
things in a non-backward compatible way, we'll make some noise about it.)

> Random
> question: how do you feel about using Python?  Trying to make a
> version of mkcephfs that runs in parallel would probably be easier if
> we could port the shell script to a python script.  I don't think
> there are any Python dependencies in Ceph right now, though.

Python's fine.  There's an issue in the tracker relating to this, btw.  
The goal will be to create discrete steps that let you use whatever 
cluster-specific tools you have for launching parallel jobs. 
	http://tracker.newdream.net/issues/400

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13  0:31 OOM's on the Ceph client machine Theodore Ts'o
  2010-10-13  2:30 ` Gregory Farnum
@ 2010-10-13  3:43 ` DongJin Lee
  2010-10-13 17:42 ` Sage Weil
  2 siblings, 0 replies; 13+ messages in thread
From: DongJin Lee @ 2010-10-13  3:43 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ceph-devel, mrubin

Hi Ted:

I'd like to follow your similar setup, too.
At this stage, I'm with the very recent version, I've tried btrfs but
ceph mount freezes as soon as I run any high benchmark or heavy iops.
I think it is to do with syncing, so I'm now trying ext4 without the
journal, hoping for some good news.

1. have you tried btrfs, with all configs the same?
2. did you use mkfs.ext4 -O ^has_journal to disable? (sorry, you know
the most since you are the ext4 man!)
3. did the change of journal size in ceph.conf changed any results?
e.g., 20MB to 1000MB?

Have you had any success in other benchmarks? I'd really like to know
if you could try 'fio' benchmark, if your time allows.
run ./fio example_file

and the example_file content is below, e.g.,

[global]
bs=4k
ioengine=libaio
iodepth=1
size=1g
direct=1
runtime=30
filename=/media/cephmount/afile

[seq-read]
  rw=read
  stonewall
[rand-read]
  rw=randread
  stonewall
[seq-write]
  rw=write
  stonewall
[rand-write]
  rw=randwrite
  stonewall

thanks a lot.

On Wed, Oct 13, 2010 at 1:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> Hi there,
>
> I've recently been playing with Ceph on an evaluation basis, and found
> that I was able to fairly reliably induce an OOM kill on my the ceph
> client machine by using FFSB with the following configuration file (see
> attached, below).
>
> I am using Ceph v0.21.3 plus a few commits that were on the testing
> branch as of late September (commit ID 569d96b).  The Ceph cluster
> contains 10 commodity servers with 5 disks configured for Ceph object
> storage on each server (plus a separate spindle for the journal files),
> so there are 5 instances of cosd on each OSD server.  The disks are
> formatted using ext4 in no-journal mode.  I am using 3 servers for the
> MDS and montioring daemons, with the MDS and monitoring daemons
> colocated these 3 servers.  The machines all have gigabit ethernet
> cards.
>
> I've been running the client on a separate machine, and this is the
> machine which has been dying with an OOM.
>
> Any help, suggestions, or "hey stupid!  You screwed up XXXX in your
> ceph.conf file" would be gratefully accepted.
>
> Thanks,
>
>                                     - Ted
>
> P.S.  In case people are curious, here are the results of the "boxacle"
> (http://btrfs.boxacle.net) FFSB workloads that I ran.  The results are
> fairly stable, except very often the 8 thread random_write workload is a
> little hard to reproduce because it very often OOM's.  I've never gotten
> a 32 thread random_write workload measurement, since it very reliably
> OOM's on my client machine.
>
> Do these results look reasonable to you?  I confess I'm a little
> disappointed with the sequential and random read numbers in particular.
> And given 10 servers and fifty spindles, even the large_file_create
> numbers seems surprising slow.
>
> (Also, given the we are using gigabit ethernet in this evaluation
> cluster, the 1GB/sec seems ridiculously high, which suggests to me that
> the fsync request wasn't honored -- FFSB includes the fsync time when
> calculating write bandwidth -- and it may explain why we are OOM'ing in
> the random_write workload.)
>
>                    1 thread           8 threads            32 threads
> large_file_create   101 MB/sec         102 MB/sec           101 MB/sec
> sequential_reads     35 MB/sec         113 MB/sec           114 MB/sec
> random_reads          1.48 MB/sec        5.44 MB/sec        11.7 MB/sec
> random_writes      923 MB/sec           1.09 GB/sec             (*)
>
> For comparison, here are the FFSB numbers on a single local ext4 disk
> with no journal:
>
>                    1 thread           8 threads            32 threads
> large_file_create   75.5 MB/sec        72.2 MB/sec          74.2 MB/sec
> sequential_reads    77.2 MB/sec        69.2 MB/sec          70.3 MB/sec
> random_reads        734 K/sec          537 K/sec            537 K/sec
> random_writes       44.5 MB/sec        41.5 MB/sec          41.6 MB/sec
>
> It's very possible that I may have done something wrong, so I've
> enclosed the ceph.conf file I used for doing this test run....  please
> let me know if there's something I've screwed up.
>
> ---------------------------- random_write.32.ffsb
> # Large file random writes.
> # 1024 files, 100MB per file.
>
> time=300  # 5 min
> alignio=1
>
> [filesystem0]
>        location=/mnt/ffsb1
>        num_files=1024
>        min_filesize=104857600  # 100 MB
>        max_filesize=104857600
>        reuse=1
> [end0]
>
> [threadgroup0]
>        num_threads=32
>
>        write_random=1
>        write_weight=1
>
>        write_size=5242880  # 5 MB
>        write_blocksize=4096
>
>        [stats]
>                enable_stats=1
>                enable_range=1
>
>                msec_range    0.00      0.01
>                msec_range    0.01      0.02
>                msec_range    0.02      0.05
>                msec_range    0.05      0.10
>                msec_range    0.10      0.20
>                msec_range    0.20      0.50
>                msec_range    0.50      1.00
>                msec_range    1.00      2.00
>                msec_range    2.00      5.00
>                msec_range    5.00     10.00
>                msec_range   10.00     20.00
>                msec_range   20.00     50.00
>                msec_range   50.00    100.00
>                msec_range  100.00    200.00
>                msec_range  200.00    500.00
>                msec_range  500.00   1000.00
>                msec_range 1000.00   2000.00
>                msec_range 2000.00   5000.00
>                msec_range 5000.00  10000.00
>        [end]
> [end0]
> ------------------------------------------------ My ceph.conf file
>
> ;
> ; This is the test ceph configuration file
> ;
> ; [tytso:20101007.0813EDT]
> ;
> ; This file defines cluster membership, the various locations
> ; that Ceph stores data, and any other runtime options.
> ;
> ; If a 'host' is defined for a daemon, the start/stop script will
> ; verify that it matches the hostname (or else ignore it).  If it is
> ; not defined, it is assumed that the daemon is intended to start on
> ; the current host (e.g., in a setup with a startup.conf on each
> ; node).
>
> ; global
> [global]
>        user = root
>        pid file = /disk/sda3/tmp/ceph/$name.pid
>        logger dir = /disk/sda3/tmp/ceph
>        log dir = /disk/sda3/tmp/ceph
>        chdir = /disk/sda3
>
> ; monitors
> ;  You need at least one.  You need at least three if you want to
> ;  tolerate any node failures.  Always create an odd number.
> [mon]
>        mon data = /disk/sda3/cephmon/data/mon$id
>
>        ; logging, for debugging monitor crashes, in order of
>        ; their likelihood of being helpful :)
>        ;debug ms = 1
>        ;debug mon = 20
>        ;debug paxos = 20
>        ;debug auth = 20
>
> [mon0]
>        host = mach1
>        mon addr = 1.2.3.4:6789
>
> [mon1]
>        host = mach2
>        mon addr = 1.2.3.5:6789
>
> [mon1]
>        host = mach3
>        mon addr = 1.2.3.6:6789
>
> ; mds
> ;  You need at least one.  Define two to get a standby.
> [mds]
>        ; where the mds keeps it's secret encryption keys
>        keyring = /data/keyring.$name
>
>        ; mds logging to debug issues.
>        ;debug ms = 1
>        ;debug mds = 20
>
> [mds.alpha]
>        host = mach2
>
> [mds.beta]
>        host = mach3
>
> [mds.gamma]
>        host = mach1
>
> ; osd
> ;  You need at least one.  Two if you want data to be replicated.
> ;  Define as many as you like.
> [osd]
>        ; osd logging to debug osd issues, in order of likelihood of being
>        ; helpful
>        ;debug ms = 1
>        ;debug osd = 20
>        ;debug filestore = 20
>        ;debug journal = 20
>
> [osd0]
>        host = mach10
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd1]
>        host = mach11
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd2]
>        host = mach12
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd3]
>        host = mach13
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd4]
>        host = mach14
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd5]
>        host = mach15
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd6]
>        host = mach16
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd7]
>        host = mach17
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd8]
>        host = mach18
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd9]
>        host = mach19
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd10]
>        host = mach10
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd11]
>        host = mach11
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd12]
>        host = mach12
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd13]
>        host = mach13
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd14]
>        host = mach14
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd15]
>        host = mach15
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd16]
>        host = mach16
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd17]
>        host = mach17
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd18]
>        host = mach18
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd19]
>        host = mach19
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd20]
>        host = mach10
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd21]
>        host = mach11
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd22]
>        host = mach12
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd23]
>        host = mach13
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd24]
>        host = mach14
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd25]
>        host = mach15
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd26]
>        host = mach16
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd27]
>        host = mach17
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd28]
>        host = mach18
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd29]
>        host = mach19
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd30]
>        host = mach10
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd31]
>        host = mach11
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd32]
>        host = mach12
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd33]
>        host = mach13
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd34]
>        host = mach14
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd35]
>        host = mach15
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd36]
>        host = mach16
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd37]
>        host = mach17
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd38]
>        host = mach18
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd39]
>        host = mach19
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd40]
>        host = mach10
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd41]
>        host = mach11
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd42]
>        host = mach12
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd43]
>        host = mach13
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd44]
>        host = mach14
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd45]
>        host = mach15
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd46]
>        host = mach16
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd47]
>        host = mach17
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd48]
>        host = mach18
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd49]
>        host = mach19
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13  0:31 OOM's on the Ceph client machine Theodore Ts'o
  2010-10-13  2:30 ` Gregory Farnum
  2010-10-13  3:43 ` DongJin Lee
@ 2010-10-13 17:42 ` Sage Weil
  2010-10-13 21:25   ` Sage Weil
  2 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2010-10-13 17:42 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ceph-devel, mrubin

Hi Ted,

On Tue, 12 Oct 2010, Theodore Ts'o wrote:
> P.S.  In case people are curious, here are the results of the "boxacle"
> (http://btrfs.boxacle.net) FFSB workloads that I ran.  The results are
> fairly stable, except very often the 8 thread random_write workload is a
> little hard to reproduce because it very often OOM's.  I've never gotten
> a 32 thread random_write workload measurement, since it very reliably
> OOM's on my client machine.  
> 
> Do these results look reasonable to you?  I confess I'm a little
> disappointed with the sequential and random read numbers in particular.
> And given 10 servers and fifty spindles, even the large_file_create
> numbers seems surprising slow.
> 
> (Also, given the we are using gigabit ethernet in this evaluation
> cluster, the 1GB/sec seems ridiculously high, which suggests to me that
> the fsync request wasn't honored -- FFSB includes the fsync time when
> calculating write bandwidth -- and it may explain why we are OOM'ing in
> the random_write workload.)
> 
>                     1 thread           8 threads            32 threads 
> large_file_create   101 MB/sec         102 MB/sec           101 MB/sec 

These may be a bit below the ceiling imposed by the gigabit ethernet 
because of the combined journaling disk; effectively all writes for the 
whole host were going to the same spindle.  Please try distributing the 
journals across the spindles.

> sequential_reads     35 MB/sec         113 MB/sec           114 MB/sec 

These are mostly reasonable.  The single thread performance is primarily 
governed by the MM readahead behavior.  There is a mount option tunable to 
adjust the max readahead on the BDI: rsize=<bytes> (the default is only 
512KB, IIRC).  Some users have reported improved read performance with a 
larger rsize, but it's not something we've had time to tune ourselves.

> random_reads          1.48 MB/sec        5.44 MB/sec        11.7 MB/sec 

This one looks way too slow.  I'm going to run this locally and see what 
is going on.

> random_writes      923 MB/sec           1.09 GB/sec             (*) 

And there is definitely something wrong here with the client.  :)  Let's 
see what happens with the latest mainline!

sage


> 
> For comparison, here are the FFSB numbers on a single local ext4 disk
> with no journal:
> 
>                     1 thread           8 threads            32 threads 
> large_file_create   75.5 MB/sec        72.2 MB/sec	    74.2 MB/sec
> sequential_reads    77.2 MB/sec	       69.2 MB/sec	    70.3 MB/sec
> random_reads        734 K/sec	       537 K/sec	    537 K/sec
> random_writes       44.5 MB/sec	       41.5 MB/sec	    41.6 MB/sec
> 
> It's very possible that I may have done something wrong, so I've
> enclosed the ceph.conf file I used for doing this test run....  please
> let me know if there's something I've screwed up.
> 
> ---------------------------- random_write.32.ffsb
> # Large file random writes.
> # 1024 files, 100MB per file.
> 
> time=300  # 5 min
> alignio=1
> 
> [filesystem0]
> 	location=/mnt/ffsb1
> 	num_files=1024
> 	min_filesize=104857600  # 100 MB
> 	max_filesize=104857600
> 	reuse=1
> [end0]
> 
> [threadgroup0]
> 	num_threads=32
> 
> 	write_random=1
> 	write_weight=1
> 
> 	write_size=5242880  # 5 MB
> 	write_blocksize=4096
> 
> 	[stats]
> 		enable_stats=1
> 		enable_range=1
> 
> 		msec_range    0.00      0.01
> 		msec_range    0.01      0.02
> 		msec_range    0.02      0.05
> 		msec_range    0.05      0.10
> 		msec_range    0.10      0.20
> 		msec_range    0.20      0.50
> 		msec_range    0.50      1.00
> 		msec_range    1.00      2.00
> 		msec_range    2.00      5.00
> 		msec_range    5.00     10.00
> 		msec_range   10.00     20.00
> 		msec_range   20.00     50.00
> 		msec_range   50.00    100.00
> 		msec_range  100.00    200.00
> 		msec_range  200.00    500.00
> 		msec_range  500.00   1000.00
> 		msec_range 1000.00   2000.00
> 		msec_range 2000.00   5000.00
> 		msec_range 5000.00  10000.00
> 	[end]
> [end0]
> ------------------------------------------------ My ceph.conf file
> 
> ;
> ; This is the test ceph configuration file
> ;
> ; [tytso:20101007.0813EDT]
> ;
> ; This file defines cluster membership, the various locations
> ; that Ceph stores data, and any other runtime options.
> ;
> ; If a 'host' is defined for a daemon, the start/stop script will
> ; verify that it matches the hostname (or else ignore it).  If it is
> ; not defined, it is assumed that the daemon is intended to start on
> ; the current host (e.g., in a setup with a startup.conf on each
> ; node).
> 
> ; global
> [global]
> 	user = root
> 	pid file = /disk/sda3/tmp/ceph/$name.pid
> 	logger dir = /disk/sda3/tmp/ceph
> 	log dir = /disk/sda3/tmp/ceph
> 	chdir = /disk/sda3
> 
> ; monitors
> ;  You need at least one.  You need at least three if you want to
> ;  tolerate any node failures.  Always create an odd number.
> [mon]
> 	mon data = /disk/sda3/cephmon/data/mon$id
> 
> 	; logging, for debugging monitor crashes, in order of
> 	; their likelihood of being helpful :)
> 	;debug ms = 1
> 	;debug mon = 20
> 	;debug paxos = 20
> 	;debug auth = 20
> 
> [mon0]
> 	host = mach1
> 	mon addr = 1.2.3.4:6789
> 
> [mon1]
> 	host = mach2
> 	mon addr = 1.2.3.5:6789
> 
> [mon1]
> 	host = mach3
> 	mon addr = 1.2.3.6:6789
> 
> ; mds
> ;  You need at least one.  Define two to get a standby.
> [mds]
> 	; where the mds keeps it's secret encryption keys
> 	keyring = /data/keyring.$name
> 
> 	; mds logging to debug issues.
> 	;debug ms = 1
> 	;debug mds = 20
> 
> [mds.alpha]
> 	host = mach2
> 
> [mds.beta]
> 	host = mach3
> 
> [mds.gamma]
> 	host = mach1
> 
> ; osd
> ;  You need at least one.  Two if you want data to be replicated.
> ;  Define as many as you like.
> [osd]
> 	; osd logging to debug osd issues, in order of likelihood of being
> 	; helpful
> 	;debug ms = 1
> 	;debug osd = 20
> 	;debug filestore = 20
> 	;debug journal = 20
> 
> [osd0]
> 	host = mach10
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd1]
> 	host = mach11
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd2]
> 	host = mach12
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd3]
> 	host = mach13
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd4]
> 	host = mach14
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd5]
> 	host = mach15
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd6]
> 	host = mach16
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd7]
> 	host = mach17
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd8]
> 	host = mach18
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd9]
> 	host = mach19
> 	osd data = /disk/sdb3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdb3
> 
> [osd10]
> 	host = mach10
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd11]
> 	host = mach11
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd12]
> 	host = mach12
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd13]
> 	host = mach13
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd14]
> 	host = mach14
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd15]
> 	host = mach15
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd16]
> 	host = mach16
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd17]
> 	host = mach17
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd18]
> 	host = mach18
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd19]
> 	host = mach19
> 	osd data = /disk/sdd3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdd3
> 
> [osd20]
> 	host = mach10
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd21]
> 	host = mach11
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd22]
> 	host = mach12
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd23]
> 	host = mach13
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd24]
> 	host = mach14
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd25]
> 	host = mach15
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd26]
> 	host = mach16
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd27]
> 	host = mach17
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd28]
> 	host = mach18
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd29]
> 	host = mach19
> 	osd data = /disk/sde3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sde3
> 
> [osd30]
> 	host = mach10
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd31]
> 	host = mach11
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd32]
> 	host = mach12
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd33]
> 	host = mach13
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd34]
> 	host = mach14
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd35]
> 	host = mach15
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd36]
> 	host = mach16
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd37]
> 	host = mach17
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd38]
> 	host = mach18
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd39]
> 	host = mach19
> 	osd data = /disk/sdf3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdf3
> 
> [osd40]
> 	host = mach10
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd41]
> 	host = mach11
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd42]
> 	host = mach12
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd43]
> 	host = mach13
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd44]
> 	host = mach14
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd45]
> 	host = mach15
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd46]
> 	host = mach16
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd47]
> 	host = mach17
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd48]
> 	host = mach18
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> [osd49]
> 	host = mach19
> 	osd data = /disk/sdg3/cephdata
> 	osd journal = /disk/sdc3/cephjnl.sdg3
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: OOM's on the Ceph client machine
  2010-10-13 17:42 ` Sage Weil
@ 2010-10-13 21:25   ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2010-10-13 21:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ceph-devel, mrubin

On Wed, 13 Oct 2010, Sage Weil wrote:
> On Tue, 12 Oct 2010, Theodore Ts'o wrote:
> > random_reads          1.48 MB/sec        5.44 MB/sec        11.7 MB/sec 
> 
> This one looks way too slow.  I'm going to run this locally and see what 
> is going on.

I looked closer at this one, and it looks like what ffsb is doing is each 
thread picks a random 5 MB chunk and does 4KB reads from within that chunk 
at random offsets.  Because the reads are random, there's no readahead, 
and we have lots of little 4KB read requests going over the wire.  
Increasing the number of threads just means more small reads in parallel.

That being the case, the single thread number isn't so surprising. 
Performance is mainly bounded by the request latency.  What is a bit 
surprising is that it doesn't scale that well as threads increase, I 
assume because of some contention on the OSDs (balancing is pseudorandom).  
FWIW, in my environment (25 single spindle OSDs, btrfs) for random_reads 
and 1/8/32 threads I got

random_reads	4.1 MB/sec	7.41 MB/sec	15.3MB/sec

sage


> 
> > random_writes      923 MB/sec           1.09 GB/sec             (*) 
> 
> And there is definitely something wrong here with the client.  :)  Let's 
> see what happens with the latest mainline!
> 
> sage
> 
> 
> > 
> > For comparison, here are the FFSB numbers on a single local ext4 disk
> > with no journal:
> > 
> >                     1 thread           8 threads            32 threads 
> > large_file_create   75.5 MB/sec        72.2 MB/sec	    74.2 MB/sec
> > sequential_reads    77.2 MB/sec	       69.2 MB/sec	    70.3 MB/sec
> > random_reads        734 K/sec	       537 K/sec	    537 K/sec
> > random_writes       44.5 MB/sec	       41.5 MB/sec	    41.6 MB/sec
> > 
> > It's very possible that I may have done something wrong, so I've
> > enclosed the ceph.conf file I used for doing this test run....  please
> > let me know if there's something I've screwed up.
> > 
> > ---------------------------- random_write.32.ffsb
> > # Large file random writes.
> > # 1024 files, 100MB per file.
> > 
> > time=300  # 5 min
> > alignio=1
> > 
> > [filesystem0]
> > 	location=/mnt/ffsb1
> > 	num_files=1024
> > 	min_filesize=104857600  # 100 MB
> > 	max_filesize=104857600
> > 	reuse=1
> > [end0]
> > 
> > [threadgroup0]
> > 	num_threads=32
> > 
> > 	write_random=1
> > 	write_weight=1
> > 
> > 	write_size=5242880  # 5 MB
> > 	write_blocksize=4096
> > 
> > 	[stats]
> > 		enable_stats=1
> > 		enable_range=1
> > 
> > 		msec_range    0.00      0.01
> > 		msec_range    0.01      0.02
> > 		msec_range    0.02      0.05
> > 		msec_range    0.05      0.10
> > 		msec_range    0.10      0.20
> > 		msec_range    0.20      0.50
> > 		msec_range    0.50      1.00
> > 		msec_range    1.00      2.00
> > 		msec_range    2.00      5.00
> > 		msec_range    5.00     10.00
> > 		msec_range   10.00     20.00
> > 		msec_range   20.00     50.00
> > 		msec_range   50.00    100.00
> > 		msec_range  100.00    200.00
> > 		msec_range  200.00    500.00
> > 		msec_range  500.00   1000.00
> > 		msec_range 1000.00   2000.00
> > 		msec_range 2000.00   5000.00
> > 		msec_range 5000.00  10000.00
> > 	[end]
> > [end0]
> > ------------------------------------------------ My ceph.conf file
> > 
> > ;
> > ; This is the test ceph configuration file
> > ;
> > ; [tytso:20101007.0813EDT]
> > ;
> > ; This file defines cluster membership, the various locations
> > ; that Ceph stores data, and any other runtime options.
> > ;
> > ; If a 'host' is defined for a daemon, the start/stop script will
> > ; verify that it matches the hostname (or else ignore it).  If it is
> > ; not defined, it is assumed that the daemon is intended to start on
> > ; the current host (e.g., in a setup with a startup.conf on each
> > ; node).
> > 
> > ; global
> > [global]
> > 	user = root
> > 	pid file = /disk/sda3/tmp/ceph/$name.pid
> > 	logger dir = /disk/sda3/tmp/ceph
> > 	log dir = /disk/sda3/tmp/ceph
> > 	chdir = /disk/sda3
> > 
> > ; monitors
> > ;  You need at least one.  You need at least three if you want to
> > ;  tolerate any node failures.  Always create an odd number.
> > [mon]
> > 	mon data = /disk/sda3/cephmon/data/mon$id
> > 
> > 	; logging, for debugging monitor crashes, in order of
> > 	; their likelihood of being helpful :)
> > 	;debug ms = 1
> > 	;debug mon = 20
> > 	;debug paxos = 20
> > 	;debug auth = 20
> > 
> > [mon0]
> > 	host = mach1
> > 	mon addr = 1.2.3.4:6789
> > 
> > [mon1]
> > 	host = mach2
> > 	mon addr = 1.2.3.5:6789
> > 
> > [mon1]
> > 	host = mach3
> > 	mon addr = 1.2.3.6:6789
> > 
> > ; mds
> > ;  You need at least one.  Define two to get a standby.
> > [mds]
> > 	; where the mds keeps it's secret encryption keys
> > 	keyring = /data/keyring.$name
> > 
> > 	; mds logging to debug issues.
> > 	;debug ms = 1
> > 	;debug mds = 20
> > 
> > [mds.alpha]
> > 	host = mach2
> > 
> > [mds.beta]
> > 	host = mach3
> > 
> > [mds.gamma]
> > 	host = mach1
> > 
> > ; osd
> > ;  You need at least one.  Two if you want data to be replicated.
> > ;  Define as many as you like.
> > [osd]
> > 	; osd logging to debug osd issues, in order of likelihood of being
> > 	; helpful
> > 	;debug ms = 1
> > 	;debug osd = 20
> > 	;debug filestore = 20
> > 	;debug journal = 20
> > 
> > [osd0]
> > 	host = mach10
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd1]
> > 	host = mach11
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd2]
> > 	host = mach12
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd3]
> > 	host = mach13
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd4]
> > 	host = mach14
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd5]
> > 	host = mach15
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd6]
> > 	host = mach16
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd7]
> > 	host = mach17
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd8]
> > 	host = mach18
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd9]
> > 	host = mach19
> > 	osd data = /disk/sdb3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdb3
> > 
> > [osd10]
> > 	host = mach10
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd11]
> > 	host = mach11
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd12]
> > 	host = mach12
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd13]
> > 	host = mach13
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd14]
> > 	host = mach14
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd15]
> > 	host = mach15
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd16]
> > 	host = mach16
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd17]
> > 	host = mach17
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd18]
> > 	host = mach18
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd19]
> > 	host = mach19
> > 	osd data = /disk/sdd3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdd3
> > 
> > [osd20]
> > 	host = mach10
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd21]
> > 	host = mach11
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd22]
> > 	host = mach12
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd23]
> > 	host = mach13
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd24]
> > 	host = mach14
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd25]
> > 	host = mach15
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd26]
> > 	host = mach16
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd27]
> > 	host = mach17
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd28]
> > 	host = mach18
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd29]
> > 	host = mach19
> > 	osd data = /disk/sde3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sde3
> > 
> > [osd30]
> > 	host = mach10
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd31]
> > 	host = mach11
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd32]
> > 	host = mach12
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd33]
> > 	host = mach13
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd34]
> > 	host = mach14
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd35]
> > 	host = mach15
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd36]
> > 	host = mach16
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd37]
> > 	host = mach17
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd38]
> > 	host = mach18
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd39]
> > 	host = mach19
> > 	osd data = /disk/sdf3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdf3
> > 
> > [osd40]
> > 	host = mach10
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd41]
> > 	host = mach11
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd42]
> > 	host = mach12
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd43]
> > 	host = mach13
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd44]
> > 	host = mach14
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd45]
> > 	host = mach15
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd46]
> > 	host = mach16
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd47]
> > 	host = mach17
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd48]
> > 	host = mach18
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > [osd49]
> > 	host = mach19
> > 	osd data = /disk/sdg3/cephdata
> > 	osd journal = /disk/sdc3/cephjnl.sdg3
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-10-21 22:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-13  0:31 OOM's on the Ceph client machine Theodore Ts'o
2010-10-13  2:30 ` Gregory Farnum
2010-10-13  3:34   ` Ted Ts'o
2010-10-13 17:29     ` Sage Weil
2010-10-14  0:03       ` Ted Ts'o
2010-10-14  3:43         ` Sage Weil
2010-10-21 20:36         ` Ted Ts'o
2010-10-21 21:46           ` Sage Weil
2010-10-21 22:28             ` Ted Ts'o
2010-10-21 22:44               ` Sage Weil
2010-10-13  3:43 ` DongJin Lee
2010-10-13 17:42 ` Sage Weil
2010-10-13 21:25   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.