* OOM's on the Ceph client machine
@ 2010-10-13 0:31 Theodore Ts'o
2010-10-13 2:30 ` Gregory Farnum
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Theodore Ts'o @ 2010-10-13 0:31 UTC (permalink / raw)
To: ceph-devel; +Cc: mrubin
Hi there,
I've recently been playing with Ceph on an evaluation basis, and found
that I was able to fairly reliably induce an OOM kill on my the ceph
client machine by using FFSB with the following configuration file (see
attached, below).
I am using Ceph v0.21.3 plus a few commits that were on the testing
branch as of late September (commit ID 569d96b). The Ceph cluster
contains 10 commodity servers with 5 disks configured for Ceph object
storage on each server (plus a separate spindle for the journal files),
so there are 5 instances of cosd on each OSD server. The disks are
formatted using ext4 in no-journal mode. I am using 3 servers for the
MDS and montioring daemons, with the MDS and monitoring daemons
colocated these 3 servers. The machines all have gigabit ethernet
cards.
I've been running the client on a separate machine, and this is the
machine which has been dying with an OOM.
Any help, suggestions, or "hey stupid! You screwed up XXXX in your
ceph.conf file" would be gratefully accepted.
Thanks,
- Ted
P.S. In case people are curious, here are the results of the "boxacle"
(http://btrfs.boxacle.net) FFSB workloads that I ran. The results are
fairly stable, except very often the 8 thread random_write workload is a
little hard to reproduce because it very often OOM's. I've never gotten
a 32 thread random_write workload measurement, since it very reliably
OOM's on my client machine.
Do these results look reasonable to you? I confess I'm a little
disappointed with the sequential and random read numbers in particular.
And given 10 servers and fifty spindles, even the large_file_create
numbers seems surprising slow.
(Also, given the we are using gigabit ethernet in this evaluation
cluster, the 1GB/sec seems ridiculously high, which suggests to me that
the fsync request wasn't honored -- FFSB includes the fsync time when
calculating write bandwidth -- and it may explain why we are OOM'ing in
the random_write workload.)
1 thread 8 threads 32 threads
large_file_create 101 MB/sec 102 MB/sec 101 MB/sec
sequential_reads 35 MB/sec 113 MB/sec 114 MB/sec
random_reads 1.48 MB/sec 5.44 MB/sec 11.7 MB/sec
random_writes 923 MB/sec 1.09 GB/sec (*)
For comparison, here are the FFSB numbers on a single local ext4 disk
with no journal:
1 thread 8 threads 32 threads
large_file_create 75.5 MB/sec 72.2 MB/sec 74.2 MB/sec
sequential_reads 77.2 MB/sec 69.2 MB/sec 70.3 MB/sec
random_reads 734 K/sec 537 K/sec 537 K/sec
random_writes 44.5 MB/sec 41.5 MB/sec 41.6 MB/sec
It's very possible that I may have done something wrong, so I've
enclosed the ceph.conf file I used for doing this test run.... please
let me know if there's something I've screwed up.
---------------------------- random_write.32.ffsb
# Large file random writes.
# 1024 files, 100MB per file.
time=300 # 5 min
alignio=1
[filesystem0]
location=/mnt/ffsb1
num_files=1024
min_filesize=104857600 # 100 MB
max_filesize=104857600
reuse=1
[end0]
[threadgroup0]
num_threads=32
write_random=1
write_weight=1
write_size=5242880 # 5 MB
write_blocksize=4096
[stats]
enable_stats=1
enable_range=1
msec_range 0.00 0.01
msec_range 0.01 0.02
msec_range 0.02 0.05
msec_range 0.05 0.10
msec_range 0.10 0.20
msec_range 0.20 0.50
msec_range 0.50 1.00
msec_range 1.00 2.00
msec_range 2.00 5.00
msec_range 5.00 10.00
msec_range 10.00 20.00
msec_range 20.00 50.00
msec_range 50.00 100.00
msec_range 100.00 200.00
msec_range 200.00 500.00
msec_range 500.00 1000.00
msec_range 1000.00 2000.00
msec_range 2000.00 5000.00
msec_range 5000.00 10000.00
[end]
[end0]
------------------------------------------------ My ceph.conf file
;
; This is the test ceph configuration file
;
; [tytso:20101007.0813EDT]
;
; This file defines cluster membership, the various locations
; that Ceph stores data, and any other runtime options.
;
; If a 'host' is defined for a daemon, the start/stop script will
; verify that it matches the hostname (or else ignore it). If it is
; not defined, it is assumed that the daemon is intended to start on
; the current host (e.g., in a setup with a startup.conf on each
; node).
; global
[global]
user = root
pid file = /disk/sda3/tmp/ceph/$name.pid
logger dir = /disk/sda3/tmp/ceph
log dir = /disk/sda3/tmp/ceph
chdir = /disk/sda3
; monitors
; You need at least one. You need at least three if you want to
; tolerate any node failures. Always create an odd number.
[mon]
mon data = /disk/sda3/cephmon/data/mon$id
; logging, for debugging monitor crashes, in order of
; their likelihood of being helpful :)
;debug ms = 1
;debug mon = 20
;debug paxos = 20
;debug auth = 20
[mon0]
host = mach1
mon addr = 1.2.3.4:6789
[mon1]
host = mach2
mon addr = 1.2.3.5:6789
[mon1]
host = mach3
mon addr = 1.2.3.6:6789
; mds
; You need at least one. Define two to get a standby.
[mds]
; where the mds keeps it's secret encryption keys
keyring = /data/keyring.$name
; mds logging to debug issues.
;debug ms = 1
;debug mds = 20
[mds.alpha]
host = mach2
[mds.beta]
host = mach3
[mds.gamma]
host = mach1
; osd
; You need at least one. Two if you want data to be replicated.
; Define as many as you like.
[osd]
; osd logging to debug osd issues, in order of likelihood of being
; helpful
;debug ms = 1
;debug osd = 20
;debug filestore = 20
;debug journal = 20
[osd0]
host = mach10
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd1]
host = mach11
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd2]
host = mach12
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd3]
host = mach13
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd4]
host = mach14
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd5]
host = mach15
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd6]
host = mach16
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd7]
host = mach17
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd8]
host = mach18
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd9]
host = mach19
osd data = /disk/sdb3/cephdata
osd journal = /disk/sdc3/cephjnl.sdb3
[osd10]
host = mach10
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd11]
host = mach11
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd12]
host = mach12
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd13]
host = mach13
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd14]
host = mach14
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd15]
host = mach15
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd16]
host = mach16
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd17]
host = mach17
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd18]
host = mach18
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd19]
host = mach19
osd data = /disk/sdd3/cephdata
osd journal = /disk/sdc3/cephjnl.sdd3
[osd20]
host = mach10
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd21]
host = mach11
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd22]
host = mach12
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd23]
host = mach13
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd24]
host = mach14
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd25]
host = mach15
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd26]
host = mach16
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd27]
host = mach17
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd28]
host = mach18
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd29]
host = mach19
osd data = /disk/sde3/cephdata
osd journal = /disk/sdc3/cephjnl.sde3
[osd30]
host = mach10
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd31]
host = mach11
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd32]
host = mach12
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd33]
host = mach13
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd34]
host = mach14
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd35]
host = mach15
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd36]
host = mach16
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd37]
host = mach17
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd38]
host = mach18
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd39]
host = mach19
osd data = /disk/sdf3/cephdata
osd journal = /disk/sdc3/cephjnl.sdf3
[osd40]
host = mach10
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd41]
host = mach11
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd42]
host = mach12
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd43]
host = mach13
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd44]
host = mach14
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd45]
host = mach15
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd46]
host = mach16
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd47]
host = mach17
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd48]
host = mach18
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
[osd49]
host = mach19
osd data = /disk/sdg3/cephdata
osd journal = /disk/sdc3/cephjnl.sdg3
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: OOM's on the Ceph client machine 2010-10-13 0:31 OOM's on the Ceph client machine Theodore Ts'o @ 2010-10-13 2:30 ` Gregory Farnum 2010-10-13 3:34 ` Ted Ts'o 2010-10-13 3:43 ` DongJin Lee 2010-10-13 17:42 ` Sage Weil 2 siblings, 1 reply; 13+ messages in thread From: Gregory Farnum @ 2010-10-13 2:30 UTC (permalink / raw) To: Theodore Ts'o; +Cc: ceph-devel, mrubin On Tue, Oct 12, 2010 at 5:31 PM, Theodore Ts'o <tytso@mit.edu> wrote: > Hi there, > > I've recently been playing with Ceph on an evaluation basis, and found > that I was able to fairly reliably induce an OOM kill on my the ceph > client machine by using FFSB with the following configuration file (see > attached, below). Does this mean you're using cfuse rather than the kernel client? FUSE performance in general is fairly disappointing and our cfuse is probably not as fast as the kernel client even so, though I don't think it should be *that* unhappy in most environments. > I am using Ceph v0.21.3 plus a few commits that were on the testing > branch as of late September (commit ID 569d96b). The Ceph cluster > contains 10 commodity servers with 5 disks configured for Ceph object > storage on each server (plus a separate spindle for the journal files), > so there are 5 instances of cosd on each OSD server. The disks are > formatted using ext4 in no-journal mode. I am using 3 servers for the > MDS and montioring daemons, with the MDS and monitoring daemons > colocated these 3 servers. The machines all have gigabit ethernet > cards. So you have 5 journals running on one spindle? This could be the cause of your slightly low sequential write performance; in the current default configuration writes have to go to the journal before going to the main disk and with multiple OSDs on one journal spindle they could be getting in each other's way. Also, how much memory do you have on these machines? > P.S. In case people are curious, here are the results of the "boxacle" > (http://btrfs.boxacle.net) FFSB workloads that I ran. The results are > fairly stable, except very often the 8 thread random_write workload is a > little hard to reproduce because it very often OOM's. I've never gotten > a 32 thread random_write workload measurement, since it very reliably > OOM's on my client machine. > > Do these results look reasonable to you? I confess I'm a little > disappointed with the sequential and random read numbers in particular. > And given 10 servers and fifty spindles, even the large_file_create > numbers seems surprising slow. I'm not familiar with FFSB and there doesn't seem to be any easily-accessible documentation, can you tell us a little more about how it works? For instance, how are the test files created (are they written out for the reads and then tested? Does the random write create the files as it goes, or are they pre-existing and then overwritten)? A few thoughts/wild guesses: I'm not sure exactly what the limit is, but 114MB/s reads are close to what you can get over a 1Gb link. If single-threaded FFSB means there's only one request in-flight at a time there may be a latency issue which is causing those 35MB/s reads. The kernel client ought to be prefetching but maybe it's not doing so properly, and I don't recall how much prefetching cfuse is actually capable of. Sage can say more on this. > (Also, given the we are using gigabit ethernet in this evaluation > cluster, the 1GB/sec seems ridiculously high, which suggests to me that > the fsync request wasn't honored -- FFSB includes the fsync time when > calculating write bandwidth -- and it may explain why we are OOM'ing in > the random_write workload.) Err, yes. Extremely odd. In glancing over cfuse this looks like it's working properly, but if you confirm that's what you're using I'll trace it. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-13 2:30 ` Gregory Farnum @ 2010-10-13 3:34 ` Ted Ts'o 2010-10-13 17:29 ` Sage Weil 0 siblings, 1 reply; 13+ messages in thread From: Ted Ts'o @ 2010-10-13 3:34 UTC (permalink / raw) To: Gregory Farnum; +Cc: ceph-devel, mrubin On Tue, Oct 12, 2010 at 07:30:48PM -0700, Gregory Farnum wrote: > Does this mean you're using cfuse rather than the kernel client? > FUSE performance in general is fairly disappointing and our cfuse is > probably not as fast as the kernel client even so, though I don't > think it should be *that* unhappy in most environments. No, I'm using the kernel client (from 2.6.34). Specifically, I'm doing a "modprobe ceph; mount -t ceph 1.2.3.4:/ /mnt" Sorry, I should have mentioned that. I can use a more recent kernel (i.e., 2.6.36-rc7) if that's likely to help. > So you have 5 journals running on one spindle? This could be the cause > of your slightly low sequential write performance; in the current > default configuration writes have to go to the journal before going to > the main disk and with multiple OSDs on one journal spindle they could > be getting in each other's way. Hmm, what do you recommend, then? The problem is if the journal only needs to be a few gigabytes (I used a 5GB file), using an entire 1T or 2T disk just so each of the journals can have their own spindle is pretty wasteful. > Also, how much memory do you have on these machines? 32GB > I'm not familiar with FFSB and there doesn't seem to be any > easily-accessible documentation, can you tell us a little more about > how it works? For instance, how are the test files created (are they > written out for the reads and then tested? Does the random write > create the files as it goes, or are they pre-existing and then > overwritten)? There's a quicky explanation of these workloads at http://btrfs.boxacle.net (I'm using the raid configuration FFSB files), but essentially, the large file create test is creating 100MB files as quickly as possible. In the rest of the tests we create 1024 100MB files, and then try (a) reading from them sequentially as quickly as possible, (b) picking a random file, and a random offset, and read 5MB, and repeat, (c) picking a random file, and a random offset, and write 5MB, and repeat. The creation of the 1024 100MB files (if necessary; the tests will reuse the previously created set of 100MB files) is not counted in the benchmark time. So in the last three tests there is no block allocation; just the time it takes to read or overwrite existing data blocks. Note BTW that this is not intrinsic to FFSB; FFSB stands for the "flexible filesystem benchmark" system. All of this is configurable using the ffsb config files. I'm just reusing the "boxacle workloads" just because they are convenient, and I'm familiar with how they work on local disk filesystems. They're used for example for benchmarking ext4 here: http://free.linux.hp.com/~enw/ext4/2.6.35/, and for btrfs here: http://btrfs.boxacle.net. (And when the IBM folks have done btrfs benchmarks, since they are so detailed and the hardware/configurations are so well described, I've also used them to help improve ext4's performance.) > A few thoughts/wild guesses: > I'm not sure exactly what the limit is, but 114MB/s reads are close to > what you can get over a 1Gb link. > If single-threaded FFSB means there's only one request in-flight at a > time there may be a latency issue which is causing those 35MB/s reads. > The kernel client ought to be prefetching but maybe it's not doing so > properly, and I don't recall how much prefetching cfuse is actually > capable of. Sage can say more on this. I'm not using cfuse; I'm using the in-kernel Ceph module. As far as network latency is concerned, ping RTT time is under 0.25ms. And sure maybe it's a prefetching issue --- but in that case I would have expected 8 thread would have had better than 2x the 1 thread case. - Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-13 3:34 ` Ted Ts'o @ 2010-10-13 17:29 ` Sage Weil 2010-10-14 0:03 ` Ted Ts'o 0 siblings, 1 reply; 13+ messages in thread From: Sage Weil @ 2010-10-13 17:29 UTC (permalink / raw) To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin Hi Ted, On Tue, 12 Oct 2010, Ted Ts'o wrote: > On Tue, Oct 12, 2010 at 07:30:48PM -0700, Gregory Farnum wrote: > > Does this mean you're using cfuse rather than the kernel client? > > FUSE performance in general is fairly disappointing and our cfuse is > > probably not as fast as the kernel client even so, though I don't > > think it should be *that* unhappy in most environments. > > No, I'm using the kernel client (from 2.6.34). Specifically, I'm > doing a "modprobe ceph; mount -t ceph 1.2.3.4:/ /mnt" > > Sorry, I should have mentioned that. I can use a more recent kernel > (i.e., 2.6.36-rc7) if that's likely to help. There have been a number of memory leak fixes since then, at least one of which may be causing your problem (it was caused by an uninitialized variable and didn't usually trigger for us, but may in your environment). Can you retry with the latest mainline? The benchmark completes without problems in my test environment. > > So you have 5 journals running on one spindle? This could be the cause > > of your slightly low sequential write performance; in the current > > default configuration writes have to go to the journal before going to > > the main disk and with multiple OSDs on one journal spindle they could > > be getting in each other's way. > > Hmm, what do you recommend, then? The problem is if the journal only > needs to be a few gigabytes (I used a 5GB file), using an entire 1T or > 2T disk just so each of the journals can have their own spindle is > pretty wasteful. If fsync on a single file in journal-less ext4 doesn't do any extra work, I would just put the (preallocated) journal file together with the data on each disk. Usually that's bad news because of the journal flushing, but you shouldn't have that problem. Alternatively, you could use a small separate partition on the same spindle. sage ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-13 17:29 ` Sage Weil @ 2010-10-14 0:03 ` Ted Ts'o 2010-10-14 3:43 ` Sage Weil 2010-10-21 20:36 ` Ted Ts'o 0 siblings, 2 replies; 13+ messages in thread From: Ted Ts'o @ 2010-10-14 0:03 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, mrubin On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote: > There have been a number of memory leak fixes since then, at least one of > which may be causing your problem (it was caused by an uninitialized > variable and didn't usually trigger for us, but may in your environment). > Can you retry with the latest mainline? The benchmark completes without > problems in my test environment. Sure. This may have to wait until early next week for me to retry with the latest mainline, but I'll definitely move to 2.6.36 in the near future. > If fsync on a single file in journal-less ext4 doesn't do any extra work, > I would just put the (preallocated) journal file together with the data on > each disk. Usually that's bad news because of the journal flushing, but > you shouldn't have that problem. Alternatively, you could use a small > separate partition on the same spindle. I'm currently reformatting the Ceph cluster to put the journal for /dev/sdX3 on /disk/sdX3/ceph.journal, so I'll try that test first, and see what difference that makes. That way I can make one change at a time and see what difference each change in my cluster configuration actually gives me. BTW, this might be a good time to report a tiny little problem which I found. If the journal file doesn't exist, then when you run mkcephfs, cosd will attempt to create the file for you. But it creates it as a 4k file, and then it loops forever in FileJournal::wrap_read_bl() on line 808, because get_top() and and header.max_size are both 4096, and it results in it being an expensive while (1) loop. This completely stalls the mkcephfs operation, and it took me a while to debug. It might be nice if cosd either (a) failed completely if the journal file is missing, or too small, or (b) if cosd is started in mkfs mode, and the journal file does not exist, perhaps it should create a journal file with some suitable default size. For stuff like this, I assume the right thing to do is to just open a bug in tracker.newdream.net? Is there any project-specific customs I should be aware of? Thanks, - Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-14 0:03 ` Ted Ts'o @ 2010-10-14 3:43 ` Sage Weil 2010-10-21 20:36 ` Ted Ts'o 1 sibling, 0 replies; 13+ messages in thread From: Sage Weil @ 2010-10-14 3:43 UTC (permalink / raw) To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin, martin On Wed, 13 Oct 2010, Ted Ts'o wrote: > On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote: > > There have been a number of memory leak fixes since then, at least one of > > which may be causing your problem (it was caused by an uninitialized > > variable and didn't usually trigger for us, but may in your environment). > > Can you retry with the latest mainline? The benchmark completes without > > problems in my test environment. > > Sure. This may have to wait until early next week for me to retry > with the latest mainline, but I'll definitely move to 2.6.36 in the > near future. > > > If fsync on a single file in journal-less ext4 doesn't do any extra work, > > I would just put the (preallocated) journal file together with the data on > > each disk. Usually that's bad news because of the journal flushing, but > > you shouldn't have that problem. Alternatively, you could use a small > > separate partition on the same spindle. > > I'm currently reformatting the Ceph cluster to put the journal for > /dev/sdX3 on /disk/sdX3/ceph.journal, so I'll try that test first, and > see what difference that makes. That way I can make one change at a > time and see what difference each change in my cluster configuration > actually gives me. Sounds good! > BTW, this might be a good time to report a tiny little problem which I > found. If the journal file doesn't exist, then when you run mkcephfs, > cosd will attempt to create the file for you. But it creates it as a > 4k file, and then it loops forever in FileJournal::wrap_read_bl() on > line 808, because get_top() and and header.max_size are both 4096, and > it results in it being an expensive while (1) loop. This completely > stalls the mkcephfs operation, and it took me a while to debug. That likely explains the hang Martin saw a few days back, and why we haven't hit it (our journal files usually already exist). Thanks for tracking that down! > It might be nice if cosd either (a) failed completely if the journal > file is missing, or too small, or (b) if cosd is started in mkfs mode, > and the journal file does not exist, perhaps it should create a > journal file with some suitable default size. > > For stuff like this, I assume the right thing to do is to just open a > bug in tracker.newdream.net? Is there any project-specific customs I > should be aware of? Yeah, entering it directly in the tracker is nice, although just reporting it here is fine as well. I went ahead and added this one (#487). Thanks! sage ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-14 0:03 ` Ted Ts'o 2010-10-14 3:43 ` Sage Weil @ 2010-10-21 20:36 ` Ted Ts'o 2010-10-21 21:46 ` Sage Weil 1 sibling, 1 reply; 13+ messages in thread From: Ted Ts'o @ 2010-10-21 20:36 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, mrubin On Wed, Oct 13, 2010 at 08:03:06PM -0400, Ted Ts'o wrote: > On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote: > > There have been a number of memory leak fixes since then, at least one of > > which may be causing your problem (it was caused by an uninitialized > > variable and didn't usually trigger for us, but may in your environment). > > Can you retry with the latest mainline? The benchmark completes without > > problems in my test environment. > > Sure. This may have to wait until early next week for me to retry > with the latest mainline, but I'll definitely move to 2.6.36 in the > near future. Just to give you an update. I've tried to use 2.6.34 with nearly all of the commits that apply to fs/ceph between 2.6.34 and 2.6.36-rc7 both with the 0.21 version of Ceph servers, as well as 0.22 plus some testing bug fixes (up to fd42c852). In both cases, using newer Ceph client causes the FFSB process to hang when it tries running the sync command. The dmesg is filled with lines like this: [ 4756.662789] ceph: skipping osd40 192.168.11.8:6808 seq 2495, expected 2496 [ 4756.662832] ceph: skipping osd7 192.168.12.18:6800 seq 4274, expected 4275 [ 4756.662843] ceph: skipping osd14 192.168.12.15:6802 seq 4124, expected 4125 [ 4756.662853] ceph: skipping osd38 192.168.11.3:6806 seq 3289, expected 3290 [ 4756.663093] ceph: skipping osd7 192.168.12.18:6800 seq 4275, expected 4276 [ 4756.882336] ceph: skipping osd7 192.168.12.18:6800 seq 4276, expected 4277 [ 4757.996962] ceph: skipping osd40 192.168.11.8:6808 seq 2496, expected 2497 [ 4757.997267] ceph: skipping osd7 192.168.12.18:6800 seq 4277, expected 4278 [ 4758.000149] ceph: skipping osd38 192.168.11.3:6806 seq 3290, expected 3291 [ 4758.003755] ceph: skipping osd14 192.168.12.15:6802 seq 4125, expected 4126 [ 4758.018078] ceph: skipping osd14 192.168.12.15:6802 seq 4126, expected 4127 [ 4758.018787] ceph: skipping osd7 192.168.12.18:6800 seq 4278, expected 4279 [ 4758.020263] ceph: skipping osd40 192.168.11.8:6808 seq 2497, expected 2498 [ 4758.020370] ceph: skipping osd10 192.168.11.8:6802 seq 946, expected 947 [ 4761.670848] ceph: tid 4422463 timed out on osd7, will reset osd [ 4761.813068] ceph: tid 4480042 timed out on osd40, will reset osd [ 4761.956584] ceph: tid 4487615 timed out on osd14, will reset osd [ 4762.102343] ceph: tid 4645028 timed out on osd38, will reset osd [ 4762.249425] ceph: skipping osd10 192.168.11.8:6802 seq 947, expected 948 [ 4767.257944] ceph: skipping osd10 192.168.11.8:6802 seq 948, expected 949 [ 4768.047058] ceph: skipping osd10 192.168.11.8:6802 seq 949, expected 950 [ 4772.260309] ceph: tid 4817033 timed out on osd10, will reset osd It's very possible (likely, even) that this was caused by my backwards porting of the various ceph patches to 2.6.34. Hopefully later today I'll be able to do an actual test run using 2.6.36, without needing to use "git cherry-pick" on some 170 odd patches. For a variety of reasons it was easier for me to use 2.6.34 as a base (drivers, patches that support dmesg dumps over the network after kernel panic/oops, and other stuff needed for our environment) but I should be able to move to 2.6.36 soon. I also ran into strange problems (which I haven't tried to characterize accurately enough for a bug report) when using the 2.6.34 client against the new 0.22 release. Is this expected to work? If so, I can try to more accurately characterize what was going on. Also, It seems that there are issues moving back and forth between 0.21 and 0.22 without reformating the ceph client. Is that accurate? It looked like when I tried going back to 0.21, I needed to rerun mkcephfs, or else the 0.21 cmon, cosd or cmds daemons would die with various failures when they saw that 0.22 data files. That's not surprising, but it does make it a little harder for me to go back and forth between 0.21 and 0.22 for the purpose of differential debugging. If I can get something stable working with 0.22 against either the 2.6.34 or 2.6.36 Ceph client, I'll drop my efforts using 0.21. Thanks, regards, - Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-21 20:36 ` Ted Ts'o @ 2010-10-21 21:46 ` Sage Weil 2010-10-21 22:28 ` Ted Ts'o 0 siblings, 1 reply; 13+ messages in thread From: Sage Weil @ 2010-10-21 21:46 UTC (permalink / raw) To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin On Thu, 21 Oct 2010, Ted Ts'o wrote: > On Wed, Oct 13, 2010 at 08:03:06PM -0400, Ted Ts'o wrote: > > On Wed, Oct 13, 2010 at 10:29:43AM -0700, Sage Weil wrote: > > > There have been a number of memory leak fixes since then, at least one of > > > which may be causing your problem (it was caused by an uninitialized > > > variable and didn't usually trigger for us, but may in your environment). > > > Can you retry with the latest mainline? The benchmark completes without > > > problems in my test environment. > > > > Sure. This may have to wait until early next week for me to retry > > with the latest mainline, but I'll definitely move to 2.6.36 in the > > near future. > > Just to give you an update. I've tried to use 2.6.34 with nearly all > of the commits that apply to fs/ceph between 2.6.34 and 2.6.36-rc7 > both with the 0.21 version of Ceph servers, as well as 0.22 plus some > testing bug fixes (up to fd42c852). In both cases, using newer Ceph > client causes the FFSB process to hang when it tries running the sync > command. The dmesg is filled with lines like this: > > [ 4756.662789] ceph: skipping osd40 192.168.11.8:6808 seq 2495, expected 2496 > [ 4756.662832] ceph: skipping osd7 192.168.12.18:6800 seq 4274, expected 4275 > [ 4756.662843] ceph: skipping osd14 192.168.12.15:6802 seq 4124, expected 4125 > [ 4756.662853] ceph: skipping osd38 192.168.11.3:6806 seq 3289, expected 3290 > [ 4756.663093] ceph: skipping osd7 192.168.12.18:6800 seq 4275, expected 4276 > [ 4756.882336] ceph: skipping osd7 192.168.12.18:6800 seq 4276, expected 4277 > [ 4757.996962] ceph: skipping osd40 192.168.11.8:6808 seq 2496, expected 2497 > [ 4757.997267] ceph: skipping osd7 192.168.12.18:6800 seq 4277, expected 4278 > [ 4758.000149] ceph: skipping osd38 192.168.11.3:6806 seq 3290, expected 3291 > [ 4758.003755] ceph: skipping osd14 192.168.12.15:6802 seq 4125, expected 4126 > [ 4758.018078] ceph: skipping osd14 192.168.12.15:6802 seq 4126, expected 4127 > [ 4758.018787] ceph: skipping osd7 192.168.12.18:6800 seq 4278, expected 4279 > [ 4758.020263] ceph: skipping osd40 192.168.11.8:6808 seq 2497, expected 2498 > [ 4758.020370] ceph: skipping osd10 192.168.11.8:6802 seq 946, expected 947 > [ 4761.670848] ceph: tid 4422463 timed out on osd7, will reset osd > [ 4761.813068] ceph: tid 4480042 timed out on osd40, will reset osd > [ 4761.956584] ceph: tid 4487615 timed out on osd14, will reset osd > [ 4762.102343] ceph: tid 4645028 timed out on osd38, will reset osd > [ 4762.249425] ceph: skipping osd10 192.168.11.8:6802 seq 947, expected 948 > [ 4767.257944] ceph: skipping osd10 192.168.11.8:6802 seq 948, expected 949 > [ 4768.047058] ceph: skipping osd10 192.168.11.8:6802 seq 949, expected 950 > [ 4772.260309] ceph: tid 4817033 timed out on osd10, will reset osd > > It's very possible (likely, even) that this was caused by my backwards > porting of the various ceph patches to 2.6.34. Hopefully later today > I'll be able to do an actual test run using 2.6.36, without needing to > use "git cherry-pick" on some 170 odd patches. For a variety of > reasons it was easier for me to use 2.6.34 as a base (drivers, patches > that support dmesg dumps over the network after kernel panic/oops, and > other stuff needed for our environment) but I should be able to move > to 2.6.36 soon. There is a ceph-client-standalone.git that has just the module source, with backport #ifdefs through 2.6.27 (see the master-backport or unstable-backport branches). It isn't well tested, but may be worth a shot if 2.6.36 is problematic for other reasons. Unfortunately it's not obvious to me from dmesg where the problem is, other than that it looks like some of the osds aren't responding (but are apparently still up). There is a known regression in v0.22 that can cause crashes in the osd cluster; we should have a fix pushed later today. That would look a bit different, though (you'd see osd down messages). I'll post an update (and probably v0.22.1) when that's been tested. > I also ran into strange problems (which I haven't tried to > characterize accurately enough for a bug report) when using the 2.6.34 > client against the new 0.22 release. Is this expected to work? If > so, I can try to more accurately characterize what was going on. The vanilla 2.6.34 client you mean? There have been a range of bugs fixed since then (enough for me to lose track of), so I wouldn't be surpised to see problems. And it's not something we've been testing. That said, the basics should work. > Also, It seems that there are issues moving back and forth between > 0.21 and 0.22 without reformating the ceph client. Is that accurate? Yeah, that isn't expected to work. In general, rolling backward isn't supported. In this case we forgot to add an incompat flag to generate a nice error message to that effect. sage ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-21 21:46 ` Sage Weil @ 2010-10-21 22:28 ` Ted Ts'o 2010-10-21 22:44 ` Sage Weil 0 siblings, 1 reply; 13+ messages in thread From: Ted Ts'o @ 2010-10-21 22:28 UTC (permalink / raw) To: Sage Weil; +Cc: Gregory Farnum, ceph-devel, mrubin On Thu, Oct 21, 2010 at 02:46:11PM -0700, Sage Weil wrote: > > Unfortunately it's not obvious to me from dmesg where the problem is, > other than that it looks like some of the osds aren't responding (but are > apparently still up). There is a known regression in v0.22 that can cause > crashes in the osd cluster; we should have a fix pushed later today. > That would look a bit different, though (you'd see osd down messages). > I'll post an update (and probably v0.22.1) when that's been tested. I looked earlier in the logs, and I do see some "osd down", "osd up", and "osd socket closed" messages. So it looks like the v0.22 regression you mentioned. I'll wait for the git update and try rebuilding the server. Thanks!! > > Also, It seems that there are issues moving back and forth between > > 0.21 and 0.22 without reformating the ceph client. Is that accurate? > > Yeah, that isn't expected to work. In general, rolling backward isn't > supported. In this case we forgot to add an incompat flag to generate a > nice error message to that effect. Is rolling forward between 0.21 and 0.22 expected to work? Or should I just do a mkcephfs just to be safe? It's not a data preservation issue, but rather the time it takes to do a mkcephfs. Random question: how do you feel about using Python? Trying to make a version of mkcephfs that runs in parallel would probably be easier if we could port the shell script to a python script. I don't think there are any Python dependencies in Ceph right now, though. - Ted ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-21 22:28 ` Ted Ts'o @ 2010-10-21 22:44 ` Sage Weil 0 siblings, 0 replies; 13+ messages in thread From: Sage Weil @ 2010-10-21 22:44 UTC (permalink / raw) To: Ted Ts'o; +Cc: Gregory Farnum, ceph-devel, mrubin On Thu, 21 Oct 2010, Ted Ts'o wrote: > On Thu, Oct 21, 2010 at 02:46:11PM -0700, Sage Weil wrote: > > > > Unfortunately it's not obvious to me from dmesg where the problem is, > > other than that it looks like some of the osds aren't responding (but are > > apparently still up). There is a known regression in v0.22 that can cause > > crashes in the osd cluster; we should have a fix pushed later today. > > That would look a bit different, though (you'd see osd down messages). > > I'll post an update (and probably v0.22.1) when that's been tested. > > I looked earlier in the logs, and I do see some "osd down", "osd up", > and "osd socket closed" messages. So it looks like the v0.22 > regression you mentioned. I'll wait for the git update and try > rebuilding the server. Thanks!! Phew! :) > > > Also, It seems that there are issues moving back and forth between > > > 0.21 and 0.22 without reformating the ceph client. Is that accurate? > > > > Yeah, that isn't expected to work. In general, rolling backward isn't > > supported. In this case we forgot to add an incompat flag to generate a > > nice error message to that effect. > > Is rolling forward between 0.21 and 0.22 expected to work? Or should > I just do a mkcephfs just to be safe? It's not a data preservation > issue, but rather the time it takes to do a mkcephfs. Rolling forward is always supposed to work. (And if we do end up changing things in a non-backward compatible way, we'll make some noise about it.) > Random > question: how do you feel about using Python? Trying to make a > version of mkcephfs that runs in parallel would probably be easier if > we could port the shell script to a python script. I don't think > there are any Python dependencies in Ceph right now, though. Python's fine. There's an issue in the tracker relating to this, btw. The goal will be to create discrete steps that let you use whatever cluster-specific tools you have for launching parallel jobs. http://tracker.newdream.net/issues/400 sage ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-13 0:31 OOM's on the Ceph client machine Theodore Ts'o 2010-10-13 2:30 ` Gregory Farnum @ 2010-10-13 3:43 ` DongJin Lee 2010-10-13 17:42 ` Sage Weil 2 siblings, 0 replies; 13+ messages in thread From: DongJin Lee @ 2010-10-13 3:43 UTC (permalink / raw) To: Theodore Ts'o; +Cc: ceph-devel, mrubin Hi Ted: I'd like to follow your similar setup, too. At this stage, I'm with the very recent version, I've tried btrfs but ceph mount freezes as soon as I run any high benchmark or heavy iops. I think it is to do with syncing, so I'm now trying ext4 without the journal, hoping for some good news. 1. have you tried btrfs, with all configs the same? 2. did you use mkfs.ext4 -O ^has_journal to disable? (sorry, you know the most since you are the ext4 man!) 3. did the change of journal size in ceph.conf changed any results? e.g., 20MB to 1000MB? Have you had any success in other benchmarks? I'd really like to know if you could try 'fio' benchmark, if your time allows. run ./fio example_file and the example_file content is below, e.g., [global] bs=4k ioengine=libaio iodepth=1 size=1g direct=1 runtime=30 filename=/media/cephmount/afile [seq-read] rw=read stonewall [rand-read] rw=randread stonewall [seq-write] rw=write stonewall [rand-write] rw=randwrite stonewall thanks a lot. On Wed, Oct 13, 2010 at 1:31 PM, Theodore Ts'o <tytso@mit.edu> wrote: > Hi there, > > I've recently been playing with Ceph on an evaluation basis, and found > that I was able to fairly reliably induce an OOM kill on my the ceph > client machine by using FFSB with the following configuration file (see > attached, below). > > I am using Ceph v0.21.3 plus a few commits that were on the testing > branch as of late September (commit ID 569d96b). The Ceph cluster > contains 10 commodity servers with 5 disks configured for Ceph object > storage on each server (plus a separate spindle for the journal files), > so there are 5 instances of cosd on each OSD server. The disks are > formatted using ext4 in no-journal mode. I am using 3 servers for the > MDS and montioring daemons, with the MDS and monitoring daemons > colocated these 3 servers. The machines all have gigabit ethernet > cards. > > I've been running the client on a separate machine, and this is the > machine which has been dying with an OOM. > > Any help, suggestions, or "hey stupid! You screwed up XXXX in your > ceph.conf file" would be gratefully accepted. > > Thanks, > > - Ted > > P.S. In case people are curious, here are the results of the "boxacle" > (http://btrfs.boxacle.net) FFSB workloads that I ran. The results are > fairly stable, except very often the 8 thread random_write workload is a > little hard to reproduce because it very often OOM's. I've never gotten > a 32 thread random_write workload measurement, since it very reliably > OOM's on my client machine. > > Do these results look reasonable to you? I confess I'm a little > disappointed with the sequential and random read numbers in particular. > And given 10 servers and fifty spindles, even the large_file_create > numbers seems surprising slow. > > (Also, given the we are using gigabit ethernet in this evaluation > cluster, the 1GB/sec seems ridiculously high, which suggests to me that > the fsync request wasn't honored -- FFSB includes the fsync time when > calculating write bandwidth -- and it may explain why we are OOM'ing in > the random_write workload.) > > 1 thread 8 threads 32 threads > large_file_create 101 MB/sec 102 MB/sec 101 MB/sec > sequential_reads 35 MB/sec 113 MB/sec 114 MB/sec > random_reads 1.48 MB/sec 5.44 MB/sec 11.7 MB/sec > random_writes 923 MB/sec 1.09 GB/sec (*) > > For comparison, here are the FFSB numbers on a single local ext4 disk > with no journal: > > 1 thread 8 threads 32 threads > large_file_create 75.5 MB/sec 72.2 MB/sec 74.2 MB/sec > sequential_reads 77.2 MB/sec 69.2 MB/sec 70.3 MB/sec > random_reads 734 K/sec 537 K/sec 537 K/sec > random_writes 44.5 MB/sec 41.5 MB/sec 41.6 MB/sec > > It's very possible that I may have done something wrong, so I've > enclosed the ceph.conf file I used for doing this test run.... please > let me know if there's something I've screwed up. > > ---------------------------- random_write.32.ffsb > # Large file random writes. > # 1024 files, 100MB per file. > > time=300 # 5 min > alignio=1 > > [filesystem0] > location=/mnt/ffsb1 > num_files=1024 > min_filesize=104857600 # 100 MB > max_filesize=104857600 > reuse=1 > [end0] > > [threadgroup0] > num_threads=32 > > write_random=1 > write_weight=1 > > write_size=5242880 # 5 MB > write_blocksize=4096 > > [stats] > enable_stats=1 > enable_range=1 > > msec_range 0.00 0.01 > msec_range 0.01 0.02 > msec_range 0.02 0.05 > msec_range 0.05 0.10 > msec_range 0.10 0.20 > msec_range 0.20 0.50 > msec_range 0.50 1.00 > msec_range 1.00 2.00 > msec_range 2.00 5.00 > msec_range 5.00 10.00 > msec_range 10.00 20.00 > msec_range 20.00 50.00 > msec_range 50.00 100.00 > msec_range 100.00 200.00 > msec_range 200.00 500.00 > msec_range 500.00 1000.00 > msec_range 1000.00 2000.00 > msec_range 2000.00 5000.00 > msec_range 5000.00 10000.00 > [end] > [end0] > ------------------------------------------------ My ceph.conf file > > ; > ; This is the test ceph configuration file > ; > ; [tytso:20101007.0813EDT] > ; > ; This file defines cluster membership, the various locations > ; that Ceph stores data, and any other runtime options. > ; > ; If a 'host' is defined for a daemon, the start/stop script will > ; verify that it matches the hostname (or else ignore it). If it is > ; not defined, it is assumed that the daemon is intended to start on > ; the current host (e.g., in a setup with a startup.conf on each > ; node). > > ; global > [global] > user = root > pid file = /disk/sda3/tmp/ceph/$name.pid > logger dir = /disk/sda3/tmp/ceph > log dir = /disk/sda3/tmp/ceph > chdir = /disk/sda3 > > ; monitors > ; You need at least one. You need at least three if you want to > ; tolerate any node failures. Always create an odd number. > [mon] > mon data = /disk/sda3/cephmon/data/mon$id > > ; logging, for debugging monitor crashes, in order of > ; their likelihood of being helpful :) > ;debug ms = 1 > ;debug mon = 20 > ;debug paxos = 20 > ;debug auth = 20 > > [mon0] > host = mach1 > mon addr = 1.2.3.4:6789 > > [mon1] > host = mach2 > mon addr = 1.2.3.5:6789 > > [mon1] > host = mach3 > mon addr = 1.2.3.6:6789 > > ; mds > ; You need at least one. Define two to get a standby. > [mds] > ; where the mds keeps it's secret encryption keys > keyring = /data/keyring.$name > > ; mds logging to debug issues. > ;debug ms = 1 > ;debug mds = 20 > > [mds.alpha] > host = mach2 > > [mds.beta] > host = mach3 > > [mds.gamma] > host = mach1 > > ; osd > ; You need at least one. Two if you want data to be replicated. > ; Define as many as you like. > [osd] > ; osd logging to debug osd issues, in order of likelihood of being > ; helpful > ;debug ms = 1 > ;debug osd = 20 > ;debug filestore = 20 > ;debug journal = 20 > > [osd0] > host = mach10 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd1] > host = mach11 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd2] > host = mach12 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd3] > host = mach13 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd4] > host = mach14 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd5] > host = mach15 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd6] > host = mach16 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd7] > host = mach17 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd8] > host = mach18 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd9] > host = mach19 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd10] > host = mach10 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd11] > host = mach11 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd12] > host = mach12 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd13] > host = mach13 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd14] > host = mach14 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd15] > host = mach15 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd16] > host = mach16 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd17] > host = mach17 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd18] > host = mach18 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd19] > host = mach19 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd20] > host = mach10 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd21] > host = mach11 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd22] > host = mach12 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd23] > host = mach13 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd24] > host = mach14 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd25] > host = mach15 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd26] > host = mach16 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd27] > host = mach17 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd28] > host = mach18 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd29] > host = mach19 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd30] > host = mach10 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd31] > host = mach11 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd32] > host = mach12 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd33] > host = mach13 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd34] > host = mach14 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd35] > host = mach15 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd36] > host = mach16 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd37] > host = mach17 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd38] > host = mach18 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd39] > host = mach19 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd40] > host = mach10 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd41] > host = mach11 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd42] > host = mach12 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd43] > host = mach13 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd44] > host = mach14 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd45] > host = mach15 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd46] > host = mach16 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd47] > host = mach17 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd48] > host = mach18 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd49] > host = mach19 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-13 0:31 OOM's on the Ceph client machine Theodore Ts'o 2010-10-13 2:30 ` Gregory Farnum 2010-10-13 3:43 ` DongJin Lee @ 2010-10-13 17:42 ` Sage Weil 2010-10-13 21:25 ` Sage Weil 2 siblings, 1 reply; 13+ messages in thread From: Sage Weil @ 2010-10-13 17:42 UTC (permalink / raw) To: Theodore Ts'o; +Cc: ceph-devel, mrubin Hi Ted, On Tue, 12 Oct 2010, Theodore Ts'o wrote: > P.S. In case people are curious, here are the results of the "boxacle" > (http://btrfs.boxacle.net) FFSB workloads that I ran. The results are > fairly stable, except very often the 8 thread random_write workload is a > little hard to reproduce because it very often OOM's. I've never gotten > a 32 thread random_write workload measurement, since it very reliably > OOM's on my client machine. > > Do these results look reasonable to you? I confess I'm a little > disappointed with the sequential and random read numbers in particular. > And given 10 servers and fifty spindles, even the large_file_create > numbers seems surprising slow. > > (Also, given the we are using gigabit ethernet in this evaluation > cluster, the 1GB/sec seems ridiculously high, which suggests to me that > the fsync request wasn't honored -- FFSB includes the fsync time when > calculating write bandwidth -- and it may explain why we are OOM'ing in > the random_write workload.) > > 1 thread 8 threads 32 threads > large_file_create 101 MB/sec 102 MB/sec 101 MB/sec These may be a bit below the ceiling imposed by the gigabit ethernet because of the combined journaling disk; effectively all writes for the whole host were going to the same spindle. Please try distributing the journals across the spindles. > sequential_reads 35 MB/sec 113 MB/sec 114 MB/sec These are mostly reasonable. The single thread performance is primarily governed by the MM readahead behavior. There is a mount option tunable to adjust the max readahead on the BDI: rsize=<bytes> (the default is only 512KB, IIRC). Some users have reported improved read performance with a larger rsize, but it's not something we've had time to tune ourselves. > random_reads 1.48 MB/sec 5.44 MB/sec 11.7 MB/sec This one looks way too slow. I'm going to run this locally and see what is going on. > random_writes 923 MB/sec 1.09 GB/sec (*) And there is definitely something wrong here with the client. :) Let's see what happens with the latest mainline! sage > > For comparison, here are the FFSB numbers on a single local ext4 disk > with no journal: > > 1 thread 8 threads 32 threads > large_file_create 75.5 MB/sec 72.2 MB/sec 74.2 MB/sec > sequential_reads 77.2 MB/sec 69.2 MB/sec 70.3 MB/sec > random_reads 734 K/sec 537 K/sec 537 K/sec > random_writes 44.5 MB/sec 41.5 MB/sec 41.6 MB/sec > > It's very possible that I may have done something wrong, so I've > enclosed the ceph.conf file I used for doing this test run.... please > let me know if there's something I've screwed up. > > ---------------------------- random_write.32.ffsb > # Large file random writes. > # 1024 files, 100MB per file. > > time=300 # 5 min > alignio=1 > > [filesystem0] > location=/mnt/ffsb1 > num_files=1024 > min_filesize=104857600 # 100 MB > max_filesize=104857600 > reuse=1 > [end0] > > [threadgroup0] > num_threads=32 > > write_random=1 > write_weight=1 > > write_size=5242880 # 5 MB > write_blocksize=4096 > > [stats] > enable_stats=1 > enable_range=1 > > msec_range 0.00 0.01 > msec_range 0.01 0.02 > msec_range 0.02 0.05 > msec_range 0.05 0.10 > msec_range 0.10 0.20 > msec_range 0.20 0.50 > msec_range 0.50 1.00 > msec_range 1.00 2.00 > msec_range 2.00 5.00 > msec_range 5.00 10.00 > msec_range 10.00 20.00 > msec_range 20.00 50.00 > msec_range 50.00 100.00 > msec_range 100.00 200.00 > msec_range 200.00 500.00 > msec_range 500.00 1000.00 > msec_range 1000.00 2000.00 > msec_range 2000.00 5000.00 > msec_range 5000.00 10000.00 > [end] > [end0] > ------------------------------------------------ My ceph.conf file > > ; > ; This is the test ceph configuration file > ; > ; [tytso:20101007.0813EDT] > ; > ; This file defines cluster membership, the various locations > ; that Ceph stores data, and any other runtime options. > ; > ; If a 'host' is defined for a daemon, the start/stop script will > ; verify that it matches the hostname (or else ignore it). If it is > ; not defined, it is assumed that the daemon is intended to start on > ; the current host (e.g., in a setup with a startup.conf on each > ; node). > > ; global > [global] > user = root > pid file = /disk/sda3/tmp/ceph/$name.pid > logger dir = /disk/sda3/tmp/ceph > log dir = /disk/sda3/tmp/ceph > chdir = /disk/sda3 > > ; monitors > ; You need at least one. You need at least three if you want to > ; tolerate any node failures. Always create an odd number. > [mon] > mon data = /disk/sda3/cephmon/data/mon$id > > ; logging, for debugging monitor crashes, in order of > ; their likelihood of being helpful :) > ;debug ms = 1 > ;debug mon = 20 > ;debug paxos = 20 > ;debug auth = 20 > > [mon0] > host = mach1 > mon addr = 1.2.3.4:6789 > > [mon1] > host = mach2 > mon addr = 1.2.3.5:6789 > > [mon1] > host = mach3 > mon addr = 1.2.3.6:6789 > > ; mds > ; You need at least one. Define two to get a standby. > [mds] > ; where the mds keeps it's secret encryption keys > keyring = /data/keyring.$name > > ; mds logging to debug issues. > ;debug ms = 1 > ;debug mds = 20 > > [mds.alpha] > host = mach2 > > [mds.beta] > host = mach3 > > [mds.gamma] > host = mach1 > > ; osd > ; You need at least one. Two if you want data to be replicated. > ; Define as many as you like. > [osd] > ; osd logging to debug osd issues, in order of likelihood of being > ; helpful > ;debug ms = 1 > ;debug osd = 20 > ;debug filestore = 20 > ;debug journal = 20 > > [osd0] > host = mach10 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd1] > host = mach11 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd2] > host = mach12 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd3] > host = mach13 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd4] > host = mach14 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd5] > host = mach15 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd6] > host = mach16 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd7] > host = mach17 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd8] > host = mach18 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd9] > host = mach19 > osd data = /disk/sdb3/cephdata > osd journal = /disk/sdc3/cephjnl.sdb3 > > [osd10] > host = mach10 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd11] > host = mach11 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd12] > host = mach12 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd13] > host = mach13 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd14] > host = mach14 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd15] > host = mach15 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd16] > host = mach16 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd17] > host = mach17 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd18] > host = mach18 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd19] > host = mach19 > osd data = /disk/sdd3/cephdata > osd journal = /disk/sdc3/cephjnl.sdd3 > > [osd20] > host = mach10 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd21] > host = mach11 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd22] > host = mach12 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd23] > host = mach13 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd24] > host = mach14 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd25] > host = mach15 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd26] > host = mach16 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd27] > host = mach17 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd28] > host = mach18 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd29] > host = mach19 > osd data = /disk/sde3/cephdata > osd journal = /disk/sdc3/cephjnl.sde3 > > [osd30] > host = mach10 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd31] > host = mach11 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd32] > host = mach12 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd33] > host = mach13 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd34] > host = mach14 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd35] > host = mach15 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd36] > host = mach16 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd37] > host = mach17 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd38] > host = mach18 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd39] > host = mach19 > osd data = /disk/sdf3/cephdata > osd journal = /disk/sdc3/cephjnl.sdf3 > > [osd40] > host = mach10 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd41] > host = mach11 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd42] > host = mach12 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd43] > host = mach13 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd44] > host = mach14 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd45] > host = mach15 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd46] > host = mach16 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd47] > host = mach17 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd48] > host = mach18 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > [osd49] > host = mach19 > osd data = /disk/sdg3/cephdata > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: OOM's on the Ceph client machine 2010-10-13 17:42 ` Sage Weil @ 2010-10-13 21:25 ` Sage Weil 0 siblings, 0 replies; 13+ messages in thread From: Sage Weil @ 2010-10-13 21:25 UTC (permalink / raw) To: Theodore Ts'o; +Cc: ceph-devel, mrubin On Wed, 13 Oct 2010, Sage Weil wrote: > On Tue, 12 Oct 2010, Theodore Ts'o wrote: > > random_reads 1.48 MB/sec 5.44 MB/sec 11.7 MB/sec > > This one looks way too slow. I'm going to run this locally and see what > is going on. I looked closer at this one, and it looks like what ffsb is doing is each thread picks a random 5 MB chunk and does 4KB reads from within that chunk at random offsets. Because the reads are random, there's no readahead, and we have lots of little 4KB read requests going over the wire. Increasing the number of threads just means more small reads in parallel. That being the case, the single thread number isn't so surprising. Performance is mainly bounded by the request latency. What is a bit surprising is that it doesn't scale that well as threads increase, I assume because of some contention on the OSDs (balancing is pseudorandom). FWIW, in my environment (25 single spindle OSDs, btrfs) for random_reads and 1/8/32 threads I got random_reads 4.1 MB/sec 7.41 MB/sec 15.3MB/sec sage > > > random_writes 923 MB/sec 1.09 GB/sec (*) > > And there is definitely something wrong here with the client. :) Let's > see what happens with the latest mainline! > > sage > > > > > > For comparison, here are the FFSB numbers on a single local ext4 disk > > with no journal: > > > > 1 thread 8 threads 32 threads > > large_file_create 75.5 MB/sec 72.2 MB/sec 74.2 MB/sec > > sequential_reads 77.2 MB/sec 69.2 MB/sec 70.3 MB/sec > > random_reads 734 K/sec 537 K/sec 537 K/sec > > random_writes 44.5 MB/sec 41.5 MB/sec 41.6 MB/sec > > > > It's very possible that I may have done something wrong, so I've > > enclosed the ceph.conf file I used for doing this test run.... please > > let me know if there's something I've screwed up. > > > > ---------------------------- random_write.32.ffsb > > # Large file random writes. > > # 1024 files, 100MB per file. > > > > time=300 # 5 min > > alignio=1 > > > > [filesystem0] > > location=/mnt/ffsb1 > > num_files=1024 > > min_filesize=104857600 # 100 MB > > max_filesize=104857600 > > reuse=1 > > [end0] > > > > [threadgroup0] > > num_threads=32 > > > > write_random=1 > > write_weight=1 > > > > write_size=5242880 # 5 MB > > write_blocksize=4096 > > > > [stats] > > enable_stats=1 > > enable_range=1 > > > > msec_range 0.00 0.01 > > msec_range 0.01 0.02 > > msec_range 0.02 0.05 > > msec_range 0.05 0.10 > > msec_range 0.10 0.20 > > msec_range 0.20 0.50 > > msec_range 0.50 1.00 > > msec_range 1.00 2.00 > > msec_range 2.00 5.00 > > msec_range 5.00 10.00 > > msec_range 10.00 20.00 > > msec_range 20.00 50.00 > > msec_range 50.00 100.00 > > msec_range 100.00 200.00 > > msec_range 200.00 500.00 > > msec_range 500.00 1000.00 > > msec_range 1000.00 2000.00 > > msec_range 2000.00 5000.00 > > msec_range 5000.00 10000.00 > > [end] > > [end0] > > ------------------------------------------------ My ceph.conf file > > > > ; > > ; This is the test ceph configuration file > > ; > > ; [tytso:20101007.0813EDT] > > ; > > ; This file defines cluster membership, the various locations > > ; that Ceph stores data, and any other runtime options. > > ; > > ; If a 'host' is defined for a daemon, the start/stop script will > > ; verify that it matches the hostname (or else ignore it). If it is > > ; not defined, it is assumed that the daemon is intended to start on > > ; the current host (e.g., in a setup with a startup.conf on each > > ; node). > > > > ; global > > [global] > > user = root > > pid file = /disk/sda3/tmp/ceph/$name.pid > > logger dir = /disk/sda3/tmp/ceph > > log dir = /disk/sda3/tmp/ceph > > chdir = /disk/sda3 > > > > ; monitors > > ; You need at least one. You need at least three if you want to > > ; tolerate any node failures. Always create an odd number. > > [mon] > > mon data = /disk/sda3/cephmon/data/mon$id > > > > ; logging, for debugging monitor crashes, in order of > > ; their likelihood of being helpful :) > > ;debug ms = 1 > > ;debug mon = 20 > > ;debug paxos = 20 > > ;debug auth = 20 > > > > [mon0] > > host = mach1 > > mon addr = 1.2.3.4:6789 > > > > [mon1] > > host = mach2 > > mon addr = 1.2.3.5:6789 > > > > [mon1] > > host = mach3 > > mon addr = 1.2.3.6:6789 > > > > ; mds > > ; You need at least one. Define two to get a standby. > > [mds] > > ; where the mds keeps it's secret encryption keys > > keyring = /data/keyring.$name > > > > ; mds logging to debug issues. > > ;debug ms = 1 > > ;debug mds = 20 > > > > [mds.alpha] > > host = mach2 > > > > [mds.beta] > > host = mach3 > > > > [mds.gamma] > > host = mach1 > > > > ; osd > > ; You need at least one. Two if you want data to be replicated. > > ; Define as many as you like. > > [osd] > > ; osd logging to debug osd issues, in order of likelihood of being > > ; helpful > > ;debug ms = 1 > > ;debug osd = 20 > > ;debug filestore = 20 > > ;debug journal = 20 > > > > [osd0] > > host = mach10 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd1] > > host = mach11 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd2] > > host = mach12 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd3] > > host = mach13 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd4] > > host = mach14 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd5] > > host = mach15 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd6] > > host = mach16 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd7] > > host = mach17 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd8] > > host = mach18 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd9] > > host = mach19 > > osd data = /disk/sdb3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdb3 > > > > [osd10] > > host = mach10 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd11] > > host = mach11 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd12] > > host = mach12 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd13] > > host = mach13 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd14] > > host = mach14 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd15] > > host = mach15 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd16] > > host = mach16 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd17] > > host = mach17 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd18] > > host = mach18 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd19] > > host = mach19 > > osd data = /disk/sdd3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdd3 > > > > [osd20] > > host = mach10 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd21] > > host = mach11 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd22] > > host = mach12 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd23] > > host = mach13 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd24] > > host = mach14 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd25] > > host = mach15 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd26] > > host = mach16 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd27] > > host = mach17 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd28] > > host = mach18 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd29] > > host = mach19 > > osd data = /disk/sde3/cephdata > > osd journal = /disk/sdc3/cephjnl.sde3 > > > > [osd30] > > host = mach10 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd31] > > host = mach11 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd32] > > host = mach12 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd33] > > host = mach13 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd34] > > host = mach14 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd35] > > host = mach15 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd36] > > host = mach16 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd37] > > host = mach17 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd38] > > host = mach18 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd39] > > host = mach19 > > osd data = /disk/sdf3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdf3 > > > > [osd40] > > host = mach10 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd41] > > host = mach11 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd42] > > host = mach12 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd43] > > host = mach13 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd44] > > host = mach14 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd45] > > host = mach15 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd46] > > host = mach16 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd47] > > host = mach17 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd48] > > host = mach18 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > [osd49] > > host = mach19 > > osd data = /disk/sdg3/cephdata > > osd journal = /disk/sdc3/cephjnl.sdg3 > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2010-10-21 22:41 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-13 0:31 OOM's on the Ceph client machine Theodore Ts'o 2010-10-13 2:30 ` Gregory Farnum 2010-10-13 3:34 ` Ted Ts'o 2010-10-13 17:29 ` Sage Weil 2010-10-14 0:03 ` Ted Ts'o 2010-10-14 3:43 ` Sage Weil 2010-10-21 20:36 ` Ted Ts'o 2010-10-21 21:46 ` Sage Weil 2010-10-21 22:28 ` Ted Ts'o 2010-10-21 22:44 ` Sage Weil 2010-10-13 3:43 ` DongJin Lee 2010-10-13 17:42 ` Sage Weil 2010-10-13 21:25 ` Sage Weil
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.