All of lore.kernel.org
 help / color / mirror / Atom feed
* problems creating new ceph cluster when using journal on block device
@ 2012-11-08  7:29 Travis Rhoden
  2012-11-08  8:08 ` Wido den Hollander
  0 siblings, 1 reply; 7+ messages in thread
From: Travis Rhoden @ 2012-11-08  7:29 UTC (permalink / raw)
  To: ceph-devel

Hey folks,

I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
hardware has SSDs for journals, and I'm trying to get mkcephfs to
intialize everything for me. However, the command hangs forever and I
eventually have to kill it.

After poking around a bit, it's clear that the problem has something
to do with the journal.  If I comment out the journal in ceph.conf,
the commands proceed just find.  This is the first time I've tried to
throw a journal on a block device rather than a file, so maybe I've
done something wrong with that.

Here is the info from ceph.conf:


[osd]
        osd journal size = 4000
[osd.0]
        host = ceph1
        osd journal = /dev/sda5


when I log in the log file, here is what I see:

2012-11-07 23:18:20.578623 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) mkfs in /var/lib/ceph/osd/ceph-0
2012-11-07 23:18:20.578699 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) mkfs fsid is already set to
4aac6842-8d71-4405-88ad-e3e9e4da308d
2012-11-07 23:18:20.632138 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) leveldb db exists/created
2012-11-07 23:18:20.634338 7fe2743e3780  0 journal  kernel version is 3.2.0
2012-11-07 23:18:20.634579 7fe2743e3780  1 journal _open /dev/sda5 fd
9: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0
2012-11-07 23:18:20.634995 7fe2743e3780  1 journal check: header looks ok
2012-11-07 23:18:20.636020 7fe2743e3780  1
filestore(/var/lib/ceph/osd/ceph-0) mkfs done in
/var/lib/ceph/osd/ceph-0
2012-11-07 23:18:20.682113 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is supported
and appears to work
2012-11-07 23:18:20.682125 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option
2012-11-07 23:18:20.682424 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount did NOT detect btrfs
2012-11-07 23:18:20.781938 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount syncfs(2) syscall fully
supported (by glibc and kernel)
2012-11-07 23:18:20.782061 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount found snaps <>
2012-11-07 23:18:20.823915 7fe2743e3780  0
filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
mode: btrfs not detected
2012-11-07 23:18:20.826137 7fe2743e3780  0 journal  kernel version is 3.2.0
2012-11-07 23:18:20.826386 7fe2743e3780  1 journal _open /dev/sda5 fd
15: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0

So I know it is trying to use the right partition/block device.  It
just never get's past that line.

Finally, I tried to track things down myself to see what was hanging
using strace.  I ran:

strace /usr/bin/ceph-osd -c /tmp/travis/conf --monmap
/tmp/travis/monmap -i 0 --mkfs --mkkey

And the final output from that is:

open("/dev/sda5", O_RDONLY)             = 15
fstat(15, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 5), ...}) = 0
ioctl(15, BLKGETSIZE64, 0x7fffe7a587a8) = 0
geteuid()                               = 0
pipe2([16, 17], O_CLOEXEC)              = 0
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f5365f28a50) = 707
close(17)                               = 0
fcntl(16, F_SETFD, 0)                   = 0
fstat(16, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f5365f14000
read(16, "\n/dev/sda5:\n write-caching =  1 "..., 4096) = 37
open("/proc/version", O_RDONLY)         = 17
read(17, "Linux version 3.2.0-23-generic ("..., 127) = 127
futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1
close(17)                               = 0
close(16)                               = 0
wait4(707, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 707
munmap(0x7f5365f14000, 4096)            = 0
io_setup(128, {139996169318400})        = 0
futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1
pread(15, "\2\0\0\0000\0\0\0\1\0\0\0\0\0\0\0J\254hB\215qD\5\210\255\343\351\344\3320\215"...,
4096, 0) = 4096

And that's as far as it gets.  Any thoughts?

After some sleep, I'll try throwing the journal back on a file instead
of a block device and see if that does it.

Can anyone confirm that using a block device instead of a file is
actually better performance?

Thanks,

 - Travis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problems creating new ceph cluster when using journal on block device
  2012-11-08  7:29 problems creating new ceph cluster when using journal on block device Travis Rhoden
@ 2012-11-08  8:08 ` Wido den Hollander
  2012-11-08  8:24   ` Mark Kirkwood
  0 siblings, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2012-11-08  8:08 UTC (permalink / raw)
  To: Travis Rhoden; +Cc: ceph-devel



On 08-11-12 08:29, Travis Rhoden wrote:
> Hey folks,
>
> I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
> hardware has SSDs for journals, and I'm trying to get mkcephfs to
> intialize everything for me. However, the command hangs forever and I
> eventually have to kill it.
>
> After poking around a bit, it's clear that the problem has something
> to do with the journal.  If I comment out the journal in ceph.conf,
> the commands proceed just find.  This is the first time I've tried to
> throw a journal on a block device rather than a file, so maybe I've
> done something wrong with that.
>
> Here is the info from ceph.conf:
>
>
> [osd]
>          osd journal size = 4000

Not sure if this is the problem, but when using a block device you don't 
have to specify the size for the journal.

Wido

> [osd.0]
>          host = ceph1
>          osd journal = /dev/sda5
>
>
> when I log in the log file, here is what I see:
>
> 2012-11-07 23:18:20.578623 7fe2743e3780  1
> filestore(/var/lib/ceph/osd/ceph-0) mkfs in /var/lib/ceph/osd/ceph-0
> 2012-11-07 23:18:20.578699 7fe2743e3780  1
> filestore(/var/lib/ceph/osd/ceph-0) mkfs fsid is already set to
> 4aac6842-8d71-4405-88ad-e3e9e4da308d
> 2012-11-07 23:18:20.632138 7fe2743e3780  1
> filestore(/var/lib/ceph/osd/ceph-0) leveldb db exists/created
> 2012-11-07 23:18:20.634338 7fe2743e3780  0 journal  kernel version is 3.2.0
> 2012-11-07 23:18:20.634579 7fe2743e3780  1 journal _open /dev/sda5 fd
> 9: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0
> 2012-11-07 23:18:20.634995 7fe2743e3780  1 journal check: header looks ok
> 2012-11-07 23:18:20.636020 7fe2743e3780  1
> filestore(/var/lib/ceph/osd/ceph-0) mkfs done in
> /var/lib/ceph/osd/ceph-0
> 2012-11-07 23:18:20.682113 7fe2743e3780  0
> filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is supported
> and appears to work
> 2012-11-07 23:18:20.682125 7fe2743e3780  0
> filestore(/var/lib/ceph/osd/ceph-0) mount FIEMAP ioctl is disabled via
> 'filestore fiemap' config option
> 2012-11-07 23:18:20.682424 7fe2743e3780  0
> filestore(/var/lib/ceph/osd/ceph-0) mount did NOT detect btrfs
> 2012-11-07 23:18:20.781938 7fe2743e3780  0
> filestore(/var/lib/ceph/osd/ceph-0) mount syncfs(2) syscall fully
> supported (by glibc and kernel)
> 2012-11-07 23:18:20.782061 7fe2743e3780  0
> filestore(/var/lib/ceph/osd/ceph-0) mount found snaps <>
> 2012-11-07 23:18:20.823915 7fe2743e3780  0
> filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
> mode: btrfs not detected
> 2012-11-07 23:18:20.826137 7fe2743e3780  0 journal  kernel version is 3.2.0
> 2012-11-07 23:18:20.826386 7fe2743e3780  1 journal _open /dev/sda5 fd
> 15: 4194304000 bytes, block size 4096 bytes, directio = 1, aio = 0
>
> So I know it is trying to use the right partition/block device.  It
> just never get's past that line.
>
> Finally, I tried to track things down myself to see what was hanging
> using strace.  I ran:
>
> strace /usr/bin/ceph-osd -c /tmp/travis/conf --monmap
> /tmp/travis/monmap -i 0 --mkfs --mkkey
>
> And the final output from that is:
>
> open("/dev/sda5", O_RDONLY)             = 15
> fstat(15, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 5), ...}) = 0
> ioctl(15, BLKGETSIZE64, 0x7fffe7a587a8) = 0
> geteuid()                               = 0
> pipe2([16, 17], O_CLOEXEC)              = 0
> clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x7f5365f28a50) = 707
> close(17)                               = 0
> fcntl(16, F_SETFD, 0)                   = 0
> fstat(16, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7f5365f14000
> read(16, "\n/dev/sda5:\n write-caching =  1 "..., 4096) = 37
> open("/proc/version", O_RDONLY)         = 17
> read(17, "Linux version 3.2.0-23-generic ("..., 127) = 127
> futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078,
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1
> close(17)                               = 0
> close(16)                               = 0
> wait4(707, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 707
> munmap(0x7f5365f14000, 4096)            = 0
> io_setup(128, {139996169318400})        = 0
> futex(0x2db807c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x2db8078,
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x2db8028, FUTEX_WAKE_PRIVATE, 1) = 1
> pread(15, "\2\0\0\0000\0\0\0\1\0\0\0\0\0\0\0J\254hB\215qD\5\210\255\343\351\344\3320\215"...,
> 4096, 0) = 4096
>
> And that's as far as it gets.  Any thoughts?
>
> After some sleep, I'll try throwing the journal back on a file instead
> of a block device and see if that does it.
>
> Can anyone confirm that using a block device instead of a file is
> actually better performance?
>
> Thanks,
>
>   - Travis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problems creating new ceph cluster when using journal on block device
  2012-11-08  8:08 ` Wido den Hollander
@ 2012-11-08  8:24   ` Mark Kirkwood
  2012-11-08 15:01     ` Travis Rhoden
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Kirkwood @ 2012-11-08  8:24 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Travis Rhoden, ceph-devel

On 08/11/12 21:08, Wido den Hollander wrote:
>
>
> On 08-11-12 08:29, Travis Rhoden wrote:
>> Hey folks,
>>
>> I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
>> hardware has SSDs for journals, and I'm trying to get mkcephfs to
>> intialize everything for me. However, the command hangs forever and I
>> eventually have to kill it.
>>
>> After poking around a bit, it's clear that the problem has something
>> to do with the journal.  If I comment out the journal in ceph.conf,
>> the commands proceed just find.  This is the first time I've tried to
>> throw a journal on a block device rather than a file, so maybe I've
>> done something wrong with that.
>>
>> Here is the info from ceph.conf:
>>
>>
>> [osd]
>>          osd journal size = 4000
>
> Not sure if this is the problem, but when using a block device you don't
> have to specify the size for the journal.

Also might be useful to know make/model of ssd, plus motherboard 
make/model (in case commenting out size does not fix)!

Regards

Mark


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problems creating new ceph cluster when using journal on block device
  2012-11-08  8:24   ` Mark Kirkwood
@ 2012-11-08 15:01     ` Travis Rhoden
  2012-11-08 15:08       ` Travis Rhoden
  0 siblings, 1 reply; 7+ messages in thread
From: Travis Rhoden @ 2012-11-08 15:01 UTC (permalink / raw)
  To: Mark Kirkwood; +Cc: Wido den Hollander, ceph-devel

>>> [osd]
>>>          osd journal size = 4000
>>
>>
>> Not sure if this is the problem, but when using a block device you don't
>> have to specify the size for the journal.

So happy to know that, Wido!  I had hoped there was a way to skip that.

Tried without it -- only difference in the logs was seeing that it
picked up the full size of the partition.  So, same result.

> Also might be useful to know make/model of ssd, plus motherboard make/model
> (in case commenting out size does not fix)!

It's an Intel X25-E, 64GB.  It's a place-holder until some bigger ones
we have on order show up.

The mother board is a SuperMicro X8DT6.  SSDs are connected to onboard
SATA ports, data drives are connected to LSI 9211-8i (SAS2008)

Maybe there is a special way I need to do the partition?  My goal was
to throw 6 journals on this disk, and it is partitioned like so:

Model: ATA SSDSA2SH064G1GC (scsi)
Disk /dev/sda: 64.0GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type      File system  Flags
 1      1049kB  512MB   511MB   primary                raid
 2      512MB   2511MB  2000MB  primary                raid
 3      2511MB  6512MB  4000MB  primary                raid
 4      6512MB  64.0GB  57.5GB  extended
 5      6513MB  15.1GB  8590MB  logical
 6      15.1GB  23.7GB  8590MB  logical
 7      23.7GB  32.3GB  8590MB  logical
 8      32.3GB  40.9GB  8590MB  logical
 9      40.9GB  49.5GB  8590MB  logical
10      49.5GB  58.1GB  8590MB  logical


So, sda5-10 are my journal partitions.  I know that I have consumed
most of the drive here, and that is bad for the SSD and such, but it
really is a temporary setup.

 - Travis

On Thu, Nov 8, 2012 at 3:24 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> On 08/11/12 21:08, Wido den Hollander wrote:
>>
>>
>>
>> On 08-11-12 08:29, Travis Rhoden wrote:
>>>
>>> Hey folks,
>>>
>>> I'm trying to set up a brand new Ceph cluster, based on v0.53.  My
>>> hardware has SSDs for journals, and I'm trying to get mkcephfs to
>>> intialize everything for me. However, the command hangs forever and I
>>> eventually have to kill it.
>>>
>>> After poking around a bit, it's clear that the problem has something
>>> to do with the journal.  If I comment out the journal in ceph.conf,
>>> the commands proceed just find.  This is the first time I've tried to
>>> throw a journal on a block device rather than a file, so maybe I've
>>> done something wrong with that.
>>>
>>> Here is the info from ceph.conf:
>>>
>>>
>>> [osd]
>>>          osd journal size = 4000
>>
>>
>> Not sure if this is the problem, but when using a block device you don't
>> have to specify the size for the journal.
>
>
> Also might be useful to know make/model of ssd, plus motherboard make/model
> (in case commenting out size does not fix)!
>
> Regards
>
> Mark
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problems creating new ceph cluster when using journal on block device
  2012-11-08 15:01     ` Travis Rhoden
@ 2012-11-08 15:08       ` Travis Rhoden
  2012-11-08 17:36         ` Travis Rhoden
  0 siblings, 1 reply; 7+ messages in thread
From: Travis Rhoden @ 2012-11-08 15:08 UTC (permalink / raw)
  To: ceph-devel

One more thing -- Google search says this is harmless -- I see quite a
few of these in syslog:

hdparm: sending ioctl 2285 to a partition!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problems creating new ceph cluster when using journal on block device
  2012-11-08 15:08       ` Travis Rhoden
@ 2012-11-08 17:36         ` Travis Rhoden
  2012-11-08 17:41           ` Mark Nelson
  0 siblings, 1 reply; 7+ messages in thread
From: Travis Rhoden @ 2012-11-08 17:36 UTC (permalink / raw)
  To: ceph-devel

Solved!

I stumbled into the solution while switching from block device to a
file.  I was being bit by running mkcephfs multiple times -- it wasn't
really failing on the journal, it was failing because the OSD data
disk had been initialized before.  I couldn't see that until I used a
file for the journal and then I see log output like:

=== osd.0 ===
2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1
2012-11-08 16:41:37.678726 7ffc3cfcd780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument

I unmounted the OSD's that had been touched before, reformatted them,
and then remounted.  I setup ceph.conf to use block devices for the
journals, and then everything proceeded normally.

So the final relevant bits from my ceph.conf file look like:

[osd]
        osd journal size = 0
        journal dio = true
        journal aio = true

[osd.0]
        host = ceph1
        osd journal = /dev/sda5

[osd.1]
        host = ceph1
        osd journal = /dev/sda6
...

Thanks,

 - Travis

On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden <trhoden@gmail.com> wrote:
> One more thing -- Google search says this is harmless -- I see quite a
> few of these in syslog:
>
> hdparm: sending ioctl 2285 to a partition!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problems creating new ceph cluster when using journal on block device
  2012-11-08 17:36         ` Travis Rhoden
@ 2012-11-08 17:41           ` Mark Nelson
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Nelson @ 2012-11-08 17:41 UTC (permalink / raw)
  To: Travis Rhoden; +Cc: ceph-devel

On 11/08/2012 11:36 AM, Travis Rhoden wrote:
> Solved!
>
> I stumbled into the solution while switching from block device to a
> file.  I was being bit by running mkcephfs multiple times -- it wasn't
> really failing on the journal, it was failing because the OSD data
> disk had been initialized before.  I couldn't see that until I used a
> file for the journal and then I see log output like:

Yeah, that was a change that landed a couple of months ago.  It's really 
important now to blow away the old data (I just reformat) if you want a 
totally clean ceph deployment rather than just running mkcephfs.

>
> === osd.0 ===
> 2012-11-08 16:41:37.677620 7ffc3cfcd780 -1 provided osd id 0 != superblock's -1
> 2012-11-08 16:41:37.678726 7ffc3cfcd780 -1  ** ERROR: error creating
> empty object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument
>
> I unmounted the OSD's that had been touched before, reformatted them,
> and then remounted.  I setup ceph.conf to use block devices for the
> journals, and then everything proceeded normally.
>
> So the final relevant bits from my ceph.conf file look like:
>
> [osd]
>          osd journal size = 0
>          journal dio = true
>          journal aio = true
>
> [osd.0]
>          host = ceph1
>          osd journal = /dev/sda5
>
> [osd.1]
>          host = ceph1
>          osd journal = /dev/sda6
> ...
>
> Thanks,
>
>   - Travis
>
> On Thu, Nov 8, 2012 at 10:08 AM, Travis Rhoden <trhoden@gmail.com> wrote:
>> One more thing -- Google search says this is harmless -- I see quite a
>> few of these in syslog:
>>
>> hdparm: sending ioctl 2285 to a partition!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-11-08 17:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-08  7:29 problems creating new ceph cluster when using journal on block device Travis Rhoden
2012-11-08  8:08 ` Wido den Hollander
2012-11-08  8:24   ` Mark Kirkwood
2012-11-08 15:01     ` Travis Rhoden
2012-11-08 15:08       ` Travis Rhoden
2012-11-08 17:36         ` Travis Rhoden
2012-11-08 17:41           ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.