Cannot mount ceph filesystem: error 5 (Input/Output error)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Cannot mount ceph filesystem: error 5 (Input/Output error)
@ 2011-11-25 16:53 Guido Winkelmann
  2011-11-25 18:54 ` Wido den Hollander
  0 siblings, 1 reply; 9+ messages in thread
From: Guido Winkelmann @ 2011-11-25 16:53 UTC (permalink / raw)
  To: ceph-devel

Hi,

When trying to mount a Ceph filesystem with "mount.ceph 10.3.1.33:6789:/ /mnt", 
the command hangs for several minutes and then fails with the message 

mount error 5 = Input/output error

Background:

I'm trying to set up a small 3-machine Ceph cluster, to be used for network-
transparent block- and file storage for Qemu server virtualization. This is the 
first time I'm doing anything at all with Ceph, so many of the concepts are 
still new and a bit confusing to me.

My plan was to set up each of the three machines equally with one mon, one osd 
and one mds, and to add more servers, or replace the existing ones with bigger 
machines, as need arises.

So far, I have only started with one of the three servers, to get at least a 
basic configuration working and to get some feel for the system and its 
components, before adding the other two.

The machines are all running a basic installation of CentOS 6. What I did so 
far on the first machine was:

- Build a newer kernel (3.1.1) with built-in support for btrfs and ceph, 
reboot with it
- Prepare two larger empty partitions (sda5 and sdb5) for data storage
- Download and unpack ceph-0.38.tar.gz, cd ceph-0.38
- Run ./configure
- Repeat numerous times while installing one dependency after the other, 
eventually ending up with ./configure --without-fuse --without-tcmalloc
- make && make install
- Copy src/init-ceph to /etc/init.d/
- Create /usr/local/etc/ceph/ceph.conf with this content:

[global]
        max open files = 131072
        log file = /var/log/ceph/$name.log
        log_to_syslog = true
        pid file = /var/run/ceph/$name.pid

[mon]
        mon data = /mondata/$name

[mon.ceph1]
        host = ceph1
        mon addr = 10.3.1.33:6789

[mds]
        keyring = /cephxdata/keyring.$name

[mds.alpha]
        host = ceph1

[osd]
        osd data = /data/$name
        osd journal = /data/$name/journal
        osd journal size = 1000 ; journal size, in megabytes

[osd.0]
        host = ceph1
        btrfs devs = /dev/sda5 /dev/sdb5

(Slightly adjusted from src/sample.ceph.conf, comments removed for brevity)

- Run mkcephfs -c /usr/local/etc/ceph/ceph.conf --mkbtrfs -a -k \
/usr/local/etc/ceph/keyring.bin
- Run /etc/init.d/ceph start

After these steps, I tried to mount the ceph filesystem (on the same machine) 
with "mount.ceph 10.3.1.33:6789:/ /mnt" and got the aforementioned error.

The only thing that happens in the logs when I try to mount is this one line 
in mon log:
2011-11-25 17:26:53.216745 7f0ee75df700 -- 10.3.1.33:6789/0 >> 
10.3.1.33:0/1719960241 pipe(0x7f0ee0001320 sd=12 pgs=0 cs=0 l=0).accept peer 
addr is really 10.3.1.33:0/1719960241 (socket is 10.3.1.33:46908/0)

The osd and mds logs do not show any activity at that point.

Does anybody have any idea what could be going wrong here?

	Guido

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-11-25 16:53 Cannot mount ceph filesystem: error 5 (Input/Output error) Guido Winkelmann
@ 2011-11-25 18:54 ` Wido den Hollander
  2011-12-02 16:24   ` Guido Winkelmann
  0 siblings, 1 reply; 9+ messages in thread
From: Wido den Hollander @ 2011-11-25 18:54 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel

Hi Guido,

On 11/25/2011 05:53 PM, Guido Winkelmann wrote:
> Hi,
>
> When trying to mount a Ceph filesystem with "mount.ceph 10.3.1.33:6789:/ /mnt",
> the command hangs for several minutes and then fails with the message
>
> mount error 5 = Input/output error

Does "ceph -s" work? If so, what is the output?

>
> Background:
>
> I'm trying to set up a small 3-machine Ceph cluster, to be used for network-
> transparent block- and file storage for Qemu server virtualization. This is the
> first time I'm doing anything at all with Ceph, so many of the concepts are
> still new and a bit confusing to me.
>
> My plan was to set up each of the three machines equally with one mon, one osd
> and one mds, and to add more servers, or replace the existing ones with bigger
> machines, as need arises.

If should be enough to have 3 OSD's and 1 MON and 1 MDS, for basic 
testing that is all you'd need.

>
> So far, I have only started with one of the three servers, to get at least a
> basic configuration working and to get some feel for the system and its
> components, before adding the other two.
>
> The machines are all running a basic installation of CentOS 6. What I did so
> far on the first machine was:
>
> - Build a newer kernel (3.1.1) with built-in support for btrfs and ceph,
> reboot with it
> - Prepare two larger empty partitions (sda5 and sdb5) for data storage
> - Download and unpack ceph-0.38.tar.gz, cd ceph-0.38
> - Run ./configure
> - Repeat numerous times while installing one dependency after the other,
> eventually ending up with ./configure --without-fuse --without-tcmalloc

I'd recommend using tcmalloc, it will save a lot of memory usage.

> - make&&  make install
> - Copy src/init-ceph to /etc/init.d/
> - Create /usr/local/etc/ceph/ceph.conf with this content:
>
> [global]
>          max open files = 131072
>          log file = /var/log/ceph/$name.log
>          log_to_syslog = true
>          pid file = /var/run/ceph/$name.pid

You want to log to syslog and files simultaneously? That will cause 
double the I/O on that system.

>
> [mon]
>          mon data = /mondata/$name
>
> [mon.ceph1]
>          host = ceph1
>          mon addr = 10.3.1.33:6789
>
> [mds]
>          keyring = /cephxdata/keyring.$name
>

Why did you specify a keyring here? You did not enable the cephx 
authentication.

> [mds.alpha]
>          host = ceph1
>
> [osd]
>          osd data = /data/$name
>          osd journal = /data/$name/journal
>          osd journal size = 1000 ; journal size, in megabytes
>
> [osd.0]
>          host = ceph1
>          btrfs devs = /dev/sda5 /dev/sdb5
>

Does that actually work? Never tested with specifying to devices. What 
are you trying to do? Create a striped filesystem?

> (Slightly adjusted from src/sample.ceph.conf, comments removed for brevity)
>
> - Run mkcephfs -c /usr/local/etc/ceph/ceph.conf --mkbtrfs -a -k \
> /usr/local/etc/ceph/keyring.bin
> - Run /etc/init.d/ceph start
>
> After these steps, I tried to mount the ceph filesystem (on the same machine)
> with "mount.ceph 10.3.1.33:6789:/ /mnt" and got the aforementioned error.
>

Mounting the Ceph filesystem on the same host as where your OSD is 
running is not recommended. Although it should probably work, you could 
run into some trouble.

You could take a look at my ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf

That might give you some more clues!

Wido

> The only thing that happens in the logs when I try to mount is this one line
> in mon log:
> 2011-11-25 17:26:53.216745 7f0ee75df700 -- 10.3.1.33:6789/0>>
> 10.3.1.33:0/1719960241 pipe(0x7f0ee0001320 sd=12 pgs=0 cs=0 l=0).accept peer
> addr is really 10.3.1.33:0/1719960241 (socket is 10.3.1.33:46908/0)
>
> The osd and mds logs do not show any activity at that point.
>
> Does anybody have any idea what could be going wrong here?
>
> 	Guido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-11-25 18:54 ` Wido den Hollander
@ 2011-12-02 16:24   ` Guido Winkelmann
  2011-12-02 19:29     ` Samuel Just
  0 siblings, 1 reply; 9+ messages in thread
From: Guido Winkelmann @ 2011-12-02 16:24 UTC (permalink / raw)
  To: ceph-devel

Hi,

Am Freitag, 25. November 2011, 19:54:42 schrieb Wido den Hollander:
> Hi Guido,
> 
> On 11/25/2011 05:53 PM, Guido Winkelmann wrote:
> > Hi,
> > 
> > When trying to mount a Ceph filesystem with "mount.ceph 10.3.1.33:6789:/
> > /mnt", the command hangs for several minutes and then fails with the
> > message
> > 
> > mount error 5 = Input/output error
> 
> Does "ceph -s" work? If so, what is the output?

It works. This is the output:

# ceph -s
2011-11-28 18:51:32.018383    pg v60: 6 pgs: 6 active+clean+degraded; 0 KB 
data, 1292 KB used, 1750 GB / 1752 GB avail
2011-11-28 18:51:32.018610   mds e11: 1/1/1 up {0=alpha=up:creating}
2011-11-28 18:51:32.018698   osd e22: 1 osds: 1 up, 1 in
2011-11-28 18:51:32.018767   log 2011-11-28 18:50:16.299822 mon.0 
10.3.1.33:6789/0 3 : [INF] mds.? 10.3.1.33:6800/21218 up:boot
2011-11-28 18:51:32.018833   mon e1: 1 mons at {ceph1=10.3.1.33:6789/0}

> 
> > Background:
> > 
> > I'm trying to set up a small 3-machine Ceph cluster, to be used for
> > network- transparent block- and file storage for Qemu server
> > virtualization. This is the first time I'm doing anything at all with
> > Ceph, so many of the concepts are still new and a bit confusing to me.
> > 
> > My plan was to set up each of the three machines equally with one mon,
> > one osd and one mds, and to add more servers, or replace the existing
> > ones with bigger machines, as need arises.
> 
> If should be enough to have 3 OSD's and 1 MON and 1 MDS, for basic
> testing that is all you'd need.

Well as long as I have three machines, I might as well go for some more 
redundancy... No need to have the whole cluster fail just because the one node 
with the mon or mds daemon goes down.

[...]
> > - make&&  make install
> > - Copy src/init-ceph to /etc/init.d/
> > - Create /usr/local/etc/ceph/ceph.conf with this content:
> > 
> > [global]
> > 
> >          max open files = 131072
> >          log file = /var/log/ceph/$name.log
> >          log_to_syslog = true
> >          pid file = /var/run/ceph/$name.pid
> 
> You want to log to syslog and files simultaneously? That will cause
> double the I/O on that system.

Well, I've switched off the syslog part now.

> > [mon]
> > 
> >          mon data = /mondata/$name
> > 
> > [mon.ceph1]
> > 
> >          host = ceph1
> >          mon addr = 10.3.1.33:6789
> > 
> > [mds]
> > 
> >          keyring = /cephxdata/keyring.$name
> 
> Why did you specify a keyring here? You did not enable the cephx
> authentication.

I suppose I just forgot that particular part. It doesn't seem to make any 
difference though - I tried removing that line and restart ceph, it did not 
change anything.

At first, I tried starting with Cephx, but that let to other errors, so I 
disabled it to remove one variable when looking for the problem.
The biggest problem I still have with Cephx is lacking documentation. So far, 
the best I have found is some offhand mention in the "Cluster_configuration" 
page on the wiki and the man page to ceph-authtool. Maybe I've missed 
something, but so far, I haven't anything that really explains how this is 
supposed to work.

One thing in particular I don't get yet is why all example config files seem to 
give each component a different keyring file. As far as I understood the design, 
each client will have to connect directly to every component (mon/mds/osd) 
directly, so how can authentication still work if they each have their own 
keyring?

> > [mds.alpha]
> > 
> >          host = ceph1
> > 
> > [osd]
> > 
> >          osd data = /data/$name
> >          osd journal = /data/$name/journal
> >          osd journal size = 1000 ; journal size, in megabytes
> > 
> > [osd.0]
> > 
> >          host = ceph1
> >          btrfs devs = /dev/sda5 /dev/sdb5
> 
> Does that actually work? Never tested with specifying to devices. What
> are you trying to do? Create a striped filesystem?

Well, mostly I was just trying to use the storage space from both harddisks in 
this machine. Since the built-in multi-device capability of btrfs is one of 
the most advertised features, this looked like the logical solution to me.

It looks like it's working, too. btrfs-show shows me a filesystem with two 
devices, and twice the storage space of just one of them. The mounted 
filesystem under /data can be used like any other filesystem...

Would you say I should go about this differently?

> > (Slightly adjusted from src/sample.ceph.conf, comments removed for
> > brevity)
> > 
> > - Run mkcephfs -c /usr/local/etc/ceph/ceph.conf --mkbtrfs -a -k \
> > /usr/local/etc/ceph/keyring.bin
> > - Run /etc/init.d/ceph start
> > 
> > After these steps, I tried to mount the ceph filesystem (on the same
> > machine) with "mount.ceph 10.3.1.33:6789:/ /mnt" and got the
> > aforementioned error.
> Mounting the Ceph filesystem on the same host as where your OSD is
> running is not recommended. Although it should probably work, you could
> run into some trouble.

Really? That's the first I've heard of that, and it seems quite counter-
intuitive, too.

Anyway, I've tried mounting the ceph filesystem from a different host, and the 
result is exactly the same.

BTW, when I'm issuing the mount command, I can see these lines in dmesg:

libceph: client4602 fsid c8dd60a6-3dc7-6188-3f84-c58ff04f244a
libceph: mon0 10.3.1.33:6789 session established

> You could take a look at my ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf
> 
> That might give you some more clues!

Hm, I'm afraid it doesn't. It's neat to see a working cluster with 10 storage 
nodes and IPv6, but it gives me no pointers as to why /my/ installation isn't 
working.

	Guido

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-12-02 16:24   ` Guido Winkelmann
@ 2011-12-02 19:29     ` Samuel Just
  2011-12-03 14:21       ` Guido Winkelmann
  0 siblings, 1 reply; 9+ messages in thread
From: Samuel Just @ 2011-12-02 19:29 UTC (permalink / raw)
  To: ceph-devel

Guido,

Sorry for the confusion, you hit a bug where the default map for a
cluster with one osd contains no pgs.  0.39 (which will be released
today) will have a fix.

-Sam

On Fri, Dec 2, 2011 at 8:24 AM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> Hi,
>
> Am Freitag, 25. November 2011, 19:54:42 schrieb Wido den Hollander:
>> Hi Guido,
>>
>> On 11/25/2011 05:53 PM, Guido Winkelmann wrote:
>> > Hi,
>> >
>> > When trying to mount a Ceph filesystem with "mount.ceph 10.3.1.33:6789:/
>> > /mnt", the command hangs for several minutes and then fails with the
>> > message
>> >
>> > mount error 5 = Input/output error
>>
>> Does "ceph -s" work? If so, what is the output?
>
> It works. This is the output:
>
> # ceph -s
> 2011-11-28 18:51:32.018383    pg v60: 6 pgs: 6 active+clean+degraded; 0 KB
> data, 1292 KB used, 1750 GB / 1752 GB avail
> 2011-11-28 18:51:32.018610   mds e11: 1/1/1 up {0=alpha=up:creating}
> 2011-11-28 18:51:32.018698   osd e22: 1 osds: 1 up, 1 in
> 2011-11-28 18:51:32.018767   log 2011-11-28 18:50:16.299822 mon.0
> 10.3.1.33:6789/0 3 : [INF] mds.? 10.3.1.33:6800/21218 up:boot
> 2011-11-28 18:51:32.018833   mon e1: 1 mons at {ceph1=10.3.1.33:6789/0}
>
>>
>> > Background:
>> >
>> > I'm trying to set up a small 3-machine Ceph cluster, to be used for
>> > network- transparent block- and file storage for Qemu server
>> > virtualization. This is the first time I'm doing anything at all with
>> > Ceph, so many of the concepts are still new and a bit confusing to me.
>> >
>> > My plan was to set up each of the three machines equally with one mon,
>> > one osd and one mds, and to add more servers, or replace the existing
>> > ones with bigger machines, as need arises.
>>
>> If should be enough to have 3 OSD's and 1 MON and 1 MDS, for basic
>> testing that is all you'd need.
>
> Well as long as I have three machines, I might as well go for some more
> redundancy... No need to have the whole cluster fail just because the one node
> with the mon or mds daemon goes down.
>
> [...]
>> > - make&&  make install
>> > - Copy src/init-ceph to /etc/init.d/
>> > - Create /usr/local/etc/ceph/ceph.conf with this content:
>> >
>> > [global]
>> >
>> >          max open files = 131072
>> >          log file = /var/log/ceph/$name.log
>> >          log_to_syslog = true
>> >          pid file = /var/run/ceph/$name.pid
>>
>> You want to log to syslog and files simultaneously? That will cause
>> double the I/O on that system.
>
> Well, I've switched off the syslog part now.
>
>> > [mon]
>> >
>> >          mon data = /mondata/$name
>> >
>> > [mon.ceph1]
>> >
>> >          host = ceph1
>> >          mon addr = 10.3.1.33:6789
>> >
>> > [mds]
>> >
>> >          keyring = /cephxdata/keyring.$name
>>
>> Why did you specify a keyring here? You did not enable the cephx
>> authentication.
>
> I suppose I just forgot that particular part. It doesn't seem to make any
> difference though - I tried removing that line and restart ceph, it did not
> change anything.
>
> At first, I tried starting with Cephx, but that let to other errors, so I
> disabled it to remove one variable when looking for the problem.
> The biggest problem I still have with Cephx is lacking documentation. So far,
> the best I have found is some offhand mention in the "Cluster_configuration"
> page on the wiki and the man page to ceph-authtool. Maybe I've missed
> something, but so far, I haven't anything that really explains how this is
> supposed to work.
>
> One thing in particular I don't get yet is why all example config files seem to
> give each component a different keyring file. As far as I understood the design,
> each client will have to connect directly to every component (mon/mds/osd)
> directly, so how can authentication still work if they each have their own
> keyring?
>
>> > [mds.alpha]
>> >
>> >          host = ceph1
>> >
>> > [osd]
>> >
>> >          osd data = /data/$name
>> >          osd journal = /data/$name/journal
>> >          osd journal size = 1000 ; journal size, in megabytes
>> >
>> > [osd.0]
>> >
>> >          host = ceph1
>> >          btrfs devs = /dev/sda5 /dev/sdb5
>>
>> Does that actually work? Never tested with specifying to devices. What
>> are you trying to do? Create a striped filesystem?
>
> Well, mostly I was just trying to use the storage space from both harddisks in
> this machine. Since the built-in multi-device capability of btrfs is one of
> the most advertised features, this looked like the logical solution to me.
>
> It looks like it's working, too. btrfs-show shows me a filesystem with two
> devices, and twice the storage space of just one of them. The mounted
> filesystem under /data can be used like any other filesystem...
>
> Would you say I should go about this differently?
>
>> > (Slightly adjusted from src/sample.ceph.conf, comments removed for
>> > brevity)
>> >
>> > - Run mkcephfs -c /usr/local/etc/ceph/ceph.conf --mkbtrfs -a -k \
>> > /usr/local/etc/ceph/keyring.bin
>> > - Run /etc/init.d/ceph start
>> >
>> > After these steps, I tried to mount the ceph filesystem (on the same
>> > machine) with "mount.ceph 10.3.1.33:6789:/ /mnt" and got the
>> > aforementioned error.
>> Mounting the Ceph filesystem on the same host as where your OSD is
>> running is not recommended. Although it should probably work, you could
>> run into some trouble.
>
> Really? That's the first I've heard of that, and it seems quite counter-
> intuitive, too.
>
> Anyway, I've tried mounting the ceph filesystem from a different host, and the
> result is exactly the same.
>
> BTW, when I'm issuing the mount command, I can see these lines in dmesg:
>
> libceph: client4602 fsid c8dd60a6-3dc7-6188-3f84-c58ff04f244a
> libceph: mon0 10.3.1.33:6789 session established
>
>> You could take a look at my ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf
>>
>> That might give you some more clues!
>
> Hm, I'm afraid it doesn't. It's neat to see a working cluster with 10 storage
> nodes and IPv6, but it gives me no pointers as to why /my/ installation isn't
> working.
>
>        Guido
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-12-02 19:29     ` Samuel Just
@ 2011-12-03 14:21       ` Guido Winkelmann
  2011-12-06 19:51         ` Gregory Farnum
  0 siblings, 1 reply; 9+ messages in thread
From: Guido Winkelmann @ 2011-12-03 14:21 UTC (permalink / raw)
  To: ceph-devel

On Friday 02 December 2011 11:29:41 Samuel Just wrote:
> Guido,
> 
> Sorry for the confusion, you hit a bug where the default map for a
> cluster with one osd contains no pgs.  0.39 (which will be released
> today) will have a fix.

Really? Then why does the output of ceph -s below mention 6 pgs?

BTW, that's another aspect where the documentation is a bit lacking right now. 
I've found a page telling me how to change the number of pgs, but I couldn't 
find any explanation so far what a pg actually is, or why I should want to 
change their number...

	Guido
 
> -Sam
> 
> On Fri, Dec 2, 2011 at 8:24 AM, Guido Winkelmann
> 
> <guido-ceph@thisisnotatest.de> wrote:
> > Hi,
> > 
> > Am Freitag, 25. November 2011, 19:54:42 schrieb Wido den Hollander:
> >> Hi Guido,
> >> 
> >> On 11/25/2011 05:53 PM, Guido Winkelmann wrote:
> >> > Hi,
> >> > 
> >> > When trying to mount a Ceph filesystem with "mount.ceph
> >> > 10.3.1.33:6789:/ /mnt", the command hangs for several minutes and
> >> > then fails with the message
> >> > 
> >> > mount error 5 = Input/output error
> >> 
> >> Does "ceph -s" work? If so, what is the output?
> > 
> > It works. This is the output:
> > 
> > # ceph -s
> > 2011-11-28 18:51:32.018383    pg v60: 6 pgs: 6 active+clean+degraded; 0
> > KB data, 1292 KB used, 1750 GB / 1752 GB avail
> > 2011-11-28 18:51:32.018610   mds e11: 1/1/1 up {0=alpha=up:creating}
> > 2011-11-28 18:51:32.018698   osd e22: 1 osds: 1 up, 1 in
> > 2011-11-28 18:51:32.018767   log 2011-11-28 18:50:16.299822 mon.0
> > 10.3.1.33:6789/0 3 : [INF] mds.? 10.3.1.33:6800/21218 up:boot
> > 2011-11-28 18:51:32.018833   mon e1: 1 mons at {ceph1=10.3.1.33:6789/0}



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-12-03 14:21       ` Guido Winkelmann
@ 2011-12-06 19:51         ` Gregory Farnum
  2011-12-07 14:38           ` Guido Winkelmann
  0 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2011-12-06 19:51 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel

On Sat, Dec 3, 2011 at 6:21 AM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> On Friday 02 December 2011 11:29:41 Samuel Just wrote:
>> Guido,
>>
>> Sorry for the confusion, you hit a bug where the default map for a
>> cluster with one osd contains no pgs.  0.39 (which will be released
>> today) will have a fix.
>
> Really? Then why does the output of ceph -s below mention 6 pgs?
There are a couple of different categories of PGs; the 6 that exist
are "local" PGs which are tied to a specific OSD. However, those
aren't actually used in a standard Ceph configuration.


> BTW, that's another aspect where the documentation is a bit lacking right now.
> I've found a page telling me how to change the number of pgs, but I couldn't
> find any explanation so far what a pg actually is, or why I should want to
> change their number...
PG = "placement group". When placing data in the cluster, objects are
mapped into PGs, and those PGs are mapped onto OSDs. We use the
indirection so that we can group objects, which reduces the amount of
per-object metadata we need to keep track of and processes we need to
run (it would be prohibitively expensive to track eg the placement
history on a per-object basis).
Increasing the number of PGs can reduce the variance in per-OSD load
across your cluster, but each PG requires a bit more CPU and memory on
the OSDs that are storing it. We try and ballpark it at 100 PGs/OSD,
although it can vary widely without ill effects depending on your
cluster. You hit a bug in how we calculate the initial PG number from
a cluster description.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-12-06 19:51         ` Gregory Farnum
@ 2011-12-07 14:38           ` Guido Winkelmann
  2011-12-07 18:00             ` Tommi Virtanen
  0 siblings, 1 reply; 9+ messages in thread
From: Guido Winkelmann @ 2011-12-07 14:38 UTC (permalink / raw)
  To: ceph-devel

Am Dienstag, 6. Dezember 2011, 11:51:45 schrieben Sie:
> PG = "placement group". When placing data in the cluster, objects are
> mapped into PGs, and those PGs are mapped onto OSDs.

How does the Object->PG mapping look like, do you map more than one object on 
one PG, or do you sometimes map an object to more than one PG? How about the 
mapping of PGs to OSDs, does one PG belong to exactly one OSD?

Does one PG represent a fixed amount of storage space?

        Guido

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-12-07 14:38           ` Guido Winkelmann
@ 2011-12-07 18:00             ` Tommi Virtanen
  2011-12-07 18:03               ` Tommi Virtanen
  0 siblings, 1 reply; 9+ messages in thread
From: Tommi Virtanen @ 2011-12-07 18:00 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel

On Wed, Dec 7, 2011 at 06:38, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
> Am Dienstag, 6. Dezember 2011, 11:51:45 schrieben Sie:
>> PG = "placement group". When placing data in the cluster, objects are
>> mapped into PGs, and those PGs are mapped onto OSDs.
>
> How does the Object->PG mapping look like, do you map more than one object on
> one PG, or do you sometimes map an object to more than one PG? How about the
> mapping of PGs to OSDs, does one PG belong to exactly one OSD?
>
> Does one PG represent a fixed amount of storage space?

Many objects map to one PG.

Each object maps to exactly one PG.

One PG maps to a single list of OSDs, where the first one in the list
is the primary and the rest are replicas.

Many PGs can map to one OSD.

A PG represents nothing but a grouping of objects; you configure the
number of PGs you want  (see
http://ceph.newdream.net/wiki/Changing_the_number_of_PGs ), number of
OSDs * 100 is a good starting point, and all of your stored objects
are pseudo-randomly evenly distributed to the PGs. So a PG explicitly
does NOT represent a fixed amount of storage; it represents 1/pg_num
'th of the storage you happen to have on your OSDs.

Ignoring the finer points of CRUSH and custom placement, it goes
something like this in pseudocode:

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)  # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

If you want to understand the crush() part in the above, imagine a
perfectly spherical datacenter in a vacuum ;) that is, if all osds
have weight 1.0, and there is no topology to the data center (all OSDs
are on the top level), and you use defaults, etc, it simplifies to
consistent hashing; you can think of it as:

def crush(pg):
    all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
    result = []
    # size is the number of copies; primary+replicas
    while len(result) < size:
        r = get_random_number()
        chosen = all_osds[ r % len(all_osds) ]
        if chosen in result:
            # osd can be picked only once
            continue
        result.append(chosen)
    return result

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Cannot mount ceph filesystem: error 5 (Input/Output error)
  2011-12-07 18:00             ` Tommi Virtanen
@ 2011-12-07 18:03               ` Tommi Virtanen
  0 siblings, 0 replies; 9+ messages in thread
From: Tommi Virtanen @ 2011-12-07 18:03 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel

On Wed, Dec 7, 2011 at 10:00, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
> def crush(pg):
>    all_osds = ['osd.0', 'osd.1', 'osd.2', ...]
>    result = []
>    # size is the number of copies; primary+replicas
>    while len(result) < size:
>        r = get_random_number()

Err I mean pseudorandom, based on pg. So basically "r = pg" there.

>        chosen = all_osds[ r % len(all_osds) ]
>        if chosen in result:
>            # osd can be picked only once
>            continue
>        result.append(chosen)
>    return result
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-12-07 18:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-25 16:53 Cannot mount ceph filesystem: error 5 (Input/Output error) Guido Winkelmann
2011-11-25 18:54 ` Wido den Hollander
2011-12-02 16:24   ` Guido Winkelmann
2011-12-02 19:29     ` Samuel Just
2011-12-03 14:21       ` Guido Winkelmann
2011-12-06 19:51         ` Gregory Farnum
2011-12-07 14:38           ` Guido Winkelmann
2011-12-07 18:00             ` Tommi Virtanen
2011-12-07 18:03               ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.