* recoverying from 95% full osd
@ 2013-01-08 10:42 Roman Hlynovskiy
2013-01-08 16:16 ` Mark Nelson
2013-01-08 17:20 ` Gregory Farnum
0 siblings, 2 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-08 10:42 UTC (permalink / raw)
To: ceph-devel
Hello,
I am running ceph v0.56 and at the moment trying to recover ceph which
got completely stuck after 1 osd got filled by 95%. Looks like the
distribution algorithm is not perfect since all 3 OSD's I user are
256Gb each, however one of them got filled faster than others:
osd-1:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
osd-2:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
osd-3:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
by the moment mds is showing the following behaviour:
2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
modify 0x9ba63c0 tid 23448
2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
modify 0xca86c30 tid 23449
so, it does not respond to any mount requests
I've played around with all types of commands like:
ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
and
'mon osd full ratio = 0.98' in mon configuration for each mon
however
chef@ceph-node03:/var/log/ceph$ ceph health detail
HEALTH_ERR 1 full osd(s)
osd.2 is full at 95%
mds still believes 95% is the threshold, so no responses to mount requests.
chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
Object prefix: benchmark_data_ceph-node03_3903
2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa467ff0 tid 1
2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa468780 tid 2
2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa468f88 tid 3
2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa469348 tid 4
2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa469708 tid 5
2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa469ac8 tid 6
2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46a2d0 tid 7
2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46a690 tid 8
2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46aa50 tid 9
2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46ae10 tid 10
2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46b1d0 tid 11
2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46b590 tid 12
2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46b950 tid 13
2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46bd10 tid 14
2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46c0d0 tid 15
2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
paused modify 0xa46c490 tid 16
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 16 16 0 0 0 - 0
1 16 16 0 0 0 - 0
2 16 16 0 0 0 - 0
rados doesn't work.
chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
no change: average_util: 0.812678, overload_util: 0.975214. overloaded
osds: (none)
this one also.
is there any chance to recover ceph?
--
...WBR, Roman Hlynovskiy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-08 10:42 recoverying from 95% full osd Roman Hlynovskiy
@ 2013-01-08 16:16 ` Mark Nelson
2013-01-09 4:19 ` Roman Hlynovskiy
2013-01-08 17:20 ` Gregory Farnum
1 sibling, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2013-01-08 16:16 UTC (permalink / raw)
To: Roman Hlynovskiy; +Cc: ceph-devel
On 01/08/2013 04:42 AM, Roman Hlynovskiy wrote:
> Hello,
>
> I am running ceph v0.56 and at the moment trying to recover ceph which
> got completely stuck after 1 osd got filled by 95%. Looks like the
> distribution algorithm is not perfect since all 3 OSD's I user are
> 256Gb each, however one of them got filled faster than others:
>
> osd-1:
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
>
> osd-2:
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
>
> osd-3:
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
>
>
> by the moment mds is showing the following behaviour:
> 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
> modify 0x9ba63c0 tid 23448
> 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
> modify 0xca86c30 tid 23449
>
> so, it does not respond to any mount requests
>
> I've played around with all types of commands like:
> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>
> and
>
> 'mon osd full ratio = 0.98' in mon configuration for each mon
>
> however
>
> chef@ceph-node03:/var/log/ceph$ ceph health detail
> HEALTH_ERR 1 full osd(s)
> osd.2 is full at 95%
>
> mds still believes 95% is the threshold, so no responses to mount requests.
>
> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
> Object prefix: benchmark_data_ceph-node03_3903
> 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa467ff0 tid 1
> 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa468780 tid 2
> 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa468f88 tid 3
> 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa469348 tid 4
> 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa469708 tid 5
> 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa469ac8 tid 6
> 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46a2d0 tid 7
> 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46a690 tid 8
> 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46aa50 tid 9
> 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46ae10 tid 10
> 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46b1d0 tid 11
> 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46b590 tid 12
> 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46b950 tid 13
> 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46bd10 tid 14
> 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46c0d0 tid 15
> 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46c490 tid 16
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 0 16 16 0 0 0 - 0
> 1 16 16 0 0 0 - 0
> 2 16 16 0 0 0 - 0
>
> rados doesn't work.
>
> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> osds: (none)
>
> this one also.
>
>
> is there any chance to recover ceph?
Hi,
There may be other ways to fix it, but one method might be to simply add
another OSD so the data gets redistributed. I wouldn't continue to
modify osd full ratio up. I think Sam's said in the past it can make a
minor problem into a very big problem if you fill an OSD all the way.
Another option that may (or may not) work as a temporary solution is to
change the osd weights.
Having said that, I'm curious to know how many PGs you have? Do you
have a custom crush map? That distribution is pretty skewed!
Thanks,
Mark
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-08 10:42 recoverying from 95% full osd Roman Hlynovskiy
2013-01-08 16:16 ` Mark Nelson
@ 2013-01-08 17:20 ` Gregory Farnum
2013-01-09 5:41 ` Roman Hlynovskiy
1 sibling, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2013-01-08 17:20 UTC (permalink / raw)
To: Roman Hlynovskiy; +Cc: ceph-devel@vger.kernel.org
On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
<roman.hlynovskiy@gmail.com> wrote:
> Hello,
>
> I am running ceph v0.56 and at the moment trying to recover ceph which
> got completely stuck after 1 osd got filled by 95%. Looks like the
> distribution algorithm is not perfect since all 3 OSD's I user are
> 256Gb each, however one of them got filled faster than others:
>
> osd-1:
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
>
> osd-2:
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
>
> osd-3:
> Filesystem Size Used Avail Use% Mounted on
> /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
>
>
> by the moment mds is showing the following behaviour:
> 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
> modify 0x9ba63c0 tid 23448
> 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
> modify 0xca86c30 tid 23449
>
> so, it does not respond to any mount requests
>
> I've played around with all types of commands like:
> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>
> and
>
> 'mon osd full ratio = 0.98' in mon configuration for each mon
>
> however
>
> chef@ceph-node03:/var/log/ceph$ ceph health detail
> HEALTH_ERR 1 full osd(s)
> osd.2 is full at 95%
>
> mds still believes 95% is the threshold, so no responses to mount requests.
>
> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
> Object prefix: benchmark_data_ceph-node03_3903
> 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa467ff0 tid 1
> 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa468780 tid 2
> 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa468f88 tid 3
> 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa469348 tid 4
> 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa469708 tid 5
> 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa469ac8 tid 6
> 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46a2d0 tid 7
> 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46a690 tid 8
> 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46aa50 tid 9
> 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46ae10 tid 10
> 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46b1d0 tid 11
> 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46b590 tid 12
> 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46b950 tid 13
> 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46bd10 tid 14
> 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46c0d0 tid 15
> 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
> paused modify 0xa46c490 tid 16
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> 0 16 16 0 0 0 - 0
> 1 16 16 0 0 0 - 0
> 2 16 16 0 0 0 - 0
>
> rados doesn't work.
>
> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> osds: (none)
>
> this one also.
>
>
> is there any chance to recover ceph?
"ceph pg set_full_ratio 0.98"
However, as Mark mentioned, you want to figure out why one OSD is so
much fuller than the others first. Even in a small cluster I don't
think you should be able to see that kind of variance. Simply setting
the full ratio to 98% and then continuing to run could cause bigger
problems if that OSD continues to get a disproportionate share of the
writes and fills up its disk.
-Greg
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-08 16:16 ` Mark Nelson
@ 2013-01-09 4:19 ` Roman Hlynovskiy
0 siblings, 0 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-09 4:19 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
Hello Mark,
ok, adding another osd is a good option, however my initial plan was
to raise full ratio watermark and remove unnecessary data. it' clear
for me that overfilling one of osd will cause big problems to the fs
consistency.
But... 2 other OSDs are still having plenty of space. what is the
difference between having pretty fresh OSD with plenty of space and
using current OSDs with plenty of fresh space from ceph point of view?
surprisingly my setup is rather standard ) all according to the online manuals
chef@ceph-node01:~$ ceph -s
health HEALTH_ERR 1 full osd(s)
monmap e4: 3 mons at
{a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
election epoch 242, quorum 0,1,2 a,b,c
osdmap e321: 3 osds: 3 up, 3 in full
pgmap v113335: 384 pgs: 384 active+clean; 305 GB data, 614 GB
used, 141 GB / 755 GB avail
mdsmap e4599: 1/1/1 up {0=a=up:active}, 2 up:standby
If I get it correct I have 384 PGs.
my crushmap is also pretty straightforward:
chef@ceph-node03:~$ ./get_crushmap.sh
got crush map from osdmap epoch 321
# begin crush map
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool
# buckets
host ceph-node01 {
id -2 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
}
host ceph-node02 {
id -4 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.000
}
host ceph-node03 {
id -5 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.000
}
rack unknownrack {
id -3 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item ceph-node01 weight 1.000
item ceph-node02 weight 1.000
item ceph-node03 weight 1.000
}
pool default {
id -1 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item unknownrack weight 3.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
actually I have a theory for this strange data distribution. The whole
stuff is running in a virtualized environment. each ceph-node is
running on it's own physical server, however overall load for every
server is pretty much different. the node with 95% used OSD is running
on the least loaded system. Could it be that additional i/o waits for
the other systems are causing ceph to write data to the least loaded
osd node?
2013/1/8 Mark Nelson <mark.nelson@inktank.com>:
> On 01/08/2013 04:42 AM, Roman Hlynovskiy wrote:
>>
>> Hello,
>>
>> I am running ceph v0.56 and at the moment trying to recover ceph which
>> got completely stuck after 1 osd got filled by 95%. Looks like the
>> distribution algorithm is not perfect since all 3 OSD's I user are
>> 256Gb each, however one of them got filled faster than others:
>>
>> osd-1:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
>>
>> osd-2:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
>>
>> osd-3:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
>>
>>
>> by the moment mds is showing the following behaviour:
>> 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
>> modify 0x9ba63c0 tid 23448
>> 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
>> modify 0xca86c30 tid 23449
>>
>> so, it does not respond to any mount requests
>>
>> I've played around with all types of commands like:
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>>
>> and
>>
>> 'mon osd full ratio = 0.98' in mon configuration for each mon
>>
>> however
>>
>> chef@ceph-node03:/var/log/ceph$ ceph health detail
>> HEALTH_ERR 1 full osd(s)
>> osd.2 is full at 95%
>>
>> mds still believes 95% is the threshold, so no responses to mount
>> requests.
>>
>> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>> Maintaining 16 concurrent writes of 4194304 bytes for at least 10
>> seconds.
>> Object prefix: benchmark_data_ceph-node03_3903
>> 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa467ff0 tid 1
>> 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa468780 tid 2
>> 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa468f88 tid 3
>> 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa469348 tid 4
>> 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa469708 tid 5
>> 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa469ac8 tid 6
>> 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46a2d0 tid 7
>> 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46a690 tid 8
>> 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46aa50 tid 9
>> 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46ae10 tid 10
>> 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46b1d0 tid 11
>> 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46b590 tid 12
>> 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46b950 tid 13
>> 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46bd10 tid 14
>> 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46c0d0 tid 15
>> 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46c490 tid 16
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg
>> lat
>> 0 16 16 0 0 0 -
>> 0
>> 1 16 16 0 0 0 -
>> 0
>> 2 16 16 0 0 0 -
>> 0
>>
>> rados doesn't work.
>>
>> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
>> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
>> osds: (none)
>>
>> this one also.
>>
>>
>> is there any chance to recover ceph?
>
>
> Hi,
>
> There may be other ways to fix it, but one method might be to simply add
> another OSD so the data gets redistributed. I wouldn't continue to modify
> osd full ratio up. I think Sam's said in the past it can make a minor
> problem into a very big problem if you fill an OSD all the way. Another
> option that may (or may not) work as a temporary solution is to change the
> osd weights.
>
> Having said that, I'm curious to know how many PGs you have? Do you have a
> custom crush map? That distribution is pretty skewed!
>
> Thanks,
> Mark
--
...WBR, Roman Hlynovskiy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-08 17:20 ` Gregory Farnum
@ 2013-01-09 5:41 ` Roman Hlynovskiy
2013-01-09 6:52 ` Sage Weil
0 siblings, 1 reply; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-09 5:41 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org
Thanks a lot Greg,
that was the black magic command I was looking for )
I deleted some obsolete data and reached those figures:
chef@cephgw:~$ ./clu.sh exec "df -kh"|grep osd
/dev/mapper/vg00-osd 252G 153G 100G 61% /var/lib/ceph/osd/ceph-0
/dev/mapper/vg00-osd 252G 180G 73G 72% /var/lib/ceph/osd/ceph-1
/dev/mapper/vg00-osd 252G 213G 40G 85% /var/lib/ceph/osd/ceph-2
which in comparison to previous one:
/dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
/dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
/dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from osd-3.
So, cleaned up space also has some disproportion.
at the same time:
chef@cephgw:~$ ceph osd tree
# id weight type name up/down reweight
-1 3 pool default
-3 3 rack unknownrack
-2 1 host ceph-node01
0 1 osd.0 up 1
-4 1 host ceph-node02
1 1 osd.1 up 1
-5 1 host ceph-node03
2 1 osd.2 up 1
all osd weights are the same. I guess there is no automatic way to
balance storage usage for my case and I have to play with osd weights
using 'ceph osd reweight-by-utilization xx' until storage is used more
or less equally and when get the weights back to 1?
2013/1/8 Gregory Farnum <greg@inktank.com>:
> On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> <roman.hlynovskiy@gmail.com> wrote:
>> Hello,
>>
>> I am running ceph v0.56 and at the moment trying to recover ceph which
>> got completely stuck after 1 osd got filled by 95%. Looks like the
>> distribution algorithm is not perfect since all 3 OSD's I user are
>> 256Gb each, however one of them got filled faster than others:
>>
>> osd-1:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
>>
>> osd-2:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
>>
>> osd-3:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
>>
>>
>> by the moment mds is showing the following behaviour:
>> 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
>> modify 0x9ba63c0 tid 23448
>> 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
>> modify 0xca86c30 tid 23449
>>
>> so, it does not respond to any mount requests
>>
>> I've played around with all types of commands like:
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>>
>> and
>>
>> 'mon osd full ratio = 0.98' in mon configuration for each mon
>>
>> however
>>
>> chef@ceph-node03:/var/log/ceph$ ceph health detail
>> HEALTH_ERR 1 full osd(s)
>> osd.2 is full at 95%
>>
>> mds still believes 95% is the threshold, so no responses to mount requests.
>>
>> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>> Object prefix: benchmark_data_ceph-node03_3903
>> 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa467ff0 tid 1
>> 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa468780 tid 2
>> 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa468f88 tid 3
>> 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa469348 tid 4
>> 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa469708 tid 5
>> 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa469ac8 tid 6
>> 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46a2d0 tid 7
>> 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46a690 tid 8
>> 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46aa50 tid 9
>> 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46ae10 tid 10
>> 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46b1d0 tid 11
>> 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46b590 tid 12
>> 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46b950 tid 13
>> 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46bd10 tid 14
>> 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46c0d0 tid 15
>> 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
>> paused modify 0xa46c490 tid 16
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>> 0 16 16 0 0 0 - 0
>> 1 16 16 0 0 0 - 0
>> 2 16 16 0 0 0 - 0
>>
>> rados doesn't work.
>>
>> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
>> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
>> osds: (none)
>>
>> this one also.
>>
>>
>> is there any chance to recover ceph?
>
> "ceph pg set_full_ratio 0.98"
>
> However, as Mark mentioned, you want to figure out why one OSD is so
> much fuller than the others first. Even in a small cluster I don't
> think you should be able to see that kind of variance. Simply setting
> the full ratio to 98% and then continuing to run could cause bigger
> problems if that OSD continues to get a disproportionate share of the
> writes and fills up its disk.
> -Greg
--
...WBR, Roman Hlynovskiy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-09 5:41 ` Roman Hlynovskiy
@ 2013-01-09 6:52 ` Sage Weil
2013-01-09 7:19 ` Gregory Farnum
0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2013-01-09 6:52 UTC (permalink / raw)
To: Roman Hlynovskiy; +Cc: Gregory Farnum, ceph-devel@vger.kernel.org
On Wed, 9 Jan 2013, Roman Hlynovskiy wrote:
> Thanks a lot Greg,
>
> that was the black magic command I was looking for )
>
> I deleted some obsolete data and reached those figures:
>
> chef@cephgw:~$ ./clu.sh exec "df -kh"|grep osd
> /dev/mapper/vg00-osd 252G 153G 100G 61% /var/lib/ceph/osd/ceph-0
> /dev/mapper/vg00-osd 252G 180G 73G 72% /var/lib/ceph/osd/ceph-1
> /dev/mapper/vg00-osd 252G 213G 40G 85% /var/lib/ceph/osd/ceph-2
>
> which in comparison to previous one:
>
> /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
>
> show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from osd-3.
> So, cleaned up space also has some disproportion.
>
> at the same time:
> chef@cephgw:~$ ceph osd tree
>
> # id weight type name up/down reweight
> -1 3 pool default
> -3 3 rack unknownrack
> -2 1 host ceph-node01
> 0 1 osd.0 up 1
> -4 1 host ceph-node02
> 1 1 osd.1 up 1
> -5 1 host ceph-node03
> 2 1 osd.2 up 1
>
>
> all osd weights are the same. I guess there is no automatic way to
> balance storage usage for my case and I have to play with osd weights
> using 'ceph osd reweight-by-utilization xx' until storage is used more
> or less equally and when get the weights back to 1?
How many pgs do you have? ('ceph osd dump | grep ^pool').
You might also adjust the crush tunables, see
http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
sage
>
>
>
> 2013/1/8 Gregory Farnum <greg@inktank.com>:
> > On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> > <roman.hlynovskiy@gmail.com> wrote:
> >> Hello,
> >>
> >> I am running ceph v0.56 and at the moment trying to recover ceph which
> >> got completely stuck after 1 osd got filled by 95%. Looks like the
> >> distribution algorithm is not perfect since all 3 OSD's I user are
> >> 256Gb each, however one of them got filled faster than others:
> >>
> >> osd-1:
> >> Filesystem Size Used Avail Use% Mounted on
> >> /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> >>
> >> osd-2:
> >> Filesystem Size Used Avail Use% Mounted on
> >> /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> >>
> >> osd-3:
> >> Filesystem Size Used Avail Use% Mounted on
> >> /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> >>
> >>
> >> by the moment mds is showing the following behaviour:
> >> 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
> >> modify 0x9ba63c0 tid 23448
> >> 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
> >> modify 0xca86c30 tid 23449
> >>
> >> so, it does not respond to any mount requests
> >>
> >> I've played around with all types of commands like:
> >> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> >> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
> >>
> >> and
> >>
> >> 'mon osd full ratio = 0.98' in mon configuration for each mon
> >>
> >> however
> >>
> >> chef@ceph-node03:/var/log/ceph$ ceph health detail
> >> HEALTH_ERR 1 full osd(s)
> >> osd.2 is full at 95%
> >>
> >> mds still believes 95% is the threshold, so no responses to mount requests.
> >>
> >> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
> >> Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
> >> Object prefix: benchmark_data_ceph-node03_3903
> >> 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa467ff0 tid 1
> >> 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa468780 tid 2
> >> 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa468f88 tid 3
> >> 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa469348 tid 4
> >> 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa469708 tid 5
> >> 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa469ac8 tid 6
> >> 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46a2d0 tid 7
> >> 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46a690 tid 8
> >> 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46aa50 tid 9
> >> 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46ae10 tid 10
> >> 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46b1d0 tid 11
> >> 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46b590 tid 12
> >> 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46b950 tid 13
> >> 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46bd10 tid 14
> >> 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46c0d0 tid 15
> >> 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
> >> paused modify 0xa46c490 tid 16
> >> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> >> 0 16 16 0 0 0 - 0
> >> 1 16 16 0 0 0 - 0
> >> 2 16 16 0 0 0 - 0
> >>
> >> rados doesn't work.
> >>
> >> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> >> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> >> osds: (none)
> >>
> >> this one also.
> >>
> >>
> >> is there any chance to recover ceph?
> >
> > "ceph pg set_full_ratio 0.98"
> >
> > However, as Mark mentioned, you want to figure out why one OSD is so
> > much fuller than the others first. Even in a small cluster I don't
> > think you should be able to see that kind of variance. Simply setting
> > the full ratio to 98% and then continuing to run could cause bigger
> > problems if that OSD continues to get a disproportionate share of the
> > writes and fills up its disk.
> > -Greg
>
>
>
> --
> ...WBR, Roman Hlynovskiy
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-09 6:52 ` Sage Weil
@ 2013-01-09 7:19 ` Gregory Farnum
2013-01-09 9:47 ` Roman Hlynovskiy
0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2013-01-09 7:19 UTC (permalink / raw)
To: Sage Weil, Roman Hlynovskiy; +Cc: ceph-devel@vger.kernel.org
On Tuesday, January 8, 2013 at 10:52 PM, Sage Weil wrote:
> On Wed, 9 Jan 2013, Roman Hlynovskiy wrote:
> > Thanks a lot Greg,
> >
> > that was the black magic command I was looking for )
> >
> > I deleted some obsolete data and reached those figures:
> >
> > chef@cephgw:~$ ./clu.sh (http://clu.sh) exec "df -kh"|grep osd
> > /dev/mapper/vg00-osd 252G 153G 100G 61% /var/lib/ceph/osd/ceph-0
> > /dev/mapper/vg00-osd 252G 180G 73G 72% /var/lib/ceph/osd/ceph-1
> > /dev/mapper/vg00-osd 252G 213G 40G 85% /var/lib/ceph/osd/ceph-2
> >
> > which in comparison to previous one:
> >
> > /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> > /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> > /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> >
> > show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from osd-3.
> > So, cleaned up space also has some disproportion.
> >
> > at the same time:
> > chef@cephgw:~$ ceph osd tree
> >
> > # id weight type name up/down reweight
> > -1 3 pool default
> > -3 3 rack unknownrack
> > -2 1 host ceph-node01
> > 0 1 osd.0 up 1
> > -4 1 host ceph-node02
> > 1 1 osd.1 up 1
> > -5 1 host ceph-node03
> > 2 1 osd.2 up 1
> >
> >
> > all osd weights are the same. I guess there is no automatic way to
> > balance storage usage for my case and I have to play with osd weights
> > using 'ceph osd reweight-by-utilization xx' until storage is used more
> > or less equally and when get the weights back to 1?
>
>
>
> How many pgs do you have? ('ceph osd dump | grep ^pool').
I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
-Greg
>
> You might also adjust the crush tunables, see
>
> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>
> sage
>
> >
> >
> >
> > 2013/1/8 Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>:
> > > On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> > > <roman.hlynovskiy@gmail.com (mailto:roman.hlynovskiy@gmail.com)> wrote:
> > > > Hello,
> > > >
> > > > I am running ceph v0.56 and at the moment trying to recover ceph which
> > > > got completely stuck after 1 osd got filled by 95%. Looks like the
> > > > distribution algorithm is not perfect since all 3 OSD's I user are
> > > > 256Gb each, however one of them got filled faster than others:
> > > >
> > > > osd-1:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> > > >
> > > > osd-2:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> > > >
> > > > osd-3:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> > > >
> > > >
> > > > by the moment mds is showing the following behaviour:
> > > > 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
> > > > modify 0x9ba63c0 tid 23448
> > > > 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
> > > > modify 0xca86c30 tid 23449
> > > >
> > > > so, it does not respond to any mount requests
> > > >
> > > > I've played around with all types of commands like:
> > > > ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> > > > ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
> > > >
> > > > and
> > > >
> > > > 'mon osd full ratio = 0.98' in mon configuration for each mon
> > > >
> > > > however
> > > >
> > > > chef@ceph-node03:/var/log/ceph$ ceph health detail
> > > > HEALTH_ERR 1 full osd(s)
> > > > osd.2 is full at 95%
> > > >
> > > > mds still believes 95% is the threshold, so no responses to mount requests.
> > > >
> > > > chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
> > > > Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
> > > > Object prefix: benchmark_data_ceph-node03_3903
> > > > 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa467ff0 tid 1
> > > > 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa468780 tid 2
> > > > 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa468f88 tid 3
> > > > 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa469348 tid 4
> > > > 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa469708 tid 5
> > > > 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa469ac8 tid 6
> > > > 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46a2d0 tid 7
> > > > 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46a690 tid 8
> > > > 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46aa50 tid 9
> > > > 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46ae10 tid 10
> > > > 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46b1d0 tid 11
> > > > 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46b590 tid 12
> > > > 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46b950 tid 13
> > > > 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46bd10 tid 14
> > > > 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46c0d0 tid 15
> > > > 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46c490 tid 16
> > > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> > > > 0 16 16 0 0 0 - 0
> > > > 1 16 16 0 0 0 - 0
> > > > 2 16 16 0 0 0 - 0
> > > >
> > > > rados doesn't work.
> > > >
> > > > chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> > > > no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> > > > osds: (none)
> > > >
> > > > this one also.
> > > >
> > > >
> > > > is there any chance to recover ceph?
> > >
> > > "ceph pg set_full_ratio 0.98"
> > >
> > > However, as Mark mentioned, you want to figure out why one OSD is so
> > > much fuller than the others first. Even in a small cluster I don't
> > > think you should be able to see that kind of variance. Simply setting
> > > the full ratio to 98% and then continuing to run could cause bigger
> > > problems if that OSD continues to get a disproportionate share of the
> > > writes and fills up its disk.
> > > -Greg
> >
> >
> >
> >
> >
> > --
> > ...WBR, Roman Hlynovskiy
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-09 7:19 ` Gregory Farnum
@ 2013-01-09 9:47 ` Roman Hlynovskiy
2013-01-10 5:32 ` Roman Hlynovskiy
2013-01-25 4:11 ` Dan Mick
0 siblings, 2 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-09 9:47 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org
>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>
> I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
> Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
> -Greg
hm... what would be recommended PG size per pool ?
chef@cephgw:~$ ceph osd lspools
0 data,1 metadata,2 rbd,
chef@cephgw:~$ ceph osd pool get data pg_num
PG_NUM: 128
chef@cephgw:~$ ceph osd pool get metadata pg_num
PG_NUM: 128
chef@cephgw:~$ ceph osd pool get rbd pg_num
PG_NUM: 128
according to the http://ceph.com/docs/master/rados/operations/placement-groups/
(OSDs * 100)
Total PGs = ------------
Replicas
I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150
will it make much difference to set 150 instead of 128 or I should
base on different values?
btw, just one more off-topic question:
chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
dumped all in format plain
version 113906
last_osdmap_epoch 323
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degr unf bytes
log disklog state state_stamp v reported up
acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
pool 0 74748 0 0 0
286157692336 17668034 17668034
pool 1 618 0 0 0
131846062 6414518 6414518
pool 2 0 0 0 0
0 0 0
sum 75366 0 0 0
286289538398 24082552 24082552
osdstat kbused kbavail kb hb in
hb out
0 157999220 106227596 264226816 [1,2] []
1 185604948 78621868 264226816 [0,2] []
2 219475396 44751420 264226816 [0,1] []
sum 563079564 229600884 792680448
pool 0 (data) is used for data storage
pool 1 (metadata) is used for metadata storage
what is pool 2 (rbd) for? looks like it's absolutely empty.
>
>>
>> You might also adjust the crush tunables, see
>>
>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>
>> sage
>>
Thanks for the link, Sage I set tunable values according to the doc.
Btw, online document is missing magical param for crushmap which
allows those scary_tunables )
--
...WBR, Roman Hlynovskiy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-09 9:47 ` Roman Hlynovskiy
@ 2013-01-10 5:32 ` Roman Hlynovskiy
2013-01-10 16:50 ` Roman Hlynovskiy
2013-01-25 4:11 ` Dan Mick
1 sibling, 1 reply; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-10 5:32 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org
Hello again!
I left the system in working state overnight and got it in a wierd
state this morning:
chef@ceph-node02:/var/log/ceph$ ceph -s
health HEALTH_OK
monmap e4: 3 mons at
{a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
election epoch 254, quorum 0,1,2 a,b,c
osdmap e348: 3 osds: 3 up, 3 in
pgmap v114606: 384 pgs: 384 active+clean; 161 GB data, 326 GB
used, 429 GB / 755 GB avail
mdsmap e4623: 1/1/1 up {0=b=up:active}, 1 up:standby
so, it looks ok from the first point of view, however I am not able
to mount ceph from any of nodes:
be01:~# mount /var/www/jroger.org/data
mount: 192.168.7.11:/: can't read superblock
on the nodes, which had ceph mounted yesterday I am able to look
through the filesystem, but any kind of data read causes client to
hang.
I made a trace on the active mds with debug ms/mds = 20
(http://wh.of.kz/ceph_logs.tar.gz)
Could you please help to identify what's going on.
2013/1/9 Roman Hlynovskiy <roman.hlynovskiy@gmail.com>:
>>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>>
>> I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
>> Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
>> -Greg
>
> hm... what would be recommended PG size per pool ?
>
> chef@cephgw:~$ ceph osd lspools
> 0 data,1 metadata,2 rbd,
> chef@cephgw:~$ ceph osd pool get data pg_num
> PG_NUM: 128
> chef@cephgw:~$ ceph osd pool get metadata pg_num
> PG_NUM: 128
> chef@cephgw:~$ ceph osd pool get rbd pg_num
> PG_NUM: 128
>
> according to the http://ceph.com/docs/master/rados/operations/placement-groups/
>
> (OSDs * 100)
> Total PGs = ------------
> Replicas
>
> I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150
>
> will it make much difference to set 150 instead of 128 or I should
> base on different values?
>
> btw, just one more off-topic question:
>
> chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
> dumped all in format plain
> version 113906
> last_osdmap_epoch 323
> last_pg_scan 1
> full_ratio 0.95
> nearfull_ratio 0.85
> pg_stat objects mip degr unf bytes
> log disklog state state_stamp v reported up
> acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
> pool 0 74748 0 0 0
> 286157692336 17668034 17668034
> pool 1 618 0 0 0
> 131846062 6414518 6414518
> pool 2 0 0 0 0
> 0 0 0
> sum 75366 0 0 0
> 286289538398 24082552 24082552
> osdstat kbused kbavail kb hb in
> hb out
> 0 157999220 106227596 264226816 [1,2] []
> 1 185604948 78621868 264226816 [0,2] []
> 2 219475396 44751420 264226816 [0,1] []
> sum 563079564 229600884 792680448
>
> pool 0 (data) is used for data storage
> pool 1 (metadata) is used for metadata storage
>
> what is pool 2 (rbd) for? looks like it's absolutely empty.
>
>
>>
>>>
>>> You might also adjust the crush tunables, see
>>>
>>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>>
>>> sage
>>>
>
> Thanks for the link, Sage I set tunable values according to the doc.
> Btw, online document is missing magical param for crushmap which
> allows those scary_tunables )
>
>
>
> --
> ...WBR, Roman Hlynovskiy
--
...WBR, Roman Hlynovskiy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-10 5:32 ` Roman Hlynovskiy
@ 2013-01-10 16:50 ` Roman Hlynovskiy
0 siblings, 0 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-10 16:50 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org
please disregard my last email. I followed recommendation for
tunables, but missed the note that kernel version should be 3.5 or
later in order to support the tunables. I reverted them back to the
legacy ones and everything is back online.
2013/1/10 Roman Hlynovskiy <roman.hlynovskiy@gmail.com>:
> Hello again!
>
> I left the system in working state overnight and got it in a wierd
> state this morning:
>
> chef@ceph-node02:/var/log/ceph$ ceph -s
> health HEALTH_OK
> monmap e4: 3 mons at
> {a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
> election epoch 254, quorum 0,1,2 a,b,c
> osdmap e348: 3 osds: 3 up, 3 in
> pgmap v114606: 384 pgs: 384 active+clean; 161 GB data, 326 GB
> used, 429 GB / 755 GB avail
> mdsmap e4623: 1/1/1 up {0=b=up:active}, 1 up:standby
>
> so, it looks ok from the first point of view, however I am not able
> to mount ceph from any of nodes:
> be01:~# mount /var/www/jroger.org/data
> mount: 192.168.7.11:/: can't read superblock
>
> on the nodes, which had ceph mounted yesterday I am able to look
> through the filesystem, but any kind of data read causes client to
> hang.
>
> I made a trace on the active mds with debug ms/mds = 20
> (http://wh.of.kz/ceph_logs.tar.gz)
> Could you please help to identify what's going on.
>
> 2013/1/9 Roman Hlynovskiy <roman.hlynovskiy@gmail.com>:
>>>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>>>
>>> I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
>>> Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
>>> -Greg
>>
>> hm... what would be recommended PG size per pool ?
>>
>> chef@cephgw:~$ ceph osd lspools
>> 0 data,1 metadata,2 rbd,
>> chef@cephgw:~$ ceph osd pool get data pg_num
>> PG_NUM: 128
>> chef@cephgw:~$ ceph osd pool get metadata pg_num
>> PG_NUM: 128
>> chef@cephgw:~$ ceph osd pool get rbd pg_num
>> PG_NUM: 128
>>
>> according to the http://ceph.com/docs/master/rados/operations/placement-groups/
>>
>> (OSDs * 100)
>> Total PGs = ------------
>> Replicas
>>
>> I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150
>>
>> will it make much difference to set 150 instead of 128 or I should
>> base on different values?
>>
>> btw, just one more off-topic question:
>>
>> chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
>> dumped all in format plain
>> version 113906
>> last_osdmap_epoch 323
>> last_pg_scan 1
>> full_ratio 0.95
>> nearfull_ratio 0.85
>> pg_stat objects mip degr unf bytes
>> log disklog state state_stamp v reported up
>> acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
>> pool 0 74748 0 0 0
>> 286157692336 17668034 17668034
>> pool 1 618 0 0 0
>> 131846062 6414518 6414518
>> pool 2 0 0 0 0
>> 0 0 0
>> sum 75366 0 0 0
>> 286289538398 24082552 24082552
>> osdstat kbused kbavail kb hb in
>> hb out
>> 0 157999220 106227596 264226816 [1,2] []
>> 1 185604948 78621868 264226816 [0,2] []
>> 2 219475396 44751420 264226816 [0,1] []
>> sum 563079564 229600884 792680448
>>
>> pool 0 (data) is used for data storage
>> pool 1 (metadata) is used for metadata storage
>>
>> what is pool 2 (rbd) for? looks like it's absolutely empty.
>>
>>
>>>
>>>>
>>>> You might also adjust the crush tunables, see
>>>>
>>>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>>>
>>>> sage
>>>>
>>
>> Thanks for the link, Sage I set tunable values according to the doc.
>> Btw, online document is missing magical param for crushmap which
>> allows those scary_tunables )
>>
>>
>>
>> --
>> ...WBR, Roman Hlynovskiy
>
>
>
> --
> ...WBR, Roman Hlynovskiy
--
...WBR, Roman Hlynovskiy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: recoverying from 95% full osd
2013-01-09 9:47 ` Roman Hlynovskiy
2013-01-10 5:32 ` Roman Hlynovskiy
@ 2013-01-25 4:11 ` Dan Mick
1 sibling, 0 replies; 11+ messages in thread
From: Dan Mick @ 2013-01-25 4:11 UTC (permalink / raw)
To: Roman Hlynovskiy; +Cc: Gregory Farnum, Sage Weil, ceph-devel@vger.kernel.org
> what is pool 2 (rbd) for? looks like it's absolutely empty.
by default it's for rbd images (see the rbd command etc.). It being
empty or not has no effect on the other pools.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-01-25 4:12 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-08 10:42 recoverying from 95% full osd Roman Hlynovskiy
2013-01-08 16:16 ` Mark Nelson
2013-01-09 4:19 ` Roman Hlynovskiy
2013-01-08 17:20 ` Gregory Farnum
2013-01-09 5:41 ` Roman Hlynovskiy
2013-01-09 6:52 ` Sage Weil
2013-01-09 7:19 ` Gregory Farnum
2013-01-09 9:47 ` Roman Hlynovskiy
2013-01-10 5:32 ` Roman Hlynovskiy
2013-01-10 16:50 ` Roman Hlynovskiy
2013-01-25 4:11 ` Dan Mick
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.