All of lore.kernel.org
 help / color / mirror / Atom feed
* recoverying from 95% full osd
@ 2013-01-08 10:42 Roman Hlynovskiy
  2013-01-08 16:16 ` Mark Nelson
  2013-01-08 17:20 ` Gregory Farnum
  0 siblings, 2 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-08 10:42 UTC (permalink / raw)
  To: ceph-devel

Hello,

I am running ceph v0.56 and at the moment trying to recover ceph which
got completely stuck after 1 osd got filled by 95%. Looks like the
distribution algorithm is not perfect since all 3 OSD's I user are
256Gb each, however one of them got filled faster than others:

osd-1:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0

osd-2:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1

osd-3:
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2


by the moment mds is showing the following behaviour:
2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
modify 0x9ba63c0 tid 23448
2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
modify 0xca86c30 tid 23449

so, it does not respond to any mount requests

I've played around with all types of commands like:
ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'

and

'mon osd full ratio = 0.98' in mon configuration for each mon

however

chef@ceph-node03:/var/log/ceph$ ceph health detail
HEALTH_ERR 1 full osd(s)
osd.2 is full at 95%

mds still believes 95% is the threshold, so no responses to mount requests.

chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
 Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
 Object prefix: benchmark_data_ceph-node03_3903
2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa467ff0 tid 1
2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa468780 tid 2
2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa468f88 tid 3
2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa469348 tid 4
2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa469708 tid 5
2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa469ac8 tid 6
2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46a2d0 tid 7
2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46a690 tid 8
2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46aa50 tid 9
2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46ae10 tid 10
2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46b1d0 tid 11
2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46b590 tid 12
2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46b950 tid 13
2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46bd10 tid 14
2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46c0d0 tid 15
2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
paused modify 0xa46c490 tid 16
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0      16        16         0         0         0         -         0
     1      16        16         0         0         0         -         0
     2      16        16         0         0         0         -         0

rados doesn't work.

chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
no change: average_util: 0.812678, overload_util: 0.975214. overloaded
osds: (none)

this one also.


is there any chance to recover ceph?


--
...WBR, Roman Hlynovskiy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-08 10:42 recoverying from 95% full osd Roman Hlynovskiy
@ 2013-01-08 16:16 ` Mark Nelson
  2013-01-09  4:19   ` Roman Hlynovskiy
  2013-01-08 17:20 ` Gregory Farnum
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2013-01-08 16:16 UTC (permalink / raw)
  To: Roman Hlynovskiy; +Cc: ceph-devel

On 01/08/2013 04:42 AM, Roman Hlynovskiy wrote:
> Hello,
>
> I am running ceph v0.56 and at the moment trying to recover ceph which
> got completely stuck after 1 osd got filled by 95%. Looks like the
> distribution algorithm is not perfect since all 3 OSD's I user are
> 256Gb each, however one of them got filled faster than others:
>
> osd-1:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
>
> osd-2:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
>
> osd-3:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
>
>
> by the moment mds is showing the following behaviour:
> 2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
> modify 0x9ba63c0 tid 23448
> 2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
> modify 0xca86c30 tid 23449
>
> so, it does not respond to any mount requests
>
> I've played around with all types of commands like:
> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>
> and
>
> 'mon osd full ratio = 0.98' in mon configuration for each mon
>
> however
>
> chef@ceph-node03:/var/log/ceph$ ceph health detail
> HEALTH_ERR 1 full osd(s)
> osd.2 is full at 95%
>
> mds still believes 95% is the threshold, so no responses to mount requests.
>
> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>   Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>   Object prefix: benchmark_data_ceph-node03_3903
> 2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa467ff0 tid 1
> 2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa468780 tid 2
> 2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa468f88 tid 3
> 2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469348 tid 4
> 2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469708 tid 5
> 2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469ac8 tid 6
> 2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46a2d0 tid 7
> 2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46a690 tid 8
> 2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46aa50 tid 9
> 2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46ae10 tid 10
> 2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b1d0 tid 11
> 2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b590 tid 12
> 2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b950 tid 13
> 2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46bd10 tid 14
> 2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46c0d0 tid 15
> 2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46c490 tid 16
>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>       0      16        16         0         0         0         -         0
>       1      16        16         0         0         0         -         0
>       2      16        16         0         0         0         -         0
>
> rados doesn't work.
>
> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> osds: (none)
>
> this one also.
>
>
> is there any chance to recover ceph?

Hi,

There may be other ways to fix it, but one method might be to simply add 
another OSD so the data gets redistributed.  I wouldn't continue to 
modify osd full ratio up.  I think Sam's said in the past it can make a 
minor problem into a very big problem if you fill an OSD all the way. 
Another option that may (or may not) work as a temporary solution is to 
change the osd weights.

Having said that, I'm curious to know how many PGs you have?  Do you 
have a custom crush map?  That distribution is pretty skewed!

Thanks,
Mark

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-08 10:42 recoverying from 95% full osd Roman Hlynovskiy
  2013-01-08 16:16 ` Mark Nelson
@ 2013-01-08 17:20 ` Gregory Farnum
  2013-01-09  5:41   ` Roman Hlynovskiy
  1 sibling, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2013-01-08 17:20 UTC (permalink / raw)
  To: Roman Hlynovskiy; +Cc: ceph-devel@vger.kernel.org

On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
<roman.hlynovskiy@gmail.com> wrote:
> Hello,
>
> I am running ceph v0.56 and at the moment trying to recover ceph which
> got completely stuck after 1 osd got filled by 95%. Looks like the
> distribution algorithm is not perfect since all 3 OSD's I user are
> 256Gb each, however one of them got filled faster than others:
>
> osd-1:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
>
> osd-2:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
>
> osd-3:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
>
>
> by the moment mds is showing the following behaviour:
> 2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
> modify 0x9ba63c0 tid 23448
> 2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
> modify 0xca86c30 tid 23449
>
> so, it does not respond to any mount requests
>
> I've played around with all types of commands like:
> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>
> and
>
> 'mon osd full ratio = 0.98' in mon configuration for each mon
>
> however
>
> chef@ceph-node03:/var/log/ceph$ ceph health detail
> HEALTH_ERR 1 full osd(s)
> osd.2 is full at 95%
>
> mds still believes 95% is the threshold, so no responses to mount requests.
>
> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>  Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>  Object prefix: benchmark_data_ceph-node03_3903
> 2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa467ff0 tid 1
> 2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa468780 tid 2
> 2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa468f88 tid 3
> 2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469348 tid 4
> 2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469708 tid 5
> 2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469ac8 tid 6
> 2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46a2d0 tid 7
> 2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46a690 tid 8
> 2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46aa50 tid 9
> 2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46ae10 tid 10
> 2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b1d0 tid 11
> 2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b590 tid 12
> 2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b950 tid 13
> 2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46bd10 tid 14
> 2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46c0d0 tid 15
> 2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46c490 tid 16
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0      16        16         0         0         0         -         0
>      1      16        16         0         0         0         -         0
>      2      16        16         0         0         0         -         0
>
> rados doesn't work.
>
> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> osds: (none)
>
> this one also.
>
>
> is there any chance to recover ceph?

"ceph pg set_full_ratio 0.98"

However, as Mark mentioned, you want to figure out why one OSD is so
much fuller than the others first. Even in a small cluster I don't
think you should be able to see that kind of variance. Simply setting
the full ratio to 98% and then continuing to run could cause bigger
problems if that OSD continues to get a disproportionate share of the
writes and fills up its disk.
-Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-08 16:16 ` Mark Nelson
@ 2013-01-09  4:19   ` Roman Hlynovskiy
  0 siblings, 0 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-09  4:19 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Hello Mark,

ok, adding another osd is a good option, however my initial plan was
to raise full ratio watermark and remove unnecessary data. it' clear
for me that overfilling one of osd will cause big problems to the fs
consistency.
But... 2 other OSDs are still having plenty of space. what is the
difference between having pretty fresh OSD with plenty of space and
using current OSDs with plenty of fresh space from ceph point of view?


surprisingly my setup is rather standard ) all according to the online manuals

chef@ceph-node01:~$ ceph -s
   health HEALTH_ERR 1 full osd(s)
   monmap e4: 3 mons at
{a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
election epoch 242, quorum 0,1,2 a,b,c
   osdmap e321: 3 osds: 3 up, 3 in full
    pgmap v113335: 384 pgs: 384 active+clean; 305 GB data, 614 GB
used, 141 GB / 755 GB avail
   mdsmap e4599: 1/1/1 up {0=a=up:active}, 2 up:standby

If I get it correct I have 384 PGs.


my crushmap is also pretty straightforward:


chef@ceph-node03:~$ ./get_crushmap.sh
got crush map from osdmap epoch 321
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host ceph-node01 {
    id -2        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.0 weight 1.000
}
host ceph-node02 {
    id -4        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.1 weight 1.000
}
host ceph-node03 {
    id -5        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.2 weight 1.000
}
rack unknownrack {
    id -3        # do not change unnecessarily
    # weight 3.000
    alg straw
    hash 0    # rjenkins1
    item ceph-node01 weight 1.000
    item ceph-node02 weight 1.000
    item ceph-node03 weight 1.000
}
pool default {
    id -1        # do not change unnecessarily
    # weight 3.000
    alg straw
    hash 0    # rjenkins1
    item unknownrack weight 3.000
}

# rules
rule data {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule metadata {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule rbd {
    ruleset 2
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map


actually I have a theory for this strange data distribution. The whole
stuff is running in a virtualized environment. each ceph-node is
running on it's own physical server, however overall load for every
server is pretty much different. the node with 95% used OSD is running
on the least loaded system. Could it be that additional i/o waits for
the other systems are causing ceph to write data to the least loaded
osd node?



2013/1/8 Mark Nelson <mark.nelson@inktank.com>:
> On 01/08/2013 04:42 AM, Roman Hlynovskiy wrote:
>>
>> Hello,
>>
>> I am running ceph v0.56 and at the moment trying to recover ceph which
>> got completely stuck after 1 osd got filled by 95%. Looks like the
>> distribution algorithm is not perfect since all 3 OSD's I user are
>> 256Gb each, however one of them got filled faster than others:
>>
>> osd-1:
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
>>
>> osd-2:
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
>>
>> osd-3:
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
>>
>>
>> by the moment mds is showing the following behaviour:
>> 2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
>> modify 0x9ba63c0 tid 23448
>> 2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
>> modify 0xca86c30 tid 23449
>>
>> so, it does not respond to any mount requests
>>
>> I've played around with all types of commands like:
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>>
>> and
>>
>> 'mon osd full ratio = 0.98' in mon configuration for each mon
>>
>> however
>>
>> chef@ceph-node03:/var/log/ceph$ ceph health detail
>> HEALTH_ERR 1 full osd(s)
>> osd.2 is full at 95%
>>
>> mds still believes 95% is the threshold, so no responses to mount
>> requests.
>>
>> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>>   Maintaining 16 concurrent writes of 4194304 bytes for at least 10
>> seconds.
>>   Object prefix: benchmark_data_ceph-node03_3903
>> 2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa467ff0 tid 1
>> 2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa468780 tid 2
>> 2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa468f88 tid 3
>> 2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa469348 tid 4
>> 2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa469708 tid 5
>> 2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa469ac8 tid 6
>> 2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46a2d0 tid 7
>> 2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46a690 tid 8
>> 2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46aa50 tid 9
>> 2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46ae10 tid 10
>> 2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46b1d0 tid 11
>> 2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46b590 tid 12
>> 2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46b950 tid 13
>> 2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46bd10 tid 14
>> 2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46c0d0 tid 15
>> 2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46c490 tid 16
>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>       0      16        16         0         0         0         -
>> 0
>>       1      16        16         0         0         0         -
>> 0
>>       2      16        16         0         0         0         -
>> 0
>>
>> rados doesn't work.
>>
>> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
>> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
>> osds: (none)
>>
>> this one also.
>>
>>
>> is there any chance to recover ceph?
>
>
> Hi,
>
> There may be other ways to fix it, but one method might be to simply add
> another OSD so the data gets redistributed.  I wouldn't continue to modify
> osd full ratio up.  I think Sam's said in the past it can make a minor
> problem into a very big problem if you fill an OSD all the way. Another
> option that may (or may not) work as a temporary solution is to change the
> osd weights.
>
> Having said that, I'm curious to know how many PGs you have?  Do you have a
> custom crush map?  That distribution is pretty skewed!
>
> Thanks,
> Mark



-- 
...WBR, Roman Hlynovskiy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-08 17:20 ` Gregory Farnum
@ 2013-01-09  5:41   ` Roman Hlynovskiy
  2013-01-09  6:52     ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-09  5:41 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org

Thanks a lot Greg,

that was the black magic command I was looking for )

I deleted some obsolete data and reached those figures:

chef@cephgw:~$ ./clu.sh exec "df -kh"|grep osd
/dev/mapper/vg00-osd  252G  153G  100G  61% /var/lib/ceph/osd/ceph-0
/dev/mapper/vg00-osd  252G  180G   73G  72% /var/lib/ceph/osd/ceph-1
/dev/mapper/vg00-osd  252G  213G   40G  85% /var/lib/ceph/osd/ceph-2

which in comparison to previous one:

/dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
/dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
/dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2

show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from osd-3.
So, cleaned up space also has some disproportion.

at the same time:
chef@cephgw:~$ ceph osd tree

# id    weight    type name    up/down    reweight
-1    3    pool default
-3    3        rack unknownrack
-2    1            host ceph-node01
0    1                osd.0    up    1
-4    1            host ceph-node02
1    1                osd.1    up    1
-5    1            host ceph-node03
2    1                osd.2    up    1


all osd weights are the same. I guess there is no automatic way to
balance storage usage for my case and I have to play with osd weights
using 'ceph osd reweight-by-utilization xx' until storage is used more
or less equally and when get the weights back to 1?



2013/1/8 Gregory Farnum <greg@inktank.com>:
> On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> <roman.hlynovskiy@gmail.com> wrote:
>> Hello,
>>
>> I am running ceph v0.56 and at the moment trying to recover ceph which
>> got completely stuck after 1 osd got filled by 95%. Looks like the
>> distribution algorithm is not perfect since all 3 OSD's I user are
>> 256Gb each, however one of them got filled faster than others:
>>
>> osd-1:
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
>>
>> osd-2:
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
>>
>> osd-3:
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
>>
>>
>> by the moment mds is showing the following behaviour:
>> 2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
>> modify 0x9ba63c0 tid 23448
>> 2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
>> modify 0xca86c30 tid 23449
>>
>> so, it does not respond to any mount requests
>>
>> I've played around with all types of commands like:
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
>> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>>
>> and
>>
>> 'mon osd full ratio = 0.98' in mon configuration for each mon
>>
>> however
>>
>> chef@ceph-node03:/var/log/ceph$ ceph health detail
>> HEALTH_ERR 1 full osd(s)
>> osd.2 is full at 95%
>>
>> mds still believes 95% is the threshold, so no responses to mount requests.
>>
>> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>>  Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>>  Object prefix: benchmark_data_ceph-node03_3903
>> 2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa467ff0 tid 1
>> 2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa468780 tid 2
>> 2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa468f88 tid 3
>> 2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa469348 tid 4
>> 2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa469708 tid 5
>> 2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa469ac8 tid 6
>> 2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46a2d0 tid 7
>> 2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46a690 tid 8
>> 2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46aa50 tid 9
>> 2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46ae10 tid 10
>> 2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46b1d0 tid 11
>> 2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46b590 tid 12
>> 2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46b950 tid 13
>> 2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46bd10 tid 14
>> 2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46c0d0 tid 15
>> 2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
>> paused modify 0xa46c490 tid 16
>>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>      0      16        16         0         0         0         -         0
>>      1      16        16         0         0         0         -         0
>>      2      16        16         0         0         0         -         0
>>
>> rados doesn't work.
>>
>> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
>> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
>> osds: (none)
>>
>> this one also.
>>
>>
>> is there any chance to recover ceph?
>
> "ceph pg set_full_ratio 0.98"
>
> However, as Mark mentioned, you want to figure out why one OSD is so
> much fuller than the others first. Even in a small cluster I don't
> think you should be able to see that kind of variance. Simply setting
> the full ratio to 98% and then continuing to run could cause bigger
> problems if that OSD continues to get a disproportionate share of the
> writes and fills up its disk.
> -Greg



-- 
...WBR, Roman Hlynovskiy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-09  5:41   ` Roman Hlynovskiy
@ 2013-01-09  6:52     ` Sage Weil
  2013-01-09  7:19       ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2013-01-09  6:52 UTC (permalink / raw)
  To: Roman Hlynovskiy; +Cc: Gregory Farnum, ceph-devel@vger.kernel.org

On Wed, 9 Jan 2013, Roman Hlynovskiy wrote:
> Thanks a lot Greg,
> 
> that was the black magic command I was looking for )
> 
> I deleted some obsolete data and reached those figures:
> 
> chef@cephgw:~$ ./clu.sh exec "df -kh"|grep osd
> /dev/mapper/vg00-osd  252G  153G  100G  61% /var/lib/ceph/osd/ceph-0
> /dev/mapper/vg00-osd  252G  180G   73G  72% /var/lib/ceph/osd/ceph-1
> /dev/mapper/vg00-osd  252G  213G   40G  85% /var/lib/ceph/osd/ceph-2
> 
> which in comparison to previous one:
> 
> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
> 
> show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from osd-3.
> So, cleaned up space also has some disproportion.
> 
> at the same time:
> chef@cephgw:~$ ceph osd tree
> 
> # id    weight    type name    up/down    reweight
> -1    3    pool default
> -3    3        rack unknownrack
> -2    1            host ceph-node01
> 0    1                osd.0    up    1
> -4    1            host ceph-node02
> 1    1                osd.1    up    1
> -5    1            host ceph-node03
> 2    1                osd.2    up    1
> 
> 
> all osd weights are the same. I guess there is no automatic way to
> balance storage usage for my case and I have to play with osd weights
> using 'ceph osd reweight-by-utilization xx' until storage is used more
> or less equally and when get the weights back to 1?

How many pgs do you have?  ('ceph osd dump | grep ^pool').

You might also adjust the crush tunables, see

	http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables

sage

> 
> 
> 
> 2013/1/8 Gregory Farnum <greg@inktank.com>:
> > On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> > <roman.hlynovskiy@gmail.com> wrote:
> >> Hello,
> >>
> >> I am running ceph v0.56 and at the moment trying to recover ceph which
> >> got completely stuck after 1 osd got filled by 95%. Looks like the
> >> distribution algorithm is not perfect since all 3 OSD's I user are
> >> 256Gb each, however one of them got filled faster than others:
> >>
> >> osd-1:
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
> >>
> >> osd-2:
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
> >>
> >> osd-3:
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
> >>
> >>
> >> by the moment mds is showing the following behaviour:
> >> 2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
> >> modify 0x9ba63c0 tid 23448
> >> 2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
> >> modify 0xca86c30 tid 23449
> >>
> >> so, it does not respond to any mount requests
> >>
> >> I've played around with all types of commands like:
> >> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> >> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
> >>
> >> and
> >>
> >> 'mon osd full ratio = 0.98' in mon configuration for each mon
> >>
> >> however
> >>
> >> chef@ceph-node03:/var/log/ceph$ ceph health detail
> >> HEALTH_ERR 1 full osd(s)
> >> osd.2 is full at 95%
> >>
> >> mds still believes 95% is the threshold, so no responses to mount requests.
> >>
> >> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
> >>  Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
> >>  Object prefix: benchmark_data_ceph-node03_3903
> >> 2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa467ff0 tid 1
> >> 2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa468780 tid 2
> >> 2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa468f88 tid 3
> >> 2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa469348 tid 4
> >> 2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa469708 tid 5
> >> 2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa469ac8 tid 6
> >> 2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46a2d0 tid 7
> >> 2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46a690 tid 8
> >> 2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46aa50 tid 9
> >> 2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46ae10 tid 10
> >> 2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46b1d0 tid 11
> >> 2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46b590 tid 12
> >> 2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46b950 tid 13
> >> 2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46bd10 tid 14
> >> 2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46c0d0 tid 15
> >> 2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
> >> paused modify 0xa46c490 tid 16
> >>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >>      0      16        16         0         0         0         -         0
> >>      1      16        16         0         0         0         -         0
> >>      2      16        16         0         0         0         -         0
> >>
> >> rados doesn't work.
> >>
> >> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> >> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> >> osds: (none)
> >>
> >> this one also.
> >>
> >>
> >> is there any chance to recover ceph?
> >
> > "ceph pg set_full_ratio 0.98"
> >
> > However, as Mark mentioned, you want to figure out why one OSD is so
> > much fuller than the others first. Even in a small cluster I don't
> > think you should be able to see that kind of variance. Simply setting
> > the full ratio to 98% and then continuing to run could cause bigger
> > problems if that OSD continues to get a disproportionate share of the
> > writes and fills up its disk.
> > -Greg
> 
> 
> 
> -- 
> ...WBR, Roman Hlynovskiy
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-09  6:52     ` Sage Weil
@ 2013-01-09  7:19       ` Gregory Farnum
  2013-01-09  9:47         ` Roman Hlynovskiy
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2013-01-09  7:19 UTC (permalink / raw)
  To: Sage Weil, Roman Hlynovskiy; +Cc: ceph-devel@vger.kernel.org

On Tuesday, January 8, 2013 at 10:52 PM, Sage Weil wrote:
> On Wed, 9 Jan 2013, Roman Hlynovskiy wrote:
> > Thanks a lot Greg,
> > 
> > that was the black magic command I was looking for )
> > 
> > I deleted some obsolete data and reached those figures:
> > 
> > chef@cephgw:~$ ./clu.sh (http://clu.sh) exec "df -kh"|grep osd
> > /dev/mapper/vg00-osd 252G 153G 100G 61% /var/lib/ceph/osd/ceph-0
> > /dev/mapper/vg00-osd 252G 180G 73G 72% /var/lib/ceph/osd/ceph-1
> > /dev/mapper/vg00-osd 252G 213G 40G 85% /var/lib/ceph/osd/ceph-2
> > 
> > which in comparison to previous one:
> > 
> > /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> > /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> > /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> > 
> > show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from osd-3.
> > So, cleaned up space also has some disproportion.
> > 
> > at the same time:
> > chef@cephgw:~$ ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 3 pool default
> > -3 3 rack unknownrack
> > -2 1 host ceph-node01
> > 0 1 osd.0 up 1
> > -4 1 host ceph-node02
> > 1 1 osd.1 up 1
> > -5 1 host ceph-node03
> > 2 1 osd.2 up 1
> > 
> > 
> > all osd weights are the same. I guess there is no automatic way to
> > balance storage usage for my case and I have to play with osd weights
> > using 'ceph osd reweight-by-utilization xx' until storage is used more
> > or less equally and when get the weights back to 1?
> 
> 
> 
> How many pgs do you have? ('ceph osd dump | grep ^pool').

I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
-Greg
 
> 
> You might also adjust the crush tunables, see
> 
> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
> 
> sage
> 
> > 
> > 
> > 
> > 2013/1/8 Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)>:
> > > On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> > > <roman.hlynovskiy@gmail.com (mailto:roman.hlynovskiy@gmail.com)> wrote:
> > > > Hello,
> > > > 
> > > > I am running ceph v0.56 and at the moment trying to recover ceph which
> > > > got completely stuck after 1 osd got filled by 95%. Looks like the
> > > > distribution algorithm is not perfect since all 3 OSD's I user are
> > > > 256Gb each, however one of them got filled faster than others:
> > > > 
> > > > osd-1:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> > > > 
> > > > osd-2:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> > > > 
> > > > osd-3:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> > > > 
> > > > 
> > > > by the moment mds is showing the following behaviour:
> > > > 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
> > > > modify 0x9ba63c0 tid 23448
> > > > 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
> > > > modify 0xca86c30 tid 23449
> > > > 
> > > > so, it does not respond to any mount requests
> > > > 
> > > > I've played around with all types of commands like:
> > > > ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> > > > ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
> > > > 
> > > > and
> > > > 
> > > > 'mon osd full ratio = 0.98' in mon configuration for each mon
> > > > 
> > > > however
> > > > 
> > > > chef@ceph-node03:/var/log/ceph$ ceph health detail
> > > > HEALTH_ERR 1 full osd(s)
> > > > osd.2 is full at 95%
> > > > 
> > > > mds still believes 95% is the threshold, so no responses to mount requests.
> > > > 
> > > > chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
> > > > Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
> > > > Object prefix: benchmark_data_ceph-node03_3903
> > > > 2013-01-08 16:33:02.363206 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa467ff0 tid 1
> > > > 2013-01-08 16:33:02.363618 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa468780 tid 2
> > > > 2013-01-08 16:33:02.363741 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa468f88 tid 3
> > > > 2013-01-08 16:33:02.364056 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa469348 tid 4
> > > > 2013-01-08 16:33:02.364171 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa469708 tid 5
> > > > 2013-01-08 16:33:02.365024 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa469ac8 tid 6
> > > > 2013-01-08 16:33:02.365187 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46a2d0 tid 7
> > > > 2013-01-08 16:33:02.365296 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46a690 tid 8
> > > > 2013-01-08 16:33:02.365402 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46aa50 tid 9
> > > > 2013-01-08 16:33:02.365508 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46ae10 tid 10
> > > > 2013-01-08 16:33:02.365635 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46b1d0 tid 11
> > > > 2013-01-08 16:33:02.365742 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46b590 tid 12
> > > > 2013-01-08 16:33:02.365868 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46b950 tid 13
> > > > 2013-01-08 16:33:02.365975 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46bd10 tid 14
> > > > 2013-01-08 16:33:02.366096 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46c0d0 tid 15
> > > > 2013-01-08 16:33:02.366203 b6be3710 0 client.9958.objecter FULL,
> > > > paused modify 0xa46c490 tid 16
> > > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
> > > > 0 16 16 0 0 0 - 0
> > > > 1 16 16 0 0 0 - 0
> > > > 2 16 16 0 0 0 - 0
> > > > 
> > > > rados doesn't work.
> > > > 
> > > > chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> > > > no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> > > > osds: (none)
> > > > 
> > > > this one also.
> > > > 
> > > > 
> > > > is there any chance to recover ceph?
> > > 
> > > "ceph pg set_full_ratio 0.98"
> > > 
> > > However, as Mark mentioned, you want to figure out why one OSD is so
> > > much fuller than the others first. Even in a small cluster I don't
> > > think you should be able to see that kind of variance. Simply setting
> > > the full ratio to 98% and then continuing to run could cause bigger
> > > problems if that OSD continues to get a disproportionate share of the
> > > writes and fills up its disk.
> > > -Greg
> > 
> > 
> > 
> > 
> > 
> > -- 
> > ...WBR, Roman Hlynovskiy
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-09  7:19       ` Gregory Farnum
@ 2013-01-09  9:47         ` Roman Hlynovskiy
  2013-01-10  5:32           ` Roman Hlynovskiy
  2013-01-25  4:11           ` Dan Mick
  0 siblings, 2 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-09  9:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org

>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>
> I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
> Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
> -Greg

hm... what would be recommended PG size per pool ?

chef@cephgw:~$ ceph osd lspools
0 data,1 metadata,2 rbd,
chef@cephgw:~$ ceph osd pool get data pg_num
PG_NUM: 128
chef@cephgw:~$ ceph osd pool get metadata pg_num
PG_NUM: 128
chef@cephgw:~$ ceph osd pool get rbd pg_num
PG_NUM: 128

according to the http://ceph.com/docs/master/rados/operations/placement-groups/

                (OSDs * 100)
Total PGs = ------------
                   Replicas

I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150

will it make much difference to set 150 instead of 128 or I should
base on different values?

btw, just one more off-topic question:

chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
dumped             all        in         format     plain
version            113906
last_osdmap_epoch  323
last_pg_scan       1
full_ratio         0.95
nearfull_ratio     0.85
pg_stat            objects    mip        degr       unf    bytes
  log           disklog   state     state_stamp  v  reported  up
acting  last_scrub  scrub_stamp  last_deep_scrub  deep_scrub_stamp
pool               0          74748      0          0      0
  286157692336  17668034  17668034
pool               1          618        0          0      0
  131846062     6414518   6414518
pool               2          0          0          0      0
  0             0         0
sum                75366      0          0          0
286289538398  24082552      24082552
osdstat            kbused     kbavail    kb         hb     in
  hb            out
0                  157999220  106227596  264226816  [1,2]  []
1                  185604948  78621868   264226816  [0,2]  []
2                  219475396  44751420   264226816  [0,1]  []
sum                563079564  229600884  792680448

pool 0 (data) is used for data storage
pool 1 (metadata) is used for metadata storage

what is pool 2 (rbd) for? looks like it's absolutely empty.


>
>>
>> You might also adjust the crush tunables, see
>>
>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>
>> sage
>>

Thanks for the link, Sage I set tunable values according to the doc.
Btw, online document is missing magical param for crushmap which
allows those scary_tunables )



--
...WBR, Roman Hlynovskiy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-09  9:47         ` Roman Hlynovskiy
@ 2013-01-10  5:32           ` Roman Hlynovskiy
  2013-01-10 16:50             ` Roman Hlynovskiy
  2013-01-25  4:11           ` Dan Mick
  1 sibling, 1 reply; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-10  5:32 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org

Hello again!

I left the system in working state overnight and got it in a wierd
state this morning:

chef@ceph-node02:/var/log/ceph$ ceph -s
   health HEALTH_OK
   monmap e4: 3 mons at
{a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
election epoch 254, quorum 0,1,2 a,b,c
   osdmap e348: 3 osds: 3 up, 3 in
    pgmap v114606: 384 pgs: 384 active+clean; 161 GB data, 326 GB
used, 429 GB / 755 GB avail
   mdsmap e4623: 1/1/1 up {0=b=up:active}, 1 up:standby

so, it looks ok from the first point of view,  however I am not able
to mount ceph from any of nodes:
be01:~# mount /var/www/jroger.org/data
mount: 192.168.7.11:/: can't read superblock

on the nodes, which had ceph mounted yesterday I am able to look
through the filesystem, but any kind of data read causes client to
hang.

I made a trace on the active mds with debug ms/mds = 20
(http://wh.of.kz/ceph_logs.tar.gz)
Could you please help to identify what's going on.

2013/1/9 Roman Hlynovskiy <roman.hlynovskiy@gmail.com>:
>>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>>
>> I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
>> Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
>> -Greg
>
> hm... what would be recommended PG size per pool ?
>
> chef@cephgw:~$ ceph osd lspools
> 0 data,1 metadata,2 rbd,
> chef@cephgw:~$ ceph osd pool get data pg_num
> PG_NUM: 128
> chef@cephgw:~$ ceph osd pool get metadata pg_num
> PG_NUM: 128
> chef@cephgw:~$ ceph osd pool get rbd pg_num
> PG_NUM: 128
>
> according to the http://ceph.com/docs/master/rados/operations/placement-groups/
>
>                 (OSDs * 100)
> Total PGs = ------------
>                    Replicas
>
> I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150
>
> will it make much difference to set 150 instead of 128 or I should
> base on different values?
>
> btw, just one more off-topic question:
>
> chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
> dumped             all        in         format     plain
> version            113906
> last_osdmap_epoch  323
> last_pg_scan       1
> full_ratio         0.95
> nearfull_ratio     0.85
> pg_stat            objects    mip        degr       unf    bytes
>   log           disklog   state     state_stamp  v  reported  up
> acting  last_scrub  scrub_stamp  last_deep_scrub  deep_scrub_stamp
> pool               0          74748      0          0      0
>   286157692336  17668034  17668034
> pool               1          618        0          0      0
>   131846062     6414518   6414518
> pool               2          0          0          0      0
>   0             0         0
> sum                75366      0          0          0
> 286289538398  24082552      24082552
> osdstat            kbused     kbavail    kb         hb     in
>   hb            out
> 0                  157999220  106227596  264226816  [1,2]  []
> 1                  185604948  78621868   264226816  [0,2]  []
> 2                  219475396  44751420   264226816  [0,1]  []
> sum                563079564  229600884  792680448
>
> pool 0 (data) is used for data storage
> pool 1 (metadata) is used for metadata storage
>
> what is pool 2 (rbd) for? looks like it's absolutely empty.
>
>
>>
>>>
>>> You might also adjust the crush tunables, see
>>>
>>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>>
>>> sage
>>>
>
> Thanks for the link, Sage I set tunable values according to the doc.
> Btw, online document is missing magical param for crushmap which
> allows those scary_tunables )
>
>
>
> --
> ...WBR, Roman Hlynovskiy



-- 
...WBR, Roman Hlynovskiy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-10  5:32           ` Roman Hlynovskiy
@ 2013-01-10 16:50             ` Roman Hlynovskiy
  0 siblings, 0 replies; 11+ messages in thread
From: Roman Hlynovskiy @ 2013-01-10 16:50 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org

please disregard my last email. I followed recommendation for
tunables, but missed the note that kernel version should be 3.5 or
later in order to support the tunables. I reverted them back to the
legacy ones and everything is back online.

2013/1/10 Roman Hlynovskiy <roman.hlynovskiy@gmail.com>:
> Hello again!
>
> I left the system in working state overnight and got it in a wierd
> state this morning:
>
> chef@ceph-node02:/var/log/ceph$ ceph -s
>    health HEALTH_OK
>    monmap e4: 3 mons at
> {a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
> election epoch 254, quorum 0,1,2 a,b,c
>    osdmap e348: 3 osds: 3 up, 3 in
>     pgmap v114606: 384 pgs: 384 active+clean; 161 GB data, 326 GB
> used, 429 GB / 755 GB avail
>    mdsmap e4623: 1/1/1 up {0=b=up:active}, 1 up:standby
>
> so, it looks ok from the first point of view,  however I am not able
> to mount ceph from any of nodes:
> be01:~# mount /var/www/jroger.org/data
> mount: 192.168.7.11:/: can't read superblock
>
> on the nodes, which had ceph mounted yesterday I am able to look
> through the filesystem, but any kind of data read causes client to
> hang.
>
> I made a trace on the active mds with debug ms/mds = 20
> (http://wh.of.kz/ceph_logs.tar.gz)
> Could you please help to identify what's going on.
>
> 2013/1/9 Roman Hlynovskiy <roman.hlynovskiy@gmail.com>:
>>>> How many pgs do you have? ('ceph osd dump | grep ^pool').
>>>
>>> I believe this is it. 384 PGs, but three pools of which only one (or maybe a second one, sort of) is in use. Automatically setting the right PG counts is coming some day, but until then being able to set up pools of the right size is a big gotcha. :(
>>> Depending on how mutable the data is, recreate with larger PG counts on the pools in use. Otherwise we can do something more detailed.
>>> -Greg
>>
>> hm... what would be recommended PG size per pool ?
>>
>> chef@cephgw:~$ ceph osd lspools
>> 0 data,1 metadata,2 rbd,
>> chef@cephgw:~$ ceph osd pool get data pg_num
>> PG_NUM: 128
>> chef@cephgw:~$ ceph osd pool get metadata pg_num
>> PG_NUM: 128
>> chef@cephgw:~$ ceph osd pool get rbd pg_num
>> PG_NUM: 128
>>
>> according to the http://ceph.com/docs/master/rados/operations/placement-groups/
>>
>>                 (OSDs * 100)
>> Total PGs = ------------
>>                    Replicas
>>
>> I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150
>>
>> will it make much difference to set 150 instead of 128 or I should
>> base on different values?
>>
>> btw, just one more off-topic question:
>>
>> chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t
>> dumped             all        in         format     plain
>> version            113906
>> last_osdmap_epoch  323
>> last_pg_scan       1
>> full_ratio         0.95
>> nearfull_ratio     0.85
>> pg_stat            objects    mip        degr       unf    bytes
>>   log           disklog   state     state_stamp  v  reported  up
>> acting  last_scrub  scrub_stamp  last_deep_scrub  deep_scrub_stamp
>> pool               0          74748      0          0      0
>>   286157692336  17668034  17668034
>> pool               1          618        0          0      0
>>   131846062     6414518   6414518
>> pool               2          0          0          0      0
>>   0             0         0
>> sum                75366      0          0          0
>> 286289538398  24082552      24082552
>> osdstat            kbused     kbavail    kb         hb     in
>>   hb            out
>> 0                  157999220  106227596  264226816  [1,2]  []
>> 1                  185604948  78621868   264226816  [0,2]  []
>> 2                  219475396  44751420   264226816  [0,1]  []
>> sum                563079564  229600884  792680448
>>
>> pool 0 (data) is used for data storage
>> pool 1 (metadata) is used for metadata storage
>>
>> what is pool 2 (rbd) for? looks like it's absolutely empty.
>>
>>
>>>
>>>>
>>>> You might also adjust the crush tunables, see
>>>>
>>>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
>>>>
>>>> sage
>>>>
>>
>> Thanks for the link, Sage I set tunable values according to the doc.
>> Btw, online document is missing magical param for crushmap which
>> allows those scary_tunables )
>>
>>
>>
>> --
>> ...WBR, Roman Hlynovskiy
>
>
>
> --
> ...WBR, Roman Hlynovskiy



-- 
...WBR, Roman Hlynovskiy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: recoverying from 95% full osd
  2013-01-09  9:47         ` Roman Hlynovskiy
  2013-01-10  5:32           ` Roman Hlynovskiy
@ 2013-01-25  4:11           ` Dan Mick
  1 sibling, 0 replies; 11+ messages in thread
From: Dan Mick @ 2013-01-25  4:11 UTC (permalink / raw)
  To: Roman Hlynovskiy; +Cc: Gregory Farnum, Sage Weil, ceph-devel@vger.kernel.org


> what is pool 2 (rbd) for? looks like it's absolutely empty.

by default it's for rbd images (see the rbd command etc.).  It being 
empty or not has no effect on the other pools.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-01-25  4:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-08 10:42 recoverying from 95% full osd Roman Hlynovskiy
2013-01-08 16:16 ` Mark Nelson
2013-01-09  4:19   ` Roman Hlynovskiy
2013-01-08 17:20 ` Gregory Farnum
2013-01-09  5:41   ` Roman Hlynovskiy
2013-01-09  6:52     ` Sage Weil
2013-01-09  7:19       ` Gregory Farnum
2013-01-09  9:47         ` Roman Hlynovskiy
2013-01-10  5:32           ` Roman Hlynovskiy
2013-01-10 16:50             ` Roman Hlynovskiy
2013-01-25  4:11           ` Dan Mick

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.