* 70+ OSD are DOWN and not coming up
@ 2014-05-20 9:02 Karan Singh
[not found] ` <5396D45F-E87E-4F53-85D8-B4DC1F630B78-Gn+qtVAUx6s@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Karan Singh @ 2014-05-20 9:02 UTC (permalink / raw)
To: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA, Ceph Community
[-- Attachment #1.1: Type: text/plain, Size: 4736 bytes --]
Hello Cephers , need your suggestion for troubleshooting.
My cluster is terribly struggling , 70+ osd are down out of 165
Problem —> OSD are getting marked out of cluster and are down. The cluster is degraded. On checking logs of failed OSD we are getting wired entries that are continuously getting generated.
Osd Debug logs :: http://pastebin.com/agTKh6zB
2014-05-20 10:19:03.699886 7f2328e237a0 0 osd.158 357532 done with init, starting boot process
2014-05-20 10:19:03.700093 7f22ff621700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not 192.168.1.109:6802/910005982 - wrong node!
2014-05-20 10:19:03.700152 7f22ff621700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0 l=0 c=0x83018c0).fault with nothing to send, going to standby
2014-05-20 10:19:09.551269 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not 192.168.1.109:6803/1176009454 - wrong node!
2014-05-20 10:19:09.551347 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0 l=0 c=0x533fd20).fault with nothing to send, going to standby
2014-05-20 10:19:09.703901 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not 192.168.1.113:6802/13870 - wrong node!
2014-05-20 10:19:09.704039 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0 c=0x8302aa0).fault with nothing to send, going to standby
2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not 192.168.1.112:6800/14114 - wrong node!
2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).fault with nothing to send, going to standby
2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 c=0x83070c0).fault with nothing to send, going to standby
ceph -v
ceph version 0.80-469-g991f7f1 (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
ceph osd stat
osdmap e357073: 165 osds: 91 up, 165 in
flags noout #
I have tried doing :
1. Restarting the problematic OSDs , but no luck
2. i restarted entire host but no luck, still osds are down and getting the same mesage
2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not 192.168.1.112:6800/14114 - wrong node!
2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0 c=0x8304780).fault with nothing to send, going to standby
2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0 c=0x83070c0).fault with nothing to send, going to standby
2014-05-20 10:22:23.312473 7f2307e61700 0 osd.158 357781 do_command r=0
2014-05-20 10:22:23.326110 7f2307e61700 0 osd.158 357781 do_command r=0 debug_osd=0/5
2014-05-20 10:22:23.326123 7f2307e61700 0 log [INF] : debug_osd=0/5
2014-05-20 10:34:08.161864 7f230224d700 0 -- 192.168.1.112:6802/3807 >> 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1 l=0 c=0x8301600).fault with nothing to send, going to standby
3. Disks do not have errors , no message in dmesg and /var/log/messages
4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont know it again came bacin in Firefly
5. Recently no activity performed on cluster , except some pool and keys creation for cinder /glance integration
6. Nodes have enough free resources for osds.
7. No issues with network , osds are down on all cluster nodes. not from a single node.
****************************************************************
Karan Singh
Systems Specialist , Storage Platforms
CSC - IT Center for Science,
Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
mobile: +358 503 812758
tel. +358 9 4572001
fax +358 9 4572302
http://www.csc.fi/
****************************************************************
[-- Attachment #1.2: Type: text/html, Size: 17508 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 70+ OSD are DOWN and not coming up
[not found] ` <5396D45F-E87E-4F53-85D8-B4DC1F630B78-Gn+qtVAUx6s@public.gmane.org>
@ 2014-05-20 15:18 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2014-05-20 15:18 UTC (permalink / raw)
To: Karan Singh; +Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA, Ceph Community
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5943 bytes --]
On Tue, 20 May 2014, Karan Singh wrote:
> Hello Cephers , need your suggestion for troubleshooting.
>
> My cluster is terribly struggling , 70+ osd are down out of 165
>
> Problem ?>OSD are getting marked out of cluster and are down. The cluster is
> degraded. On checking logs of failed OSD we are getting wired entries that
> are continuously getting generated.
Tracking this at http://tracker.ceph.com/issues/8387
The most recent bits you posted in the ticket don't quite make sense: the
OSD is trying to connect to an address for an OSD that is currently marked
down. I suspect this is just timing between when the logs were captured
and when teh ceph osd dump was captured. To get a complete pictures,
please:
1) add
debug osd = 20
debug ms = 1
in [osd] and restart all osds
2) ceph osd set nodown
(to prevent flapping)
3) find some OSD that is showing these messages
4) capture a 'ceph osd dump' output.
Also happy to debug this interactively over IRC; that will likely be
faster!
Thanks-
sage
>
> Osd Debug logs :: http://pastebin.com/agTKh6zB
>
>
> 1. 2014-05-20 10:19:03.699886 7f2328e237a0 0 osd.158 357532 done with
> init, starting boot process
> 2. 2014-05-20 10:19:03.700093 7f22ff621700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
> l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not
> 192.168.1.109:6802/910005982 - wrong node!
> 3. 2014-05-20 10:19:03.700152 7f22ff621700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
> l=0 c=0x83018c0).fault with nothing to send, going to standby
> 4. 2014-05-20 10:19:09.551269 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
> l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not
> 192.168.1.109:6803/1176009454 - wrong node!
> 5. 2014-05-20 10:19:09.551347 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
> l=0 c=0x533fd20).fault with nothing to send, going to standby
> 6. 2014-05-20 10:19:09.703901 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
> c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not
> 192.168.1.113:6802/13870 - wrong node!
> 7. 2014-05-20 10:19:09.704039 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
> c=0x8302aa0).fault with nothing to send, going to standby
> 8. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
> 192.168.1.112:6800/14114 - wrong node!
> 9. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
> c=0x8304780).fault with nothing to send, going to standby
> 10. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
> c=0x83070c0).fault with nothing to send, going to standby
>
>
> 1. ceph -v
> ceph version 0.80-469-g991f7f1
> (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
> 1. ceph osd stat
> osdmap e357073: 165 osds: 91 up, 165 in
> flags noout #
>
> I have tried doing :
>
> 1. Restarting the problematic OSDs , but no luck
> 2. i restarted entire host but no luck, still osds are down and getting the
> same mesage
>
> 1. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
> 192.168.1.112:6800/14114 - wrong node!
> 2. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
> c=0x8304780).fault with nothing to send, going to standby
> 3. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
> c=0x83070c0).fault with nothing to send, going to standby
> 4. 2014-05-20 10:22:23.312473 7f2307e61700 0 osd.158 357781 do_command r=0
> 5. 2014-05-20 10:22:23.326110 7f2307e61700 0 osd.158 357781 do_command r=0
> debug_osd=0/5
> 6. 2014-05-20 10:22:23.326123 7f2307e61700 0 log [INF] : debug_osd=0/5
> 7. 2014-05-20 10:34:08.161864 7f230224d700 0 -- 192.168.1.112:6802/3807 >>
> 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1
> l=0 c=0x8301600).fault with nothing to send, going to standby
>
> 3. Disks do not have errors , no message in dmesg and /var/log/messages
>
> 4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont
> know it again came bacin in Firefly
>
> 5. Recently no activity performed on cluster , except some pool and keys
> creation for cinder /glance integration
>
> 6. Nodes have enough free resources for osds.
>
> 7. No issues with network , osds are down on all cluster nodes. not from a
> single node.
>
>
> ****************************************************************
> Karan Singh
> Systems Specialist , Storage Platforms
> CSC - IT Center for Science,
> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
> mobile: +358 503 812758
> tel. +358 9 4572001
> fax +358 9 4572302
> http://www.csc.fi/
> ****************************************************************
>
>
>
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 70+ OSD are DOWN and not coming up
[not found] ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-05-21 12:37 ` Karan Singh
2014-05-22 1:34 ` Craig Lewis
1 sibling, 0 replies; 6+ messages in thread
From: Karan Singh @ 2014-05-21 12:37 UTC (permalink / raw)
To: Sage Weil, ceph-users; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, Ceph Community
[-- Attachment #1.1: Type: text/plain, Size: 9674 bytes --]
Hello Sage
nodown, noout set on cluster
# ceph status
cluster 009d3518-e60d-4f74-a26d-c08c1976263c
health HEALTH_WARN 1133 pgs degraded; 44 pgs incomplete; 42 pgs stale; 45 pgs stuck inactive; 42 pgs stuck stale; 2602 pgs stuck unclean; recovery 206/2199 objects degraded (9.368%); 40/165 in osds are down; nodown,noout flag(s) set
monmap e4: 4 mons at {storage0101-ib=192.168.100.101:6789/0,storage0110-ib=192.168.100.110:6789/0,storage0114-ib=192.168.100.114:6789/0,storage0115-ib=192.168.100.115:6789/0}, election epoch 18, quorum 0,1,2,3 storage0101-ib,storage0110-ib,storage0114-ib,storage0115-ib
osdmap e358031: 165 osds: 125 up, 165 in
flags nodown,noout
pgmap v604305: 4544 pgs, 6 pools, 4309 MB data, 733 objects
3582 GB used, 357 TB / 361 TB avail
206/2199 objects degraded (9.368%)
1 inactive
5 stale+active+degraded+remapped
1931 active+clean
2 stale+incomplete
21 stale+active+remapped
380 active+degraded+remapped
38 incomplete
1403 active+remapped
2 stale+active+degraded
1 stale+remapped+incomplete
746 active+degraded
11 stale+active+clean
3 remapped+incomplete
Here is my ceph.conf http://pastebin.com/KZdgPJm7 (debus osd , ms set )
I tried restarting all OSD services of node-13 , services came up after several attempts of “service ceph restart” http://pastebin.com/yMk86YHh
For Node : 14
All services are up
[root@storage0114-ib ~]# service ceph status
=== osd.142 ===
osd.142: running {"version":"0.80-475-g9e80c29"}
=== osd.36 ===
osd.36: running {"version":"0.80-475-g9e80c29"}
=== osd.83 ===
osd.83: running {"version":"0.80-475-g9e80c29"}
=== osd.107 ===
osd.107: running {"version":"0.80-475-g9e80c29"}
=== osd.47 ===
osd.47: running {"version":"0.80-475-g9e80c29"}
=== osd.130 ===
osd.130: running {"version":"0.80-475-g9e80c29"}
=== osd.155 ===
osd.155: running {"version":"0.80-475-g9e80c29"}
=== osd.60 ===
osd.60: running {"version":"0.80-475-g9e80c29"}
=== osd.118 ===
osd.118: running {"version":"0.80-475-g9e80c29"}
=== osd.98 ===
osd.98: running {"version":"0.80-475-g9e80c29"}
=== osd.70 ===
osd.70: running {"version":"0.80-475-g9e80c29"}
=== mon.storage0114-ib ===
mon.storage0114-ib: running {"version":"0.80-475-g9e80c29"}
[root@storage0114-ib ~]#
— But ceph osd tree says , osd.118 is down
-10 29.93 host storage0114-ib
36 2.63 osd.36 up 1
47 2.73 osd.47 up 1
60 2.73 osd.60 up 1
70 2.73 osd.70 up 1
83 2.73 osd.83 up 1
98 2.73 osd.98 up 1
107 2.73 osd.107 up 1
118 2.73 osd.118 down 1
130 2.73 osd.130 up 1
142 2.73 osd.142 up 1
155 2.73 osd.155 up 1
— I restarted osd.118 service and it was successful , But still its showing as down in ceph osd tree . I waited for 30 minutes to get it stable but still not showing UP in ceph osd tree.
Moreover its generating HUGE logs http://pastebin.com/mDYnjAni
The problem now is if i manually visit every host and check for “service ceph status “ all services are running on all 15 hosts. But this is not getting reflected to ceph osd tree and ceph -s and they continue to show as OSD DOWN.
My irc id is ksingh , let me know by email once you are available on IRC (my time zone is Finland +2)
- Karan Singh -
On 20 May 2014, at 18:18, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
> On Tue, 20 May 2014, Karan Singh wrote:
>> Hello Cephers , need your suggestion for troubleshooting.
>>
>> My cluster is terribly struggling , 70+ osd are down out of 165
>>
>> Problem ?>OSD are getting marked out of cluster and are down. The cluster is
>> degraded. On checking logs of failed OSD we are getting wired entries that
>> are continuously getting generated.
>
> Tracking this at http://tracker.ceph.com/issues/8387
>
> The most recent bits you posted in the ticket don't quite make sense: the
> OSD is trying to connect to an address for an OSD that is currently marked
> down. I suspect this is just timing between when the logs were captured
> and when teh ceph osd dump was captured. To get a complete pictures,
> please:
>
> 1) add
>
> debug osd = 20
> debug ms = 1
>
> in [osd] and restart all osds
>
> 2) ceph osd set nodown
>
> (to prevent flapping)
>
> 3) find some OSD that is showing these messages
>
> 4) capture a 'ceph osd dump' output.
>
> Also happy to debug this interactively over IRC; that will likely be
> faster!
>
> Thanks-
> sage
>
>
>
>>
>> Osd Debug logs :: http://pastebin.com/agTKh6zB
>>
>>
>> 1. 2014-05-20 10:19:03.699886 7f2328e237a0 0 osd.158 357532 done with
>> init, starting boot process
>> 2. 2014-05-20 10:19:03.700093 7f22ff621700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>> l=0 c=0x83018c0).connect claims to be 192.168.1.109:6802/63896 not
>> 192.168.1.109:6802/910005982 - wrong node!
>> 3. 2014-05-20 10:19:03.700152 7f22ff621700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6802/910005982 pipe(0x8698500 sd=35 :33500 s=1 pgs=0 cs=0
>> l=0 c=0x83018c0).fault with nothing to send, going to standby
>> 4. 2014-05-20 10:19:09.551269 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>> l=0 c=0x533fd20).connect claims to be 192.168.1.109:6803/63896 not
>> 192.168.1.109:6803/1176009454 - wrong node!
>> 5. 2014-05-20 10:19:09.551347 7f22fdd12700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6803/1176009454 pipe(0x56aee00 sd=53 :40060 s=1 pgs=0 cs=0
>> l=0 c=0x533fd20).fault with nothing to send, going to standby
>> 6. 2014-05-20 10:19:09.703901 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>> c=0x8302aa0).connect claims to be 192.168.1.113:6802/24612 not
>> 192.168.1.113:6802/13870 - wrong node!
>> 7. 2014-05-20 10:19:09.704039 7f22fd80d700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.113:6802/13870 pipe(0x56adf00 sd=137 :42889 s=1 pgs=0 cs=0 l=0
>> c=0x8302aa0).fault with nothing to send, going to standby
>> 8. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>> 192.168.1.112:6800/14114 - wrong node!
>> 9. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).fault with nothing to send, going to standby
>> 10. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>> c=0x83070c0).fault with nothing to send, going to standby
>>
>>
>> 1. ceph -v
>> ceph version 0.80-469-g991f7f1
>> (991f7f15a6e107b33a24bbef1169f21eb7fcce2c) #
>> 1. ceph osd stat
>> osdmap e357073: 165 osds: 91 up, 165 in
>> flags noout #
>>
>> I have tried doing :
>>
>> 1. Restarting the problematic OSDs , but no luck
>> 2. i restarted entire host but no luck, still osds are down and getting the
>> same mesage
>>
>> 1. 2014-05-20 10:19:10.243139 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).connect claims to be 192.168.1.112:6800/2852 not
>> 192.168.1.112:6800/14114 - wrong node!
>> 2. 2014-05-20 10:19:10.243190 7f22fd005700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.112:6800/14114 pipe(0x56a8f00 sd=146 :43726 s=1 pgs=0 cs=0 l=0
>> c=0x8304780).fault with nothing to send, going to standby
>> 3. 2014-05-20 10:19:10.349693 7f22fc7fd700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.109:6800/13492 pipe(0x8698c80 sd=156 :0 s=1 pgs=0 cs=0 l=0
>> c=0x83070c0).fault with nothing to send, going to standby
>> 4. 2014-05-20 10:22:23.312473 7f2307e61700 0 osd.158 357781 do_command r=0
>> 5. 2014-05-20 10:22:23.326110 7f2307e61700 0 osd.158 357781 do_command r=0
>> debug_osd=0/5
>> 6. 2014-05-20 10:22:23.326123 7f2307e61700 0 log [INF] : debug_osd=0/5
>> 7. 2014-05-20 10:34:08.161864 7f230224d700 0 -- 192.168.1.112:6802/3807 >>
>> 192.168.1.102:6808/13276 pipe(0x8698280 sd=22 :41078 s=2 pgs=603 cs=1
>> l=0 c=0x8301600).fault with nothing to send, going to standby
>>
>> 3. Disks do not have errors , no message in dmesg and /var/log/messages
>>
>> 4. there was a bug in the past http://tracker.ceph.com/issues/4006 , dont
>> know it again came bacin in Firefly
>>
>> 5. Recently no activity performed on cluster , except some pool and keys
>> creation for cinder /glance integration
>>
>> 6. Nodes have enough free resources for osds.
>>
>> 7. No issues with network , osds are down on all cluster nodes. not from a
>> single node.
>>
>>
>> ****************************************************************
>> Karan Singh
>> Systems Specialist , Storage Platforms
>> CSC - IT Center for Science,
>> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
>> mobile: +358 503 812758
>> tel. +358 9 4572001
>> fax +358 9 4572302
>> http://www.csc.fi/
>> ****************************************************************
>>
>>
[-- Attachment #1.2: Type: text/html, Size: 16849 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 70+ OSD are DOWN and not coming up
[not found] ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-21 12:37 ` Karan Singh
@ 2014-05-22 1:34 ` Craig Lewis
[not found] ` <537D541D.2010005-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
1 sibling, 1 reply; 6+ messages in thread
From: Craig Lewis @ 2014-05-22 1:34 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA, Ceph Community
[-- Attachment #1.1: Type: text/plain, Size: 1169 bytes --]
On 5/20/14 08:18 , Sage Weil wrote:
> On Tue, 20 May 2014, Karan Singh wrote:
>> Hello Cephers , need your suggestion for troubleshooting.
>>
>> My cluster is terribly struggling , 70+ osd are down out of 165
>>
>> Problem ?>OSD are getting marked out of cluster and are down. The cluster is
>> degraded. On checking logs of failed OSD we are getting wired entries that
>> are continuously getting generated.
>
> Also happy to debug this interactively over IRC; that will likely be
> faster!
>
> Thanks-
> sage
>
>
If you do this over IRC, can you please post a summary to the mailling
list?
I believe I'm having this issue as well.
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org <mailto:clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
[-- Attachment #1.2: Type: text/html, Size: 2726 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 70+ OSD are DOWN and not coming up
[not found] ` <537D541D.2010005-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
@ 2014-05-22 4:15 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2014-05-22 4:15 UTC (permalink / raw)
To: Craig Lewis; +Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA, Ceph Community
[-- Attachment #1: Type: TEXT/PLAIN, Size: 849 bytes --]
On Wed, 21 May 2014, Craig Lewis wrote:
> If you do this over IRC, can you please post a summary to the mailling
> list?
>
> I believe I'm having this issue as well.
In the other case, we found that some of the OSDs were behind processing
maps (by several thousand epochs). The trick here to give them a chance
to catch up is
ceph osd set noup
ceph osd set nodown
ceph osd set noout
and wait for them to stop spinning on the CPU. You can check which map
each OSD is on with
ceph daemon osd.NNN status
to see which epoch they are on and compare that to
ceph osd stat
Once they are within 100 or less epochs,
ceph osd unset noup
and let them all start up.
We haven't determined whether the original problem was caused by this or
the other way around; we'll see once they are all caught up.
sage
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 70+ OSD are DOWN and not coming up
[not found] ` <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-05-22 7:26 ` Craig Lewis
0 siblings, 0 replies; 6+ messages in thread
From: Craig Lewis @ 2014-05-22 7:26 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA, Ceph Community
[-- Attachment #1.1: Type: text/plain, Size: 2776 bytes --]
On 5/21/14 21:15 , Sage Weil wrote:
> On Wed, 21 May 2014, Craig Lewis wrote:
>> If you do this over IRC, can you please post a summary to the mailling
>> list?
>>
>> I believe I'm having this issue as well.
> In the other case, we found that some of the OSDs were behind processing
> maps (by several thousand epochs). The trick here to give them a chance
> to catch up is
>
> ceph osd set noup
> ceph osd set nodown
> ceph osd set noout
>
> and wait for them to stop spinning on the CPU. You can check which map
> each OSD is on with
>
> ceph daemon osd.NNN status
>
> to see which epoch they are on and compare that to
>
> ceph osd stat
>
> Once they are within 100 or less epochs,
>
> ceph osd unset noup
>
> and let them all start up.
>
> We haven't determined whether the original problem was caused by this or
> the other way around; we'll see once they are all caught up.
>
> sage
I was seeing the CPU spinning too, so I think it is the same issue.
Thanks for the explanation! I've been pulling my hair out for weeks.
I can give you a data point for the "how". My problems started with a
kswapd problem on 12.04.04 (kernel 3.5.0-46-generic
#70~precise1-Ubuntu). kswapd was consuming 100% CPU, and it was
blocking the ceph-osd processes. Once I prevented kswapd from doing
that, my OSDs couldn't recover. noout and nodown didn't help; the OSDs
would suicide and restart.
Upgrading to Ubuntu 14.04 seems to have helped. The cluster isn't all
clear yet, but it's getting better. The cluster is finally healthy
after 2 weeks of incomplete and stale. It's still unresponsive, but
it's making progress. I am still seeing OSD's consuming 100% CPU, but
only the OSDs that are actively deep-scrubing. Once the deep-scrub
finishes, the OSD starts behaving again. They seem to be slowly getting
better, which matches up with your explanation.
I'll go ahead at set noup. I don't think it's necessary at this point,
but it's not going to hurt.
I'm running Emperor, and looks like osd status isn't supported. Not a
big deal though. Deep-scrub has made it through half of the PGs in the
last 36 hours, so I'll just watch for another day or two. This is a
slave cluster, so I have that luxury.
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org <mailto:clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
[-- Attachment #1.2: Type: text/html, Size: 4517 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-05-22 7:26 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-20 9:02 70+ OSD are DOWN and not coming up Karan Singh
[not found] ` <5396D45F-E87E-4F53-85D8-B4DC1F630B78-Gn+qtVAUx6s@public.gmane.org>
2014-05-20 15:18 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-21 12:37 ` Karan Singh
2014-05-22 1:34 ` Craig Lewis
[not found] ` <537D541D.2010005-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
2014-05-22 4:15 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-22 7:26 ` Craig Lewis
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).