Serious problem after increase pg

All of lore.kernel.org
 help / color / mirror / Atom feed

* Serious problem after increase pg_num in pool
@ 2012-02-20 19:16 Sławomir Skowron
  2012-02-20 19:35 ` Sławomir Skowron
  0 siblings, 1 reply; 7+ messages in thread
From: Sławomir Skowron @ 2012-02-20 19:16 UTC (permalink / raw)
  To: ceph-devel

After increase number pg_num from 8 to 100 in .rgw.buckets i have some
serious problems.

pool name       category                 KB      objects       clones
   degraded      unfound           rd        rd KB           wr
wr KB
.intent-log     -                       4662           19            0
           0           0            0            0        26502
26501
.log            -                          0            0            0
           0           0            0            0       913732
913342
.rgw            -                          1           10            0
           0           0            1            0            9
    7
.rgw.buckets    -                   39582566        73707            0
        8061           0        86594            0       610896
36050541
.rgw.control    -                          0            1            0
           0           0            0            0            0
    0
.users          -                          1            1            0
           0           0            0            0            1
    1
.users.uid      -                          1            2            0
           0           0            2            1            3
    3
data            -                          0            0            0
           0           0            0            0            0
    0
metadata        -                          0            0            0
           0           0            0            0            0
    0
rbd             -                   21590723         5328            0
           1           0           77           75      3013595
378345507
  total used       229514252        79068
  total avail    19685615164
  total space    20980898464

2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
(by osd.55 10.177.64.8:6809/28642)
2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
/ 20008 GB avail; 8071/237184 degraded (3.403%)
2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
/ 20008 GB avail; 8071/237184 degraded (3.403%)
2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
/ 20008 GB avail; 8067/237184 degraded (3.401%)
2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)

osds is going to fail, again, and again, another going to fail. Number
of up osd changing from 62, to 70-72, and going down, ang again going
up.

2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
2012-02-20 20:09:42.304975)
2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
2012-02-20 20:09:42.410144)
2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
2012-02-20 20:09:42.410144)
2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
2012-02-20 20:09:42.906639)
2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
2012-02-20 20:09:42.906639)
2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
node!
2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
2012-02-20 20:09:43.410313)
2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
2012-02-20 20:09:43.410313)
2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
node!
2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
node!

Some of them is going down with this:

2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
(commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
pid 31379
2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount FIEMAP ioctl is supported
2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount did NOT detect btrfs
2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount found snaps <>
2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount: WRITEAHEAD journal mode explicitly enabled in conf
2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount FIEMAP ioctl is supported
2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount did NOT detect btrfs
2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount found snaps <>
2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
mount: WRITEAHEAD journal mode explicitly enabled in conf
osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
18:22:19.900886
osd/OSD.cc: 4066: FAILED assert(child)
 ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
std::allocator<std::pair<pg_t const, PG*> > >&,
ObjectStore::Transaction&)+0x23e0) [0x54cd20]
 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
 8: (()+0x7efc) [0x7fe3ebda3efc]
 9: (clone()+0x6d) [0x7fe3ea3d489d]
 ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
std::allocator<std::pair<pg_t const, PG*> > >&,
ObjectStore::Transaction&)+0x23e0) [0x54cd20]
 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
 8: (()+0x7efc) [0x7fe3ebda3efc]
 9: (clone()+0x6d) [0x7fe3ea3d489d]
*** Caught signal (Aborted) **
 in thread 7fe3df8c4700
 ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
 1: /usr/bin/ceph-osd() [0x6099f6]
 2: (()+0x10060) [0x7fe3ebdac060]
 3: (gsignal()+0x35) [0x7fe3ea3293a5]
 4: (abort()+0x17b) [0x7fe3ea32cb0b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
 6: (()+0xb9f26) [0x7fe3eabe5f26]
 7: (()+0xb9f53) [0x7fe3eabe5f53]
 8: (()+0xba04e) [0x7fe3eabe604e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x200) [0x5dc6b0]
 10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
std::allocator<std::pair<pg_t const, PG*> > >&,
ObjectStore::Transaction&)+0x23e0) [0x54cd20]
 11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
 12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
 13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
 14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
 15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
 17: (()+0x7efc) [0x7fe3ebda3efc]
 18: (clone()+0x6d) [0x7fe3ea3d489d]
2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
(commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
pid 6596

Do you have any ideas ?? if you need some data from cluster, or a core
dumps from osd i have a lot of them, but they are large.

-- 
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Serious problem after increase pg_num in pool
  2012-02-20 19:16 Serious problem after increase pg_num in pool Sławomir Skowron
@ 2012-02-20 19:35 ` Sławomir Skowron
  2012-02-20 20:19   ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Sławomir Skowron @ 2012-02-20 19:35 UTC (permalink / raw)
  To: ceph-devel

and this in ceph -w

2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
[20,51,64]
2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
[20,51,64]

2012/2/20 Sławomir Skowron <slawomir.skowron@gmail.com>:
> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
> serious problems.
>
> pool name       category                 KB      objects       clones
>   degraded      unfound           rd        rd KB           wr
> wr KB
> .intent-log     -                       4662           19            0
>           0           0            0            0        26502
> 26501
> .log            -                          0            0            0
>           0           0            0            0       913732
> 913342
> .rgw            -                          1           10            0
>           0           0            1            0            9
>    7
> .rgw.buckets    -                   39582566        73707            0
>        8061           0        86594            0       610896
> 36050541
> .rgw.control    -                          0            1            0
>           0           0            0            0            0
>    0
> .users          -                          1            1            0
>           0           0            0            0            1
>    1
> .users.uid      -                          1            2            0
>           0           0            2            1            3
>    3
> data            -                          0            0            0
>           0           0            0            0            0
>    0
> metadata        -                          0            0            0
>           0           0            0            0            0
>    0
> rbd             -                   21590723         5328            0
>           1           0           77           75      3013595
> 378345507
>  total used       229514252        79068
>  total avail    19685615164
>  total space    20980898464
>
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
> 10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
> 10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
> 10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
> 10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
> 10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
> 10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
> 10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
> 10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
> 10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
> 10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
> 10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
> 10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
> 10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
> 10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
> 10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
> 10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
> 10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
> 10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
> 10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
> 10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
> 10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
> 10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
> 10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
> 10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
> 10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
> 10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
> 10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
> 10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
> 10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
> 10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
> (by osd.55 10.177.64.8:6809/28642)
> 2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> / 20008 GB avail; 8071/237184 degraded (3.403%)
> 2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
> 10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
> 10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
> 2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> / 20008 GB avail; 8071/237184 degraded (3.403%)
> 2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
> 2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
> active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> / 20008 GB avail; 8067/237184 degraded (3.401%)
> 2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
> 2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
> creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> 2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
> creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> 2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
> creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
> peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> 2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
> creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
> 179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
> GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>
> osds is going to fail, again, and again, another going to fail. Number
> of up osd changing from 62, to 70-72, and going down, ang again going
> up.
>
> 2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> 2012-02-20 20:09:42.304975)
> 2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> 2012-02-20 20:09:42.410144)
> 2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> 2012-02-20 20:09:42.410144)
> 2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> 2012-02-20 20:09:42.906639)
> 2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> 2012-02-20 20:09:42.906639)
> 2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
> 10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
> node!
> 2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> 2012-02-20 20:09:43.410313)
> 2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> 2012-02-20 20:09:43.410313)
> 2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
> 10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
> claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
> node!
> 2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
> 10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
> node!
>
> Some of them is going down with this:
>
> 2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
> pid 31379
> 2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount FIEMAP ioctl is supported
> 2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount did NOT detect btrfs
> 2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount found snaps <>
> 2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount: WRITEAHEAD journal mode explicitly enabled in conf
> 2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount FIEMAP ioctl is supported
> 2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount did NOT detect btrfs
> 2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount found snaps <>
> 2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> mount: WRITEAHEAD journal mode explicitly enabled in conf
> osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
> ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
> 18:22:19.900886
> osd/OSD.cc: 4066: FAILED assert(child)
>  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>  1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> std::allocator<std::pair<pg_t const, PG*> > >&,
> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>  2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>  3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>  4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>  5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>  6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>  7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>  8: (()+0x7efc) [0x7fe3ebda3efc]
>  9: (clone()+0x6d) [0x7fe3ea3d489d]
>  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>  1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> std::allocator<std::pair<pg_t const, PG*> > >&,
> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>  2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>  3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>  4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>  5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>  6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>  7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>  8: (()+0x7efc) [0x7fe3ebda3efc]
>  9: (clone()+0x6d) [0x7fe3ea3d489d]
> *** Caught signal (Aborted) **
>  in thread 7fe3df8c4700
>  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>  1: /usr/bin/ceph-osd() [0x6099f6]
>  2: (()+0x10060) [0x7fe3ebdac060]
>  3: (gsignal()+0x35) [0x7fe3ea3293a5]
>  4: (abort()+0x17b) [0x7fe3ea32cb0b]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
>  6: (()+0xb9f26) [0x7fe3eabe5f26]
>  7: (()+0xb9f53) [0x7fe3eabe5f53]
>  8: (()+0xba04e) [0x7fe3eabe604e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x200) [0x5dc6b0]
>  10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> std::allocator<std::pair<pg_t const, PG*> > >&,
> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>  11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>  12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>  13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>  14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>  15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>  16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>  17: (()+0x7efc) [0x7fe3ebda3efc]
>  18: (clone()+0x6d) [0x7fe3ea3d489d]
> 2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
> pid 6596
>
> Do you have any ideas ?? if you need some data from cluster, or a core
> dumps from osd i have a lot of them, but they are large.
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron



-- 
-----
Pozdrawiam

Sławek "sZiBis" Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Serious problem after increase pg_num in pool
  2012-02-20 19:35 ` Sławomir Skowron
@ 2012-02-20 20:19   ` Sage Weil
  2012-02-21  6:46     ` Sławomir Skowron
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2012-02-20 20:19 UTC (permalink / raw)
  To: Sławomir Skowron; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 20331 bytes --]

Ooh, the pg split functionality is currently broken, and we weren't 
planning on fixing it for a while longer.  I didn't realize it was still 
possible to trigger from the monitor.

I'm looking at how difficult it is to make it work (even inefficiently).  

How much data do you have in the cluster?

sage




On Mon, 20 Feb 2012, S?awomir Skowron wrote:

> and this in ceph -w
> 
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
> [20,51,64]
> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
> [20,51,64]
> 
> 2012/2/20 S?awomir Skowron <slawomir.skowron@gmail.com>:
> > After increase number pg_num from 8 to 100 in .rgw.buckets i have some
> > serious problems.
> >
> > pool name       category                 KB      objects       clones
> >   degraded      unfound           rd        rd KB           wr
> > wr KB
> > .intent-log     -                       4662           19            0
> >           0           0            0            0        26502
> > 26501
> > .log            -                          0            0            0
> >           0           0            0            0       913732
> > 913342
> > .rgw            -                          1           10            0
> >           0           0            1            0            9
> >    7
> > .rgw.buckets    -                   39582566        73707            0
> >        8061           0        86594            0       610896
> > 36050541
> > .rgw.control    -                          0            1            0
> >           0           0            0            0            0
> >    0
> > .users          -                          1            1            0
> >           0           0            0            0            1
> >    1
> > .users.uid      -                          1            2            0
> >           0           0            2            1            3
> >    3
> > data            -                          0            0            0
> >           0           0            0            0            0
> >    0
> > metadata        -                          0            0            0
> >           0           0            0            0            0
> >    0
> > rbd             -                   21590723         5328            0
> >           1           0           77           75      3013595
> > 378345507
> >  total used       229514252        79068
> >  total avail    19685615164
> >  total space    20980898464
> >
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
> > 10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
> > 10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
> > 10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
> > 10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
> > 10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
> > 10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
> > 10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
> > 10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
> > 10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
> > 10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
> > 10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
> > 10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
> > 10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
> > 10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
> > 10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
> > 10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
> > 10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
> > 10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
> > 10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
> > 10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
> > 10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
> > 10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
> > 10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
> > 10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
> > 10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
> > 10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
> > 10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
> > 10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
> > 10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
> > 10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
> > (by osd.55 10.177.64.8:6809/28642)
> > 2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
> > active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
> > down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> > / 20008 GB avail; 8071/237184 degraded (3.403%)
> > 2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
> > 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
> > 10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
> > 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
> > 10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
> > 2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
> > active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
> > down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> > / 20008 GB avail; 8071/237184 degraded (3.403%)
> > 2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
> > 2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
> > active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
> > down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> > / 20008 GB avail; 8067/237184 degraded (3.401%)
> > 2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
> > 2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
> > creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
> > peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
> > used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> > 2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
> > creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
> > peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
> > used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> > 2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
> > creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
> > peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
> > used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> > 2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
> > creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
> > 179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
> > GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> >
> > osds is going to fail, again, and again, another going to fail. Number
> > of up osd changing from 62, to 70-72, and going down, ang again going
> > up.
> >
> > 2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> > 2012-02-20 20:09:42.304975)
> > 2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> > 2012-02-20 20:09:42.410144)
> > 2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> > 2012-02-20 20:09:42.410144)
> > 2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> > 2012-02-20 20:09:42.906639)
> > 2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> > 2012-02-20 20:09:42.906639)
> > 2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
> > 10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
> > claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
> > node!
> > 2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> > 2012-02-20 20:09:43.410313)
> > 2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
> > no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> > 2012-02-20 20:09:43.410313)
> > 2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
> > 10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
> > claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
> > node!
> > 2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
> > 10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
> > claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
> > node!
> >
> > Some of them is going down with this:
> >
> > 2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
> > (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
> > pid 31379
> > 2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount FIEMAP ioctl is supported
> > 2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount did NOT detect btrfs
> > 2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount found snaps <>
> > 2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount: WRITEAHEAD journal mode explicitly enabled in conf
> > 2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount FIEMAP ioctl is supported
> > 2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount did NOT detect btrfs
> > 2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount found snaps <>
> > 2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> > mount: WRITEAHEAD journal mode explicitly enabled in conf
> > osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
> > ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
> > 18:22:19.900886
> > osd/OSD.cc: 4066: FAILED assert(child)
> >  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
> >  1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> > std::allocator<std::pair<pg_t const, PG*> > >&,
> > ObjectStore::Transaction&)+0x23e0) [0x54cd20]
> >  2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
> >  3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
> >  4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
> >  5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
> >  6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
> >  7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
> >  8: (()+0x7efc) [0x7fe3ebda3efc]
> >  9: (clone()+0x6d) [0x7fe3ea3d489d]
> >  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
> >  1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> > std::allocator<std::pair<pg_t const, PG*> > >&,
> > ObjectStore::Transaction&)+0x23e0) [0x54cd20]
> >  2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
> >  3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
> >  4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
> >  5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
> >  6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
> >  7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
> >  8: (()+0x7efc) [0x7fe3ebda3efc]
> >  9: (clone()+0x6d) [0x7fe3ea3d489d]
> > *** Caught signal (Aborted) **
> >  in thread 7fe3df8c4700
> >  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
> >  1: /usr/bin/ceph-osd() [0x6099f6]
> >  2: (()+0x10060) [0x7fe3ebdac060]
> >  3: (gsignal()+0x35) [0x7fe3ea3293a5]
> >  4: (abort()+0x17b) [0x7fe3ea32cb0b]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
> >  6: (()+0xb9f26) [0x7fe3eabe5f26]
> >  7: (()+0xb9f53) [0x7fe3eabe5f53]
> >  8: (()+0xba04e) [0x7fe3eabe604e]
> >  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x200) [0x5dc6b0]
> >  10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> > std::allocator<std::pair<pg_t const, PG*> > >&,
> > ObjectStore::Transaction&)+0x23e0) [0x54cd20]
> >  11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
> >  12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
> >  13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
> >  14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
> >  15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
> >  16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
> >  17: (()+0x7efc) [0x7fe3ebda3efc]
> >  18: (clone()+0x6d) [0x7fe3ea3d489d]
> > 2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
> > (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
> > pid 6596
> >
> > Do you have any ideas ?? if you need some data from cluster, or a core
> > dumps from osd i have a lot of them, but they are large.
> >
> > --
> > -----
> > Pozdrawiam
> >
> > S?awek "sZiBis" Skowron
> 
> 
> 
> -- 
> -----
> Pozdrawiam
> 
> S?awek "sZiBis" Skowron
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Serious problem after increase pg_num in pool
  2012-02-20 20:19   ` Sage Weil
@ 2012-02-21  6:46     ` Sławomir Skowron
  2012-02-21  7:23       ` Sławomir Skowron
  0 siblings, 1 reply; 7+ messages in thread
From: Sławomir Skowron @ 2012-02-21  6:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Sławomir Skowron, ceph-devel@vger.kernel.org

40 GB in 3 copies in rgw bucket, and some data in RBD, but they can be
destroyed.

Ceph -s reports 224 GB in normal state.

Pozdrawiam

iSS

Dnia 20 lut 2012 o godz. 21:19 Sage Weil <sage@newdream.net> napisał(a):

> Ooh, the pg split functionality is currently broken, and we weren't
> planning on fixing it for a while longer.  I didn't realize it was still
> possible to trigger from the monitor.
>
> I'm looking at how difficult it is to make it work (even inefficiently).
>
> How much data do you have in the cluster?
>
> sage
>
>
>
>
> On Mon, 20 Feb 2012, S?awomir Skowron wrote:
>
>> and this in ceph -w
>>
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
>> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
>> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
>> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
>> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
>> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
>> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
>> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
>> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
>> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
>> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
>> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
>> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
>> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
>> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
>> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
>> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
>> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
>> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
>> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
>> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
>> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
>> [20,51,64]
>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
>> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
>> [20,51,64]
>>
>> 2012/2/20 S?awomir Skowron <slawomir.skowron@gmail.com>:
>>> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
>>> serious problems.
>>>
>>> pool name       category                 KB      objects       clones
>>>   degraded      unfound           rd        rd KB           wr
>>> wr KB
>>> .intent-log     -                       4662           19            0
>>>           0           0            0            0        26502
>>> 26501
>>> .log            -                          0            0            0
>>>           0           0            0            0       913732
>>> 913342
>>> .rgw            -                          1           10            0
>>>           0           0            1            0            9
>>>    7
>>> .rgw.buckets    -                   39582566        73707            0
>>>        8061           0        86594            0       610896
>>> 36050541
>>> .rgw.control    -                          0            1            0
>>>           0           0            0            0            0
>>>    0
>>> .users          -                          1            1            0
>>>           0           0            0            0            1
>>>    1
>>> .users.uid      -                          1            2            0
>>>           0           0            2            1            3
>>>    3
>>> data            -                          0            0            0
>>>           0           0            0            0            0
>>>    0
>>> metadata        -                          0            0            0
>>>           0           0            0            0            0
>>>    0
>>> rbd             -                   21590723         5328            0
>>>           1           0           77           75      3013595
>>> 378345507
>>>  total used       229514252        79068
>>>  total avail    19685615164
>>>  total space    20980898464
>>>
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
>>> 10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
>>> 10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
>>> 10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
>>> 10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
>>> 10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
>>> 10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
>>> 10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
>>> 10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
>>> 10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
>>> 10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
>>> 10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
>>> 10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
>>> 10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
>>> 10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
>>> 10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
>>> 10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
>>> 10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
>>> 10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
>>> 10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
>>> 10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
>>> 10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
>>> 10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
>>> 10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
>>> 10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
>>> 10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
>>> 10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
>>> 10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
>>> 10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
>>> 10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
>>> 10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
>>> (by osd.55 10.177.64.8:6809/28642)
>>> 2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
>>> 2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
>>> 10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
>>> 10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
>>> 2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
>>> 2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
>>> 2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
>>> active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>> / 20008 GB avail; 8067/237184 degraded (3.401%)
>>> 2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
>>> 2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
>>> creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>> 2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
>>> creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>> 2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
>>> creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
>>> peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>> 2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
>>> creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
>>> 179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
>>> GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>
>>> osds is going to fail, again, and again, another going to fail. Number
>>> of up osd changing from 62, to 70-72, and going down, ang again going
>>> up.
>>>
>>> 2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>> 2012-02-20 20:09:42.304975)
>>> 2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>> 2012-02-20 20:09:42.410144)
>>> 2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>> 2012-02-20 20:09:42.410144)
>>> 2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>> 2012-02-20 20:09:42.906639)
>>> 2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>> 2012-02-20 20:09:42.906639)
>>> 2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
>>> 10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
>>> node!
>>> 2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>> 2012-02-20 20:09:43.410313)
>>> 2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>> 2012-02-20 20:09:43.410313)
>>> 2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
>>> 10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
>>> claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
>>> node!
>>> 2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
>>> 10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
>>> node!
>>>
>>> Some of them is going down with this:
>>>
>>> 2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
>>> pid 31379
>>> 2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount FIEMAP ioctl is supported
>>> 2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount did NOT detect btrfs
>>> 2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount found snaps <>
>>> 2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
>>> 2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount FIEMAP ioctl is supported
>>> 2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount did NOT detect btrfs
>>> 2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount found snaps <>
>>> 2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
>>> osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
>>> ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
>>> 18:22:19.900886
>>> osd/OSD.cc: 4066: FAILED assert(child)
>>>  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>  1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>  2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>  3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>  4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>  5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>  6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>  7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>  8: (()+0x7efc) [0x7fe3ebda3efc]
>>>  9: (clone()+0x6d) [0x7fe3ea3d489d]
>>>  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>  1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>  2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>  3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>  4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>  5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>  6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>  7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>  8: (()+0x7efc) [0x7fe3ebda3efc]
>>>  9: (clone()+0x6d) [0x7fe3ea3d489d]
>>> *** Caught signal (Aborted) **
>>>  in thread 7fe3df8c4700
>>>  ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>  1: /usr/bin/ceph-osd() [0x6099f6]
>>>  2: (()+0x10060) [0x7fe3ebdac060]
>>>  3: (gsignal()+0x35) [0x7fe3ea3293a5]
>>>  4: (abort()+0x17b) [0x7fe3ea32cb0b]
>>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
>>>  6: (()+0xb9f26) [0x7fe3eabe5f26]
>>>  7: (()+0xb9f53) [0x7fe3eabe5f53]
>>>  8: (()+0xba04e) [0x7fe3eabe604e]
>>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x200) [0x5dc6b0]
>>>  10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>  11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>  12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>  13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>  14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>  15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>  16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>  17: (()+0x7efc) [0x7fe3ebda3efc]
>>>  18: (clone()+0x6d) [0x7fe3ea3d489d]
>>> 2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
>>> pid 6596
>>>
>>> Do you have any ideas ?? if you need some data from cluster, or a core
>>> dumps from osd i have a lot of them, but they are large.
>>>
>>> --
>>> -----
>>> Pozdrawiam
>>>
>>> S?awek "sZiBis" Skowron
>>
>>
>>
>> --
>> -----
>> Pozdrawiam
>>
>> S?awek "sZiBis" Skowron
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Serious problem after increase pg_num in pool
  2012-02-21  6:46     ` Sławomir Skowron
@ 2012-02-21  7:23       ` Sławomir Skowron
  2012-02-21 16:00         ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Sławomir Skowron @ 2012-02-21  7:23 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

If there is no chance to stabilize this cluster i will try something like this.

- stop one machine in cluster.
- check if its still ok, and data are available
- make new fs on one machine
- migrate data by rados via obsync
- expand new cluster by second, and third machine
- change keys for radosgw etc
- new cluster is up with old dara

I can be done to migrate objects in .rgw.buckets pool via obsync ??

Dnia 21 lut 2012 o godz. 07:46 "Sławomir Skowron" <szibis@gmail.com> napisał(a):

> 40 GB in 3 copies in rgw bucket, and some data in RBD, but they can be
> destroyed.
>
> Ceph -s reports 224 GB in normal state.
>
> Pozdrawiam
>
> iSS
>
> Dnia 20 lut 2012 o godz. 21:19 Sage Weil <sage@newdream.net> napisał(a):
>
>> Ooh, the pg split functionality is currently broken, and we weren't
>> planning on fixing it for a while longer.  I didn't realize it was still
>> possible to trigger from the monitor.
>>
>> I'm looking at how difficult it is to make it work (even inefficiently).
>>
>> How much data do you have in the cluster?
>>
>> sage
>>
>>
>>
>>
>> On Mon, 20 Feb 2012, S?awomir Skowron wrote:
>>
>>> and this in ceph -w
>>>
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
>>> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
>>> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
>>> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
>>> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
>>> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
>>> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
>>> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
>>> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
>>> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
>>> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
>>> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
>>> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
>>> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
>>> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
>>> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
>>> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
>>> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
>>> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
>>> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
>>> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
>>> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
>>> [20,51,64]
>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
>>> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
>>> [20,51,64]
>>>
>>> 2012/2/20 S?awomir Skowron <slawomir.skowron@gmail.com>:
>>>> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
>>>> serious problems.
>>>>
>>>> pool name       category                 KB      objects       clones
>>>>  degraded      unfound           rd        rd KB           wr
>>>> wr KB
>>>> .intent-log     -                       4662           19            0
>>>>          0           0            0            0        26502
>>>> 26501
>>>> .log            -                          0            0            0
>>>>          0           0            0            0       913732
>>>> 913342
>>>> .rgw            -                          1           10            0
>>>>          0           0            1            0            9
>>>>   7
>>>> .rgw.buckets    -                   39582566        73707            0
>>>>       8061           0        86594            0       610896
>>>> 36050541
>>>> .rgw.control    -                          0            1            0
>>>>          0           0            0            0            0
>>>>   0
>>>> .users          -                          1            1            0
>>>>          0           0            0            0            1
>>>>   1
>>>> .users.uid      -                          1            2            0
>>>>          0           0            2            1            3
>>>>   3
>>>> data            -                          0            0            0
>>>>          0           0            0            0            0
>>>>   0
>>>> metadata        -                          0            0            0
>>>>          0           0            0            0            0
>>>>   0
>>>> rbd             -                   21590723         5328            0
>>>>          1           0           77           75      3013595
>>>> 378345507
>>>> total used       229514252        79068
>>>> total avail    19685615164
>>>> total space    20980898464
>>>>
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
>>>> 10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
>>>> 10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
>>>> 10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
>>>> 10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
>>>> 10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
>>>> 10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
>>>> 10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
>>>> 10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
>>>> 10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
>>>> 10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
>>>> 10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
>>>> 10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
>>>> 10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
>>>> 10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
>>>> 10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
>>>> 10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
>>>> 10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
>>>> 10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
>>>> 10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
>>>> 10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
>>>> 10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
>>>> 10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
>>>> 10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
>>>> 10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
>>>> 10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
>>>> 10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
>>>> 10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
>>>> 10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
>>>> 10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
>>>> 10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
>>>> (by osd.55 10.177.64.8:6809/28642)
>>>> 2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
>>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
>>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
>>>> 2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
>>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
>>>> 10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
>>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
>>>> 10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
>>>> 2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
>>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
>>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
>>>> 2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
>>>> 2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
>>>> active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
>>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>>> / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>> 2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
>>>> 2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
>>>> creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
>>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
>>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>> 2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
>>>> creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
>>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
>>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>> 2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
>>>> creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
>>>> peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
>>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>> 2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
>>>> creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
>>>> 179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
>>>> GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>>
>>>> osds is going to fail, again, and again, another going to fail. Number
>>>> of up osd changing from 62, to 70-72, and going down, ang again going
>>>> up.
>>>>
>>>> 2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>> 2012-02-20 20:09:42.304975)
>>>> 2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>>> 2012-02-20 20:09:42.410144)
>>>> 2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>> 2012-02-20 20:09:42.410144)
>>>> 2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>>> 2012-02-20 20:09:42.906639)
>>>> 2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>> 2012-02-20 20:09:42.906639)
>>>> 2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
>>>> 10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
>>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
>>>> node!
>>>> 2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>>> 2012-02-20 20:09:43.410313)
>>>> 2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>> 2012-02-20 20:09:43.410313)
>>>> 2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
>>>> 10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
>>>> claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
>>>> node!
>>>> 2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
>>>> 10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
>>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
>>>> node!
>>>>
>>>> Some of them is going down with this:
>>>>
>>>> 2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
>>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
>>>> pid 31379
>>>> 2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount FIEMAP ioctl is supported
>>>> 2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount did NOT detect btrfs
>>>> 2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount found snaps <>
>>>> 2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
>>>> 2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount FIEMAP ioctl is supported
>>>> 2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount did NOT detect btrfs
>>>> 2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount found snaps <>
>>>> 2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
>>>> osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
>>>> ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
>>>> 18:22:19.900886
>>>> osd/OSD.cc: 4066: FAILED assert(child)
>>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>> 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>> 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>> 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>> 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>> 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>> 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>> 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>> 8: (()+0x7efc) [0x7fe3ebda3efc]
>>>> 9: (clone()+0x6d) [0x7fe3ea3d489d]
>>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>> 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>> 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>> 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>> 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>> 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>> 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>> 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>> 8: (()+0x7efc) [0x7fe3ebda3efc]
>>>> 9: (clone()+0x6d) [0x7fe3ea3d489d]
>>>> *** Caught signal (Aborted) **
>>>> in thread 7fe3df8c4700
>>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>> 1: /usr/bin/ceph-osd() [0x6099f6]
>>>> 2: (()+0x10060) [0x7fe3ebdac060]
>>>> 3: (gsignal()+0x35) [0x7fe3ea3293a5]
>>>> 4: (abort()+0x17b) [0x7fe3ea32cb0b]
>>>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
>>>> 6: (()+0xb9f26) [0x7fe3eabe5f26]
>>>> 7: (()+0xb9f53) [0x7fe3eabe5f53]
>>>> 8: (()+0xba04e) [0x7fe3eabe604e]
>>>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x200) [0x5dc6b0]
>>>> 10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>> 11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>> 12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>> 13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>> 14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>> 15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>> 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>> 17: (()+0x7efc) [0x7fe3ebda3efc]
>>>> 18: (clone()+0x6d) [0x7fe3ea3d489d]
>>>> 2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
>>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
>>>> pid 6596
>>>>
>>>> Do you have any ideas ?? if you need some data from cluster, or a core
>>>> dumps from osd i have a lot of them, but they are large.
>>>>
>>>> --
>>>> -----
>>>> Pozdrawiam
>>>>
>>>> S?awek "sZiBis" Skowron
>>>
>>>
>>>
>>> --
>>> -----
>>> Pozdrawiam
>>>
>>> S?awek "sZiBis" Skowron
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Serious problem after increase pg_num in pool
  2012-02-21  7:23       ` Sławomir Skowron
@ 2012-02-21 16:00         ` Sage Weil
  2012-02-21 16:55           ` Sławomir Skowron
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2012-02-21 16:00 UTC (permalink / raw)
  To: Sławomir Skowron; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: TEXT/PLAIN, Size: 22941 bytes --]

On Tue, 21 Feb 2012, S?awomir Skowron wrote:
> If there is no chance to stabilize this cluster i will try something like this.
> 
> - stop one machine in cluster.
> - check if its still ok, and data are available
> - make new fs on one machine
> - migrate data by rados via obsync
> - expand new cluster by second, and third machine
> - change keys for radosgw etc
> - new cluster is up with old dara
> 
> I can be done to migrate objects in .rgw.buckets pool via obsync ??

obsync operates at the s3/switch bucket level, of which many are stored 
in .rgw.buckets.  You'll need to sync each of those buckets individually.

Before you do that, though, I have a pg split branch that is almost ready.  
If you don't mind, I'd be curious if it can handle your semi-broken 
cluster.  I'll have it pushed in about 2 hours, if you can wait!  If not, 
no worries.

sage



> 
> Dnia 21 lut 2012 o godz. 07:46 "Sÿÿawomir Skowron" <szibis@gmail.com> napisaÿÿ(a):
> 
> > 40 GB in 3 copies in rgw bucket, and some data in RBD, but they can be
> > destroyed.
> >
> > Ceph -s reports 224 GB in normal state.
> >
> > Pozdrawiam
> >
> > iSS
> >
> > Dnia 20 lut 2012 o godz. 21:19 Sage Weil <sage@newdream.net> napisaÿÿ(a):
> >
> >> Ooh, the pg split functionality is currently broken, and we weren't
> >> planning on fixing it for a while longer.  I didn't realize it was still
> >> possible to trigger from the monitor.
> >>
> >> I'm looking at how difficult it is to make it work (even inefficiently).
> >>
> >> How much data do you have in the cluster?
> >>
> >> sage
> >>
> >>
> >>
> >>
> >> On Mon, 20 Feb 2012, S?awomir Skowron wrote:
> >>
> >>> and this in ceph -w
> >>>
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
> >>> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
> >>> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
> >>> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
> >>> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
> >>> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
> >>> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
> >>> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
> >>> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
> >>> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
> >>> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
> >>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
> >>> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
> >>> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
> >>> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
> >>> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
> >>> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
> >>> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
> >>> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
> >>> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
> >>> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
> >>> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
> >>> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
> >>> [20,51,64]
> >>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
> >>> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
> >>> [20,51,64]
> >>>
> >>> 2012/2/20 S?awomir Skowron <slawomir.skowron@gmail.com>:
> >>>> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
> >>>> serious problems.
> >>>>
> >>>> pool name       category                 KB      objects       clones
> >>>>  degraded      unfound           rd        rd KB           wr
> >>>> wr KB
> >>>> .intent-log     -                       4662           19            0
> >>>>          0           0            0            0        26502
> >>>> 26501
> >>>> .log            -                          0            0            0
> >>>>          0           0            0            0       913732
> >>>> 913342
> >>>> .rgw            -                          1           10            0
> >>>>          0           0            1            0            9
> >>>>   7
> >>>> .rgw.buckets    -                   39582566        73707            0
> >>>>       8061           0        86594            0       610896
> >>>> 36050541
> >>>> .rgw.control    -                          0            1            0
> >>>>          0           0            0            0            0
> >>>>   0
> >>>> .users          -                          1            1            0
> >>>>          0           0            0            0            1
> >>>>   1
> >>>> .users.uid      -                          1            2            0
> >>>>          0           0            2            1            3
> >>>>   3
> >>>> data            -                          0            0            0
> >>>>          0           0            0            0            0
> >>>>   0
> >>>> metadata        -                          0            0            0
> >>>>          0           0            0            0            0
> >>>>   0
> >>>> rbd             -                   21590723         5328            0
> >>>>          1           0           77           75      3013595
> >>>> 378345507
> >>>> total used       229514252        79068
> >>>> total avail    19685615164
> >>>> total space    20980898464
> >>>>
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
> >>>> 10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
> >>>> 10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
> >>>> 10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
> >>>> 10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
> >>>> 10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
> >>>> 10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
> >>>> 10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
> >>>> 10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
> >>>> 10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
> >>>> 10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
> >>>> 10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
> >>>> 10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
> >>>> 10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
> >>>> 10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
> >>>> 10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
> >>>> 10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
> >>>> 10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
> >>>> 10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
> >>>> 10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
> >>>> 10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
> >>>> 10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
> >>>> 10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
> >>>> 10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
> >>>> 10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
> >>>> 10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
> >>>> 10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
> >>>> 10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
> >>>> 10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
> >>>> 10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
> >>>> 10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
> >>>> (by osd.55 10.177.64.8:6809/28642)
> >>>> 2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
> >>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
> >>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> >>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
> >>>> 2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
> >>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
> >>>> 10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
> >>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
> >>>> 10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
> >>>> 2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
> >>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
> >>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> >>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
> >>>> 2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
> >>>> 2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
> >>>> active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
> >>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
> >>>> / 20008 GB avail; 8067/237184 degraded (3.401%)
> >>>> 2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
> >>>> 2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
> >>>> creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
> >>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
> >>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> >>>> 2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
> >>>> creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
> >>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
> >>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> >>>> 2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
> >>>> creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
> >>>> peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
> >>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> >>>> 2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
> >>>> creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
> >>>> 179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
> >>>> GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
> >>>>
> >>>> osds is going to fail, again, and again, another going to fail. Number
> >>>> of up osd changing from 62, to 70-72, and going down, ang again going
> >>>> up.
> >>>>
> >>>> 2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> >>>> 2012-02-20 20:09:42.304975)
> >>>> 2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> >>>> 2012-02-20 20:09:42.410144)
> >>>> 2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> >>>> 2012-02-20 20:09:42.410144)
> >>>> 2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> >>>> 2012-02-20 20:09:42.906639)
> >>>> 2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> >>>> 2012-02-20 20:09:42.906639)
> >>>> 2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
> >>>> 10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
> >>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
> >>>> node!
> >>>> 2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
> >>>> 2012-02-20 20:09:43.410313)
> >>>> 2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
> >>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
> >>>> 2012-02-20 20:09:43.410313)
> >>>> 2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
> >>>> 10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
> >>>> claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
> >>>> node!
> >>>> 2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
> >>>> 10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
> >>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
> >>>> node!
> >>>>
> >>>> Some of them is going down with this:
> >>>>
> >>>> 2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
> >>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
> >>>> pid 31379
> >>>> 2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount FIEMAP ioctl is supported
> >>>> 2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount did NOT detect btrfs
> >>>> 2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount found snaps <>
> >>>> 2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
> >>>> 2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount FIEMAP ioctl is supported
> >>>> 2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount did NOT detect btrfs
> >>>> 2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount found snaps <>
> >>>> 2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
> >>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
> >>>> osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
> >>>> ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
> >>>> 18:22:19.900886
> >>>> osd/OSD.cc: 4066: FAILED assert(child)
> >>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
> >>>> 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> >>>> std::allocator<std::pair<pg_t const, PG*> > >&,
> >>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
> >>>> 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
> >>>> 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
> >>>> 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
> >>>> 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
> >>>> 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
> >>>> 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
> >>>> 8: (()+0x7efc) [0x7fe3ebda3efc]
> >>>> 9: (clone()+0x6d) [0x7fe3ea3d489d]
> >>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
> >>>> 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> >>>> std::allocator<std::pair<pg_t const, PG*> > >&,
> >>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
> >>>> 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
> >>>> 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
> >>>> 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
> >>>> 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
> >>>> 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
> >>>> 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
> >>>> 8: (()+0x7efc) [0x7fe3ebda3efc]
> >>>> 9: (clone()+0x6d) [0x7fe3ea3d489d]
> >>>> *** Caught signal (Aborted) **
> >>>> in thread 7fe3df8c4700
> >>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
> >>>> 1: /usr/bin/ceph-osd() [0x6099f6]
> >>>> 2: (()+0x10060) [0x7fe3ebdac060]
> >>>> 3: (gsignal()+0x35) [0x7fe3ea3293a5]
> >>>> 4: (abort()+0x17b) [0x7fe3ea32cb0b]
> >>>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
> >>>> 6: (()+0xb9f26) [0x7fe3eabe5f26]
> >>>> 7: (()+0xb9f53) [0x7fe3eabe5f53]
> >>>> 8: (()+0xba04e) [0x7fe3eabe604e]
> >>>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >>>> const*)+0x200) [0x5dc6b0]
> >>>> 10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
> >>>> std::allocator<std::pair<pg_t const, PG*> > >&,
> >>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
> >>>> 11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
> >>>> 12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
> >>>> 13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
> >>>> 14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
> >>>> 15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
> >>>> 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
> >>>> 17: (()+0x7efc) [0x7fe3ebda3efc]
> >>>> 18: (clone()+0x6d) [0x7fe3ea3d489d]
> >>>> 2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
> >>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
> >>>> pid 6596
> >>>>
> >>>> Do you have any ideas ?? if you need some data from cluster, or a core
> >>>> dumps from osd i have a lot of them, but they are large.
> >>>>
> >>>> --
> >>>> -----
> >>>> Pozdrawiam
> >>>>
> >>>> S?awek "sZiBis" Skowron
> >>>
> >>>
> >>>
> >>> --
> >>> -----
> >>> Pozdrawiam
> >>>
> >>> S?awek "sZiBis" Skowron
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Serious problem after increase pg_num in pool
  2012-02-21 16:00         ` Sage Weil
@ 2012-02-21 16:55           ` Sławomir Skowron
  0 siblings, 0 replies; 7+ messages in thread
From: Sławomir Skowron @ 2012-02-21 16:55 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Unfortunately 3 hours ago i made a decision about re-init cluster :(

Some data are available via rados, but cluster was unstable, and
migration of data was difficult, on time pression from outside :)

After init a new cluster on one machine, with clean pools i was able
to increase number of pg in .rgw pools

Now cluster is stable in 0.42 version, and new data going in.

Dnia 21 lut 2012 o godz. 17:00 Sage Weil <sage@newdream.net> napisał(a):

> On Tue, 21 Feb 2012, S?awomir Skowron wrote:
>> If there is no chance to stabilize this cluster i will try something like this.
>>
>> - stop one machine in cluster.
>> - check if its still ok, and data are available
>> - make new fs on one machine
>> - migrate data by rados via obsync
>> - expand new cluster by second, and third machine
>> - change keys for radosgw etc
>> - new cluster is up with old dara
>>
>> I can be done to migrate objects in .rgw.buckets pool via obsync ??
>
> obsync operates at the s3/switch bucket level, of which many are stored
> in .rgw.buckets.  You'll need to sync each of those buckets individually.
>
> Before you do that, though, I have a pg split branch that is almost ready.
> If you don't mind, I'd be curious if it can handle your semi-broken
> cluster.  I'll have it pushed in about 2 hours, if you can wait!  If not,
> no worries.
>
> sage
>
>
>
>>
>> Dnia 21 lut 2012 o godz. 07:46 "Sÿÿawomir Skowron" <szibis@gmail.com> napisaÿÿ(a):
>>
>>> 40 GB in 3 copies in rgw bucket, and some data in RBD, but they can be
>>> destroyed.
>>>
>>> Ceph -s reports 224 GB in normal state.
>>>
>>> Pozdrawiam
>>>
>>> iSS
>>>
>>> Dnia 20 lut 2012 o godz. 21:19 Sage Weil <sage@newdream.net> napisaÿÿ(a):
>>>
>>>> Ooh, the pg split functionality is currently broken, and we weren't
>>>> planning on fixing it for a while longer.  I didn't realize it was still
>>>> possible to trigger from the monitor.
>>>>
>>>> I'm looking at how difficult it is to make it work (even inefficiently).
>>>>
>>>> How much data do you have in the cluster?
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 20 Feb 2012, S?awomir Skowron wrote:
>>>>
>>>>> and this in ceph -w
>>>>>
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611270 osd.76
>>>>> 10.177.64.8:6872/5395 49 : [ERR] mkpg 7.e up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611308 osd.76
>>>>> 10.177.64.8:6872/5395 50 : [ERR] mkpg 7.16 up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611339 osd.76
>>>>> 10.177.64.8:6872/5395 51 : [ERR] mkpg 7.1e up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611369 osd.76
>>>>> 10.177.64.8:6872/5395 52 : [ERR] mkpg 7.26 up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611399 osd.76
>>>>> 10.177.64.8:6872/5395 53 : [ERR] mkpg 7.2e up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611428 osd.76
>>>>> 10.177.64.8:6872/5395 54 : [ERR] mkpg 7.36 up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611458 osd.76
>>>>> 10.177.64.8:6872/5395 55 : [ERR] mkpg 7.3e up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611488 osd.76
>>>>> 10.177.64.8:6872/5395 56 : [ERR] mkpg 7.46 up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611517 osd.76
>>>>> 10.177.64.8:6872/5395 57 : [ERR] mkpg 7.4e up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611547 osd.76
>>>>> 10.177.64.8:6872/5395 58 : [ERR] mkpg 7.56 up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:13.531857   log 2012-02-20 20:34:07.611577 osd.76
>>>>> 10.177.64.8:6872/5395 59 : [ERR] mkpg 7.5e up [76,11] != acting [76]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618816 osd.20
>>>>> 10.177.64.4:6839/6735 54 : [ERR] mkpg 7.f up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618854 osd.20
>>>>> 10.177.64.4:6839/6735 55 : [ERR] mkpg 7.17 up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618883 osd.20
>>>>> 10.177.64.4:6839/6735 56 : [ERR] mkpg 7.1f up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618912 osd.20
>>>>> 10.177.64.4:6839/6735 57 : [ERR] mkpg 7.27 up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618941 osd.20
>>>>> 10.177.64.4:6839/6735 58 : [ERR] mkpg 7.2f up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618970 osd.20
>>>>> 10.177.64.4:6839/6735 59 : [ERR] mkpg 7.37 up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.618999 osd.20
>>>>> 10.177.64.4:6839/6735 60 : [ERR] mkpg 7.3f up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619027 osd.20
>>>>> 10.177.64.4:6839/6735 61 : [ERR] mkpg 7.47 up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619056 osd.20
>>>>> 10.177.64.4:6839/6735 62 : [ERR] mkpg 7.4f up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619085 osd.20
>>>>> 10.177.64.4:6839/6735 63 : [ERR] mkpg 7.57 up [51,20,64] != acting
>>>>> [20,51,64]
>>>>> 2012-02-20 20:34:17.015290   log 2012-02-20 20:34:07.619113 osd.20
>>>>> 10.177.64.4:6839/6735 64 : [ERR] mkpg 7.5f up [51,20,64] != acting
>>>>> [20,51,64]
>>>>>
>>>>> 2012/2/20 S?awomir Skowron <slawomir.skowron@gmail.com>:
>>>>>> After increase number pg_num from 8 to 100 in .rgw.buckets i have some
>>>>>> serious problems.
>>>>>>
>>>>>> pool name       category                 KB      objects       clones
>>>>>> degraded      unfound           rd        rd KB           wr
>>>>>> wr KB
>>>>>> .intent-log     -                       4662           19            0
>>>>>>         0           0            0            0        26502
>>>>>> 26501
>>>>>> .log            -                          0            0            0
>>>>>>         0           0            0            0       913732
>>>>>> 913342
>>>>>> .rgw            -                          1           10            0
>>>>>>         0           0            1            0            9
>>>>>>  7
>>>>>> .rgw.buckets    -                   39582566        73707            0
>>>>>>      8061           0        86594            0       610896
>>>>>> 36050541
>>>>>> .rgw.control    -                          0            1            0
>>>>>>         0           0            0            0            0
>>>>>>  0
>>>>>> .users          -                          1            1            0
>>>>>>         0           0            0            0            1
>>>>>>  1
>>>>>> .users.uid      -                          1            2            0
>>>>>>         0           0            2            1            3
>>>>>>  3
>>>>>> data            -                          0            0            0
>>>>>>         0           0            0            0            0
>>>>>>  0
>>>>>> metadata        -                          0            0            0
>>>>>>         0           0            0            0            0
>>>>>>  0
>>>>>> rbd             -                   21590723         5328            0
>>>>>>         1           0           77           75      3013595
>>>>>> 378345507
>>>>>> total used       229514252        79068
>>>>>> total avail    19685615164
>>>>>> total space    20980898464
>>>>>>
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384251 mon.0
>>>>>> 10.177.64.4:6789/0 36135 : [INF] osd.28 10.177.64.6:6806/824 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384275 mon.0
>>>>>> 10.177.64.4:6789/0 36136 : [INF] osd.37 10.177.64.6:6841/29133 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384301 mon.0
>>>>>> 10.177.64.4:6789/0 36137 : [INF] osd.7 10.177.64.4:6813/8223 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384327 mon.0
>>>>>> 10.177.64.4:6789/0 36138 : [INF] osd.44 10.177.64.6:6859/2370 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384353 mon.0
>>>>>> 10.177.64.4:6789/0 36139 : [INF] osd.49 10.177.64.6:6865/29878 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384384 mon.0
>>>>>> 10.177.64.4:6789/0 36140 : [INF] osd.17 10.177.64.4:6827/5909 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384410 mon.0
>>>>>> 10.177.64.4:6789/0 36141 : [INF] osd.12 10.177.64.4:6810/5410 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384435 mon.0
>>>>>> 10.177.64.4:6789/0 36142 : [INF] osd.39 10.177.64.6:6843/12733 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384461 mon.0
>>>>>> 10.177.64.4:6789/0 36143 : [INF] osd.42 10.177.64.6:6848/13067 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384485 mon.0
>>>>>> 10.177.64.4:6789/0 36144 : [INF] osd.31 10.177.64.6:6840/1233 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384513 mon.0
>>>>>> 10.177.64.4:6789/0 36145 : [INF] osd.36 10.177.64.6:6830/12573 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384537 mon.0
>>>>>> 10.177.64.4:6789/0 36146 : [INF] osd.38 10.177.64.6:6833/32587 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384567 mon.0
>>>>>> 10.177.64.4:6789/0 36147 : [INF] osd.5 10.177.64.4:6873/7842 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384596 mon.0
>>>>>> 10.177.64.4:6789/0 36148 : [INF] osd.21 10.177.64.4:6844/11607 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384622 mon.0
>>>>>> 10.177.64.4:6789/0 36149 : [INF] osd.23 10.177.64.4:6853/6826 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384661 mon.0
>>>>>> 10.177.64.4:6789/0 36150 : [INF] osd.51 10.177.64.6:6858/15894 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384693 mon.0
>>>>>> 10.177.64.4:6789/0 36151 : [INF] osd.48 10.177.64.6:6862/13476 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384723 mon.0
>>>>>> 10.177.64.4:6789/0 36152 : [INF] osd.32 10.177.64.6:6815/3701 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384759 mon.0
>>>>>> 10.177.64.4:6789/0 36153 : [INF] osd.41 10.177.64.6:6847/1861 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384790 mon.0
>>>>>> 10.177.64.4:6789/0 36154 : [INF] osd.0 10.177.64.4:6800/5230 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384814 mon.0
>>>>>> 10.177.64.4:6789/0 36155 : [INF] osd.3 10.177.64.4:6865/7242 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384838 mon.0
>>>>>> 10.177.64.4:6789/0 36156 : [INF] osd.1 10.177.64.4:6804/9729 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384864 mon.0
>>>>>> 10.177.64.4:6789/0 36157 : [INF] osd.47 10.177.64.6:6866/13924 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384896 mon.0
>>>>>> 10.177.64.4:6789/0 36158 : [INF] osd.45 10.177.64.6:6857/4401 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384928 mon.0
>>>>>> 10.177.64.4:6789/0 36159 : [INF] osd.20 10.177.64.4:6842/6246 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384952 mon.0
>>>>>> 10.177.64.4:6789/0 36160 : [INF] osd.16 10.177.64.4:6821/5833 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.384982 mon.0
>>>>>> 10.177.64.4:6789/0 36161 : [INF] osd.35 10.177.64.6:6824/3877 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385007 mon.0
>>>>>> 10.177.64.4:6789/0 36162 : [INF] osd.3 10.177.64.4:6865/7242 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385032 mon.0
>>>>>> 10.177.64.4:6789/0 36163 : [INF] osd.7 10.177.64.4:6813/8223 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.688085   log 2012-02-20 20:06:09.385059 mon.0
>>>>>> 10.177.64.4:6789/0 36164 : [INF] osd.19 10.177.64.4:6831/10499 failed
>>>>>> (by osd.55 10.177.64.8:6809/28642)
>>>>>> 2012-02-20 20:06:10.851483    pg v172582: 10548 pgs: 92 creating, 1
>>>>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
>>>>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>>>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
>>>>>> 2012-02-20 20:06:10.967491   osd e7436: 78 osds: 70 up, 73 in
>>>>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:56.448227 mon.2
>>>>>> 10.177.64.8:6789/0 134 : [INF] mon.2 calling new monitor election
>>>>>> 2012-02-20 20:06:10.990903   log 2012-02-20 20:05:58.252635 mon.1
>>>>>> 10.177.64.6:6789/0 3929 : [INF] mon.1 calling new monitor election
>>>>>> 2012-02-20 20:06:11.034669    pg v172583: 10548 pgs: 92 creating, 1
>>>>>> active, 9713 active+clean, 3 active+degraded+backfill, 657 peering, 77
>>>>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>>>>> / 20008 GB avail; 8071/237184 degraded (3.403%)
>>>>>> 2012-02-20 20:06:11.958126   osd e7437: 78 osds: 70 up, 73 in
>>>>>> 2012-02-20 20:06:12.068650    pg v172584: 10548 pgs: 92 creating, 1
>>>>>> active, 9711 active+clean, 3 active+degraded+backfill, 659 peering, 77
>>>>>> down+peering, 5 active+degraded; 59744 MB data, 218 GB used, 18773 GB
>>>>>> / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>>>> 2012-02-20 20:06:12.947997   osd e7438: 78 osds: 70 up, 73 in
>>>>>> 2012-02-20 20:06:13.770942    pg v172585: 10548 pgs: 3 inactive, 92
>>>>>> creating, 1 active, 9824 active+clean, 3 active+degraded+backfill, 541
>>>>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
>>>>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>>>> 2012-02-20 20:06:14.686248    pg v172586: 10548 pgs: 3 inactive, 92
>>>>>> creating, 1 active, 9894 active+clean, 3 active+degraded+backfill, 471
>>>>>> peering, 77 down+peering, 7 active+degraded; 59744 MB data, 218 GB
>>>>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>>>> 2012-02-20 20:06:15.340365    pg v172587: 10548 pgs: 3 inactive, 92
>>>>>> creating, 1 active, 9915 active+clean, 3 active+degraded+backfill, 447
>>>>>> peering, 77 down+peering, 10 active+degraded; 59744 MB data, 218 GB
>>>>>> used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>>>> 2012-02-20 20:06:16.852264    pg v172588: 10548 pgs: 3 inactive, 92
>>>>>> creating, 84 active, 10094 active+clean, 3 active+degraded+backfill,
>>>>>> 179 peering, 77 down+peering, 16 active+degraded; 59744 MB data, 218
>>>>>> GB used, 18773 GB / 20008 GB avail; 8067/237184 degraded (3.401%)
>>>>>>
>>>>>> osds is going to fail, again, and again, another going to fail. Number
>>>>>> of up osd changing from 62, to 70-72, and going down, ang again going
>>>>>> up.
>>>>>>
>>>>>> 2012-02-20 20:09:47.305016 7f816009e700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>>>> 2012-02-20 20:09:42.304975)
>>>>>> 2012-02-20 20:09:47.410159 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>>>>> 2012-02-20 20:09:42.410144)
>>>>>> 2012-02-20 20:09:47.410177 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>>>> 2012-02-20 20:09:42.410144)
>>>>>> 2012-02-20 20:09:47.906661 7f816009e700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>>>>> 2012-02-20 20:09:42.906639)
>>>>>> 2012-02-20 20:09:47.906685 7f816009e700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>>>> 2012-02-20 20:09:42.906639)
>>>>>> 2012-02-20 20:09:48.114431 7f815660b700 -- 10.177.64.4:0/6389 >>
>>>>>> 10.177.64.4:6854/5398 pipe(0x1398c500 sd=47 pgs=26 cs=2 l=0).connect
>>>>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/5398 - wrong
>>>>>> node!
>>>>>> 2012-02-20 20:09:48.410333 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.61 since 2012-02-20 20:09:29.807115 (cutoff
>>>>>> 2012-02-20 20:09:43.410313)
>>>>>> 2012-02-20 20:09:48.410361 7f816c9b8700 osd.20 7476 heartbeat_check:
>>>>>> no heartbeat from osd.64 since 2012-02-20 20:09:30.286408 (cutoff
>>>>>> 2012-02-20 20:09:43.410313)
>>>>>> 2012-02-20 20:09:51.450127 7f814b75d700 -- 10.177.64.4:0/6389 >>
>>>>>> 10.177.64.4:6855/17423 pipe(0xa86e780 sd=17 pgs=17 cs=2 l=0).connect
>>>>>> claims to be 10.177.64.4:6855/17798 not 10.177.64.4:6855/17423 - wrong
>>>>>> node!
>>>>>> 2012-02-20 20:09:54.498949 7f814a248700 -- 10.177.64.4:0/6389 >>
>>>>>> 10.177.64.4:6854/19396 pipe(0x38cc780 sd=25 pgs=8 cs=2 l=0).connect
>>>>>> claims to be 10.177.64.4:6854/17798 not 10.177.64.4:6854/19396 - wrong
>>>>>> node!
>>>>>>
>>>>>> Some of them is going down with this:
>>>>>>
>>>>>> 2012-02-20 18:22:15.824992 7fe3ec1c97a0 ceph version 0.41
>>>>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
>>>>>> pid 31379
>>>>>> 2012-02-20 18:22:15.826476 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount FIEMAP ioctl is supported
>>>>>> 2012-02-20 18:22:15.826514 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount did NOT detect btrfs
>>>>>> 2012-02-20 18:22:15.826613 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount found snaps <>
>>>>>> 2012-02-20 18:22:15.826650 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
>>>>>> 2012-02-20 18:22:16.415671 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount FIEMAP ioctl is supported
>>>>>> 2012-02-20 18:22:16.415703 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount did NOT detect btrfs
>>>>>> 2012-02-20 18:22:16.415744 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount found snaps <>
>>>>>> 2012-02-20 18:22:16.415758 7fe3ec1c97a0 filestore(/vol0/data/osd.24)
>>>>>> mount: WRITEAHEAD journal mode explicitly enabled in conf
>>>>>> osd/OSD.cc: In function 'void OSD::split_pg(PG*, std::map<pg_t, PG*>&,
>>>>>> ObjectStore::Transaction&)' thread 7fe3df8c4700 time 2012-02-20
>>>>>> 18:22:19.900886
>>>>>> osd/OSD.cc: 4066: FAILED assert(child)
>>>>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>>>> 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>>>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>>>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>>>> 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>>>> 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>>>> 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>>>> 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>>>> 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>>>> 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>>>> 8: (()+0x7efc) [0x7fe3ebda3efc]
>>>>>> 9: (clone()+0x6d) [0x7fe3ea3d489d]
>>>>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>>>> 1: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>>>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>>>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>>>> 2: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>>>> 3: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>>>> 4: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>>>> 5: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>>>> 6: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>>>> 7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>>>> 8: (()+0x7efc) [0x7fe3ebda3efc]
>>>>>> 9: (clone()+0x6d) [0x7fe3ea3d489d]
>>>>>> *** Caught signal (Aborted) **
>>>>>> in thread 7fe3df8c4700
>>>>>> ceph version 0.41 (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d)
>>>>>> 1: /usr/bin/ceph-osd() [0x6099f6]
>>>>>> 2: (()+0x10060) [0x7fe3ebdac060]
>>>>>> 3: (gsignal()+0x35) [0x7fe3ea3293a5]
>>>>>> 4: (abort()+0x17b) [0x7fe3ea32cb0b]
>>>>>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe3eabe7d7d]
>>>>>> 6: (()+0xb9f26) [0x7fe3eabe5f26]
>>>>>> 7: (()+0xb9f53) [0x7fe3eabe5f53]
>>>>>> 8: (()+0xba04e) [0x7fe3eabe604e]
>>>>>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>>> const*)+0x200) [0x5dc6b0]
>>>>>> 10: (OSD::split_pg(PG*, std::map<pg_t, PG*, std::less<pg_t>,
>>>>>> std::allocator<std::pair<pg_t const, PG*> > >&,
>>>>>> ObjectStore::Transaction&)+0x23e0) [0x54cd20]
>>>>>> 11: (OSD::kick_pg_split_queue()+0x880) [0x556d90]
>>>>>> 12: (OSD::handle_pg_notify(MOSDPGNotify*)+0x4b6) [0x559546]
>>>>>> 13: (OSD::_dispatch(Message*)+0x608) [0x560e58]
>>>>>> 14: (OSD::ms_dispatch(Message*)+0x11e) [0x561b7e]
>>>>>> 15: (SimpleMessenger::dispatch_entry()+0x76b) [0x5c844b]
>>>>>> 16: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b5cfc]
>>>>>> 17: (()+0x7efc) [0x7fe3ebda3efc]
>>>>>> 18: (clone()+0x6d) [0x7fe3ea3d489d]
>>>>>> 2012-02-20 18:23:57.915653 7fa818e3e7a0 ceph version 0.41
>>>>>> (commit:c1345f7136a0af55d88280ffe4b58339aaf28c9d), process ceph-osd,
>>>>>> pid 6596
>>>>>>
>>>>>> Do you have any ideas ?? if you need some data from cluster, or a core
>>>>>> dumps from osd i have a lot of them, but they are large.
>>>>>>
>>>>>> --
>>>>>> -----
>>>>>> Pozdrawiam
>>>>>>
>>>>>> S?awek "sZiBis" Skowron
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -----
>>>>> Pozdrawiam
>>>>>
>>>>> S?awek "sZiBis" Skowron
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-02-21 16:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-20 19:16 Serious problem after increase pg_num in pool Sławomir Skowron
2012-02-20 19:35 ` Sławomir Skowron
2012-02-20 20:19   ` Sage Weil
2012-02-21  6:46     ` Sławomir Skowron
2012-02-21  7:23       ` Sławomir Skowron
2012-02-21 16:00         ` Sage Weil
2012-02-21 16:55           ` Sławomir Skowron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.