* Journal too small
@ 2012-05-17 10:59 Karol Jurak
2012-05-17 16:01 ` Sage Weil
2012-05-17 18:59 ` Josh Durgin
0 siblings, 2 replies; 9+ messages in thread
From: Karol Jurak @ 2012-05-17 10:59 UTC (permalink / raw)
To: ceph-devel
Hi,
During an ongoing recovery in one of my clusters a couple of OSDs
complained about too small journal. For instance:
2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at
863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000
start 863363072)
2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item
1693745152 > journal 1048571904 (usable)
I was under the impression that the OSDs stopped participating in recovery
after this event. (ceph -w showed that the number of PGs in state
active+clean no longer increased.) They resumed recovery after I enlarged
their journals (stop osd, --flush-journal, --mkjournal, start osd).
How serious is such situation? Do the OSDs know how to handle it
correctly? Or could this result in some data loss or corruption? After the
recovery finished (ceph -w showed that all PGs are in active+clean state)
I noticed that a few rbd images were corrupted.
The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the
recovery no clients were accessing the cluster.
Best regards,
Karol
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-17 10:59 Journal too small Karol Jurak
@ 2012-05-17 16:01 ` Sage Weil
2012-05-17 17:00 ` Tommi Virtanen
2012-05-18 10:58 ` Karol Jurak
2012-05-17 18:59 ` Josh Durgin
1 sibling, 2 replies; 9+ messages in thread
From: Sage Weil @ 2012-05-17 16:01 UTC (permalink / raw)
To: Karol Jurak; +Cc: ceph-devel
On Thu, 17 May 2012, Karol Jurak wrote:
> Hi,
>
> During an ongoing recovery in one of my clusters a couple of OSDs
> complained about too small journal. For instance:
>
> 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at
> 863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000
> start 863363072)
> 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item
> 1693745152 > journal 1048571904 (usable)
>
> I was under the impression that the OSDs stopped participating in recovery
> after this event. (ceph -w showed that the number of PGs in state
> active+clean no longer increased.) They resumed recovery after I enlarged
> their journals (stop osd, --flush-journal, --mkjournal, start osd).
>
> How serious is such situation? Do the OSDs know how to handle it
> correctly? Or could this result in some data loss or corruption? After the
> recovery finished (ceph -w showed that all PGs are in active+clean state)
> I noticed that a few rbd images were corrupted.
The osds tolerate the full journal. There will be a big latency spike,
but they'll recover without risking data. You should definitely increase
the journal size if this happens regulary, though.
sage
>
> The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the
> recovery no clients were accessing the cluster.
>
> Best regards,
> Karol
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-17 16:01 ` Sage Weil
@ 2012-05-17 17:00 ` Tommi Virtanen
2012-05-17 17:33 ` Sage Weil
2012-05-18 10:58 ` Karol Jurak
1 sibling, 1 reply; 9+ messages in thread
From: Tommi Virtanen @ 2012-05-17 17:00 UTC (permalink / raw)
To: Sage Weil; +Cc: Karol Jurak, ceph-devel
On Thu, May 17, 2012 at 9:01 AM, Sage Weil <sage@inktank.com> wrote:
>> 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at
>> 863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000
>> start 863363072)
>> 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item
>> 1693745152 > journal 1048571904 (usable)
> The osds tolerate the full journal. There will be a big latency spike,
> but they'll recover without risking data. You should definitely increase
> the journal size if this happens regulary, though.
I propose for your merging pleasure:
https://github.com/ceph/ceph/commits/journal-too-small
https://github.com/ceph/ceph/commit/62db60bede8b187e25acb715f6616d2ce7cfc97f
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-17 17:00 ` Tommi Virtanen
@ 2012-05-17 17:33 ` Sage Weil
0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-05-17 17:33 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Karol Jurak, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 834 bytes --]
On Thu, 17 May 2012, Tommi Virtanen wrote:
> On Thu, May 17, 2012 at 9:01 AM, Sage Weil <sage@inktank.com> wrote:
> >> 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at
> >> 863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000
> >> start 863363072)
> >> 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL: item
> >> 1693745152 > journal 1048571904 (usable)
>
> > The osds tolerate the full journal. There will be a big latency spike,
> > but they'll recover without risking data. You should definitely increase
> > the journal size if this happens regulary, though.
>
> I propose for your merging pleasure:
> https://github.com/ceph/ceph/commits/journal-too-small
> https://github.com/ceph/ceph/commit/62db60bede8b187e25acb715f6616d2ce7cfc97f
Perfect, merged.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-17 10:59 Journal too small Karol Jurak
2012-05-17 16:01 ` Sage Weil
@ 2012-05-17 18:59 ` Josh Durgin
2012-05-18 10:56 ` Karol Jurak
1 sibling, 1 reply; 9+ messages in thread
From: Josh Durgin @ 2012-05-17 18:59 UTC (permalink / raw)
To: Karol Jurak; +Cc: ceph-devel
On 05/17/2012 03:59 AM, Karol Jurak wrote:
> How serious is such situation? Do the OSDs know how to handle it
> correctly? Or could this result in some data loss or corruption? After the
> recovery finished (ceph -w showed that all PGs are in active+clean state)
> I noticed that a few rbd images were corrupted.
As Sage mentioned, the OSDs know how to handle full journals correctly.
I'd like to figure out how your rbd images got corrupted, if possible.
How did you notice the corruption?
Has your cluster always run 0.46, or did you upgrade from earlier
versions?
What happened to the cluster between your last check for corruption and
now? Did your use of it or any ceph client or server configuration
change?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-17 18:59 ` Josh Durgin
@ 2012-05-18 10:56 ` Karol Jurak
2012-05-19 1:51 ` Josh Durgin
0 siblings, 1 reply; 9+ messages in thread
From: Karol Jurak @ 2012-05-18 10:56 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
[-- Attachment #1: Type: Text/Plain, Size: 5656 bytes --]
On Thursday 17 of May 2012 20:59:52 Josh Durgin wrote:
> On 05/17/2012 03:59 AM, Karol Jurak wrote:
> > How serious is such situation? Do the OSDs know how to handle it
> > correctly? Or could this result in some data loss or corruption?
> > After the recovery finished (ceph -w showed that all PGs are in
> > active+clean state) I noticed that a few rbd images were corrupted.
>
> As Sage mentioned, the OSDs know how to handle full journals correctly.
>
> I'd like to figure out how your rbd images got corrupted, if possible.
>
> How did you notice the corruption?
>
> Has your cluster always run 0.46, or did you upgrade from earlier
> versions?
>
> What happened to the cluster between your last check for corruption and
> now? Did your use of it or any ceph client or server configuration
> change?
My question about journal is actually connected to a larger case I'm
currently trying to investigate.
The cluster initially run v0.45 but I upgraded it to v0.46 because of the
issue I described in this bug report (upgrade didn't resolve it):
http://tracker.newdream.net/issues/2446
The cluster consisted of 26 OSDs and used the crushmap which had a
structure identical to that of a default crushmap constructed during the
cluster creation. It had the unknownrack which contained 26 hosts and
every host contained one OSD.
Problems started when one of my collegues created and installed into the
cluster the new crush map which introduced a couple of new racks, changed
the placement rule to 'step chooseleaf firstn 0 type rack' and changed the
weights of most of the OSDs to 0 (they were meant to be removed from the
cluster). I don't have the exact copy of that crushmap but my collegue
reconstructed it from memory the best he could. It's attached as new-
crushmap.txt.
The OSDs reacted to the new crushmap by allocating large amounts of
memory. Most of them had only 1 or 2 GB of RAM. That proved to be not
enough and the Xen VMs hosting the OSDs crashed. It turned out later, that
most of the OSDs required as much as 6 to 10 GB of memory to complete the
peering phase (ceph -w showed large number of PGs in that state while the
OSDs were allocating memory).
One factor which I think might have played significant role in this
situation was the large number of PGs - 20000. Our idea was to
incrementally build the cluster consisting of approximately 200 OSDs,
hence the 20000 PGs.
I see some items in your issue tracker that look like they may be
addressing this large memory consumption issue:
http://tracker.newdream.net/issues/2321
http://tracker.newdream.net/issues/2041
I reverted to the default crushmap, changed replication level to 1 and
marked all OSDs but 2 out. That allowed me to finally recover the cluster
and bring it online but in the process all the OSDs crashed numerous
times. They were either killed by the OOM Killer or the whole VMs were
destroyed by me because they were unresponsive or the OSDs crashed due to
failed asserts such as:
====
2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_
handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10
13:07:38.816680
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
long)+0x270) [0x7a32e0]
2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7]
3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748]
4: (CephContextServiceThread::entry()+0x5c) [0x64c27c]
5: (()+0x68ba) [0x7f87888be8ba]
6: (clone()+0x6d) [0x7f8786f4302d]
====
or
====
2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function 'void
PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_
log_t&, int)' thread 7f062e9c1700 time 2012-05-10 16:33:30.369211
osd/PG.cc: 369: FAILED assert(log.head >= olog.tail && olog.head >=
log.tail)
ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
1: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
int)+0x1f14) [0x77d894]
2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec
const&)+0x2c5) [0x77dba5]
3: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x213) [0x794d93]
4: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x6b) [0x78c3cb]
5: (PG::RecoveryState::handle_log(int, MOSDPGLog*,
PG::RecoveryCtx*)+0x1a6) [0x745b76]
6: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x56f) [0x5e1b8f]
7: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x13b) [0x5e291b]
8: (OSD::_dispatch(Message*)+0x17d) [0x5e7afd]
9: (OSD::ms_dispatch(Message*)+0x1df) [0x5e83cf]
10: (SimpleMessenger::dispatch_entry()+0x979) [0x6dadf9]
11: (SimpleMessenger::DispatchThread::entry()+0xd) [0x613e8d]
12: (()+0x68ba) [0x7f063c63c8ba]
13: (clone()+0x6d) [0x7f063acc102d]
====
Although 'ceph -w' showed that all PGs are in active+clean state, during
the attempt to start the VMs which had their disk images on rbd devices,
fsck revealed multiple filesystem errors.
Karol
[-- Attachment #2: new-crushmap.txt --]
[-- Type: text/plain, Size: 6181 bytes --]
# begin crush map
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool
# buckets
host ceph-backup-osd-1 {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.000
}
host ceph-backup-osd-2 {
id -8 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.1 weight 0.000
}
host ceph-backup-osd-3 {
id -4 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 0.000
}
host ceph-backup-osd-4 {
id -11 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.3 weight 0.000
}
host ceph-backup-osd-5 {
id -12 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.4 weight 0.000
}
host ceph-backup-osd-6 {
id -5 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.5 weight 0.000
}
host ceph-backup-osd-7 {
id -6 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.6 weight 0.000
}
host ceph-backup-osd-8 {
id -10 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.7 weight 0.000
}
host ceph-backup-osd-9 {
id -9 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.8 weight 0.000
}
host ceph-backup-osd-10 {
id -7 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.9 weight 0.000
}
host ceph-backup-osd-11 {
id -13 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.10 weight 0.000
}
host ceph-backup-osd-12 {
id -22 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.11 weight 0.000
}
host ceph-backup-osd-13 {
id -14 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.12 weight 0.000
}
host ceph-backup-osd-14 {
id -15 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.13 weight 0.000
}
host ceph-backup-osd-15 {
id -16 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.14 weight 0.000
}
host ceph-backup-osd-16 {
id -17 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.15 weight 0.000
}
host ceph-backup-osd-17 {
id -18 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.16 weight 0.000
}
host ceph-backup-osd-18 {
id -19 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.17 weight 0.000
}
host ceph-backup-osd-19 {
id -20 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.18 weight 0.000
}
host ceph-backup-osd-20 {
id -21 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.19 weight 0.000
}
host ceph-backup-osd-21 {
id -23 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.20 weight 2.700
}
host ceph-backup-osd-22 {
id -24 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.21 weight 0.000
}
host ceph-backup-osd-23 {
id -25 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.22 weight 1.000
}
host ceph-backup-osd-24 {
id -26 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item osd.23 weight 2.700
}
host ceph-backup-osd-25 {
id -27 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.24 weight 0.000
}
host ceph-backup-osd-26 {
id -28 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item osd.25 weight 0.000
}
rack unknownrack {
id -3 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item ceph-backup-osd-1 weight 0.000
item ceph-backup-osd-2 weight 0.000
item ceph-backup-osd-3 weight 0.000
item ceph-backup-osd-4 weight 0.000
item ceph-backup-osd-5 weight 0.000
item ceph-backup-osd-6 weight 0.000
item ceph-backup-osd-7 weight 0.000
item ceph-backup-osd-8 weight 0.000
item ceph-backup-osd-9 weight 0.000
item ceph-backup-osd-10 weight 0.000
item ceph-backup-osd-11 weight 0.000
item ceph-backup-osd-12 weight 0.000
item ceph-backup-osd-13 weight 0.000
item ceph-backup-osd-14 weight 0.000
item ceph-backup-osd-15 weight 0.000
item ceph-backup-osd-16 weight 0.000
item ceph-backup-osd-17 weight 0.000
item ceph-backup-osd-18 weight 0.000
item ceph-backup-osd-19 weight 0.000
item ceph-backup-osd-20 weight 0.000
}
rack a8 {
id -29 # do not change unnecessarily
# weight 5.400
alg straw
hash 0 # rjenkins1
item ceph-backup-osd-21 weight 2.700
item ceph-backup-osd-24 weight 2.700
}
rack c11 {
id -30 # do not change unnecessarily
# weight 4.000
alg straw
hash 0 # rjenkins1
item ceph-backup-osd-23 weight 2.000
item ceph-backup-osd-22 weight 2.000
}
rack d12 {
id -31 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item ceph-backup-osd-26 weight 1.000
item ceph-backup-osd-25 weight 1.000
}
pool backup {
id -1 # do not change unnecessarily
# weight 11.400
alg straw
hash 0 # rjenkins1
item a8 weight 5.400
item c11 weight 4.000
item d12 weight 2.000
item unknownrack weight 0.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take backup
step chooseleaf firstn 0 type rack
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take backup
step chooseleaf firstn 0 type rack
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take backup
step chooseleaf firstn 0 type rack
step emit
}
# end crush map
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-17 16:01 ` Sage Weil
2012-05-17 17:00 ` Tommi Virtanen
@ 2012-05-18 10:58 ` Karol Jurak
1 sibling, 0 replies; 9+ messages in thread
From: Karol Jurak @ 2012-05-18 10:58 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Thursday 17 of May 2012 18:01:55 Sage Weil wrote:
> On Thu, 17 May 2012, Karol Jurak wrote:
> > Hi,
> >
> > During an ongoing recovery in one of my clusters a couple of OSDs
> > complained about too small journal. For instance:
> >
> > 2012-05-12 13:31:04.034144 7f491061d700 1 journal check_for_full at
> > 863363072 : JOURNAL FULL 863363072 >= 1048571903 (max_size 1048576000
> > start 863363072)
> > 2012-05-12 13:31:04.034680 7f491061d700 0 journal JOURNAL TOO SMALL:
> > item 1693745152 > journal 1048571904 (usable)
> >
> > I was under the impression that the OSDs stopped participating in
> > recovery after this event. (ceph -w showed that the number of PGs in
> > state active+clean no longer increased.) They resumed recovery after
> > I enlarged their journals (stop osd, --flush-journal, --mkjournal,
> > start osd).
> >
> > How serious is such situation? Do the OSDs know how to handle it
> > correctly? Or could this result in some data loss or corruption?
> > After the recovery finished (ceph -w showed that all PGs are in
> > active+clean state) I noticed that a few rbd images were corrupted.
>
> The osds tolerate the full journal. There will be a big latency spike,
> but they'll recover without risking data. You should definitely
> increase the journal size if this happens regulary, though.
Thank you for clarification and advice.
Karol
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-18 10:56 ` Karol Jurak
@ 2012-05-19 1:51 ` Josh Durgin
2012-05-21 15:11 ` Karol Jurak
0 siblings, 1 reply; 9+ messages in thread
From: Josh Durgin @ 2012-05-19 1:51 UTC (permalink / raw)
To: Karol Jurak; +Cc: ceph-devel
On 05/18/2012 03:56 AM, Karol Jurak wrote:
> On Thursday 17 of May 2012 20:59:52 Josh Durgin wrote:
>> On 05/17/2012 03:59 AM, Karol Jurak wrote:
>>> How serious is such situation? Do the OSDs know how to handle it
>>> correctly? Or could this result in some data loss or corruption?
>>> After the recovery finished (ceph -w showed that all PGs are in
>>> active+clean state) I noticed that a few rbd images were corrupted.
>>
>> As Sage mentioned, the OSDs know how to handle full journals correctly.
>>
>> I'd like to figure out how your rbd images got corrupted, if possible.
>>
>> How did you notice the corruption?
>>
>> Has your cluster always run 0.46, or did you upgrade from earlier
>> versions?
>>
>> What happened to the cluster between your last check for corruption and
>> now? Did your use of it or any ceph client or server configuration
>> change?
>
> My question about journal is actually connected to a larger case I'm
> currently trying to investigate.
>
> The cluster initially run v0.45 but I upgraded it to v0.46 because of the
> issue I described in this bug report (upgrade didn't resolve it):
>
> http://tracker.newdream.net/issues/2446
Could you attach an archive of all the osdmaps from to that bug?
You can extract them with something like:
for epoch in $(seq 1 2000)
do
ceph osd getmap $epoch -o osdmap_$epoch
done
> The cluster consisted of 26 OSDs and used the crushmap which had a
> structure identical to that of a default crushmap constructed during the
> cluster creation. It had the unknownrack which contained 26 hosts and
> every host contained one OSD.
>
> Problems started when one of my collegues created and installed into the
> cluster the new crush map which introduced a couple of new racks, changed
> the placement rule to 'step chooseleaf firstn 0 type rack' and changed the
> weights of most of the OSDs to 0 (they were meant to be removed from the
> cluster). I don't have the exact copy of that crushmap but my collegue
> reconstructed it from memory the best he could. It's attached as new-
> crushmap.txt.
>
> The OSDs reacted to the new crushmap by allocating large amounts of
> memory. Most of them had only 1 or 2 GB of RAM. That proved to be not
> enough and the Xen VMs hosting the OSDs crashed. It turned out later, that
> most of the OSDs required as much as 6 to 10 GB of memory to complete the
> peering phase (ceph -w showed large number of PGs in that state while the
> OSDs were allocating memory).
>
> One factor which I think might have played significant role in this
> situation was the large number of PGs - 20000. Our idea was to
> incrementally build the cluster consisting of approximately 200 OSDs,
> hence the 20000 PGs.
Large numbers of PGs per OSD are problematic due to memory usage linear
in the number of PGs, and increased during peering and recovery.
We recommend keeping the number of PGs per OSD on the order of 100s.
In the future, it'll be possible to split PGs to increase their number
when your cluster grows, or merge them when it shrinks. For now you
should probably wait to create a pool with a large number of PGs until
you have enough OSDs up and in to handle them.
PG splitting is http://tracker.newdream.net/issues/1515
Your crushmap with many devices with weight 0 might also have
contributed to the problem due an issue with local retries.
See:
http://tracker.newdream.net/issues/2047
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6244
A workaround in the meantime is to remove devices in deep hierarchies
from the crush map.
> I see some items in your issue tracker that look like they may be
> addressing this large memory consumption issue:
>
> http://tracker.newdream.net/issues/2321
> http://tracker.newdream.net/issues/2041
Those and the recent improvements in OSD map processing will help.
> I reverted to the default crushmap, changed replication level to 1 and
> marked all OSDs but 2 out. That allowed me to finally recover the cluster
> and bring it online but in the process all the OSDs crashed numerous
> times. They were either killed by the OOM Killer or the whole VMs were
> destroyed by me because they were unresponsive or the OSDs crashed due to
> failed asserts such as:
>
> ====
> 2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_
> handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10
> 13:07:38.816680
> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>
> ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
> 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x270) [0x7a32e0]
> 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7]
> 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748]
> 4: (CephContextServiceThread::entry()+0x5c) [0x64c27c]
> 5: (()+0x68ba) [0x7f87888be8ba]
> 6: (clone()+0x6d) [0x7f8786f4302d]
> ====
This is unresponsiveness again.
> ====
> 2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function 'void
> PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_
> log_t&, int)' thread 7f062e9c1700 time 2012-05-10 16:33:30.369211
> osd/PG.cc: 369: FAILED assert(log.head>= olog.tail&& olog.head>=
> log.tail)
>
> ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
> 1: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
> int)+0x1f14) [0x77d894]
> 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec
> const&)+0x2c5) [0x77dba5]
> 3: (boost::statechart::simple_state<PG::RecoveryState::Stray,
> PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x213) [0x794d93]
> 4: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x6b) [0x78c3cb]
> 5: (PG::RecoveryState::handle_log(int, MOSDPGLog*,
> PG::RecoveryCtx*)+0x1a6) [0x745b76]
> 6: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x56f) [0x5e1b8f]
> 7: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x13b) [0x5e291b]
> 8: (OSD::_dispatch(Message*)+0x17d) [0x5e7afd]
> 9: (OSD::ms_dispatch(Message*)+0x1df) [0x5e83cf]
> 10: (SimpleMessenger::dispatch_entry()+0x979) [0x6dadf9]
> 11: (SimpleMessenger::DispatchThread::entry()+0xd) [0x613e8d]
> 12: (()+0x68ba) [0x7f063c63c8ba]
> 13: (clone()+0x6d) [0x7f063acc102d]
> ====
This is a bug. If it's reproducible, could you generate logs of it
happening with 'debug osd = 20'?
> Although 'ceph -w' showed that all PGs are in active+clean state, during
> the attempt to start the VMs which had their disk images on rbd devices,
> fsck revealed multiple filesystem errors.
Were any of the osds restarted when they were running 0.45? There were
a couple issues with journal replay on non-btrfs that were fixed in 0.46.
If any of the nodes were powered off, it would be good to know whether
Xen was flushing disk caches for the VMs running your OSDs as well.
Josh
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Journal too small
2012-05-19 1:51 ` Josh Durgin
@ 2012-05-21 15:11 ` Karol Jurak
0 siblings, 0 replies; 9+ messages in thread
From: Karol Jurak @ 2012-05-21 15:11 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
On Saturday 19 of May 2012 03:51:13 Josh Durgin wrote:
> On 05/18/2012 03:56 AM, Karol Jurak wrote:
> > My question about journal is actually connected to a larger case I'm
> > currently trying to investigate.
> >
> > The cluster initially run v0.45 but I upgraded it to v0.46 because of
> > the issue I described in this bug report (upgrade didn't resolve
> > it):
> >
> > http://tracker.newdream.net/issues/2446
>
> Could you attach an archive of all the osdmaps from to that bug?
> You can extract them with something like:
>
> for epoch in $(seq 1 2000)
> do
> ceph osd getmap $epoch -o osdmap_$epoch
> done
The monitors have deleted the osdmaps from that period, however I managed
to reproduce this bug and I took a snapshot of osdmap and osdmap_full
directories of one of the monitors. I attached it to the bug report.
> Large numbers of PGs per OSD are problematic due to memory usage linear
> in the number of PGs, and increased during peering and recovery.
> We recommend keeping the number of PGs per OSD on the order of 100s.
> In the future, it'll be possible to split PGs to increase their number
> when your cluster grows, or merge them when it shrinks. For now you
> should probably wait to create a pool with a large number of PGs until
> you have enough OSDs up and in to handle them.
>
> PG splitting is http://tracker.newdream.net/issues/1515
>
> Your crushmap with many devices with weight 0 might also have
> contributed to the problem due an issue with local retries.
> See:
>
> http://tracker.newdream.net/issues/2047
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6244
>
> A workaround in the meantime is to remove devices in deep hierarchies
> from the crush map.
>
> > I see some items in your issue tracker that look like they may be
> > addressing this large memory consumption issue:
> >
> > http://tracker.newdream.net/issues/2321
> > http://tracker.newdream.net/issues/2041
>
> Those and the recent improvements in OSD map processing will help.
Thanks for info and advice.
> > ====
> > 2012-05-10 13:07:39.869811 7f878645a700 -1 common/HeartbeatMap.cc: In
> > function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_
> > handle_d*, const char*, time_t)' thread 7f878645a700 time 2012-05-10
> > 13:07:38.816680
> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
> >
> > ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
> > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> > const*,
> >
> > long)+0x270) [0x7a32e0]
> >
> > 2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x7a34f7]
> > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x7a3748]
> > 4: (CephContextServiceThread::entry()+0x5c) [0x64c27c]
> > 5: (()+0x68ba) [0x7f87888be8ba]
> > 6: (clone()+0x6d) [0x7f8786f4302d]
> >
> > ====
>
> This is unresponsiveness again.
That makes sense. Most OSDs' filestores were on a storage shared with
other VMs and also their (heavily utilized) swap partitions were on it.
> > ====
> > 2012-05-10 16:33:30.437730 7f062e9c1700 -1 osd/PG.cc: In function
> > 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_
> > log_t&, int)' thread 7f062e9c1700 time 2012-05-10 16:33:30.369211
> > osd/PG.cc: 369: FAILED assert(log.head>= olog.tail&& olog.head>=
> > log.tail)
> >
> > ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
> > 1: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
> >
> > int)+0x1f14) [0x77d894]
> >
> > 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec
> >
> > const&)+0x2c5) [0x77dba5]
> >
> > 3: (boost::statechart::simple_state<PG::RecoveryState::Stray,
> >
> > PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> > (boost::statechart::history_mode)0>::react_impl(boost::statechart::ev
> > ent_base const&, void const*)+0x213) [0x794d93]
> >
> > 4:
> > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachi
> > ne,
> >
> > PG::RecoveryState::Initial, std::allocator<void>,
> > boost::statechart::null_exception_translator>::process_event(boost::s
> > tatechart::event_base const&)+0x6b) [0x78c3cb]
> >
> > 5: (PG::RecoveryState::handle_log(int, MOSDPGLog*,
> >
> > PG::RecoveryCtx*)+0x1a6) [0x745b76]
> >
> > 6: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x56f)
> > [0x5e1b8f] 7:
> > (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x13b)
> > [0x5e291b] 8: (OSD::_dispatch(Message*)+0x17d) [0x5e7afd]
> > 9: (OSD::ms_dispatch(Message*)+0x1df) [0x5e83cf]
> > 10: (SimpleMessenger::dispatch_entry()+0x979) [0x6dadf9]
> > 11: (SimpleMessenger::DispatchThread::entry()+0xd) [0x613e8d]
> > 12: (()+0x68ba) [0x7f063c63c8ba]
> > 13: (clone()+0x6d) [0x7f063acc102d]
> >
> > ====
>
> This is a bug. If it's reproducible, could you generate logs of it
> happening with 'debug osd = 20'?
Unfortunately I don't know how to reproduce this bug. I've only seen it a
couple of times on one specific OSD and that OSD was running with logging
verbosity at default level. However I still have that logs, if they are of
any help.
> > Although 'ceph -w' showed that all PGs are in active+clean state,
> > during the attempt to start the VMs which had their disk images on
> > rbd devices, fsck revealed multiple filesystem errors.
>
> Were any of the osds restarted when they were running 0.45? There were
> a couple issues with journal replay on non-btrfs that were fixed in
> 0.46.
I can't recall for sure but I think it's quite possible that some of the
OSDs were restarted when they were running 0.45. I don't think that any of
them crashed at that time, though.
> If any of the nodes were powered off, it would be good to know whether
> Xen was flushing disk caches for the VMs running your OSDs as well.
Based on my limited research I think that it's possible that Xen (at least
older versions of it which we use) does not flush disk caches for the VMs.
On the other hand I'm almost certain that none of the xen hosts on which
OSD VMs were running were powered down or crashed at that time. So all I/O
of OSD VMs should eventually be persisted on disks.
Karol
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-05-21 15:11 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-17 10:59 Journal too small Karol Jurak
2012-05-17 16:01 ` Sage Weil
2012-05-17 17:00 ` Tommi Virtanen
2012-05-17 17:33 ` Sage Weil
2012-05-18 10:58 ` Karol Jurak
2012-05-17 18:59 ` Josh Durgin
2012-05-18 10:56 ` Karol Jurak
2012-05-19 1:51 ` Josh Durgin
2012-05-21 15:11 ` Karol Jurak
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.