* issue 8747 / 9011
@ 2014-09-19 23:43 Sage Weil
2014-09-20 3:08 ` Dmitry Smirnov
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-09-19 23:43 UTC (permalink / raw)
To: onlyjob; +Cc: ceph-devel
Hey Dmitry,
Are you still seeing this crash?
osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >= scrubber.end)
We haven't turned it up in our testing in the last two months, so we
still have no log of it occurring.
Thanks!
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: issue 8747 / 9011
2014-09-19 23:43 issue 8747 / 9011 Sage Weil
@ 2014-09-20 3:08 ` Dmitry Smirnov
2014-09-21 19:28 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Smirnov @ 2014-09-20 3:08 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 645 bytes --]
Hi Sage,
On Fri, 19 Sep 2014 16:43:31 Sage Weil wrote:
> Are you still seeing this crash?
>
> osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >=
> scrubber.end)
Thanks for following-up on this, Sage.
Yes, I've seen this crash just recently on 0.80.5. It usually happens during
long recovery like when OSD is replaced. I've seen this happening after hours
of backfilling/remapping although it may take a long time to manifest.
--
Cheers,
Dmitry Smirnov
GPG key : 4096R/53968D1B
---
However beautiful the strategy, you should occasionally look at the
results.
-- Winston Churchill
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: issue 8747 / 9011
2014-09-20 3:08 ` Dmitry Smirnov
@ 2014-09-21 19:28 ` Sage Weil
2014-09-22 1:13 ` Dmitry Smirnov
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-09-21 19:28 UTC (permalink / raw)
To: Dmitry Smirnov; +Cc: ceph-devel
On Fri, 19 Sep 2014, Dmitry Smirnov wrote:
> Hi Sage,
>
> On Fri, 19 Sep 2014 16:43:31 Sage Weil wrote:
> > Are you still seeing this crash?
> >
> > osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >=
> > scrubber.end)
>
> Thanks for following-up on this, Sage.
> Yes, I've seen this crash just recently on 0.80.5. It usually happens during
> long recovery like when OSD is replaced. I've seen this happening after hours
> of backfilling/remapping although it may take a long time to manifest.
Is there any possibility of enabling logging on your osds (debug ms = 1,
debug osd = 20) so that we can capture this?
Thanks-
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: issue 8747 / 9011
2014-09-21 19:28 ` Sage Weil
@ 2014-09-22 1:13 ` Dmitry Smirnov
2014-09-22 2:01 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Smirnov @ 2014-09-22 1:13 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 437 bytes --]
On Sun, 21 Sep 2014 12:28:23 Sage Weil wrote:
> Is there any possibility of enabling logging on your osds (debug ms = 1,
> debug osd = 20) so that we can capture this?
I'll put it on my TODO list but I won't be able to do it anytime soon...
Meanwhile is there any chance for #8752 to get some attention? It is regarding
inconsistent PGs on RBD caching pool. Thanks.
--
Cheers,
Dmitry Smirnov
GPG key : 4096R/53968D1B
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: issue 8747 / 9011
2014-09-22 1:13 ` Dmitry Smirnov
@ 2014-09-22 2:01 ` Sage Weil
2014-10-02 13:47 ` issue #8752 (inconsistent PGs on RBD caching pool) Dmitry Smirnov
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-09-22 2:01 UTC (permalink / raw)
To: Dmitry Smirnov; +Cc: ceph-devel
On Mon, 22 Sep 2014, Dmitry Smirnov wrote:
> On Sun, 21 Sep 2014 12:28:23 Sage Weil wrote:
> > Is there any possibility of enabling logging on your osds (debug ms = 1,
> > debug osd = 20) so that we can capture this?
>
> I'll put it on my TODO list but I won't be able to do it anytime soon...
>
> Meanwhile is there any chance for #8752 to get some attention? It is regarding
> inconsistent PGs on RBD caching pool. Thanks.
This is one we have never seen in our QA environment, and no real leads.
There are a couple slightly different scrub issues that pop up
occasionally that we are trying to nail down, but this one is a bit
different. Being able to reliably reproduce it and generate logs is the
usual strategy...
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: issue #8752 (inconsistent PGs on RBD caching pool)
2014-09-22 2:01 ` Sage Weil
@ 2014-10-02 13:47 ` Dmitry Smirnov
2014-10-02 15:28 ` Sage Weil
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Smirnov @ 2014-10-02 13:47 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 1126 bytes --]
On Sun, 21 Sep 2014 19:01:52 Sage Weil wrote:
> This is one we have never seen in our QA environment, and no real leads.
I'm much surprised about this... Is it really that unusual to use replicated
caching pool in front of RBD erasure pool? All my OSDs are Btrfs-based and
recently I've upgraded all kernels (i.e. kernel RBD clients) to 3.16.3.
Unlike some shifty issues that may be hard to replicate this particular one
was very persistent and noticeable, no effort to reproduce at all. I've been
observing it for several months already...
It is unlikely that I have anything special in my v0.80.5 cluster's
configuration...
> There are a couple slightly different scrub issues that pop up
> occasionally that we are trying to nail down, but this one is a bit
> different. Being able to reliably reproduce it and generate logs is the
> usual strategy...
Please advise what kind of logs could be useful. Something like "(debug ms =
1, debug osd = 20)" from primary OSD where inconsistent PG lies at a time when
"scrub" command is given?
Thanks.
--
All the best,
Dmitry Smirnov.
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: issue #8752 (inconsistent PGs on RBD caching pool)
2014-10-02 13:47 ` issue #8752 (inconsistent PGs on RBD caching pool) Dmitry Smirnov
@ 2014-10-02 15:28 ` Sage Weil
2014-10-02 21:09 ` Dmitry Smirnov
0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-10-02 15:28 UTC (permalink / raw)
To: Dmitry Smirnov; +Cc: ceph-devel
On Thu, 2 Oct 2014, Dmitry Smirnov wrote:
> On Sun, 21 Sep 2014 19:01:52 Sage Weil wrote:
> > This is one we have never seen in our QA environment, and no real leads.
>
> I'm much surprised about this... Is it really that unusual to use replicated
> caching pool in front of RBD erasure pool? All my OSDs are Btrfs-based and
> recently I've upgraded all kernels (i.e. kernel RBD clients) to 3.16.3.
My guess is a btrfs issue. The weird thing about your report is the byte
totals are off by an uneven number of bytes (3 bytes, 9 bytes, etc.).
We haven't ever seen this. We do test RBD over cache tiers on btrfs,
but not with EC on the base. I'll add that combo to the matrix. My first
guess is a btrfs issue, honestly.
> Unlike some shifty issues that may be hard to replicate this particular one
> was very persistent and noticeable, no effort to reproduce at all. I've been
> observing it for several months already...
Does it continue to come up after the kernels are upgraded (and after a
full cycle of scrub and repairs have been done to clear out
inconsistencies introduced while running the older kernel)?
sage
> It is unlikely that I have anything special in my v0.80.5 cluster's
> configuration...
>
> > There are a couple slightly different scrub issues that pop up
> > occasionally that we are trying to nail down, but this one is a bit
> > different. Being able to reliably reproduce it and generate logs is the
> > usual strategy...
>
> Please advise what kind of logs could be useful. Something like "(debug ms =
> 1, debug osd = 20)" from primary OSD where inconsistent PG lies at a time when
> "scrub" command is given?
>
> Thanks.
>
> --
> All the best,
> Dmitry Smirnov.
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: issue #8752 (inconsistent PGs on RBD caching pool)
2014-10-02 15:28 ` Sage Weil
@ 2014-10-02 21:09 ` Dmitry Smirnov
0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Smirnov @ 2014-10-02 21:09 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]
On Thu, 2 Oct 2014 08:28:16 Sage Weil wrote:
> My guess is a btrfs issue. The weird thing about your report is the byte
> totals are off by an uneven number of bytes (3 bytes, 9 bytes, etc.).
> We haven't ever seen this. We do test RBD over cache tiers on btrfs,
> but not with EC on the base. I'll add that combo to the matrix. My first
> guess is a btrfs issue, honestly.
I think I found where it is happening: for a while I was using Btrfs-based
OSDs with journals on ext4 partition on SSD. As an experiment I've decided to
try moving all journal files back to their OSDs and it eliminated
inconsistencies. I've updated the ticket with this information.
This behaviour is reproducible on 0.80.6.
It looks like Btrfs snapshotting do not affect this issue.
> Does it continue to come up after the kernels are upgraded (and after a
> full cycle of scrub and repairs have been done to clear out
> inconsistencies introduced while running the older kernel)?
Yes, I tried many times after every kernel update or any change in cluster
whatsoever. Repair is usually ineffective and doesn't change anything: it
would log "repair 1 errors, 1 fixed" but "ceph pg scrub" will find an error
right away. Moreover repair is not even necessary -- inconsistencies stay on
some PGs for a while then "move" to different PGs. For example "ceph pg scrub
19.NN" sometimes would be clearing affected pg from "inconsistent" state or
discover a new inconsistency seemingly at random.
Thank you.
--
Cheers,
Dmitry Smirnov.
---
Odious ideas are not entitled to hide from criticism behind the human
shield of their believers' feelings.
-- Richard Stallman
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-10-02 21:09 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-19 23:43 issue 8747 / 9011 Sage Weil
2014-09-20 3:08 ` Dmitry Smirnov
2014-09-21 19:28 ` Sage Weil
2014-09-22 1:13 ` Dmitry Smirnov
2014-09-22 2:01 ` Sage Weil
2014-10-02 13:47 ` issue #8752 (inconsistent PGs on RBD caching pool) Dmitry Smirnov
2014-10-02 15:28 ` Sage Weil
2014-10-02 21:09 ` Dmitry Smirnov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.