issue 8747 / 9011

All of lore.kernel.org
 help / color / mirror / Atom feed

* issue 8747 / 9011
@ 2014-09-19 23:43 Sage Weil
  2014-09-20  3:08 ` Dmitry Smirnov
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-09-19 23:43 UTC (permalink / raw)
  To: onlyjob; +Cc: ceph-devel

Hey Dmitry,

Are you still seeing this crash?

 osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

We haven't turned it up in our testing in the last two months, so we 
still have no log of it occurring.

Thanks!
sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue 8747 / 9011
  2014-09-19 23:43 issue 8747 / 9011 Sage Weil
@ 2014-09-20  3:08 ` Dmitry Smirnov
  2014-09-21 19:28   ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Smirnov @ 2014-09-20  3:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 645 bytes --]

Hi Sage,

On Fri, 19 Sep 2014 16:43:31 Sage Weil wrote:
> Are you still seeing this crash?
> 
>  osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >=
> scrubber.end)

Thanks for following-up on this, Sage.
Yes, I've seen this crash just recently on 0.80.5. It usually happens during 
long recovery like when OSD is replaced. I've seen this happening after hours 
of backfilling/remapping although it may take a long time to manifest.

-- 
Cheers,
 Dmitry Smirnov
 GPG key : 4096R/53968D1B

---

However beautiful the strategy, you should occasionally look at the
results.
        -- Winston Churchill

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue 8747 / 9011
  2014-09-20  3:08 ` Dmitry Smirnov
@ 2014-09-21 19:28   ` Sage Weil
  2014-09-22  1:13     ` Dmitry Smirnov
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-09-21 19:28 UTC (permalink / raw)
  To: Dmitry Smirnov; +Cc: ceph-devel

On Fri, 19 Sep 2014, Dmitry Smirnov wrote:
> Hi Sage,
> 
> On Fri, 19 Sep 2014 16:43:31 Sage Weil wrote:
> > Are you still seeing this crash?
> > 
> >  osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >=
> > scrubber.end)
> 
> Thanks for following-up on this, Sage.
> Yes, I've seen this crash just recently on 0.80.5. It usually happens during 
> long recovery like when OSD is replaced. I've seen this happening after hours 
> of backfilling/remapping although it may take a long time to manifest.

Is there any possibility of enabling logging on your osds (debug ms = 1, 
debug osd = 20) so that we can capture this?

Thanks-
sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue 8747 / 9011
  2014-09-21 19:28   ` Sage Weil
@ 2014-09-22  1:13     ` Dmitry Smirnov
  2014-09-22  2:01       ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Smirnov @ 2014-09-22  1:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]

On Sun, 21 Sep 2014 12:28:23 Sage Weil wrote:
> Is there any possibility of enabling logging on your osds (debug ms = 1,
> debug osd = 20) so that we can capture this?

I'll put it on my TODO list but I won't be able to do it anytime soon...

Meanwhile is there any chance for #8752 to get some attention? It is regarding 
inconsistent PGs on RBD caching pool. Thanks.

-- 
Cheers,
 Dmitry Smirnov
 GPG key : 4096R/53968D1B

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue 8747 / 9011
  2014-09-22  1:13     ` Dmitry Smirnov
@ 2014-09-22  2:01       ` Sage Weil
  2014-10-02 13:47         ` issue #8752 (inconsistent PGs on RBD caching pool) Dmitry Smirnov
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-09-22  2:01 UTC (permalink / raw)
  To: Dmitry Smirnov; +Cc: ceph-devel

On Mon, 22 Sep 2014, Dmitry Smirnov wrote:
> On Sun, 21 Sep 2014 12:28:23 Sage Weil wrote:
> > Is there any possibility of enabling logging on your osds (debug ms = 1,
> > debug osd = 20) so that we can capture this?
> 
> I'll put it on my TODO list but I won't be able to do it anytime soon...
> 
> Meanwhile is there any chance for #8752 to get some attention? It is regarding 
> inconsistent PGs on RBD caching pool. Thanks.

This is one we have never seen in our QA environment, and no real leads.  
There are a couple slightly different scrub issues that pop up 
occasionally that we are trying to nail down, but this one is a bit 
different.  Being able to reliably reproduce it and generate logs is the 
usual strategy...

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue #8752 (inconsistent PGs on RBD caching pool)
  2014-09-22  2:01       ` Sage Weil
@ 2014-10-02 13:47         ` Dmitry Smirnov
  2014-10-02 15:28           ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Smirnov @ 2014-10-02 13:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1126 bytes --]

On Sun, 21 Sep 2014 19:01:52 Sage Weil wrote:
> This is one we have never seen in our QA environment, and no real leads.

I'm much surprised about this... Is it really that unusual to use replicated 
caching pool in front of RBD erasure pool? All my OSDs are Btrfs-based and 
recently I've upgraded all kernels (i.e. kernel RBD clients) to 3.16.3.

Unlike some shifty issues that may be hard to replicate this particular one 
was very persistent and noticeable, no effort to reproduce at all. I've been  
observing it for several months already...

It is unlikely that I have anything special in my v0.80.5 cluster's 
configuration...

> There are a couple slightly different scrub issues that pop up
> occasionally that we are trying to nail down, but this one is a bit
> different.  Being able to reliably reproduce it and generate logs is the
> usual strategy...

Please advise what kind of logs could be useful. Something like "(debug ms = 
1, debug osd = 20)" from primary OSD where inconsistent PG lies at a time when 
"scrub" command is given?

Thanks.

-- 
All the best,
 Dmitry Smirnov.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue #8752 (inconsistent PGs on RBD caching pool)
  2014-10-02 13:47         ` issue #8752 (inconsistent PGs on RBD caching pool) Dmitry Smirnov
@ 2014-10-02 15:28           ` Sage Weil
  2014-10-02 21:09             ` Dmitry Smirnov
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-10-02 15:28 UTC (permalink / raw)
  To: Dmitry Smirnov; +Cc: ceph-devel

On Thu, 2 Oct 2014, Dmitry Smirnov wrote:
> On Sun, 21 Sep 2014 19:01:52 Sage Weil wrote:
> > This is one we have never seen in our QA environment, and no real leads.
> 
> I'm much surprised about this... Is it really that unusual to use replicated 
> caching pool in front of RBD erasure pool? All my OSDs are Btrfs-based and 
> recently I've upgraded all kernels (i.e. kernel RBD clients) to 3.16.3.

My guess is a btrfs issue.  The weird thing about your report is the byte 
totals are off by an uneven number of bytes (3 bytes, 9 bytes, etc.).  
We haven't ever seen this.  We do test RBD over cache tiers on btrfs, 
but not with EC on the base.  I'll add that combo to the matrix.  My first 
guess is a btrfs issue, honestly.

> Unlike some shifty issues that may be hard to replicate this particular one 
> was very persistent and noticeable, no effort to reproduce at all. I've been  
> observing it for several months already...

Does it continue to come up after the kernels are upgraded (and after a 
full cycle of scrub and repairs have been done to clear out 
inconsistencies introduced while running the older kernel)?

sage

> It is unlikely that I have anything special in my v0.80.5 cluster's 
> configuration...
>
> > There are a couple slightly different scrub issues that pop up
> > occasionally that we are trying to nail down, but this one is a bit
> > different.  Being able to reliably reproduce it and generate logs is the
> > usual strategy...
> 
> Please advise what kind of logs could be useful. Something like "(debug ms = 
> 1, debug osd = 20)" from primary OSD where inconsistent PG lies at a time when 
> "scrub" command is given?
> 
> Thanks.
> 
> -- 
> All the best,
>  Dmitry Smirnov.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: issue #8752 (inconsistent PGs on RBD caching pool)
  2014-10-02 15:28           ` Sage Weil
@ 2014-10-02 21:09             ` Dmitry Smirnov
  0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Smirnov @ 2014-10-02 21:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Thu, 2 Oct 2014 08:28:16 Sage Weil wrote:
> My guess is a btrfs issue.  The weird thing about your report is the byte
> totals are off by an uneven number of bytes (3 bytes, 9 bytes, etc.).
> We haven't ever seen this.  We do test RBD over cache tiers on btrfs,
> but not with EC on the base.  I'll add that combo to the matrix.  My first
> guess is a btrfs issue, honestly.

I think I found where it is happening: for a while I was using Btrfs-based 
OSDs with journals on ext4 partition on SSD. As an experiment I've decided to 
try moving all journal files back to their OSDs and it eliminated 
inconsistencies. I've updated the ticket with this information.
This behaviour is reproducible on 0.80.6.

It looks like Btrfs snapshotting do not affect this issue.

> Does it continue to come up after the kernels are upgraded (and after a
> full cycle of scrub and repairs have been done to clear out
> inconsistencies introduced while running the older kernel)?

Yes, I tried many times after every kernel update or any change in cluster 
whatsoever. Repair is usually ineffective and doesn't change anything: it 
would log "repair 1 errors, 1 fixed" but "ceph pg scrub" will find an error 
right away. Moreover repair is not even necessary -- inconsistencies stay on 
some PGs for a while then "move" to different PGs. For example "ceph pg scrub 
19.NN" sometimes would be clearing affected pg from "inconsistent" state or 
discover a new inconsistency seemingly at random.

Thank you.

-- 
Cheers,
 Dmitry Smirnov.

---

Odious ideas are not entitled to hide from criticism behind the human
shield of their believers' feelings.
        -- Richard Stallman

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-02 21:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-19 23:43 issue 8747 / 9011 Sage Weil
2014-09-20  3:08 ` Dmitry Smirnov
2014-09-21 19:28   ` Sage Weil
2014-09-22  1:13     ` Dmitry Smirnov
2014-09-22  2:01       ` Sage Weil
2014-10-02 13:47         ` issue #8752 (inconsistent PGs on RBD caching pool) Dmitry Smirnov
2014-10-02 15:28           ` Sage Weil
2014-10-02 21:09             ` Dmitry Smirnov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.