linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* XFS bug?
@ 2016-11-30 13:07 Christian Theune
  2016-12-01 11:03 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Theune @ 2016-11-30 13:07 UTC (permalink / raw)
  To: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1991 bytes --]

Hi there,

we’re running a Ceph cluster which had a very rough outage not long ago[1].

When updating our previous kernels from 4.1.16 (Gentoo) to 4.4.27 (Gentoo) we encountered the following problem in our production environment (but not in staging or development):

- Properly shut down and reboot the machine running Ceph OSDs on XFS w/ kernel 4.1.16.
- Boot with 4.4.27, let the machine mount the FS’ and start OSDs
- Have everything run 20-30 minutes
- Ceph OSDs start crashing. Kernel shows messages attached in kern.log
- Panic. Breath. 
- The RAID controllers (LSI) did not exhibit any sign of disk problems at all.
- Trying to interact with the crashed FS’, i.e.through xfs_repair, caused infinitely hanging syscalls. Clean reboot was no longer possible at that point.

After some experimentation the way to clean things up with negligible residual harm was:

- reboot into 4.1 kernel
- run xfs_repair, force the journal to be cleaned with -L (in some instances)
- ensure a second xfs_repair ends up clean, as well after a mount/umount cycle
- reboot into 4.4 kernel
- run xfs_repair again, ensure it eventually becomes clean, and stays that way after mount/unmount as well as a reboot cycle

An interesting error we saw during repair was this (I can’t remember or reconstruct whether this was on the 4.1 or 4.4 kernel):

bad agbno 4294967295 in agfl, agno 12
freeblk count 7 != flcount 6 in ag 12
sb_fdblocks 82969993, counted 82969994

bad agbno 4294967295 in agfl, agno 13
freeblk count 7 != flcount 6 in ag 13
sb_fdblocks 98156324, counted 98156325

Note, that the agbno is 2**32-1 repeatedly and the sb_fdblocks is off-by-one. I personally don’t have enough internal XFS knowledge, but to me this smells “interesting”.

Also interesting: the broken filesystems and xfs_repair behaved completely differently whether talked to from a 4.1 or 4.4 kernel, thus the pattern of first running xfs_repair on 4.1 and then again on 4.4.


[-- Attachment #2: kern.log.gz --]
[-- Type: application/x-gzip, Size: 49378 bytes --]

[-- Attachment #3: Type: text/plain, Size: 1015 bytes --]



This looks similar to [2] and may be related to the already fixed bug referenced by Dave in [3], but in our case there was no 32/64 bit migration involved.

I’d love if someone could check whether this is a new bug - I reviewed all kernel logs since the old kernel we had but could not find anything that I can pinpoint to our situation.

Unfortunately, my notes aren’t as complete as I would have liked them to be, let me know if you need anything specific, I’ll do my best to dig it up.

Cheers and thanks in advance,
Christian

[1] http://status.flyingcircus.io/incidents/h37gk5v81nz5
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576599
[3] https://plus.google.com/u/0/+FlorianHaas/posts/LNYMKQF7rgU

-- 
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-12-07  6:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87y7lrmnra.wl%peterc@chubb.wattle.id.au>
2007-03-21  1:17 ` XFS bug??? Nathan Scott
2007-03-21  2:24   ` David Chinner
2016-11-30 13:07 XFS bug? Christian Theune
2016-12-01 11:03 ` Dave Chinner
2016-12-01 11:56   ` Christian Theune
2016-12-01 20:15     ` Dave Chinner
     [not found]       ` <C28A1C2E-423B-48BC-8953-735B85CDFE08@flyingcircus.io>
2016-12-07  6:14         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).