public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* XFS corrupt after RAID failure and resync
@ 2015-01-06  5:39 David Raffelt
  2015-01-06 12:36 ` Stefan Ring
  2015-01-06 12:41 ` Brian Foster
  0 siblings, 2 replies; 12+ messages in thread
From: David Raffelt @ 2015-01-06  5:39 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2625 bytes --]

Hi All,

I have 7 drives in a RAID6 configuration with a XFS partition (running Arch
linux). Recently two drives dropped out simultaneously, and a hot spare
immediately synced successfully so that I now have 6/7 drives up in the
array.

After a reboot (to replace the faulty drives) the XFS file system would not
mount. Note that I had to perform a hard reboot since the server hung on
shutdown. When I try to mount I get the following error:
mount: mount /dev/md0 on /export/data failed: Structure needs cleaning

I have tried to perform: xfs_repair /dev/md0
And I get the following output:

Phase 1 - find and verify superblock...
couldn't verify primary superblock - bad magic number !!!
attempting to find secondary superblock...
..............................................................................
..............................................................................
                          [many lines like this]
..............................................................................
..............................................................................
found candidate secondary superblock...unable to verify superblock,
continuing
...
...........................................................................

Note that it has been scanning for many hours and has located several
secondary superblocks with the same error. It is till scanning however
based on other posts I'm guessing it will not be successful.

To investigate the superblock info I used xfs_db and the magic number looks
ok:
sudo xfs_db /dev/md0
xfs_db> sb
xfs_db> p

magicnum = 0x58465342
blocksize = 4096
dblocks = 3662666880
rblocks = 0
rextents = 0
uuid = e74e5814-3e0f-4cd1-9a68-65d9df8a373f
logstart = 2147483655
rootino = 1024
rbmino = 1025
rsumino = 1026
rextsize = 1
agblocks = 114458368
agcount = 32
rbmblocks = 0
logblocks = 521728
versionnum = 0xbdb4
sectsize = 4096
inodesize = 512
inopblock = 8
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 12
inodelog = 9
inopblog = 3
agblklog = 27
rextslog = 0
inprogress = 0
imax_pct = 5
icount = 4629568
ifree = 34177
fdblocks = 362013500
frextents = 0
uquotino = 0
gquotino = null
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 128
width = 640
dirblklog = 0
logsectlog = 12
logsectsize = 4096
logsunit = 4096
features2 = 0xa
bad_features2 = 0xa
features_compat = 0
features_ro_compat = 0
features_incompat = 0
features_log_incompat = 0
crc = 0 (unchecked)
pquotino = 0
lsn = 0


Any help or suggestions at this point would be much appreciated!  Is my
only option to try a repair -L?

Thanks in advance,
Dave

[-- Attachment #1.2: Type: text/html, Size: 6966 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread
* XFS corrupt after RAID failure and resync
@ 2015-01-06  6:12 David Raffelt
  2015-01-06 12:47 ` Brian Foster
       [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
  0 siblings, 2 replies; 12+ messages in thread
From: David Raffelt @ 2015-01-06  6:12 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 661 bytes --]

Hi again,
Some more information.... the kernel log show the following errors were
occurring after the RAID recovery, but before I reset the server.

Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
xfs_trans_read_buf() returned error 117.


Thanks,
Dave

[-- Attachment #1.2: Type: text/html, Size: 852 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: XFS corrupt after RAID failure and resync
@ 2015-01-08  8:09 Chris Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2015-01-08  8:09 UTC (permalink / raw)
  To: David Raffelt; +Cc: Chris Murphy, xfs@oss.sgi.com

On Wed, Jan 7, 2015 at 12:05 AM, David Raffelt
<david.raffelt@florey.edu.au> wrote:

> Yes, after the 2 disks were dropped I definitely had a working degraded
> drive with 5/7 . I only see XFS errors in the kernel log soon AFTER the hot
> spare finished syncing.

I suggest moving this to the linux-raid@ list and include the following:

brief description: e.g. 7 drive raid6 array, 2 drives got booted at
some point due to errors, a hotspare starts rebuilding and finishes,
then XFS errors appear in the log, and xfs_repair -n results suggest a
bad RAID assembly

kernel version
mdadm version
drive model numbers as well as their SCT ERC values
mdadm -E for all drives

The list can take all of this. I'm not sure if it'll also take a large
journal but I'd try it first before using a URL.

For the journal, two things: first it's not going back far enough, the
problems had already begun and it'd be good to have a lot more context
so I'd dig back and find the first indication of a problem, you can
use journalctl --since for this. It can take the form:

journalctl --since "24 hours ago" or "2015-01-04 12:15:00"


Also use the option -o short-monotonic which will use monotonic time,
could come in handy, and is more like dmesg output.

>> smarctl -l scterc /dev/sdX
>
>
> I'm ashamed to say that this command only works on 1 of the 8 drives since
> this is the only enterprise class drive (we are funded by small science
> grants). We have been gradually replacing the desktop class drives as they
> fail.

The errors in your logs are a lot more extensive than what I'm used to
seeing in cases of misconfiguration with desktop drives that lack
configurable SCT ERC. But the failure is consistent with that common
misconfiguration. The problem with desktop drives is the combination
of long error recoveries for bad sectors along with a short kernel
SCSI command timer. So what happens is the kernel thinks the drive has
hung up, and does a link reset. In reality the drive is probably in a
so called "deep recovery" but doesn't get a chance to report an
explicit read error. An explicit read error includes the affected
sector LBA which the md kernel code can then use to rebuild the data
from parity and overwrite the bad sector which fixes the problem.

However...


>> This has to be issued per drive, no shortcut available by specifying
>> all letters at once in brackets. And then lastly this one:
>>
>> cat /sys/block/sd[abcdefg]/device/timeout
>>
>> Again plug in the correct letters.
>
>
> All devices are set to 30 seconds.

This effectively prevents consumer drives from reporting marginally
bad blocks. If they're clearly bad, drive ECC reports read errors
fairly quickly. If they're fuzzy, then the ECC does a bunch of retries
potentially well beyond 30 seconds. I've heard times of 2-3 minutes,
which seems crazy but, that's apparently how long it can be before the
drive will give up and report a read error. And that read error is
necessary for RAID to work correctly.

So what you need to do for all drives that do not have configurable SCT ERC, is:

echo 180 > /sys/block/sdX/device/timeout

That way the kernel will wait up to 3 minutes. The drive will almost
certainly report an explicit read error in less than that, and then md
can fix the problem by writing over that bad sector. To force this
correction actively rather than passively you should schedule a scrub
of all arrays:

echo check > /sys/block/mdX/md/sync_action

You can do this on complete arrays in normal operation. I wouldn't do
this on the degraded array though. Consult linux-raid@ and do what's
suggested there.




>> Right well it's not fore sure toast yet. Also, one of the things
>> gluster is intended to mitigate is the loss of an entire brick, which
>> is what happened, but you need another 15TB of space to do
>> distributed-replicated on your scratch space. If you can tolerate
>> upwards of 48 hour single disk rebuild times, there are now 8TB HGST
>> Helium drives :-P
>
>
> Just to confirm, we have 3x15TB bricks in a 45TB volume. Don't we need
> complete duplication in a distributed-replicated Gluster volume, or can we
> get away with only 1 more brick?

If you want all the data to be replicated you need double the storage.
But you can have more than one volume, such that one has replication
and the other doesn't. The bricks used for replication volumes don't
both have to be raid6. It could be one raid6 and one raid5, or one
raid6 and one raid0. It's a risk assessment.


-- 
Chris Murphy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-01-08  8:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-06  5:39 XFS corrupt after RAID failure and resync David Raffelt
2015-01-06 12:36 ` Stefan Ring
2015-01-06 12:41 ` Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2015-01-06  6:12 David Raffelt
2015-01-06 12:47 ` Brian Foster
     [not found] ` <44b127de199c445fa12c3b832a05f108@000s-ex-hub-qs1.unimelb.edu.au>
2015-01-06 20:34   ` David Raffelt
2015-01-06 23:16     ` Brian Foster
     [not found]     ` <8cc9a649ec2240faa4e38fd742437546@000S-EX-HUB-NP2.unimelb.edu.au>
2015-01-06 23:47       ` David Raffelt
2015-01-07  0:27         ` Dave Chinner
2015-01-07 16:16         ` Brian Foster
2015-01-07  2:35     ` Chris Murphy
2015-01-08  8:09 Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox