problems with dm-raid 6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* problems with dm-raid 6
       [not found] <trinity-235b76ed-571d-4615-b6f7-b4d5ed6a116d-1458509365312@3capp-gmx-bs09>
@ 2016-03-20 21:44 ` Patrick Tschackert
  2016-03-20 22:37   ` Andreas Klauer
  0 siblings, 1 reply; 16+ messages in thread
From: Patrick Tschackert @ 2016-03-20 21:44 UTC (permalink / raw)
  To: linux-raid

Hi, I've been referred here after this exchange: https://mail-archive.com/linux-btrfs@vger.kernel.org/msg51726.html
Especially the last email: https://mail-archive.com/linux-btrfs@vger.kernel.org/msg51763.html

Here's a rundown of my problem:
After rebooting the system, one of the harddisks was missing from my md raid 6 (the drive was /dev/sdf), so i rebuilt it with a hotspare that was already present in the system.
I physically removed the "missing" /dev/sdf drive after the restore and replaced it with a new drive.
This was all done using the following kernel:

$ uname -a
Linux vmhost 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4
(2016-02-29) x86_64 GNU/Linux

After I got advice from the linux-btrfs mailing list, i upgraded to a newer kernel from the debian backports and increased the command timeout on the drives:

$ uname -a
Linux vmhost 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.5-1~bpo8+1 (2016-02-23) x86_64 GNU/Linux

$ cat /sys/block/md0/md/mismatch_cnt
0

$ for i in /sys/class/scsi_generic/*/device/timeout; do echo 120 > "$i"; done
(I know this isn't persistent across reboots...)

$ echo check > /sys/block/md0/md/sync_action

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sda[0] sdf[12](S) sdg[11](S) sdj[9] sdh[7] sdi[6] sdk[10] sde[4] sdd[3] sdc[2] sdb[1]
20510948416 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]
[>....................] check = 1.0% (30812476/2930135488) finish=340.6min speed=141864K/sec

unused devices: <none>

After the raid was done checking, I got this:

$ cat /sys/block/md0/md/mismatch_cnt
311936608

And messages in dmesg (attached to this mail) lead me to believe that the /dev/sdh drive is also faulty:

[12235.372901] sd 7:0:0:0: [sdh] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[12235.372906] sd 7:0:0:0: [sdh] tag#15 Sense Key : Medium Error [current] [descriptor]
[12235.372909] sd 7:0:0:0: [sdh] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed
[12235.372913] sd 7:0:0:0: [sdh] tag#15 CDB: Read(16) 88 00 00 00 00 00 af b2 bb 48 00 00 05 40 00 00
[12235.372916] blk_update_request: I/O error, dev sdh, sector 2947727304
[12235.372941] ata8: EH complete
[12266.856747] ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
[12266.856753] ata8.00: irq_stat 0x40000008
[12266.856756] ata8.00: failed command: READ FPDMA QUEUED
[12266.856762] ata8.00: cmd 60/40:d8:08:17:b5/05:00:af:00:00/40 tag 27 ncq 688128 in
res 41/40:00:18:1b:b5/00:00:af:00:00/40 Emask 0x409 (media error) <F>
[12266.856765] ata8.00: status: { DRDY ERR }
[12266.856767] ata8.00: error: { UNC }
[12266.858112] ata8.00: configured for UDMA/133

Here are the output for "smartctl -x" for each disk in the array: http://pastebin.com/PCMMByJc
And here's my complete dmesg: http://pastebin.com/bwkhXh2S

This is the current status of the array:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sda[0] sdf[12](S) sdg[11](S) sdj[9] sdh[7] sdi[6] sdk[10] sde[4] sdd[3] sdc[2] sdb[1]
20510948416 blocks super 1.2 level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]

unused devices: <none>

$ mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Sat Jun 14 18:47:44 2014
Raid Level : raid6
Array Size : 20510948416 (19560.77 GiB 21003.21 GB)
Used Dev Size : 2930135488 (2794.40 GiB 3000.46 GB)
Raid Devices : 9
Total Devices : 11
Persistence : Superblock is persistent

Update Time : Sun Mar 20 18:04:04 2016
State : clean
Active Devices : 9
Working Devices : 11
Failed Devices : 0
Spare Devices : 2

Layout : left-symmetric
Chunk Size : 64K

Name : brain:0
UUID : e45daf8f:99d0ff7f:e8244429:827e7c71
Events : 2393

Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 16 1 active sync /dev/sdb
2 8 32 2 active sync /dev/sdc
3 8 48 3 active sync /dev/sdd
4 8 64 4 active sync /dev/sde
10 8 160 5 active sync /dev/sdk
6 8 128 6 active sync /dev/sdi
7 8 112 7 active sync /dev/sdh
9 8 144 8 active sync /dev/sdj

11 8 96 - spare /dev/sdg
12 8 80 - spare /dev/sdf

The RAID holds an encrypted LUKS container. After opening it, the filesys can't be mounted (see https://mail-archive.com/linux-btrfs@vger.kernel.org/msg51726.html[https://mail-archive.com/linux-btrfs@vger.kernel.org/msg51726.html]).
Could this be due to errors on the raid?
Should i manually fail /dev/sdh and rebuild?

Thank you & kind Regards

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-20 21:44 ` problems with dm-raid 6 Patrick Tschackert
@ 2016-03-20 22:37   ` Andreas Klauer
  2016-03-21 12:42     ` Phil Turmel
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Klauer @ 2016-03-20 22:37 UTC (permalink / raw)
  To: Patrick Tschackert; +Cc: linux-raid

On Sun, Mar 20, 2016 at 10:44:57PM +0100, Patrick Tschackert wrote:
> After rebooting the system, one of the harddisks was missing from my md raid 6 (the drive was /dev/sdf), so i rebuilt it with a hotspare that was already present in the system.
> I physically removed the "missing" /dev/sdf drive after the restore and replaced it with a new drive.

Exact commands involved for those steps?

mdadm --examine output for your disks?

> $ cat /sys/block/md0/md/mismatch_cnt
> 311936608

Basically the whole array out of whack.

This is what you get when you use --create --assume-clean on disks 
that are not actually clean... or if you somehow convince md to 
integrate a disk that does not have valid data on, for example 
because you copied partition table and md metadata - but not  
everything else - using dd.

Something really bad happened here and the only person who 
can explain it, is probably yourself.

Your best bet is that the data is valid on n-2 disks.

Use overlay https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

Assemble the overlay RAID with any 2 disks missing (try all combinations) and see if you get valid data.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-20 22:37   ` Andreas Klauer
@ 2016-03-21 12:42     ` Phil Turmel
  2016-03-21 13:27       ` Andreas Klauer
  2016-03-21 21:26       ` Chris Murphy
  0 siblings, 2 replies; 16+ messages in thread
From: Phil Turmel @ 2016-03-21 12:42 UTC (permalink / raw)
  To: Andreas Klauer, Patrick Tschackert; +Cc: linux-raid

Hi Patrick,

On 03/20/2016 06:37 PM, Andreas Klauer wrote:
> On Sun, Mar 20, 2016 at 10:44:57PM +0100, Patrick Tschackert wrote:
>> After rebooting the system, one of the harddisks was missing from my md raid 6 (the drive was /dev/sdf), so i rebuilt it with a hotspare that was already present in the system.
>> I physically removed the "missing" /dev/sdf drive after the restore and replaced it with a new drive.

Your smartctl output shows pending sector problems with sdf, sdh, and
sdj.  The latter are WD Reds that won't keep those problems through a
scrub, so I guess the smartctl report was from before that?

> Exact commands involved for those steps?
> 
> mdadm --examine output for your disks?

Yes, we want these.

>> $ cat /sys/block/md0/md/mismatch_cnt
>> 311936608
> 
> Basically the whole array out of whack.

Wow.

> This is what you get when you use --create --assume-clean on disks 
> that are not actually clean... or if you somehow convince md to 
> integrate a disk that does not have valid data on, for example 
> because you copied partition table and md metadata - but not  
> everything else - using dd.
> 
> Something really bad happened here and the only person who 
> can explain it, is probably yourself.

This is wrong.  Your mdadm -D output clearly shows a 2014 creation date,
so you definitely hadn't done --create --assume-clean at that point.
(Don't.)

> Your best bet is that the data is valid on n-2 disks.
> 
> Use overlay https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
> 
> Assemble the overlay RAID with any 2 disks missing (try all combinations) and see if you get valid data.

No.  Something else is wrong, quite possibly hardware.  You don't get a
mismatch count like that without it showing up in smartctl too, unless
corrupt data was being written to one or more disks for a long time.

It's unclear from your dmesg what might have happened.  Probably bad
stuff going back years.

If you used ddrescue to replace sdf instead of letting mdadm reconstruct
it, that would have introduced zero sectors that would scramble your
encrypted filesystem.  Please let us know that you didn't use ddrescue.

The encryption inside your array will frustrate any attempt to do
per-member analysis.  I don't think there's anything still wrong with
the array (anything fixable, that is).

If an array error stomped on the key area of your dm-crypt layer, you
are totally destroyed, unless you happen to have a key backup you can
restore.

Otherwise you are at the mercy of fsck to try to fix your volume.  I
would use an overlay for that.

Phil

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 12:42     ` Phil Turmel
@ 2016-03-21 13:27       ` Andreas Klauer
  2016-03-21 21:26       ` Chris Murphy
  1 sibling, 0 replies; 16+ messages in thread
From: Andreas Klauer @ 2016-03-21 13:27 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Patrick Tschackert, linux-raid

On Mon, Mar 21, 2016 at 08:42:16AM -0400, Phil Turmel wrote:
> On 03/20/2016 06:37 PM, Andreas Klauer wrote:
> > This is what you get when you use --create --assume-clean on disks 
> > that are not actually clean... or if you somehow convince md to 
> > integrate a disk that does not have valid data on, for example 
> > because you copied partition table and md metadata - but not  
> > everything else - using dd.
> > 
> > Something really bad happened here and the only person who 
> > can explain it, is probably yourself.
> 
> This is wrong.  Your mdadm -D output clearly shows a 2014 creation date,
> so you definitely hadn't done --create --assume-clean at that point.
> (Don't.)

It was just an example. You get a mismatch if (at least) one disk has 
wrong data on it; this many mismatches means there is a disk full of 
wrong data for some reason we do not know, hence I suggested trying 
to assemble it with disks missing, in the hope the data will start 
making sense once the bad disk(s) are gone.

> Something else is wrong, quite possibly hardware.

Nothing is impossible, but for a hardware error this is very unusual.

> I would use an overlay for that.

Use overlays, and for disks with bad sectors, make good copies of them 
(make sure to remember which of them were bad and where the holes in 
the copies are, i.e. keep the ddrescue logs). If you play around with 
disks that already have bad sectors, they might die completely on you.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 12:42     ` Phil Turmel
  2016-03-21 13:27       ` Andreas Klauer
@ 2016-03-21 21:26       ` Chris Murphy
  2016-03-21 21:38         ` Andreas Klauer
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2016-03-21 21:26 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Andreas Klauer, Patrick Tschackert, linux-raid

Original thread on btrfs list (the OP links here didn't work for me):
http://www.spinics.net/lists/linux-btrfs/msg53143.html

On Mon, Mar 21, 2016 at 6:42 AM, Phil Turmel <philip@turmel.org> wrote:
> Hi Patrick,
>
> On 03/20/2016 06:37 PM, Andreas Klauer wrote:
>> On Sun, Mar 20, 2016 at 10:44:57PM +0100, Patrick Tschackert wrote:
>>> After rebooting the system, one of the harddisks was missing from my md raid 6 (the drive was /dev/sdf), so i rebuilt it with a hotspare that was already present in the system.
>>> I physically removed the "missing" /dev/sdf drive after the restore and replaced it with a new drive.
>
> Your smartctl output shows pending sector problems with sdf, sdh, and
> sdj.  The latter are WD Reds that won't keep those problems through a
> scrub, so I guess the smartctl report was from before that?

From what I understand, no, the smartctl are after the scrub check.
The dmesg shows read errors but no md attempt to fix up those errors,
which I thought was strange but might be a good thing if the raid is
not assembled correctly.

>> Your best bet is that the data is valid on n-2 disks.
>>
>> Use overlay https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
>>
>> Assemble the overlay RAID with any 2 disks missing (try all combinations) and see if you get valid data.
>
> No.  Something else is wrong, quite possibly hardware.  You don't get a
> mismatch count like that without it showing up in smartctl too, unless
> corrupt data was being written to one or more disks for a long time.
>
> It's unclear from your dmesg what might have happened.  Probably bad
> stuff going back years.

Seems unlikely because this was a functioning raid6 with Btrfs on top.
So there'd have been a ton of Btrfs complaints.

I think something wrong happened with the device replace procedure, I
just can't tell what because all the devices are present and working
according to the -D output.

In that first message on the btrfs list you can see what things work
and don't work in more detail. The summary is, all three Btrfs super
blocks are found. This wouldn't be possible if the array weren't at
least partially correct, and also the LUKS volume were being unlocked
correctly. Unless there's something very nuanced and detailed we're
not understanding yet.

But as soon as commands are used to look for other things, there are
immediate failures, lots of metadata checksum errors, an inability to
read the chunk and root trees. So it's like there's a hole in the file
system. I just can't tell if it's a small one like the size of a drive
or a big one.

> Otherwise you are at the mercy of fsck to try to fix your volume.  I
> would use an overlay for that.

At this point I'm skeptical this will work. Also, I'm not familiar
with this overlay technique. I did look at the URL provided by
Andreas, my concern is whether it's possible for the volume UUID to
appear more than once to the kernel? There are some very tricky things
about Btrfs's dependency on volume UUID that can make it get confused
where it should be writing when it sees more than one device with the
same UUID. This is a problem with for example Btrfs on LVM, taking a
snapshot of an LV, and both LV's being active means in effect two
Btrfs instances with the same UUID and Btrfs can clobber them both in
a bad way.
https://btrfs.wiki.kernel.org/index.php/Gotchas

I really think the Btrfs file system, based on the OP's description on
the Btrfs list, is probably OK. The issue is raid6 assembly somehow
being wonky. Even if it were double degraded by pulling any two
suspect drives, I'd expect things to immediately get better and a
nomodify 'btrfs check' will then come up clean. The OP had a clean
shutdown. But it's an open question how long after device failure he
actually noticed it before doing the rebuild and how he did that
rebuild; and whether there's missing critical data on any of the other
bad sectors on the three remaining drives. Chances are, those sectors
don't overlap though.

But at this point we need to hear back from Patrick.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 21:26       ` Chris Murphy
@ 2016-03-21 21:38         ` Andreas Klauer
  2016-03-21 21:46           ` Chris Murphy
  2016-03-21 22:42           ` Patrick Tschackert
  0 siblings, 2 replies; 16+ messages in thread
From: Andreas Klauer @ 2016-03-21 21:38 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, Patrick Tschackert, linux-raid

On Mon, Mar 21, 2016 at 03:26:30PM -0600, Chris Murphy wrote:
> At this point I'm skeptical this will work. Also, I'm not familiar
> with this overlay technique. I did look at the URL provided by
> Andreas, my concern is whether it's possible for the volume UUID to
> appear more than once to the kernel?

Thanks for pointing out this issue, I'm not familiar with btrfs at all.

Normally this would be a problem and you'd have to find some way 
to change the UUID in the overlay or whatever, but if it's LUKS 
encrypted that should solve the problem nicely, you just have to 
stop the original RAID and close the original LUKS container 
and then only open the one of the overlay.

You should not assemble the original RAID anymore anyhow, anything 
you write to the array at this point will likely only increase 
damages. The overlay allows you to experiment in read-"write" 
mode without actually changing anything on your disks.

> The issue is raid6 assembly somehow being wonky.
> But at this point we need to hear back from Patrick.

Yes, it's quite a mystery, I hope the OP can shed some light on 
what might have caused this.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 21:38         ` Andreas Klauer
@ 2016-03-21 21:46           ` Chris Murphy
  2016-03-21 22:42           ` Patrick Tschackert
  1 sibling, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2016-03-21 21:46 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Chris Murphy, Phil Turmel, Patrick Tschackert, linux-raid

On Mon, Mar 21, 2016 at 3:38 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:
> On Mon, Mar 21, 2016 at 03:26:30PM -0600, Chris Murphy wrote:
>> At this point I'm skeptical this will work. Also, I'm not familiar
>> with this overlay technique. I did look at the URL provided by
>> Andreas, my concern is whether it's possible for the volume UUID to
>> appear more than once to the kernel?
>
> Thanks for pointing out this issue, I'm not familiar with btrfs at all.
>
> Normally this would be a problem and you'd have to find some way
> to change the UUID in the overlay or whatever, but if it's LUKS
> encrypted that should solve the problem nicely, you just have to
> stop the original RAID and close the original LUKS container
> and then only open the one of the overlay.

Yes hiding it with LUKS is the way to go here. There's no practical
way to change the volume UUID as it's pervasive throughout the fs
metadata, it's used everywhere. Changing it with btrfstune means
completely rewriting out the entire fs metadata. It's the same now for
XFS v5 which similarly uses fs volume UUID everywhere as a part of
consistency checking, along with metadata checksumming. I'm not sure
about the optional ext4 checksumming but I think it too is making wide
use of fs volume UUID throughout metadata.

So I think we have to be careful with multiple instances of UUID
appearing and probably there need to be better checks in kernel code
to avoid confusion. Btrfs is harder because it explicitly supports
multiple devices, and the volume UUID correctly appears on all devices
(as well as a unique device specific UUID). So it's used to seeing
multiple instances of a volume UUID. But anyway, this is a side track
topic.

>
> You should not assemble the original RAID anymore anyhow, anything
> you write to the array at this point will likely only increase
> damages. The overlay allows you to experiment in read-"write"
> mode without actually changing anything on your disks.

Agreed. If it turns out there are some repairs needed within Btrfs
it's better with the overlay because it's unclear based on the errors
thus far what repair step to use next, and some of these repair
attempts can still sometimes make things worse (which are of course
bugs, but nevertheless...)

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
@ 2016-03-21 22:42           ` Patrick Tschackert
  2016-03-21 22:54             ` Adam Goryachev
  2016-03-21 23:04             ` Andreas Klauer
  0 siblings, 2 replies; 16+ messages in thread
From: Patrick Tschackert @ 2016-03-21 22:42 UTC (permalink / raw)
  To: lists, Andreas.Klauer; +Cc: linux-raid

Hi Chris,

> From what I understand, no, the smartctl are after the scrub check.
That's correct, I ran those shortly before sending the OP.

> But it's an open question how long after device failure he
> actually noticed it before doing the rebuild and how he did that
> rebuild;

I noticed something was wrong directy after the reboot. I opened the LUKS container (which worked) and tried to mount the btrfs filesys, which didn't. After noticing the raid was in the state "clean, degraded", I triggered the restore by running mdadm --run, because i already had two hotspares present in the raid. So one of the hotspares was used to do the restore.
After that, i ran mdadm --readwrite /dev/md0, because the raid was set to read only. The problem with mounting the btrfs filesys still occurred.

> But at this point we need to hear back from Patrick.
Sorry for taking so long, long day at work :(

> You should not assemble the original RAID anymore anyhow, anything
> you write to the array at this point will likely only increase
> damages. The overlay allows you to experiment in read-"write"
> mode without actually changing anything on your disks.

> Agreed. If it turns out there are some repairs needed within Btrfs
> it's better with the overlay because it's unclear based on the errors
> thus far what repair step to use next, and some of these repair
> attempts can still sometimes make things worse (which are of course
> bugs, but nevertheless...)

I'll look into overlays and try that tomorrow, it's too late and i don't want to further screw this up by doing this half asleep :/
Based on your advice, I'll use an overlay on the array and then try to fix the btrfs filesystem. If I understand correctly, the dm overlay file would enable me to revert to the current state, in case the btrfs repair goes wrong?

Regards,
Patrick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 22:42           ` Patrick Tschackert
@ 2016-03-21 22:54             ` Adam Goryachev
  2016-03-21 23:15               ` Andreas Klauer
  2016-03-21 23:04             ` Andreas Klauer
  1 sibling, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2016-03-21 22:54 UTC (permalink / raw)
  To: Patrick Tschackert, lists, Andreas.Klauer; +Cc: linux-raid

On 22/03/16 09:42, Patrick Tschackert wrote:
> Hi Chris,
>
>>  From what I understand, no, the smartctl are after the scrub check.
> That's correct, I ran those shortly before sending the OP.
>
>> But it's an open question how long after device failure he
>> actually noticed it before doing the rebuild and how he did that
>> rebuild;
> I noticed something was wrong directy after the reboot. I opened the LUKS container (which worked) and tried to mount the btrfs filesys, which didn't. After noticing the raid was in the state "clean, degraded", I triggered the restore by running mdadm --run, because i already had two hotspares present in the raid. So one of the hotspares was used to do the restore.
> After that, i ran mdadm --readwrite /dev/md0, because the raid was set to read only. The problem with mounting the btrfs filesys still occurred.
>
>> But at this point we need to hear back from Patrick.
> Sorry for taking so long, long day at work :(
>
>> You should not assemble the original RAID anymore anyhow, anything
>> you write to the array at this point will likely only increase
>> damages. The overlay allows you to experiment in read-"write"
>> mode without actually changing anything on your disks.
>> Agreed. If it turns out there are some repairs needed within Btrfs
>> it's better with the overlay because it's unclear based on the errors
>> thus far what repair step to use next, and some of these repair
>> attempts can still sometimes make things worse (which are of course
>> bugs, but nevertheless...)
> I'll look into overlays and try that tomorrow, it's too late and i don't want to further screw this up by doing this half asleep :/
> Based on your advice, I'll use an overlay on the array and then try to fix the btrfs filesystem. If I understand correctly, the dm overlay file would enable me to revert to the current state, in case the btrfs repair goes wrong?
>
> Regards,
> Patrick

I think the suggestion is to use overlays on each individual 
drive/partition that is used in the RAID array, and then try/test adding 
a subset of those drives to assemble the raid array. That way, any 
writes to the raid will not damage the underlying data. To my untrained 
eye, it looks like maybe the "first" drive in your array is correct, and 
hence the first block returns the correct data so you can access the 
LUKS, but the second (or third, or fourth) is damaged, and thats why you 
can't read the filesystem inside the LUKS. Hence, try swapping the order 
of the disks and/or leaving different disks out, and see if you can then 
read both LUKS and the filesystem inside it.

Once you can do that, then either the filesystem will "Just Work" or 
else you might need to do a repair depending on what exactly went wrong, 
and how much was written during that time.

Regards,
Adam


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 22:54             ` Adam Goryachev
@ 2016-03-21 23:15               ` Andreas Klauer
  2016-03-21 23:48                 ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Klauer @ 2016-03-21 23:15 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Patrick Tschackert, lists, linux-raid

On Tue, Mar 22, 2016 at 09:54:54AM +1100, Adam Goryachev wrote:
> To my untrained eye, it looks like maybe the "first" drive in your array is correct, and 
> hence the first block returns the correct data so you can access the 
> LUKS, but the second (or third, or fourth) is damaged, and thats why you 
> can't read the filesystem inside the LUKS.

This is a "problem" you get with arrays of many disks, if you "forget"
the correct drive order and then "create" the RAID anew, it might 
result in a perfectly mountable filesystem but with errors down the 
way since the first "wrong" data may appear outside the filesystems 
immediate metadata zone, if two later disks switched places.

However the OP only uses 64K chunksize, so that gives a lot less 
valid data than you'd get with 512K chunks. The LUKS header is already 
larger than 64K so if there is really bad data on one of the disks 
throughout, it's already quite lucky for the LUKS header to have 
survived. May be a good idea to grab a backup of that header while 
it's still working anyhow.

The one disk full of bad data theory might not even be correct, 
maybe a sync started, and somehow the disk got accepted as fully 
synced even though it didn't... because the controller silently 
ignored all writes? Mysterious selective hardware failure?

> Once you can do that, then either the filesystem will "Just Work" or 
> else you might need to do a repair depending on what exactly went wrong, 
> and how much was written during that time.

Hope dies last.

If btrfs stored data the same way a traditional filesystem would, 
uncompressed unencrypted unfragmented, you could hunt the raw data 
of your md for magic headers of known large files and see if you 
can tell in more detail the type of damage.

For example if you could find a large megapixel JPEG image like 
that and were able to load it but it would appear corrupted at 
some point, the point of corruption might point you to the 
disk you no longer want to be in your array.

But I don't know enough/anything about btrfs so not sure if viable.

Regards,
Andreas Klauer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 23:15               ` Andreas Klauer
@ 2016-03-21 23:48                 ` Adam Goryachev
  0 siblings, 0 replies; 16+ messages in thread
From: Adam Goryachev @ 2016-03-21 23:48 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Patrick Tschackert, lists, linux-raid

On 22/03/16 10:15, Andreas Klauer wrote:
> On Tue, Mar 22, 2016 at 09:54:54AM +1100, Adam Goryachev wrote:
>> To my untrained eye, it looks like maybe the "first" drive in your array is correct, and
>> hence the first block returns the correct data so you can access the
>> LUKS, but the second (or third, or fourth) is damaged, and thats why you
>> can't read the filesystem inside the LUKS.
> This is a "problem" you get with arrays of many disks, if you "forget"
> the correct drive order and then "create" the RAID anew, it might
> result in a perfectly mountable filesystem but with errors down the
> way since the first "wrong" data may appear outside the filesystems
> immediate metadata zone, if two later disks switched places.
>
> However the OP only uses 64K chunksize, so that gives a lot less
> valid data than you'd get with 512K chunks. The LUKS header is already
> larger than 64K so if there is really bad data on one of the disks
> throughout, it's already quite lucky for the LUKS header to have
> survived. May be a good idea to grab a backup of that header while
> it's still working anyhow.
>
> The one disk full of bad data theory might not even be correct,
> maybe a sync started, and somehow the disk got accepted as fully
> synced even though it didn't... because the controller silently
> ignored all writes? Mysterious selective hardware failure?
>
>> Once you can do that, then either the filesystem will "Just Work" or
>> else you might need to do a repair depending on what exactly went wrong,
>> and how much was written during that time.
> Hope dies last.
>
> If btrfs stored data the same way a traditional filesystem would,
> uncompressed unencrypted unfragmented, you could hunt the raw data
> of your md for magic headers of known large files and see if you
> can tell in more detail the type of damage.
>
> For example if you could find a large megapixel JPEG image like
> that and were able to load it but it would appear corrupted at
> some point, the point of corruption might point you to the
> disk you no longer want to be in your array.
>
> But I don't know enough/anything about btrfs so not sure if viable.

All of that is true, but it is a LUKS (encrypted) partition, so even if 
the filesystem format was simple, you wouldn't be able to work it out 
because all the data is encrypted. (At least, that is the point of data 
encryption right.... in practice maybe there are still some options, but 
I'm definitely out of my depth there).

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 22:42           ` Patrick Tschackert
  2016-03-21 22:54             ` Adam Goryachev
@ 2016-03-21 23:04             ` Andreas Klauer
  2016-03-22  3:53               ` Chris Murphy
  1 sibling, 1 reply; 16+ messages in thread
From: Andreas Klauer @ 2016-03-21 23:04 UTC (permalink / raw)
  To: Patrick Tschackert; +Cc: lists, linux-raid

On Mon, Mar 21, 2016 at 11:42:00PM +0100, Patrick Tschackert wrote:
> Based on your advice, I'll use an overlay on the array

Below the array. Assemble an array of overlay disks with two missing.
One of the missing disks should be that spare you synced in, because 
that's one of the prime suspects... actually if it was synced to it 
should not be the cause of mismatches, but if there are mismatches 
then it's likely the data that was synced to it is not good.

Although RAID is able to detect mismatches, it can not tell 
which is the valid data and which is the invalid data so ... 

Definitely assemble without that disk and without each other 
disk in turn; so for each attempt two disks missing from 
your RAID-6, hopefully one of them is a bad egg and with it 
gone, things will look better to btrfs than they do now.

The ultimate cause of those many mismatches remains a mystery.

Of course you can also attempt to repair btrfs directly but 
if btrfs redundancy is not equal to RAID-6 then it won't be 
able to fix. (I think you already cleared that point on 
the btrfs mailing list and would not be asking here if 
btrfs had the magic ability to recover)

Regards,
Andreas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-21 23:04             ` Andreas Klauer
@ 2016-03-22  3:53               ` Chris Murphy
  2016-03-22  4:22                 ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2016-03-22  3:53 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Patrick Tschackert, Chris Murphy, linux-raid

On Mon, Mar 21, 2016 at 5:04 PM, Andreas Klauer
<Andreas.Klauer@metamorpher.de> wrote:

> Of course you can also attempt to repair btrfs directly but
> if btrfs redundancy is not equal to RAID-6 then it won't be
> able to fix. (I think you already cleared that point on
> the btrfs mailing list and would not be asking here if
> btrfs had the magic ability to recover)

No such unicorns in Btrfs.

In the Btrfs thread, I suggested the md scrub "check", and when I saw
that high count I was like, no Patrick, do not change anything, don't
run any Btrfs repair tools, go talk to the linux-raid@ folks. Those
mismatches are not read errors. Those are discrepancies between data
and parity strips (chunks for md folks, I use SNIA strip terminology
because in Btrfs a chunk is a kind of super-extent or collection of
extents, either metadata or data).

Patrick, do you remember what your mkfs options were?

If default, it will be single profile for data chunks, and DUP for
metadata chunks. That means there's only one copy of data extents, and
two copies of metadata. The file system is the metadata including all
checksums for data and metadata, and all trees. The problem though is
there's a possible single source of failure in really narrow cases.
Everything in Btrfs land is a logical address. if you do filefrag on a
file, the physical address * 4096 is the logical address within
Btrfs's address space, it's not a physical block. That address has to
be referenced in the chunk tree, which is what references two things:
what device(s) *and* what sector on those devices. So even though
there's a duplicate of the chunk tree, since there's only one logical
address in the super for the chunk tree, if that logical address
doesn't resolve to a sane physical location, there's no way to find
out where any copies are. So it's stuck.

Per the Btrfs FAQ:
------
There are three superblocks: the first one is located at 64K, the
second one at 64M, the third one at 256GB. The following lines reset
the magic string on all the three superblocks

# dd if=/dev/zero bs=1 count=8 of=/dev/sda seek=$((64*1024+64))
# dd if=/dev/zero bs=1 count=8 of=/dev/sda seek=$((64*1024*1024+64))
# dd if=/dev/zero bs=1 count=8 of=/dev/sda seek=$((256*1024*1024*1024+64))
------

So if someone can do the math and figure out what physical devices
those supers might be on, with a 64K chunk and 9 devices, might be
funny if they end up all on one drive... tragically funny. Hopefully
they're on multiple drives thought, and this suggests at least the
critical minimum number of drives are still sane and this can be
recovered even minus two drives.

If there's more than two drives toasted, then I don't think Btrfs can
help at all - same as any other file system.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
  2016-03-22  3:53               ` Chris Murphy
@ 2016-03-22  4:22                 ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2016-03-22  4:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Andreas Klauer, Patrick Tschackert, linux-raid

On Mon, Mar 21, 2016 at 9:53 PM, Chris Murphy <lists@colorremedies.com> wrote:
 So even though
> there's a duplicate of the chunk tree, since there's only one logical
> address in the super for the chunk tree, if that logical address
> doesn't resolve to a sane physical location, there's no way to find
> out where any copies are. So it's stuck.

Small clarification. The Btrfs superblock contains four backups. But
in Patrick's case, all four backups contain the same chunk root
address as the primary entry in the super (all five copies are the
same address) so... yeah stuck. Basically that address has to become
viable to start to get access to one or more copies of the filesystem,
including the root tree which is also a logical address. So none of
the trees can be sorted out without the chunk tree. There is a chunk
recover option, but I'm very skeptical of it being able to do anything
when the btrfs-find-root command fails totally for Patrick. There's
just not enough information available.

So I think the key is to get at the least n-2 drives up in a degraded
array and then yes indeed we'll have magic, or unicorn poop cookies,
or whatever...

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
@ 2016-03-21 22:06 Patrick Tschackert
  0 siblings, 0 replies; 16+ messages in thread
From: Patrick Tschackert @ 2016-03-21 22:06 UTC (permalink / raw)
  To: Andreas.Klauer; +Cc: linux-raid

Thank you for answering!

>> After rebooting the system, one of the harddisks was missing from my md raid 6 (the drive was /dev/sdf), so i rebuilt it with a hotspare that was already present in the system.
>> I physically removed the "missing" /dev/sdf drive after the restore and replaced it with a new drive.

>Exact commands involved for those steps?

Well since the /dev/sdf disk was missing from the array after the reboot, i didn't use any command to remove it. I just used
$ mdadm --run /dev/md0
to trigger the rebuild/restore. As i had two spare drives present in the array anyway, i thought that was the smartest thing to do.
After the restore was done, i shut down the system and swapped the missing disk (/dev/sdf) with a new one.
I then added the new disk to the array as a spare (mdadm --add /dev/md0 /dev/sdf)

> mdadm --examine output for your disks?
Here is the output for every disk in the array: http://pastebin.com/JW8rbJYY

> This is what you get when you use --create --assume-clean on disks
> that are not actually clean... or if you somehow convince md to
> integrate a disk that does not have valid data on, for example
> because you copied partition table and md metadata - but not
> everything else - using dd.

I didn't use that command or anything like that, i just triggered the rebuild with mdadm --run. It then started the restore (i monitored the progress by looking at /proc/mdstat), and it seemed to complete successfully.

> Your best bet is that the data is valid on n-2 disks.
> Use overlay https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
> Assemble the overlay RAID with any 2 disks missing (try all combinations) and see if you get valid data.

Thanks, I will definitely try that!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: problems with dm-raid 6
@ 2016-03-21 22:19 Patrick Tschackert
  0 siblings, 0 replies; 16+ messages in thread
From: Patrick Tschackert @ 2016-03-21 22:19 UTC (permalink / raw)
  To: philip; +Cc: linux-raid

Hi Philip, thanks for answering!

> Your smartctl output shows pending sector problems with sdf, sdh, and
> sdj.  The latter are WD Reds that won't keep those problems through a
> scrub, so I guess the smartctl report was from before that?

The smartctl results are "fresh", i ran the commands just before sending my last eMail.

>> mdadm --examine output for your disks?
>Yes, we want these.

Here: http://pastebin.com/JW8rbJYY

> Your mdadm -D output clearly shows a 2014 creation date,
> so you definitely hadn't done --create --assume-clean at that point.
> (Don't.)

I didn't do that, I used mdadm --run /dev/md0 to start the rebuild/restore

> Something else is wrong, quite possibly hardware.  You don't get a
> mismatch count like that without it showing up in smartctl too, unless
> corrupt data was being written to one or more disks for a long time.

As I said in my initial eMail, I got

$ cat /sys/block/md0/md/mismatch_cnt
0

directly after the rebuild/restore. I then ran

$ for i in /sys/class/scsi_generic/*/device/timeout; do echo 120 > "$i"; done

to correct disk timeouts (got that advice from irc) and

$ echo check > /sys/block/md0/md/sync_action

to start a check on the raid. After the check was completed i got

$ cat /sys/block/md0/md/mismatch_cnt
311936608

> If you used ddrescue to replace sdf instead of letting mdadm reconstruct
> it, that would have introduced zero sectors that would scramble your
> encrypted filesystem.  Please let us know that you didn't use ddrescue.

I didn't do that, I just ran mdadm --run /dev/md0, which started the rebuild, nothing else.


> The encryption inside your array will frustrate any attempt to do
> per-member analysis.  I don't think there's anything still wrong with
> the array (anything fixable, that is).
> If an array error stomped on the key area of your dm-crypt layer, you
> are totally destroyed, unless you happen to have a key backup you can
> restore.
> Otherwise you are at the mercy of fsck to try to fix your volume.  I
> would use an overlay for that.

Well, the key area seems alright, i can open the volume using "cryptsetup luksOpen /dev/md0 storage", it asks for my passphrase and then opens the volume.
I can even read the BTRFS superblock (of the filesys on my luks volume), so the whole thing doesn't seem to be completely borked.
I'll read up on overlays and try them maybe.

Kind regards

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-03-22  4:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <trinity-235b76ed-571d-4615-b6f7-b4d5ed6a116d-1458509365312@3capp-gmx-bs09>
2016-03-20 21:44 ` problems with dm-raid 6 Patrick Tschackert
2016-03-20 22:37   ` Andreas Klauer
2016-03-21 12:42     ` Phil Turmel
2016-03-21 13:27       ` Andreas Klauer
2016-03-21 21:26       ` Chris Murphy
2016-03-21 21:38         ` Andreas Klauer
2016-03-21 21:46           ` Chris Murphy
2016-03-21 22:42           ` Patrick Tschackert
2016-03-21 22:54             ` Adam Goryachev
2016-03-21 23:15               ` Andreas Klauer
2016-03-21 23:48                 ` Adam Goryachev
2016-03-21 23:04             ` Andreas Klauer
2016-03-22  3:53               ` Chris Murphy
2016-03-22  4:22                 ` Chris Murphy
2016-03-21 22:06 Patrick Tschackert
  -- strict thread matches above, loose matches on Subject: below --
2016-03-21 22:19 Patrick Tschackert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).