Linux RAID subsystem development
 help / color / mirror / Atom feed
* 双11预售专场已经开启!低至五折,双十一前囤货高峰!
From: 中华新村 @ 2016-10-31  9:01 UTC (permalink / raw)
  To: linux-raid

                双11海量好货提前曝光

提前查看双11好货,抢先收藏,双11当天立即下单,避免排队付款,不止是五折!

电脑用户入口:http://s.click.taobao.com/0DPUWPx
手机用户入口:http://s.click.taobao.com/XQlUWPx

^ permalink raw reply

* [PATCH] md/bitmap: call bitmap_file_unmap once bitmap_storage_alloc returns -ENOMEM
From: Guoqing Jiang @ 2016-10-31  2:19 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Guoqing Jiang

It is possible that bitmap_storage_alloc could return -ENOMEM,
and some member inside store could be allocated such as filemap.

To avoid memory leak, we need to call bitmap_file_unmap to free
those members in the bitmap_resize.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 2d826927a3bf..cd3a0659cc07 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -2029,8 +2029,10 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 					   !bitmap->mddev->bitmap_info.external,
 					   mddev_is_clustered(bitmap->mddev)
 					   ? bitmap->cluster_slot : 0);
-	if (ret)
+	if (ret) {
+		bitmap_file_unmap(&store);
 		goto err;
+	}
 
 	pages = DIV_ROUND_UP(chunks, PAGE_COUNTER_RATIO);
 
-- 
2.6.2


^ permalink raw reply related

* Re: Panicked and deleted superblock
From: Andreas Klauer @ 2016-10-30 21:11 UTC (permalink / raw)
  To: Peter Hoffmann; +Cc: linux-raid
In-Reply-To: <7f1dd791-ed1c-5db6-555d-ed377995420e@gmx.net>

On Sun, Oct 30, 2016 at 09:45:27PM +0100, Peter Hoffmann wrote:
> there shouldn't anything be lost as growing consumes more
> than it writes, stripe wise speaking

That's what I meant by 'overlap' - it's the wrong word I guess.

> /dev/sda2 --luks--> /dev/mapper/HDD_0 \
> /dev/sdb2 --luks--> /dev/mapper/HDD_1 --raid--> /dev/md127 -ext4-> /raid
> /dev/sdc2 --luks--> /dev/mapper/HDD_2 /

You're hoping it be faster since three threads instead of one?
Adds the overhead of encrypting parity. Not sure if worth it.
This idea belongs to another era (before AES-NI).

But it's good, that way, you have "unencrypted" data on your RAID and can 
make deductions from that raw data as to chunk size and such things. 

> * anything else?

This is where I don't know how to provide specific help.
Since you did not provide specific data I can work with.
Your data offset sounds strange to me but with overlay, 
it's faster to just go ahead and try.

You'll have to figure out the details by yourself, pretty much.

Once you have the correct offset you might be able to deduct the other 
offset. Create 4 loop devices size of your disks (sparse files in tmpfs, 
truncate -s thefile, losetup), create a 3 disk raid, grow to 4 disks, 
check with mdadm --examine if & how the data offset changed.

> So I'm looking for a sequence of bytes that is duplicated on both
> overlays. This way I find the border between both parts.

Yes, there should be an identical region (let's hope not zeroes)
and you should roughly determine the end of that region and that's 
your entry point for a linear device mapping.

Regards
Andreas Klauer

^ permalink raw reply

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
From: TomK @ 2016-10-30 21:08 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161030201302.GB6727@metamorpher.de>

On 10/30/2016 4:13 PM, Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 02:56:58PM -0400, TomK wrote:
>> So the question is how come the mdadm RAID did not catch this disk as a
>> failed disk and pull it out of the array?
>
> RAID doesn't know about SMART. It's that simple.
>
> If SMART already knows about errors - too bad, RAID doesn't care.
> It also doesn't know about anything else really. You ddrescue the
> member disk directly and it finds tons of errors... RAID isn't involved.
>
> RAID will only kick when it by itself stumbles over an error that does
> not go away when rewriting data. Or when the drive just doesn't respond
> anymore for an extended period of time. And that timeout is per request
> so a bad disk can grind the entire system to a halt without ever kicked.
>
> ddrescue has this nice --min-read-rate option, any zone that yields data
> slower will be considered a hopeless case, RAID does not have such magic.
> If your drive always responds and always claims to successfully write
> even when it doesn't, then RAID will never kick it.
>
> If you never run array checks or smart selftests, errors won't show.
> RAID will show them as healthy, SMART will show them as healthy,
> doesn't mean diddly-squat until you actually test it. Regularly.
>
> Kicking drives yourself is quite normal. RAID only does so much.
> This is why we have mdadm --replace, that way even a semi-broken disk
> can help with the rebuild effort and bad sectors on other disks won't
> result in an even bigger problem, or at least, not right away.
>
> If you leave RAID to its own devices, it has a much higher chance of dying
> than if you run tests, and actually decide to do something once *you're*
> aware that there are problems that RAID itself isn't aware of.
>
>> On a separate topic, if I eventually expand the array to 6 2TB disks,
>> will the array be smart enough to allow me to expand it to the new size?
>
> Yes. Perhaps after an additional --grow --size=max.
>
> Regards
> Andreas Klauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Very clear. Thanks Andreas!

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply

* Re: Panicked and deleted superblock
From: Peter Hoffmann @ 2016-10-30 20:45 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20161030194326.GA6727@metamorpher.de>

On Sun, Oct 30, 2016 at 08:43:00PM +100 Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 07:23:00PM +0100, Peter Hoffmann wrote:
>> I assume that both processes - re-sync  and grow - raced
>> through the array and did their job.
> 
> Oi oi oi, it's still one process per raid, no races. Isn't it? 
> I'm not a kernel developer so I don't really *know* what happens 
> in this case, but in my imagination it should go something like 
> - disk that is not fully synced, treat as unsynced/degraded, repopulate.
> 
> Either that or it's actually smart enough to remember it synced up to X 
> and just does the right thing(tm), whatever that is. But that sounds 
> like having to write out a lot of special cases instead of handling 
> the degraded case you must be able to cope with anyhow.
> 
> You have to re-write it and recalculate all parity anyway since the grow 
> changes everything.
> 
> As long as it didn't consider your half a disk to be fully synced, 
> your data should be completely fine. The only question is - where. ;)
All right, but even if the re-sync stopped as I started growing, like
you wrote, there shouldn't anything be lost as growing consumes more
than it writes, stripe wise speaking

| (D1) |  D2  |  D3  |      |  D1  |  D2  |  D3  | (D4) |
|------|------|------|      |------|------|------|------|
| (B1) |  B2  | P1,2 | ->   |  B1  |  B2  |  B3  |(P123)|
| (B4) | P3,4 |  B3  |      |  B5  |  B6  | P456 | (B4) |
| (P)  |  B5  |  B6  |      |  ?   | [B5] | [B6] |      |
Where () shows non existent but should-be-there blocks and
      [] existing but shouldn't be there blocks

>> And after running for a while - my NAS is very slow (partly because all
>> disks are LUKS'd), mdstat showed around 1GiB of Data processed - we had
>> a blackout.
> 
> Stop trying to scare me! I'm not scared. 
> You you you and your spine-chilling halloween horror story.
> 
> Slow because of LUKS? You don't have LUKS below your RAID layer, right?
> Right? (Right?)
Ehm, ehm, may I call my lawer ;-) Yes,

/dev/sda2 --luks--> /dev/mapper/HDD_0 \
/dev/sdb2 --luks--> /dev/mapper/HDD_1 --raid--> /dev/md127 -ext4-> /raid
/dev/sdc2 --luks--> /dev/mapper/HDD_2 /

>> the RAID superblock is now lost
> 
> Other people have proved Murphy's Law before, you know, 
> why bother doing it again?
> 
>> My idea is to look for that magic number of the ext4-fs to find the
>> beginning of Block 1 on Disk 1, then I would copy an reasonable amount
>> of data and try to figure out how big Block 1 and hence chunk-size is -
>> perhaps fsck.ext4 can help do that?
> 
> Determining the data offset, that's fine, only one thing to consider.
> Growing RAIDs changes that very offset you're looking for, so.
> Even if you find it, it's still wrong.
> 
>> One thing I'm wondering is if I got the layout right. And the other
>> might be rather a case for the ext4-mailing list but I'd ask it anyway:
>> how can I figure where the file system starts to be corrupted?
> 
> Let's not care about your filesystem for now. Also forget fsck.
> 
> It's dangerous to go alone. Take this.
> 
> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
> 
> Create two overlays. Two. Okay?
> 
> Overlay #1: You create your not-fully-grown 4 disk raid.
> 
> You have to figure out the disk order, raid level, metadata version,
* disk order seems pretty obvious to me _UU and later UUU_
* raid level is 5
* 1) data offset on the grown system seems to be 100000
     (at least I find ext4s magic signature at 100000+400+38)
  2) no idea where it might be for the unsynced version
* chunk size I have no idea, I might have adjusted it from the default
value for better alignment with the file system
* layout should be the default left-symetric
  (all diagrams in the original mail are wrong as data blocks in a
stripe start after parity block not with first disk)
* anything else?


> data offset, chunk size, layout, and some things I don't remember. 
> If you got it right there should be a filesystem on the raid device. 
> Or a LUKS header. Or something that makes any sense whatsoever at 
> least for however far the reshape actually progressed.
> 
> Overlay #2: You create your not-fully-synced 3 disk raid.
>             Leaving the not-fully-synced disk as missing.
> 
> Basically this is the same thing as #1, except the data offset 
> might be different, there's obviously no 4th disk, and one of 
> the other three missing.
> 
> There probably WON'T be a filesystem on this one because it's 
> already grown over. So the beginning of this device is garbage, 
> it only starts making sense after the area that wasn't reshaped.
So I'm looking for a sequence of bytes that is duplicated on both
overlays. This way I find the border between both parts.

> If it was unencrypted... oh well. It wasn't. Was it?
> Now you've done it, I'm confused.
> 
> Then you find the point where data overlaps and create a linear mapping. 
> It overlaps because 4 disk more space than 3 so 1GB on 4 won't overwrite 
> 1GB on 3 so there is an overlapping zone.
> 
> And you're done. At least in terms of having access to the whole thing.
> 
> Easy peasy.
> 
> Regards
> Andreas Klauer
Thank you, that overlay file system is the way to go

> PS: Do you _really_ not have anything left. Logfiles? Anything?
>     Maybe you asked anything about your raid anywhere before 
>     and posted examine along with it, tucked away in some 
>     linux forum or chat you might have perused...
> 
>     Please check. Your story is really interesting but nothing 
>     beats hard facts such as actual output of your crap.
I'd be happy to have any such things but I never had any trouble before

^ permalink raw reply

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
From: Andreas Klauer @ 2016-10-30 20:13 UTC (permalink / raw)
  To: TomK; +Cc: linux-raid
In-Reply-To: <73e35e17-80aa-c7e6-535c-3665d9789e16@mdevsys.com>

On Sun, Oct 30, 2016 at 02:56:58PM -0400, TomK wrote:
> So the question is how come the mdadm RAID did not catch this disk as a 
> failed disk and pull it out of the array?

RAID doesn't know about SMART. It's that simple.

If SMART already knows about errors - too bad, RAID doesn't care.
It also doesn't know about anything else really. You ddrescue the 
member disk directly and it finds tons of errors... RAID isn't involved.

RAID will only kick when it by itself stumbles over an error that does 
not go away when rewriting data. Or when the drive just doesn't respond 
anymore for an extended period of time. And that timeout is per request 
so a bad disk can grind the entire system to a halt without ever kicked.

ddrescue has this nice --min-read-rate option, any zone that yields data
slower will be considered a hopeless case, RAID does not have such magic. 
If your drive always responds and always claims to successfully write 
even when it doesn't, then RAID will never kick it.

If you never run array checks or smart selftests, errors won't show.
RAID will show them as healthy, SMART will show them as healthy, 
doesn't mean diddly-squat until you actually test it. Regularly.

Kicking drives yourself is quite normal. RAID only does so much. 
This is why we have mdadm --replace, that way even a semi-broken disk 
can help with the rebuild effort and bad sectors on other disks won't 
result in an even bigger problem, or at least, not right away.

If you leave RAID to its own devices, it has a much higher chance of dying 
than if you run tests, and actually decide to do something once *you're* 
aware that there are problems that RAID itself isn't aware of.

> On a separate topic, if I eventually expand the array to 6 2TB disks, 
> will the array be smart enough to allow me to expand it to the new size? 

Yes. Perhaps after an additional --grow --size=max.

Regards
Andreas Klauer

^ permalink raw reply

* Re: Panicked and deleted superblock
From: Andreas Klauer @ 2016-10-30 19:43 UTC (permalink / raw)
  To: Peter Hoffmann; +Cc: linux-raid
In-Reply-To: <0e68051d-1008-cf9b-1f8f-0a0736b1c58f@gmx.net>

On Sun, Oct 30, 2016 at 07:23:00PM +0100, Peter Hoffmann wrote:
> I assume that both processes - re-sync  and grow - raced
> through the array and did their job.

Oi oi oi, it's still one process per raid, no races. Isn't it? 
I'm not a kernel developer so I don't really *know* what happens 
in this case, but in my imagination it should go something like 
- disk that is not fully synced, treat as unsynced/degraded, repopulate.

Either that or it's actually smart enough to remember it synced up to X 
and just does the right thing(tm), whatever that is. But that sounds 
like having to write out a lot of special cases instead of handling 
the degraded case you must be able to cope with anyhow.

You have to re-write it and recalculate all parity anyway since the grow 
changes everything.

As long as it didn't consider your half a disk to be fully synced, 
your data should be completely fine. The only question is - where. ;)

> And after running for a while - my NAS is very slow (partly because all
> disks are LUKS'd), mdstat showed around 1GiB of Data processed - we had
> a blackout.

Stop trying to scare me! I'm not scared. 
You you you and your spine-chilling halloween horror story.

Slow because of LUKS? You don't have LUKS below your RAID layer, right?
Right? (Right?)

> the RAID superblock is now lost

Other people have proved Murphy's Law before, you know, 
why bother doing it again?

> My idea is to look for that magic number of the ext4-fs to find the
> beginning of Block 1 on Disk 1, then I would copy an reasonable amount
> of data and try to figure out how big Block 1 and hence chunk-size is -
> perhaps fsck.ext4 can help do that?

Determining the data offset, that's fine, only one thing to consider.
Growing RAIDs changes that very offset you're looking for, so.
Even if you find it, it's still wrong.

> One thing I'm wondering is if I got the layout right. And the other
> might be rather a case for the ext4-mailing list but I'd ask it anyway:
> how can I figure where the file system starts to be corrupted?

Let's not care about your filesystem for now. Also forget fsck.

It's dangerous to go alone. Take this.

https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

Create two overlays. Two. Okay?

Overlay #1: You create your not-fully-grown 4 disk raid.

You have to figure out the disk order, raid level, metadata version, 
data offset, chunk size, layout, and some things I don't remember. 
If you got it right there should be a filesystem on the raid device. 
Or a LUKS header. Or something that makes any sense whatsoever at 
least for however far the reshape actually progressed.

Overlay #2: You create your not-fully-synced 3 disk raid.
            Leaving the not-fully-synced disk as missing.

Basically this is the same thing as #1, except the data offset 
might be different, there's obviously no 4th disk, and one of 
the other three missing.

There probably WON'T be a filesystem on this one because it's 
already grown over. So the beginning of this device is garbage, 
it only starts making sense after the area that wasn't reshaped.

If it was unencrypted... oh well. It wasn't. Was it?
Now you've done it, I'm confused.

Then you find the point where data overlaps and create a linear mapping. 
It overlaps because 4 disk more space than 3 so 1GB on 4 won't overwrite 
1GB on 3 so there is an overlapping zone.

And you're done. At least in terms of having access to the whole thing.

Easy peasy.

Regards
Andreas Klauer

PS: Do you _really_ not have anything left. Logfiles? Anything?
    Maybe you asked anything about your raid anywhere before 
    and posted examine along with it, tucked away in some 
    linux forum or chat you might have perused...

    Please check. Your story is really interesting but nothing 
    beats hard facts such as actual output of your crap.

^ permalink raw reply

* Re: [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
From: TomK @ 2016-10-30 19:16 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <73e35e17-80aa-c7e6-535c-3665d9789e16@mdevsys.com>

On 10/30/2016 2:56 PM, TomK wrote:
> Hey Guy's,
>
> We recently saw a situation where smartctl -A errored out eventually in
> a short time of a few days the disk cascaded into bad blocks eventually
> becoming a completely unrecognizable SATA disk.  It apparently was
> limping along for 6 months causing random timeout and slowdowns
> accessing the array.  But the RAID array did not pull it out or and did
> not mark it as bad.  The RAID 6 we have has been running for 6 years,
> however we did have alot of disk replacements in it yet it was always
> very very reliable.  Disks started as all 1TB Seagates but are now 2 WD
> 2TB, 1 2TB Seagate with 2 left as 1TB Seagates and the last one as
> 1.5TB.  Has a mix of green, red, blue etc.  Yet very rock solid.
>
> We did not do a thorough R/W test to see how the error and bad disk
> affected the data stored on the array but did notice pauses and
> slowdowns on the CIFS share presented from it with pauses and generally
> difficulty in reading data, however no data errors that we could see.
> Since then we replaced the 2TB Seagate with a new 2TB WD and everything
> is fine even if the array is degraded.  But as soon as we put in this
> bad disk, it degraded to it's previous behaviour.  Yet the array didn't
> catch it as a failed disk until the disk was nearly completely
> inaccessible.
>
> So the question is how come the mdadm RAID did not catch this disk as a
> failed disk and pull it out of the array?  Seams this disk was going bad
> for a while now but as long as the array reported all 6 healthy, there
> was no cause for alarm.  Also how does the array not detect the disk
> failure while issues in applications using the array show up?  Removing
> the disk and leaving the array in a degraded state also solved the
> accessibility issue on the array.  So appears the disk was generating
> some sort of errors (Possibly bad PCB) that were not caught before.
>
> Looking at the changelogs, has a similar case been addressed?
>
> On a separate topic, if I eventually expand the array to 6 2TB disks,
> will the array be smart enough to allow me to expand it to the new size?
>  Have not tried that yet and wanted to ask first.
>
> Cheers,
> Tom
>
>
> [root@mbpc-pc modprobe.d]# rpm -qf /sbin/mdadm
> mdadm-3.3.2-5.el6.x86_64
> [root@mbpc-pc modprobe.d]#
>
>
> (The 100% util lasts roughly 30 seconds)
> 10/23/2016 10:18:20 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   25.19    0.00   74.56
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.03   27.00  27.00   2.70
> sdc               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.01   15.00  15.00   1.50
> sdd               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.02   18.00  18.00   1.80
> sde               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.02   23.00  23.00   2.30
> sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.15    0.00   0.00 100.00
> sdg               0.00     2.00    1.00    4.00     4.00   172.00 70.40
>    0.04    8.40   2.80   1.40
> sda               0.00     0.00    0.00    1.00     0.00     2.50 5.00
> 0.04   37.00  37.00   3.70
> sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    1.00    6.00     4.00   172.00 50.29
>    0.05    7.29   2.00   1.40
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.00    0.00   0.00 100.00
>
> 10/23/2016 10:18:21 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.25   24.81    0.00   74.94
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdc               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdd               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 2.00    0.00   0.00 100.00
> sdg               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sda               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 0.00    0.00   0.00   0.00
> dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00
> 1.00    0.00   0.00 100.00
>
>
> We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016
> 10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM)
> mark when something occurs and it drops down to below 100% numbers.
>
> So I checked the array which shows all clean, even across reboots:
>
> [root@mbpc-pc ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
>       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6]
> [UUUUUU]
>       bitmap: 1/8 pages [4KB], 65536KB chunk
>
> unused devices: <none>
> [root@mbpc-pc ~]#
>
>
> Then I run smartctl across all disks and sure enough /dev/sdf prints this:
>
> [root@mbpc-pc ~]# smartctl -A /dev/sdf
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Error SMART Values Read failed: scsi error badly formed scsi parameters
> Smartctl: SMART Read Values failed.
>
> === START OF READ SMART DATA SECTION ===
> [root@mbpc-pc ~]#
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bit trigger happy.  Here's a better version of the first sentence.  :)

We recently saw a situation where smartctl -A errored out but mdadm 
didn't pick this up. Eventually, in a short time of a few days, the disk 
cascaded into bad blocks then became a completely unrecognizable SATA disk.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


^ permalink raw reply

* [ LR] Kernel 4.8.4: INFO: task kworker/u16:8:289 blocked for more than 120 seconds.
From: TomK @ 2016-10-30 18:56 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <f6b83548-cb8b-be21-ee4f-cae9f7fa2950@turmel.org>

Hey Guy's,

We recently saw a situation where smartctl -A errored out eventually in 
a short time of a few days the disk cascaded into bad blocks eventually 
becoming a completely unrecognizable SATA disk.  It apparently was 
limping along for 6 months causing random timeout and slowdowns 
accessing the array.  But the RAID array did not pull it out or and did 
not mark it as bad.  The RAID 6 we have has been running for 6 years, 
however we did have alot of disk replacements in it yet it was always 
very very reliable.  Disks started as all 1TB Seagates but are now 2 WD 
2TB, 1 2TB Seagate with 2 left as 1TB Seagates and the last one as 
1.5TB.  Has a mix of green, red, blue etc.  Yet very rock solid.

We did not do a thorough R/W test to see how the error and bad disk 
affected the data stored on the array but did notice pauses and 
slowdowns on the CIFS share presented from it with pauses and generally 
difficulty in reading data, however no data errors that we could see. 
Since then we replaced the 2TB Seagate with a new 2TB WD and everything 
is fine even if the array is degraded.  But as soon as we put in this 
bad disk, it degraded to it's previous behaviour.  Yet the array didn't 
catch it as a failed disk until the disk was nearly completely 
inaccessible.

So the question is how come the mdadm RAID did not catch this disk as a 
failed disk and pull it out of the array?  Seams this disk was going bad 
for a while now but as long as the array reported all 6 healthy, there 
was no cause for alarm.  Also how does the array not detect the disk 
failure while issues in applications using the array show up?  Removing 
the disk and leaving the array in a degraded state also solved the 
accessibility issue on the array.  So appears the disk was generating 
some sort of errors (Possibly bad PCB) that were not caught before.

Looking at the changelogs, has a similar case been addressed?

On a separate topic, if I eventually expand the array to 6 2TB disks, 
will the array be smart enough to allow me to expand it to the new size? 
  Have not tried that yet and wanted to ask first.

Cheers,
Tom


[root@mbpc-pc modprobe.d]# rpm -qf /sbin/mdadm
mdadm-3.3.2-5.el6.x86_64
[root@mbpc-pc modprobe.d]#


(The 100% util lasts roughly 30 seconds)
10/23/2016 10:18:20 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   25.19    0.00   74.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.03   27.00  27.00   2.70
sdc               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.01   15.00  15.00   1.50
sdd               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.02   18.00  18.00   1.80
sde               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.02   23.00  23.00   2.30
sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   1.15    0.00   0.00 100.00
sdg               0.00     2.00    1.00    4.00     4.00   172.00 70.40 
    0.04    8.40   2.80   1.40
sda               0.00     0.00    0.00    1.00     0.00     2.50 5.00 
   0.04   37.00  37.00   3.70
sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-0              0.00     0.00    1.00    6.00     4.00   172.00 50.29 
    0.05    7.29   2.00   1.40
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   1.00    0.00   0.00 100.00

10/23/2016 10:18:21 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   24.81    0.00   74.94

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   2.00    0.00   0.00 100.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
fd0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00 0.00 
   1.00    0.00   0.00 100.00


We can see that /dev/sdf ramps up to 100% starting at around (10/23/2016 
10:18:18 PM) and stays that way till about the (10/23/2016 10:18:42 PM) 
mark when something occurs and it drops down to below 100% numbers.

So I checked the array which shows all clean, even across reboots:

[root@mbpc-pc ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdb[7] sdf[6] sdd[3] sda[5] sdc[1] sde[8]
       3907045632 blocks super 1.2 level 6, 64k chunk, algorithm 2 [6/6] 
[UUUUUU]
       bitmap: 1/8 pages [4KB], 65536KB chunk

unused devices: <none>
[root@mbpc-pc ~]#


Then I run smartctl across all disks and sure enough /dev/sdf prints this:

[root@mbpc-pc ~]# smartctl -A /dev/sdf
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-4.8.4] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Error SMART Values Read failed: scsi error badly formed scsi parameters
Smartctl: SMART Read Values failed.

=== START OF READ SMART DATA SECTION ===
[root@mbpc-pc ~]#




^ permalink raw reply

* Panicked and deleted superblock
From: Peter Hoffmann @ 2016-10-30 18:23 UTC (permalink / raw)
  To: linux-raid

My problem is the result of working late and not informing myself
previously, I'm fully aware that I should have had a backup, be less
spontaneous and more cautious.

The initial situation is a RAID-5 array with three disks. I assume it to
look follows:

| Disk 1   | Disk 2   | Disk 3   |
|----------|----------|----------|
|    out   | Block 2  | P(1,2)   |
|    of    | P(3,4)   | Block 4  |	degenerated but working
|   sync   | Block 5  | Block 6  |


Then I started the re-sync:

| Disk 1   | Disk 2   | Disk 3   |
|----------|----------|----------|
| Block 1  | Block 2  | P(1,2)   |
| Block 3  | P(3,4)   | Block 4  |   	already synced
| P(5,6)   | Block 5  | Block 6  |
               . . .
|    out   | Block b  | P(a,b)   |
|    of    | P(c,d)   | Block d  |	not yet synced
|   sync   | Block e  | Block f  |

But I didn't wait for it to finish as I actually wanted to add a fourth
disk and so started a grow process. But I just changed the size of the
array, I didn't actually add the fourth disk (don't ask why I cannot
recall it). I assume that both processes - re-sync  and grow - raced
through the array and did their job.

| Disk 1   | Disk 2   | Disk 3   |
|----------|----------|----------|
| Block 1  | Block 2  | Block 3  |
| Block 4  | Block 5  | P(4,5,6) |	with four disks but degenerated
| Block 7  | P(7,8,9) | Block 8  |
               . . .
| Block a  | Block b  | P(a,b)   |
| Block c  | P(c,d)   | Block d  |	not yet grown but synced
| P(e,f)   | Block e  | Block f  |
               . . .
|    out   | Block V  | P(U,V)   |
|    of    | P(W,X)   | Block X  |		not yet synced
|   sync   | Block Y  | Block Z  |

And after running for a while - my NAS is very slow (partly because all
disks are LUKS'd), mdstat showed around 1GiB of Data processed - we had
a blackout. Water dropped in a distribution socket and *poff*. After a
reboot I wanted to resemble everything, didn't know what I was doing so
the RAID superblock is now lost and I failed to reassemble (this is the
part I really can't recall, I panicked). I never wrote anything to the
actual array so I assume, better hope that no actual data is lost.

I have a plan but wanted to check with you before doing anything stupid
again.
My idea is to look for that magic number of the ext4-fs to find the
beginning of Block 1 on Disk 1, then I would copy an reasonable amount
of data and try to figure out how big Block 1 and hence chunk-size is -
perhaps fsck.ext4 can help do that? After that I copy another reasonable
amount of data from Disks 1-3 to figure out the border between the grown
Stripes and the synced Stripes. And from there on I'd have my data in a
defined state from which I can save the whole file system.
One thing I'm wondering is if I got the layout right. And the other
might be rather a case for the ext4-mailing list but I'd ask it anyway:
how can I figure where the file system starts to be corrupted?

embarrassed Greetings,
Peter Hoffmann


^ permalink raw reply

* Re: clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
From: Marc MERLIN @ 2016-10-30 17:16 UTC (permalink / raw)
  To: Phil Turmel, Neil Brown, Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161030171234.GD28648@merlins.org>

On Sun, Oct 30, 2016 at 10:12:34AM -0700, Marc MERLIN wrote:
> Hi Neil,
> 
> Could you offer any guidance here? Is there somethign else I can do to clear
> those fake bad blocks (the underlying disks are fine, I scanned them)
> without rebuilding the array?

On Sun, Oct 30, 2016 at 06:02:42PM +0100, Andreas Klauer wrote:
> > There should be some --update=no-bbl --force if the admin knows the bad
> > block list is wrong and due to IO issues not related to the drive.
> 
> Good point. And hey, there it is.
> 
> mdadm.c
> 
> |                       	if (strcmp(c.update, "bbl") == 0)
> |                               	continue;
> |                       	if (strcmp(c.update, "no-bbl") == 0)
> |                                continue;
> |                       	if (strcmp(c.update, "force-no-bbl") == 0)
> |                               	continue;
> 
> force-no-bbl. It's in mdadm v3.4, not sure about older ones.

Oh, very nice, thank you. It's not in the man page, but it works:

myth:~# mdadm --assemble --update=force-no-bbl /dev/md5
mdadm: /dev/md5 has been started with 5 drives.
myth:~# 
myth:~# mdadm --examine-badblocks /dev/sd[defgh]1
No bad-blocks list configured on /dev/sdd1
No bad-blocks list configured on /dev/sde1
No bad-blocks list configured on /dev/sdf1
No bad-blocks list configured on /dev/sdg1
No bad-blocks list configured on /dev/sdh1

Now I'll make sure to turn off this feature on all my other arrays
in case it got turned on without my asking for it.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply

* clearing blocks wrongfully marked as bad if --update=no-bbl can't be used?
From: Marc MERLIN @ 2016-10-30 17:12 UTC (permalink / raw)
  To: Phil Turmel, Neil Brown; +Cc: Andreas Klauer, linux-raid
In-Reply-To: <f6b83548-cb8b-be21-ee4f-cae9f7fa2950@turmel.org>

Hi Neil,

Could you offer any guidance here? Is there somethign else I can do to clear
those fake bad blocks (the underlying disks are fine, I scanned them)
without rebuilding the array?

On Sun, Oct 30, 2016 at 12:34:56PM -0400, Phil Turmel wrote:
> On 10/30/2016 12:19 PM, Andreas Klauer wrote:
> > On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
> >> (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
> >> on at least one drive, no?)
> > 
> > If n+1 disks have bad blocks there's no data to sync over, so they just 
> > propagate and stay bad forever. Or at least that's how it seemed to work 
> > last time I tried it. I'm not familiar with bad blocks. I just turn it off.
> 
> I, too, turn it off.  (I never let it turn on, actually.)
> 
> I'm a little disturbed that this feature has become the default on new
> arrays.  This feature was introduced specifically to support underlying
> storage technologies that cannot perform their own bad block management.
>  And since it doesn't implement any relocation algorithm for blocks
> marked bad, it simply gives up any redundancy for affected sectors.  And
> when there's no remaining redundancy, it simply passes the error up the
> stack.  In this case, your errors were created by known communications
> weaknesses that should always be recoverable with --assemble --force.
> 
> As far as I'm concerned, the bad block system is an incomplete feature
> that should never be used in production, and certainly not on top of any
> storage technology that implements error detection, correction, and
> relocation.  Like, every modern SATA and SAS drive.

Agreed. Just to confirm, I did indeed not willlingly turn this on, and I
really wish I had not been turned on automatically.
As you point out, I've never needed this, and cabling induced problems just
used to kill my array, I would fix the cabling and manually rebuild it.
Now my array doesn't get killed, but it gets rendered not very usable and
cause my filesystem (btrfs) to abort and fail when I access the wrong parts
of it.

I'm now stuck with those fake bad blocks that I can't remove without some
complicated surgery of editting md metadata on disk or recreating an array
on top of the current one with the option disabled and hope things line up.

This really ought to work, or something similar:
myth:~# mdadm --assemble --force --update=no-bbl /dev/md5
mdadm: Cannot remove active bbl from /dev/sdf1
mdadm: Cannot remove active bbl from /dev/sdd1
mdadm: /dev/md5 has been started with 5 drives.
(as in the array was assembled, but it's not really useful without those
fake bad blocks cleared from the bad block list)

And yes I agree that bad blocks should not be a default, now I really wish they
had never been auto turned on, I already lost a week of scanning this array
and looking at problems over thie feature that turns out made a wrong
assumption and doesn't seem to let me clear it :-/

Thanks both for your answer and pointing me in the right direction.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Andreas Klauer @ 2016-10-30 17:02 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid
In-Reply-To: <20161030164342.GC28648@merlins.org>

On Sun, Oct 30, 2016 at 09:43:42AM -0700, Marc MERLIN wrote:
> Right.
> There should be some --update=no-bbl --force if the admin knows the bad
> block list is wrong and due to IO issues not related to the drive.

Good point. And hey, there it is.

mdadm.c

|                       	if (strcmp(c.update, "bbl") == 0)
|                               	continue;
|                       	if (strcmp(c.update, "no-bbl") == 0)
|                                continue;
|                       	if (strcmp(c.update, "force-no-bbl") == 0)
|                               	continue;

force-no-bbl. It's in mdadm v3.4, not sure about older ones.

If I stumbled across that one before then I forgot about it.

Good luck
Andreas Klauer

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Marc MERLIN @ 2016-10-30 16:43 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161030161929.GA5582@metamorpher.de>

On Sun, Oct 30, 2016 at 05:19:29PM +0100, Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
> > (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
> > on at least one drive, no?)
> 
> If n+1 disks have bad blocks there's no data to sync over, so they just 
> propagate and stay bad forever. Or at least that's how it seemed to work 
> last time I tried it. I'm not familiar with bad blocks. I just turn it off.
> 
> As long as the bad block list is empty you can --update=no-bbl.
> If everything else fails - edit the metadata or carefully recreate.
> Which I don't recommend because you can go wrong in a hundred ways.
 
Right.
There should be some --update=no-bbl --force if the admin knows the bad
block list is wrong and due to IO issues not related to the drive.

> I don't remember if anyone ever had a proper solution to this.
> It came up a couple of times on the list so you could search.

Will look, thanks.

> If you've replaced drives since, the drive that has been part of the array 
> the longest is probably the most likely to still have valid data in there. 
> That could be synced over to the other drives once the bbl is cleared. 
> It might not matter, you'd have to check with your filesystems if they 
> believe any files located there. (Filesystems sometimes maintain their 
> own bad block lists so you'd have to check those too.)

No drives were ever replaced, this is an original array used only a few
times (for backups).
At this point I'm almost tempted to wipe and start over, but it's going to
take a week to recreate the backup (lots of data, slow link).
As for the filesystem it's btrfs with data and metadata checksums, so it's
easy to verify that everything is fine once I can get md5 to stop returning
IO errors on blocks it thinks are bad, but in fact are not.

And here isn't one good drive between the 2, the bad blocks are identical on
both drives and must have happened at the same time due to those cable
induced IO errors I mentionned.
Too bad that mdadm doesn't seem to account for the fact that it could be
wrong when marking blocks as bad and does not seem to give a way to recover
from this easily....
I'll do more reading, thanks.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Phil Turmel @ 2016-10-30 16:34 UTC (permalink / raw)
  To: Andreas Klauer, Marc MERLIN; +Cc: linux-raid
In-Reply-To: <20161030161929.GA5582@metamorpher.de>

On 10/30/2016 12:19 PM, Andreas Klauer wrote:
> On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
>> (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
>> on at least one drive, no?)
> 
> If n+1 disks have bad blocks there's no data to sync over, so they just 
> propagate and stay bad forever. Or at least that's how it seemed to work 
> last time I tried it. I'm not familiar with bad blocks. I just turn it off.

I, too, turn it off.  (I never let it turn on, actually.)

I'm a little disturbed that this feature has become the default on new
arrays.  This feature was introduced specifically to support underlying
storage technologies that cannot perform their own bad block management.
 And since it doesn't implement any relocation algorithm for blocks
marked bad, it simply gives up any redundancy for affected sectors.  And
when there's no remaining redundancy, it simply passes the error up the
stack.  In this case, your errors were created by known communications
weaknesses that should always be recoverable with --assemble --force.

As far as I'm concerned, the bad block system is an incomplete feature
that should never be used in production, and certainly not on top of any
storage technology that implements error detection, correction, and
relocation.  Like, every modern SATA and SAS drive.

Phil

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Andreas Klauer @ 2016-10-30 16:19 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid
In-Reply-To: <20161030153857.GB28648@merlins.org>

On Sun, Oct 30, 2016 at 08:38:57AM -0700, Marc MERLIN wrote:
> (mmmh, but even so, rebuilding the spare should have cleared the bad blocks
> on at least one drive, no?)

If n+1 disks have bad blocks there's no data to sync over, so they just 
propagate and stay bad forever. Or at least that's how it seemed to work 
last time I tried it. I'm not familiar with bad blocks. I just turn it off.

As long as the bad block list is empty you can --update=no-bbl.
If everything else fails - edit the metadata or carefully recreate.
Which I don't recommend because you can go wrong in a hundred ways.

I don't remember if anyone ever had a proper solution to this.
It came up a couple of times on the list so you could search.

If you've replaced drives since, the drive that has been part of the array 
the longest is probably the most likely to still have valid data in there. 
That could be synced over to the other drives once the bbl is cleared. 
It might not matter, you'd have to check with your filesystems if they 
believe any files located there. (Filesystems sometimes maintain their 
own bad block lists so you'd have to check those too.)

Regards
Andreas Klauer

^ permalink raw reply

* Re: recovering failed raid5
From: Phil Turmel @ 2016-10-30 16:18 UTC (permalink / raw)
  To: Andreas Klauer, Roman Mamedov; +Cc: linux-raid
In-Reply-To: <20161029120230.GA4725@metamorpher.de>

On 10/29/2016 08:02 AM, Andreas Klauer wrote:

> If you rented a server in a datacenter, thus entitled to working hardware, 
> would you create a ticket on read failure or not?

If the rate of read errors is within the manufacturer specs, it *is*
"working hardware", by definition.  I would expect if you did file such
a ticket, without a pattern of multiple read errors outside the hardware
spec, for it to be rejected.  And if you made a nuisance of yourself in
such a professional environment when you are so clearly wrong, to find
your server rental contract cancelled.

Phil

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Marc MERLIN @ 2016-10-30 15:38 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161030093337.GA3627@metamorpher.de>

On Sun, Oct 30, 2016 at 10:33:37AM +0100, Andreas Klauer wrote:
> On Sat, Oct 29, 2016 at 07:16:14PM -0700, Marc MERLIN wrote:
> > Can someone tell me how this is possible?
> > More generally, is it possible for the kernel to return an md error 
> > and then not log any underlying hardware error on the drives the md 
> > was being read from?
> 
> Is there something in mdadm --examine(-badblocks) /dev/sd*?

Well, well, I learned something new today. First I had to upgrade my mdadm
tools to get that option, and sure enough:
myth:~# mdadm --examine-badblocks /dev/sd[defgh]1
Bad-blocks on /dev/sdd1:
            14408704 for 352 sectors
            14409568 for 160 sectors
           132523032 for 512 sectors
           372496968 for 440 sectors
Bad-blocks list is empty in /dev/sde1
Bad-blocks on /dev/sdf1:
            14408704 for 352 sectors
            14409568 for 160 sectors
           132523032 for 512 sectors
           372496968 for 440 sectors
Bad-blocks list is empty in /dev/sdg1
Bad-blocks list is empty in /dev/sdh1

So thank you for pointing me in the right direction.

I think they are due to the fact that it's an external disk array on a port
multiplier where sometimes I get bus errors that aren't actually on the
disks.

Questions:
1) shouldn't my array have been invalidated if I have bad blocks on 2 drives
in the same place or is the only possible way for this to happen that it did
get invalidated and I somehow force rebuilt the array to bring it back up
and I don't remember doing so?
(mmmh, but even so, rebuilding the spare should have cleared the bad blocks
on at least one drive, no?)

2) I'm currently running this, which I believe is the way to recover:
myth:~# echo 'check' > /sys/block/md5/md/sync_action 
but I'm not too hopeful on how that's going to work out if I have 2 drives with
supposed bad blocks at the same offsets.

Is there another way to just clear the bad block list on both drives if I've
already verified that those blocks are not bad and that they were due to some 
I/O errors that came from a bad cable connection?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply

* Re: Buffer I/O error on dev md5, logical block 7073536, async page read
From: Andreas Klauer @ 2016-10-30  9:33 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid
In-Reply-To: <20161030021614.asws67j34ji64qle@merlins.org>

On Sat, Oct 29, 2016 at 07:16:14PM -0700, Marc MERLIN wrote:
> Can someone tell me how this is possible?
> More generally, is it possible for the kernel to return an md error 
> and then not log any underlying hardware error on the drives the md 
> was being read from?

Is there something in mdadm --examine(-badblocks) /dev/sd*?

Regards
Andreas Klauer

^ permalink raw reply

* Buffer I/O error on dev md5, logical block 7073536, async page read
From: Marc MERLIN @ 2016-10-30  2:16 UTC (permalink / raw)
  To: linux-raid

Howdy,

I'm struggling with this problem.

I have this md5 array with 5 drives:
Personalities : [linear] [raid0] [raid1] [raid10] [multipath] [raid6] [raid5] [raid4] 
md5 : active raid5 sdg1[0] sdh1[6] sdf1[2] sde1[3] sdd1[5]
      15627542528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 0/30 pages [0KB], 65536KB chunk

I started having filesystem problems with it, so I did a scan with hdrecover on the drives first,
and that passed. Then I did it on the md5 array, and it failed.

With a simple dd, I get this:

25526374400 bytes (26 GB) copied, 249.888 s, 102 MB/s
dd: reading `/dev/md5': Input/output error
56588288+0 records in
56588288+0 records out
28973203456 bytes (29 GB) copied, 283.325 s, 102 MB/s
[1]+  Exit 1                  dd if=/dev/md5 of=/dev/null
kernel: [202693.708639] Buffer I/O error on dev md5, logical block 7073536, async page read

Yes, I can read the entire disk devices without problem (took a long time
to run, but it finished)

Can someone tell me how this is possible?
More generally, is it possible for the kernel to return an md error and then not log
any underlying hardware error on the drives the md was being read from?

Kernel 4.6.0. I'll upgrade just in case, but md has been stable enough for so many years that I'm 
thinking the problem is likely elsewhere.

Any ideas?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply

* 5849963102831 linux-raid
From: xa0ajutor @ 2016-10-30  0:34 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: mclinux-raid.zip --]
[-- Type: application/zip, Size: 13730 bytes --]

^ permalink raw reply

* Re: [PATCH 23/60] block: introduce flag QUEUE_FLAG_NO_MP
From: Ming Lei @ 2016-10-29 22:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Linux FS Devel, Kirill A . Shutemov, Mike Christie,
	Hannes Reinecke, Dan Williams, Toshi Kani, Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <20161029152933.GA17241@infradead.org>

On Sat, Oct 29, 2016 at 11:29 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Sat, Oct 29, 2016 at 04:08:22PM +0800, Ming Lei wrote:
>> MD(especially raid1 and raid10) is a bit difficult to support
>> multipage bvec, so introduce this flag for not enabling multipage
>> bvec, then MD can still accept singlepage bvec only, and once
>> direct access to bvec table in MD and other fs/drivers are cleanuped,
>> the flag can be removed. BTRFS has the similar issue too.
>
> There is really no good reason for that.  The RAID1 and 10 code really
> just needs some love to use the bio cloning infrastructure, bio
> iterators and generally recent bio apis.  btrfs just needs a tiny little
> bit of help and I'll send patches soon.

That is very nice of you to do this cleanup, cool!

I guess it still need a bit time, and hope that won't be the block
for the whole patchset, :-)

[linux-2.6-next]$git grep -n -E "bi_io_vec|bi_vcnt" ./fs/btrfs/ | wc -l
45

[linux-2.6-next]$git grep -n -E "bi_io_vec|bi_vcnt" ./drivers/md/ |
grep raid | wc -l
54

>
> Having two different code path is just asking for trouble in the long
> run.

Definitely, that flag is introduced just as a short-term solution.

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH 23/60] block: introduce flag QUEUE_FLAG_NO_MP
From: Christoph Hellwig @ 2016-10-29 15:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Kirill A . Shutemov, Mike Christie,
	Hannes Reinecke, Dan Williams, Toshi Kani, shli, linux-raid
In-Reply-To: <1477728600-12938-24-git-send-email-tom.leiming@gmail.com>

On Sat, Oct 29, 2016 at 04:08:22PM +0800, Ming Lei wrote:
> MD(especially raid1 and raid10) is a bit difficult to support
> multipage bvec, so introduce this flag for not enabling multipage
> bvec, then MD can still accept singlepage bvec only, and once
> direct access to bvec table in MD and other fs/drivers are cleanuped,
> the flag can be removed. BTRFS has the similar issue too.

There is really no good reason for that.  The RAID1 and 10 code really
just needs some love to use the bio cloning infrastructure, bio
iterators and generally recent bio apis.  btrfs just needs a tiny little
bit of help and I'll send patches soon.

Having two different code path is just asking for trouble in the long
run.

^ permalink raw reply

* Re: recovering failed raid5
From: Andreas Klauer @ 2016-10-29 12:02 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid
In-Reply-To: <20161029152951.62add3ca@natsu>

On Sat, Oct 29, 2016 at 03:29:51PM +0500, Roman Mamedov wrote:
> And if there's an unreadable sector on rebuild as a drive found its 8th bad
> sector after 3 more years of perfect operation, that's not a problem either,
> because the setup they run in, is RAID6.

But if such disks are acceptable to run in a RAID, and you advertize it 
as such, you have to expect to see RAIDs where every single disk has a 
dozen reallocated sectors and a history of read errors to go with it.
Is that still fine? Do you expect to be lucky every time?

RAID-6 is not magic, either. Sooner or later, it will fail, too. 

Keep ignoring errors in RAID-5 and you'll see double failure.
Keep ignoring errors in RAID-6 long enough and you'll see triple failure.
All disks fail and many of them do silently, undetected if untested.

If you rented a server in a datacenter, thus entitled to working hardware, 
would you create a ticket on read failure or not?

Regards
Andreas Klauer

^ permalink raw reply

* Re: recovering failed raid5
From: Roman Mamedov @ 2016-10-29 10:29 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid
In-Reply-To: <20161028133304.GA11564@metamorpher.de>

On Fri, 28 Oct 2016 15:33:04 +0200
Andreas Klauer <Andreas.Klauer@metamorpher.de> wrote:

> On Fri, Oct 28, 2016 at 01:22:31PM +0100, Alexander Shenkin wrote:
> > One remaining question: is sdc definitely toast?
> 
> In my opinion a drive is toast starting from the very first reallocated/ 
> pending/uncorrectable sector, your drive has several of those and that's 
> only the ones the drive already knows about - there may be more.

I'd say you are overly cautious on this. Yes there are drives for which one
reallocated sector is a sign of the coming avalanche of them, but then there
are also ones (e.g. my Hitachi 2TB) which work for years, over than period
develop 3-5-7 reallocated sectors, and THAT'S IT, they just continue to work.
And if there's an unreadable sector on rebuild as a drive found its 8th bad
sector after 3 more years of perfect operation, that's not a problem either,
because the setup they run in, is RAID6. (Not to compensate for this, but I
wouldn't be running a 8-10 drive RAID5 in any case).

-- 
With respect,
Roman

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox