help channel for md raid?

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* help channel for md raid?
@ 2021-08-02 14:38 Stephan Böttcher
  2021-08-02 18:29 ` Wols Lists
  0 siblings, 1 reply; 5+ messages in thread
From: Stephan Böttcher @ 2021-08-02 14:38 UTC (permalink / raw)
  To: linux-raid


Moin!

May I ask a question on this list, or is this strictly for development?

Thanks,
Stephan


My qustion is how to translate sector numbers from a RAID6 as in

> Aug  2 01:32:28 falbala kernel: md8: mismatch sector in range 1460408-1460416

to ext4 inode numbers, as in

> debugfs: icheck 730227 730355 730483 730611
> Block   Inode number
> 730227  30474245
> 730355  30474245
> 730483  30474245
> 730611  30474245

Is there a list, channel, matrix room for this kind of questions?
Are there tools to do what I need?
Is the approach below sensible?

It is a RAID6 with six drives, one of them failed.
A check yielded 378 such mismatches.

I assume the sectors count from the start of the `Data Offset`.
`ext4 block numbers` count from the start of the partition?
Is that correct?

The failed drive has >3000 unreadble sectors and became very slow.

#! /usr/bin/gawk -f

BEGIN {
	SS = 512
	CS = 0x80000/SS
	BS = 4096/SS
	N  = 4
}
/mismatch sector in range/ {
	split($11,A,/-/); 
	S = A[1]; 
	M = S%CS; 
	C = S/CS;
	B = (C*N*CS + M)/BS
	printf "icheck"
	for (i=0; i<N; i++) printf " %u", B + i*CS/BS 
	printf "\n"
}


/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d58e68a2:7d9e9218:1bb61653:c295ae02
           Name : falbala:8  (local to host falbala)
  Creation Time : Wed Aug 20 12:23:51 2014
     Raid Level : raid6
   Raid Devices : 6

 Avail Dev Size : 5440837632 (2594.39 GiB 2785.71 GB)
     Array Size : 10881675264 (10377.57 GiB 11142.84 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=0 sectors
          State : clean
    Device UUID : 16ee808f:9cf5420e:270cc442:db7768b6

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Aug  2 09:27:57 2021
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 2b557999 - correct
         Events : 405815

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAA.AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdb3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d58e68a2:7d9e9218:1bb61653:c295ae02
           Name : falbala:8  (local to host falbala)
  Creation Time : Wed Aug 20 12:23:51 2014
     Raid Level : raid6
   Raid Devices : 6

 Avail Dev Size : 5440837632 (2594.39 GiB 2785.71 GB)
     Array Size : 10881675264 (10377.57 GiB 11142.84 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 28c5ded0:c005aaf2:167d3af3:8714b48c

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Aug  2 09:27:57 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : d7dc6e93 - correct
         Events : 405815

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA.AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d58e68a2:7d9e9218:1bb61653:c295ae02
           Name : falbala:8  (local to host falbala)
  Creation Time : Wed Aug 20 12:23:51 2014
     Raid Level : raid6
   Raid Devices : 6

 Avail Dev Size : 5440837632 (2594.39 GiB 2785.71 GB)
     Array Size : 10881675264 (10377.57 GiB 11142.84 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 1e9be3ef:f6f44ef4:cd4c39d4:002b515c

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Aug  2 09:27:57 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : a92219f4 - correct
         Events : 405815

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : AAA.AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d58e68a2:7d9e9218:1bb61653:c295ae02
           Name : falbala:8  (local to host falbala)
  Creation Time : Wed Aug 20 12:23:51 2014
     Raid Level : raid6
   Raid Devices : 6

 Avail Dev Size : 5440837632 (2594.39 GiB 2785.71 GB)
     Array Size : 10881675264 (10377.57 GiB 11142.84 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : active
    Device UUID : e85da470:24a385b8:ae66eacd:d822539d

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Aug  2 01:24:06 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 28d023be - correct
         Events : 400427

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d58e68a2:7d9e9218:1bb61653:c295ae02
           Name : falbala:8  (local to host falbala)
  Creation Time : Wed Aug 20 12:23:51 2014
     Raid Level : raid6
   Raid Devices : 6

 Avail Dev Size : 5440837632 (2594.39 GiB 2785.71 GB)
     Array Size : 10881675264 (10377.57 GiB 11142.84 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 48223f81:5c9a0ddd:ff0d1782:57941c27

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Aug  2 09:27:57 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 9be57126 - correct
         Events : 405815

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : AAA.AA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d58e68a2:7d9e9218:1bb61653:c295ae02
           Name : falbala:8  (local to host falbala)
  Creation Time : Wed Aug 20 12:23:51 2014
     Raid Level : raid6
   Raid Devices : 6

 Avail Dev Size : 5440837632 (2594.39 GiB 2785.71 GB)
     Array Size : 10881675264 (10377.57 GiB 11142.84 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : 881ee37d:810e582a:bc0d4841:27916f6b

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Aug  2 09:27:57 2021
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : e957ddfe - correct
         Events : 405815

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAA.AA ('A' == active, '.' == missing, 'R' == replacing)


-- 
Stephan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help channel for md raid?
  2021-08-02 14:38 help channel for md raid? Stephan Böttcher
@ 2021-08-02 18:29 ` Wols Lists
  2021-08-02 22:20   ` Stephan Böttcher
  0 siblings, 1 reply; 5+ messages in thread
From: Wols Lists @ 2021-08-02 18:29 UTC (permalink / raw)
  To: Stephan Böttcher, linux-raid

On 02/08/21 15:38, Stephan Böttcher wrote:
> 
> Moin!
> 
> May I ask a question on this list, or is this strictly for development?

Ask away!
> 
> Thanks,
> Stephan
> 
> 
> My qustion is how to translate sector numbers from a RAID6 as in
> 
>> Aug  2 01:32:28 falbala kernel: md8: mismatch sector in range 1460408-1460416
> 
> to ext4 inode numbers, as in
> 
>> debugfs: icheck 730227 730355 730483 730611
>> Block   Inode number
>> 730227  30474245
>> 730355  30474245
>> 730483  30474245
>> 730611  30474245
> 
> Is there a list, channel, matrix room for this kind of questions?
> Are there tools to do what I need?
> Is the approach below sensible?
> 
> It is a RAID6 with six drives, one of them failed.
> A check yielded 378 such mismatches.
> 
> I assume the sectors count from the start of the `Data Offset`.
> `ext4 block numbers` count from the start of the partition?
> Is that correct?

md8 is your array. This is the block device presented to your file
system so you're feeding it 730227, 730355, 730483, 730611, these are
the sector numbers of md8, and they will be *linear* within it.

So if you're trying to map filesystem sectors to md8 sectors, they are
the same thing.

Only if you're trying to map filesystem sectors to the hard drives
underlying the raid do you need to worry about the 'Data Offset'. (And
this varies on a per-drive basis!)
> 
> The failed drive has >3000 unreadble sectors and became very slow.

So you've removed the drive? Have you replaced it? If you have a drive
fail again, always try and replace it using the --replace option, I
think it's too late for that now ...

But as for finding out which files may have been corrupted, you want to
use the tools that come with the filesystem, and ask it which files use
sectors 1460408-1460416. Don't worry about the underlying raid.
Hopefully those tools will come back and say those sectors are unused.
If they ARE used, the chances are it's just the parity which is corrupt.
Otherwise you're looking at a backup ...

So I think what you need to do is (1) find out which files use those
sectors, (2) replace that missing drive asap, and (3) check the
integrity of the file system with fsck. Then (4) do a repair scrub.

(I gather such errors are reasonably common, and do not signify a
problem. On a live filesystem they could well be a collision between a
file write and the check ...)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help channel for md raid?
  2021-08-02 18:29 ` Wols Lists
@ 2021-08-02 22:20   ` Stephan Böttcher
  2021-08-03 19:44     ` antlists
  0 siblings, 1 reply; 5+ messages in thread
From: Stephan Böttcher @ 2021-08-02 22:20 UTC (permalink / raw)
  To: Wols Lists, linux-raid

Wol writes:

> Ask away!

Thanks.

>> My qustion is how to translate sector numbers from a RAID6 as in
>> 
>>> Aug  2 01:32:28 falbala kernel: md8: mismatch sector in range 1460408-1460416
>> 
>> to ext4 inode numbers, as in
>> 
>>> debugfs: icheck 730227 730355 730483 730611
>>> Block   Inode number
>>> 730227  30474245
>>> 730355  30474245
>>> 730483  30474245
>>> 730611  30474245
>> 
>> It is a RAID6 with six drives, one of them failed.
>> A check yielded 378 such mismatches.
>> 
>> I assume the sectors count from the start of the `Data Offset`.
>> `ext4 block numbers` count from the start of the partition?
>> Is that correct?
>
> md8 is your array. 

Yes, and it issued a few hundred messages of sector numbers with
mismatches in the parity, like: sectors 1460408-1460416.

Those sectors seem to be in units of 512 bytes, per drive.

> This is the block device presented to your file system so you're
> feeding it 730227, 730355, 730483, 730611, 

These are block numbers I calculated that may match those sector
numbers, using the awk script in my first mail.

The ext4 block size is 4kByte.

The chunk size is 512kByte = 1024 sectors = 128 blocks.  Is that per
drive or total?  I assume per drive.

The sector numbers are per drive.  Sector 1460408 is in chunk
1460408/1024 = 1426, sector offset  1460408 % 1024 = 184.

My RAID6 with six drives has a payload of 4 × 512kByte per chunk slice.
That is 512 blocks per chunk.  The first block of the affected chunk
should be 1426 × 512 = 730112.

The sector offset 184 into the chunk is 184/8 = 23 blocks in, i.e., block
730112 + 23 = 730135.  And the corresponding sectors in the other drives
are multiples of the chunk size ahead, so I need to look for blocks
730112 + 23 + i×128, i=0…3.

So when I ask debugfs which inode uses the block 730135, I should get
(one of) the file(s) that is affected by the mismatch.

(My awk script was missing an `int()` call, that's why these numbers are
different from those posted before.)

Does any of this makes sense?

> these are the sector numbers of md8, and they will be *linear* within
> it.

I believe these sector numbers are per drive.  The parity is calculated
for all sectors at the same offset to 'Data Offset'. I found something
to that effect when searching for the term "mismatch sector in range".
The numbers I got agree with the size of my drives.

I did not find any much documentation how exactly the layout of such a
RAID6 works.  From what I did find, I made the conclusions above.  Maybe
I should look into the kernel source.  (Well, I did, grepping for
"mismatch sector in range").

What does "Layout : left-symmetric" mean?  I guess it does not matter
here, unless I need to map those four block numbers to drives.

> So if you're trying to map filesystem sectors to md8 sectors, they are
> the same thing.
>
> Only if you're trying to map filesystem sectors to the hard drives
> underlying the raid do you need to worry about the 'Data Offset'. (And
> this varies on a per-drive basis!)
>> 
>> The failed drive has >3000 unreadble sectors and became very slow.
>
> So you've removed the drive? Have you replaced it? If you have a drive
> fail again, always try and replace it using the --replace option, I
> think it's too late for that now ...

I'd need one more SATA port for that. I could put the slow drive on
a USB. I'll do that next time.  Most drive failures I had in the past
were total losses.

I issued mdadm --fail on it.  The drive was 100× slower than normal.
Then I did a parity check, which gave all those mismatch errors.  A fsck
went through without errors.

Replacement drives are in the drawer. It will go in tomorrow.

When I now do a resync with the new drive, how will those mismatches be
resolved?

I was thinking of pulling each of the remaining drives in turn, --assemble
--readonly, and see how the affected files are different,  If all failures
are on a single drive, I can --fail that drive too before the resync.

Will a -A --readonly assembled RAID be truly read-only?  When I plug a
pulled drive back in, will the RAID accept it?

The --fail-ed drive is still readable, slowly.  Unfortunately it is now
out of date.

> But as for finding out which files may have been corrupted, you want to
> use the tools that come with the filesystem, and ask it which files use
> sectors 1460408-1460416. Don't worry about the underlying raid.
> Hopefully those tools will come back and say those sectors are unused.

If it were that easy …

> If they ARE used, the chances are it's just the parity which is corrupt.
> Otherwise you're looking at a backup ...

No backup.  The computer center needs to backup several multi TByte
filesystems for our group.  This is one of the filesystems where
everybody is told not to put irreplaceable data.  As if anybody will
listen to what I say.

> (I gather such errors are reasonably common, and do not signify a
> problem. On a live filesystem they could well be a collision between a
> file write and the check ...)

So you say it may all be errors in the parity only?  This helps for 5/6th
of the data, not the chunks where the actual data was pulled.  

Gruß,
-- 
Stephan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help channel for md raid?
  2021-08-02 22:20   ` Stephan Böttcher
@ 2021-08-03 19:44     ` antlists
  2021-08-04  9:40       ` Stephan Böttcher
  0 siblings, 1 reply; 5+ messages in thread
From: antlists @ 2021-08-03 19:44 UTC (permalink / raw)
  To: Stephan Böttcher, linux-raid; +Cc: NeilBrown

On 02/08/2021 23:20, Stephan Böttcher wrote:
> 
> Wol writes:
> 
>> Ask away!
> 
> Thanks.
> 
>>> My qustion is how to translate sector numbers from a RAID6 as in
>>>
>>>> Aug  2 01:32:28 falbala kernel: md8: mismatch sector in range 1460408-1460416
>>>
>>> to ext4 inode numbers, as in
>>>
>>>> debugfs: icheck 730227 730355 730483 730611
>>>> Block   Inode number
>>>> 730227  30474245
>>>> 730355  30474245
>>>> 730483  30474245
>>>> 730611  30474245
>>>
>>> It is a RAID6 with six drives, one of them failed.
>>> A check yielded 378 such mismatches.
>>>
>>> I assume the sectors count from the start of the `Data Offset`.
>>> `ext4 block numbers` count from the start of the partition?
>>> Is that correct?
>>
>> md8 is your array.
> 
> Yes, and it issued a few hundred messages of sector numbers with
> mismatches in the parity, like: sectors 1460408-1460416.

Bear in mind I'm now getting out of my depth, I just edit the wiki, I 
don't know the deep internals :-), but something doesn't feel right 
here. 408-416 is *nine* sectors. I'd expect it to be a multiple of 2, or 
related to the number of data drives (in your case 4). So 9 is well weird.
> 
> Those sectors seem to be in units of 512 bytes, per drive.
> 
>> This is the block device presented to your file system so you're
>> feeding it 730227, 730355, 730483, 730611,
> 
> These are block numbers I calculated that may match those sector
> numbers, using the awk script in my first mail.
> 
> The ext4 block size is 4kByte.
> 
> The chunk size is 512kByte = 1024 sectors = 128 blocks.  Is that per
> drive or total?  I assume per drive.

I would think so. You have 6 chunks per stripe, d1, d2, d3, d4, P and Q, 
  so each stripe is 2MB.
> 
> The sector numbers are per drive.  Sector 1460408 is in chunk
> 1460408/1024 = 1426, sector offset  1460408 % 1024 = 184.

??? If it's scanning md8, then the sector numbers will be relative to 
md8. Stuff working at this level is merely scanning a block device - it 
has no clue whether it's a drive - it may well not be!
> 
> My RAID6 with six drives has a payload of 4 × 512kByte per chunk slice.
> That is 512 blocks per chunk.  The first block of the affected chunk
> should be 1426 × 512 = 730112.
> 
> The sector offset 184 into the chunk is 184/8 = 23 blocks in, i.e., block
> 730112 + 23 = 730135.  And the corresponding sectors in the other drives
> are multiples of the chunk size ahead, so I need to look for blocks
> 730112 + 23 + i×128, i=0…3.
> 
> So when I ask debugfs which inode uses the block 730135, I should get
> (one of) the file(s) that is affected by the mismatch.

If this is a DISK block, then debugfs will be getting VERY confused, 
because it will be looking for the filesystem block 73015, which will be 
somewhere else entirely.
> 
> (My awk script was missing an `int()` call, that's why these numbers are
> different from those posted before.)
> 
> Does any of this makes sense?

Not really, primarily because (a) I don't know what you're trying to do, 
and (b) because if I do understand correctly what you're trying to do, 
you're doing it all wrong.
> 
>> these are the sector numbers of md8, and they will be *linear* within
>> it.
> 
> I believe these sector numbers are per drive.  The parity is calculated
> for all sectors at the same offset to 'Data Offset'. I found something
> to that effect when searching for the term "mismatch sector in range".
> The numbers I got agree with the size of my drives.
> 
> I did not find any much documentation how exactly the layout of such a
> RAID6 works.  From what I did find, I made the conclusions above.  Maybe
> I should look into the kernel source.  (Well, I did, grepping for
> "mismatch sector in range").

As I said above, you'll have 2MB per slice, 4 data and 2 parity chunks. 
The parity blocks move around, so if you want to recover the data, 
you're also going to have work out which disk your data chunk is on, 
because it's not a simple 1,5,9 or 1,7,13 ...

And while, with a brand new array, the data offset will be the same 
across all drives, any management activity is likely to move the data 
around, not necessarily the same across all drives.
> 
> What does "Layout : left-symmetric" mean?  I guess it does not matter
> here, unless I need to map those four block numbers to drives.

Yup. A quick google tells me it's the algorithm used to place the parity 
chunks.
> 
>> So if you're trying to map filesystem sectors to md8 sectors, they are
>> the same thing.
>>
>> Only if you're trying to map filesystem sectors to the hard drives
>> underlying the raid do you need to worry about the 'Data Offset'. (And
>> this varies on a per-drive basis!)
>>>
>>> The failed drive has >3000 unreadble sectors and became very slow.
>>
>> So you've removed the drive? Have you replaced it? If you have a drive
>> fail again, always try and replace it using the --replace option, I
>> think it's too late for that now ...
> 
> I'd need one more SATA port for that. I could put the slow drive on
> a USB. I'll do that next time.  Most drive failures I had in the past
> were total losses.

If you've got room, put an expansion card with an eSATA port on it. Or 
yes, you could physically replace the drive, put the old one back on a 
USB port, and do an mdadm --replace. DON'T do it the other way round, 
putting the new one on a USB for the mdadm replace before swapping it in 
- experience says this is a bad move ...
> 
> I issued mdadm --fail on it.  The drive was 100× slower than normal.
> Then I did a parity check, which gave all those mismatch errors.  A fsck
> went through without errors.
> 
> Replacement drives are in the drawer. It will go in tomorrow.
> 
> When I now do a resync with the new drive, how will those mismatches be
> resolved?

I'm not sure. Neil? I'm guessing that if all four data chunks have 
survived, parity will be recalculated, and everything is hunky-dory. But 
every third block is a parity. So you'll lose one block in three, 
because raid-6 can recover two missing pieces of information, but here 
we have three - a missing block (the failed drive), a corrupt block, and 
we don't know which block is corrupt.
> 
> I was thinking of pulling each of the remaining drives in turn, --assemble
> --readonly, and see how the affected files are different,  If all failures
> are on a single drive, I can --fail that drive too before the resync.

That's a lot of work that will almost certainly be a waste of time. The 
likely cause is parity being messed up on write, so that will be spread 
across drives.
> 
> Will a -A --readonly assembled RAID be truly read-only?  When I plug a
> pulled drive back in, will the RAID accept it?
> 
Well, if you don't mount the partition, or you mount that read-only, 
nothing is going to write to it.

> The --fail-ed drive is still readable, slowly.  Unfortunately it is now
> out of date.

Okay. Can you mount the partition read-only and stop it getting even 
more out-of-date?

Copy the dud drive with ddrescue, and then put it into the array with 
--re-add. Hopefully it will do a re-add, not an add. And then you'll 
have recovered pretty much everything you can.
> 
>> But as for finding out which files may have been corrupted, you want to
>> use the tools that come with the filesystem, and ask it which files use
>> sectors 1460408-1460416. Don't worry about the underlying raid.
>> Hopefully those tools will come back and say those sectors are unused.
> 
> If it were that easy …
> 
>> If they ARE used, the chances are it's just the parity which is corrupt.
>> Otherwise you're looking at a backup ...
> 
> No backup.  The computer center needs to backup several multi TByte
> filesystems for our group.  This is one of the filesystems where
> everybody is told not to put irreplaceable data.  As if anybody will
> listen to what I say.

Well, unfortunately, they're going to learn the hard way ...
> 
>> (I gather such errors are reasonably common, and do not signify a
>> problem. On a live filesystem they could well be a collision between a
>> file write and the check ...)
> 
> So you say it may all be errors in the parity only?  This helps for 5/6th
> of the data, not the chunks where the actual data was pulled.
> 
Auf wiederhoren,
Wol

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: help channel for md raid?
  2021-08-03 19:44     ` antlists
@ 2021-08-04  9:40       ` Stephan Böttcher
  0 siblings, 0 replies; 5+ messages in thread
From: Stephan Böttcher @ 2021-08-04  9:40 UTC (permalink / raw)
  To: antlists; +Cc: linux-raid, NeilBrown


antlists <antlists@youngman.org.uk> writes:

> On 02/08/2021 23:20, Stephan Böttcher wrote:
>> Wol writes:
>> 
>> Yes, and it issued a few hundred messages of sector numbers with
>> mismatches in the parity, like: sectors 1460408-1460416.
>
> Bear in mind I'm now getting out of my depth, I just edit the wiki, I
> don't know the deep internals :-), but something doesn't feel right 
> here. 408-416 is *nine* sectors. I'd expect it to be a multiple of 2,
> or related to the number of data drives (in your case 4). So 9 is well
> weird.

from drivers/md/raid5.c:

       atomic64_add(STRIPE_SECTORS, &conf->mddev->resync_mismatches);
       if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) {
               /* don't try to repair!! */
               set_bit(STRIPE_INSYNC, &sh->state);
               pr_warn_ratelimited("%s: mismatch sector in range "
                                   "%llu-%llu\n", mdname(conf->mddev),
                                   (unsigned long long) sh->sector,
                                   (unsigned long long) sh->sector +
                                   STRIPE_SECTORS);

The check reads STRIPE_SECTORS=8 sectors from each drive at drive
location sh->sectors=…408, checks the parity, since one drive is missing
it cannot find out on which drive the error is and issues this warning,
reporting sectors […408, …416).

>> These are block numbers I calculated that may match those sector
>> numbers, using the awk script in my first mail.
>> The ext4 block size is 4kByte.
>> The chunk size is 512kByte = 1024 sectors = 128 blocks.  Is that per
>> drive or total?  I assume per drive.
>
> I would think so. You have 6 chunks per stripe, d1, d2, d3, d4, P and
> Q,   so each stripe is 2MB.

>> The sector numbers are per drive.  Sector 1460408 is in chunk
>> 1460408/1024 = 1426, sector offset  1460408 % 1024 = 184.
>
> ??? If it's scanning md8, then the sector numbers will be relative to
> md8.

why?

> Stuff working at this level is merely scanning a block device - it has
> no clue whether it's a drive - it may well not be!

?

>> My RAID6 with six drives has a payload of 4 × 512kByte per chunk
>> slice.
>> That is 512 blocks per chunk.  The first block of the affected chunk
>> should be 1426 × 512 = 730112.
>> The sector offset 184 into the chunk is 184/8 = 23 blocks in, i.e.,
>> block
>> 730112 + 23 = 730135.  And the corresponding sectors in the other drives
>> are multiples of the chunk size ahead, so I need to look for blocks
>> 730112 + 23 + i×128, i=0…3.
>> So when I ask debugfs which inode uses the block 730135, I should
>> get
>> (one of) the file(s) that is affected by the mismatch.
>
> If this is a DISK block, then debugfs will be getting VERY confused,
> because it will be looking for the filesystem block 73015, which will
> be somewhere else entirely.

- I took the sector number reported by the raid driver, and calculated
  which block numbers on /dev/md8 those correspond to.

    sector 1460408 -> block 730135

- I asked debugfs (ext2progs) which inodes uses those blocks.

- I asked debugfs which file paths point to those innodes.

- I asked debugfs to dump those files to a filesystem on another host.

The `debugfs dump` I repeated five times, once with each remaining drive
omitted from the --readonly assembly with four drives.  

A few files were always the same. Did I make a mistake in my
calculations, did I miss some offset?

Some files were different for all six dumps.  Probably because of errors
in multiple chunks.

Most files have dumps that agree for some subset. When that happened,
the "full" five drive dump was always among the set.

I put in a new drive, the array is now complete with six drives.  I will
have a look how the files compare now, and for those few that may be
valuable I may be able to fix something picking from the six dumps I made.

Gruß,
-- 
Stephan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-04  9:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-02 14:38 help channel for md raid? Stephan Böttcher
2021-08-02 18:29 ` Wols Lists
2021-08-02 22:20   ` Stephan Böttcher
2021-08-03 19:44     ` antlists
2021-08-04  9:40       ` Stephan Böttcher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox