Re: scrub: unrepaired sectors detected

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: Stefan N <stefannnau@gmail.com>, Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: scrub: unrepaired sectors detected
Date: Sat, 9 Dec 2023 15:55:45 +1030	[thread overview]
Message-ID: <c626e033-eaec-4392-9215-b57dcb165b4e@suse.com> (raw)
In-Reply-To: <CA+W5K0ouj3WN3en7q_1uGTzCqzuAiUJqQJot7zkRpH8M1-6vBA@mail.gmail.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 6101 bytes --]



On 2023/12/9 12:20, Stefan N wrote:
> Hi Qu,
> 
> Thanks for explaining that, and giving a path to remediation.
> 
> Could you please explain how you derived the bytenr range from the log
> message, as my attempt to reverse engineer the maths was not
> successful in the next reported error:
> 
> BTRFS error (device sdg): unrepaired sectors detected, full stripe
> 145932367691776 data stripe 2 errors 14-15

145932367691776 is the full stripe number, a full stripe looks like 
something like this:

	X             X+64K         X+128K                   X+64*N K
	|   Data 1    |   Data 2    |   ...   |    Data N    |

Data stripe 2 means it's the 3rd data stripe (we starts from data stripe 0).

So the 3rd data stripe would be for logical range
[Full stripe + 2 * 64K, Full stripe + 3 * 64K).

Furthermore, "errors" is for the vertical stripes, since btrfs is using 
fixed 64K stripe, and normally 4K sector size, we got 16 sectors for 
each data stripe.

And since the value is for vertical stripes, it applies to all data 
stripes. But since the report is only for data stripe 2, we only need to 
add the sectors offset, now we have:

  [Full stripe + 2 * 64K + 14 * 4K, Full stripe + 2 * 64K + 16 * 4K)

Hopes this can help you to pin down all the affected data.

BTW, considering btrfs scrub would try all possible RAID6 combinations, 
if we still have unrepariable data, it really means more than 2 corruptions.
Did you experienced more than 2 power losses before hitting this problem?

Thanks,
Qu
> 
> Cheers,
> 
> Stefan
> 
> On Wed, 6 Dec 2023 at 06:35, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2023/12/5 18:21, Stefan N wrote:
>>> Hi all,
>>>
>>> I'm having trouble getting an array to perform a scrub or replace, and
>>> would appreciate any assistance. I have two empty disks I can use to
>>> move things around, but the intended outcome is to use them to replace
>>> two of the smaller disks.
>>>
>>> $ uname -a ; btrfs --version ; btrfs fi show
>>> Linux $hostname 6.5.0-13-generic #13-Ubuntu SMP PREEMPT_DYNAMIC Fri
>>> Nov  3 12:16:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>> btrfs-progs v6.3.2
>>> Label: none  uuid: 3cde0d85-f53e-4db6-ac2c-a0e6528c5ced
>>>           Total devices 8 FS bytes used 71.32TiB
>>>           devid    1 size 16.37TiB used 16.37TiB path /dev/sdg
>>>           devid    2 size 10.91TiB used 10.91TiB path /dev/sdf
>>>           devid    3 size 16.37TiB used 16.36TiB path /dev/sdd
>>>           devid    4 size 16.37TiB used 12.54TiB path /dev/sda
>>>           devid    5 size 10.91TiB used 10.91TiB path /dev/sde
>>>           devid    6 size 10.91TiB used 10.91TiB path /dev/sdc
>>>           devid    7 size 16.37TiB used 16.37TiB path /dev/sdh
>>>           devid    8 size 10.91TiB used 10.91TiB path /dev/sdb
>>>
>>> $ btrfs fi df /mnt/point/
>>> Data, RAID6: total=71.97TiB, used=71.23TiB
>>> System, RAID1C3: total=36.00MiB, used=6.62MiB
>>> Metadata, RAID1C3: total=91.00GiB, used=85.09GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>> $
>>>
>>> Attempting to scrub
>>> BTRFS error (device sdg): unrepaired sectors detected, full stripe
>>> 145926853230592 data stripe 2 errors 5-13
>>
>> This is introduced in recent kernels, to detect full stripe RAID56
>> stripes which contains sectors which can not be repaired.
>>
>> This is pretty new behavior as an extra safenet, as sometimes such scrub
>> itself can further corrupt the P/Q stripes and cause unrepairable sectors.
>>
>> And I'm afraid that's already the case here.
>> Older RAID56 code (and even the newer one) still has the old write-hole
>> problem, thus previous power loss can reduce the redundancy and
>> eventually lead to data corruption.
>>
>> Newer scrub code is addressing this by detecting and error out, other
>> than further spreading the corruption.
>>> BTRFS info (device sdg): scrub: not finished on devid 2 with status: -5
>>>
>>> Scrub device /dev/sdf (id 2) canceled
>>> Scrub started:    Thu Nov 30 08:01:03 2023
>>> Status:           aborted
>>> Duration:         32:17:10
>>>           data_extents_scrubbed: 89766644
>>>           tree_extents_scrubbed: 0
>>>           data_bytes_scrubbed: 5856020676608
>>>           tree_bytes_scrubbed: 0
>>>           read_errors: 0
>>>           csum_errors: 0
>>>           verify_errors: 0
>>>           no_csum: 0
>>>           csum_discards: 0
>>>           super_errors: 0
>>>           malloc_errors: 0
>>>           uncorrectable_errors: 0
>>>           unverified_errors: 0
>>>           corrected_errors: 0
>>>           last_physical: 7984173809664
>>>
>>> Attempting to do replace using brand new disks, failed at ~50%, ran
>>> twice with two different pairs of disks
>>> Disk /dev/sdi: 16.37 TiB, 18000207937536 bytes, 35156656128 sectors
>>> Disk /dev/sdl: 16.37 TiB, 18000207937536 bytes, 35156656128 sectors
>>>
>>> BTRFS error (device sdg): unrepaired sectors detected, full stripe
>>> 145926853230592 data stripe 2 errors 5-13
>>> BTRFS error (device sdg): btrfs_scrub_dev(/dev/sdf, 2, /dev/sdl) failed -5
>>>
>>> The data is fairly replaceable so typically have been previously been
>>> deleting files that fail checks and performing roughly 3-monthly
>>> scrubs and weekly balances (musage/dusage=50).
>>
>> This can be something happened in the past but only caught by newer kernel.
>>
>> Anyway if you're fine to delete some files (only 9 sectors affected),
>> you can try to locate the inodes for the following bytenr range:
>>
>>    [145926853382144, 145926853414912]
>>
>> The way to go is using "btrfs logical-resolve -o <bytenr> <mnt>".
>>
>> And delete all the involved files, increase the bytenr by 4k, try again
>> until no more output for every 4K block in above range.
>>
>> Normally it should only be one or two files.
>>
>> Then retry scrub, re-do the loop until the scrub can finish properly.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Any help would be appreciated!
>>>
>>> Cheers,
>>>
>>> Stefan
>>>
> 

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7027 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

     prev parent reply	other threads:[~2023-12-09  5:25 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05  7:51 scrub: unrepaired sectors detected Stefan N
2023-12-05 20:05 ` Qu Wenruo
2023-12-09  1:50   ` Stefan N
2023-12-09  5:25     ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c626e033-eaec-4392-9215-b57dcb165b4e@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=stefannnau@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox