RAID6, errors at missing device replacement

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID6, errors at missing device replacement
@ 2016-04-15 19:49 Yauhen Kharuzhy
  2016-04-15 23:00 ` Henk Slager
  2016-04-16  7:37 ` Duncan
  0 siblings, 2 replies; 7+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-15 19:49 UTC (permalink / raw)
  To: linux-btrfs

Hi.

I have discovered case when replacement of missing devices causes
metadata corruption. Does anybody know anything about this?

I use 4.4.5 kernel with latest global spare patches.

If we have RAID6 (may be reproducible on RAID5 too) and try to replace
one missing drive by other and after this try to remove another drive
and replace it, plenty of errors are shown in the log:

[  748.641766] BTRFS error (device sdf): failed to rebuild valid
logical 7366459392 for dev /dev/sde
[  748.678069] BTRFS error (device sdf): failed to rebuild valid
logical 7381139456 for dev /dev/sde
[  748.693559] BTRFS error (device sdf): failed to rebuild valid
logical 7290974208 for dev /dev/sde
[  752.039100] BTRFS error (device sdf): bad tree block start
13048831955636601734 6919258112
[  752.647869] BTRFS error (device sdf): bad tree block start
12819300352 6919290880
[  752.658520] BTRFS error (device sdf): bad tree block start
31618367488 6919290880
[  752.712633] BTRFS error (device sdf): bad tree block start
31618367488 6919290880

After device replacement finish, scrub shows uncorrectable errors.
Btrfs check complains about errors too:
root@test:~/# btrfs check -p /dev/sdc
Checking filesystem on /dev/sdc
UUID: 833fef31-5536-411c-8f58-53b527569fa5
checksum verify failed on 9359163392 found E4E3BDB6 wanted 00000000
checksum verify failed on 9359163392 found E4E3BDB6 wanted 00000000
checksum verify failed on 9359163392 found 4D1F4197 wanted DE0E50EC
bytenr mismatch, want=9359163392, have=9359228928

Errors found in extent allocation tree or chunk allocation
checking free space cache [.]
checking fs roots [.]
checking csums
checking root refs
found 1049788420 bytes used err is 0
total csum bytes: 1024000
total tree bytes: 1179648
total fs tree bytes: 16384
total extent tree bytes: 16384
btree space waste bytes: 124962
file data blocks allocated: 1049755648
 referenced 1049755648

After first replacement metadata seems not spread across all devices:
Label: none  uuid: 3db39446-6810-47bf-8732-d5a8793500f3
        Total devices 4 FS bytes used 1002.00MiB
        devid    1 size 8.00GiB used 1.28GiB path /dev/sdc
        devid    2 size 8.00GiB used 1.28GiB path /dev/sdd
        devid    3 size 8.00GiB used 1.28GiB path /dev/sdf
        devid    4 size 8.00GiB used 1.25GiB path /dev/sdg

# btrfs device usage /mnt/
/dev/sdc, ID: 1
   Device size:             8.00GiB
   Data,RAID6:              1.00GiB
   Metadata,RAID6:        256.00MiB
   System,RAID6:           32.00MiB
   Unallocated:             6.72GiB

/dev/sdd, ID: 2
   Device size:             8.00GiB
   Data,RAID6:              1.00GiB
   Metadata,RAID6:        256.00MiB
   System,RAID6:           32.00MiB
   Unallocated:             6.72GiB

/dev/sdf, ID: 3
   Device size:             8.00GiB
   Data,RAID6:              1.00GiB
   Metadata,RAID6:        256.00MiB
   System,RAID6:           32.00MiB
   Unallocated:             6.72GiB

/dev/sdg, ID: 4
   Device size:             8.00GiB
   Data,RAID6:              1.00GiB
   Metadata,RAID6:        256.00MiB
   Unallocated:             6.75GiB


Steps to reproduce:
1) Create and mount RAID6
2) remove drive belonging to RAID, try write and let kernel code close
the device
3) replace missing device by 'btrfs replace start' command
4) remove drive in another slot, try write, wait for closing of it
5) start replacing of missing drive -> ERRORS.

If full balance after step 3) was done, no errors appeared.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID6, errors at missing device replacement
  2016-04-15 19:49 RAID6, errors at missing device replacement Yauhen Kharuzhy
@ 2016-04-15 23:00 ` Henk Slager
  2016-04-16  7:37 ` Duncan
  1 sibling, 0 replies; 7+ messages in thread
From: Henk Slager @ 2016-04-15 23:00 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 15, 2016 at 9:49 PM, Yauhen Kharuzhy
<yauhen.kharuzhy@zavadatar.com> wrote:
> Hi.
>
> I have discovered case when replacement of missing devices causes
> metadata corruption. Does anybody know anything about this?

I just can confirm that there is corruption when doing replacement for
both raid5 and raid6, and not only metadata.
If the replace is done in a very stepwise way, so no other
transactions ongoing on the fs and also when the device
'faillure'/removal is done in a planned way, the replace can be
successfull.

For raid5 extention from 3x100GB -> 4x100GB balance with stripe filter
worked as expected (some 4.4 kernel). I still had this images stored
and tried how the fs would survive an overwite of 1 device with a DVD
image (kernel 4.6.0-rc1). To summarize, i had to do a replace and
scrub and although tons of errors, some very weird/wrong, all files
seemed still be there. Until I unmounted and tried to remount: fs was
totally corrupted and no way to recover.

> I use 4.4.5 kernel with latest global spare patches.
>
> If we have RAID6 (may be reproducible on RAID5 too) and try to replace
> one missing drive by other and after this try to remove another drive
> and replace it, plenty of errors are shown in the log:
>
> [  748.641766] BTRFS error (device sdf): failed to rebuild valid
> logical 7366459392 for dev /dev/sde
> [  748.678069] BTRFS error (device sdf): failed to rebuild valid
> logical 7381139456 for dev /dev/sde
> [  748.693559] BTRFS error (device sdf): failed to rebuild valid
> logical 7290974208 for dev /dev/sde
> [  752.039100] BTRFS error (device sdf): bad tree block start
> 13048831955636601734 6919258112
> [  752.647869] BTRFS error (device sdf): bad tree block start
> 12819300352 6919290880
> [  752.658520] BTRFS error (device sdf): bad tree block start
> 31618367488 6919290880
> [  752.712633] BTRFS error (device sdf): bad tree block start
> 31618367488 6919290880
>
> After device replacement finish, scrub shows uncorrectable errors.
> Btrfs check complains about errors too:
> root@test:~/# btrfs check -p /dev/sdc
> Checking filesystem on /dev/sdc
> UUID: 833fef31-5536-411c-8f58-53b527569fa5
> checksum verify failed on 9359163392 found E4E3BDB6 wanted 00000000
> checksum verify failed on 9359163392 found E4E3BDB6 wanted 00000000
> checksum verify failed on 9359163392 found 4D1F4197 wanted DE0E50EC
> bytenr mismatch, want=9359163392, have=9359228928
>
> Errors found in extent allocation tree or chunk allocation
> checking free space cache [.]
> checking fs roots [.]
> checking csums
> checking root refs
> found 1049788420 bytes used err is 0
> total csum bytes: 1024000
> total tree bytes: 1179648
> total fs tree bytes: 16384
> total extent tree bytes: 16384
> btree space waste bytes: 124962
> file data blocks allocated: 1049755648
>  referenced 1049755648
>
> After first replacement metadata seems not spread across all devices:
> Label: none  uuid: 3db39446-6810-47bf-8732-d5a8793500f3
>         Total devices 4 FS bytes used 1002.00MiB
>         devid    1 size 8.00GiB used 1.28GiB path /dev/sdc
>         devid    2 size 8.00GiB used 1.28GiB path /dev/sdd
>         devid    3 size 8.00GiB used 1.28GiB path /dev/sdf
>         devid    4 size 8.00GiB used 1.25GiB path /dev/sdg
>
> # btrfs device usage /mnt/
> /dev/sdc, ID: 1
>    Device size:             8.00GiB
>    Data,RAID6:              1.00GiB
>    Metadata,RAID6:        256.00MiB
>    System,RAID6:           32.00MiB
>    Unallocated:             6.72GiB
>
> /dev/sdd, ID: 2
>    Device size:             8.00GiB
>    Data,RAID6:              1.00GiB
>    Metadata,RAID6:        256.00MiB
>    System,RAID6:           32.00MiB
>    Unallocated:             6.72GiB
>
> /dev/sdf, ID: 3
>    Device size:             8.00GiB
>    Data,RAID6:              1.00GiB
>    Metadata,RAID6:        256.00MiB
>    System,RAID6:           32.00MiB
>    Unallocated:             6.72GiB
>
> /dev/sdg, ID: 4
>    Device size:             8.00GiB
>    Data,RAID6:              1.00GiB
>    Metadata,RAID6:        256.00MiB
>    Unallocated:             6.75GiB
>
>
> Steps to reproduce:
> 1) Create and mount RAID6
> 2) remove drive belonging to RAID, try write and let kernel code close
> the device
> 3) replace missing device by 'btrfs replace start' command
> 4) remove drive in another slot, try write, wait for closing of it
> 5) start replacing of missing drive -> ERRORS.
>
> If full balance after step 3) was done, no errors appeared.

I used kernel 4.6.0-rc3  running in a Virtualbox, deleted and added
drives as one would do in a live system, rsyncing files to the fs in
the meantime. Both 1st and 2nd replace device show device errors later
on, but the steps 1) to 5) seem to have worked fine, also btrfs de us
shows correct and regular numbers. So the step 5) ERRORS don't seem to
occur.
BUT:
- when scrub is done, it just stops way too early, but no errors in dmesg
- umount works
- then mount again seems successfully but no mount is done actually,
also not after dev scan or other attempts
- after reboot, fs can be mounted, but many files have changed size
(to 0) and dmesg mentions lots of 'no csum' errors.
- roughly half of the data has disappeared, when comparing scrub output and du

Looking at all this, I did not do the full balance after step 3)
workaround; too many things go wrong at the same time for the kernel I
used.

It could be that you want to see how kernel + global spare patches
work out for raid6 replace specifically ? Or just in general for a new
kernel like 4.6.0-rc3 ?

At least it it looks like that the kernel you used did better than 4.6.0-rc3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID6, errors at missing device replacement
  2016-04-15 19:49 RAID6, errors at missing device replacement Yauhen Kharuzhy
  2016-04-15 23:00 ` Henk Slager
@ 2016-04-16  7:37 ` Duncan
  2016-05-02 18:43   ` Yauhen Kharuzhy
  1 sibling, 1 reply; 7+ messages in thread
From: Duncan @ 2016-04-16  7:37 UTC (permalink / raw)
  To: linux-btrfs

Yauhen Kharuzhy posted on Fri, 15 Apr 2016 12:49:36 -0700 as excerpted:

> I have discovered case when replacement of missing devices causes
> metadata corruption. Does anybody know anything about this?
> 
> I use 4.4.5 kernel with latest global spare patches.
> 
> If we have RAID6 (may be reproducible on RAID5 too) and try to replace
> one missing drive by other and after this try to remove another drive
> and replace it, plenty of errors are shown in the log:

I know you're working on testing the global spare patches, and thanks for 
that, you've already helped catch bugs that otherwise might conceivably 
have made it into the first release with the feature, such that they 
would likely have had to be fixed later, keeping the feature from 
stabilizing for some time.

Unfortunately, that seems to be what happened to the raid56 mode
recovery/repair/reshape/scrub patches, despite the long development time 
after the basic parity-writing "partial raid56 support" went in.  Unlike 
the global-spare patches, I don't recall the raid56 recover/... patches 
getting posted a kernel and userspace release cycle or more in advance 
and getting the type of independent review and testing that you're doing 
for global-spare, leading to multiple public revisions as issues were 
found and corrected.  Arguably, that only happened once (nominally) full 
functionality was in mainline, with the result being a kernel cycle and a 
half before raid56 was really working at all for recovery, and there 
still being issues over five cycles later.

And arguably, with patches for global-spare posted to the list and your 
well beyond cursory independent testing, global-spare should be far more 
mature on mainlining, with your efforts very possibly helping it avoid 
the same sort of issues.

Tho in all fairness, btrfs itself is maturing, and it may well be that 
either the raid56 experience directly led to the tougher but ultimately 
better process for global-spare, or the btrfs process itself is simply 
mature enough now that the raid56 situation wouldn't happen were it to be 
introduced now, either.

So two main points:

1) Due to raid56 mode itself still being somewhat immature, it may not be 
appropriate to use as a platform for testing further new features (like 
global spare) just yet -- global-spare testing with raid56 may either 
have to wait (i.e. skip it for now), or someone who's intimately familiar 
with the current known raid56 problems and able to recognize them on 
sight might need to do that testing, if it is to be done at this stage.

2) That's very much for your work testing global-spare, and of course to 
Anand Jain for posting the patches so you can. =:^)  Your work is 
directly contributing to it being more mature at mainline feature 
release, so that (unlike raid56) hopefully it can fast-stabilize once 
released, because of all the testing and work that is going in now, 
before mainlining and release. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID6, errors at missing device replacement
  2016-04-16  7:37 ` Duncan
@ 2016-05-02 18:43   ` Yauhen Kharuzhy
  2016-05-02 19:04     ` Chris Murphy
  0 siblings, 1 reply; 7+ messages in thread
From: Yauhen Kharuzhy @ 2016-05-02 18:43 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11211 bytes --]

On Sat, Apr 16, 2016 at 07:37:48AM +0000, Duncan wrote:
> Yauhen Kharuzhy posted on Fri, 15 Apr 2016 12:49:36 -0700 as excerpted:
> 
> > I have discovered case when replacement of missing devices causes
> > metadata corruption. Does anybody know anything about this?
> > 
> > I use 4.4.5 kernel with latest global spare patches.
> > 
> > If we have RAID6 (may be reproducible on RAID5 too) and try to replace
> > one missing drive by other and after this try to remove another drive
> > and replace it, plenty of errors are shown in the log:

I have reproduced this with vanilla 4.6-rc4 kernel and RAID5.

Script used to reproduce is attached, run as "./test-replace.sh <mount point> <disk1 disk2...>"

Kernel log:

[  402.878389] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 1 transid 3 /dev/sdc
[  402.911820] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 2 transid 3 /dev/sdd
[  402.972031] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 3 transid 3 /dev/sde
[  403.020067] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 4 transid 3 /dev/sdf
[  404.042312] BTRFS info (device sdf): disk space caching is enabled
[  404.051338] BTRFS: has skinny extents
[  404.056805] BTRFS: flagging fs with big metadata feature
[  404.149815] BTRFS: creating UUID tree
[  407.321146] sd 5:0:0:0: [sdf] Synchronizing SCSI cache
[  407.349530] sd 5:0:0:0: [sdf] Stopping disk
[  407.376682] ata6.00: disabled
[  407.695945] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
[  407.703760] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
[  407.726179] BTRFS error (device sdf): bdev /dev/sdf errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
[  407.733718] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
[  407.739873] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  410.631220] ata6: hard resetting link
[  411.041672] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  411.090105] ata6.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
[  411.153739] ata6.00: 16777216 sectors, multi 128: LBA48 NCQ (depth 31/32)
[  411.189534] ata6.00: configured for UDMA/133
[  411.225526] ata6: EH complete
[  411.229002] scsi 5:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
[  411.278584] sd 5:0:0:0: [sdg] 16777216 512-byte logical blocks: (8.59 GB/8.00 GiB)
[  411.297341] sd 5:0:0:0: [sdg] Write Protect is off
[  411.300054] sd 5:0:0:0: Attached scsi generic sg5 type 0
[  411.350875] sd 5:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  411.371402] sd 5:0:0:0: [sdg] Attached SCSI disk
[  413.663624] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 2, corrupt 0, gen 0
[  413.714417] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
[  413.719450] BTRFS error (device sdf): bdev /dev/sdf errs: wr 3, rd 0, flush 2, corrupt 0, gen 0
[  413.728705] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
[  413.734030] BTRFS error (device sdf): bdev /dev/sdf errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
[  413.841946] BTRFS info (device sde): allowing degraded mounts
[  413.848622] BTRFS info (device sde): disk space caching is enabled
[  413.877470] BTRFS: has skinny extents
[  413.942027] BTRFS info (device sde): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  414.076571] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg started
[  420.402126] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg finished
[  420.646768] sd 4:0:0:0: [sde] Synchronizing SCSI cache
[  420.653786] sd 4:0:0:0: [sde] Stopping disk
[  420.707224] ata5.00: disabled
[  420.991219] BTRFS error (device sde): bdev /dev/sde errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
[  421.006803] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
[  421.013813] BTRFS error (device sde): bdev /dev/sde errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
[  421.022001] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
[  421.032855] BTRFS error (device sde): bdev /dev/sde errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  423.943549] ata5: hard resetting link
[  424.264086] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  424.270354] ata5.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
[  424.303915] ata5.00: 41943040 sectors, multi 128: LBA48 NCQ (depth 31/32)
[  424.312418] ata5.00: configured for UDMA/133
[  424.317876] ata5: EH complete
[  424.346139] scsi 4:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
[  424.389067] sd 4:0:0:0: [sdf] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[  424.389110] sd 4:0:0:0: Attached scsi generic sg4 type 0
[  424.453500] sd 4:0:0:0: [sdf] Write Protect is off
[  424.460923] sd 4:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  424.526381] sd 4:0:0:0: [sdf] Attached SCSI disk
[  426.636182] BTRFS error (device sde): bdev /dev/sde errs: wr 2, rd 0, flush 2, corrupt 0, gen 0
[  426.641741] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
[  426.691659] BTRFS error (device sde): bdev /dev/sde errs: wr 3, rd 0, flush 2, corrupt 0, gen 0
[  426.698723] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
[  426.710799] BTRFS error (device sde): bdev /dev/sde errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
[  426.834307] BTRFS info (device sdg): allowing degraded mounts
[  426.842495] BTRFS info (device sdg): disk space caching is enabled
[  426.860045] BTRFS: has skinny extents
[  426.875105] BTRFS info (device sdg): bdev /dev/sdg errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  426.886143] BTRFS info (device sdg): bdev /dev/sde errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  427.146338] BTRFS info (device sdg): dev_replace from <missing disk> (devid 3) to /dev/sdf started
[  427.936021] BTRFS error (device sdg): failed to rebuild valid logical 3279355904 for dev /dev/sde
[  428.076806] BTRFS error (device sdg): failed to rebuild valid logical 3267567616 for dev /dev/sde
[  428.189681] BTRFS error (device sdg): failed to rebuild valid logical 3277004800 for dev /dev/sde
[  428.768747] BTRFS error (device sdg): failed to rebuild valid logical 3279372288 for dev /dev/sde
[  429.411867] BTRFS error (device sdg): failed to rebuild valid logical 3269947392 for dev /dev/sde
[  429.438711] BTRFS error (device sdg): failed to rebuild valid logical 3271520256 for dev /dev/sde
[  429.499210] BTRFS error (device sdg): failed to rebuild valid logical 3268378624 for dev /dev/sde
[  429.870200] BTRFS error (device sdg): failed to rebuild valid logical 3276255232 for dev /dev/sde
[  429.967750] BTRFS error (device sdg): failed to rebuild valid logical 3266834432 for dev /dev/sde
[  430.028623] BTRFS error (device sdg): failed to rebuild valid logical 3274698752 for dev /dev/sde
[  430.488825] BTRFS info (device sdg): dev_replace from <missing disk> (devid 3) to /dev/sdf finished
[  430.620438] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
[  430.692664] sd 3:0:0:0: [sdd] Stopping disk
[  430.760882] ata4.00: disabled
[  430.958960] BTRFS error (device sdg): bdev /dev/sdd errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
[  430.982233] BTRFS warning (device sdg): lost page write due to IO error on /dev/sdd
[  430.999441] BTRFS error (device sdg): bdev /dev/sdd errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
[  431.036540] BTRFS warning (device sdg): lost page write due to IO error on /dev/sdd
[  431.074314] BTRFS error (device sdg): bdev /dev/sdd errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  433.961963] ata4: hard resetting link
[  434.287424] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  434.292584] ata4.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
[  434.302767] ata4.00: 41943040 sectors, multi 128: LBA48 NCQ (depth 31/32)
[  434.342383] ata4.00: configured for UDMA/133
[  434.354685] ata4: EH complete
[  434.364789] scsi 3:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
[  434.440122] sd 3:0:0:0: Attached scsi generic sg3 type 0
[  434.448358] sd 3:0:0:0: [sde] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[  434.448481] sd 3:0:0:0: [sde] Write Protect is off
[  434.448517] sd 3:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  434.589187] sd 3:0:0:0: [sde] Attached SCSI disk
[  436.639464] BTRFS error (device sdg): bdev /dev/sdd errs: wr 2, rd 0, flush 2, corrupt 0, gen 0
[  436.701947] BTRFS warning (device sdg): lost page write due to IO error on /dev/sdd
[  436.713283] BTRFS error (device sdg): bdev /dev/sdd errs: wr 3, rd 0, flush 2, corrupt 0, gen 0
[  436.723682] BTRFS warning (device sdg): lost page write due to IO error on /dev/sdd
[  436.731662] BTRFS error (device sdg): bdev /dev/sdd errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
[  436.761114] BTRFS error (device sdg): bdev /dev/sdd errs: wr 4, rd 0, flush 3, corrupt 0, gen 0
[  436.783619] BTRFS warning (device sdg): lost page write due to IO error on /dev/sdd
[  436.790353] BTRFS error (device sdg): bdev /dev/sdd errs: wr 5, rd 0, flush 3, corrupt 0, gen 0
[  436.828784] BTRFS warning (device sdg): lost page write due to IO error on /dev/sdd
[  436.840279] BTRFS error (device sdg): bdev /dev/sdd errs: wr 6, rd 0, flush 3, corrupt 0, gen 0
[  436.963086] BTRFS info (device sdf): allowing degraded mounts
[  436.977520] BTRFS info (device sdf): disk space caching is enabled
[  436.982720] BTRFS: has skinny extents
[  436.998246] BTRFS info (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  437.023059] BTRFS info (device sdf): bdev /dev/sdg errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[  437.040400] BTRFS info (device sdf): bdev /dev/sdd errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
[  437.241595] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sde started
[  438.185590] scrub_missing_raid56_worker: 2 callbacks suppressed
[  438.188229] BTRFS error (device sdf): failed to rebuild valid logical 3279421440 for dev /dev/sdd
[  438.300493] BTRFS error (device sdf): failed to rebuild valid logical 3267633152 for dev /dev/sdd
[  438.703672] BTRFS error (device sdf): failed to rebuild valid logical 3277070336 for dev /dev/sdd
[  439.157045] BTRFS error (device sdf): failed to rebuild valid logical 3279437824 for dev /dev/sdd
[  439.373168] BTRFS error (device sdf): failed to rebuild valid logical 3270012928 for dev /dev/sdd
[  439.423270] BTRFS error (device sdf): failed to rebuild valid logical 3271585792 for dev /dev/sdd
[  439.601332] BTRFS error (device sdf): failed to rebuild valid logical 3268444160 for dev /dev/sdd
[  440.043626] BTRFS error (device sdf): failed to rebuild valid logical 3276320768 for dev /dev/sdd
[  440.205525] BTRFS error (device sdf): failed to rebuild valid logical 3266899968 for dev /dev/sdd
[  440.249055] BTRFS error (device sdf): failed to rebuild valid logical 3274764288 for dev /dev/sdd
[  440.351454] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sde finished


-- 
Yauhen Kharuzhy

[-- Attachment #2: test-replace.sh --]
[-- Type: application/x-sh, Size: 3101 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID6, errors at missing device replacement
  2016-05-02 18:43   ` Yauhen Kharuzhy
@ 2016-05-02 19:04     ` Chris Murphy
  2016-05-02 19:19       ` Yauhen Kharuzhy
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Murphy @ 2016-05-02 19:04 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: Duncan, Btrfs BTRFS

On Mon, May 2, 2016 at 12:43 PM, Yauhen Kharuzhy
<yauhen.kharuzhy@zavadatar.com> wrote:
> On Sat, Apr 16, 2016 at 07:37:48AM +0000, Duncan wrote:
>> Yauhen Kharuzhy posted on Fri, 15 Apr 2016 12:49:36 -0700 as excerpted:
>>
>> > I have discovered case when replacement of missing devices causes
>> > metadata corruption. Does anybody know anything about this?
>> >
>> > I use 4.4.5 kernel with latest global spare patches.
>> >
>> > If we have RAID6 (may be reproducible on RAID5 too) and try to replace
>> > one missing drive by other and after this try to remove another drive
>> > and replace it, plenty of errors are shown in the log:
>
> I have reproduced this with vanilla 4.6-rc4 kernel and RAID5.
>
> Script used to reproduce is attached, run as "./test-replace.sh <mount point> <disk1 disk2...>"
>
> Kernel log:
>
> [  402.878389] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 1 transid 3 /dev/sdc
> [  402.911820] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 2 transid 3 /dev/sdd
> [  402.972031] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 3 transid 3 /dev/sde
> [  403.020067] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 4 transid 3 /dev/sdf
> [  404.042312] BTRFS info (device sdf): disk space caching is enabled
> [  404.051338] BTRFS: has skinny extents
> [  404.056805] BTRFS: flagging fs with big metadata feature
> [  404.149815] BTRFS: creating UUID tree
> [  407.321146] sd 5:0:0:0: [sdf] Synchronizing SCSI cache
> [  407.349530] sd 5:0:0:0: [sdf] Stopping disk
> [  407.376682] ata6.00: disabled

Why is ata6 disabled?

> [  407.695945] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
> [  407.703760] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> [  407.726179] BTRFS error (device sdf): bdev /dev/sdf errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
> [  407.733718] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> [  407.739873] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
> [  410.631220] ata6: hard resetting link

And now reset?


> [  411.041672] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  411.090105] ata6.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
> [  411.153739] ata6.00: 16777216 sectors, multi 128: LBA48 NCQ (depth 31/32)
> [  411.189534] ata6.00: configured for UDMA/133
> [  411.225526] ata6: EH complete
> [  411.229002] scsi 5:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
> [  411.278584] sd 5:0:0:0: [sdg] 16777216 512-byte logical blocks: (8.59 GB/8.00 GiB)

sd 5:0:0:0 was sdf but now it's sdg



> [  411.297341] sd 5:0:0:0: [sdg] Write Protect is off
> [  411.300054] sd 5:0:0:0: Attached scsi generic sg5 type 0
> [  411.350875] sd 5:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [  411.371402] sd 5:0:0:0: [sdg] Attached SCSI disk
> [  413.663624] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 2, corrupt 0, gen 0
> [  413.714417] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> [  413.719450] BTRFS error (device sdf): bdev /dev/sdf errs: wr 3, rd 0, flush 2, corrupt 0, gen 0
> [  413.728705] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> [  413.734030] BTRFS error (device sdf): bdev /dev/sdf errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
> [  413.841946] BTRFS info (device sde): allowing degraded mounts
> [  413.848622] BTRFS info (device sde): disk space caching is enabled
> [  413.877470] BTRFS: has skinny extents
> [  413.942027] BTRFS info (device sde): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
> [  414.076571] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg started
> [  420.402126] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg finished
> [  420.646768] sd 4:0:0:0: [sde] Synchronizing SCSI cache
> [  420.653786] sd 4:0:0:0: [sde] Stopping disk
> [  420.707224] ata5.00: disabled

sde is stopped? ata5 is disabled

> [  420.991219] BTRFS error (device sde): bdev /dev/sde errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
> [  421.006803] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
> [  421.013813] BTRFS error (device sde): bdev /dev/sde errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
> [  421.022001] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
> [  421.032855] BTRFS error (device sde): bdev /dev/sde errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
> [  423.943549] ata5: hard resetting link

and now reset


> [  424.264086] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  424.270354] ata5.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
> [  424.303915] ata5.00: 41943040 sectors, multi 128: LBA48 NCQ (depth 31/32)
> [  424.312418] ata5.00: configured for UDMA/133
> [  424.317876] ata5: EH complete
> [  424.346139] scsi 4:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
> [  424.389067] sd 4:0:0:0: [sdf] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
> [  424.389110] sd 4:0:0:0: Attached scsi generic sg4 type 0
> [  424.453500] sd 4:0:0:0: [sdf] Write Protect is off

sd 4:0:0:0: was sde now it's sdf


I think there's another bug here instigating all of this. I'm not sure
it's a Btrfs bug at all.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID6, errors at missing device replacement
  2016-05-02 19:04     ` Chris Murphy
@ 2016-05-02 19:19       ` Yauhen Kharuzhy
  2016-05-02 19:33         ` Chris Murphy
  0 siblings, 1 reply; 7+ messages in thread
From: Yauhen Kharuzhy @ 2016-05-02 19:19 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Duncan, Btrfs BTRFS

On Mon, May 02, 2016 at 01:04:30PM -0600, Chris Murphy wrote:
> On Mon, May 2, 2016 at 12:43 PM, Yauhen Kharuzhy
> <yauhen.kharuzhy@zavadatar.com> wrote:
> > On Sat, Apr 16, 2016 at 07:37:48AM +0000, Duncan wrote:
> >> Yauhen Kharuzhy posted on Fri, 15 Apr 2016 12:49:36 -0700 as excerpted:
> >>
> >> > I have discovered case when replacement of missing devices causes
> >> > metadata corruption. Does anybody know anything about this?
> >> >
> >> > I use 4.4.5 kernel with latest global spare patches.
> >> >
> >> > If we have RAID6 (may be reproducible on RAID5 too) and try to replace
> >> > one missing drive by other and after this try to remove another drive
> >> > and replace it, plenty of errors are shown in the log:
> >
> > I have reproduced this with vanilla 4.6-rc4 kernel and RAID5.
> >
> > Script used to reproduce is attached, run as "./test-replace.sh <mount point> <disk1 disk2...>"
> >
> > Kernel log:
> >
> > [  402.878389] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 1 transid 3 /dev/sdc
> > [  402.911820] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 2 transid 3 /dev/sdd
> > [  402.972031] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 3 transid 3 /dev/sde
> > [  403.020067] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 4 transid 3 /dev/sdf
> > [  404.042312] BTRFS info (device sdf): disk space caching is enabled
> > [  404.051338] BTRFS: has skinny extents
> > [  404.056805] BTRFS: flagging fs with big metadata feature
> > [  404.149815] BTRFS: creating UUID tree
> > [  407.321146] sd 5:0:0:0: [sdf] Synchronizing SCSI cache
> > [  407.349530] sd 5:0:0:0: [sdf] Stopping disk
> > [  407.376682] ata6.00: disabled
> 
> Why is ata6 disabled?

To emulate of failed drive, I detach it from SCSI host (see script) by
'echo 1 > /sys/class/scsi_device/<dev>/device/delete' command.

> 
> > [  407.695945] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
> > [  407.703760] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> > [  407.726179] BTRFS error (device sdf): bdev /dev/sdf errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
> > [  407.733718] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> > [  407.739873] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
> > [  410.631220] ata6: hard resetting link
> 
> And now reset?
> 
> 
> > [  411.041672] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [  411.090105] ata6.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
> > [  411.153739] ata6.00: 16777216 sectors, multi 128: LBA48 NCQ (depth 31/32)
> > [  411.189534] ata6.00: configured for UDMA/133
> > [  411.225526] ata6: EH complete
> > [  411.229002] scsi 5:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
> > [  411.278584] sd 5:0:0:0: [sdg] 16777216 512-byte logical blocks: (8.59 GB/8.00 GiB)
> 
> sd 5:0:0:0 was sdf but now it's sdg

Yes, I reinserted drive again, wipe btrfs from it, and start
replace of missing device by it. sdf block device will be released by
btrfs at unmount (without Anand's global spare patchset there is no way
to close failed or removed device and make it missing).

> 
> 
> 
> > [  411.297341] sd 5:0:0:0: [sdg] Write Protect is off
> > [  411.300054] sd 5:0:0:0: Attached scsi generic sg5 type 0
> > [  411.350875] sd 5:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > [  411.371402] sd 5:0:0:0: [sdg] Attached SCSI disk
> > [  413.663624] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 2, corrupt 0, gen 0
> > [  413.714417] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> > [  413.719450] BTRFS error (device sdf): bdev /dev/sdf errs: wr 3, rd 0, flush 2, corrupt 0, gen 0
> > [  413.728705] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
> > [  413.734030] BTRFS error (device sdf): bdev /dev/sdf errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
> > [  413.841946] BTRFS info (device sde): allowing degraded mounts
> > [  413.848622] BTRFS info (device sde): disk space caching is enabled
> > [  413.877470] BTRFS: has skinny extents
> > [  413.942027] BTRFS info (device sde): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
> > [  414.076571] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg started
> > [  420.402126] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg finished
> > [  420.646768] sd 4:0:0:0: [sde] Synchronizing SCSI cache
> > [  420.653786] sd 4:0:0:0: [sde] Stopping disk
> > [  420.707224] ata5.00: disabled
> 
> sde is stopped? ata5 is disabled

Second replace, 'failed to rebuild logical...' messages appear only at
sencond replace of another device than in first replace.

> 
> > [  420.991219] BTRFS error (device sde): bdev /dev/sde errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
> > [  421.006803] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
> > [  421.013813] BTRFS error (device sde): bdev /dev/sde errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
> > [  421.022001] BTRFS warning (device sde): lost page write due to IO error on /dev/sde
> > [  421.032855] BTRFS error (device sde): bdev /dev/sde errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
> > [  423.943549] ata5: hard resetting link
> 
> and now reset
> 
> 
> > [  424.264086] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [  424.270354] ata5.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
> > [  424.303915] ata5.00: 41943040 sectors, multi 128: LBA48 NCQ (depth 31/32)
> > [  424.312418] ata5.00: configured for UDMA/133
> > [  424.317876] ata5: EH complete
> > [  424.346139] scsi 4:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
> > [  424.389067] sd 4:0:0:0: [sdf] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
> > [  424.389110] sd 4:0:0:0: Attached scsi generic sg4 type 0
> > [  424.453500] sd 4:0:0:0: [sdf] Write Protect is off
> 
> sd 4:0:0:0: was sde now it's sdf
> 
> 
> I think there's another bug here instigating all of this. I'm not sure
> it's a Btrfs bug at all.

-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RAID6, errors at missing device replacement
  2016-05-02 19:19       ` Yauhen Kharuzhy
@ 2016-05-02 19:33         ` Chris Murphy
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Murphy @ 2016-05-02 19:33 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: Chris Murphy, Duncan, Btrfs BTRFS

On Mon, May 2, 2016 at 1:19 PM, Yauhen Kharuzhy
<yauhen.kharuzhy@zavadatar.com> wrote:
> On Mon, May 02, 2016 at 01:04:30PM -0600, Chris Murphy wrote:
>> On Mon, May 2, 2016 at 12:43 PM, Yauhen Kharuzhy
>> <yauhen.kharuzhy@zavadatar.com> wrote:
>> > On Sat, Apr 16, 2016 at 07:37:48AM +0000, Duncan wrote:
>> >> Yauhen Kharuzhy posted on Fri, 15 Apr 2016 12:49:36 -0700 as excerpted:
>> >>
>> >> > I have discovered case when replacement of missing devices causes
>> >> > metadata corruption. Does anybody know anything about this?
>> >> >
>> >> > I use 4.4.5 kernel with latest global spare patches.
>> >> >
>> >> > If we have RAID6 (may be reproducible on RAID5 too) and try to replace
>> >> > one missing drive by other and after this try to remove another drive
>> >> > and replace it, plenty of errors are shown in the log:
>> >
>> > I have reproduced this with vanilla 4.6-rc4 kernel and RAID5.
>> >
>> > Script used to reproduce is attached, run as "./test-replace.sh <mount point> <disk1 disk2...>"
>> >
>> > Kernel log:
>> >
>> > [  402.878389] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 1 transid 3 /dev/sdc
>> > [  402.911820] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 2 transid 3 /dev/sdd
>> > [  402.972031] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 3 transid 3 /dev/sde
>> > [  403.020067] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 4 transid 3 /dev/sdf
>> > [  404.042312] BTRFS info (device sdf): disk space caching is enabled
>> > [  404.051338] BTRFS: has skinny extents
>> > [  404.056805] BTRFS: flagging fs with big metadata feature
>> > [  404.149815] BTRFS: creating UUID tree
>> > [  407.321146] sd 5:0:0:0: [sdf] Synchronizing SCSI cache
>> > [  407.349530] sd 5:0:0:0: [sdf] Stopping disk
>> > [  407.376682] ata6.00: disabled
>>
>> Why is ata6 disabled?
>
> To emulate of failed drive, I detach it from SCSI host (see script) by
> 'echo 1 > /sys/class/scsi_device/<dev>/device/delete' command.
>
>>
>> > [  407.695945] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
>> > [  407.703760] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
>> > [  407.726179] BTRFS error (device sdf): bdev /dev/sdf errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
>> > [  407.733718] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
>> > [  407.739873] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
>> > [  410.631220] ata6: hard resetting link
>>
>> And now reset?
>>
>>
>> > [  411.041672] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> > [  411.090105] ata6.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133
>> > [  411.153739] ata6.00: 16777216 sectors, multi 128: LBA48 NCQ (depth 31/32)
>> > [  411.189534] ata6.00: configured for UDMA/133
>> > [  411.225526] ata6: EH complete
>> > [  411.229002] scsi 5:0:0:0: Direct-Access     ATA      VBOX HARDDISK    1.0  PQ: 0 ANSI: 5
>> > [  411.278584] sd 5:0:0:0: [sdg] 16777216 512-byte logical blocks: (8.59 GB/8.00 GiB)
>>
>> sd 5:0:0:0 was sdf but now it's sdg
>
> Yes, I reinserted drive again, wipe btrfs from it, and start
> replace of missing device by it. sdf block device will be released by
> btrfs at unmount (without Anand's global spare patchset there is no way
> to close failed or removed device and make it missing).
>
>>
>>
>>
>> > [  411.297341] sd 5:0:0:0: [sdg] Write Protect is off
>> > [  411.300054] sd 5:0:0:0: Attached scsi generic sg5 type 0
>> > [  411.350875] sd 5:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>> > [  411.371402] sd 5:0:0:0: [sdg] Attached SCSI disk
>> > [  413.663624] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 2, corrupt 0, gen 0
>> > [  413.714417] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
>> > [  413.719450] BTRFS error (device sdf): bdev /dev/sdf errs: wr 3, rd 0, flush 2, corrupt 0, gen 0
>> > [  413.728705] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf
>> > [  413.734030] BTRFS error (device sdf): bdev /dev/sdf errs: wr 4, rd 0, flush 2, corrupt 0, gen 0
>> > [  413.841946] BTRFS info (device sde): allowing degraded mounts
>> > [  413.848622] BTRFS info (device sde): disk space caching is enabled
>> > [  413.877470] BTRFS: has skinny extents
>> > [  413.942027] BTRFS info (device sde): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
>> > [  414.076571] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg started
>> > [  420.402126] BTRFS info (device sde): dev_replace from <missing disk> (devid 4) to /dev/sdg finished
>> > [  420.646768] sd 4:0:0:0: [sde] Synchronizing SCSI cache
>> > [  420.653786] sd 4:0:0:0: [sde] Stopping disk
>> > [  420.707224] ata5.00: disabled
>>
>> sde is stopped? ata5 is disabled
>
> Second replace, 'failed to rebuild logical...' messages appear only at
> sencond replace of another device than in first replace.

OK thanks.

Maybe an RFE for a Btrfs umount message to the kernel buffer would be
a good idea? XFS has this:

[166852.899040] XFS (dm-6): Unmounting Filesystem

It can be useful to have kernel confirmation whether a volume is umounted.





-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-05-02 19:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-15 19:49 RAID6, errors at missing device replacement Yauhen Kharuzhy
2016-04-15 23:00 ` Henk Slager
2016-04-16  7:37 ` Duncan
2016-05-02 18:43   ` Yauhen Kharuzhy
2016-05-02 19:04     ` Chris Murphy
2016-05-02 19:19       ` Yauhen Kharuzhy
2016-05-02 19:33         ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).