From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f178.google.com ([209.85.214.178]:34826 "EHLO mail-ob0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754690AbcEBTEb (ORCPT ); Mon, 2 May 2016 15:04:31 -0400 Received: by mail-ob0-f178.google.com with SMTP id n10so80082222obb.2 for ; Mon, 02 May 2016 12:04:31 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20160502184305.GL21960@jeknote.loshitsa1.net> References: <20160502184305.GL21960@jeknote.loshitsa1.net> Date: Mon, 2 May 2016 13:04:30 -0600 Message-ID: Subject: Re: RAID6, errors at missing device replacement From: Chris Murphy To: Yauhen Kharuzhy Cc: Duncan <1i5t5.duncan@cox.net>, Btrfs BTRFS Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Mon, May 2, 2016 at 12:43 PM, Yauhen Kharuzhy wrote: > On Sat, Apr 16, 2016 at 07:37:48AM +0000, Duncan wrote: >> Yauhen Kharuzhy posted on Fri, 15 Apr 2016 12:49:36 -0700 as excerpted: >> >> > I have discovered case when replacement of missing devices causes >> > metadata corruption. Does anybody know anything about this? >> > >> > I use 4.4.5 kernel with latest global spare patches. >> > >> > If we have RAID6 (may be reproducible on RAID5 too) and try to replace >> > one missing drive by other and after this try to remove another drive >> > and replace it, plenty of errors are shown in the log: > > I have reproduced this with vanilla 4.6-rc4 kernel and RAID5. > > Script used to reproduce is attached, run as "./test-replace.sh " > > Kernel log: > > [ 402.878389] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 1 transid 3 /dev/sdc > [ 402.911820] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 2 transid 3 /dev/sdd > [ 402.972031] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 3 transid 3 /dev/sde > [ 403.020067] BTRFS: device fsid eabede3e-1e50-46cd-92ec-f9476b321f63 devid 4 transid 3 /dev/sdf > [ 404.042312] BTRFS info (device sdf): disk space caching is enabled > [ 404.051338] BTRFS: has skinny extents > [ 404.056805] BTRFS: flagging fs with big metadata feature > [ 404.149815] BTRFS: creating UUID tree > [ 407.321146] sd 5:0:0:0: [sdf] Synchronizing SCSI cache > [ 407.349530] sd 5:0:0:0: [sdf] Stopping disk > [ 407.376682] ata6.00: disabled Why is ata6 disabled? > [ 407.695945] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 > [ 407.703760] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf > [ 407.726179] BTRFS error (device sdf): bdev /dev/sdf errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 > [ 407.733718] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf > [ 407.739873] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 > [ 410.631220] ata6: hard resetting link And now reset? > [ 411.041672] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 411.090105] ata6.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133 > [ 411.153739] ata6.00: 16777216 sectors, multi 128: LBA48 NCQ (depth 31/32) > [ 411.189534] ata6.00: configured for UDMA/133 > [ 411.225526] ata6: EH complete > [ 411.229002] scsi 5:0:0:0: Direct-Access ATA VBOX HARDDISK 1.0 PQ: 0 ANSI: 5 > [ 411.278584] sd 5:0:0:0: [sdg] 16777216 512-byte logical blocks: (8.59 GB/8.00 GiB) sd 5:0:0:0 was sdf but now it's sdg > [ 411.297341] sd 5:0:0:0: [sdg] Write Protect is off > [ 411.300054] sd 5:0:0:0: Attached scsi generic sg5 type 0 > [ 411.350875] sd 5:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > [ 411.371402] sd 5:0:0:0: [sdg] Attached SCSI disk > [ 413.663624] BTRFS error (device sdf): bdev /dev/sdf errs: wr 2, rd 0, flush 2, corrupt 0, gen 0 > [ 413.714417] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf > [ 413.719450] BTRFS error (device sdf): bdev /dev/sdf errs: wr 3, rd 0, flush 2, corrupt 0, gen 0 > [ 413.728705] BTRFS warning (device sdf): lost page write due to IO error on /dev/sdf > [ 413.734030] BTRFS error (device sdf): bdev /dev/sdf errs: wr 4, rd 0, flush 2, corrupt 0, gen 0 > [ 413.841946] BTRFS info (device sde): allowing degraded mounts > [ 413.848622] BTRFS info (device sde): disk space caching is enabled > [ 413.877470] BTRFS: has skinny extents > [ 413.942027] BTRFS info (device sde): bdev /dev/sdf errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 > [ 414.076571] BTRFS info (device sde): dev_replace from (devid 4) to /dev/sdg started > [ 420.402126] BTRFS info (device sde): dev_replace from (devid 4) to /dev/sdg finished > [ 420.646768] sd 4:0:0:0: [sde] Synchronizing SCSI cache > [ 420.653786] sd 4:0:0:0: [sde] Stopping disk > [ 420.707224] ata5.00: disabled sde is stopped? ata5 is disabled > [ 420.991219] BTRFS error (device sde): bdev /dev/sde errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 > [ 421.006803] BTRFS warning (device sde): lost page write due to IO error on /dev/sde > [ 421.013813] BTRFS error (device sde): bdev /dev/sde errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 > [ 421.022001] BTRFS warning (device sde): lost page write due to IO error on /dev/sde > [ 421.032855] BTRFS error (device sde): bdev /dev/sde errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 > [ 423.943549] ata5: hard resetting link and now reset > [ 424.264086] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [ 424.270354] ata5.00: ATA-6: VBOX HARDDISK, 1.0, max UDMA/133 > [ 424.303915] ata5.00: 41943040 sectors, multi 128: LBA48 NCQ (depth 31/32) > [ 424.312418] ata5.00: configured for UDMA/133 > [ 424.317876] ata5: EH complete > [ 424.346139] scsi 4:0:0:0: Direct-Access ATA VBOX HARDDISK 1.0 PQ: 0 ANSI: 5 > [ 424.389067] sd 4:0:0:0: [sdf] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB) > [ 424.389110] sd 4:0:0:0: Attached scsi generic sg4 type 0 > [ 424.453500] sd 4:0:0:0: [sdf] Write Protect is off sd 4:0:0:0: was sde now it's sdf I think there's another bug here instigating all of this. I'm not sure it's a Btrfs bug at all. -- Chris Murphy