From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp2-g21.free.fr ([212.27.42.2]:60863 "EHLO smtp2-g21.free.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755140Ab3HRWgG (ORCPT ); Sun, 18 Aug 2013 18:36:06 -0400 Received: from shiva (unknown [88.171.134.228]) by smtp2-g21.free.fr (Postfix) with ESMTP id AF8E84B00BF for ; Mon, 19 Aug 2013 00:35:57 +0200 (CEST) Message-ID: <52114C4A.6000003@pook.it> Date: Mon, 19 Aug 2013 00:35:54 +0200 From: Stuart Pook MIME-Version: 1.0 To: Chris Murphy CC: Btrfs BTRFS Subject: Re: uncorrectable errors after btrfs replace References: <52111C9D.3090704@pook.it> <73B6CA35-5279-44D6-A427-46985C3F554C@colorremedies.com> In-Reply-To: <73B6CA35-5279-44D6-A427-46985C3F554C@colorremedies.com> Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: hi Chris thanks for your reply. I was unable to save the filesystem. Even after deleting all but 4Gb I still had too many errors so I just reformated the device. I'm glad that it was my backups and not my data. On 18/08/13 23:43, Chris Murphy wrote: > On Aug 18, 2013, at 1:12 PM, Stuart Pook wrote: > >> 6 btrfs filesystem resize 580g . > > You first shrank a 2TB btrfs file system on dmcrypt device to 590GB. > But then you didn't resize the dm device or the partition? no, I had no need to resize the dm device or partition. I just read that when doing a replace the new device must be no smaller than the old device. So I shrunk the old device using "btrfs filesystem resize". Once the resize worked I was able to do the replace but I didn't try to replace before resizing. This is what btrfs(1) says on Debian: "The targetdev needs to be same size or larger than the srcdev." I may be confused here. >> 9 time btrfs balance start -musage=1 -dusage=1 . && time btrfs filesystem resize 580g . I was surprised that the resize to 580Gb didn't work so I tried a magical rebalance before doing the resize to 580 again. It still didn't work (not enough space) but a resize to 590 Gb did. >> 10 time btrfs filesystem resize 590g . this worked > You followed the resize of the fs, but not the underlying devices, > with a balance, then resized it two more times? The resize to 580 didn't work. So I did a balance. The resize to 580 still didn't work so I resized to 590. > This is weird, but also makes the sequence difficult to follow. >> 13 time btrfs replace start /dev/dm-11 /dev/dm-12 -B /disks/backups >> 14 time btrfs replace start /dev/dm-11 /dev/dm-12-B /disks/backups > Why is this command repeated? What's with the numbering system that > skips numbers? The command is repeated because I cancelled it my mistake by setting the filesystem to readonly. I'm not sure if I restarted it by rerunning the replace or just by remounting the filesystem readwrite in another window. I'll put all of the commands at the end of this list. >> Aug 18 12:28:17 kooka kernel: [54139.448029] ata10: SATA link up1.5 Gbps (SStatus 113 SControl 310) > Bad connection so libata is dropping the link from 3 Gbps to1.5Gbps. >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 12080 > > This confirms that both ends of the cable are sensing communication > problems between drive and controller. The cable needs to be > replaced, likely it's the connector not the cable itself. I think that I should stop using my SATA dock with the SATA ports on my motherboard which are probably not designed to be hot plugged. >> I guess that /disks/backup is mostly dead and that I should just >> reformat it. What do you think? > > Well I think I'd try to simplify this drastically and see if you've > got a reproducing bug. I ran a badblocks scan on the raw device (not the luks device) and didn't get any errors. > The steps you've got I find mostly incoherent, so I can't try to do > what you did to see if it's reproducible. yes, this was the first time I've tried this. And just to make this more difficult some commands were typed in a different window. >> Next time I'll watch /var/log/syslog but I would have preferred >> that "btrfs replace" stop when getting errors. > > The errors should be self correcting, but the mere fact they're > happening means that some errors could be occurring but aren't > detected. If the data is corrupting in-transit, but the drive or > controller didn't report a problem, then btrfs has no way of knowing > it was written incorrectly. The data was written to the WD-Blue (640Gb) disk and then copied off it. The only errors I saw concerned the WB-Blue. If the errors were data corruption on writing or reading the WD-Blue then I would have thought that the checksums would have told me that there was something wrong. btrfs didn't give me an IO error until I started to read the files when the data was on a final disk. Does "btrfs replace" check the ckecksums as it reads the data from the disk that is being replaced? Just to be clear. This is the series of btrfs replace I did: backups : HD204UI -> WD-Blue /mnt : WD-Black -> HD204UI backups : WD-Blue -> WD-Black I guess that my backups were corrupted was they were written to or read from the WD-Blue. Wouldn't the checksums have detected this problem before the data was written to the WD-Black? > There's only so much software can do to overcome blatant hardware problems. I was hoping to be informed of them > But, it seems unlikely such a high percent of errors would go > undetected to result in so many uncorrectable errors, so there may be > user error here along with a bug. I'm not sure how I could have done it better. Does "btrfs replace" check that the data is correctly written to the new disk before it is removed from the old disk? Should I have used the 2 disks to make a RAID-1 and then done a scrub before removing the old disk? Here is the complete list of commands I made in the main terminal 1 cd /disks/backups/ 2 btrfs filesystem df 3 btrfs filesystem df , 4* 5 btrfs filesystem df . 6 btrfs filesystem resize 580g . 7 date 8 btrfs filesystem df . 9 time btrfs balance start -musage=1 -dusage=1 . && time btrfs filesystem resize 580g . 10 time btrfs filesystem resize 590g . 11 btrfs filesystem show 12 cryptsetup luksOpen /dev/sdd2 640Gb 13 time btrfs replace start /dev/dm-11 /dev/dm-12 -B /disks/backups 14 time btrfs replace start /dev/dm-11 /dev/dm-12 -B /disks/backups 15 cd / 16 btrfs filesystem show 17 btrfs filesystem show 18 cryptsetup remove _dev_sdc2 19 fdisk /dev/sdc 20 fdisk /dev/sdc 21 fdisk -c /dev/sdc 22 fdisk -c=dos /dev/sdc 23 fdisk /dev/sdc 24 fdisk -c=dos /dev/sdc 25 l /mnt 26 mount /dev/sdb1 /mnt 27 l /mnt 28 btrfs subv list /mnt 29 btrfs filesystem show 30 #time btrfs replace start /dev/dm-11 /dev/dm-12 -B /disks/backups 31 fdisk -l /dev/sdc 32 time btrfs replace start /dev/sdb1 /dev/sdc2 -B /mnt 33 btrfs filesystem show 34 btrfs filesystem label /dev/dm-12 35 btrfs filesystem label /disks/backups 36 btrfs filesystem label /disks/backups backups2Tb 37 btrfs filesystem show 38 btrfs filesystem label /disks/backups 39 cryptsetup luksFormat /dev/sdb2 40 cryptsetup luksAddKey /dev/sdb2 41 cryptsetup open /dev/sdb2 newbackups 42 l /dev/mapper/newbackups 43 time btrfs replace start /dev/dm-12 /dev/dm-11 -B /disks/backups 44 btrfs filesystem show 45 cryptsetup status 640Gb 46 cryptsetup remove 640Gb 47 btrfs filesystem show 48 btrfs filesystem df /disks/backups/ 49 btrfs filesystem resize max /disks/backups/ 50 btrfs filesystem df /disks/backups/ 51 btrfs filesystem show 52 vi /etc/cron.daily/storebackup 53 vi /etc/cron.daily/stuart 54 /etc/local/backups 55 mount 56 mount -o remount,rw /disks/backups/ 57 time btrfs scrub start -Bd /disks/backups 58 smartctl -a /dev/sdb 59 smartctl -a /dev/sdc 60 smartctl -a /dev/sdd 61 smartctl -t short /dev/sdd 62 sleep 2m; smartctl -a /dev/sdd 63 history > /tmp/root.commands Which disk is which? WD-Black ata-WDC_WD2002FAEX-007BA0_WD-WCAY00589823 -> ../../sdb HD204UI ata-ST2000DL004_HD204UI_S2H7J90C549571 -> ../../sdc WD-Blue ata-WDC_WD6400AAKS-00A7B2_WD-WMASY2546840 -> ../../sdd please let me know if I can be any clearer, thanks Stuart