* Fwd: btrfs replace seems to corrupt the file system [not found] <CA+xOVSOD1YY-=Cm+vmzTUV9cHe9idtDkRr0RmpRP5a0Z6eC4YQ@mail.gmail.com> @ 2015-06-27 23:17 ` Mordechay Kaganer 2015-06-28 0:52 ` Moby ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Mordechay Kaganer @ 2015-06-27 23:17 UTC (permalink / raw) To: linux-btrfs B.H. Hello. I'm running our backup archive on btrfs. We have MD-based RAID5 array with 4 6TB disks then LVM on top of it, and btrfs volume on the LV (we don't use btrfs's own RAID features because we want RAID5 and as far as i understand the support is only partial). I wanted to move the archive to another MD array of 4 8TB drives (this time without LVM). So i did: btrfs replace start 1 /dev/md1 <mount_point> Where 1 is the only devid that was present and /dev/md1 is the new array. The replace run successfully until finished after more than 5 days. The system downloaded some fresh backups and created new snapshots during the ongoing replace. I've go 2 kernel warnings about replace task waiting for more than 120 seconds in the middle, but process seamed to go on anyway. After the replace have finished i did btrfs fi resize 1:max <mount_point> then unmounted and mounted again using the new drive. Then i've run a scrub on the FS - and got a lot of checksum errors. Messages like this: BTRFS: checksum error at logical 5398405586944 on dev /dev/md1, sector 10576283152, root 12788, inode 4512290, offset 23 592960, length 4096, links 1 (path: XXXXXXXXX) BTRFS: bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 67165, gen 0 BTRFS: unable to fixup (regular) error at logical 5398405586944 on dev /dev/md1 Is there any way to fix this? I still have the old array available but replace have wiped out it's superblock so it's not mountable. # uname -a Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version Btrfs v3.12 -- משיח NOW! Moshiach is coming very soon, prepare yourself! יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Fwd: btrfs replace seems to corrupt the file system 2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer @ 2015-06-28 0:52 ` Moby 2015-06-28 16:31 ` Mordechay Kaganer 2015-06-28 16:45 ` Chris Murphy 2 siblings, 0 replies; 15+ messages in thread From: Moby @ 2015-06-28 0:52 UTC (permalink / raw) To: linux-btrfs On 06/27/2015 06:17 PM, Mordechay Kaganer wrote: > B.H. > > Hello. I'm running our backup archive on btrfs. We have MD-based RAID5 > array with 4 6TB disks then LVM on top of it, and btrfs volume on the > LV (we don't use btrfs's own RAID features because we want RAID5 and > as far as i understand the support is only partial). > > I wanted to move the archive to another MD array of 4 8TB drives (this > time without LVM). So i did: > > btrfs replace start 1 /dev/md1 <mount_point> > > Where 1 is the only devid that was present and /dev/md1 is the new array. > > The replace run successfully until finished after more than 5 days. > The system downloaded some fresh backups and created new snapshots > during the ongoing replace. I've go 2 kernel warnings about replace > task waiting for more than 120 seconds in the middle, but process > seamed to go on anyway. > > After the replace have finished i did btrfs fi resize 1:max > <mount_point> then unmounted and mounted again using the new drive. > > Then i've run a scrub on the FS - and got a lot of checksum errors. > Messages like this: > > BTRFS: checksum error at logical 5398405586944 on dev /dev/md1, sector > 10576283152, root 12788, inode 4512290, offset 23 > 592960, length 4096, links 1 (path: XXXXXXXXX) > BTRFS: bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 67165, gen 0 > BTRFS: unable to fixup (regular) error at logical 5398405586944 on dev /dev/md1 > > Is there any way to fix this? I still have the old array available but > replace have wiped out it's superblock so it's not mountable. > > # uname -a > Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18 > 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > > # btrfs --version > Btrfs v3.12 > I was seeing insane behavior with btrfs and kernel versions from the stock/update distros. Upgrading the kernel to stable (4.1.0-1.gfcf8349-default as of today) and btrfs progs btrfs-progs v4.1+20150622 resolved the insane (such as negative left percentages during tasks etc) behavior and errors I was seeing. -- --Moby They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety. -- Benjamin Franklin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer 2015-06-28 0:52 ` Moby @ 2015-06-28 16:31 ` Mordechay Kaganer 2015-06-29 2:50 ` Duncan 2015-06-28 16:45 ` Chris Murphy 2 siblings, 1 reply; 15+ messages in thread From: Mordechay Kaganer @ 2015-06-28 16:31 UTC (permalink / raw) To: linux-btrfs On Sun, Jun 28, 2015 at 2:17 AM, Mordechay Kaganer <mkaganer@gmail.com> wrote: > B.H. > > Hello. I'm running our backup archive on btrfs. We have MD-based RAID5 > array with 4 6TB disks then LVM on top of it, and btrfs volume on the > LV (we don't use btrfs's own RAID features because we want RAID5 and > as far as i understand the support is only partial). > > I wanted to move the archive to another MD array of 4 8TB drives (this > time without LVM). So i did: > > btrfs replace start 1 /dev/md1 <mount_point> > > Where 1 is the only devid that was present and /dev/md1 is the new array. > > The replace run successfully until finished after more than 5 days. > The system downloaded some fresh backups and created new snapshots > during the ongoing replace. I've go 2 kernel warnings about replace > task waiting for more than 120 seconds in the middle, but process > seamed to go on anyway. > > After the replace have finished i did btrfs fi resize 1:max > <mount_point> then unmounted and mounted again using the new drive. > > Then i've run a scrub on the FS - and got a lot of checksum errors. > Messages like this: > > BTRFS: checksum error at logical 5398405586944 on dev /dev/md1, sector > 10576283152, root 12788, inode 4512290, offset 23 > 592960, length 4096, links 1 (path: XXXXXXXXX) > BTRFS: bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 67165, gen 0 > BTRFS: unable to fixup (regular) error at logical 5398405586944 on dev /dev/md1 > > Is there any way to fix this? I still have the old array available but > replace have wiped out it's superblock so it's not mountable. > > # uname -a > Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18 > 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > > # btrfs --version > Btrfs v3.12 I'm trying to recover the original data from before the replace operation. What i did so far is restoring the superblock of the original (replaced) device from a backup copy, like this: btrfs-select-super -s 2 /dev/mapper/XXXXXX This worked, so btrfs tools now recognize the device as having btrfs volume on it. I did full btrfs check on the partition - didn't find any errors, at least per my understanding. But it's impossible to mount the volume. When trying to mount the volume i get the following messages in dmesg: [109989.432274] BTRFS warning (device dm-2): cannot mount because device replace operation is ongoing and [109989.432280] BTRFS warning (device dm-2): tgtdev (devid 0) is missing, need to run 'btrfs dev scan'? [109989.432282] BTRFS: failed to init dev_replace: -5 [109989.459719] BTRFS: open_ctree failed On the other hand, the the "replaced" device mounts OK, but btrfs scrub returns lots of checksum errors so i fear the data is probably corrupt. The volume is about 15TB and has many subvolumes and snapshots so finding what exactly is corrupt will be very tricky. Any idea what can I do to recover the data? -- משיח NOW! Moshiach is coming very soon, prepare yourself! יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 16:31 ` Mordechay Kaganer @ 2015-06-29 2:50 ` Duncan 0 siblings, 0 replies; 15+ messages in thread From: Duncan @ 2015-06-29 2:50 UTC (permalink / raw) To: linux-btrfs Mordechay Kaganer posted on Sun, 28 Jun 2015 19:31:31 +0300 as excerpted: > On Sun, Jun 28, 2015 at 2:17 AM, Mordechay Kaganer <mkaganer@gmail.com> > wrote: >> B.H. >> >> Hello. I'm running our backup archive on btrfs. We have MD-based RAID5 >> array with 4 6TB disks then LVM on top of it, and btrfs volume on the >> LV (we don't use btrfs's own RAID features because we want RAID5 and as >> far as i understand the support is only partial). (I see people already helping with the primary issue so won't address that here. However, addressing the above...) FWIW... btrfs raid56 (5 and 6) support is now (from kernel 3.19) "code complete". However, "code complete" is far from "stable and mature", and I (as a list regular but not a dev) have been recommending that people continue to hold off a few kernels until it has had some time to stabilize to more or less about the same point as btrfs itself is at, unless of course their purpose is actually to test the code with data they're prepared to lose, report bugs and help get them fixed, in which case, welcome aboard! =:^) Of course btrfs itself isn't really mature or entirely stable yet, tho it's reasonable for ordinary use, provided the sysadmins' rule of backups is observed: (a) If it's not backed up, by definition the data is worth less to you than the time and media required to do the backups, despite any claims to the contrary, and (b) for purposes of this rule, a would-be backup that hasn't been tested restorable isn't yet a backup. But back to raid56, my recommendation has been to wait at LEAST TWO kernel cycles, which would be the just released 4.1, and even then, consider it bleeding edge and be prepared to deal with bugs. For stability comparable to btrfs in general, my recommendation is to wait at least a year, which happens to be about five kernel cycles, so until at least 4.4. At that point, either check a few weeks of list traffic and decide for yourself based on that, or ask, but that's a reasonably educated guess. Btrfs raid56 bottom line, 4.1 is the minimal 2 kernel cycles code maturity I suggested; if you're prepared to be bleeding edge, try it. Else wait the full year, kernel 4.4 or so. (More below...) >> I wanted to move the archive to another MD array of 4 8TB drives (this >> time without LVM). So i did: >> >> btrfs replace start 1 /dev/md1 <mount_point> >> >> Where 1 is the only devid that was present and /dev/md1 is the new >> array. FWIW, I hadn't even considered the possibility of doing a replace from a single device. I had thought it required raid mode. But if it appeared to work... >> The replace run successfully until finished after more than 5 days. >> The system downloaded some fresh backups and created new snapshots >> during the ongoing replace. I've go 2 kernel warnings about replace >> task waiting for more than 120 seconds in the middle, but process >> seamed to go on anyway. >> >> After the replace have finished i did btrfs fi resize 1:max >> <mount_point> then unmounted and mounted again using the new drive. >> >> Then i've run a scrub on the FS - and got a lot of checksum errors. Had you done a pre-replace scrub on the existing device? If not, is the corruption actually new, or from before the replace and simply transferred? You don't know. Meanwhile, one reason not to particularly like the idea of btrfs over something like mdraid, is that btrfs is checksumming and operationally verifying, mdraid is not. If btrfs reports an error, was it at the media level and which raid device if so, the raid level, the btrfs level, or ?? Tho for mdraid5/6 you can do a raid scrub, and hopefully detect and correct media and raid level errors, but you still don't have raid level checksum verification. And with multiple terabyte drives that's definitely going to take awhile! With btrfs raid1/10 there will be a second, hopefully checksum-valid, copy, to use and rebuild from. And btrfs raid56 should be able to reconstruct a hopefully valid checksum from parity, tho of course at its maturity level one can't yet assume it's entirely bug-free. (Again, as I observed above the problem resolution is occurring on another subthread, so I'll leave this at the above.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer 2015-06-28 0:52 ` Moby 2015-06-28 16:31 ` Mordechay Kaganer @ 2015-06-28 16:45 ` Chris Murphy 2015-06-28 18:02 ` Mordechay Kaganer 2 siblings, 1 reply; 15+ messages in thread From: Chris Murphy @ 2015-06-28 16:45 UTC (permalink / raw) To: Btrfs BTRFS On Sat, Jun 27, 2015 at 5:17 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote: > # uname -a > Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18 > 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > > # btrfs --version > Btrfs v3.12 Well it's over a weekend so many devs may not get around to responding until Monday, if it's urgent then IRC is probably better. But the thing is, the kernel and btrfs-progs are kinda old. So I'm reasonably sure the suggestion is going to be first to upgrade both of them, it's sorta par for the course with Btrfs problems. Option A: Maybe someone has advice on how to get the demoted device to be valid again as if the replace command hadn't been used. Because then you could try the replace again with newer kernel and progs, and see if the problem still happens. That's a good question to ask on IRC if you don't have a response by tomorrow. Option B: In the meantime, start to check some files to see if they're actually corrupt or if these csum errors are bogus. Option C: Well, it's a backup so actually before A or B it's probably best to start making a new backup in case this one is beyond repair. So hopefully you have a 3rd location to put it into so you can keep both of these Btrfs volumes in their current states until you have a more clear idea what to do with them, but at least you're not delaying getting a current backup in place. Whether this new backup is Btrfs based or not is less important than actually having a backup. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 16:45 ` Chris Murphy @ 2015-06-28 18:02 ` Mordechay Kaganer 2015-06-28 18:30 ` Chris Murphy 2015-06-28 18:50 ` Noah Massey 0 siblings, 2 replies; 15+ messages in thread From: Mordechay Kaganer @ 2015-06-28 18:02 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS B.H. Thanks for the reply. On Sun, Jun 28, 2015 at 7:45 PM, Chris Murphy <lists@colorremedies.com> wrote: > > Option A: Maybe someone has advice on how to get the demoted device to > be valid again as if the replace command hadn't been used. Because > then you could try the replace again with newer kernel and progs, and > see if the problem still happens. That's a good question to ask on IRC > if you don't have a response by tomorrow. To recover the old device, that's what i'm trying to do. Asked on IRC also, no reply. As stated above, the device passes btrfs check without errors but cannot mount because it complains about "ongoing replace" and the replace device is missing. > Option B: In the meantime, start to check some files to see if they're > actually corrupt or if these csum errors are bogus. Tried to copy some files that are reported with bad checksums out of the "new" volume. Copy fails with messages like this: [181896.761117] BTRFS info (device md1): csum failed ino 3849795 off 1388544 csum 2566472073 expected csum 3428551483 [181896.761362] BTRFS info (device md1): csum failed ino 3849795 off 1519616 csum 2566472073 expected csum 1565909691 [181896.761997] BTRFS info (device md1): csum failed ino 3849795 extent 5084061945856 csum 2566472073 wanted 2627769260 mirror 0 [181896.769091] BTRFS info (device md1): csum failed ino 3849795 off 1257472 csum 2566472073 expected csum 4184704592 [181896.769509] BTRFS info (device md1): csum failed ino 3849795 off 1257472 csum 2566472073 expected csum 4184704592 [181897.171789] BTRFS info (device md1): csum failed ino 2940181 extent 4288477184000 csum 2566472073 wanted 1434149511 mirror 0 [181897.171984] BTRFS info (device md1): csum failed ino 2940181 extent 4288477270016 csum 2566472073 wanted 439924019 mirror 0 [181897.172199] BTRFS info (device md1): csum failed ino 2940181 extent 4288477356032 csum 2566472073 wanted 3293573949 mirror 0 > Option C: Well, it's a backup so actually before A or B it's probably > best to start making a new backup in case this one is beyond repair. > So hopefully you have a 3rd location to put it into so you can keep > both of these Btrfs volumes in their current states until you have a > more clear idea what to do with them, but at least you're not delaying > getting a current backup in place. Whether this new backup is Btrfs > based or not is less important than actually having a backup. Unfortunately, i don't have so much free storage, it's more than 15TB. But i do have another backup, so the recovery is not so urgent. The btrfs backup is the only place where i have the older snapshots of the data (actually, this is why we decided using btrfs on the first hand) and reformatting will mean to loose that older data. -- משיח NOW! Moshiach is coming very soon, prepare yourself! יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 18:02 ` Mordechay Kaganer @ 2015-06-28 18:30 ` Chris Murphy 2015-06-28 18:50 ` Noah Massey 1 sibling, 0 replies; 15+ messages in thread From: Chris Murphy @ 2015-06-28 18:30 UTC (permalink / raw) To: Btrfs BTRFS There is a work around for the file system not reading files when there are csum errors, which is btrfs check --init-csum-tree, but there is a caveat: You might need a newer btrfs-progs, I forget exactly what version it will reconstruct new csums, maybe 3.18 or 3.19. Older versions create a new csum tree that's empty so you get a bunch of errors for everything but you can still read files at least and see if they're corrupt or not. Before you do that, you could try btrfs restore to extract some files and see if they are really corrupt or if the csum error is bogus. If the files are corrupted, then the (new device) backup is probably useless anyway and there's no point in keeping it around. You can still keep the old device backup that won't mount in case someone eventually can say how to reset it as if the replace did not occur. I'm willing to bet that the data is OK, and you've run into some obscure bug on csum computation during the replace that went bad and it's just that the csums are all wrong. Speculation though. Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 18:02 ` Mordechay Kaganer 2015-06-28 18:30 ` Chris Murphy @ 2015-06-28 18:50 ` Noah Massey 2015-06-28 19:08 ` Chris Murphy 2015-06-28 19:20 ` Mordechay Kaganer 1 sibling, 2 replies; 15+ messages in thread From: Noah Massey @ 2015-06-28 18:50 UTC (permalink / raw) To: Btrfs BTRFS On Sun, Jun 28, 2015 at 2:02 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote: > To recover the old device, that's what i'm trying to do. Asked on IRC > also, no reply. As stated above, the device passes btrfs check without > errors but cannot mount because it complains about "ongoing replace" > and the replace device is missing. Standard disclaimer, not a dev, just a user. The following worked for me to recover the old device after reproducing your situation: (where loop0 is my "old" device) # mount -t btrfs -o degraded /dev/loop0 /mnt # btrfs replace cancel /mnt # btrfs umount /mnt # mount -t btrfs /dev/loop0 /mnt mount now succeeds without error. $ uname -r 4.1.0 $ btrfs version btrfs-progs v4.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 18:50 ` Noah Massey @ 2015-06-28 19:08 ` Chris Murphy 2015-06-28 19:20 ` Mordechay Kaganer 1 sibling, 0 replies; 15+ messages in thread From: Chris Murphy @ 2015-06-28 19:08 UTC (permalink / raw) To: Btrfs BTRFS On Sun, Jun 28, 2015 at 12:50 PM, Noah Massey <noah.massey@gmail.com> wrote: > On Sun, Jun 28, 2015 at 2:02 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote: >> To recover the old device, that's what i'm trying to do. Asked on IRC >> also, no reply. As stated above, the device passes btrfs check without >> errors but cannot mount because it complains about "ongoing replace" >> and the replace device is missing. > > Standard disclaimer, not a dev, just a user. > The following worked for me to recover the old device after > reproducing your situation: > (where loop0 is my "old" device) > > # mount -t btrfs -o degraded /dev/loop0 /mnt > # btrfs replace cancel /mnt > # btrfs umount /mnt > # mount -t btrfs /dev/loop0 /mnt > > mount now succeeds without error. Neat trick! > > $ uname -r > 4.1.0 > $ btrfs version > btrfs-progs v4.1 Yeah I definitely advise a newer kernel and progs. Even if this trick works with the older kernel (seems reasonably likely) that the next attempt at btrfs replace needs to happen with a newer kernel and progs anyway. Bit off topic question is the device used as the target for replacement has superblocks from the first (failed, bad csum) attempt. So I wonder to what degree it's non-deterministic to try this again without erasing at least the old superblocks first? They have the same UUID is the thing; if the UUID from a different volume were used, there's no ambiguity between the stale and new data, but is there any possibility of confusion with a new attempt when UUID is the same? -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 18:50 ` Noah Massey 2015-06-28 19:08 ` Chris Murphy @ 2015-06-28 19:20 ` Mordechay Kaganer 2015-06-28 19:32 ` Chris Murphy 1 sibling, 1 reply; 15+ messages in thread From: Mordechay Kaganer @ 2015-06-28 19:20 UTC (permalink / raw) To: Noah Massey; +Cc: Btrfs BTRFS B.H. On Sun, Jun 28, 2015 at 9:50 PM, Noah Massey <noah.massey@gmail.com> wrote: > On Sun, Jun 28, 2015 at 2:02 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote: >> To recover the old device, that's what i'm trying to do. Asked on IRC >> also, no reply. As stated above, the device passes btrfs check without >> errors but cannot mount because it complains about "ongoing replace" >> and the replace device is missing. > > Standard disclaimer, not a dev, just a user. > The following worked for me to recover the old device after > reproducing your situation: > (where loop0 is my "old" device) > > # mount -t btrfs -o degraded /dev/loop0 /mnt > # btrfs replace cancel /mnt > # btrfs umount /mnt > # mount -t btrfs /dev/loop0 /mnt > > mount now succeeds without error. Yeah! That worked even with my old kernel/btrfs-progs. Thank you very much. Now the old volume mounts OK. The next step for me is to run scrub on it to see if the data is actually intact. Then (if it's OK, hopefully) we'll see how to redo the replace. Maybe, unmount and do a simple "dd" will be the best option? At least it's not going to corrupt the original data :-). -- משיח NOW! Moshiach is coming very soon, prepare yourself! יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 19:20 ` Mordechay Kaganer @ 2015-06-28 19:32 ` Chris Murphy 2015-06-29 5:02 ` Mordechay Kaganer 0 siblings, 1 reply; 15+ messages in thread From: Chris Murphy @ 2015-06-28 19:32 UTC (permalink / raw) To: Btrfs BTRFS On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote: > Then (if it's OK, hopefully) we'll see how to redo the replace. Maybe, > unmount and do a simple "dd" will be the best option? At least it's > not going to corrupt the original data :-). Use of dd can cause corruption of the original. "Do not make a block-level copy of a btrfs filesystem to a block device, and then try to mount either the original or the copy while both are visible to the same kernel." https://btrfs.wiki.kernel.org/index.php/Gotchas Once you do the dd, you can't mount either one of the copies until one of the copies is completely hidden (i.e. on an LV that's inactive and flagged to never automatically become active). I think it's too risky just to avoid using a newer kernel. I'd sooner create a new file system and tediously btrfs send/receive the subvolumes you want to keep. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-28 19:32 ` Chris Murphy @ 2015-06-29 5:02 ` Mordechay Kaganer 2015-06-29 8:08 ` Duncan 0 siblings, 1 reply; 15+ messages in thread From: Mordechay Kaganer @ 2015-06-29 5:02 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS B.H. On Sun, Jun 28, 2015 at 10:32 PM, Chris Murphy <lists@colorremedies.com> wrote: > On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote: > > >> Then (if it's OK, hopefully) we'll see how to redo the replace. Maybe, >> unmount and do a simple "dd" will be the best option? At least it's >> not going to corrupt the original data :-). > > Use of dd can cause corruption of the original. > But doing a block-level copy and taking care that the original volume is hidden from the kernel while mounting the new one is safe, isn't it? Anyway, what is the "strait forward" and recommended way of replacing the underlying device on a single-device btrfs not using any raid features? I can see 3 options: 1. btrfs replace - as far as i understand, it's primarily intended for replacing the member disks under btrfs's raid. 2, Add a new volume, then remove the old one. Maybe this way we'll need to do a full balance after that? 3. Block-level copy of the partition, then hide the original from the kernel to avoid confusion because of the same UUID. Of course, this way the volume is going to be off-line until the copy is finished. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-29 5:02 ` Mordechay Kaganer @ 2015-06-29 8:08 ` Duncan 2015-06-29 11:23 ` Mike Fleetwood 2015-06-29 11:39 ` Mordechay Kaganer 0 siblings, 2 replies; 15+ messages in thread From: Duncan @ 2015-06-29 8:08 UTC (permalink / raw) To: linux-btrfs Mordechay Kaganer posted on Mon, 29 Jun 2015 08:02:01 +0300 as excerpted: > On Sun, Jun 28, 2015 at 10:32 PM, Chris Murphy <lists@colorremedies.com> > wrote: >> On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@gmail.com> >> wrote: >> >> Use of dd can cause corruption of the original. >> > But doing a block-level copy and taking care that the original volume is > hidden from the kernel while mounting the new one is safe, isn't it? As long as neither one is mounted while doing the copy, and one or the other is hidden before an attempt to mount, it should be safe, yes. The base problem is that btrfs can be multi-device, and that it tracks the devices belonging to the filesystem based on UUID, so as soon as it sees another device with the same UUID, it considers it part of the same filesystem. Writes can go to any of the devices it considers a component device, and after a write creates a difference, reads can end up coming from the stale one. Meanwhile, unlike many filesystems, btrfs uses the UUID as part of the metadata, so changing the UUID isn't as simple as rewriting a superblock; the metadata must be rewritten to the new UUID. There's actually a tool now available to do just that, but it's new enough I'm not even sure it's available in release form yet; if so, it'll be latest releases. Otherwise, it'd be in integration branch. And FWIW a different aspect of the same problem can occur in raid1 mode, when a device drops out and is later reintroduced, with both devices separately mounted rw,degraded and updated in the mean time. Normally, btrfs will track the generation, a monotonically increasing integer, and will read from the higher/newer generation, but with separate updates to each, if they both happen to have the same generation at reunite... So for raid1 mode, the recommendation is that if there's a split and one continues to be updated, be sure the other one isn't separately mounted writable and then the two combined again, or if both must be separately mounted writable and then recombined, wipe the one and add it as a new device, thus avoiding the possibility of confusion. > Anyway, what is the "strait forward" and recommended way of replacing > the underlying device on a single-device btrfs not using any raid > features? I can see 3 options: > > 1. btrfs replace - as far as i understand, it's primarily intended for > replacing the member disks under btrfs's raid. It seems this /can/ work. You demonstrated that much. But I'm not sure whether btrfs replace was actually designed to do the single-device replace. If not, it almost certainly hasn't been tested for it. Even if so, I'm sure I'm not the only one who hadn't thought of using it that way, so while it might have been development-tested for single-device- replace, it's unlikely to have had the same degree of broader testing of actual usage, simply because few even thought of using it that way. Regardless, you seem to have flushed out some bugs. Now that they're visible and the weekend's over, the devs will likely get to work tracing them down and fixing them. > 2, Add a new volume, then remove the old one. Maybe this way we'll need > to do a full balance after that? This is the alternative I'd have used in your scenario (but see below). Except a manual balance shouldn't be necessary. The device add part should go pretty fast as it would simply make more space available. The device remove will go much slower as in effect it'll trigger that balance, forcing everything over to the just added pretty much empty device. You'd do a manual balance if you wanted to convert to raid or some such, but from single device to single device, just the add/remove should do it. > 3. Block-level copy of the partition, then hide the original from the > kernel to avoid confusion because of the same UUID. Of course, this way > the volume is going to be off-line until the copy is finished. This could work too, but in addition to being forced to keep the filesystem offline the entire time, the block-level copy will copy any problems, etc, too. But what I'd /prefer/ to do would be to take the opportunity to create a new filesystem, possibly using different mkfs.btrfs options or at least starting new with a fresh filesystem and thus eliminating any as yet undetected or still developing problems with the old filesystem. Since the replace or device remove will end up rewriting everything anyway, might as well make a clean break and start fresh, would be my thinking. You could then use send/receive to copy all the snapshots, etc, over. Currently, that would need to be done one at a time, but there's discussion of adding a subvolume-recursive mode. Tho while on the subject of snapshots, it should be noted that btrfs operations such as balance don't scale so well with tens of thousands of snapshots. So the recommendation is to try to keep it to 250 snapshots or so per subvolume, under 2000 snapshots total, if possible, which of course at 250 per would be 8 separate subvolumes. You can go above that to 3000 or so if absolutely necessary, but if it reaches near 10K, expect more problems in general, and dramatically increased memory and time requirements, for balance, check, device replace/remove, etc. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-29 8:08 ` Duncan @ 2015-06-29 11:23 ` Mike Fleetwood 2015-06-29 11:39 ` Mordechay Kaganer 1 sibling, 0 replies; 15+ messages in thread From: Mike Fleetwood @ 2015-06-29 11:23 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On 29 June 2015 at 09:08, Duncan <1i5t5.duncan@cox.net> wrote: > Meanwhile, unlike many filesystems, btrfs uses the UUID as part of the > metadata, so changing the UUID isn't as simple as rewriting a superblock; > the metadata must be rewritten to the new UUID. There's actually a tool > now available to do just that, but it's new enough I'm not even sure it's > available in release form yet; if so, it'll be latest releases. > Otherwise, it'd be in integration branch. FYI, btrfstune with changing file system UUID capability, was included in btrfs-progs 4.1 released last week, Mon, 22 Jun 2015. http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg44182.html Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: btrfs replace seems to corrupt the file system 2015-06-29 8:08 ` Duncan 2015-06-29 11:23 ` Mike Fleetwood @ 2015-06-29 11:39 ` Mordechay Kaganer 1 sibling, 0 replies; 15+ messages in thread From: Mordechay Kaganer @ 2015-06-29 11:39 UTC (permalink / raw) To: Duncan; +Cc: Btrfs BTRFS B.H. Regarding the main issue, the drive that was "recovered" using Noah's trick (mount -o degraded then btrfs replace cancel) appears to be clean. At least, it passes scrub without any errors. It even contains all changes that were made during the replace was ongoing. Also i've run MD's consistency check on the destination drive which contains the corrupt FS and it appears to be clean from MD's point of view, so i think it can be considered a "proof" the btrfs replace was actually the source of the corruption. I'll try to reproduce the situation before trying to upgrade the kernel/btrfs-progs with smaller loopback devices. Not sure if it is reproducible so easily. The original replace operation took more than 5 days and i'm not going to play with the actual data again ;-). If the "corrupt" version of the FS may help in debugging the issue, please contact me today, before we have wiped it out. On Mon, Jun 29, 2015 at 11:08 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Mordechay Kaganer posted on Mon, 29 Jun 2015 08:02:01 +0300 as excerpted: >> 1. btrfs replace - as far as i understand, it's primarily intended for >> replacing the member disks under btrfs's raid. > > It seems this /can/ work. You demonstrated that much. But I'm not sure > whether btrfs replace was actually designed to do the single-device > replace. If not, it almost certainly hasn't been tested for it. Even if > so, I'm sure I'm not the only one who hadn't thought of using it that > way, so while it might have been development-tested for single-device- > replace, it's unlikely to have had the same degree of broader testing of > actual usage, simply because few even thought of using it that way. *If* replace is usable for single-drive FS, this method has the advantage that it can be cancelled in the middle and (for single drive, using Noah's trick) even after the operation has finished. For multi-drive FS, the trick wouldn't help as soon as any changes were made the the FS after the replace. -- משיח NOW! Moshiach is coming very soon, prepare yourself! יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד! ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-06-29 11:39 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CA+xOVSOD1YY-=Cm+vmzTUV9cHe9idtDkRr0RmpRP5a0Z6eC4YQ@mail.gmail.com>
2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer
2015-06-28 0:52 ` Moby
2015-06-28 16:31 ` Mordechay Kaganer
2015-06-29 2:50 ` Duncan
2015-06-28 16:45 ` Chris Murphy
2015-06-28 18:02 ` Mordechay Kaganer
2015-06-28 18:30 ` Chris Murphy
2015-06-28 18:50 ` Noah Massey
2015-06-28 19:08 ` Chris Murphy
2015-06-28 19:20 ` Mordechay Kaganer
2015-06-28 19:32 ` Chris Murphy
2015-06-29 5:02 ` Mordechay Kaganer
2015-06-29 8:08 ` Duncan
2015-06-29 11:23 ` Mike Fleetwood
2015-06-29 11:39 ` Mordechay Kaganer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox