Fwd: btrfs replace seems to corrupt the file system

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Fwd: btrfs replace seems to corrupt the file system
       [not found] <CA+xOVSOD1YY-=Cm+vmzTUV9cHe9idtDkRr0RmpRP5a0Z6eC4YQ@mail.gmail.com>
@ 2015-06-27 23:17 ` Mordechay Kaganer
  2015-06-28  0:52   ` Moby
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mordechay Kaganer @ 2015-06-27 23:17 UTC (permalink / raw)
  To: linux-btrfs

B.H.

Hello. I'm running our backup archive on btrfs. We have MD-based RAID5
array with 4 6TB disks then LVM on top of it, and btrfs volume on the
LV (we don't use btrfs's own RAID features because we want RAID5 and
as far as i understand the support is only partial).

I wanted to move the archive to another MD array of 4 8TB drives (this
time without LVM). So i did:

btrfs replace start 1 /dev/md1 <mount_point>

Where 1 is the only devid that was present and /dev/md1 is the new array.

The replace run successfully until finished after more than 5 days.
The system downloaded some fresh backups and created new snapshots
during the ongoing replace. I've go 2 kernel warnings about replace
task waiting for more than 120 seconds in the middle, but process
seamed to go on anyway.

After the replace have finished i did btrfs fi resize 1:max
<mount_point> then unmounted and mounted again using the new drive.

Then i've run a scrub on the FS - and got a lot of checksum errors.
Messages like this:

BTRFS: checksum error at logical 5398405586944 on dev /dev/md1, sector
10576283152, root 12788, inode 4512290, offset 23
592960, length 4096, links 1 (path: XXXXXXXXX)
BTRFS: bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 67165, gen 0
BTRFS: unable to fixup (regular) error at logical 5398405586944 on dev /dev/md1

Is there any way to fix this? I still have the old array available but
replace have wiped out it's superblock so it's not mountable.

# uname -a
Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18
18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

# btrfs --version
Btrfs v3.12

-- 
משיח NOW!
Moshiach is coming very soon, prepare yourself!
יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fwd: btrfs replace seems to corrupt the file system
  2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer
@ 2015-06-28  0:52   ` Moby
  2015-06-28 16:31   ` Mordechay Kaganer
  2015-06-28 16:45   ` Chris Murphy
  2 siblings, 0 replies; 15+ messages in thread
From: Moby @ 2015-06-28  0:52 UTC (permalink / raw)
  To: linux-btrfs



On 06/27/2015 06:17 PM, Mordechay Kaganer wrote:
> B.H.
>
> Hello. I'm running our backup archive on btrfs. We have MD-based RAID5
> array with 4 6TB disks then LVM on top of it, and btrfs volume on the
> LV (we don't use btrfs's own RAID features because we want RAID5 and
> as far as i understand the support is only partial).
>
> I wanted to move the archive to another MD array of 4 8TB drives (this
> time without LVM). So i did:
>
> btrfs replace start 1 /dev/md1 <mount_point>
>
> Where 1 is the only devid that was present and /dev/md1 is the new array.
>
> The replace run successfully until finished after more than 5 days.
> The system downloaded some fresh backups and created new snapshots
> during the ongoing replace. I've go 2 kernel warnings about replace
> task waiting for more than 120 seconds in the middle, but process
> seamed to go on anyway.
>
> After the replace have finished i did btrfs fi resize 1:max
> <mount_point> then unmounted and mounted again using the new drive.
>
> Then i've run a scrub on the FS - and got a lot of checksum errors.
> Messages like this:
>
> BTRFS: checksum error at logical 5398405586944 on dev /dev/md1, sector
> 10576283152, root 12788, inode 4512290, offset 23
> 592960, length 4096, links 1 (path: XXXXXXXXX)
> BTRFS: bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 67165, gen 0
> BTRFS: unable to fixup (regular) error at logical 5398405586944 on dev /dev/md1
>
> Is there any way to fix this? I still have the old array available but
> replace have wiped out it's superblock so it's not mountable.
>
> # uname -a
> Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18
> 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> # btrfs --version
> Btrfs v3.12
>
I was seeing insane behavior with btrfs and kernel versions from the 
stock/update distros.  Upgrading the kernel to stable 
(4.1.0-1.gfcf8349-default as of today) and btrfs progs btrfs-progs 
v4.1+20150622
resolved the insane (such as negative left percentages during tasks etc) 
behavior and errors I was seeing.

-- 
--Moby

They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.  -- Benjamin Franklin


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer
  2015-06-28  0:52   ` Moby
@ 2015-06-28 16:31   ` Mordechay Kaganer
  2015-06-29  2:50     ` Duncan
  2015-06-28 16:45   ` Chris Murphy
  2 siblings, 1 reply; 15+ messages in thread
From: Mordechay Kaganer @ 2015-06-28 16:31 UTC (permalink / raw)
  To: linux-btrfs

On Sun, Jun 28, 2015 at 2:17 AM, Mordechay Kaganer <mkaganer@gmail.com> wrote:
> B.H.
>
> Hello. I'm running our backup archive on btrfs. We have MD-based RAID5
> array with 4 6TB disks then LVM on top of it, and btrfs volume on the
> LV (we don't use btrfs's own RAID features because we want RAID5 and
> as far as i understand the support is only partial).
>
> I wanted to move the archive to another MD array of 4 8TB drives (this
> time without LVM). So i did:
>
> btrfs replace start 1 /dev/md1 <mount_point>
>
> Where 1 is the only devid that was present and /dev/md1 is the new array.
>
> The replace run successfully until finished after more than 5 days.
> The system downloaded some fresh backups and created new snapshots
> during the ongoing replace. I've go 2 kernel warnings about replace
> task waiting for more than 120 seconds in the middle, but process
> seamed to go on anyway.
>
> After the replace have finished i did btrfs fi resize 1:max
> <mount_point> then unmounted and mounted again using the new drive.
>
> Then i've run a scrub on the FS - and got a lot of checksum errors.
> Messages like this:
>
> BTRFS: checksum error at logical 5398405586944 on dev /dev/md1, sector
> 10576283152, root 12788, inode 4512290, offset 23
> 592960, length 4096, links 1 (path: XXXXXXXXX)
> BTRFS: bdev /dev/md1 errs: wr 0, rd 0, flush 0, corrupt 67165, gen 0
> BTRFS: unable to fixup (regular) error at logical 5398405586944 on dev /dev/md1
>
> Is there any way to fix this? I still have the old array available but
> replace have wiped out it's superblock so it's not mountable.
>
> # uname -a
> Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18
> 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> # btrfs --version
> Btrfs v3.12

I'm trying to recover the original data from before the replace
operation. What i did so far is restoring the superblock of the
original (replaced) device from a backup copy, like this:
btrfs-select-super -s 2 /dev/mapper/XXXXXX

This worked, so btrfs tools now recognize the device as having btrfs
volume on it. I did full btrfs check on the partition - didn't find
any errors, at least per my understanding.

But it's impossible to mount the volume. When trying to mount the
volume i get the following messages in dmesg:

[109989.432274] BTRFS warning (device dm-2): cannot mount because
device replace operation is ongoing and
[109989.432280] BTRFS warning (device dm-2): tgtdev (devid 0) is
missing, need to run 'btrfs dev scan'?
[109989.432282] BTRFS: failed to init dev_replace: -5
[109989.459719] BTRFS: open_ctree failed

On the other hand, the the "replaced" device mounts OK, but btrfs
scrub returns lots of checksum errors so i fear the data is probably
corrupt. The volume is about 15TB and has many subvolumes and
snapshots so finding what exactly is corrupt will be very tricky.

Any idea what can I do to recover the data?

-- 
משיח NOW!
Moshiach is coming very soon, prepare yourself!
יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 16:31   ` Mordechay Kaganer
@ 2015-06-29  2:50     ` Duncan
  0 siblings, 0 replies; 15+ messages in thread
From: Duncan @ 2015-06-29  2:50 UTC (permalink / raw)
  To: linux-btrfs

Mordechay Kaganer posted on Sun, 28 Jun 2015 19:31:31 +0300 as excerpted:

> On Sun, Jun 28, 2015 at 2:17 AM, Mordechay Kaganer <mkaganer@gmail.com>
> wrote:
>> B.H.
>>
>> Hello. I'm running our backup archive on btrfs. We have MD-based RAID5
>> array with 4 6TB disks then LVM on top of it, and btrfs volume on the
>> LV (we don't use btrfs's own RAID features because we want RAID5 and as
>> far as i understand the support is only partial).

(I see people already helping with the primary issue so won't address 
that here.  However, addressing the above...)

FWIW... btrfs raid56 (5 and 6) support is now (from kernel 3.19) "code 
complete".  However, "code complete" is far from "stable and mature", and 
I (as a list regular but not a dev) have been recommending that people 
continue to hold off a few kernels until it has had some time to 
stabilize to more or less about the same point as btrfs itself is at, 
unless of course their purpose is actually to test the code with data 
they're prepared to lose, report bugs and help get them fixed, in which 
case, welcome aboard! =:^)

Of course btrfs itself isn't really mature or entirely stable yet, tho 
it's reasonable for ordinary use, provided the sysadmins' rule of backups 
is observed: (a) If it's not backed up, by definition the data is worth 
less to you than the time and media required to do the backups, despite 
any claims to the contrary, and (b) for purposes of this rule, a would-be 
backup that hasn't been tested restorable isn't yet a backup.

But back to raid56, my recommendation has been to wait at LEAST TWO 
kernel cycles, which would be the just released 4.1, and even then, 
consider it bleeding edge and be prepared to deal with bugs.  For 
stability comparable to btrfs in general, my recommendation is to wait at 
least a year, which happens to be about five kernel cycles, so until at 
least 4.4.  At that point, either check a few weeks of list traffic and 
decide for yourself based on that, or ask, but that's a reasonably 
educated guess.

Btrfs raid56 bottom line, 4.1 is the minimal 2 kernel cycles code 
maturity I suggested; if you're prepared to be bleeding edge, try it.  
Else wait the full year, kernel 4.4 or so.

(More below...)

>> I wanted to move the archive to another MD array of 4 8TB drives (this
>> time without LVM). So i did:
>>
>> btrfs replace start 1 /dev/md1 <mount_point>
>>
>> Where 1 is the only devid that was present and /dev/md1 is the new
>> array.

FWIW, I hadn't even considered the possibility of doing a replace from a 
single device.  I had thought it required raid mode.  But if it appeared 
to work...

>> The replace run successfully until finished after more than 5 days.
>> The system downloaded some fresh backups and created new snapshots
>> during the ongoing replace. I've go 2 kernel warnings about replace
>> task waiting for more than 120 seconds in the middle, but process
>> seamed to go on anyway.
>>
>> After the replace have finished i did btrfs fi resize 1:max
>> <mount_point> then unmounted and mounted again using the new drive.
>>
>> Then i've run a scrub on the FS - and got a lot of checksum errors.

Had you done a pre-replace scrub on the existing device?  If not, is the 
corruption actually new, or from before the replace and simply 
transferred?  You don't know.

Meanwhile, one reason not to particularly like the idea of btrfs over 
something like mdraid, is that btrfs is checksumming and operationally 
verifying, mdraid is not.  If btrfs reports an error, was it at the media 
level and which raid device if so, the raid level, the btrfs level, 
or ??  Tho for mdraid5/6 you can do a raid scrub, and hopefully detect 
and correct media and raid level errors, but you still don't have raid 
level checksum verification.  And with multiple terabyte drives that's 
definitely going to take awhile!

With btrfs raid1/10 there will be a second, hopefully checksum-valid, 
copy, to use and rebuild from.  And btrfs raid56 should be able to 
reconstruct a hopefully valid checksum from parity, tho of course at its 
maturity level one can't yet assume it's entirely bug-free.

(Again, as I observed above the problem resolution is occurring on 
another subthread, so I'll leave this at the above.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer
  2015-06-28  0:52   ` Moby
  2015-06-28 16:31   ` Mordechay Kaganer
@ 2015-06-28 16:45   ` Chris Murphy
  2015-06-28 18:02     ` Mordechay Kaganer
  2 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-06-28 16:45 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sat, Jun 27, 2015 at 5:17 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote:

> # uname -a
> Linux <hostname> 3.16.0-41-generic #57~14.04.1-Ubuntu SMP Thu Jun 18
> 18:01:13 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> # btrfs --version
> Btrfs v3.12

Well it's over a weekend so many devs may not get around to responding
until Monday, if it's urgent then IRC is probably better. But the
thing is, the kernel and btrfs-progs are kinda old. So I'm reasonably
sure the suggestion is going to be first to upgrade both of them, it's
sorta par for the course with Btrfs problems.

Option A: Maybe someone has advice on how to get the demoted device to
be valid again as if the replace command hadn't been used. Because
then you could try the replace again with newer kernel and progs, and
see if the problem still happens. That's a good question to ask on IRC
if you don't have a response by tomorrow.

Option B: In the meantime, start to check some files to see if they're
actually corrupt or if these csum errors are bogus.

Option C: Well, it's a backup so actually before A or B it's probably
best to start making a new backup in case this one is beyond repair.
So hopefully you have a 3rd location to put it into so you can keep
both of these Btrfs volumes in their current states until you have a
more clear idea what to do with them, but at least you're not delaying
getting a current backup in place. Whether this new backup is Btrfs
based or not is less important than actually having a backup.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 16:45   ` Chris Murphy
@ 2015-06-28 18:02     ` Mordechay Kaganer
  2015-06-28 18:30       ` Chris Murphy
  2015-06-28 18:50       ` Noah Massey
  0 siblings, 2 replies; 15+ messages in thread
From: Mordechay Kaganer @ 2015-06-28 18:02 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

B.H.

Thanks for the reply.

On Sun, Jun 28, 2015 at 7:45 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> Option A: Maybe someone has advice on how to get the demoted device to
> be valid again as if the replace command hadn't been used. Because
> then you could try the replace again with newer kernel and progs, and
> see if the problem still happens. That's a good question to ask on IRC
> if you don't have a response by tomorrow.

To recover the old device, that's what i'm trying to do. Asked on IRC
also, no reply. As stated above, the device passes btrfs check without
errors but cannot mount because it complains about "ongoing replace"
and the replace device is missing.

> Option B: In the meantime, start to check some files to see if they're
> actually corrupt or if these csum errors are bogus.

Tried to copy some files that are reported with bad checksums out of
the "new" volume. Copy fails with messages like this:

[181896.761117] BTRFS info (device md1): csum failed ino 3849795 off
1388544 csum 2566472073 expected csum 3428551483
[181896.761362] BTRFS info (device md1): csum failed ino 3849795 off
1519616 csum 2566472073 expected csum 1565909691
[181896.761997] BTRFS info (device md1): csum failed ino 3849795
extent 5084061945856 csum 2566472073 wanted 2627769260 mirror 0
[181896.769091] BTRFS info (device md1): csum failed ino 3849795 off
1257472 csum 2566472073 expected csum 4184704592
[181896.769509] BTRFS info (device md1): csum failed ino 3849795 off
1257472 csum 2566472073 expected csum 4184704592
[181897.171789] BTRFS info (device md1): csum failed ino 2940181
extent 4288477184000 csum 2566472073 wanted 1434149511 mirror 0
[181897.171984] BTRFS info (device md1): csum failed ino 2940181
extent 4288477270016 csum 2566472073 wanted 439924019 mirror 0
[181897.172199] BTRFS info (device md1): csum failed ino 2940181
extent 4288477356032 csum 2566472073 wanted 3293573949 mirror 0

> Option C: Well, it's a backup so actually before A or B it's probably
> best to start making a new backup in case this one is beyond repair.
> So hopefully you have a 3rd location to put it into so you can keep
> both of these Btrfs volumes in their current states until you have a
> more clear idea what to do with them, but at least you're not delaying
> getting a current backup in place. Whether this new backup is Btrfs
> based or not is less important than actually having a backup.

Unfortunately, i don't have so much free storage, it's more than 15TB.
But i do have another backup, so the recovery is not so urgent. The
btrfs backup is the only place where i have the older snapshots of the
data (actually, this is why we decided using btrfs on the first hand)
and reformatting will mean to loose that older data.

-- משיח NOW!
Moshiach is coming very soon, prepare yourself!
יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 18:02     ` Mordechay Kaganer
@ 2015-06-28 18:30       ` Chris Murphy
  2015-06-28 18:50       ` Noah Massey
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2015-06-28 18:30 UTC (permalink / raw)
  To: Btrfs BTRFS

There is a work around for the file system not reading files when
there are csum errors, which is btrfs check --init-csum-tree, but
there is a caveat:

You might need a newer btrfs-progs, I forget exactly what version it
will reconstruct new csums, maybe 3.18 or 3.19. Older versions create
a new csum tree that's empty so you get a bunch of errors for
everything but you can still read files at least and see if they're
corrupt or not.

Before you do that, you could try btrfs restore to extract some files
and see if they are really corrupt or if the csum error is bogus.

If the files are corrupted, then the (new device) backup is probably
useless anyway and there's no point in keeping it around. You can
still keep the old device backup that won't mount in case someone
eventually can say how to reset it as if the replace did not occur.

I'm willing to bet that the data is OK, and you've run into some
obscure bug on csum computation during the replace that went bad and
it's just that the csums are all wrong. Speculation though.

Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 18:02     ` Mordechay Kaganer
  2015-06-28 18:30       ` Chris Murphy
@ 2015-06-28 18:50       ` Noah Massey
  2015-06-28 19:08         ` Chris Murphy
  2015-06-28 19:20         ` Mordechay Kaganer
  1 sibling, 2 replies; 15+ messages in thread
From: Noah Massey @ 2015-06-28 18:50 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jun 28, 2015 at 2:02 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote:
> To recover the old device, that's what i'm trying to do. Asked on IRC
> also, no reply. As stated above, the device passes btrfs check without
> errors but cannot mount because it complains about "ongoing replace"
> and the replace device is missing.

Standard disclaimer, not a dev, just a user.
The following worked for me to recover the old device after
reproducing your situation:
(where loop0 is my "old" device)

# mount -t btrfs -o degraded /dev/loop0 /mnt
# btrfs replace cancel /mnt
# btrfs umount /mnt
# mount -t btrfs /dev/loop0 /mnt

mount now succeeds without error.

$ uname -r
4.1.0
$ btrfs version
btrfs-progs v4.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 18:50       ` Noah Massey
@ 2015-06-28 19:08         ` Chris Murphy
  2015-06-28 19:20         ` Mordechay Kaganer
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Murphy @ 2015-06-28 19:08 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jun 28, 2015 at 12:50 PM, Noah Massey <noah.massey@gmail.com> wrote:
> On Sun, Jun 28, 2015 at 2:02 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote:
>> To recover the old device, that's what i'm trying to do. Asked on IRC
>> also, no reply. As stated above, the device passes btrfs check without
>> errors but cannot mount because it complains about "ongoing replace"
>> and the replace device is missing.
>
> Standard disclaimer, not a dev, just a user.
> The following worked for me to recover the old device after
> reproducing your situation:
> (where loop0 is my "old" device)
>
> # mount -t btrfs -o degraded /dev/loop0 /mnt
> # btrfs replace cancel /mnt
> # btrfs umount /mnt
> # mount -t btrfs /dev/loop0 /mnt
>
> mount now succeeds without error.

Neat trick!

>
> $ uname -r
> 4.1.0
> $ btrfs version
> btrfs-progs v4.1

Yeah I definitely advise a newer kernel and progs. Even if this trick
works with the older kernel (seems reasonably likely) that the next
attempt at btrfs replace needs to happen with a newer kernel and progs
anyway.

Bit off topic question is the device used as the target for
replacement has superblocks from the first (failed, bad csum) attempt.
So I wonder to what degree it's non-deterministic to try this again
without erasing at least the old superblocks first? They have the same
UUID is the thing; if the UUID from a different volume were used,
there's no ambiguity between the stale and new data, but is there any
possibility of confusion with a new attempt when UUID is the same?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 18:50       ` Noah Massey
  2015-06-28 19:08         ` Chris Murphy
@ 2015-06-28 19:20         ` Mordechay Kaganer
  2015-06-28 19:32           ` Chris Murphy
  1 sibling, 1 reply; 15+ messages in thread
From: Mordechay Kaganer @ 2015-06-28 19:20 UTC (permalink / raw)
  To: Noah Massey; +Cc: Btrfs BTRFS

B.H.

On Sun, Jun 28, 2015 at 9:50 PM, Noah Massey <noah.massey@gmail.com> wrote:
> On Sun, Jun 28, 2015 at 2:02 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote:
>> To recover the old device, that's what i'm trying to do. Asked on IRC
>> also, no reply. As stated above, the device passes btrfs check without
>> errors but cannot mount because it complains about "ongoing replace"
>> and the replace device is missing.
>
> Standard disclaimer, not a dev, just a user.
> The following worked for me to recover the old device after
> reproducing your situation:
> (where loop0 is my "old" device)
>
> # mount -t btrfs -o degraded /dev/loop0 /mnt
> # btrfs replace cancel /mnt
> # btrfs umount /mnt
> # mount -t btrfs /dev/loop0 /mnt
>
> mount now succeeds without error.

Yeah! That worked even with my old kernel/btrfs-progs. Thank you very
much. Now the old volume mounts OK. The next step for me is to run
scrub on it to see if the data is actually intact.

Then (if it's OK, hopefully) we'll see how to redo the replace. Maybe,
unmount and do a simple "dd" will be the best option? At least it's
not going to corrupt the original data :-).

-- 
משיח NOW!
Moshiach is coming very soon, prepare yourself!
יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 19:20         ` Mordechay Kaganer
@ 2015-06-28 19:32           ` Chris Murphy
  2015-06-29  5:02             ` Mordechay Kaganer
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-06-28 19:32 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote:

> Then (if it's OK, hopefully) we'll see how to redo the replace. Maybe,
> unmount and do a simple "dd" will be the best option? At least it's
> not going to corrupt the original data :-).

Use of dd can cause corruption of the original.

"Do not make a block-level copy of a btrfs filesystem to a block
device, and then try to mount either the original or the copy while
both are visible to the same kernel."

https://btrfs.wiki.kernel.org/index.php/Gotchas

Once you do the dd, you can't mount either one of the copies until one
of the copies is completely hidden (i.e. on an LV that's inactive and
flagged to never automatically become active).

I think it's too risky just to avoid using a newer kernel. I'd sooner
create a new file system and tediously btrfs send/receive the
subvolumes you want to keep.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-28 19:32           ` Chris Murphy
@ 2015-06-29  5:02             ` Mordechay Kaganer
  2015-06-29  8:08               ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Mordechay Kaganer @ 2015-06-29  5:02 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

B.H.

On Sun, Jun 28, 2015 at 10:32 PM, Chris Murphy <lists@colorremedies.com> wrote:
> On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@gmail.com> wrote:
>
>
>> Then (if it's OK, hopefully) we'll see how to redo the replace. Maybe,
>> unmount and do a simple "dd" will be the best option? At least it's
>> not going to corrupt the original data :-).
>
> Use of dd can cause corruption of the original.
>

But doing a block-level copy and taking care that the original volume
is hidden from the kernel while mounting the new one is safe, isn't
it?

Anyway, what is the "strait forward" and recommended way of replacing
the underlying device on a single-device btrfs not using any raid
features? I can see 3 options:

1. btrfs replace - as far as i understand, it's primarily intended for
replacing the member disks under btrfs's raid.

2, Add a new volume, then remove the old one. Maybe this way we'll
need to do a full balance after that?

3. Block-level copy of the partition, then hide the original from the
kernel to avoid confusion because of the same UUID. Of course, this
way the volume is going to be off-line until the copy is finished.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-29  5:02             ` Mordechay Kaganer
@ 2015-06-29  8:08               ` Duncan
  2015-06-29 11:23                 ` Mike Fleetwood
  2015-06-29 11:39                 ` Mordechay Kaganer
  0 siblings, 2 replies; 15+ messages in thread
From: Duncan @ 2015-06-29  8:08 UTC (permalink / raw)
  To: linux-btrfs

Mordechay Kaganer posted on Mon, 29 Jun 2015 08:02:01 +0300 as excerpted:

> On Sun, Jun 28, 2015 at 10:32 PM, Chris Murphy <lists@colorremedies.com>
> wrote:
>> On Sun, Jun 28, 2015 at 1:20 PM, Mordechay Kaganer <mkaganer@gmail.com>
>> wrote:
>>
>> Use of dd can cause corruption of the original.
>>
> But doing a block-level copy and taking care that the original volume is
> hidden from the kernel while mounting the new one is safe, isn't it?

As long as neither one is mounted while doing the copy, and one or the 
other is hidden before an attempt to mount, it should be safe, yes.

The base problem is that btrfs can be multi-device, and that it tracks 
the devices belonging to the filesystem based on UUID, so as soon as it 
sees another device with the same UUID, it considers it part of the same 
filesystem.  Writes can go to any of the devices it considers a component 
device, and after a write creates a difference, reads can end up coming 
from the stale one.

Meanwhile, unlike many filesystems, btrfs uses the UUID as part of the 
metadata, so changing the UUID isn't as simple as rewriting a superblock; 
the metadata must be rewritten to the new UUID.  There's actually a tool 
now available to do just that, but it's new enough I'm not even sure it's 
available in release form yet; if so, it'll be latest releases.  
Otherwise, it'd be in integration branch.

And FWIW a different aspect of the same problem can occur in raid1 mode, 
when a device drops out and is later reintroduced, with both devices 
separately mounted rw,degraded and updated in the mean time.  Normally, 
btrfs will track the generation, a monotonically increasing integer, and 
will read from the higher/newer generation, but with separate updates to 
each, if they both happen to have the same generation at reunite...

So for raid1 mode, the recommendation is that if there's a split and one 
continues to be updated, be sure the other one isn't separately mounted 
writable and then the two combined again, or if both must be separately 
mounted writable and then recombined, wipe the one and add it as a new 
device, thus avoiding the possibility of confusion.

> Anyway, what is the "strait forward" and recommended way of replacing
> the underlying device on a single-device btrfs not using any raid
> features? I can see 3 options:
> 
> 1. btrfs replace - as far as i understand, it's primarily intended for
> replacing the member disks under btrfs's raid.

It seems this /can/ work.  You demonstrated that much.  But I'm not sure 
whether btrfs replace was actually designed to do the single-device 
replace.  If not, it almost certainly hasn't been tested for it.  Even if 
so, I'm sure I'm not the only one who hadn't thought of using it that 
way, so while it might have been development-tested for single-device-
replace, it's unlikely to have had the same degree of broader testing of 
actual usage, simply because few even thought of using it that way.

Regardless, you seem to have flushed out some bugs.  Now that they're 
visible and the weekend's over, the devs will likely get to work tracing 
them down and fixing them.

> 2, Add a new volume, then remove the old one. Maybe this way we'll need
> to do a full balance after that?

This is the alternative I'd have used in your scenario (but see below).  
Except a manual balance shouldn't be necessary.  The device add part 
should go pretty fast as it would simply make more space available.  The 
device remove will go much slower as in effect it'll trigger that 
balance, forcing everything over to the just added pretty much empty 
device.

You'd do a manual balance if you wanted to convert to raid or some such, 
but from single device to single device, just the add/remove should do it.

> 3. Block-level copy of the partition, then hide the original from the
> kernel to avoid confusion because of the same UUID. Of course, this way
> the volume is going to be off-line until the copy is finished.

This could work too, but in addition to being forced to keep the 
filesystem offline the entire time, the block-level copy will copy any 
problems, etc, too.

But what I'd /prefer/ to do would be to take the opportunity to create a 
new filesystem, possibly using different mkfs.btrfs options or at least 
starting new with a fresh filesystem and thus eliminating any as yet 
undetected or still developing problems with the old filesystem.  Since 
the replace or device remove will end up rewriting everything anyway, 
might as well make a clean break and start fresh, would be my thinking.

You could then use send/receive to copy all the snapshots, etc, over.  
Currently, that would need to be done one at a time, but there's 
discussion of adding a subvolume-recursive mode.

Tho while on the subject of snapshots, it should be noted that btrfs 
operations such as balance don't scale so well with tens of thousands of 
snapshots.  So the recommendation is to try to keep it to 250 snapshots 
or so per subvolume, under 2000 snapshots total, if possible, which of 
course at 250 per would be 8 separate subvolumes.  You can go above that 
to 3000 or so if absolutely necessary, but if it reaches near 10K, expect 
more problems in general, and dramatically increased memory and time 
requirements, for balance, check, device replace/remove, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-29  8:08               ` Duncan
@ 2015-06-29 11:23                 ` Mike Fleetwood
  2015-06-29 11:39                 ` Mordechay Kaganer
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Fleetwood @ 2015-06-29 11:23 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On 29 June 2015 at 09:08, Duncan <1i5t5.duncan@cox.net> wrote:
> Meanwhile, unlike many filesystems, btrfs uses the UUID as part of the
> metadata, so changing the UUID isn't as simple as rewriting a superblock;
> the metadata must be rewritten to the new UUID.  There's actually a tool
> now available to do just that, but it's new enough I'm not even sure it's
> available in release form yet; if so, it'll be latest releases.
> Otherwise, it'd be in integration branch.

FYI, btrfstune with changing file system UUID capability, was included
in btrfs-progs 4.1 released last week, Mon, 22 Jun 2015.
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg44182.html

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs replace seems to corrupt the file system
  2015-06-29  8:08               ` Duncan
  2015-06-29 11:23                 ` Mike Fleetwood
@ 2015-06-29 11:39                 ` Mordechay Kaganer
  1 sibling, 0 replies; 15+ messages in thread
From: Mordechay Kaganer @ 2015-06-29 11:39 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

B.H.

Regarding the main issue, the drive that was "recovered" using Noah's
trick (mount -o degraded then btrfs replace cancel) appears to be
clean. At least, it passes scrub without any errors. It even contains
all changes that were made during the replace was ongoing. Also i've
run MD's consistency check on the destination drive which contains the
corrupt FS and it appears to be clean from MD's point of view, so i
think it can be considered a "proof" the btrfs replace was actually
the source of the corruption. I'll try to reproduce the situation
before trying to upgrade the kernel/btrfs-progs with smaller loopback
devices. Not sure if it is reproducible so easily. The original
replace operation took more than 5 days and i'm not going to play with
the actual data again ;-).

If the "corrupt" version of the FS may help in debugging the issue,
please contact me today, before we have wiped it out.

On Mon, Jun 29, 2015 at 11:08 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mordechay Kaganer posted on Mon, 29 Jun 2015 08:02:01 +0300 as excerpted:
>> 1. btrfs replace - as far as i understand, it's primarily intended for
>> replacing the member disks under btrfs's raid.
>
> It seems this /can/ work.  You demonstrated that much.  But I'm not sure
> whether btrfs replace was actually designed to do the single-device
> replace.  If not, it almost certainly hasn't been tested for it.  Even if
> so, I'm sure I'm not the only one who hadn't thought of using it that
> way, so while it might have been development-tested for single-device-
> replace, it's unlikely to have had the same degree of broader testing of
> actual usage, simply because few even thought of using it that way.

*If* replace is usable for single-drive FS, this method has the
advantage that it can be cancelled in the middle and (for single
drive, using Noah's trick) even after the operation has finished. For
multi-drive FS, the trick wouldn't help as soon as any changes were
made the the FS after the replace.

-- 
משיח NOW!
Moshiach is coming very soon, prepare yourself!
יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-06-29 11:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CA+xOVSOD1YY-=Cm+vmzTUV9cHe9idtDkRr0RmpRP5a0Z6eC4YQ@mail.gmail.com>
2015-06-27 23:17 ` Fwd: btrfs replace seems to corrupt the file system Mordechay Kaganer
2015-06-28  0:52   ` Moby
2015-06-28 16:31   ` Mordechay Kaganer
2015-06-29  2:50     ` Duncan
2015-06-28 16:45   ` Chris Murphy
2015-06-28 18:02     ` Mordechay Kaganer
2015-06-28 18:30       ` Chris Murphy
2015-06-28 18:50       ` Noah Massey
2015-06-28 19:08         ` Chris Murphy
2015-06-28 19:20         ` Mordechay Kaganer
2015-06-28 19:32           ` Chris Murphy
2015-06-29  5:02             ` Mordechay Kaganer
2015-06-29  8:08               ` Duncan
2015-06-29 11:23                 ` Mike Fleetwood
2015-06-29 11:39                 ` Mordechay Kaganer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox