"btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared
@ 2013-11-15 11:31 Lutz Vieweg
  2013-11-15 12:38 ` Hugo Mills
  0 siblings, 1 reply; 3+ messages in thread
From: Lutz Vieweg @ 2013-11-15 11:31 UTC (permalink / raw)
  To: linux-btrfs

Hi again,

I just did another test on resilience with btrfs/raid1, this time I tested
the following scenario: One out of two raid1 devices disappears. The filesystem
is written to in degraded mode. The missing device re-appears (think of e.g.
a storage device that temporarily became unavailable due to a cable or controller
issue that is later fixed). User issues "btrfs filesystem balance".

Alas, this scenario ends in an effor "btrfs: 1 enospc errors during balance",
with the raid1 staying degraded.

Here's the test procedure in detail:

Testing was done using vanilla linux-3.12 (x86_64)
plus btrfs-progs at commit 9f0c53f574b242b0d5988db2972c8aac77ef35a9
plus "[PATCH] btrfs-progs: for mixed group check opt before default raid profile is enforced"

Preparing two 100 MB image files:
 > # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
 > 100+0 records in
 > 100+0 records out
 > 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
 >
 > # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
 > 100+0 records in
 > 100+0 records out
 > 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s

Preparing two loop devices on those images to act as the underlying
block devices for btrfs:
 > # losetup /dev/loop1 /tmp/img1
 > # losetup /dev/loop2 /tmp/img2

Mounting / writing to the fs:
 > # mount -t btrfs /dev/loop1 /mnt/tmp
 > # echo asdfasdfasdfasdf >/mnt/tmp/testfile1
 > # md5sum /mnt/tmp/testfile1
 > f1264d450b9feda62fec5a1e11faba1a  /mnt/tmp/testfile1
 > # umount /mnt/tmp

First storage device "disappears":
 > # losetup -d /dev/loop1

Mounting degraded btrfs:
 > # mount -t btrfs -o degraded /dev/loop2 /mnt/tmp

Testing that testfile1 is still readable:
 > # md5sum /mnt/tmp/testfile1
f1264d450b9feda62fec5a1e11faba1a  /mnt/tmp/testfile1

Creating "testfile2" on the degraded filesystem:
 > # echo qwerqwerqwerqwer >/mnt/tmp/testfile2
 > # md5sum /mnt/tmp/testfile2
 > 9df26d2f2657462c435d58274cc5bdf0  /mnt/tmp/testfile2
 > # umount /mnt/tmp

Now we assume the issue causing the first storage device
to be unavailable to be fixed:

 > # losetup /dev/loop1 /tmp/img1
 > # mount -t btrfs /dev/loop1 /mnt/tmp

Notice that at this point, I would have expected some kind of warning
in the syslog that the mounted filesystem is not balanced and
thus not redundant.
But there was no such warning.
This may easily lead operators into a situation where they do
not realize that a btrfs is not redundant and losing one storage
device will lose data.

Testing that the two testfiles (one of which is not yet
stored on both devices) are still readable:
 > # md5sum /mnt/tmp/testfile1
f1264d450b9feda62fec5a1e11faba1a  /mnt/tmp/testfile1
 > # md5sum /mnt/tmp/testfile2
9df26d2f2657462c435d58274cc5bdf0  /mnt/tmp/testfile2

So far, so good.
Now since we know the filesystem is not really redundant,
we start a "balance":

 > # btrfs filesystem balance /mnt/tmp
 > ERROR: error during balancing '/mnt/tmp' - No space left on device
 > There may be more info in syslog - try dmesg | tail

Syslog shows:
 > kernel: btrfs: relocating block group 20971520 flags 21
 > kernel: btrfs: found 3 extents
 > kernel: btrfs: relocating block group 4194304 flags 5
 > kernel: btrfs: relocating block group 0 flags 2
 > kernel: btrfs: 1 enospc errors during balance

So the raid1 remains "degraded".

BTW: I wonder why "btrfs balance" seems to require additional space
for writing data to the re-appeared disk.

I also wonder: Would btrfs try to write _two_ copies of
everything to _one_ remaining device of a degraded two-disk raid1?
(If yes, then this means a raid1 would have to be planned with
twice the capacity just to be sure that one failing disk will
not lead to an out-of-diskspace situation. Not good.)

Regards,

Lutz Vieweg

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared
  2013-11-15 11:31 "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared Lutz Vieweg
@ 2013-11-15 12:38 ` Hugo Mills
  2013-11-15 15:00   ` Duncan
  0 siblings, 1 reply; 3+ messages in thread
From: Hugo Mills @ 2013-11-15 12:38 UTC (permalink / raw)
  To: Lutz Vieweg; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3296 bytes --]

On Fri, Nov 15, 2013 at 12:31:24PM +0100, Lutz Vieweg wrote:
> Hi again,
> 
> I just did another test on resilience with btrfs/raid1, this time I tested
> the following scenario: One out of two raid1 devices disappears. The filesystem
> is written to in degraded mode. The missing device re-appears (think of e.g.
> a storage device that temporarily became unavailable due to a cable or controller
> issue that is later fixed). User issues "btrfs filesystem balance".
> 
> Alas, this scenario ends in an effor "btrfs: 1 enospc errors during balance",
> with the raid1 staying degraded.
> 
> Here's the test procedure in detail:
> 
> Testing was done using vanilla linux-3.12 (x86_64)
> plus btrfs-progs at commit 9f0c53f574b242b0d5988db2972c8aac77ef35a9
> plus "[PATCH] btrfs-progs: for mixed group check opt before default raid profile is enforced"
> 
> Preparing two 100 MB image files:
> > # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
> > 100+0 records in
> > 100+0 records out
> > 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
> >
> > # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
> > 100+0 records in
> > 100+0 records out
> > 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s

   For btrfs, this is *tiny*. I'm not surprised you've got ENOSPC
problems here -- it's got nowhere to move the data to, even if you are
using --mixed mode.

   I would recommend using larger sparse files for doing this,
otherwise you're going to keep hitting ENOSPC errors instead of
triggering actual bugs in device recovery:

$ dd if=/dev/zero of=/tmp/img1 bs=1M count=0 seek=10240

That will give you a small file with a large apparent size.

[...]

> So far, so good.
> Now since we know the filesystem is not really redundant,
> we start a "balance":
> 
> > # btrfs filesystem balance /mnt/tmp
> > ERROR: error during balancing '/mnt/tmp' - No space left on device
> > There may be more info in syslog - try dmesg | tail
> 
> Syslog shows:
> > kernel: btrfs: relocating block group 20971520 flags 21
> > kernel: btrfs: found 3 extents
> > kernel: btrfs: relocating block group 4194304 flags 5
> > kernel: btrfs: relocating block group 0 flags 2
> > kernel: btrfs: 1 enospc errors during balance
> 
> So the raid1 remains "degraded".
> 
> BTW: I wonder why "btrfs balance" seems to require additional space
> for writing data to the re-appeared disk.

   It's a copy-on-write filesystem: *every* modification of the FS
requires additional space to write the new copy to. In your example
here, the FS is so small, I'm surprised you could write anything to it
all, due to metadata overheads.

> I also wonder: Would btrfs try to write _two_ copies of
> everything to _one_ remaining device of a degraded two-disk raid1?

   No. It would have to degrade from RAID-1 to DUP to do that (and I
think we prevent DUP data for some reason).

   Hugo.

> (If yes, then this means a raid1 would have to be planned with
> twice the capacity just to be sure that one failing disk will
> not lead to an out-of-diskspace situation. Not good.)
> 
> Regards,
> 
> Lutz Vieweg
> 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
               --- Doughnut furs ache me, Omar Dorlin. ---               

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared
  2013-11-15 12:38 ` Hugo Mills
@ 2013-11-15 15:00   ` Duncan
  0 siblings, 0 replies; 3+ messages in thread
From: Duncan @ 2013-11-15 15:00 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Fri, 15 Nov 2013 12:38:41 +0000 as excerpted:

>> I also wonder: Would btrfs try to write _two_ copies of everything to
>> _one_ remaining device of a degraded two-disk raid1?
> 
> No. It would have to degrade from RAID-1 to DUP to do that (and I
> think we prevent DUP data for some reason).

You may be correct about DUP data, but that is unlikely to be the issue 
here, because he's likely using the mixed-mode default due to the <1GB 
filesystem size, and on a multi-device filesystem that should default to 
RAID1 just as metadata by itself does.

However, I noticed that his outlined duplicator SKIPPED the mkfs.btrfs 
command part, and there's no btrfs filesystem show and btrfs filesystem df 
to verify how the kernel's actually treating the filesystem, so...

@ LV:

For further tests, please include these commands and their output:

1) your mkfs.btrfs command

[Then mount, and after mount...]

2) btrfs filesystem show <path>

3) btrfs filesystem df <path>

Thanks.  These should make what btrfs is doing far clearer.

Meanwhile, I've been following your efforts with quite some interest as 
they correspond to some of the pre-deployment btrfs raid1 mode testing I 
did.  This was several kernels ago, however, so I had been wondering if 
the behavior had changed, hopefully for the better, and your testing 
looks to be headed toward the same test I did at some point.

Back then, I found a rather counterintuitive result of my own.

Basically, take a two-device raid1 mode (both data and metadata; in my 
case the devices were over a gig so mixed data+metadata wasn't invoked 
and I specified -m raid1 -d raid1 when doing the mkfs.btrfs) btrfs, mount 
it, copy some files to it, unmount it.

Then disconnect one device (I was using actual devices not the loop 
devices you're using) and mount degraded.  Make a change to the degraded 
filesystem.  Unmount.

Then disconnect that device and reconnect the other.  Mount degraded.  
Make a *DIFFERENT* change to the same file.  Unmount.  The two copies 
have now forked in an incompatible manner.

Now reconnect both devices and remount, this time without degraded.

As you, here I expected btrfs to protest, particularly so since the two 
copies were incompatible.  *NO* *PROTEST*!!

OK, so check that file to see which version I got.  I've now forgotten 
which one it was, but it was definitely one of the two forks, not the 
original version.

Now unmount and disconnect the device with the copy it said it had.  
Mount the filesystem degraded with the other device.  Check the file 
again.

!!  I got the other fork! !!

Not only did btrfs not protest when I mounted a raid1 device undegraded 
after making incompatible changes to the file with each of the two 
devices mounted degraded separately, but accessing the file on the 
undegraded filesystem neither protested nor corrected the other copy, 
which remained the incompatibly forked copy as confirmed by remounting 
the filesystem degraded with just that device in ordered to access it.

To my way of thinking, that's downright dangerous, as well as being 
entirely unintuitive.

Unfortunately, I didn't actually do a balance to see what btrfs would do 
with the incompatible versions, I simply blew away that testing 
filesystem with a new mkfs.btrfs (I'm on SSD so mkfs.btrfs automatically 
issues a trim/discard to clear the new filesystem space before mking it), 
and I've been kicking myself for not doing so ever since, because I 
really would like to know balance actually /does/ in such a case!  But I 
was still new enough to btrfs at that time that I didn't really know what 
I was doing, so I didn't realize I'd omitted a critical part of the test 
until it was too late, and I've not been interested /enough/ in the 
outcome to redo the test again, with a newer kernel and tools and a 
balance this time.

What scrub would have done with it would be an interesting testcase as 
well, but I don't know that either, because I never tried it.

At least I hadn't redone the test yet.  But I keep thinking about it, and 
I guess I would have one of these days.  But now it seems like you're 
heading toward doing it for me. =:^)

Meanwhile, the conclusion I took from the test was that if I ever had to 
mount degraded in read/write mode, I should make *VERY* sure I 
CONSISTENTLY used the same device when I did so.  And when I undegraded, 
/hopefully/ a balance would choose the newer version.  Unfortunately I 
never did actually test that, so I figured should I actually need to use 
degraded, even if the degrade was only temporary, the best way to recover 
was probably to trim that entire partition on the lost device and then 
proceed to add it back into the filesystem as if it were a new device and 
do a balance to finally recover raid1 mode.

Meanwhile (2), what btrfs raid1 mode /has/ been providing me with is data 
integrity via the checksumming and scrub features.  And with raid1 
providing a second copy of the data to work with, scrub really does 
appear to do what it says on the tin, copy the good copy over the bad, if 
there's a checksum mismatch on the bad one.  What I do NOT know is what 
happens in /either/ a scrub or balance case, when the metadata of both 
copies is valid, but they differ from each other!

FWIW, here's my original list post on the subject, tho it doesn't seem to 
have generated any followup (beyond my own).  IIRC I did get a reply from 
another sysadmin on another thread, but I can't find it now, and all it 
did was echo my concern, no reply from a dev or the like.

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/26096

So anyway, yes, I'm definitely following your results with interest! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-11-15 15:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-15 11:31 "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared Lutz Vieweg
2013-11-15 12:38 ` Hugo Mills
2013-11-15 15:00   ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).