Does btrfs "raid1" actually provide any resilience?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Does btrfs "raid1" actually provide any resilience?
@ 2013-11-14 11:02 Lutz Vieweg
  2013-11-14 17:18 ` George Mitchell
  2013-11-14 18:22 ` Goffredo Baroncelli
  0 siblings, 2 replies; 16+ messages in thread
From: Lutz Vieweg @ 2013-11-14 11:02 UTC (permalink / raw)
  To: linux-btrfs

Hi,

on a server that so far uses an MD RAID1 with XFS on it we wanted
to try btrfs, instead.

But even the most basic check for btrfs actually providing
resilience against one of the physical storage devices failing
yields a "does not work" result - so I wonder whether I misunderstood
that btrfs is meant to not require block-device level RAID
functionality underneath.

Here are the test procedure:

Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.

Preparing two 100 MB image files:
> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>
> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s

Preparing two loop devices on those images to act as the underlying
block devices for btrfs:
> # losetup /dev/loop1 /tmp/img1
> # losetup /dev/loop2 /tmp/img2

Preparing the btrfs filesystem on the loop devices:
> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1 /dev/loop2
> SMALL VOLUME: forcing mixed metadata/data groups
>
> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
> WARNING! - see http://btrfs.wiki.kernel.org before using
>
> Performing full device TRIM (100.00MiB) ...
> Turning ON incompat feature 'mixed-bg': mixed data and metadata block groups
> Created a data/metadata chunk of size 8388608
> Performing full device TRIM (100.00MiB) ...
> adding device /dev/loop2 id 2
> fs created label test on /dev/loop1
>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
> Btrfs v0.20-rc1-591-gc652e4e

Mounting the btfs filesystem:
> # mount -t btrfs /dev/loop1 /mnt/tmp

Copying just 70MB of zeroes into a test file:
> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
> 70+0 records in
> 70+0 records out
> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s

Checking that the testfile can be read:
> # md5sum /mnt/tmp/testfile
> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile

Unmounting before further testing:
> # umount /mnt/tmp


Now we assume that one of the two "storage devices" is broken,
so we remove one of the two loop devices:
> # losetup -d /dev/loop1

Trying to mount the btrfs filesystem from the one storage device that is left:
> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>        missing codepage or helper program, or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
... does not work.

In /var/log/messages we find:
> kernel: btrfs: failed to read chunk root on loop2
> kernel: btrfs: open_ctree failed

(The same happenes when adding ",ro" to the mount options.)

Ok, so if the first of two disks was broken, so is our filesystem.
Isn't that what RAID1 should prevent?

We tried a different scenario, now the first disk remains
but the second is broken:

> # losetup -d /dev/loop2
> # losetup /dev/loop1 /tmp/img1
>
> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>        missing codepage or helper program, or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
>
> In /var/log/messages:
> kernel: Btrfs: too many missing devices, writeable mount is not allowed

The message is different, but still unsatisfactory: Not being
able to write to a RAID1 because one out of two disks failed
is not what one would expect - the machine should be operable just
normal with a degraded RAID1.

But let's try if at least a read-only mount works:
> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
The mount command itself does work.

But then:
> # md5sum /mnt/tmp/testfile
> md5sum: /mnt/tmp/testfile: Input/output error

The testfile is not readable anymore. (At this point, no messages
are to be found in dmesg/syslog - I would expect such on an
input/output error.)

So the bottom line is: All the double writing that comes with RAID1
mode did not provide any usefule resilience.

I am kind of sure this is not as intended, or is it?

Regards,

Lutz Vieweg



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 11:02 Does btrfs "raid1" actually provide any resilience? Lutz Vieweg
@ 2013-11-14 17:18 ` George Mitchell
  2013-11-14 17:35   ` Lutz Vieweg
  2013-11-14 18:22 ` Goffredo Baroncelli
  1 sibling, 1 reply; 16+ messages in thread
From: George Mitchell @ 2013-11-14 17:18 UTC (permalink / raw)
  Cc: linux-btrfs

The read only mount issue is by design.  It is intended to make sure you 
know exactly what is going on before you proceed.  For example, a drive 
may actually be fine, but may have been caused by a cable failure.  In 
that case you would want to fix the cable problem before you break the 
mirror by writing to a single drive.  The read only function is designed 
to make certain you know that you are simplex before you proceed further.

As for the rest of it, hopefully someone else here can shed more light.  
For sure RAID1 mode works fairly reliably (like traditional RAID1) in a 
non virtual setting.  But it IS still experimental.  I am using btrfs 
RAID1 on my workstation, four partitions spread over five hard drives, 
but I back everything up 100% multiple times daily via anacron and cron 
functions.  I certainly wouldn't trust it just yet as it is not fully 
production ready.  That said, I have been using it for over six months 
now, coming off of 3ware RAID, and I have no regrets.


On 11/14/2013 03:02 AM, Lutz Vieweg wrote:
> Hi,
>
> on a server that so far uses an MD RAID1 with XFS on it we wanted
> to try btrfs, instead.
>
> But even the most basic check for btrfs actually providing
> resilience against one of the physical storage devices failing
> yields a "does not work" result - so I wonder whether I misunderstood
> that btrfs is meant to not require block-device level RAID
> functionality underneath.
>
> Here are the test procedure:
>
> Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
> commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.
>
> Preparing two 100 MB image files:
>> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>>
>> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s
>
> Preparing two loop devices on those images to act as the underlying
> block devices for btrfs:
>> # losetup /dev/loop1 /tmp/img1
>> # losetup /dev/loop2 /tmp/img2
>
> Preparing the btrfs filesystem on the loop devices:
>> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1 
>> /dev/loop2
>> SMALL VOLUME: forcing mixed metadata/data groups
>>
>> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>
>> Performing full device TRIM (100.00MiB) ...
>> Turning ON incompat feature 'mixed-bg': mixed data and metadata block 
>> groups
>> Created a data/metadata chunk of size 8388608
>> Performing full device TRIM (100.00MiB) ...
>> adding device /dev/loop2 id 2
>> fs created label test on /dev/loop1
>>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
>> Btrfs v0.20-rc1-591-gc652e4e
>
> Mounting the btfs filesystem:
>> # mount -t btrfs /dev/loop1 /mnt/tmp
>
> Copying just 70MB of zeroes into a test file:
>> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
>> 70+0 records in
>> 70+0 records out
>> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s
>
> Checking that the testfile can be read:
>> # md5sum /mnt/tmp/testfile
>> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile
>
> Unmounting before further testing:
>> # umount /mnt/tmp
>
>
> Now we assume that one of the two "storage devices" is broken,
> so we remove one of the two loop devices:
>> # losetup -d /dev/loop1
>
> Trying to mount the btrfs filesystem from the one storage device that 
> is left:
>> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
>> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>>        missing codepage or helper program, or other error
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail  or so
> ... does not work.
>
> In /var/log/messages we find:
>> kernel: btrfs: failed to read chunk root on loop2
>> kernel: btrfs: open_ctree failed
>
> (The same happenes when adding ",ro" to the mount options.)
>
> Ok, so if the first of two disks was broken, so is our filesystem.
> Isn't that what RAID1 should prevent?
>
> We tried a different scenario, now the first disk remains
> but the second is broken:
>
>> # losetup -d /dev/loop2
>> # losetup /dev/loop1 /tmp/img1
>>
>> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
>> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>>        missing codepage or helper program, or other error
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail  or so
>>
>> In /var/log/messages:
>> kernel: Btrfs: too many missing devices, writeable mount is not allowed
>
> The message is different, but still unsatisfactory: Not being
> able to write to a RAID1 because one out of two disks failed
> is not what one would expect - the machine should be operable just
> normal with a degraded RAID1.
>
> But let's try if at least a read-only mount works:
>> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
> The mount command itself does work.
>
> But then:
>> # md5sum /mnt/tmp/testfile
>> md5sum: /mnt/tmp/testfile: Input/output error
>
> The testfile is not readable anymore. (At this point, no messages
> are to be found in dmesg/syslog - I would expect such on an
> input/output error.)
>
> So the bottom line is: All the double writing that comes with RAID1
> mode did not provide any usefule resilience.
>
> I am kind of sure this is not as intended, or is it?
>
> Regards,
>
> Lutz Vieweg
>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 17:18 ` George Mitchell
@ 2013-11-14 17:35   ` Lutz Vieweg
  2013-11-14 19:59     ` Kyle Gates
  2013-11-15  1:58     ` George Mitchell
  0 siblings, 2 replies; 16+ messages in thread
From: Lutz Vieweg @ 2013-11-14 17:35 UTC (permalink / raw)
  To: linux-btrfs

On 11/14/2013 06:18 PM, George Mitchell wrote:
> The read only mount issue is by design.  It is intended to make sure you know exactly what is going
> on before you proceed.

Hmmm... but will a server be able to continue its operation (including writes) on
an already mounted btrfs when a storage device in a btrfs-raid1 fails?
(If not, that would contradict the idea of achieving a higher reliability.)

> The read only function is designed to make certain you know that you are
> simplex before you proceed further.

Ok, but once I know - e.g. by verifying that indeed, one storage device is broken -
is there any option to proceed (without redundancy) until I can replace the broken
device?

> I certainly wouldn't trust it just yet as it is not fully production ready.

Sure, the server we intend to try btrfs on is one that we can restore when required,
and there is a redundant server (without btrfs) that can stand in. I was just
hoping for some good experiences to justify a larger field-trial.

> That said, I have been using it for over six
> months now, coming off of 3ware RAID, and I have no regrets.

I guess every Linux software RAID option is an improvement when
you come from those awful hardware RAID controllers, which caused
us additional downtime more often than they prevented downtime.

Regards,

Lutz Vieweg


> On 11/14/2013 03:02 AM, Lutz Vieweg wrote:
>> Hi,
>>
>> on a server that so far uses an MD RAID1 with XFS on it we wanted
>> to try btrfs, instead.
>>
>> But even the most basic check for btrfs actually providing
>> resilience against one of the physical storage devices failing
>> yields a "does not work" result - so I wonder whether I misunderstood
>> that btrfs is meant to not require block-device level RAID
>> functionality underneath.
>>
>> Here are the test procedure:
>>
>> Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
>> commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.
>>
>> Preparing two 100 MB image files:
>>> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
>>> 100+0 records in
>>> 100+0 records out
>>> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>>>
>>> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
>>> 100+0 records in
>>> 100+0 records out
>>> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s
>>
>> Preparing two loop devices on those images to act as the underlying
>> block devices for btrfs:
>>> # losetup /dev/loop1 /tmp/img1
>>> # losetup /dev/loop2 /tmp/img2
>>
>> Preparing the btrfs filesystem on the loop devices:
>>> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1 /dev/loop2
>>> SMALL VOLUME: forcing mixed metadata/data groups
>>>
>>> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
>>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>>
>>> Performing full device TRIM (100.00MiB) ...
>>> Turning ON incompat feature 'mixed-bg': mixed data and metadata block groups
>>> Created a data/metadata chunk of size 8388608
>>> Performing full device TRIM (100.00MiB) ...
>>> adding device /dev/loop2 id 2
>>> fs created label test on /dev/loop1
>>>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
>>> Btrfs v0.20-rc1-591-gc652e4e
>>
>> Mounting the btfs filesystem:
>>> # mount -t btrfs /dev/loop1 /mnt/tmp
>>
>> Copying just 70MB of zeroes into a test file:
>>> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
>>> 70+0 records in
>>> 70+0 records out
>>> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s
>>
>> Checking that the testfile can be read:
>>> # md5sum /mnt/tmp/testfile
>>> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile
>>
>> Unmounting before further testing:
>>> # umount /mnt/tmp
>>
>>
>> Now we assume that one of the two "storage devices" is broken,
>> so we remove one of the two loop devices:
>>> # losetup -d /dev/loop1
>>
>> Trying to mount the btrfs filesystem from the one storage device that is left:
>>> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
>>> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>>>        missing codepage or helper program, or other error
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail  or so
>> ... does not work.
>>
>> In /var/log/messages we find:
>>> kernel: btrfs: failed to read chunk root on loop2
>>> kernel: btrfs: open_ctree failed
>>
>> (The same happenes when adding ",ro" to the mount options.)
>>
>> Ok, so if the first of two disks was broken, so is our filesystem.
>> Isn't that what RAID1 should prevent?
>>
>> We tried a different scenario, now the first disk remains
>> but the second is broken:
>>
>>> # losetup -d /dev/loop2
>>> # losetup /dev/loop1 /tmp/img1
>>>
>>> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
>>> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>>>        missing codepage or helper program, or other error
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail  or so
>>>
>>> In /var/log/messages:
>>> kernel: Btrfs: too many missing devices, writeable mount is not allowed
>>
>> The message is different, but still unsatisfactory: Not being
>> able to write to a RAID1 because one out of two disks failed
>> is not what one would expect - the machine should be operable just
>> normal with a degraded RAID1.
>>
>> But let's try if at least a read-only mount works:
>>> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
>> The mount command itself does work.
>>
>> But then:
>>> # md5sum /mnt/tmp/testfile
>>> md5sum: /mnt/tmp/testfile: Input/output error
>>
>> The testfile is not readable anymore. (At this point, no messages
>> are to be found in dmesg/syslog - I would expect such on an
>> input/output error.)
>>
>> So the bottom line is: All the double writing that comes with RAID1
>> mode did not provide any usefule resilience.
>>
>> I am kind of sure this is not as intended, or is it?
>>
>> Regards,
>>
>> Lutz Vieweg
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 11:02 Does btrfs "raid1" actually provide any resilience? Lutz Vieweg
  2013-11-14 17:18 ` George Mitchell
@ 2013-11-14 18:22 ` Goffredo Baroncelli
  2013-11-14 20:47   ` BUG: btrfsRe: " Goffredo Baroncelli
  1 sibling, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2013-11-14 18:22 UTC (permalink / raw)
  To: Lutz Vieweg; +Cc: linux-btrfs

On 2013-11-14 12:02, Lutz Vieweg wrote:
> Hi,
> 
> on a server that so far uses an MD RAID1 with XFS on it we wanted
> to try btrfs, instead.
> 
> But even the most basic check for btrfs actually providing
> resilience against one of the physical storage devices failing
> yields a "does not work" result - so I wonder whether I misunderstood
> that btrfs is meant to not require block-device level RAID
> functionality underneath.

I don't think that you have misunderstood btrfs. On the basis of my
knowledge you are right.

With a kernel v3.11.6 I made your test and I got the following:

- 2 disks of 100M each and 1 file of 70M: I was *unable* to create the
file because I got a "No space left on device". I was not surprise BTRFS
behaves bad when the free space is low. However I was able to remove a
disk and remount the filesystem in "degraded" mode.

- 2 disk of 3G each and 1 file of 100M: I was *able* to create the file,
and to remount the filesystem in degraded mode when I deleted a disk.

Note: in any case I needed to mount the filesystem in read-only mode.

I will try also with a 3.12 kernel.

BR
G.Baroncelli
> 
> Here are the test procedure:
> 
> Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
> commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.
> 
> Preparing two 100 MB image files:
>> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>>
>> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s
> 
> Preparing two loop devices on those images to act as the underlying
> block devices for btrfs:
>> # losetup /dev/loop1 /tmp/img1
>> # losetup /dev/loop2 /tmp/img2
> 
> Preparing the btrfs filesystem on the loop devices:
>> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1
>> /dev/loop2
>> SMALL VOLUME: forcing mixed metadata/data groups
>>
>> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>
>> Performing full device TRIM (100.00MiB) ...
>> Turning ON incompat feature 'mixed-bg': mixed data and metadata block
>> groups
>> Created a data/metadata chunk of size 8388608
>> Performing full device TRIM (100.00MiB) ...
>> adding device /dev/loop2 id 2
>> fs created label test on /dev/loop1
>>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
>> Btrfs v0.20-rc1-591-gc652e4e
> 
> Mounting the btfs filesystem:
>> # mount -t btrfs /dev/loop1 /mnt/tmp
> 
> Copying just 70MB of zeroes into a test file:
>> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
>> 70+0 records in
>> 70+0 records out
>> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s
> 
> Checking that the testfile can be read:
>> # md5sum /mnt/tmp/testfile
>> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile
> 
> Unmounting before further testing:
>> # umount /mnt/tmp
> 
> 
> Now we assume that one of the two "storage devices" is broken,
> so we remove one of the two loop devices:
>> # losetup -d /dev/loop1
> 
> Trying to mount the btrfs filesystem from the one storage device that is
> left:
>> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
>> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>>        missing codepage or helper program, or other error
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail  or so
> ... does not work.
> 
> In /var/log/messages we find:
>> kernel: btrfs: failed to read chunk root on loop2
>> kernel: btrfs: open_ctree failed
> 
> (The same happenes when adding ",ro" to the mount options.)
> 
> Ok, so if the first of two disks was broken, so is our filesystem.
> Isn't that what RAID1 should prevent?
> 
> We tried a different scenario, now the first disk remains
> but the second is broken:
> 
>> # losetup -d /dev/loop2
>> # losetup /dev/loop1 /tmp/img1
>>
>> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
>> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>>        missing codepage or helper program, or other error
>>        In some cases useful info is found in syslog - try
>>        dmesg | tail  or so
>>
>> In /var/log/messages:
>> kernel: Btrfs: too many missing devices, writeable mount is not allowed
> 
> The message is different, but still unsatisfactory: Not being
> able to write to a RAID1 because one out of two disks failed
> is not what one would expect - the machine should be operable just
> normal with a degraded RAID1.
> 
> But let's try if at least a read-only mount works:
>> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
> The mount command itself does work.
> 
> But then:
>> # md5sum /mnt/tmp/testfile
>> md5sum: /mnt/tmp/testfile: Input/output error
> 
> The testfile is not readable anymore. (At this point, no messages
> are to be found in dmesg/syslog - I would expect such on an
> input/output error.)
> 
> So the bottom line is: All the double writing that comes with RAID1
> mode did not provide any usefule resilience.
> 
> I am kind of sure this is not as intended, or is it?
> 
> Regards,
> 
> Lutz Vieweg
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 17:35   ` Lutz Vieweg
@ 2013-11-14 19:59     ` Kyle Gates
  2013-11-15  1:58     ` George Mitchell
  1 sibling, 0 replies; 16+ messages in thread
From: Kyle Gates @ 2013-11-14 19:59 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

On 11/14/2013 11:35 AM, Lutz Vieweg wrote:
> 
> On 11/14/2013 06:18 PM, George Mitchell wrote:
>> The read only mount issue is by design.  It is intended to make sure you know exactly what is going
>> on before you proceed.
> 
> Hmmm... but will a server be able to continue its operation (including writes) on
> an already mounted btrfs when a storage device in a btrfs-raid1 fails?
> (If not, that would contradict the idea of achieving a higher reliability.)
> 
>> The read only function is designed to make certain you know that you are
>> simplex before you proceed further.
> 
> Ok, but once I know - e.g. by verifying that indeed, one storage device is broken -
> is there any option to proceed (without redundancy) until I can replace the broken
> device?

Bonus points if the raid mode is maintained during degraded operation via either dup (2 disk array) or allocating additional chunks (3+ disk array).
 
>> I certainly wouldn't trust it just yet as it is not fully production ready.
> 
> Sure, the server we intend to try btrfs on is one that we can restore when required,
> and there is a redundant server (without btrfs) that can stand in. I was just
> hoping for some good experiences to justify a larger field-trial.
> 
>> That said, I have been using it for over six
>> months now, coming off of 3ware RAID, and I have no regrets.
> 
> I guess every Linux software RAID option is an improvement when
> you come from those awful hardware RAID controllers, which caused
> us additional downtime more often than they prevented downtime.
> 
> Regards,
> 
> Lutz Vieweg
> 
> 
>> On 11/14/2013 03:02 AM, Lutz Vieweg wrote:
>>> Hi,
>>>
>>> on a server that so far uses an MD RAID1 with XFS on it we wanted
>>> to try btrfs, instead.
>>>
>>> But even the most basic check for btrfs actually providing
>>> resilience against one of the physical storage devices failing
>>> yields a "does not work" result - so I wonder whether I misunderstood
>>> that btrfs is meant to not require block-device level RAID
>>> functionality underneath.
>>>
>>> Here are the test procedure:
>>>
>>> Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
>>> commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.
>>>
>>> Preparing two 100 MB image files:
>>>> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>>>>
>>>> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s
>>>
>>> Preparing two loop devices on those images to act as the underlying
>>> block devices for btrfs:
>>>> # losetup /dev/loop1 /tmp/img1
>>>> # losetup /dev/loop2 /tmp/img2
>>>
>>> Preparing the btrfs filesystem on the loop devices:
>>>> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1 /dev/loop2
>>>> SMALL VOLUME: forcing mixed metadata/data groups
>>>>
>>>> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
>>>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>>>
>>>> Performing full device TRIM (100.00MiB) ...
>>>> Turning ON incompat feature 'mixed-bg': mixed data and metadata block groups
>>>> Created a data/metadata chunk of size 8388608
>>>> Performing full device TRIM (100.00MiB) ...
>>>> adding device /dev/loop2 id 2
>>>> fs created label test on /dev/loop1
>>>>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
>>>> Btrfs v0.20-rc1-591-gc652e4e
>>>
>>> Mounting the btfs filesystem:
>>>> # mount -t btrfs /dev/loop1 /mnt/tmp
>>>
>>> Copying just 70MB of zeroes into a test file:
>>>> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
>>>> 70+0 records in
>>>> 70+0 records out
>>>> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s
>>>
>>> Checking that the testfile can be read:
>>>> # md5sum /mnt/tmp/testfile
>>>> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile
>>>
>>> Unmounting before further testing:
>>>> # umount /mnt/tmp
>>>
>>>
>>> Now we assume that one of the two "storage devices" is broken,
>>> so we remove one of the two loop devices:
>>>> # losetup -d /dev/loop1
>>>
>>> Trying to mount the btrfs filesystem from the one storage device that is left:
>>>> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
>>>> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>>>>        missing codepage or helper program, or other error
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail  or so
>>> ... does not work.
>>>
>>> In /var/log/messages we find:
>>>> kernel: btrfs: failed to read chunk root on loop2
>>>> kernel: btrfs: open_ctree failed
>>>
>>> (The same happenes when adding ",ro" to the mount options.)
>>>
>>> Ok, so if the first of two disks was broken, so is our filesystem.
>>> Isn't that what RAID1 should prevent?
>>>
>>> We tried a different scenario, now the first disk remains
>>> but the second is broken:
>>>
>>>> # losetup -d /dev/loop2
>>>> # losetup /dev/loop1 /tmp/img1
>>>>
>>>> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
>>>> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>>>>        missing codepage or helper program, or other error
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail  or so
>>>>
>>>> In /var/log/messages:
>>>> kernel: Btrfs: too many missing devices, writeable mount is not allowed
>>>
>>> The message is different, but still unsatisfactory: Not being
>>> able to write to a RAID1 because one out of two disks failed
>>> is not what one would expect - the machine should be operable just
>>> normal with a degraded RAID1.
>>>
>>> But let's try if at least a read-only mount works:
>>>> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
>>> The mount command itself does work.
>>>
>>> But then:
>>>> # md5sum /mnt/tmp/testfile
>>>> md5sum: /mnt/tmp/testfile: Input/output error
>>>
>>> The testfile is not readable anymore. (At this point, no messages
>>> are to be found in dmesg/syslog - I would expect such on an
>>> input/output error.)
>>>
>>> So the bottom line is: All the double writing that comes with RAID1
>>> mode did not provide any usefule resilience.
>>>
>>> I am kind of sure this is not as intended, or is it?
>>>
>>> Regards,
>>>
>>> Lutz Vieweg
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html 		 	   		  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 18:22 ` Goffredo Baroncelli
@ 2013-11-14 20:47   ` Goffredo Baroncelli
  2013-11-14 21:21     ` Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?] Goffredo Baroncelli
  2013-11-14 21:22     ` BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience? Chris Murphy
  0 siblings, 2 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2013-11-14 20:47 UTC (permalink / raw)
  Cc: Lutz Vieweg, linux-btrfs

On 2013-11-14 19:22, Goffredo Baroncelli wrote:
> On 2013-11-14 12:02, Lutz Vieweg wrote:
>> Hi,
>>
>> on a server that so far uses an MD RAID1 with XFS on it we wanted
>> to try btrfs, instead.
>>
>> But even the most basic check for btrfs actually providing
>> resilience against one of the physical storage devices failing
>> yields a "does not work" result - so I wonder whether I misunderstood
>> that btrfs is meant to not require block-device level RAID
>> functionality underneath.
> 
> I don't think that you have misunderstood btrfs. On the basis of my
> knowledge you are right.
> 
> With a kernel v3.11.6 I made your test and I got the following:
> 
> - 2 disks of 100M each and 1 file of 70M: I was *unable* to create the
> file because I got a "No space left on device". I was not surprise BTRFS
> behaves bad when the free space is low. However I was able to remove a
> disk and remount the filesystem in "degraded" mode.
> 
> - 2 disk of 3G each and 1 file of 100M: I was *able* to create the file,
> and to remount the filesystem in degraded mode when I deleted a disk.
> 
> Note: in any case I needed to mount the filesystem in read-only mode.
> 
> I will try also with a 3.12 kernel.

Ok, it seems to be a BUG of latest btrfs.mkfs:

If I use the standard debian "mkfs.btrfs":

ghigo@venice:/tmp$ sudo mkfs.btrfs -m raid1 -d raid1 -K /dev/loop[01]
WARNING! - Btrfs v0.20-rc1 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

SMALL VOLUME: forcing mixed metadata/data groups
Created a data/metadata chunk of size 8388608
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
	nodesize 4096 leafsize 4096 sectorsize 4096 size 202.00MB
Btrfs v0.20-rc1
ghigo@venice:/tmp$ sudo mount /dev/loop1 /mnt/test
ghigo@venice:/tmp$ sudo btrfs fi df /mnt/test
System, RAID1: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Data+Metadata, RAID1: total=64.00MB, used=28.00KB
Data+Metadata: total=8.00MB, used=0.00

Note the presence of the profile Data+Metadata RAID1

Instead if I use the btrfs-progs c652e4efb8e2dd7... I got

ghigo@venice:/tmp$ sudo ~ghigo/btrfs/btrfs-progs/mkfs.btrfs -m raid1 -d
raid1 -K /dev/loop[01]
SMALL VOLUME: forcing mixed metadata/data groups

WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

Turning ON incompat feature 'mixed-bg': mixed data and metadata block groups
Created a data/metadata chunk of size 8388608
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
	nodesize 4096 leafsize 4096 sectorsize 4096 size 202.00MiB
Btrfs v0.20-rc1-591-gc652e4e
ghigo@venice:/tmp$ sudo mount /dev/loop1 /mnt/testghigo@venice:/tmp$
sudo btrfs fi df /mnt/test
System: total=4.00MB, used=4.00KB
Data+Metadata: total=8.00MB, used=28.00KB

Note the absence of any RAID1 profile.
> 
> BR
> G.Baroncelli
>>
>> Here are the test procedure:
>>
>> Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
>> commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.
>>
>> Preparing two 100 MB image files:
>>> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
>>> 100+0 records in
>>> 100+0 records out
>>> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>>>
>>> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
>>> 100+0 records in
>>> 100+0 records out
>>> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s
>>
>> Preparing two loop devices on those images to act as the underlying
>> block devices for btrfs:
>>> # losetup /dev/loop1 /tmp/img1
>>> # losetup /dev/loop2 /tmp/img2
>>
>> Preparing the btrfs filesystem on the loop devices:
>>> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1
>>> /dev/loop2
>>> SMALL VOLUME: forcing mixed metadata/data groups
>>>
>>> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
>>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>>
>>> Performing full device TRIM (100.00MiB) ...
>>> Turning ON incompat feature 'mixed-bg': mixed data and metadata block
>>> groups
>>> Created a data/metadata chunk of size 8388608
>>> Performing full device TRIM (100.00MiB) ...
>>> adding device /dev/loop2 id 2
>>> fs created label test on /dev/loop1
>>>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
>>> Btrfs v0.20-rc1-591-gc652e4e
>>
>> Mounting the btfs filesystem:
>>> # mount -t btrfs /dev/loop1 /mnt/tmp
>>
>> Copying just 70MB of zeroes into a test file:
>>> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
>>> 70+0 records in
>>> 70+0 records out
>>> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s
>>
>> Checking that the testfile can be read:
>>> # md5sum /mnt/tmp/testfile
>>> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile
>>
>> Unmounting before further testing:
>>> # umount /mnt/tmp
>>
>>
>> Now we assume that one of the two "storage devices" is broken,
>> so we remove one of the two loop devices:
>>> # losetup -d /dev/loop1
>>
>> Trying to mount the btrfs filesystem from the one storage device that is
>> left:
>>> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
>>> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>>>        missing codepage or helper program, or other error
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail  or so
>> ... does not work.
>>
>> In /var/log/messages we find:
>>> kernel: btrfs: failed to read chunk root on loop2
>>> kernel: btrfs: open_ctree failed
>>
>> (The same happenes when adding ",ro" to the mount options.)
>>
>> Ok, so if the first of two disks was broken, so is our filesystem.
>> Isn't that what RAID1 should prevent?
>>
>> We tried a different scenario, now the first disk remains
>> but the second is broken:
>>
>>> # losetup -d /dev/loop2
>>> # losetup /dev/loop1 /tmp/img1
>>>
>>> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
>>> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>>>        missing codepage or helper program, or other error
>>>        In some cases useful info is found in syslog - try
>>>        dmesg | tail  or so
>>>
>>> In /var/log/messages:
>>> kernel: Btrfs: too many missing devices, writeable mount is not allowed
>>
>> The message is different, but still unsatisfactory: Not being
>> able to write to a RAID1 because one out of two disks failed
>> is not what one would expect - the machine should be operable just
>> normal with a degraded RAID1.
>>
>> But let's try if at least a read-only mount works:
>>> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
>> The mount command itself does work.
>>
>> But then:
>>> # md5sum /mnt/tmp/testfile
>>> md5sum: /mnt/tmp/testfile: Input/output error
>>
>> The testfile is not readable anymore. (At this point, no messages
>> are to be found in dmesg/syslog - I would expect such on an
>> input/output error.)
>>
>> So the bottom line is: All the double writing that comes with RAID1
>> mode did not provide any usefule resilience.
>>
>> I am kind of sure this is not as intended, or is it?
>>
>> Regards,
>>
>> Lutz Vieweg
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-14 20:47   ` BUG: btrfsRe: " Goffredo Baroncelli
@ 2013-11-14 21:21     ` Goffredo Baroncelli
  2013-11-15  4:44       ` Anand Jain
  2013-11-15  7:12       ` Duncan
  2013-11-14 21:22     ` BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience? Chris Murphy
  1 sibling, 2 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2013-11-14 21:21 UTC (permalink / raw)
  To: Anand Jain; +Cc: Lutz Vieweg, linux-btrfs

Hi Anand,

after some tests and looking at the code I discovered that the current
mkfs.btrfs doesn't allow any raid profile other than SINGLE for data and
meta-data when the mixed metadata/data group is enabled. It seems this
behaviour was introduce by a your commit [1].

mkfs.c line 1384 onwards

	/*
	* Set default profiles according to number of added devices.
	* For mixed groups defaults are single/single.
	*/
	if (!mixed) {
[....]
	} else {
		u32 best_leafsize = max_t(u32, sysconf(_SC_PAGESIZE),
				sectorsize);
		metadata_profile = 0;
		data_profile = 0;

But in another your commit [2] it seems that you check that in case of
mixed, the metadata and data profile have to be equal (implicitly
allowing that they could be different than single ?).

mkfs.c line 1373 onward

	if (is_vol_small(file)) {
		printf("SMALL VOLUME: forcing mixed metadata/data
				groups\n");
		mixed = 1;
		if (metadata_profile != data_profile) {
			if (metadata_profile_opt || data_profile_opt) {
				fprintf(stderr, "With mixed block
	 groups data and metadata profiles must be the same\n");
				exit(1);
			}
		}
	}

So I am a bit confusing: it is allowed a raid profile different than
single when the mixed is enabled ? Of course mixed and raid together
doesn't make sense (or almost make very little sense) but the code of
mkfs is a bit confused, and a warning should be raised when the raid
profile are forced to a default different from the one selected by the user.

Thanks for the attention.
BR
G.Baroncelli

[1] btrfs-progs: avoid write to the disk before sure to create fs
71d6bd3c8d70fb682c7fd50796f587ce1f1cf6f8
.
[2] btrfs-progs: mkfs should check for small vol well before
cdbc10729266c03aeb2eb812c17a3ef6c1ceae26
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 20:47   ` BUG: btrfsRe: " Goffredo Baroncelli
  2013-11-14 21:21     ` Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?] Goffredo Baroncelli
@ 2013-11-14 21:22     ` Chris Murphy
  2013-11-14 21:31       ` Goffredo Baroncelli
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2013-11-14 21:22 UTC (permalink / raw)
  To: Btrfs BTRFS


On Nov 14, 2013, at 1:47 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
> 
> Instead if I use the btrfs-progs c652e4efb8e2dd7... I got
> 
> [snip]

> Data+Metadata: total=8.00MB, used=28.00KB
> 
> Note the absence of any RAID1 profile.

What happens if the devices are large enough to avoid mandatory block group mixing? Try 100GB for each device. Is the problem reproducible?


Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 21:22     ` BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience? Chris Murphy
@ 2013-11-14 21:31       ` Goffredo Baroncelli
  0 siblings, 0 replies; 16+ messages in thread
From: Goffredo Baroncelli @ 2013-11-14 21:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2013-11-14 22:22, Chris Murphy wrote:
> 
> On Nov 14, 2013, at 1:47 PM, Goffredo Baroncelli <kreijack@libero.it> wrote:
>>
>> Instead if I use the btrfs-progs c652e4efb8e2dd7... I got
>>
>> [snip]
> 
>> Data+Metadata: total=8.00MB, used=28.00KB
>>
>> Note the absence of any RAID1 profile.
> 
> What happens if the devices are large enough to avoid mandatory block group mixing? Try 100GB for each device. Is the problem reproducible?

It seems related to the mixing (see my other email). Even looking at the
code and doing some tests seems to confirm that.

> 
> Chris Murphy--

Goffredo

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Does btrfs "raid1" actually provide any resilience?
  2013-11-14 17:35   ` Lutz Vieweg
  2013-11-14 19:59     ` Kyle Gates
@ 2013-11-15  1:58     ` George Mitchell
  1 sibling, 0 replies; 16+ messages in thread
From: George Mitchell @ 2013-11-15  1:58 UTC (permalink / raw)
  Cc: linux-btrfs

On 11/14/2013 09:35 AM, Lutz Vieweg wrote:
> On 11/14/2013 06:18 PM, George Mitchell wrote:
>> The read only mount issue is by design. It is intended to make sure 
>> you know exactly what is going
>> on before you proceed.
>
> Hmmm... but will a server be able to continue its operation (including 
> writes) on
> an already mounted btrfs when a storage device in a btrfs-raid1 fails?
> (If not, that would contradict the idea of achieving a higher 
> reliability.)
I am pretty sure that a drive dropping out when it is "in service" is 
handled differently than a drive failing to appear when the system is 
freshly booted.  In the case of an "in service" drive, I believe there 
would be full transparent redundancy rw.
>
>> The read only function is designed to make certain you know that you are
>> simplex before you proceed further.
>
> Ok, but once I know - e.g. by verifying that indeed, one storage 
> device is broken -
> is there any option to proceed (without redundancy) until I can 
> replace the broken
> device?
>
>> I certainly wouldn't trust it just yet as it is not fully production 
>> ready.
>
> Sure, the server we intend to try btrfs on is one that we can restore 
> when required,
> and there is a redundant server (without btrfs) that can stand in. I 
> was just
> hoping for some good experiences to justify a larger field-trial.
I waited until April of this year for the same reasons, but decided it 
WAS ready, as long as one take precautions and doesn't bet the farm on 
it.  Just make sure you don't try to do anything exotic with it (RAID5, 
etc), its really not ready for that yet.  But for vanilla RAID1 it seems 
to work just fine.  I don't really mess with snapshots and such at this 
point, I run a pretty spartan environment with it.  It IS file system 
RAID so it might have a problem with something that looks like a bogus 
file, like a file filled with all balls for example.  Additionally, as 
the previous poster mentioned, it is very sensitive to low free space.
>
>> That said, I have been using it for over six
>> months now, coming off of 3ware RAID, and I have no regrets.
>
> I guess every Linux software RAID option is an improvement when
> you come from those awful hardware RAID controllers, which caused
> us additional downtime more often than they prevented downtime.
I went to hardware RAID precisely because soft RAID sucked in my 
opinion.  But btrfs is miles ahead of hardware RAID.  There is simply no 
comparison.
>
> Regards,
>
> Lutz Vieweg
>
>
>> On 11/14/2013 03:02 AM, Lutz Vieweg wrote:
>>> Hi,
>>>
>>> on a server that so far uses an MD RAID1 with XFS on it we wanted
>>> to try btrfs, instead.
>>>
>>> But even the most basic check for btrfs actually providing
>>> resilience against one of the physical storage devices failing
>>> yields a "does not work" result - so I wonder whether I misunderstood
>>> that btrfs is meant to not require block-device level RAID
>>> functionality underneath.
>>>
>>> Here are the test procedure:
>>>
>>> Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at
>>> commit c652e4efb8e2dd76ef1627d8cd649c6af5905902.
>>>
>>> Preparing two 100 MB image files:
>>>> # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s
>>>>
>>>> # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100
>>>> 100+0 records in
>>>> 100+0 records out
>>>> 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s
>>>
>>> Preparing two loop devices on those images to act as the underlying
>>> block devices for btrfs:
>>>> # losetup /dev/loop1 /tmp/img1
>>>> # losetup /dev/loop2 /tmp/img2
>>>
>>> Preparing the btrfs filesystem on the loop devices:
>>>> # mkfs.btrfs --data raid1 --metadata raid1 --label test /dev/loop1 
>>>> /dev/loop2
>>>> SMALL VOLUME: forcing mixed metadata/data groups
>>>>
>>>> WARNING! - Btrfs v0.20-rc1-591-gc652e4e IS EXPERIMENTAL
>>>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>>>
>>>> Performing full device TRIM (100.00MiB) ...
>>>> Turning ON incompat feature 'mixed-bg': mixed data and metadata 
>>>> block groups
>>>> Created a data/metadata chunk of size 8388608
>>>> Performing full device TRIM (100.00MiB) ...
>>>> adding device /dev/loop2 id 2
>>>> fs created label test on /dev/loop1
>>>>         nodesize 4096 leafsize 4096 sectorsize 4096 size 200.00MiB
>>>> Btrfs v0.20-rc1-591-gc652e4e
>>>
>>> Mounting the btfs filesystem:
>>>> # mount -t btrfs /dev/loop1 /mnt/tmp
>>>
>>> Copying just 70MB of zeroes into a test file:
>>>> # dd if=/dev/zero of=/mnt/tmp/testfile bs=1024k count=70
>>>> 70+0 records in
>>>> 70+0 records out
>>>> 73400320 bytes (73 MB) copied, 0.0657669 s, 1.1 GB/s
>>>
>>> Checking that the testfile can be read:
>>>> # md5sum /mnt/tmp/testfile
>>>> b89fdccdd61d57b371f9611eec7d3cef  /mnt/tmp/testfile
>>>
>>> Unmounting before further testing:
>>>> # umount /mnt/tmp
>>>
>>>
>>> Now we assume that one of the two "storage devices" is broken,
>>> so we remove one of the two loop devices:
>>>> # losetup -d /dev/loop1
>>>
>>> Trying to mount the btrfs filesystem from the one storage device 
>>> that is left:
>>>> # mount -t btrfs -o device=/dev/loop2,degraded /dev/loop2 /mnt/tmp
>>>> mount: wrong fs type, bad option, bad superblock on /dev/loop2,
>>>>        missing codepage or helper program, or other error
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail  or so
>>> ... does not work.
>>>
>>> In /var/log/messages we find:
>>>> kernel: btrfs: failed to read chunk root on loop2
>>>> kernel: btrfs: open_ctree failed
>>>
>>> (The same happenes when adding ",ro" to the mount options.)
>>>
>>> Ok, so if the first of two disks was broken, so is our filesystem.
>>> Isn't that what RAID1 should prevent?
>>>
>>> We tried a different scenario, now the first disk remains
>>> but the second is broken:
>>>
>>>> # losetup -d /dev/loop2
>>>> # losetup /dev/loop1 /tmp/img1
>>>>
>>>> # mount -t btrfs -o degraded /dev/loop1 /mnt/tmp
>>>> mount: wrong fs type, bad option, bad superblock on /dev/loop1,
>>>>        missing codepage or helper program, or other error
>>>>        In some cases useful info is found in syslog - try
>>>>        dmesg | tail  or so
>>>>
>>>> In /var/log/messages:
>>>> kernel: Btrfs: too many missing devices, writeable mount is not 
>>>> allowed
>>>
>>> The message is different, but still unsatisfactory: Not being
>>> able to write to a RAID1 because one out of two disks failed
>>> is not what one would expect - the machine should be operable just
>>> normal with a degraded RAID1.
>>>
>>> But let's try if at least a read-only mount works:
>>>> # mount -t btrfs -o degraded,ro /dev/loop1 /mnt/tmp
>>> The mount command itself does work.
>>>
>>> But then:
>>>> # md5sum /mnt/tmp/testfile
>>>> md5sum: /mnt/tmp/testfile: Input/output error
>>>
>>> The testfile is not readable anymore. (At this point, no messages
>>> are to be found in dmesg/syslog - I would expect such on an
>>> input/output error.)
>>>
>>> So the bottom line is: All the double writing that comes with RAID1
>>> mode did not provide any usefule resilience.
>>>
>>> I am kind of sure this is not as intended, or is it?
>>>
>>> Regards,
>>>
>>> Lutz Vieweg
>>>
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-14 21:21     ` Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?] Goffredo Baroncelli
@ 2013-11-15  4:44       ` Anand Jain
  2013-11-15 10:35         ` Lutz Vieweg
  2013-11-15 10:36         ` Lutz Vieweg
  2013-11-15  7:12       ` Duncan
  1 sibling, 2 replies; 16+ messages in thread
From: Anand Jain @ 2013-11-15  4:44 UTC (permalink / raw)
  To: kreijack, Lutz Vieweg; +Cc: linux-btrfs


Hi G.Baroncelli, Lutz,

  Thanks for the test case and heads-up on this. The code missed
  the check if the user has provided the option before default
  profile for the mixed group (due to small vol) is enforced.

  I have sent out the following patch to fix it.

[PATCH] btrfs-progs: for mixed group check opt before default raid 
profile is enforced

  Kindly let us know how it performed if you could.

  Thanks,
Anand


On 11/15/2013 05:21 AM, Goffredo Baroncelli wrote:
> Hi Anand,
>
> after some tests and looking at the code I discovered that the current
> mkfs.btrfs doesn't allow any raid profile other than SINGLE for data and
> meta-data when the mixed metadata/data group is enabled. It seems this
> behaviour was introduce by a your commit [1].
>
>
> mkfs.c line 1384 onwards
>
> 	/*
> 	* Set default profiles according to number of added devices.
> 	* For mixed groups defaults are single/single.
> 	*/
> 	if (!mixed) {
> [....]
> 	} else {
> 		u32 best_leafsize = max_t(u32, sysconf(_SC_PAGESIZE),
> 				sectorsize);
> 		metadata_profile = 0;
> 		data_profile = 0;
>
>
> But in another your commit [2] it seems that you check that in case of
> mixed, the metadata and data profile have to be equal (implicitly
> allowing that they could be different than single ?).
>
> mkfs.c line 1373 onward
>
> 	if (is_vol_small(file)) {
> 		printf("SMALL VOLUME: forcing mixed metadata/data
> 				groups\n");
> 		mixed = 1;
> 		if (metadata_profile != data_profile) {
> 			if (metadata_profile_opt || data_profile_opt) {
> 				fprintf(stderr, "With mixed block
> 	 groups data and metadata profiles must be the same\n");
> 				exit(1);
> 			}
> 		}
> 	}
>
>
> So I am a bit confusing: it is allowed a raid profile different than
> single when the mixed is enabled ? Of course mixed and raid together
> doesn't make sense (or almost make very little sense) but the code of
> mkfs is a bit confused, and a warning should be raised when the raid
> profile are forced to a default different from the one selected by the user.
>
>
> Thanks for the attention.
> BR
> G.Baroncelli
>
>
> [1] btrfs-progs: avoid write to the disk before sure to create fs
> 71d6bd3c8d70fb682c7fd50796f587ce1f1cf6f8
> .
> [2] btrfs-progs: mkfs should check for small vol well before
> cdbc10729266c03aeb2eb812c17a3ef6c1ceae26
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-14 21:21     ` Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?] Goffredo Baroncelli
  2013-11-15  4:44       ` Anand Jain
@ 2013-11-15  7:12       ` Duncan
  2013-11-15  7:30         ` Goffredo Baroncelli
  1 sibling, 1 reply; 16+ messages in thread
From: Duncan @ 2013-11-15  7:12 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Thu, 14 Nov 2013 22:21:22 +0100 as
excerpted:

> after some tests and looking at the code I discovered that the current
> mkfs.btrfs doesn't allow any raid profile other than SINGLE for data and
> meta-data when the mixed metadata/data group is enabled.

That'd be a big problem for me, here, as I run a separate sub-GiB (640 
MiB) btrfs filesystem /var/log, in data+metadata raid1 mode.  (The 
mountpoint is actually /lg, with /var/log a symlink pointing at it.)

btrfs f sh /lg

Label: lg0238gcnx+35l0  uuid: c77a9eb8-9841-4c2b-925e-75d0a925dcc3
        Total devices 2 FS bytes used 51.93MiB
        devid    1 size 640.00MiB used 288.00MiB path /dev/sdc4
        devid    2 size 640.00MiB used 288.00MiB path /dev/sda4

btrfs f df /lg

System, RAID1: total=32.00MiB, used=4.00KiB
Data+Metadata, RAID1: total=256.00MiB, used=51.92MiB

I've had a couple bad shutdowns, but btrfs scrub has reliably cleaned up 
the resulting mess on /lg due to open logfiles, and I'd be rather unhappy 
if that weren't possible.

Meanwhile, I also have two separate sub-GiB (256 MiB) /boot filesystems, 
one on each (of two) SSDs, with the grub2 on each pointing at its own 
/boot so a broken grub2 update won't break my ability to boot.

Those are both data+metadata dup mode:

btrfs f df /bt

System, DUP: total=8.00MiB, used=4.00KiB
System, single: total=4.00MiB, used=0.00
Data+Metadata, DUP: total=114.00MiB, used=41.38MiB
Data+Metadata, single: total=8.00MiB, used=0.00

You're saying data+metadata DUP wouldn't be possible here either, which 
would make me pretty unhappy too.

Fortunately I did those mkfs.btrfs on an earlier btrfs-tools so wasn't 
affected by this bug, but bug I would indeed call it, for sure!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-15  7:12       ` Duncan
@ 2013-11-15  7:30         ` Goffredo Baroncelli
  2013-11-15  9:37           ` Duncan
  0 siblings, 1 reply; 16+ messages in thread
From: Goffredo Baroncelli @ 2013-11-15  7:30 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs, Anand Jain

On 2013-11-15 08:12, Duncan wrote:
> Goffredo Baroncelli posted on Thu, 14 Nov 2013 22:21:22 +0100 as
> excerpted:
> 
>> after some tests and looking at the code I discovered that the current
>> mkfs.btrfs doesn't allow any raid profile other than SINGLE for data and
>> meta-data when the mixed metadata/data group is enabled.
> 
> That'd be a big problem for me, here, as I run a separate sub-GiB (640 
> MiB) btrfs filesystem /var/log, in data+metadata raid1 mode.  (The 
> mountpoint is actually /lg, with /var/log a symlink pointing at it.)
> 
[...]
> 
> You're saying data+metadata DUP wouldn't be possible here either, which 
> would make me pretty unhappy too.

The problem should be in mkfs.btrfs not in the btrfs kernel code. So if
the filesystem was created, there should be not problem.

> 
> Fortunately I did those mkfs.btrfs on an earlier btrfs-tools so wasn't 
> affected by this bug, but bug I would indeed call it, for sure!

Anand posted a patch few hours ago (which worked for me). I think that
this bug could be addressed quickly.

BR


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-15  7:30         ` Goffredo Baroncelli
@ 2013-11-15  9:37           ` Duncan
  0 siblings, 0 replies; 16+ messages in thread
From: Duncan @ 2013-11-15  9:37 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Fri, 15 Nov 2013 08:30:49 +0100 as
excerpted:

> On 2013-11-15 08:12, Duncan wrote:
>> 
> [...]
>> 
>> You're saying data+metadata DUP wouldn't be possible here either, which
>> would make me pretty unhappy too.
> 
> The problem should be in mkfs.btrfs not in the btrfs kernel code. So if
> the filesystem was created, there should be not problem.

Yes.  Thanks.

>> Fortunately I did those mkfs.btrfs on an earlier btrfs-tools so wasn't
>> affected by this bug, but bug I would indeed call it, for sure!
> 
> Anand posted a patch few hours ago (which worked for me). I think that
> this bug could be addressed quickly.

His patch crossed with my reply.  Quick work, indeed!  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-15  4:44       ` Anand Jain
@ 2013-11-15 10:35         ` Lutz Vieweg
  2013-11-15 10:36         ` Lutz Vieweg
  1 sibling, 0 replies; 16+ messages in thread
From: Lutz Vieweg @ 2013-11-15 10:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kreijack, linux-btrfs

On 11/15/2013 05:44 AM, Anand Jain wrote:
>   Thanks for the test case and heads-up on this. The code missed
>   the check if the user has provided the option before default
>   profile for the mixed group (due to small vol) is enforced.
>
>   I have sent out the following patch to fix it.
>
> [PATCH] btrfs-progs: for mixed group check opt before default raid profile is enforced
>
>   Kindly let us know how it performed if you could.

I just tried it: The test case I posted now works as expected.

Thanks a lot for your effort!

(The patch did not apply to the current head of
  git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
  without manual intervention, but was easy to fix.)

Regards,

Lutz Vieweg

PS: Will now proceed with some less basic resilience tests... ;-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?]
  2013-11-15  4:44       ` Anand Jain
  2013-11-15 10:35         ` Lutz Vieweg
@ 2013-11-15 10:36         ` Lutz Vieweg
  1 sibling, 0 replies; 16+ messages in thread
From: Lutz Vieweg @ 2013-11-15 10:36 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kreijack, linux-btrfs

On 11/15/2013 05:44 AM, Anand Jain wrote:
>   Thanks for the test case and heads-up on this. The code missed
>   the check if the user has provided the option before default
>   profile for the mixed group (due to small vol) is enforced.
>
>   I have sent out the following patch to fix it.
>
> [PATCH] btrfs-progs: for mixed group check opt before default raid profile is enforced
>
>   Kindly let us know how it performed if you could.

I just tried it: The test case I posted now works as expected.

Thanks a lot for your effort!

(The patch did not apply to the current head of
  git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
  without manual intervention, but was easy to fix.)

Regards,

Lutz Vieweg

PS: Will now proceed with some less basic resilience tests... ;-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-11-15 10:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-14 11:02 Does btrfs "raid1" actually provide any resilience? Lutz Vieweg
2013-11-14 17:18 ` George Mitchell
2013-11-14 17:35   ` Lutz Vieweg
2013-11-14 19:59     ` Kyle Gates
2013-11-15  1:58     ` George Mitchell
2013-11-14 18:22 ` Goffredo Baroncelli
2013-11-14 20:47   ` BUG: btrfsRe: " Goffredo Baroncelli
2013-11-14 21:21     ` Mixed and raid [was Re: BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience?] Goffredo Baroncelli
2013-11-15  4:44       ` Anand Jain
2013-11-15 10:35         ` Lutz Vieweg
2013-11-15 10:36         ` Lutz Vieweg
2013-11-15  7:12       ` Duncan
2013-11-15  7:30         ` Goffredo Baroncelli
2013-11-15  9:37           ` Duncan
2013-11-14 21:22     ` BUG: btrfsRe: Does btrfs "raid1" actually provide any resilience? Chris Murphy
2013-11-14 21:31       ` Goffredo Baroncelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).