Understanding BTRFS storage

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Understanding BTRFS storage
@ 2015-08-26  8:56 George Duffield
  2015-08-26 11:41 ` Austin S Hemmelgarn
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: George Duffield @ 2015-08-26  8:56 UTC (permalink / raw)
  To: linux-btrfs

Hi

Is there a more comprehensive discussion/ documentation of Btrfs
features than is referenced in
https://btrfs.wiki.kernel.org/index.php/Main_Page...I'd love to learn
more but it seems there's no readily available authoritative
documentation out there?

I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
solution that will involve duplicating a data store on a second
machine for backup purposes (the machine is only powered up for
backups).

Two quick questions:
- If I were simply to create a Btrfs volume using 5x3TB drives and not
create a raid5/6/10 array I understand data would be striped across
the 5 drives with no reduncancy ... i.e. if a drive fails all data is
lost?  Is this correct?

- Is Btrfs RAID10 (for data) ready to be used reliably?

*/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26  8:56 Understanding BTRFS storage George Duffield
@ 2015-08-26 11:41 ` Austin S Hemmelgarn
  2015-08-26 11:50 ` Hugo Mills
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-26 11:41 UTC (permalink / raw)
  To: George Duffield, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On 2015-08-26 04:56, George Duffield wrote:
> Hi
>
> Is there a more comprehensive discussion/ documentation of Btrfs
> features than is referenced in
> https://btrfs.wiki.kernel.org/index.php/Main_Page...I'd love to learn
> more but it seems there's no readily available authoritative
> documentation out there?
>
> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
> solution that will involve duplicating a data store on a second
> machine for backup purposes (the machine is only powered up for
> backups).
>
> Two quick questions:
> - If I were simply to create a Btrfs volume using 5x3TB drives and not
> create a raid5/6/10 array I understand data would be striped across
> the 5 drives with no reduncancy ... i.e. if a drive fails all data is
> lost?  Is this correct?
Yes, although the striping is at a much larger granularity than a 
typical RAID0.
>
> - Is Btrfs RAID10 (for data) ready to be used reliably?
Yes, in general it is.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26  8:56 Understanding BTRFS storage George Duffield
  2015-08-26 11:41 ` Austin S Hemmelgarn
@ 2015-08-26 11:50 ` Hugo Mills
  2015-08-26 11:50 ` Roman Mamedov
  2015-08-26 11:50 ` Duncan
  3 siblings, 0 replies; 18+ messages in thread
From: Hugo Mills @ 2015-08-26 11:50 UTC (permalink / raw)
  To: George Duffield; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1701 bytes --]

On Wed, Aug 26, 2015 at 10:56:03AM +0200, George Duffield wrote:
> Hi
> 
> Is there a more comprehensive discussion/ documentation of Btrfs
> features than is referenced in
> https://btrfs.wiki.kernel.org/index.php/Main_Page...I'd love to learn
> more but it seems there's no readily available authoritative
> documentation out there?
> 
> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
> solution that will involve duplicating a data store on a second
> machine for backup purposes (the machine is only powered up for
> backups).
> 
> Two quick questions:
> - If I were simply to create a Btrfs volume using 5x3TB drives and not
> create a raid5/6/10 array I understand data would be striped across
> the 5 drives with no reduncancy ... i.e. if a drive fails all data is
> lost?  Is this correct?

   With RAID-1 metadata and single data, when you lose a device the FS
will continue to be usable. Any data that was stored on the missing
device will return an I/O error when you try to read it. With single
data, the data space is assigned to devices in 1 GiB chunks in
turn. Within that, files that are written once and not modified are
likely to be placed linearly within that sequence. Files that get
modified may have their modifications placed out of sequence on other
chunks and devices.

> - Is Btrfs RAID10 (for data) ready to be used reliably?

   I'd say yes, others may say no. I'd suggest using RAID-1 for now
anyway -- it uses the space better when you come to add new devices or
replace them (with different sizes).

   Hugo.

-- 
Hugo Mills             | Preventing talpidian orogenesis.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26  8:56 Understanding BTRFS storage George Duffield
  2015-08-26 11:41 ` Austin S Hemmelgarn
  2015-08-26 11:50 ` Hugo Mills
@ 2015-08-26 11:50 ` Roman Mamedov
  2015-08-26 12:03   ` Austin S Hemmelgarn
  2015-08-26 11:50 ` Duncan
  3 siblings, 1 reply; 18+ messages in thread
From: Roman Mamedov @ 2015-08-26 11:50 UTC (permalink / raw)
  To: George Duffield; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 674 bytes --]

On Wed, 26 Aug 2015 10:56:03 +0200
George Duffield <forumscollective@gmail.com> wrote:

> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
> solution that will involve duplicating a data store on a second
> machine for backup purposes (the machine is only powered up for
> backups).

What do you want to achieve by switching? As Btrfs RAID5/6 is not safe yet, do
you also plan to migrate to RAID10, losing in storage efficiency?

Why not use Btrfs in single-device mode on top of your mdadm RAID5/6? Can even
migrate without moving any data if you currently use Ext4, as it can be
converted to Btrfs in-place.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26  8:56 Understanding BTRFS storage George Duffield
                   ` (2 preceding siblings ...)
  2015-08-26 11:50 ` Roman Mamedov
@ 2015-08-26 11:50 ` Duncan
  3 siblings, 0 replies; 18+ messages in thread
From: Duncan @ 2015-08-26 11:50 UTC (permalink / raw)
  To: linux-btrfs

George Duffield posted on Wed, 26 Aug 2015 10:56:03 +0200 as excerpted:

> Two quick questions:
> - If I were simply to create a Btrfs volume using 5x3TB drives and not
> create a raid5/6/10 array I understand data would be striped across the
> 5 drives with no reduncancy ... i.e. if a drive fails all data is lost? 
> Is this correct?

I'm not actually sure if the data default on a multi-device is raid0 (all 
data effectively lost) or what btrfs calls single mode, which is what it 
uses for a single device and on a multi-device fs, is sort of like raid0 
but with very large strips.  Earlier on it was single mode, but somebody 
commented that it's raid0 mode now instead, so I'm no longer sure what 
the current default is.

In single mode, files written all at once and not changed, upto a gig in 
size (that being the nominal data chunk size), will likely appear on a 
single device.  With five devices, dropping out only one should in theory 
leave many of those files and even a reasonable number of 2 GiB files 
intact.  However, fragmentation or rewriting some data within a file 
would tend to spread it out among data chunks, and thus likely across 
more devices, making the chance of loosing it higher.

Meanwhile, metadata default remains paired-mirrored raid1, regardless of 
the number of devices.

But you can always specify the data and metadata raid levels as desired, 
assuming you have at least the minimum number of devices required for 
that raid level.  I always specify them here, preferring raid1 for both 
data and metadata, tho if it were available, I'd probably use 3-way-
mirroring.  That's roadmapped but probably won't be available for a year 
or so yet, and it'll take some time to stabilize after that.

> - Is Btrfs RAID10 (for data) ready to be used reliably?

Btrfs raid0/1/10 modes as well as single and (for single device metadata) 
dup modes are all relatively mature, and should be as stable as btrfs 
itself, meaning stabilizing, but not fully stable just yet, with bugs 
from time to time.

Basically, that means the sysadmin's backups rule, that if it's not 
backed up, by action and definition it wasn't valuable, regardless of 
claims to the contrary (and complete backups are tested, if it's not 
tested usable/restorable, the backup isn't complete yet), applies double 
-- really, have backups or you're playing Russian roulette with your 
data, but those modes are stable enough for daily use, as long as you do 
have those backups or the data is simply throw-away.

Btrfs raid56 (5 and 6, it's the same code dealing with both) modes were 
nominally code-complete as of 3.19, but are still new enough they've not 
reached the stability of the rest of btrfs, yet.  As such, I've been 
suggesting that unless people are prepared to deal with that additional 
potential instability and bugginess, they wait for a year after 
introduction, effectively five kernel cycles, which should put btrfs-
stability-match at about the 4.4 kernel timeframe.

Similarly, quota code has been a problem and remains less than stable, so 
don't use btrfs quotas in the near term (until at least 4.3, then see 
what behavior looks like), unless of course you're doing so in 
cooperation with the devs working on it specifically to help test and 
stabilize it.

Other features are generally as stable as btrfs as a whole, except that 
keeping to say 250-ish snapshots per subvolume, 1000-2000 snapshots per 
filesystem, is recommended, as snapshotting, while it works well in 
general as long as there's not too many, simply doesn't scale well in 
terms of maintenance time -- device replaces, balances, btrfs checks, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26 11:50 ` Roman Mamedov
@ 2015-08-26 12:03   ` Austin S Hemmelgarn
  2015-08-27  2:58     ` Duncan
  2015-08-28  8:50     ` George Duffield
  0 siblings, 2 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-26 12:03 UTC (permalink / raw)
  To: Roman Mamedov, George Duffield; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On 2015-08-26 07:50, Roman Mamedov wrote:
> On Wed, 26 Aug 2015 10:56:03 +0200
> George Duffield <forumscollective@gmail.com> wrote:
>
>> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
>> solution that will involve duplicating a data store on a second
>> machine for backup purposes (the machine is only powered up for
>> backups).
>
> What do you want to achieve by switching? As Btrfs RAID5/6 is not safe yet, do
> you also plan to migrate to RAID10, losing in storage efficiency?
>
> Why not use Btrfs in single-device mode on top of your mdadm RAID5/6? Can even
> migrate without moving any data if you currently use Ext4, as it can be
> converted to Btrfs in-place.
>
As of right now, btrfs-convert does not work reliably or safely.  I 
would strongly advise against using it unless you are trying to help get 
it working again.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26 12:03   ` Austin S Hemmelgarn
@ 2015-08-27  2:58     ` Duncan
  2015-08-27 12:01       ` Austin S Hemmelgarn
  2015-08-28  8:50     ` George Duffield
  1 sibling, 1 reply; 18+ messages in thread
From: Duncan @ 2015-08-27  2:58 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Wed, 26 Aug 2015 08:03:40 -0400 as
excerpted:

> On 2015-08-26 07:50, Roman Mamedov wrote:
>> On Wed, 26 Aug 2015 10:56:03 +0200 George Duffield
>> <forumscollective@gmail.com> wrote:
>>
>>> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
>>> solution that will involve duplicating a data store on a second
>>> machine for backup purposes (the machine is only powered up for
>>> backups).
>>
>> What do you want to achieve by switching? As Btrfs RAID5/6 is not safe
>> yet, do you also plan to migrate to RAID10, losing in storage
>> efficiency?
>>
>> Why not use Btrfs in single-device mode on top of your mdadm RAID5/6?
>> Can even migrate without moving any data if you currently use Ext4, as
>> it can be converted to Btrfs in-place.

Someone (IIRC it was Austin H) posted what I thought was an extremely 
good setup, a few weeks ago.  Create two (or more) mdraid0s, and put 
btrfs raid1 (or raid5/6 when it's a bit more mature, I've been 
recommending waiting until 4.4 and see what the on-list reports for it 
look like then) on top.  The btrfs raid on top lets you use btrfs' data 
integrity features, while the mdraid0s beneath help counteract the fact 
that btrfs isn't well optimized for speed yet, the way mdraid has been.  
And the btrfs raid on top means all is not lost with a device going bad 
in the mdraid0, as would normally be the case, since the other raid0(s), 
functioning as the remaining btrfs devices, let you rebuild the missing 
btrfs device, by recreating the missing raid0.

Normally, that sort of raid01 is discouraged in favor of raid10, with 
raid1 at the lower level and raid0 on top, for more efficient rebuilds, 
but btrfs' data integrity features change that story entirely. =:^)

> As of right now, btrfs-convert does not work reliably or safely.  I
> would strongly advise against using it unless you are trying to help get
> it working again.

Seconded.  Better to use your existing ext4 as a backup, which you should 
have anyway if you value your data, and copy the data from that ext4 
"backup" to the new btrfs you created with mkfs.btrfs using your 
preferred options.  That leaves the existing ext4 in place /as/ a backup, 
while starting with a fresh and clean btrfs, setup with exactly the 
options you want.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-27  2:58     ` Duncan
@ 2015-08-27 12:01       ` Austin S Hemmelgarn
  2015-08-28  9:47         ` Duncan
  0 siblings, 1 reply; 18+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-27 12:01 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2575 bytes --]

On 2015-08-26 22:58, Duncan wrote:
> Austin S Hemmelgarn posted on Wed, 26 Aug 2015 08:03:40 -0400 as
> excerpted:
>
>> On 2015-08-26 07:50, Roman Mamedov wrote:
>>> On Wed, 26 Aug 2015 10:56:03 +0200 George Duffield
>>> <forumscollective@gmail.com> wrote:
>>>
>>>> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
>>>> solution that will involve duplicating a data store on a second
>>>> machine for backup purposes (the machine is only powered up for
>>>> backups).
>>>
>>> What do you want to achieve by switching? As Btrfs RAID5/6 is not safe
>>> yet, do you also plan to migrate to RAID10, losing in storage
>>> efficiency?
>>>
>>> Why not use Btrfs in single-device mode on top of your mdadm RAID5/6?
>>> Can even migrate without moving any data if you currently use Ext4, as
>>> it can be converted to Btrfs in-place.
>
> Someone (IIRC it was Austin H) posted what I thought was an extremely
> good setup, a few weeks ago.  Create two (or more) mdraid0s, and put
> btrfs raid1 (or raid5/6 when it's a bit more mature, I've been
> recommending waiting until 4.4 and see what the on-list reports for it
> look like then) on top.  The btrfs raid on top lets you use btrfs' data
> integrity features, while the mdraid0s beneath help counteract the fact
> that btrfs isn't well optimized for speed yet, the way mdraid has been.
> And the btrfs raid on top means all is not lost with a device going bad
> in the mdraid0, as would normally be the case, since the other raid0(s),
> functioning as the remaining btrfs devices, let you rebuild the missing
> btrfs device, by recreating the missing raid0.
>
> Normally, that sort of raid01 is discouraged in favor of raid10, with
> raid1 at the lower level and raid0 on top, for more efficient rebuilds,
> but btrfs' data integrity features change that story entirely. =:^)
>
Two additional things:
1. If you use MD RAID1 instead of RAID0, it's just as fast for reads, no 
slower than on top of single disks for writes, and get's you better data 
safety guarantees than even raid6 (if you do 2 MD RAID 1 devices with 
BTRFS raid1 on top, you can lose all but one disk and still have all 
your data).

2. I would be cautious of MD/DM RAID on the most recent kernels, the 
clustered MD code that went in recently broke a lot of things initially, 
and I'm not yet convinced that they have managed to glue everything back 
together yet (I'm still having occasional problems with RAID1 and RAID10 
on LVM), so do some testing on a non-production system first.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-26 12:03   ` Austin S Hemmelgarn
  2015-08-27  2:58     ` Duncan
@ 2015-08-28  8:50     ` George Duffield
  2015-08-28  9:35       ` Hugo Mills
  2015-08-28  9:46       ` Roman Mamedov
  1 sibling, 2 replies; 18+ messages in thread
From: George Duffield @ 2015-08-28  8:50 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Roman Mamedov, linux-btrfs

Running a traditional raid5 array of that size is statistically
guaranteed to fail in the event of a rebuild. I also need to expand
the size of available storage to accomodate future storage
requirements. My understanding is that a Btrfs array is easily
expanded without the overhead associated with expanding a traditional
array.  Add to that the ability to throw varying drive sizes at the
problem and a Btrfs RAID array looks pretty appealing.

For clarity, my intention is to create a Btrfs array using new drives,
not to convert the existing ext4 raid5 array.

On Wed, Aug 26, 2015 at 2:03 PM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2015-08-26 07:50, Roman Mamedov wrote:
>>
>> On Wed, 26 Aug 2015 10:56:03 +0200
>> George Duffield <forumscollective@gmail.com> wrote:
>>
>>> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
>>> solution that will involve duplicating a data store on a second
>>> machine for backup purposes (the machine is only powered up for
>>> backups).
>>
>>
>> What do you want to achieve by switching? As Btrfs RAID5/6 is not safe
>> yet, do
>> you also plan to migrate to RAID10, losing in storage efficiency?
>>
>> Why not use Btrfs in single-device mode on top of your mdadm RAID5/6? Can
>> even
>> migrate without moving any data if you currently use Ext4, as it can be
>> converted to Btrfs in-place.
>>
> As of right now, btrfs-convert does not work reliably or safely.  I would
> strongly advise against using it unless you are trying to help get it
> working again.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28  8:50     ` George Duffield
@ 2015-08-28  9:35       ` Hugo Mills
  2015-08-28 15:42         ` Chris Murphy
                           ` (2 more replies)
  2015-08-28  9:46       ` Roman Mamedov
  1 sibling, 3 replies; 18+ messages in thread
From: Hugo Mills @ 2015-08-28  9:35 UTC (permalink / raw)
  To: George Duffield; +Cc: Austin S Hemmelgarn, Roman Mamedov, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote:
> Running a traditional raid5 array of that size is statistically
> guaranteed to fail in the event of a rebuild.

   Except that if it were, you wouldn't see anyone running RAID-5
arrays of that size and (considerably) larger. And successfully
replacing devices in them.

   As I understand it, the calculations that lead to the conclusion
you quote are based on the assumption that the bit error rate (BER) of
the drive is applied on all reads -- this is not the case. The BER is
the error rate of the platter after the device has been left unread
(and powered off) for some long period of time. (I've seen 5 years
been quoted for that).

   Hugo.

> I also need to expand
> the size of available storage to accomodate future storage
> requirements. My understanding is that a Btrfs array is easily
> expanded without the overhead associated with expanding a traditional
> array.  Add to that the ability to throw varying drive sizes at the
> problem and a Btrfs RAID array looks pretty appealing.
> 
> For clarity, my intention is to create a Btrfs array using new drives,
> not to convert the existing ext4 raid5 array.
> 
> On Wed, Aug 26, 2015 at 2:03 PM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > On 2015-08-26 07:50, Roman Mamedov wrote:
> >>
> >> On Wed, 26 Aug 2015 10:56:03 +0200
> >> George Duffield <forumscollective@gmail.com> wrote:
> >>
> >>> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
> >>> solution that will involve duplicating a data store on a second
> >>> machine for backup purposes (the machine is only powered up for
> >>> backups).
> >>
> >>
> >> What do you want to achieve by switching? As Btrfs RAID5/6 is not safe
> >> yet, do
> >> you also plan to migrate to RAID10, losing in storage efficiency?
> >>
> >> Why not use Btrfs in single-device mode on top of your mdadm RAID5/6? Can
> >> even
> >> migrate without moving any data if you currently use Ext4, as it can be
> >> converted to Btrfs in-place.
> >>
> > As of right now, btrfs-convert does not work reliably or safely.  I would
> > strongly advise against using it unless you are trying to help get it
> > working again.
> >

-- 
Hugo Mills             | Beware geeks bearing GIFs
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28  8:50     ` George Duffield
  2015-08-28  9:35       ` Hugo Mills
@ 2015-08-28  9:46       ` Roman Mamedov
  1 sibling, 0 replies; 18+ messages in thread
From: Roman Mamedov @ 2015-08-28  9:46 UTC (permalink / raw)
  To: George Duffield; +Cc: Austin S Hemmelgarn, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

On Fri, 28 Aug 2015 10:50:12 +0200
George Duffield <forumscollective@gmail.com> wrote:

> Running a traditional raid5 array of that size is statistically
> guaranteed to fail in the event of a rebuild.

Yeah I consider RAID5 to be safe up to about 4 devices. As you already have 5
and looking to expand, I'd recommend going RAID6. The "fail on rebuild" issue
is almost completely mitigated by it, perhaps up to a dozen of drives or more.

Don't know about your usage scenarios, but as for me the loss of storage
efficiency in RAID10 compared to RAID6 in unacceptable, and I also don't need
the performance benefits of RAID10 at all.

So both from the efficiency and stability standpoints my personal choice
currently is Btrfs single device mode on top of MD RAID5 (in a 4 drive array)
and RAID6 (with 7 drives).

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-27 12:01       ` Austin S Hemmelgarn
@ 2015-08-28  9:47         ` Duncan
  2015-08-28 12:54           ` Austin S Hemmelgarn
  0 siblings, 1 reply; 18+ messages in thread
From: Duncan @ 2015-08-28  9:47 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Thu, 27 Aug 2015 08:01:58 -0400 as
excerpted:

>> Someone (IIRC it was Austin H) posted what I thought was an extremely
>> good setup, a few weeks ago.  Create two (or more) mdraid0s, and put
>> btrfs raid1 (or raid5/6 when it's a bit more mature, I've been
>> recommending waiting until 4.4 and see what the on-list reports for it
>> look like then) on top.  The btrfs raid on top lets you use btrfs' data
>> integrity features, while the mdraid0s beneath help counteract the fact
>> that btrfs isn't well optimized for speed yet, the way mdraid has been.
>> And the btrfs raid on top means all is not lost with a device going bad
>> in the mdraid0, as would normally be the case, since the other
>> raid0(s),
>> functioning as the remaining btrfs devices, let you rebuild the missing
>> btrfs device, by recreating the missing raid0.
>>
>> Normally, that sort of raid01 is discouraged in favor of raid10, with
>> raid1 at the lower level and raid0 on top, for more efficient rebuilds,
>> but btrfs' data integrity features change that story entirely. =:^)
>>
> Two additional things:
> 1. If you use MD RAID1 instead of RAID0, it's just as fast for reads, no
> slower than on top of single disks for writes, and get's you better data
> safety guarantees than even raid6 (if you do 2 MD RAID 1 devices with
> BTRFS raid1 on top, you can lose all but one disk and still have all
> your data).

My hesitation for btrfs raid1 on top of mdraid1, is that a btrfs scrub 
doesn't scrub all the mdraid component devices.

Of course if btrfs scrub finds an error, it will try to rewrite the bad 
copy from the (hopefully good) other btrfs raid1 copy, and that will 
trigger a rewrite of both/all copies on that underlying mdraid1, which 
should catch the bad one in the process no matter which one it was.

But if one of the lower level mdraid1 component devices is bad while the 
other(s) are good, and mdraid happens to pick the good device, it won't 
even see and thus can't scrub the bad lower-level copy.

To avoid that problem, one can of course do an mdraid level scrub 
followed by a btrfs scrub.  The mdraid level scrub won't tell bad from 
good but will simply ensure they match, and if it happens to pick the bad 
one at that level, the followon btrfs level scrub will detect that and 
trigger a rewrite from its other copy, which again, will rewrite both/all 
the underlying mdraid1 component devices on that btrfs raid1 side, but 
that still wouldn't ensure that the rewrite actually happened properly, 
so then you're left redoing both levels yet again, to ensure that.

Which in theory can work, but in practice, particularly on spinning rust, 
you pretty quickly reach a point when you're running 24/7 scrubs, which, 
again particularly on spinning rust, is going to kill throughput for 
pretty much any other IO going on at the same time.

Which is one of the reasons I found btrfs raid1 on mdraid0 so appealing 
in comparison -- raid0 has only the single copy, which is either correct 
or incorrect, and if the btrfs scrub turns up a problem, it does the 
rewrite, and a single second pass of that btrfs scrub can verify that the 
rewrite happened correctly, because there's no hidden copies being picked 
more or less randomly at the mdraid level, only the single copy, which is 
either correct or incorrect.  I like that determinism! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28  9:47         ` Duncan
@ 2015-08-28 12:54           ` Austin S Hemmelgarn
  0 siblings, 0 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-28 12:54 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

On 2015-08-28 05:47, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 27 Aug 2015 08:01:58 -0400 as
> excerpted:
>
>>> Someone (IIRC it was Austin H) posted what I thought was an extremely
>>> good setup, a few weeks ago.  Create two (or more) mdraid0s, and put
>>> btrfs raid1 (or raid5/6 when it's a bit more mature, I've been
>>> recommending waiting until 4.4 and see what the on-list reports for it
>>> look like then) on top.  The btrfs raid on top lets you use btrfs' data
>>> integrity features, while the mdraid0s beneath help counteract the fact
>>> that btrfs isn't well optimized for speed yet, the way mdraid has been.
>>> And the btrfs raid on top means all is not lost with a device going bad
>>> in the mdraid0, as would normally be the case, since the other
>>> raid0(s),
>>> functioning as the remaining btrfs devices, let you rebuild the missing
>>> btrfs device, by recreating the missing raid0.
>>>
>>> Normally, that sort of raid01 is discouraged in favor of raid10, with
>>> raid1 at the lower level and raid0 on top, for more efficient rebuilds,
>>> but btrfs' data integrity features change that story entirely. =:^)
>>>
>> Two additional things:
>> 1. If you use MD RAID1 instead of RAID0, it's just as fast for reads, no
>> slower than on top of single disks for writes, and get's you better data
>> safety guarantees than even raid6 (if you do 2 MD RAID 1 devices with
>> BTRFS raid1 on top, you can lose all but one disk and still have all
>> your data).
>
> My hesitation for btrfs raid1 on top of mdraid1, is that a btrfs scrub
> doesn't scrub all the mdraid component devices.
>
> Of course if btrfs scrub finds an error, it will try to rewrite the bad
> copy from the (hopefully good) other btrfs raid1 copy, and that will
> trigger a rewrite of both/all copies on that underlying mdraid1, which
> should catch the bad one in the process no matter which one it was.
>
> But if one of the lower level mdraid1 component devices is bad while the
> other(s) are good, and mdraid happens to pick the good device, it won't
> even see and thus can't scrub the bad lower-level copy.
>
> To avoid that problem, one can of course do an mdraid level scrub
> followed by a btrfs scrub.  The mdraid level scrub won't tell bad from
> good but will simply ensure they match, and if it happens to pick the bad
> one at that level, the followon btrfs level scrub will detect that and
> trigger a rewrite from its other copy, which again, will rewrite both/all
> the underlying mdraid1 component devices on that btrfs raid1 side, but
> that still wouldn't ensure that the rewrite actually happened properly,
> so then you're left redoing both levels yet again, to ensure that.
>
> Which in theory can work, but in practice, particularly on spinning rust,
> you pretty quickly reach a point when you're running 24/7 scrubs, which,
> again particularly on spinning rust, is going to kill throughput for
> pretty much any other IO going on at the same time.
Well yes, but only if you are working with large data sets.  In my use 
case, the usage amounts to write once, read at most twice, and the data 
sets are both less than 32G, so scrubbing the lower level RAID1 takes 
about 10 minutes as of right now.  In particular, the array's get 
written to at most once a day, and only read when the primary data 
sources fail.  In my use case, performance isn't as important as up-time.
>
> Which is one of the reasons I found btrfs raid1 on mdraid0 so appealing
> in comparison -- raid0 has only the single copy, which is either correct
> or incorrect, and if the btrfs scrub turns up a problem, it does the
> rewrite, and a single second pass of that btrfs scrub can verify that the
> rewrite happened correctly, because there's no hidden copies being picked
> more or less randomly at the mdraid level, only the single copy, which is
> either correct or incorrect.  I like that determinism! =:^)
>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28  9:35       ` Hugo Mills
@ 2015-08-28 15:42         ` Chris Murphy
  2015-08-28 17:11           ` Austin S Hemmelgarn
  2015-08-29  8:52         ` George Duffield
  2015-09-02  5:01         ` Russell Coker
  2 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2015-08-28 15:42 UTC (permalink / raw)
  To: Hugo Mills, George Duffield, Austin S Hemmelgarn, Roman Mamedov,
	Btrfs BTRFS

On Fri, Aug 28, 2015 at 3:35 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote:
>> Running a traditional raid5 array of that size is statistically
>> guaranteed to fail in the event of a rebuild.
>
>    Except that if it were, you wouldn't see anyone running RAID-5
> arrays of that size and (considerably) larger. And successfully
> replacing devices in them.
>
>    As I understand it, the calculations that lead to the conclusion
> you quote are based on the assumption that the bit error rate (BER) of
> the drive is applied on all reads -- this is not the case. The BER is
> the error rate of the platter after the device has been left unread
> (and powered off) for some long period of time. (I've seen 5 years
> been quoted for that).

I think the confusion comes from the Unrecovered Read Error (URE) or
"Non-recoverable read errors per bits read" in the drive spec sheet.
e.g. on a WDC Red this is written as "<1 in 10^14" but this gets
(wrongly) reinterpreted into an *expected* URE once every 12.5TB (not
TiB) read, which is of course complete utter bullshit. But it gets
repeated all the time.

It's as if symbols have no meaning, and < is some sort of arrow, or
someone got bored and just didn't want to use a space. That symbol
makes the URE value a maximum for what is ostensibly a scientific
sample of drives. We have no idea what the minimum is, we don't even
know the mean, and it's not in the manufacturer's best interest to do
that. The mean between consumer SATA and enterprise SAS may not be all
that different, while the maximum is two orders magnitude better for
enterprise SAS so it makes sense to try to upsell us with that
promise.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28 15:42         ` Chris Murphy
@ 2015-08-28 17:11           ` Austin S Hemmelgarn
  0 siblings, 0 replies; 18+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-28 17:11 UTC (permalink / raw)
  To: Chris Murphy, Hugo Mills, George Duffield, Roman Mamedov,
	Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2044 bytes --]

On 2015-08-28 11:42, Chris Murphy wrote:
> On Fri, Aug 28, 2015 at 3:35 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote:
>>> Running a traditional raid5 array of that size is statistically
>>> guaranteed to fail in the event of a rebuild.
>>
>>     Except that if it were, you wouldn't see anyone running RAID-5
>> arrays of that size and (considerably) larger. And successfully
>> replacing devices in them.
>>
>>     As I understand it, the calculations that lead to the conclusion
>> you quote are based on the assumption that the bit error rate (BER) of
>> the drive is applied on all reads -- this is not the case. The BER is
>> the error rate of the platter after the device has been left unread
>> (and powered off) for some long period of time. (I've seen 5 years
>> been quoted for that).
>
> I think the confusion comes from the Unrecovered Read Error (URE) or
> "Non-recoverable read errors per bits read" in the drive spec sheet.
> e.g. on a WDC Red this is written as "<1 in 10^14" but this gets
> (wrongly) reinterpreted into an *expected* URE once every 12.5TB (not
> TiB) read, which is of course complete utter bullshit. But it gets
> repeated all the time.
>
> It's as if symbols have no meaning, and < is some sort of arrow, or
> someone got bored and just didn't want to use a space. That symbol
> makes the URE value a maximum for what is ostensibly a scientific
> sample of drives. We have no idea what the minimum is, we don't even
> know the mean, and it's not in the manufacturer's best interest to do
> that. The mean between consumer SATA and enterprise SAS may not be all
> that different, while the maximum is two orders magnitude better for
> enterprise SAS so it makes sense to try to upsell us with that
> promise.
>
That probably is the case, the truly sad thing is that there are so many 
engineers (read as 'people who are supposed to actually pay attention to 
the specs') who do this on a regular basis.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28  9:35       ` Hugo Mills
  2015-08-28 15:42         ` Chris Murphy
@ 2015-08-29  8:52         ` George Duffield
  2015-08-29 22:28           ` Chris Murphy
  2015-09-02  5:01         ` Russell Coker
  2 siblings, 1 reply; 18+ messages in thread
From: George Duffield @ 2015-08-29  8:52 UTC (permalink / raw)
  To: Hugo Mills, George Duffield, Austin S Hemmelgarn, Roman Mamedov,
	linux-btrfs

Funny you should say that, whilst I'd read about it it didn't concern
me much until Neil Brown himself advised me against expanding the
raid5 arrays any further (one was built using 3TB drives and the other
using 4TB drives).  My understanding is that larger arrays are
typically built using more drives of lower capacity.  I'm also loathe
to use mdadm as expanding arrays takes forever whereas a Btrfs array
should expand much quicker.  If Btrfs raid isn't yet ready for prime
time I'll just hold off doing anything for the moment, frustrating as
that is.

On Fri, Aug 28, 2015 at 11:35 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote:
>> Running a traditional raid5 array of that size is statistically
>> guaranteed to fail in the event of a rebuild.
>
>    Except that if it were, you wouldn't see anyone running RAID-5
> arrays of that size and (considerably) larger. And successfully
> replacing devices in them.
>
>    As I understand it, the calculations that lead to the conclusion
> you quote are based on the assumption that the bit error rate (BER) of
> the drive is applied on all reads -- this is not the case. The BER is
> the error rate of the platter after the device has been left unread
> (and powered off) for some long period of time. (I've seen 5 years
> been quoted for that).
>
>    Hugo.
>
>> I also need to expand
>> the size of available storage to accomodate future storage
>> requirements. My understanding is that a Btrfs array is easily
>> expanded without the overhead associated with expanding a traditional
>> array.  Add to that the ability to throw varying drive sizes at the
>> problem and a Btrfs RAID array looks pretty appealing.
>>
>> For clarity, my intention is to create a Btrfs array using new drives,
>> not to convert the existing ext4 raid5 array.
>>
>> On Wed, Aug 26, 2015 at 2:03 PM, Austin S Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>> > On 2015-08-26 07:50, Roman Mamedov wrote:
>> >>
>> >> On Wed, 26 Aug 2015 10:56:03 +0200
>> >> George Duffield <forumscollective@gmail.com> wrote:
>> >>
>> >>> I'm looking to switch from a 5x3TB mdadm raid5 array to a Btrfs based
>> >>> solution that will involve duplicating a data store on a second
>> >>> machine for backup purposes (the machine is only powered up for
>> >>> backups).
>> >>
>> >>
>> >> What do you want to achieve by switching? As Btrfs RAID5/6 is not safe
>> >> yet, do
>> >> you also plan to migrate to RAID10, losing in storage efficiency?
>> >>
>> >> Why not use Btrfs in single-device mode on top of your mdadm RAID5/6? Can
>> >> even
>> >> migrate without moving any data if you currently use Ext4, as it can be
>> >> converted to Btrfs in-place.
>> >>
>> > As of right now, btrfs-convert does not work reliably or safely.  I would
>> > strongly advise against using it unless you are trying to help get it
>> > working again.
>> >
>
> --
> Hugo Mills             | Beware geeks bearing GIFs
> hugo@... carfax.org.uk |
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-29  8:52         ` George Duffield
@ 2015-08-29 22:28           ` Chris Murphy
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Murphy @ 2015-08-29 22:28 UTC (permalink / raw)
  To: George Duffield
  Cc: Hugo Mills, Austin S Hemmelgarn, Roman Mamedov, Btrfs BTRFS

On Sat, Aug 29, 2015 at 2:52 AM, George Duffield
<forumscollective@gmail.com> wrote:
> Funny you should say that, whilst I'd read about it it didn't concern
> me much until Neil Brown himself advised me against expanding the
> raid5 arrays any further (one was built using 3TB drives and the other
> using 4TB drives).  My understanding is that larger arrays are
> typically built using more drives of lower capacity.  I'm also loathe
> to use mdadm as expanding arrays takes forever whereas a Btrfs array
> should expand much quicker.  If Btrfs raid isn't yet ready for prime
> time I'll just hold off doing anything for the moment, frustrating as
> that is.

I think a grid of mdadm vs btrfs feature/behavior comparisons might be useful.

The main thing to be aware of with btrfs multiple device is the
failure handling is really not present; whereas it is with mdadm and
lvm raids. This means btrfs tolerates read and write failures where md
will "eject" the drive from the array after even one write failure,
and after so many read failures (not sure what it is). There's also no
spares support. And no notifications of problems, just kernel
messages.

Instead of notification emails the mdadm way, I think it's better to
look at maybe libblockdev and storaged projects since both of those
are taking on standardizing the manipulation of mdadm arrays, LVM,
LUKS, and other Linux storage technologies. And then project like (but
no limited to) openLMI and a future udisks2 replacement can then get
information and state on such things, and propagate that up to the
user (with email, text message, web browser, whatever).

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Understanding BTRFS storage
  2015-08-28  9:35       ` Hugo Mills
  2015-08-28 15:42         ` Chris Murphy
  2015-08-29  8:52         ` George Duffield
@ 2015-09-02  5:01         ` Russell Coker
  2 siblings, 0 replies; 18+ messages in thread
From: Russell Coker @ 2015-09-02  5:01 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On Fri, 28 Aug 2015 07:35:02 PM Hugo Mills wrote:
>   On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote:
> > Running a traditional raid5 array of that size is statistically
> > guaranteed to fail in the event of a rebuild.
> 
>    Except that if it were, you wouldn't see anyone running RAID-5
> arrays of that size and (considerably) larger. And successfully
> replacing devices in them.

Let's not assume that everyone who thinks that they are "successfully" running 
a RAID-5 array is actually doing so.

One of the features of BTRFS is that you won't get undetected data corruption.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-09-02  5:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-26  8:56 Understanding BTRFS storage George Duffield
2015-08-26 11:41 ` Austin S Hemmelgarn
2015-08-26 11:50 ` Hugo Mills
2015-08-26 11:50 ` Roman Mamedov
2015-08-26 12:03   ` Austin S Hemmelgarn
2015-08-27  2:58     ` Duncan
2015-08-27 12:01       ` Austin S Hemmelgarn
2015-08-28  9:47         ` Duncan
2015-08-28 12:54           ` Austin S Hemmelgarn
2015-08-28  8:50     ` George Duffield
2015-08-28  9:35       ` Hugo Mills
2015-08-28 15:42         ` Chris Murphy
2015-08-28 17:11           ` Austin S Hemmelgarn
2015-08-29  8:52         ` George Duffield
2015-08-29 22:28           ` Chris Murphy
2015-09-02  5:01         ` Russell Coker
2015-08-28  9:46       ` Roman Mamedov
2015-08-26 11:50 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).