btrfs RAID5 or btrfs on md RAID5?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs RAID5 or btrfs on md RAID5?
@ 2025-09-22  7:09 Ulli Horlacher
  2025-09-22  7:41 ` Qu Wenruo
  2025-09-22  8:07 ` Lukas Straub
  0 siblings, 2 replies; 21+ messages in thread
From: Ulli Horlacher @ 2025-09-22  7:09 UTC (permalink / raw)
  To: linux-btrfs


I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to
recycle in my workstation PC (Ubuntu 24 with kernel 6.14).

Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on
top of a md RAID5?

What is the current status?

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    https://www.tik.uni-stuttgart.de/
REF:<20250922070956.GA2624931@tik.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  7:09 btrfs RAID5 or btrfs on md RAID5? Ulli Horlacher
@ 2025-09-22  7:41 ` Qu Wenruo
  2025-09-22  8:28   ` Ulli Horlacher
                     ` (2 more replies)
  2025-09-22  8:07 ` Lukas Straub
  1 sibling, 3 replies; 21+ messages in thread
From: Qu Wenruo @ 2025-09-22  7:41 UTC (permalink / raw)
  To: Ulli Horlacher, linux-btrfs

在 2025/9/22 16:39, Ulli Horlacher 写道:
> 
> I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to
> recycle in my workstation PC (Ubuntu 24 with kernel 6.14).
> 
> Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on
> top of a md RAID5?

Neither is perfect.

Btrfs RAID56 has no journal to protect against write hole. But has the 
ability to properly detect and rebuild corrupted data using data checksum.

Meanwhile MD raid56 has journal to protect against wirte hole, but has 
no checksum to know which data is correct or not.

> 
> What is the current status?
> 

No extra work is done for btrfs RAID56 write hole for a while.

The experimental raid-stripe-tree has some potential to address the 
problem, but that feature doesn't support RAID56 yet.

Another solution is something like RAIDZ, which requires block size > 
page size support, and extra RAID56 changes (mostly much smaller stripe 
length, 4K instead of the current 64K).

The bs > ps support is not even merged, and submitted patchset lacks 
certain features (RAID56 ironically).
And no formal RAIDZ support is even considered.

So you either run RAID5 for data only and ran full scrub after every 
unexpected power loss (slow, and no further writes until scrub is done, 
which is further maintanance burden).
Or just don't use RAID5 at all.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  7:09 btrfs RAID5 or btrfs on md RAID5? Ulli Horlacher
  2025-09-22  7:41 ` Qu Wenruo
@ 2025-09-22  8:07 ` Lukas Straub
  2025-09-22  8:50   ` Ulli Horlacher
  1 sibling, 1 reply; 21+ messages in thread
From: Lukas Straub @ 2025-09-22  8:07 UTC (permalink / raw)
  To: Ulli Horlacher; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

On Mon, 22 Sep 2025 09:09:56 +0200
Ulli Horlacher <framstag@rus.uni-stuttgart.de> wrote:

> I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to
> recycle in my workstation PC (Ubuntu 24 with kernel 6.14).
> 
> Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on
> top of a md RAID5?
> 
> What is the current status?
> 

Hi,

md RAID5 with Partial Parity Log is perfect for btrfs:
https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html

When a stripe is partially updated with new data, PPL ensures that the
old data in the stripe will not be corrupted by the write-hole. The new
data on the other hand is still affected by the write hole, but for
btrfs that is not a problem.

Regards,
Lukas

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  7:41 ` Qu Wenruo
@ 2025-09-22  8:28   ` Ulli Horlacher
  2025-09-22  9:06     ` Qu Wenruo
  2025-09-22  9:43   ` Ulli Horlacher
  2025-10-21  1:02   ` DanglingPointer
  2 siblings, 1 reply; 21+ messages in thread
From: Ulli Horlacher @ 2025-09-22  8:28 UTC (permalink / raw)
  To: linux-btrfs

On Mon 2025-09-22 (17:11), Qu Wenruo wrote:

> > Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on
> > top of a md RAID5?
> 
> Neither is perfect.

We live in a non-perfect world :-}


> Btrfs RAID56 has no journal to protect against write hole.

What does this mean? 
What is a write hole and want is the danger with it?


> So you either run RAID5 for data only

This is a mkfs.btrfs option?
Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"?


> and ran full scrub after every unexpected power loss (slow, and no
> further writes until scrub is done, which is further maintanance burden).

Ubuntu has (like most Linux distributions) systemd.
How can I detect a previous power loss and force full scrub on booting?


> Or just don't use RAID5 at all.

You suggest btrfs RAID0?
As I wrote: I have 4 x 4 TB SAS SSD (enterprise hardware, very reliable).

Another disk layout option for me could be:

64 GB / filesystem RAID1
32 GB swap RAID1
3.9 TB /home
3.9 TB /data
3.9 TB /VM
3.9 TB /backup

In case of a SSD damage failure I have to recover from (external) backup.

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    https://www.tik.uni-stuttgart.de/
REF:<d3a5e463-d00e-4428-ad7b-35f87f9a6550@gmx.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  8:07 ` Lukas Straub
@ 2025-09-22  8:50   ` Ulli Horlacher
  0 siblings, 0 replies; 21+ messages in thread
From: Ulli Horlacher @ 2025-09-22  8:50 UTC (permalink / raw)
  To: linux-btrfs

On Mon 2025-09-22 (10:07), Lukas Straub wrote:

> md RAID5 with Partial Parity Log is perfect for btrfs:
> https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html

I already have another system with btrfs on top of md RAID5 (4 x 1.6 TB
SSD):

root@juhu:~# uname -a
Linux juhu 6.8.0-83-generic #83~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Sep  9 18:19:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

root@juhu:~# mount | grep local
/dev/md127 on /local type btrfs (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)

root@juhu:~# mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu Feb 10 09:38:22 2022
        Raid Level : raid5
        Array Size : 4285387776 (3.99 TiB 4.39 TB)
     Used Dev Size : 1428462592 (1362.29 GiB 1462.75 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Sep 22 10:43:43 2025
             State : clean
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : mux:nts1
              UUID : 74388db3:3c3b30c3:e1295cc5:46f23ff7
            Events : 23359

    Number   Major   Minor   RaidDevice State
       0       8       20        0      active sync   /dev/sdb4
       1       8        4        1      active sync   /dev/sda4
       2       8       52        2      active sync   /dev/sdd4
       4       8       36        3      active sync   /dev/sdc4


Shall I enable PPL there with mdadm --consistency-policy=ppl /dev/md127 ?

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    https://www.tik.uni-stuttgart.de/
REF:<20250922100715.7f847dc0@penguin>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  8:28   ` Ulli Horlacher
@ 2025-09-22  9:06     ` Qu Wenruo
  2025-09-22  9:23       ` Ulli Horlacher
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2025-09-22  9:06 UTC (permalink / raw)
  To: linux-btrfs



在 2025/9/22 17:58, Ulli Horlacher 写道:
> On Mon 2025-09-22 (17:11), Qu Wenruo wrote:
> 
>>> Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on
>>> top of a md RAID5?
>>
>> Neither is perfect.
> 
> We live in a non-perfect world :-}
> 
> 
>> Btrfs RAID56 has no journal to protect against write hole.
> 
> What does this mean?
> What is a write hole and want is the danger with it?

Write-hole means during a partial stripe update, a power loss happened, 
the parity may be out-of-sync.

This means that stripe will not get the full protection of RAID5.

E.g. after that power loss one device is lost, then btrfs may not be 
able to rebuild the correct data.

> 
> 
>> So you either run RAID5 for data only
> 
> This is a mkfs.btrfs option?
> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"?

For RAID5, RAID1 is preferred for data.

> 
> 
>> and ran full scrub after every unexpected power loss (slow, and no
>> further writes until scrub is done, which is further maintanance burden).
> 
> Ubuntu has (like most Linux distributions) systemd.
> How can I detect a previous power loss and force full scrub on booting?

Not sure. You may dig into systemd docs to find that out.

> 
> 
>> Or just don't use RAID5 at all.
> 
> You suggest btrfs RAID0?

I'd suggest RAID10. But that means you're "wasting" half of your capacity.

Thanks,
Qu

> As I wrote: I have 4 x 4 TB SAS SSD (enterprise hardware, very reliable).
> 
> Another disk layout option for me could be:
> 
> 64 GB / filesystem RAID1
> 32 GB swap RAID1
> 3.9 TB /home
> 3.9 TB /data
> 3.9 TB /VM
> 3.9 TB /backup
> 
> In case of a SSD damage failure I have to recover from (external) backup.
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  9:06     ` Qu Wenruo
@ 2025-09-22  9:23       ` Ulli Horlacher
  2025-09-22  9:27         ` Qu Wenruo
  0 siblings, 1 reply; 21+ messages in thread
From: Ulli Horlacher @ 2025-09-22  9:23 UTC (permalink / raw)
  To: linux-btrfs

On Mon 2025-09-22 (18:36), Qu Wenruo wrote:


> Write-hole means during a partial stripe update, a power loss happened,
> the parity may be out-of-sync.
> 
> This means that stripe will not get the full protection of RAID5.
> 
> E.g. after that power loss one device is lost, then btrfs may not be
> able to rebuild the correct data.

It must happens both, power loss and one device is lost?
Then this is a more rare situation: unlikly, but not impossible.


> >> So you either run RAID5 for data only
> >
> > This is a mkfs.btrfs option?
> > Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"?
> 
> For RAID5, RAID1 is preferred for data.

Then the real usable capacity of this volume is only the half?
With 4 x 4 TB disks I get 2 TB, in opposite to 3 TB with RAID5 data?


> >> and ran full scrub after every unexpected power loss (slow, and no
> >> further writes until scrub is done, which is further maintanance burden).
> >
> > Ubuntu has (like most Linux distributions) systemd.
> > How can I detect a previous power loss and force full scrub on booting?
> 
> Not sure. You may dig into systemd docs to find that out.

So, no recommandition from you. Difficult situation :-}


> >> Or just don't use RAID5 at all.
> >
> > You suggest btrfs RAID0?
> 
> I'd suggest RAID10. But that means you're "wasting" half of your capacity.

Ok, it is a trade off...


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    https://www.tik.uni-stuttgart.de/
REF:<95ece5d8-0e5a-4db9-8603-c819980c3a3b@suse.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  9:23       ` Ulli Horlacher
@ 2025-09-22  9:27         ` Qu Wenruo
  2025-10-20  9:00           ` Ulli Horlacher
  0 siblings, 1 reply; 21+ messages in thread
From: Qu Wenruo @ 2025-09-22  9:27 UTC (permalink / raw)
  To: linux-btrfs



在 2025/9/22 18:53, Ulli Horlacher 写道:
> On Mon 2025-09-22 (18:36), Qu Wenruo wrote:
> 
> 
>> Write-hole means during a partial stripe update, a power loss happened,
>> the parity may be out-of-sync.
>>
>> This means that stripe will not get the full protection of RAID5.
>>
>> E.g. after that power loss one device is lost, then btrfs may not be
>> able to rebuild the correct data.
> 
> It must happens both, power loss and one device is lost?
> Then this is a more rare situation: unlikly, but not impossible.
> 
> 
>>>> So you either run RAID5 for data only
>>>
>>> This is a mkfs.btrfs option?
>>> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"?
>>
>> For RAID5, RAID1 is preferred for data.
> 
> Then the real usable capacity of this volume is only the half?

No, metadata is really a small part of the fs.

The majority of usable space really depends on the data profile.

If you use RAID1 metadata + RAID5 data, I believe only less than 10% of 
real space is used on RAID1, the remaining is still RAID5.

Unless you put tons of small files (smaller than 2K), then those files 
will be inlined into metadata and takes a lot of space...

Thanks,
Qu

> With 4 x 4 TB disks I get 2 TB, in opposite to 3 TB with RAID5 data?
> 
> 
>>>> and ran full scrub after every unexpected power loss (slow, and no
>>>> further writes until scrub is done, which is further maintanance burden).
>>>
>>> Ubuntu has (like most Linux distributions) systemd.
>>> How can I detect a previous power loss and force full scrub on booting?
>>
>> Not sure. You may dig into systemd docs to find that out.
> 
> So, no recommandition from you. Difficult situation :-}
> 
> 
>>>> Or just don't use RAID5 at all.
>>>
>>> You suggest btrfs RAID0?
>>
>> I'd suggest RAID10. But that means you're "wasting" half of your capacity.
> 
> Ok, it is a trade off...
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  7:41 ` Qu Wenruo
  2025-09-22  8:28   ` Ulli Horlacher
@ 2025-09-22  9:43   ` Ulli Horlacher
  2025-09-22 10:41     ` Qu Wenruo
  2025-10-21  1:02   ` DanglingPointer
  2 siblings, 1 reply; 21+ messages in thread
From: Ulli Horlacher @ 2025-09-22  9:43 UTC (permalink / raw)
  To: linux-btrfs

On Mon 2025-09-22 (17:11), Qu Wenruo wrote:

> Btrfs RAID56 has no journal to protect against write hole. But has the
> ability to properly detect and rebuild corrupted data using data checksum.

As I wrote before, I could use btrfs RAID1 (only) for the / filesystem (64
GB), the other partitions without any RAID level, just simple btrfs
filesytems. No md RAID volumes at all.

btrfs RAID1 is not prone to write holes, but is able to rebuild corrupted
data using data checksum?

Then this is could be the most robust solution for me.
In case of a disk failure I have to recover from backup.

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    https://www.tik.uni-stuttgart.de/
REF:<d3a5e463-d00e-4428-ad7b-35f87f9a6550@gmx.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  9:43   ` Ulli Horlacher
@ 2025-09-22 10:41     ` Qu Wenruo
  0 siblings, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2025-09-22 10:41 UTC (permalink / raw)
  To: linux-btrfs



在 2025/9/22 19:13, Ulli Horlacher 写道:
> On Mon 2025-09-22 (17:11), Qu Wenruo wrote:
> 
>> Btrfs RAID56 has no journal to protect against write hole. But has the
>> ability to properly detect and rebuild corrupted data using data checksum.
> 
> As I wrote before, I could use btrfs RAID1 (only) for the / filesystem (64
> GB), the other partitions without any RAID level, just simple btrfs
> filesytems. No md RAID volumes at all.
> 
> btrfs RAID1 is not prone to write holes, but is able to rebuild corrupted
> data using data checksum?

Yes.

Write-hole is only possible for RAID56 profiles which needs RMW.

RAID0/1/10 do not need RWM at all, thus they are completely safe.

> 
> Then this is could be the most robust solution for me.
> In case of a disk failure I have to recover from backup.
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  9:27         ` Qu Wenruo
@ 2025-10-20  9:00           ` Ulli Horlacher
  2025-10-20  9:31             ` Andrei Borzenkov
  0 siblings, 1 reply; 21+ messages in thread
From: Ulli Horlacher @ 2025-10-20  9:00 UTC (permalink / raw)
  To: linux-btrfs


Resuming this discussion... 

On Mon 2025-09-22 (18:57), Qu Wenruo wrote:

> >>>> So you either run RAID5 for data only
> >>>
> >>> This is a mkfs.btrfs option?
> >>> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"?
> >>
> >> For RAID5, RAID1 is preferred for data.
> >
> > Then the real usable capacity of this volume is only the half?
> 
> No, metadata is really a small part of the fs.
> 
> The majority of usable space really depends on the data profile.
> 
> If you use RAID1 metadata + RAID5 data, I believe only less than 10% of
> real space is used on RAID1, the remaining is still RAID5.

Sounds like a good compromise solution!

Asuming I have 4 partitions with equal size, then the suggested command to
create the filesystem would be:

mkfs.btrfs -m raid1 -d raid5 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4

Does this setup help to protect against write hole?


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    https://www.tik.uni-stuttgart.de/
REF:<1e4baff2-1310-437a-be62-5e9b72784a54@gmx.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-20  9:00           ` Ulli Horlacher
@ 2025-10-20  9:31             ` Andrei Borzenkov
  0 siblings, 0 replies; 21+ messages in thread
From: Andrei Borzenkov @ 2025-10-20  9:31 UTC (permalink / raw)
  To: linux-btrfs

On Mon, Oct 20, 2025 at 12:07 PM Ulli Horlacher
<framstag@rus.uni-stuttgart.de> wrote:
>
>
> Resuming this discussion...
>
> On Mon 2025-09-22 (18:57), Qu Wenruo wrote:
>
> > >>>> So you either run RAID5 for data only
> > >>>
> > >>> This is a mkfs.btrfs option?
> > >>> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"?
> > >>
> > >> For RAID5, RAID1 is preferred for data.
> > >
> > > Then the real usable capacity of this volume is only the half?
> >
> > No, metadata is really a small part of the fs.
> >
> > The majority of usable space really depends on the data profile.
> >
> > If you use RAID1 metadata + RAID5 data, I believe only less than 10% of
> > real space is used on RAID1, the remaining is still RAID5.
>
> Sounds like a good compromise solution!
>
> Asuming I have 4 partitions with equal size, then the suggested command to
> create the filesystem would be:
>
> mkfs.btrfs -m raid1 -d raid5 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4
>
> Does this setup help to protect against write hole?
>

No. It simply reduces the damage caused by the write hole. Only the
content of individual files is affected, not the metadata.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-09-22  7:41 ` Qu Wenruo
  2025-09-22  8:28   ` Ulli Horlacher
  2025-09-22  9:43   ` Ulli Horlacher
@ 2025-10-21  1:02   ` DanglingPointer
  2025-10-21 15:46     ` Mark Harmstone
  2 siblings, 1 reply; 21+ messages in thread
From: DanglingPointer @ 2025-10-21  1:02 UTC (permalink / raw)
  To: Qu Wenruo, Ulli Horlacher, linux-btrfs

Are there any plans to work on either of the proposed solutions 
mentioned here to once and for all fix RAID56?

On 22/9/25 17:41, Qu Wenruo wrote:
>
>
> 在 2025/9/22 16:39, Ulli Horlacher 写道:
>>
>> I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I 
>> want to
>> recycle in my workstation PC (Ubuntu 24 with kernel 6.14).
>>
>> Is btrfs RAID5 ready for production usage or shall I use non-RAID 
>> btrfs on
>> top of a md RAID5?
>
> Neither is perfect.
>
> Btrfs RAID56 has no journal to protect against write hole. But has the 
> ability to properly detect and rebuild corrupted data using data 
> checksum.
>
> Meanwhile MD raid56 has journal to protect against wirte hole, but has 
> no checksum to know which data is correct or not.
>
>>
>> What is the current status?
>>
>
> No extra work is done for btrfs RAID56 write hole for a while.
>
> The experimental raid-stripe-tree has some potential to address the 
> problem, but that feature doesn't support RAID56 yet.
>
>
> Another solution is something like RAIDZ, which requires block size > 
> page size support, and extra RAID56 changes (mostly much smaller 
> stripe length, 4K instead of the current 64K).
>
> The bs > ps support is not even merged, and submitted patchset lacks 
> certain features (RAID56 ironically).
> And no formal RAIDZ support is even considered.
>
> So you either run RAID5 for data only and ran full scrub after every 
> unexpected power loss (slow, and no further writes until scrub is 
> done, which is further maintanance burden).
> Or just don't use RAID5 at all.
>
> Thanks,
> Qu
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21  1:02   ` DanglingPointer
@ 2025-10-21 15:46     ` Mark Harmstone
  2025-10-21 15:53       ` Christoph Anton Mitterer
  0 siblings, 1 reply; 21+ messages in thread
From: Mark Harmstone @ 2025-10-21 15:46 UTC (permalink / raw)
  To: DanglingPointer, Qu Wenruo, Ulli Horlacher, linux-btrfs

On 21/10/2025 2.02 am, DanglingPointer wrote:
> Are there any plans to work on either of the proposed solutions mentioned here to once and for all fix RAID56?

The brutal truth is probably that RAID5/6 is an idea whose time has passed.
Storage is cheap enough that it doesn't warrant the added latency, CPU time,
and complexity.

If I had four 4TB drives I would probably go for RAID1 data and RAID1C4
metadata.
> On 22/9/25 17:41, Qu Wenruo wrote:
>>
>>
>> 在 2025/9/22 16:39, Ulli Horlacher 写道:
>>>
>>> I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to
>>> recycle in my workstation PC (Ubuntu 24 with kernel 6.14).
>>>
>>> Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on
>>> top of a md RAID5?
>>
>> Neither is perfect.
>>
>> Btrfs RAID56 has no journal to protect against write hole. But has the ability to properly detect and rebuild corrupted data using data checksum.
>>
>> Meanwhile MD raid56 has journal to protect against wirte hole, but has no checksum to know which data is correct or not.
>>
>>>
>>> What is the current status?
>>>
>>
>> No extra work is done for btrfs RAID56 write hole for a while.
>>
>> The experimental raid-stripe-tree has some potential to address the problem, but that feature doesn't support RAID56 yet.
>>
>>
>> Another solution is something like RAIDZ, which requires block size > page size support, and extra RAID56 changes (mostly much smaller stripe length, 4K instead of the current 64K).
>>
>> The bs > ps support is not even merged, and submitted patchset lacks certain features (RAID56 ironically).
>> And no formal RAIDZ support is even considered.
>>
>> So you either run RAID5 for data only and ran full scrub after every unexpected power loss (slow, and no further writes until scrub is done, which is further maintanance burden).
>> Or just don't use RAID5 at all.
>>
>> Thanks,
>> Qu
>>
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 15:46     ` Mark Harmstone
@ 2025-10-21 15:53       ` Christoph Anton Mitterer
  2025-10-21 16:15         ` Jukka Larja
  2025-10-21 16:45         ` Mark Harmstone
  0 siblings, 2 replies; 21+ messages in thread
From: Christoph Anton Mitterer @ 2025-10-21 15:53 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
> The brutal truth is probably that RAID5/6 is an idea whose time has
> passed.
> Storage is cheap enough that it doesn't warrant the added latency,
> CPU time,
> and complexity.

That doesn't seem to be generally the case. We have e.g. large storage
servers with 24x 22 TB HDDs.

RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
RAID1 would loose half.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 15:53       ` Christoph Anton Mitterer
@ 2025-10-21 16:15         ` Jukka Larja
  2025-10-21 16:45         ` Mark Harmstone
  1 sibling, 0 replies; 21+ messages in thread
From: Jukka Larja @ 2025-10-21 16:15 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer kirjoitti 21.10.2025 klo 18.53:
> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
>> The brutal truth is probably that RAID5/6 is an idea whose time has
>> passed.
>> Storage is cheap enough that it doesn't warrant the added latency,
>> CPU time,
>> and complexity.
> 
> That doesn't seem to be generally the case. We have e.g. large storage
> servers with 24x 22 TB HDDs.
> 
> RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
> RAID1 would loose half.

Also significant, RAID1 only protects from single drive failure, RAID6 from two.

-JLarja

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 15:53       ` Christoph Anton Mitterer
  2025-10-21 16:15         ` Jukka Larja
@ 2025-10-21 16:45         ` Mark Harmstone
  2025-10-21 17:32           ` Andrei Borzenkov
  2025-10-21 19:32           ` Goffredo Baroncelli
  1 sibling, 2 replies; 21+ messages in thread
From: Mark Harmstone @ 2025-10-21 16:45 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote:
> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
>> The brutal truth is probably that RAID5/6 is an idea whose time has
>> passed.
>> Storage is cheap enough that it doesn't warrant the added latency,
>> CPU time,
>> and complexity.
> 
> That doesn't seem to be generally the case. We have e.g. large storage
> servers with 24x 22 TB HDDs.
> 
> RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
> RAID1 would loose half.
> 
> 
> Cheers,
> Chris.
So for every sector you want to write, you actually need to write three
and read 21. That seems a very quick way to wear out all those disks.
And then one starts operating more slowly, which slows down every write...

I'd still use RAID1 in this case.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 16:45         ` Mark Harmstone
@ 2025-10-21 17:32           ` Andrei Borzenkov
  2025-10-21 17:43             ` Mark Harmstone
  2025-10-21 19:32           ` Goffredo Baroncelli
  1 sibling, 1 reply; 21+ messages in thread
From: Andrei Borzenkov @ 2025-10-21 17:32 UTC (permalink / raw)
  To: Mark Harmstone, Christoph Anton Mitterer, linux-btrfs

21.10.2025 19:45, Mark Harmstone wrote:
> On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote:
>> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
>>> The brutal truth is probably that RAID5/6 is an idea whose time has
>>> passed.
>>> Storage is cheap enough that it doesn't warrant the added latency,
>>> CPU time,
>>> and complexity.
>>
>> That doesn't seem to be generally the case. We have e.g. large storage
>> servers with 24x 22 TB HDDs.
>>
>> RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
>> RAID1 would loose half.
>>
>>
>> Cheers,
>> Chris.
> So for every sector you want to write, you actually need to write three
> and read 21. 

RAID5 needs to read 2 sectors and write 2 sectors. Independently of the 
number of disks in the array.

It is more difficult to make any generic statement about RAID6 because 
to my best knowledge there is no standard parity computation algorithm 
for it, each vendor does something different. But simply adding the 
second parity block means you need to read 3 and write 3 blocks.

> That seems a very quick way to wear out all those disks.
> And then one starts operating more slowly, which slows down every write...
> 
> I'd still use RAID1 in this case.
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 17:32           ` Andrei Borzenkov
@ 2025-10-21 17:43             ` Mark Harmstone
  0 siblings, 0 replies; 21+ messages in thread
From: Mark Harmstone @ 2025-10-21 17:43 UTC (permalink / raw)
  To: Andrei Borzenkov, Christoph Anton Mitterer, linux-btrfs

On 21/10/2025 6.32 pm, Andrei Borzenkov wrote:
> 21.10.2025 19:45, Mark Harmstone wrote:
>> On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote:
>>> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
>>>> The brutal truth is probably that RAID5/6 is an idea whose time has
>>>> passed.
>>>> Storage is cheap enough that it doesn't warrant the added latency,
>>>> CPU time,
>>>> and complexity.
>>>
>>> That doesn't seem to be generally the case. We have e.g. large storage
>>> servers with 24x 22 TB HDDs.
>>>
>>> RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
>>> RAID1 would loose half.
>>>
>>>
>>> Cheers,
>>> Chris.
>> So for every sector you want to write, you actually need to write three
>> and read 21. 
> 
> RAID5 needs to read 2 sectors and write 2 sectors. Independently of the number of disks in the array.

This isn't the case for btrfs' implementation, which will stripe every chunk
over each disk if it can. Possibly other people do something different.

> It is more difficult to make any generic statement about RAID6 because to my best knowledge there is no standard parity computation algorithm for it, each vendor does something different. But simply adding the second parity block means you need to read 3 and write 3 blocks.
Likewise for btrfs' implementation of RAID6. I suppose this shows that if
anyone were ever to fix it, they would need to make sure that RAID6 chunks
get given 4 stripes rather than 24.

>> That seems a very quick way to wear out all those disks.
>> And then one starts operating more slowly, which slows down every write...
>>
>> I'd still use RAID1 in this case.
>>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 16:45         ` Mark Harmstone
  2025-10-21 17:32           ` Andrei Borzenkov
@ 2025-10-21 19:32           ` Goffredo Baroncelli
  2025-10-21 22:19             ` DanglingPointer
  1 sibling, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2025-10-21 19:32 UTC (permalink / raw)
  To: Mark Harmstone, Christoph Anton Mitterer, linux-btrfs

On 21/10/2025 18.45, Mark Harmstone wrote:
> On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote:
>> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
>>> The brutal truth is probably that RAID5/6 is an idea whose time has
>>> passed.
>>> Storage is cheap enough that it doesn't warrant the added latency,
>>> CPU time,
>>> and complexity.
>>
>> That doesn't seem to be generally the case. We have e.g. large storage
>> servers with 24x 22 TB HDDs.
>>
>> RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
>> RAID1 would loose half.
>>
>>
>> Cheers,
>> Chris.
> So for every sector you want to write, you actually need to write three
> and read 21. That seems a very quick way to wear out all those disks.
> And then one starts operating more slowly, which slows down every write...
  
Yes, it is true that the classic raid5/6 doesn't scale well when the number
of disks grows.

However I still think that there is room to a different approach. Like putting
the redundancy inside the extent to avoid an RMW cycles. This and the
fact that in BTRFS the extents are immutable, would avoid the slowness
that you mention.

> I'd still use RAID1 in this case.
> 
This is faster, but also doesn't scale well (== expensive) when the storage
  size grows.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: btrfs RAID5 or btrfs on md RAID5?
  2025-10-21 19:32           ` Goffredo Baroncelli
@ 2025-10-21 22:19             ` DanglingPointer
  0 siblings, 0 replies; 21+ messages in thread
From: DanglingPointer @ 2025-10-21 22:19 UTC (permalink / raw)
  To: kreijack, Mark Harmstone, Christoph Anton Mitterer, linux-btrfs

Just going back to the original question I posted...

Will the BTRFS project decide to fix once and for all RAID56 to fix the 
write-hole even if the result is a slower-latent result?

At least that would be a fully functional production ready offering as 
version 1.0?

Future optimisations and improvements on that version 1.0 will obviously 
happen for the life of BTRFS and linux which will improve the 
performance from the impact of adding whatever is needed to fix the 
write-hole.  Just like everything else!  At least everyone can say it is 
done, although slower now!  But feature complete!   Making it faster 
will then incrementally happen as it evolves like everything else.



On 22/10/25 06:32, Goffredo Baroncelli wrote:
> On 21/10/2025 18.45, Mark Harmstone wrote:
>> On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote:
>>> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote:
>>>> The brutal truth is probably that RAID5/6 is an idea whose time has
>>>> passed.
>>>> Storage is cheap enough that it doesn't warrant the added latency,
>>>> CPU time,
>>>> and complexity.
>>>
>>> That doesn't seem to be generally the case. We have e.g. large storage
>>> servers with 24x 22 TB HDDs.
>>>
>>> RAID6 is plenty enough redundancy for these, loosing 2 HDDs.
>>> RAID1 would loose half.
>>>
>>>
>>> Cheers,
>>> Chris.
>> So for every sector you want to write, you actually need to write three
>> and read 21. That seems a very quick way to wear out all those disks.
>> And then one starts operating more slowly, which slows down every 
>> write...
>
> Yes, it is true that the classic raid5/6 doesn't scale well when the 
> number
> of disks grows.
>
> However I still think that there is room to a different approach. Like 
> putting
> the redundancy inside the extent to avoid an RMW cycles. This and the
> fact that in BTRFS the extents are immutable, would avoid the slowness
> that you mention.
>
>> I'd still use RAID1 in this case.
>>
> This is faster, but also doesn't scale well (== expensive) when the 
> storage
>  size grows.
>
> BR
> G.Baroncelli
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-10-21 22:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-22  7:09 btrfs RAID5 or btrfs on md RAID5? Ulli Horlacher
2025-09-22  7:41 ` Qu Wenruo
2025-09-22  8:28   ` Ulli Horlacher
2025-09-22  9:06     ` Qu Wenruo
2025-09-22  9:23       ` Ulli Horlacher
2025-09-22  9:27         ` Qu Wenruo
2025-10-20  9:00           ` Ulli Horlacher
2025-10-20  9:31             ` Andrei Borzenkov
2025-09-22  9:43   ` Ulli Horlacher
2025-09-22 10:41     ` Qu Wenruo
2025-10-21  1:02   ` DanglingPointer
2025-10-21 15:46     ` Mark Harmstone
2025-10-21 15:53       ` Christoph Anton Mitterer
2025-10-21 16:15         ` Jukka Larja
2025-10-21 16:45         ` Mark Harmstone
2025-10-21 17:32           ` Andrei Borzenkov
2025-10-21 17:43             ` Mark Harmstone
2025-10-21 19:32           ` Goffredo Baroncelli
2025-10-21 22:19             ` DanglingPointer
2025-09-22  8:07 ` Lukas Straub
2025-09-22  8:50   ` Ulli Horlacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).