* btrfs RAID5 or btrfs on md RAID5? @ 2025-09-22 7:09 Ulli Horlacher 2025-09-22 7:41 ` Qu Wenruo 2025-09-22 8:07 ` Lukas Straub 0 siblings, 2 replies; 21+ messages in thread From: Ulli Horlacher @ 2025-09-22 7:09 UTC (permalink / raw) To: linux-btrfs I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to recycle in my workstation PC (Ubuntu 24 with kernel 6.14). Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on top of a md RAID5? What is the current status? -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlacher@tik.uni-stuttgart.de Allmandring 30a Tel: ++49-711-68565868 70569 Stuttgart (Germany) WWW: https://www.tik.uni-stuttgart.de/ REF:<20250922070956.GA2624931@tik.uni-stuttgart.de> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 7:09 btrfs RAID5 or btrfs on md RAID5? Ulli Horlacher @ 2025-09-22 7:41 ` Qu Wenruo 2025-09-22 8:28 ` Ulli Horlacher ` (2 more replies) 2025-09-22 8:07 ` Lukas Straub 1 sibling, 3 replies; 21+ messages in thread From: Qu Wenruo @ 2025-09-22 7:41 UTC (permalink / raw) To: Ulli Horlacher, linux-btrfs 在 2025/9/22 16:39, Ulli Horlacher 写道: > > I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to > recycle in my workstation PC (Ubuntu 24 with kernel 6.14). > > Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on > top of a md RAID5? Neither is perfect. Btrfs RAID56 has no journal to protect against write hole. But has the ability to properly detect and rebuild corrupted data using data checksum. Meanwhile MD raid56 has journal to protect against wirte hole, but has no checksum to know which data is correct or not. > > What is the current status? > No extra work is done for btrfs RAID56 write hole for a while. The experimental raid-stripe-tree has some potential to address the problem, but that feature doesn't support RAID56 yet. Another solution is something like RAIDZ, which requires block size > page size support, and extra RAID56 changes (mostly much smaller stripe length, 4K instead of the current 64K). The bs > ps support is not even merged, and submitted patchset lacks certain features (RAID56 ironically). And no formal RAIDZ support is even considered. So you either run RAID5 for data only and ran full scrub after every unexpected power loss (slow, and no further writes until scrub is done, which is further maintanance burden). Or just don't use RAID5 at all. Thanks, Qu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 7:41 ` Qu Wenruo @ 2025-09-22 8:28 ` Ulli Horlacher 2025-09-22 9:06 ` Qu Wenruo 2025-09-22 9:43 ` Ulli Horlacher 2025-10-21 1:02 ` DanglingPointer 2 siblings, 1 reply; 21+ messages in thread From: Ulli Horlacher @ 2025-09-22 8:28 UTC (permalink / raw) To: linux-btrfs On Mon 2025-09-22 (17:11), Qu Wenruo wrote: > > Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on > > top of a md RAID5? > > Neither is perfect. We live in a non-perfect world :-} > Btrfs RAID56 has no journal to protect against write hole. What does this mean? What is a write hole and want is the danger with it? > So you either run RAID5 for data only This is a mkfs.btrfs option? Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"? > and ran full scrub after every unexpected power loss (slow, and no > further writes until scrub is done, which is further maintanance burden). Ubuntu has (like most Linux distributions) systemd. How can I detect a previous power loss and force full scrub on booting? > Or just don't use RAID5 at all. You suggest btrfs RAID0? As I wrote: I have 4 x 4 TB SAS SSD (enterprise hardware, very reliable). Another disk layout option for me could be: 64 GB / filesystem RAID1 32 GB swap RAID1 3.9 TB /home 3.9 TB /data 3.9 TB /VM 3.9 TB /backup In case of a SSD damage failure I have to recover from (external) backup. -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlacher@tik.uni-stuttgart.de Allmandring 30a Tel: ++49-711-68565868 70569 Stuttgart (Germany) WWW: https://www.tik.uni-stuttgart.de/ REF:<d3a5e463-d00e-4428-ad7b-35f87f9a6550@gmx.com> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 8:28 ` Ulli Horlacher @ 2025-09-22 9:06 ` Qu Wenruo 2025-09-22 9:23 ` Ulli Horlacher 0 siblings, 1 reply; 21+ messages in thread From: Qu Wenruo @ 2025-09-22 9:06 UTC (permalink / raw) To: linux-btrfs 在 2025/9/22 17:58, Ulli Horlacher 写道: > On Mon 2025-09-22 (17:11), Qu Wenruo wrote: > >>> Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on >>> top of a md RAID5? >> >> Neither is perfect. > > We live in a non-perfect world :-} > > >> Btrfs RAID56 has no journal to protect against write hole. > > What does this mean? > What is a write hole and want is the danger with it? Write-hole means during a partial stripe update, a power loss happened, the parity may be out-of-sync. This means that stripe will not get the full protection of RAID5. E.g. after that power loss one device is lost, then btrfs may not be able to rebuild the correct data. > > >> So you either run RAID5 for data only > > This is a mkfs.btrfs option? > Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"? For RAID5, RAID1 is preferred for data. > > >> and ran full scrub after every unexpected power loss (slow, and no >> further writes until scrub is done, which is further maintanance burden). > > Ubuntu has (like most Linux distributions) systemd. > How can I detect a previous power loss and force full scrub on booting? Not sure. You may dig into systemd docs to find that out. > > >> Or just don't use RAID5 at all. > > You suggest btrfs RAID0? I'd suggest RAID10. But that means you're "wasting" half of your capacity. Thanks, Qu > As I wrote: I have 4 x 4 TB SAS SSD (enterprise hardware, very reliable). > > Another disk layout option for me could be: > > 64 GB / filesystem RAID1 > 32 GB swap RAID1 > 3.9 TB /home > 3.9 TB /data > 3.9 TB /VM > 3.9 TB /backup > > In case of a SSD damage failure I have to recover from (external) backup. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 9:06 ` Qu Wenruo @ 2025-09-22 9:23 ` Ulli Horlacher 2025-09-22 9:27 ` Qu Wenruo 0 siblings, 1 reply; 21+ messages in thread From: Ulli Horlacher @ 2025-09-22 9:23 UTC (permalink / raw) To: linux-btrfs On Mon 2025-09-22 (18:36), Qu Wenruo wrote: > Write-hole means during a partial stripe update, a power loss happened, > the parity may be out-of-sync. > > This means that stripe will not get the full protection of RAID5. > > E.g. after that power loss one device is lost, then btrfs may not be > able to rebuild the correct data. It must happens both, power loss and one device is lost? Then this is a more rare situation: unlikly, but not impossible. > >> So you either run RAID5 for data only > > > > This is a mkfs.btrfs option? > > Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"? > > For RAID5, RAID1 is preferred for data. Then the real usable capacity of this volume is only the half? With 4 x 4 TB disks I get 2 TB, in opposite to 3 TB with RAID5 data? > >> and ran full scrub after every unexpected power loss (slow, and no > >> further writes until scrub is done, which is further maintanance burden). > > > > Ubuntu has (like most Linux distributions) systemd. > > How can I detect a previous power loss and force full scrub on booting? > > Not sure. You may dig into systemd docs to find that out. So, no recommandition from you. Difficult situation :-} > >> Or just don't use RAID5 at all. > > > > You suggest btrfs RAID0? > > I'd suggest RAID10. But that means you're "wasting" half of your capacity. Ok, it is a trade off... -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlacher@tik.uni-stuttgart.de Allmandring 30a Tel: ++49-711-68565868 70569 Stuttgart (Germany) WWW: https://www.tik.uni-stuttgart.de/ REF:<95ece5d8-0e5a-4db9-8603-c819980c3a3b@suse.com> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 9:23 ` Ulli Horlacher @ 2025-09-22 9:27 ` Qu Wenruo 2025-10-20 9:00 ` Ulli Horlacher 0 siblings, 1 reply; 21+ messages in thread From: Qu Wenruo @ 2025-09-22 9:27 UTC (permalink / raw) To: linux-btrfs 在 2025/9/22 18:53, Ulli Horlacher 写道: > On Mon 2025-09-22 (18:36), Qu Wenruo wrote: > > >> Write-hole means during a partial stripe update, a power loss happened, >> the parity may be out-of-sync. >> >> This means that stripe will not get the full protection of RAID5. >> >> E.g. after that power loss one device is lost, then btrfs may not be >> able to rebuild the correct data. > > It must happens both, power loss and one device is lost? > Then this is a more rare situation: unlikly, but not impossible. > > >>>> So you either run RAID5 for data only >>> >>> This is a mkfs.btrfs option? >>> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"? >> >> For RAID5, RAID1 is preferred for data. > > Then the real usable capacity of this volume is only the half? No, metadata is really a small part of the fs. The majority of usable space really depends on the data profile. If you use RAID1 metadata + RAID5 data, I believe only less than 10% of real space is used on RAID1, the remaining is still RAID5. Unless you put tons of small files (smaller than 2K), then those files will be inlined into metadata and takes a lot of space... Thanks, Qu > With 4 x 4 TB disks I get 2 TB, in opposite to 3 TB with RAID5 data? > > >>>> and ran full scrub after every unexpected power loss (slow, and no >>>> further writes until scrub is done, which is further maintanance burden). >>> >>> Ubuntu has (like most Linux distributions) systemd. >>> How can I detect a previous power loss and force full scrub on booting? >> >> Not sure. You may dig into systemd docs to find that out. > > So, no recommandition from you. Difficult situation :-} > > >>>> Or just don't use RAID5 at all. >>> >>> You suggest btrfs RAID0? >> >> I'd suggest RAID10. But that means you're "wasting" half of your capacity. > > Ok, it is a trade off... > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 9:27 ` Qu Wenruo @ 2025-10-20 9:00 ` Ulli Horlacher 2025-10-20 9:31 ` Andrei Borzenkov 0 siblings, 1 reply; 21+ messages in thread From: Ulli Horlacher @ 2025-10-20 9:00 UTC (permalink / raw) To: linux-btrfs Resuming this discussion... On Mon 2025-09-22 (18:57), Qu Wenruo wrote: > >>>> So you either run RAID5 for data only > >>> > >>> This is a mkfs.btrfs option? > >>> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"? > >> > >> For RAID5, RAID1 is preferred for data. > > > > Then the real usable capacity of this volume is only the half? > > No, metadata is really a small part of the fs. > > The majority of usable space really depends on the data profile. > > If you use RAID1 metadata + RAID5 data, I believe only less than 10% of > real space is used on RAID1, the remaining is still RAID5. Sounds like a good compromise solution! Asuming I have 4 partitions with equal size, then the suggested command to create the filesystem would be: mkfs.btrfs -m raid1 -d raid5 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4 Does this setup help to protect against write hole? -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlacher@tik.uni-stuttgart.de Allmandring 30a Tel: ++49-711-68565868 70569 Stuttgart (Germany) WWW: https://www.tik.uni-stuttgart.de/ REF:<1e4baff2-1310-437a-be62-5e9b72784a54@gmx.com> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-20 9:00 ` Ulli Horlacher @ 2025-10-20 9:31 ` Andrei Borzenkov 0 siblings, 0 replies; 21+ messages in thread From: Andrei Borzenkov @ 2025-10-20 9:31 UTC (permalink / raw) To: linux-btrfs On Mon, Oct 20, 2025 at 12:07 PM Ulli Horlacher <framstag@rus.uni-stuttgart.de> wrote: > > > Resuming this discussion... > > On Mon 2025-09-22 (18:57), Qu Wenruo wrote: > > > >>>> So you either run RAID5 for data only > > >>> > > >>> This is a mkfs.btrfs option? > > >>> Shall I use "mkfs.btrfs -m dup" or "mkfs.btrfs -m raid1"? > > >> > > >> For RAID5, RAID1 is preferred for data. > > > > > > Then the real usable capacity of this volume is only the half? > > > > No, metadata is really a small part of the fs. > > > > The majority of usable space really depends on the data profile. > > > > If you use RAID1 metadata + RAID5 data, I believe only less than 10% of > > real space is used on RAID1, the remaining is still RAID5. > > Sounds like a good compromise solution! > > Asuming I have 4 partitions with equal size, then the suggested command to > create the filesystem would be: > > mkfs.btrfs -m raid1 -d raid5 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4 > > Does this setup help to protect against write hole? > No. It simply reduces the damage caused by the write hole. Only the content of individual files is affected, not the metadata. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 7:41 ` Qu Wenruo 2025-09-22 8:28 ` Ulli Horlacher @ 2025-09-22 9:43 ` Ulli Horlacher 2025-09-22 10:41 ` Qu Wenruo 2025-10-21 1:02 ` DanglingPointer 2 siblings, 1 reply; 21+ messages in thread From: Ulli Horlacher @ 2025-09-22 9:43 UTC (permalink / raw) To: linux-btrfs On Mon 2025-09-22 (17:11), Qu Wenruo wrote: > Btrfs RAID56 has no journal to protect against write hole. But has the > ability to properly detect and rebuild corrupted data using data checksum. As I wrote before, I could use btrfs RAID1 (only) for the / filesystem (64 GB), the other partitions without any RAID level, just simple btrfs filesytems. No md RAID volumes at all. btrfs RAID1 is not prone to write holes, but is able to rebuild corrupted data using data checksum? Then this is could be the most robust solution for me. In case of a disk failure I have to recover from backup. -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlacher@tik.uni-stuttgart.de Allmandring 30a Tel: ++49-711-68565868 70569 Stuttgart (Germany) WWW: https://www.tik.uni-stuttgart.de/ REF:<d3a5e463-d00e-4428-ad7b-35f87f9a6550@gmx.com> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 9:43 ` Ulli Horlacher @ 2025-09-22 10:41 ` Qu Wenruo 0 siblings, 0 replies; 21+ messages in thread From: Qu Wenruo @ 2025-09-22 10:41 UTC (permalink / raw) To: linux-btrfs 在 2025/9/22 19:13, Ulli Horlacher 写道: > On Mon 2025-09-22 (17:11), Qu Wenruo wrote: > >> Btrfs RAID56 has no journal to protect against write hole. But has the >> ability to properly detect and rebuild corrupted data using data checksum. > > As I wrote before, I could use btrfs RAID1 (only) for the / filesystem (64 > GB), the other partitions without any RAID level, just simple btrfs > filesytems. No md RAID volumes at all. > > btrfs RAID1 is not prone to write holes, but is able to rebuild corrupted > data using data checksum? Yes. Write-hole is only possible for RAID56 profiles which needs RMW. RAID0/1/10 do not need RWM at all, thus they are completely safe. > > Then this is could be the most robust solution for me. > In case of a disk failure I have to recover from backup. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 7:41 ` Qu Wenruo 2025-09-22 8:28 ` Ulli Horlacher 2025-09-22 9:43 ` Ulli Horlacher @ 2025-10-21 1:02 ` DanglingPointer 2025-10-21 15:46 ` Mark Harmstone 2 siblings, 1 reply; 21+ messages in thread From: DanglingPointer @ 2025-10-21 1:02 UTC (permalink / raw) To: Qu Wenruo, Ulli Horlacher, linux-btrfs Are there any plans to work on either of the proposed solutions mentioned here to once and for all fix RAID56? On 22/9/25 17:41, Qu Wenruo wrote: > > > 在 2025/9/22 16:39, Ulli Horlacher 写道: >> >> I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I >> want to >> recycle in my workstation PC (Ubuntu 24 with kernel 6.14). >> >> Is btrfs RAID5 ready for production usage or shall I use non-RAID >> btrfs on >> top of a md RAID5? > > Neither is perfect. > > Btrfs RAID56 has no journal to protect against write hole. But has the > ability to properly detect and rebuild corrupted data using data > checksum. > > Meanwhile MD raid56 has journal to protect against wirte hole, but has > no checksum to know which data is correct or not. > >> >> What is the current status? >> > > No extra work is done for btrfs RAID56 write hole for a while. > > The experimental raid-stripe-tree has some potential to address the > problem, but that feature doesn't support RAID56 yet. > > > Another solution is something like RAIDZ, which requires block size > > page size support, and extra RAID56 changes (mostly much smaller > stripe length, 4K instead of the current 64K). > > The bs > ps support is not even merged, and submitted patchset lacks > certain features (RAID56 ironically). > And no formal RAIDZ support is even considered. > > So you either run RAID5 for data only and ran full scrub after every > unexpected power loss (slow, and no further writes until scrub is > done, which is further maintanance burden). > Or just don't use RAID5 at all. > > Thanks, > Qu > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 1:02 ` DanglingPointer @ 2025-10-21 15:46 ` Mark Harmstone 2025-10-21 15:53 ` Christoph Anton Mitterer 0 siblings, 1 reply; 21+ messages in thread From: Mark Harmstone @ 2025-10-21 15:46 UTC (permalink / raw) To: DanglingPointer, Qu Wenruo, Ulli Horlacher, linux-btrfs On 21/10/2025 2.02 am, DanglingPointer wrote: > Are there any plans to work on either of the proposed solutions mentioned here to once and for all fix RAID56? The brutal truth is probably that RAID5/6 is an idea whose time has passed. Storage is cheap enough that it doesn't warrant the added latency, CPU time, and complexity. If I had four 4TB drives I would probably go for RAID1 data and RAID1C4 metadata. > On 22/9/25 17:41, Qu Wenruo wrote: >> >> >> 在 2025/9/22 16:39, Ulli Horlacher 写道: >>> >>> I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to >>> recycle in my workstation PC (Ubuntu 24 with kernel 6.14). >>> >>> Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on >>> top of a md RAID5? >> >> Neither is perfect. >> >> Btrfs RAID56 has no journal to protect against write hole. But has the ability to properly detect and rebuild corrupted data using data checksum. >> >> Meanwhile MD raid56 has journal to protect against wirte hole, but has no checksum to know which data is correct or not. >> >>> >>> What is the current status? >>> >> >> No extra work is done for btrfs RAID56 write hole for a while. >> >> The experimental raid-stripe-tree has some potential to address the problem, but that feature doesn't support RAID56 yet. >> >> >> Another solution is something like RAIDZ, which requires block size > page size support, and extra RAID56 changes (mostly much smaller stripe length, 4K instead of the current 64K). >> >> The bs > ps support is not even merged, and submitted patchset lacks certain features (RAID56 ironically). >> And no formal RAIDZ support is even considered. >> >> So you either run RAID5 for data only and ran full scrub after every unexpected power loss (slow, and no further writes until scrub is done, which is further maintanance burden). >> Or just don't use RAID5 at all. >> >> Thanks, >> Qu >> > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 15:46 ` Mark Harmstone @ 2025-10-21 15:53 ` Christoph Anton Mitterer 2025-10-21 16:15 ` Jukka Larja 2025-10-21 16:45 ` Mark Harmstone 0 siblings, 2 replies; 21+ messages in thread From: Christoph Anton Mitterer @ 2025-10-21 15:53 UTC (permalink / raw) To: linux-btrfs On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: > The brutal truth is probably that RAID5/6 is an idea whose time has > passed. > Storage is cheap enough that it doesn't warrant the added latency, > CPU time, > and complexity. That doesn't seem to be generally the case. We have e.g. large storage servers with 24x 22 TB HDDs. RAID6 is plenty enough redundancy for these, loosing 2 HDDs. RAID1 would loose half. Cheers, Chris. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 15:53 ` Christoph Anton Mitterer @ 2025-10-21 16:15 ` Jukka Larja 2025-10-21 16:45 ` Mark Harmstone 1 sibling, 0 replies; 21+ messages in thread From: Jukka Larja @ 2025-10-21 16:15 UTC (permalink / raw) To: linux-btrfs Christoph Anton Mitterer kirjoitti 21.10.2025 klo 18.53: > On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: >> The brutal truth is probably that RAID5/6 is an idea whose time has >> passed. >> Storage is cheap enough that it doesn't warrant the added latency, >> CPU time, >> and complexity. > > That doesn't seem to be generally the case. We have e.g. large storage > servers with 24x 22 TB HDDs. > > RAID6 is plenty enough redundancy for these, loosing 2 HDDs. > RAID1 would loose half. Also significant, RAID1 only protects from single drive failure, RAID6 from two. -JLarja ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 15:53 ` Christoph Anton Mitterer 2025-10-21 16:15 ` Jukka Larja @ 2025-10-21 16:45 ` Mark Harmstone 2025-10-21 17:32 ` Andrei Borzenkov 2025-10-21 19:32 ` Goffredo Baroncelli 1 sibling, 2 replies; 21+ messages in thread From: Mark Harmstone @ 2025-10-21 16:45 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote: > On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: >> The brutal truth is probably that RAID5/6 is an idea whose time has >> passed. >> Storage is cheap enough that it doesn't warrant the added latency, >> CPU time, >> and complexity. > > That doesn't seem to be generally the case. We have e.g. large storage > servers with 24x 22 TB HDDs. > > RAID6 is plenty enough redundancy for these, loosing 2 HDDs. > RAID1 would loose half. > > > Cheers, > Chris. So for every sector you want to write, you actually need to write three and read 21. That seems a very quick way to wear out all those disks. And then one starts operating more slowly, which slows down every write... I'd still use RAID1 in this case. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 16:45 ` Mark Harmstone @ 2025-10-21 17:32 ` Andrei Borzenkov 2025-10-21 17:43 ` Mark Harmstone 2025-10-21 19:32 ` Goffredo Baroncelli 1 sibling, 1 reply; 21+ messages in thread From: Andrei Borzenkov @ 2025-10-21 17:32 UTC (permalink / raw) To: Mark Harmstone, Christoph Anton Mitterer, linux-btrfs 21.10.2025 19:45, Mark Harmstone wrote: > On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote: >> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: >>> The brutal truth is probably that RAID5/6 is an idea whose time has >>> passed. >>> Storage is cheap enough that it doesn't warrant the added latency, >>> CPU time, >>> and complexity. >> >> That doesn't seem to be generally the case. We have e.g. large storage >> servers with 24x 22 TB HDDs. >> >> RAID6 is plenty enough redundancy for these, loosing 2 HDDs. >> RAID1 would loose half. >> >> >> Cheers, >> Chris. > So for every sector you want to write, you actually need to write three > and read 21. RAID5 needs to read 2 sectors and write 2 sectors. Independently of the number of disks in the array. It is more difficult to make any generic statement about RAID6 because to my best knowledge there is no standard parity computation algorithm for it, each vendor does something different. But simply adding the second parity block means you need to read 3 and write 3 blocks. > That seems a very quick way to wear out all those disks. > And then one starts operating more slowly, which slows down every write... > > I'd still use RAID1 in this case. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 17:32 ` Andrei Borzenkov @ 2025-10-21 17:43 ` Mark Harmstone 0 siblings, 0 replies; 21+ messages in thread From: Mark Harmstone @ 2025-10-21 17:43 UTC (permalink / raw) To: Andrei Borzenkov, Christoph Anton Mitterer, linux-btrfs On 21/10/2025 6.32 pm, Andrei Borzenkov wrote: > 21.10.2025 19:45, Mark Harmstone wrote: >> On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote: >>> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: >>>> The brutal truth is probably that RAID5/6 is an idea whose time has >>>> passed. >>>> Storage is cheap enough that it doesn't warrant the added latency, >>>> CPU time, >>>> and complexity. >>> >>> That doesn't seem to be generally the case. We have e.g. large storage >>> servers with 24x 22 TB HDDs. >>> >>> RAID6 is plenty enough redundancy for these, loosing 2 HDDs. >>> RAID1 would loose half. >>> >>> >>> Cheers, >>> Chris. >> So for every sector you want to write, you actually need to write three >> and read 21. > > RAID5 needs to read 2 sectors and write 2 sectors. Independently of the number of disks in the array. This isn't the case for btrfs' implementation, which will stripe every chunk over each disk if it can. Possibly other people do something different. > It is more difficult to make any generic statement about RAID6 because to my best knowledge there is no standard parity computation algorithm for it, each vendor does something different. But simply adding the second parity block means you need to read 3 and write 3 blocks. Likewise for btrfs' implementation of RAID6. I suppose this shows that if anyone were ever to fix it, they would need to make sure that RAID6 chunks get given 4 stripes rather than 24. >> That seems a very quick way to wear out all those disks. >> And then one starts operating more slowly, which slows down every write... >> >> I'd still use RAID1 in this case. >> > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 16:45 ` Mark Harmstone 2025-10-21 17:32 ` Andrei Borzenkov @ 2025-10-21 19:32 ` Goffredo Baroncelli 2025-10-21 22:19 ` DanglingPointer 1 sibling, 1 reply; 21+ messages in thread From: Goffredo Baroncelli @ 2025-10-21 19:32 UTC (permalink / raw) To: Mark Harmstone, Christoph Anton Mitterer, linux-btrfs On 21/10/2025 18.45, Mark Harmstone wrote: > On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote: >> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: >>> The brutal truth is probably that RAID5/6 is an idea whose time has >>> passed. >>> Storage is cheap enough that it doesn't warrant the added latency, >>> CPU time, >>> and complexity. >> >> That doesn't seem to be generally the case. We have e.g. large storage >> servers with 24x 22 TB HDDs. >> >> RAID6 is plenty enough redundancy for these, loosing 2 HDDs. >> RAID1 would loose half. >> >> >> Cheers, >> Chris. > So for every sector you want to write, you actually need to write three > and read 21. That seems a very quick way to wear out all those disks. > And then one starts operating more slowly, which slows down every write... Yes, it is true that the classic raid5/6 doesn't scale well when the number of disks grows. However I still think that there is room to a different approach. Like putting the redundancy inside the extent to avoid an RMW cycles. This and the fact that in BTRFS the extents are immutable, would avoid the slowness that you mention. > I'd still use RAID1 in this case. > This is faster, but also doesn't scale well (== expensive) when the storage size grows. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-10-21 19:32 ` Goffredo Baroncelli @ 2025-10-21 22:19 ` DanglingPointer 0 siblings, 0 replies; 21+ messages in thread From: DanglingPointer @ 2025-10-21 22:19 UTC (permalink / raw) To: kreijack, Mark Harmstone, Christoph Anton Mitterer, linux-btrfs Just going back to the original question I posted... Will the BTRFS project decide to fix once and for all RAID56 to fix the write-hole even if the result is a slower-latent result? At least that would be a fully functional production ready offering as version 1.0? Future optimisations and improvements on that version 1.0 will obviously happen for the life of BTRFS and linux which will improve the performance from the impact of adding whatever is needed to fix the write-hole. Just like everything else! At least everyone can say it is done, although slower now! But feature complete! Making it faster will then incrementally happen as it evolves like everything else. On 22/10/25 06:32, Goffredo Baroncelli wrote: > On 21/10/2025 18.45, Mark Harmstone wrote: >> On 21/10/2025 4.53 pm, Christoph Anton Mitterer wrote: >>> On Tue, 2025-10-21 at 16:46 +0100, Mark Harmstone wrote: >>>> The brutal truth is probably that RAID5/6 is an idea whose time has >>>> passed. >>>> Storage is cheap enough that it doesn't warrant the added latency, >>>> CPU time, >>>> and complexity. >>> >>> That doesn't seem to be generally the case. We have e.g. large storage >>> servers with 24x 22 TB HDDs. >>> >>> RAID6 is plenty enough redundancy for these, loosing 2 HDDs. >>> RAID1 would loose half. >>> >>> >>> Cheers, >>> Chris. >> So for every sector you want to write, you actually need to write three >> and read 21. That seems a very quick way to wear out all those disks. >> And then one starts operating more slowly, which slows down every >> write... > > Yes, it is true that the classic raid5/6 doesn't scale well when the > number > of disks grows. > > However I still think that there is room to a different approach. Like > putting > the redundancy inside the extent to avoid an RMW cycles. This and the > fact that in BTRFS the extents are immutable, would avoid the slowness > that you mention. > >> I'd still use RAID1 in this case. >> > This is faster, but also doesn't scale well (== expensive) when the > storage > size grows. > > BR > G.Baroncelli > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 7:09 btrfs RAID5 or btrfs on md RAID5? Ulli Horlacher 2025-09-22 7:41 ` Qu Wenruo @ 2025-09-22 8:07 ` Lukas Straub 2025-09-22 8:50 ` Ulli Horlacher 1 sibling, 1 reply; 21+ messages in thread From: Lukas Straub @ 2025-09-22 8:07 UTC (permalink / raw) To: Ulli Horlacher; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 775 bytes --] On Mon, 22 Sep 2025 09:09:56 +0200 Ulli Horlacher <framstag@rus.uni-stuttgart.de> wrote: > I have 4 x 4 TB SAS SSD (from a deactivated Netapp system) which I want to > recycle in my workstation PC (Ubuntu 24 with kernel 6.14). > > Is btrfs RAID5 ready for production usage or shall I use non-RAID btrfs on > top of a md RAID5? > > What is the current status? > Hi, md RAID5 with Partial Parity Log is perfect for btrfs: https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html When a stripe is partially updated with new data, PPL ensures that the old data in the stripe will not be corrupted by the write-hole. The new data on the other hand is still affected by the write hole, but for btrfs that is not a problem. Regards, Lukas [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: btrfs RAID5 or btrfs on md RAID5? 2025-09-22 8:07 ` Lukas Straub @ 2025-09-22 8:50 ` Ulli Horlacher 0 siblings, 0 replies; 21+ messages in thread From: Ulli Horlacher @ 2025-09-22 8:50 UTC (permalink / raw) To: linux-btrfs On Mon 2025-09-22 (10:07), Lukas Straub wrote: > md RAID5 with Partial Parity Log is perfect for btrfs: > https://www.kernel.org/doc/html/latest/driver-api/md/raid5-ppl.html I already have another system with btrfs on top of md RAID5 (4 x 1.6 TB SSD): root@juhu:~# uname -a Linux juhu 6.8.0-83-generic #83~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Sep 9 18:19:47 UTC 2 x86_64 x86_64 x86_64 GNU/Linux root@juhu:~# mount | grep local /dev/md127 on /local type btrfs (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/) root@juhu:~# mdadm --detail /dev/md127 /dev/md127: Version : 1.2 Creation Time : Thu Feb 10 09:38:22 2022 Raid Level : raid5 Array Size : 4285387776 (3.99 TiB 4.39 TB) Used Dev Size : 1428462592 (1362.29 GiB 1462.75 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Sep 22 10:43:43 2025 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : mux:nts1 UUID : 74388db3:3c3b30c3:e1295cc5:46f23ff7 Events : 23359 Number Major Minor RaidDevice State 0 8 20 0 active sync /dev/sdb4 1 8 4 1 active sync /dev/sda4 2 8 52 2 active sync /dev/sdd4 4 8 36 3 active sync /dev/sdc4 Shall I enable PPL there with mdadm --consistency-policy=ppl /dev/md127 ? -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlacher@tik.uni-stuttgart.de Allmandring 30a Tel: ++49-711-68565868 70569 Stuttgart (Germany) WWW: https://www.tik.uni-stuttgart.de/ REF:<20250922100715.7f847dc0@penguin> ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-10-21 22:19 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-22 7:09 btrfs RAID5 or btrfs on md RAID5? Ulli Horlacher 2025-09-22 7:41 ` Qu Wenruo 2025-09-22 8:28 ` Ulli Horlacher 2025-09-22 9:06 ` Qu Wenruo 2025-09-22 9:23 ` Ulli Horlacher 2025-09-22 9:27 ` Qu Wenruo 2025-10-20 9:00 ` Ulli Horlacher 2025-10-20 9:31 ` Andrei Borzenkov 2025-09-22 9:43 ` Ulli Horlacher 2025-09-22 10:41 ` Qu Wenruo 2025-10-21 1:02 ` DanglingPointer 2025-10-21 15:46 ` Mark Harmstone 2025-10-21 15:53 ` Christoph Anton Mitterer 2025-10-21 16:15 ` Jukka Larja 2025-10-21 16:45 ` Mark Harmstone 2025-10-21 17:32 ` Andrei Borzenkov 2025-10-21 17:43 ` Mark Harmstone 2025-10-21 19:32 ` Goffredo Baroncelli 2025-10-21 22:19 ` DanglingPointer 2025-09-22 8:07 ` Lukas Straub 2025-09-22 8:50 ` Ulli Horlacher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).