BTRFS list of grievances

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* BTRFS list of grievances
@ 2024-09-27 11:20 waxhead
  2024-09-27 16:27 ` Roman Mamedov
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: waxhead @ 2024-09-27 11:20 UTC (permalink / raw)
  To: Btrfs BTRFS

First thing first: I am a long time BTRFS user and frequent reader of 
the mailing list. I am *NOT* a BTRFS developer, but that being said I 
have been known to summon a segmentation failure or two from years of 
programming in C.

Since I have been using BTRFS more or less problem free since 2013 or so 
for nearly everything, I figured that I should be entitled to simply 
write down a list of things that I personally think sucks (more or less) 
with this otherwise fine filesystem

Make of it what you will, but what I am trying to get across is what the 
upper class would probably call 'constructive criticism'.

So here goes:

1. FS MANAGEMENT
================
BTRFS is rather simple to manage. We can add/remove devices on the fly, 
balance the filesystem, scrub, defrag, select compression algorithms 
etc. Some of these things are done as mount options, some as properties 
and some by issuing a command that process something.

Personally, I feel this is a bit messy and in some cases quite backwards 
at times. I believe the original idea was that BTRFS should support pr. 
subvolume mount options, storage profiles, etc etc.... and subvolumes 
are after all a key feature of the filesystem.

Heck, we even have a root subvolume (id 256) which ideally is the parent 
(or root) for all other subvolumes on the filesystem. So why on earth do 
we have commands such as 'btrfs balance start -dusage=50 /fsmnt' when 
logically it could just has easily have been 'btrfs <subvolume> balance 
start -dusage=50' . E.g. on the root subvolume instead of the fs mount 
point.

Besides, if BTRFS at some point are supposed to be more "subvolume 
centric" then why are not things like scrub, balance, convert 
(data/metadata), device add/remove or even defrag handled as properties 
to a subvolume. E.g. why not set a flag that triggers what needs to be 
done, and let the filesystem process that as a background task.

That would for example allow for finer granularity for scrub for certain 
subvolumes, instead of having to do the entire filesystem as it 
currently is now.

Status for the jobs do in my opinion belong in sysfs, but there is 
nothing wrong with a simple command to "pretty'fy" the status either.

And yes, I even mentioned device add/remove because if it would be 
possible at some point to assign priority/weight to certain devices for 
certain subvolumes then making a subvolume prefer or avoid using a 
certain storage device wold be as "simple" as setting a suitable 
weight/priority, and it would be possible to add/remove (assign) storage 
devices without affecting all other subvolumes.

So for me , 'btrfs property set' (or something similar) sounds like the 
only sensible way of properly managing a BTRFS. And really, with the 
exception of the rescue and subvolume mount options most, if not all 
other mount options seems to better belong as a property for a subvolume 
(which may or may not be the id 256 / root subvolume)

2. USE DEVICE ID's EVERYWHERE INSTEAD OF /dev/sdX:
==================================================
Using "btrfs filesystem show" will list all BTRFS devices, and also show 
the assigned ID for that device / partition / whatever. Since BTRFS 
already have the notion of a device ID, it seems pointless to not use 
that ID for management / identification anywhere possible.
(for example btrfs device stat /mnt)

3. SOME DEVICES MISSING SHOULD BE ID 1,2,3,4... MISSING:
========================================================
If one or more devices are missing it would have been great to know WHAT 
devices where missing. Why not print the ID's of the missing devices 
instead of just let the user know that "some" of them are missing?

4. THE ABILITY TO SET A LABEL FOR A DEVICE ID:
==============================================
It would have been great to set a label for a BTRFS device ID. For 
example ID1 = "Shelf01.24", ID2 = "NAS_01", ID3 = "localdiskXYZ"

5. DEDUPLICATION IS NOT INTEGRATED IN BTRFS:
============================================
I think that some form of (simple) deduplication should be integrated in 
BTRFS. Using unofficial tools may be perfectly safe, but it feels 
"unsafe" to be honest. Besides deduplication is something that might 
have been interesting to turn on/on_whenidle/off as a property to a 
subvolume as well.

6. DEVICE STATS:
================
Again device ID's are not used, but also why is this info not listed in 
a table? Showing this in a table would make 5x lines become 1x line 
which would be far more readable. Finaly it is not clear to me what is 
fixed errors, and what are actual damage accumulated in the filesystem

7. LIST OF DAMAGED FILES:
=========================
There is no easy way to get a list of damaged files on a BTRFS 
filesystem to my knowledge. It would be great to have a command for that.

8. ABILITY TO RESERVE SPARE SPACE:
==================================
Because of the way BTRFS works a spare device is not very useful. Rather 
spare space would be a good idea I think. That way if one device is 
missing data, it could be replicated to other drives (or even on a 
single device [DUP] in emergency situations)

9. ABILITY TO MERGE / CONSUME EXISTING BTRFS:
=============================================
It would have been great to merge existing BTRFS volumes into a larger 
volume e.g. assimilate it ..because we all know resistance is futile.
Again a subvolume would be the cleanest way of importing another BTRFS I 
think.

10. AUTOREJECT FAILED DEVICES:
==============================
As I have mentioned before. It it was possible to assign certain storage 
devices / storage device groups to certain subvolumes then as the 
failure count for a device increase, it may be preferable to 
automatically lower the weight/priority of that device so that things 
are stored elsewhere. If auto-migration is triggered at a low enough 
weight then devices with a high failure rate/count could be rejected.

11. That's it folks!
====================
I know it is a lot of "rant", but hope someone find it useful or 
inspiring. If for nothing more than to keep my mouth shut. ;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 11:20 BTRFS list of grievances waxhead
@ 2024-09-27 16:27 ` Roman Mamedov
  2024-09-27 18:05   ` Remi Gauvin
                     ` (2 more replies)
  2024-09-27 17:44 ` Mark Harmstone
  2024-09-30 21:43 ` Goffredo Baroncelli
  2 siblings, 3 replies; 14+ messages in thread
From: Roman Mamedov @ 2024-09-27 16:27 UTC (permalink / raw)
  To: waxhead; +Cc: Btrfs BTRFS

On Fri, 27 Sep 2024 13:20:14 +0200
waxhead <waxhead@dirtcellar.net> wrote:

> 1. FS MANAGEMENT
> ================
> BTRFS is rather simple to manage. We can add/remove devices on the fly, 
> balance the filesystem, scrub, defrag, select compression algorithms 
> etc. Some of these things are done as mount options, some as properties 
> and some by issuing a command that process something.

I will add my annoyance or rather a showstopper.

Consider a RAID1 of two 20TB disks. One disk disconnects and the system
operates on just the remaining one for a few days.

Side note: will Btrfs even agree to operate in such state without constant
stream of errors to dmesg?

Then the disk is reconnected to the system.

For a start, are we even able to cleanly forget an abruptly disappeared drive
in RAID1, and then re-add it back when the same disk it reappears (under a
different /dev/sdX location)? Without remounting and reboot?

Secondly, it feels like you'll be extremely lucky not to die a fiery death of
"parent transid mismatch errors" right away with Btrfs, after this.

Or if not, then how do you get from there to a consistent state? Run a scrub,
make the system reread the entire 40 TB of data, correcting errors and lack of
duplication where necessary.

Meanwhile, mdadm RAID1: thanks to the Write-intent bitmap, after a re-add the
RAID resyncs just the small changed areas from the continuously running disk
to the temporarily-absent one, and the array consistency is almost instantly
restored, in many cases just with a few GBs read and written.

Or maybe I underestimate the current Btrfs capabilities here?

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 11:20 BTRFS list of grievances waxhead
  2024-09-27 16:27 ` Roman Mamedov
@ 2024-09-27 17:44 ` Mark Harmstone
  2024-09-30 21:43 ` Goffredo Baroncelli
  2 siblings, 0 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-09-27 17:44 UTC (permalink / raw)
  To: waxhead@dirtcellar.net, Btrfs BTRFS

On 27/9/24 12:20, waxhead wrote:
> 2. USE DEVICE ID's EVERYWHERE INSTEAD OF /dev/sdX:
> 4. THE ABILITY TO SET A LABEL FOR A DEVICE ID:

I think the view is that this is the job of udev, not btrfs. Presumably 
you can use udev rules to give block devices arbitrary names.
Maybe it might be useful if there was an option to btrfs-progs so that 
it printed the symlink names in /dev/disk/by-partlabel if present.

> 9. ABILITY TO MERGE / CONSUME EXISTING BTRFS:

Yeah, I've had the same idea - this is something that's definitely 
possible, it just needs someone to implement it.

If we're writing a wishlist, I'll add:

BAD SECTOR TREE

i.e. a list of sectors known to be bad that the allocator should avoid, 
in the same way that it avoids the superblocks. NTFS has something 
similar, and I think ext2 does too.

Mark

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 16:27 ` Roman Mamedov
@ 2024-09-27 18:05   ` Remi Gauvin
  2024-09-27 19:01     ` Colin S
  2024-09-28 10:15   ` Paul Jones
  2024-09-28 17:51   ` Roman Mamedov
  2 siblings, 1 reply; 14+ messages in thread
From: Remi Gauvin @ 2024-09-27 18:05 UTC (permalink / raw)
  To: Roman Mamedov, waxhead; +Cc: Btrfs BTRFS

On 2024-09-27 12:27 p.m., Roman Mamedov wrote:
>
> Or if not, then how do you get from there to a consistent state? Run a scrub,
> make the system reread the entire 40 TB of data, correcting errors and lack of
> duplication where necessary.
>

The BTRFS handling of this situation is actually worse.

The often given, (and entirely too simple) answer is to scrub.  But this
has several caveats.

1. The system will not detect the error state automatically.  So fixing
this requires the admin to be actively monitoring for errors to detect
the missed writes.  (regular monitoring of btrfs dev stats and allert on
errors is required.)

2.  Any files that are stored on on the device with CoW Disabled will
not be fixed, and the two copies will be different, with no real way to
detect or fix.  There are packages that disable CoW on files by
default.  (systemd log files, but probably more concerning, and virtual
disk created by libvirt, for example. (Some amount of divergence can
happen at any unclean shutdown in this scenario)

3. I don't have exact math at my fingertips, but with enough failed
writes, the chances of a CRC32 collision of the stale data leaving
unfixed/corrupted data behind gets fairly high.

For reasons of 2 and 3, the only way to fix this without increasing
chance of data corruption is to replace the previously disconnected
drive to a hot spare. (with the -r option to btrfs replace.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 18:05   ` Remi Gauvin
@ 2024-09-27 19:01     ` Colin S
  2024-10-02 19:31       ` Chris Murphy
  0 siblings, 1 reply; 14+ messages in thread
From: Colin S @ 2024-09-27 19:01 UTC (permalink / raw)
  To: Btrfs BTRFS

On 27/09/2024 13:05, Remi Gauvin wrote:
> 2.  Any files that are stored on on the device with CoW Disabled will
> not be fixed, and the two copies will be different, with no real way to
> detect or fix.  There are packages that disable CoW on files by
> default.  (systemd log files, but probably more concerning, and virtual
> disk created by libvirt, for example. (Some amount of divergence can
> happen at any unclean shutdown in this scenario)

For reference, there is a longstanding open request for enhancement to 
detect mismatch and enable manual recovery[0].

> 3. I don't have exact math at my fingertips, but with enough failed
> writes, the chances of a CRC32 collision of the stale data leaving
> unfixed/corrupted data behind gets fairly high.

It was mentioned to me that CRC32C 12TiB stale collision chance is 
75%[1]. Given that there is a proven alternative with no risk of 
collision (write-intent bitmap), I would say relying on checksums here 
is the wrong thing to do.

> For reasons of 2 and 3, the only way to fix this without increasing
> chance of data corruption is to replace the previously disconnected
> drive to a hot spare. (with the -r option to btrfs replace.)

Furthermore, if a lost device ever mounts rw on its own, it will cause 
permanent split-brain, because btrfs doesn’t track lost devices so will 
happily rejoin all devices again later. Compared to everything else that 
btrfs already solves, this seems like such a trivial problem, as my 
understanding is that it only requires storing a bitmap on each device 
that indicates which other devices were present/absent according to that 
device, and if the bitmaps don’t match, then don’t rejoin the devices 
without manual intervention.

I wrote about this exact thing already a little over a month ago[2] plus 
gave a dozen citations to past discussions, and didn’t get any feedback 
from anyone working on btrfs. btrfs developers: short of implementing a 
write-intent bitmap myself, which is not possible, what can I (or anyone 
else) do to get some developer time on this?

Thanks,

[0] https://github.com/kdave/btrfs-progs/issues/134
[1] https://github.com/kdave/btrfs-progs/pull/863#discussion_r1710574045
[2] 
https://lore.kernel.org/linux-btrfs/55c3f03d-a650-4193-8982-ffcb70575c2e@zetafleet.com/T/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: BTRFS list of grievances
  2024-09-27 16:27 ` Roman Mamedov
  2024-09-27 18:05   ` Remi Gauvin
@ 2024-09-28 10:15   ` Paul Jones
  2024-09-28 17:51   ` Roman Mamedov
  2 siblings, 0 replies; 14+ messages in thread
From: Paul Jones @ 2024-09-28 10:15 UTC (permalink / raw)
  To: Roman Mamedov, waxhead; +Cc: Btrfs BTRFS

> -----Original Message-----
> From: Roman Mamedov <rm@romanrm.net>
> Sent: Saturday, 28 September 2024 2:28 AM
> To: waxhead <waxhead@dirtcellar.net>
> Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
> Subject: Re: BTRFS list of grievances
> 
> On Fri, 27 Sep 2024 13:20:14 +0200
> waxhead <waxhead@dirtcellar.net> wrote:
> 
> > 1. FS MANAGEMENT
> > ================
> > BTRFS is rather simple to manage. We can add/remove devices on the
> > fly, balance the filesystem, scrub, defrag, select compression
> > algorithms etc. Some of these things are done as mount options, some
> > as properties and some by issuing a command that process something.
> 
> I will add my annoyance or rather a showstopper.
> 
> Consider a RAID1 of two 20TB disks. One disk disconnects and the system
> operates on just the remaining one for a few days.
> 
> Side note: will Btrfs even agree to operate in such state without constant
> stream of errors to dmesg?
> 
> Then the disk is reconnected to the system.
> 
> For a start, are we even able to cleanly forget an abruptly disappeared drive
> in RAID1, and then re-add it back when the same disk it reappears (under a
> different /dev/sdX location)? Without remounting and reboot?
> 
> Secondly, it feels like you'll be extremely lucky not to die a fiery death of
> "parent transid mismatch errors" right away with Btrfs, after this.
> 
> Or if not, then how do you get from there to a consistent state? Run a scrub,
> make the system reread the entire 40 TB of data, correcting errors and lack of
> duplication where necessary.
> 
> Meanwhile, mdadm RAID1: thanks to the Write-intent bitmap, after a re-add
> the RAID resyncs just the small changed areas from the continuously running
> disk to the temporarily-absent one, and the array consistency is almost
> instantly restored, in many cases just with a few GBs read and written.
> 
> Or maybe I underestimate the current Btrfs capabilities here?

I have some experience with this - once the disk is reconnected: unmount, btrfs sync, mount. Yes, there will be a firestorm of errors when recent data is accessed (I've had over 100k errors fixed by scrub) but all the data stays intact. You do need to run scrub eventually to be sure all errors have been found and eliminated, but btrfs will fix any problems it encounters on the fly so immediate scrub/rebuild is not needed.
It's not the perfect solution but it's definitely robust. Re-adding a disk without unmounting would be amazing.

Paul.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 16:27 ` Roman Mamedov
  2024-09-27 18:05   ` Remi Gauvin
  2024-09-28 10:15   ` Paul Jones
@ 2024-09-28 17:51   ` Roman Mamedov
  2 siblings, 0 replies; 14+ messages in thread
From: Roman Mamedov @ 2024-09-28 17:51 UTC (permalink / raw)
  To: waxhead; +Cc: Btrfs BTRFS

On Fri, 27 Sep 2024 21:27:55 +0500
Roman Mamedov <rm@romanrm.net> wrote:

> "parent transid mismatch errors" right away with Btrfs, after this.

Speaking of which, another annoyance is that a "parent transid verify failed"
still seems to be a game over for any Btrfs filesystem (salvage data, reformat
and restore from backups), even with a low number of the transid difference.

There is no option to forcibly restore FS consistency in exchange of losing
some of the stored data.

And it still happens sometimes, in conjunction with things like USB enclosures,
faulty cables or power cuts.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 11:20 BTRFS list of grievances waxhead
  2024-09-27 16:27 ` Roman Mamedov
  2024-09-27 17:44 ` Mark Harmstone
@ 2024-09-30 21:43 ` Goffredo Baroncelli
  2024-10-03 17:10   ` Goffredo Baroncelli
  2 siblings, 1 reply; 14+ messages in thread
From: Goffredo Baroncelli @ 2024-09-30 21:43 UTC (permalink / raw)
  To: waxhead, Btrfs BTRFS

On 27/09/2024 13.20, waxhead wrote:
> First thing first: I am a long time BTRFS user and frequent reader of the mailing list. I am *NOT* a BTRFS developer, but that being said I have been known to summon a segmentation failure or two from years of programming in C.
> 
> Since I have been using BTRFS more or less problem free since 2013 or so for nearly everything, I figured that I should be entitled to simply write down a list of things that I personally think sucks (more or less) with this otherwise fine filesystem
> 
> Make of it what you will, but what I am trying to get across is what the upper class would probably call 'constructive criticism'.
> 
> So here goes:
> 
> 
> 
> 1. FS MANAGEMENT
> ================
> BTRFS is rather simple to manage. We can add/remove devices on the fly, balance the filesystem, scrub, defrag, select compression algorithms etc. Some of these things are done as mount options, some as properties and some by issuing a command that process something.
> 
> Personally, I feel this is a bit messy and in some cases quite backwards at times. I believe the original idea was that BTRFS should support pr. subvolume mount options, storage profiles, etc etc.... and subvolumes are after all a key feature of the filesystem.
> 
> Heck, we even have a root subvolume (id 256) which ideally is the parent (or root) for all other subvolumes on the filesystem. So why on earth do we have commands such as 'btrfs balance start -dusage=50 /fsmnt' when logically it could just has easily have been 'btrfs <subvolume> balance start -dusage=50' . E.g. on the root subvolume instead of the fs mount point.
> 
> Besides, if BTRFS at some point are supposed to be more "subvolume centric" then why are not things like scrub, balance, convert (data/metadata), device add/remove or even defrag handled as properties to a subvolume. E.g. why not set a flag that triggers what needs to be done, and let the filesystem process that as a background task.
> 
> That would for example allow for finer granularity for scrub for certain subvolumes, instead of having to do the entire filesystem as it currently is now.

I am not sure to agree. Some properties are per "filesystem", others are per "sub-volume"; being a "subvolume" a subset of a filesystem, it might seem that providing a setting on a per "sub-volume" basis gives to the user more flexibility.
However this is a gain only if there isn't any possible confusion about what the filesystem will do. For example, it is not clear to me, what means doing a balance (e.g. reshape the raid profile) for a subvolume when also a snapshot exists: the user want to balance only the subvolume (un-sharing the data), or the user want to balance the subvolume data and all the shared extents. I am not saying that we cannot define a semantic of a subvolume balance; I am saying that this is not so obvious and should be avoided.

I think that, depending by the user case, the expectation might be different. IMHO a filesystem should behaves following the "least surprise" principle. And if something my be misunderstood, then it is better to not have it.

This to say that form me, if something is related to shared data, it should be "per filesystem" (like the raid profile), to avoid any ambiguity. Other properties (like a inode property) should be per "sub-volume" basis.

> 
> Status for the jobs do in my opinion belong in sysfs, but there is nothing wrong with a simple command to "pretty'fy" the status either.
> 
> And yes, I even mentioned device add/remove because if it would be possible at some point to assign priority/weight to certain devices for certain subvolumes then making a subvolume prefer or avoid using a certain storage device wold be as "simple" as setting a suitable weight/priority, and it would be possible to add/remove (assign) storage devices without affecting all other subvolumes.
> 
> So for me , 'btrfs property set' (or something similar) sounds like the only sensible way of properly managing a BTRFS. And really, with the exception of the rescue and subvolume mount options most, if not all other mount options seems to better belong as a property for a subvolume (which may or may not be the id 256 / root subvolume)
> 
> 
> 
> 2. USE DEVICE ID's EVERYWHERE INSTEAD OF /dev/sdX:
> ==================================================
> Using "btrfs filesystem show" will list all BTRFS devices, and also show the assigned ID for that device / partition / whatever. Since BTRFS already have the notion of a device ID, it seems pointless to not use that ID for management / identification anywhere possible.
> (for example btrfs device stat /mnt)
> 

I suggest both the ways. If something is a device, interpret as device, otherwise a try to interpret as id; I worked in the past on something like that. But I never finalize it.

> 
> 3. SOME DEVICES MISSING SHOULD BE ID 1,2,3,4... MISSING:
> ========================================================
> If one or more devices are missing it would have been great to know WHAT devices where missing. Why not print the ID's of the missing devices instead of just let the user know that "some" of them are missing?
> 

+1

> 
> 4. THE ABILITY TO SET A LABEL FOR A DEVICE ID:
> ==============================================
> It would have been great to set a label for a BTRFS device ID. For example ID1 = "Shelf01.24", ID2 = "NAS_01", ID3 = "localdiskXYZ"
> 

Considering the ubiquitous of a GUID partition table, currently we have:
- a device name (/dev/sdx), which can be customized by udev
- a partition type GUID
- a unique partition UUID
- a partition label (yes GUID has room for 36 UTF16 code unit)
- a btrfs sub UUID
- a btrfs ID

I think that it is enough :-), and a further label would only increase the confusion

> 
> 5. DEDUPLICATION IS NOT INTEGRATED IN BTRFS:
> ============================================
> I think that some form of (simple) deduplication should be integrated in BTRFS. Using unofficial tools may be perfectly safe, but it feels "unsafe" to be honest. Besides deduplication is something that might have been interesting to turn on/on_whenidle/off as a property to a subvolume as well.
> 

It is not clear if the problem is "online vs offline" deduplication or the fact that the dedup is not integrate in the btrfs-prog command.

> 
> 6. DEVICE STATS:
> ================
> Again device ID's are not used, but also why is this info not listed in a table? Showing this in a table would make 5x lines become 1x line which would be far more readable. Finaly it is not clear to me what is fixed errors, and what are actual damage accumulated in the filesystem
> 

+1

> 
> 7. LIST OF DAMAGED FILES:
> =========================
> There is no easy way to get a list of damaged files on a BTRFS filesystem to my knowledge. It would be great to have a command for that.
> 

I am not sure if it's worth the complexity. Basically now it is enough to look in the log for a filesystem error showing the inode. Logging an inode at the filesystem level would increase the complexity.

> 
> 8. ABILITY TO RESERVE SPARE SPACE:
> ==================================
> Because of the way BTRFS works a spare device is not very useful. Rather spare space would be a good idea I think. That way if one device is missing data, it could be replicated to other drives (or even on a single device [DUP] in emergency situations)
> 

We could reserve (e.g.) 1G for each disk, that cannot be allocated until root request it. It will not prevent the exhaustion of the free space, but would prevent the situation where the user cannot free space because.. it has not space.
When the filesystem fill all the disk(s) (with the except of the above 1GB reserved space), it goes in RO; then the administrator might unlock the reserved space and start to remove the thing.


> 
> 9. ABILITY TO MERGE / CONSUME EXISTING BTRFS:
> =============================================
> It would have been great to merge existing BTRFS volumes into a larger volume e.g. assimilate it ..because we all know resistance is futile.
> Again a subvolume would be the cleanest way of importing another BTRFS I think.
> 

What about the inode collision ?

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-27 19:01     ` Colin S
@ 2024-10-02 19:31       ` Chris Murphy
  2024-10-02 23:18         ` Colin S
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2024-10-02 19:31 UTC (permalink / raw)
  To: Colin S, Btrfs BTRFS

On Fri, Sep 27, 2024, at 3:01 PM, Colin S wrote:

> Furthermore, if a lost device ever mounts rw on its own, it will cause 
> permanent split-brain, because btrfs doesn’t track lost devices so will 
> happily rejoin all devices again later. 

RW degraded mount makes transid ambiguous. There isn't a timestamp in the super, so we can't use that to help disambiguate matching or similar transids on multiple members that were mounted rw degraded.

One idea I had is a "mounted degraded" flag that would cause the kernel to do some logic to prevent rw mount that will cause the split brain problem. i.e. do not permit the mount of a file system when 2+ devices present have the degraded rw flag set. Perhaps not even RO, I'm not sure.

Would such a flag need to go in the super though? Or could we just make such a thing an item in the device tree? And for that matter, add fs create time, and the last mounted and unmounted times in device tree?

We also need a partial scrub, i.e. start a scrub from a certain point so that not all data and metadata needs to be read. Write intent bitmap would help do that but can we infer a write intent bitmap via transid? 

Or still another idea, a variation on the seed device but a single device can be both seed and sprout?  i.e. upon mounting rw degraded, changes to the filesystem need to go in a separate location, the point being to preserve the state prior to mounting degraded, and isolate the degraded writes to "play them back" later when all the drives are together again and we're running normally (not degraded).

We really need some things in place with automatic degraded recovery and device readd before we could ever figure out how to have unattended degraded boot (for the 10 people on earth who want this - bad but funny joke). Right now we can't set degraded mount option persistently because split brain. And we can't even try to mount when not all devices are present because mount will fail (without degraded mount option). Therefore there's a udev rule in place to not even try to mount during boot if not all devices are found. Indefinitely waits. Kinda annoying!

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-10-02 19:31       ` Chris Murphy
@ 2024-10-02 23:18         ` Colin S
  0 siblings, 0 replies; 14+ messages in thread
From: Colin S @ 2024-10-02 23:18 UTC (permalink / raw)
  To: Btrfs BTRFS

On 02/10/2024 14:31, Chris Murphy wrote:
> On Fri, Sep 27, 2024, at 3:01 PM, Colin S wrote:
>
>> Furthermore, if a lost device ever mounts rw on its own, it will cause
>> permanent split-brain, because btrfs doesn’t track lost devices so will
>> happily rejoin all devices again later.
> RW degraded mount makes transid ambiguous. There isn't a timestamp in the super, so we can't use that to help disambiguate matching or similar transids on multiple members that were mounted rw degraded.
>
> One idea I had is a "mounted degraded" flag that would cause the kernel to do some logic to prevent rw mount that will cause the split brain problem. i.e. do not permit the mount of a file system when 2+ devices present have the degraded rw flag set. Perhaps not even RO, I'm not sure.
>
> Would such a flag need to go in the super though? Or could we just make such a thing an item in the device tree? And for that matter, add fs create time, and the last mounted and unmounted times in device tree?

I suspect I may be missing something important because I am no FS expert 
and md-raid doesn’t appear to do this exactly, but I can’t think of a 
case where it doesn’t work to have a simple per-device bitmap that says 
which devices have up-to-date writes, from the perspective of each 
device. Using a bitwise-AND across all visible device bitmaps, whichever 
bits remain set must be the good device(s) because they were always seen 
by all devices as never having missed a write. If the bitwise-AND 
results in 0, then there is a split-brain, because it means each device 
saw some other device miss a write. When a lost device reappears, set 
its own up-to-date bit to 0 until it is recovered, and from there I 
think the existing btrfs-replace mechanism can be used to ‘replace’ the 
lost disk with itself.

> We also need a partial scrub, i.e. start a scrub from a certain point so that not all data and metadata needs to be read. Write intent bitmap would help do that but can we infer a write intent bitmap via transid?

Whether this is possible is beyond my knowledge level. This patch from 
2022[0] says some thing about why not btrfs btree, but I don’t know if 
that is talking about some implementation detail, or if it is saying 
that what you suggest is a fundamentally unsound approach. Either way, 
my understanding is that the write-intent bitmap is important to reduce 
the time/resource impact of a failure, but is not strictly necessary to 
solve the data corruption problem; doing a full device scan and updating 
mismatched blocks would be sufficient, just slow. (I think checksums 
cannot be used safely here due to the collision risk.)

> We really need some things in place with automatic degraded recovery and device readd before we could ever figure out how to have unattended degraded boot (for the 10 people on earth who want this - bad but funny joke). Right now we can't set degraded mount option persistently because split brain. And we can't even try to mount when not all devices are present because mount will fail (without degraded mount option). Therefore there's a udev rule in place to not even try to mount during boot if not all devices are found. Indefinitely waits. Kinda annoying!

I may be just repeating/agreeing with what you are saying, but just in 
case, most of the time a degraded mount will not result in split-brain, 
and in the case of a split-brain I don’t believe there is ever a safe 
way to automatically choose which one might be the right one for a 
degraded mount, so btrfs should not try. Timestamps would not account 
for errors during hardware recovery, like rw-mounting the wrong device 
and writing to it inadvertently. So long as btrfs does not try to rejoin 
split-brain then the user could take action to restore the correct half.

Thank you for sharing your thoughts, and I hope this is the start of a 
solution for this issue.

Best,

[0] 
https://lore.kernel.org/linux-btrfs/bd94acc1-5c1d-203b-8523-e6986206b267@suse.com/T/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-09-30 21:43 ` Goffredo Baroncelli
@ 2024-10-03 17:10   ` Goffredo Baroncelli
  2024-10-03 17:26     ` Remi Gauvin
  0 siblings, 1 reply; 14+ messages in thread
From: Goffredo Baroncelli @ 2024-10-03 17:10 UTC (permalink / raw)
  To: waxhead; +Cc: Btrfs BTRFS

On 30/09/2024 23.43, Goffredo Baroncelli wrote:
> On 27/09/2024 13.20, waxhead wrote:
[...]
>>
>> 6. DEVICE STATS:
>> ================
>> Again device ID's are not used, but also why is this info not listed in a table? Showing this in a table would make 5x lines become 1x line which would be far more readable. 


This was an already solved problem


$ sudo ./btrfs dev stat -T /mnt/btrfs-raid1/
Id Path      Write errors Read errors Flush errors Corruption errors Generation errors
-- --------- ------------ ----------- ------------ ----------------- -----------------
  1 /dev/sda2            0           0            0               763                 0
  2 /dev/sdb2            0           0            0              3504                 0
  3 /dev/sdd2           13           0            0              6218                 0

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-10-03 17:10   ` Goffredo Baroncelli
@ 2024-10-03 17:26     ` Remi Gauvin
  2024-10-03 18:24       ` Goffredo Baroncelli
  0 siblings, 1 reply; 14+ messages in thread
From: Remi Gauvin @ 2024-10-03 17:26 UTC (permalink / raw)
  To: kreijack; +Cc: Btrfs BTRFS

On 2024-10-03 1:10 p.m., Goffredo Baroncelli wrote:
>
> $ sudo ./btrfs dev stat -T /mnt/btrfs-raid1/
> Id Path      Write errors Read errors Flush errors Corruption errors
> Generation errors
> -- --------- ------------ ----------- ------------ -----------------
> -----------------
>  1 /dev/sda2            0           0            0              
> 763                 0
>  2 /dev/sdb2            0           0            0             
> 3504                 0
>  3 /dev/sdd2           13           0            0             
> 6218                 0
>

I hope that's a made up sample and not actual output of your
filesystem.  Otherwise, you have a problem...


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-10-03 17:26     ` Remi Gauvin
@ 2024-10-03 18:24       ` Goffredo Baroncelli
  2024-10-03 18:32         ` Remi Gauvin
  0 siblings, 1 reply; 14+ messages in thread
From: Goffredo Baroncelli @ 2024-10-03 18:24 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: Btrfs BTRFS

On 03/10/2024 19.26, Remi Gauvin wrote:
> On 2024-10-03 1:10 p.m., Goffredo Baroncelli wrote:
>>
>> $ sudo ./btrfs dev stat -T /mnt/btrfs-raid1/
>> Id Path      Write errors Read errors Flush errors Corruption errors
>> Generation errors
>> -- --------- ------------ ----------- ------------ -----------------
>> -----------------
>>   1 /dev/sda2            0           0            0
>> 763                 0
>>   2 /dev/sdb2            0           0            0
>> 3504                 0
>>   3 /dev/sdd2           13           0            0
>> 6218                 0
>>
> 
> I hope that's a made up sample and not actual output of your
> filesystem.  Otherwise, you have a problem...
> 

It is a real output, and I didn't notice these errors.

This was an old disks set. And these are old errors, due to a bad power supply.
Replaced the power supply all the problem disappeared.

Of course I didn't have any data issue, due to btrfs+raid1.

However I never cared to clear those errors. Anyway I run a "btrfs scrub"
which didn't find any error.

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS list of grievances
  2024-10-03 18:24       ` Goffredo Baroncelli
@ 2024-10-03 18:32         ` Remi Gauvin
  0 siblings, 0 replies; 14+ messages in thread
From: Remi Gauvin @ 2024-10-03 18:32 UTC (permalink / raw)
  To: kreijack; +Cc: Btrfs BTRFS

On 2024-10-03 2:24 p.m., Goffredo Baroncelli wrote:
>
>>
>
> It is a real output, and I didn't notice these errors.
>
> This was an old disks set. And these are old errors, due to a bad
> power supply.
> Replaced the power supply all the problem disappeared.
>
> Of course I didn't have any data issue, due to btrfs+raid1.
>
> However I never cared to clear those errors. Anyway I run a "btrfs scrub"
> which didn't find any error.



An excellent example of btrfs doing it's job as advertised!.. I would
suggest resetting the counters though, so future problems will be easy
to spot.



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-03 18:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-27 11:20 BTRFS list of grievances waxhead
2024-09-27 16:27 ` Roman Mamedov
2024-09-27 18:05   ` Remi Gauvin
2024-09-27 19:01     ` Colin S
2024-10-02 19:31       ` Chris Murphy
2024-10-02 23:18         ` Colin S
2024-09-28 10:15   ` Paul Jones
2024-09-28 17:51   ` Roman Mamedov
2024-09-27 17:44 ` Mark Harmstone
2024-09-30 21:43 ` Goffredo Baroncelli
2024-10-03 17:10   ` Goffredo Baroncelli
2024-10-03 17:26     ` Remi Gauvin
2024-10-03 18:24       ` Goffredo Baroncelli
2024-10-03 18:32         ` Remi Gauvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox