* raid5/6 production use status? @ 2016-06-01 22:25 Christoph Anton Mitterer 2016-06-02 9:24 ` Gerald Hopf 0 siblings, 1 reply; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-01 22:25 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 980 bytes --] Hey. I've lost a bit track recently and the wiki changelog doesn't seem to contain much about how things went on at the RAID5/6 front... so how're things going? Is it already more or less "productively" usable? What's still missing? I guess there still aren't any administrative tools that e.g. monitor for failed disks or block errors? Does that RAID5/6 itself work already? Is it possible to replace broken devices (or such with block errors)? Are things like a completely failing disk (during fs being online) handled gracefully? How about scrubbing/repairing... I assume on read it would identify silent block errors by the checksum[0] and rebuild it if possible, doing what when it fails? Just giving read errors? Marking the btrfs RAID failed and remounting the fs read-only? Cheers & thx, Chris. [0] Except of course for the nodatacow case, which, albeit a major case, unfortunately still seems to lack the important checksumming support :-( [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid5/6 production use status? 2016-06-01 22:25 raid5/6 production use status? Christoph Anton Mitterer @ 2016-06-02 9:24 ` Gerald Hopf 2016-06-02 9:35 ` Hugo Mills 2016-06-03 17:38 ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer 0 siblings, 2 replies; 25+ messages in thread From: Gerald Hopf @ 2016-06-02 9:24 UTC (permalink / raw) To: linux-btrfs > Hey. > > I've lost a bit track recently and the wiki changelog doesn't seem to > contain much about how things went on at the RAID5/6 front... so how're > things going? > > Is it already more or less "productively" usable? What's still missing? Well, you still can't even check for free space. ~ # btrfs fi usage /mnt/data-raid WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 18.19TiB Device allocated: 0.00B Device unallocated: 18.19TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) btrfs --version ==> btrfs-progs v4.5.3-70-gc1c27b9 kernel ==> 4.6.0 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid5/6 production use status? 2016-06-02 9:24 ` Gerald Hopf @ 2016-06-02 9:35 ` Hugo Mills 2016-06-02 10:03 ` Gerald Hopf 2016-06-03 17:38 ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer 1 sibling, 1 reply; 25+ messages in thread From: Hugo Mills @ 2016-06-02 9:35 UTC (permalink / raw) To: Gerald Hopf; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1221 bytes --] On Thu, Jun 02, 2016 at 11:24:45AM +0200, Gerald Hopf wrote: > > >Hey. > > > >I've lost a bit track recently and the wiki changelog doesn't seem to > >contain much about how things went on at the RAID5/6 front... so how're > >things going? > > > >Is it already more or less "productively" usable? What's still missing? > Well, you still can't even check for free space. You can, but not with that tool. https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools Hugo. > ~ # btrfs fi usage /mnt/data-raid > WARNING: RAID56 detected, not implemented > WARNING: RAID56 detected, not implemented > WARNING: RAID56 detected, not implemented > Overall: > Device size: 18.19TiB > Device allocated: 0.00B > Device unallocated: 18.19TiB > Device missing: 0.00B > Used: 0.00B > Free (estimated): 0.00B (min: 8.00EiB) > > btrfs --version ==> btrfs-progs v4.5.3-70-gc1c27b9 > kernel ==> 4.6.0 > > -- Hugo Mills | UNIX: Spanish manufacturer of fire extinguishers hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: raid5/6 production use status? 2016-06-02 9:35 ` Hugo Mills @ 2016-06-02 10:03 ` Gerald Hopf 0 siblings, 0 replies; 25+ messages in thread From: Gerald Hopf @ 2016-06-02 10:03 UTC (permalink / raw) To: Hugo Mills, linux-btrfs >>> Hey. >>> >>> I've lost a bit track recently and the wiki changelog doesn't seem to >>> contain much about how things went on at the RAID5/6 front... so how're >>> things going? >>> >>> Is it already more or less "productively" usable? What's still missing? >> Well, you still can't even check for free space. > You can, but not with that tool. > > https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools > > Hugo. That tool however according to the wiki is the "new tool" which you are supposed to use! The other options are not that good... btrfs fi usage ==> 3x WARNING: RAID56 detected, not implemented btrfs fi df ==> only shows what part of "allocated" space is in use, not useful information if you want to know if you have free space btrfs fi show ==> does not show total free space. I guess you can use the information in btrfs fi show and then subtract used from total and then multiply that? But by what? n disks? Or by n-1 disks because of parity? ==> multiplying it by all disks (including parity) seems to arrive at a similar free space as df -h shows me. But is it correct? Or should it be 4/5 of this because I have one parity disk and 4 data disks? I do however stand corrected: You actually can (barely) check for free space. And you can get a number that might or might not be the free space. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs (was: raid5/6) production use status (and future)? 2016-06-02 9:24 ` Gerald Hopf 2016-06-02 9:35 ` Hugo Mills @ 2016-06-03 17:38 ` Christoph Anton Mitterer 2016-06-03 19:50 ` btrfs Austin S Hemmelgarn [not found] ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za> 1 sibling, 2 replies; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-03 17:38 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1654 bytes --] Hey.. Hm... so the overall btrfs state seems to be still pretty worrying, doesn't it? - RAID5/6 seems far from being stable or even usable,... not to talk about higher parity levels, whose earlier posted patches (e.g. http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have been given up. - Serious show-stoppers and security deficiencies like the UUID collision corruptions/attacks that have been extensively discussed earlier, are still open - a number of important core features not fully working in many situations (e.g. the issues with defrag, not being ref-link aware,... an I vaguely remember similar things with compression). - OTOH, defrag seems to be viable for important use cases (VM images, DBs,... everything where large files are internally re-written randomly). Sure there is nodatacow, but with that one effectively completely looses one of the core features/promises of btrfs (integrity by checksumming)... and as I've showed in an earlier large discussion, none of the typical use cases for nodatacow has any high-level checksumming, and even if, it's not used per default, or doesn't give the same benefits at it would on the fs level, like using it for RAID recovery). - other earlier anticipated features like newer/better compression or checksum algos seem to be dead either - still no real RAID 1 - no end-user/admin grade maangement/analysis tools, that tell non- experts about the state/health of their fs, and whether things like balance etc.pp. are necessary - the still problematic documentation situation [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-03 17:38 ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer @ 2016-06-03 19:50 ` Austin S Hemmelgarn 2016-06-04 1:51 ` btrfs Christoph Anton Mitterer [not found] ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za> 1 sibling, 1 reply; 25+ messages in thread From: Austin S Hemmelgarn @ 2016-06-03 19:50 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2016-06-03 13:38, Christoph Anton Mitterer wrote: > Hey.. > > Hm... so the overall btrfs state seems to be still pretty worrying, > doesn't it? > > - RAID5/6 seems far from being stable or even usable,... not to talk > about higher parity levels, whose earlier posted patches (e.g. > http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have > been given up. There's no point in trying to do higher parity levels if we can't get regular parity working correctly. Given the current state of things, it might be better to break even and just rewrite the whole parity raid thing from scratch, but I doubt that anybody is willing to do that. > > - Serious show-stoppers and security deficiencies like the UUID > collision corruptions/attacks that have been extensively discussed > earlier, are still open The UUID issue is not a BTRFS specific one, it just happens to be easier to cause issues with it on BTRFS, it causes problems with all Linux native filesystems, as well as LVM, and is also an issue on Windows. There is no way to solve it sanely given the requirement that userspace not be broken. Properly fixing this would likely make us more dependent on hardware configuration than even mounting by device name. > > - a number of important core features not fully working in many > situations (e.g. the issues with defrag, not being ref-link aware,... > an I vaguely remember similar things with compression). OK, how then should defrag handle reflinks? Preserving them prevents it from being able to completely defragment data. It's worth pointing out that it is generally pointless to defragment snapshots, as they are typically infrequently accessed in most use cases. > > - OTOH, defrag seems to be viable for important use cases (VM images, > DBs,... everything where large files are internally re-written > randomly). > Sure there is nodatacow, but with that one effectively completely > looses one of the core features/promises of btrfs (integrity by > checksumming)... and as I've showed in an earlier large discussion, > none of the typical use cases for nodatacow has any high-level > checksumming, and even if, it's not used per default, or doesn't give > the same benefits at it would on the fs level, like using it for RAID > recovery). The argument of nodatacow being viable for anything is a pretty significant secondary discussion that is itself entirely orthogonal to the point you appear to be trying to make here. > > - other earlier anticipated features like newer/better compression or > checksum algos seem to be dead either This one I entirely agree about. The arguments against adding other compression algorithms and new checksums are entirely bogus. Ideally we'd switch to just encoding API info from the CryptoAPI and let people use wherever they want from there. > > - still no real RAID 1 No, you mean still no higher order replication. I know I'm being stubborn about this, but RAID-1 is offici8ally defined in the standards as 2-way replication. The only extant systems that support higher levels of replication and call it RAID-1 are entirely based on MD RAID and it's poor choice of naming. Overall, between this and the insanity that is raid5/6, somebody with significantly more skill than me, and significantly more time than most of the developers, needs to just take a step back and rewrite the whole multi-device profile support from scratch. > > - no end-user/admin grade maangement/analysis tools, that tell non- > experts about the state/health of their fs, and whether things like > balance etc.pp. are necessary I don't see anyone forthcoming with such tools either. As far as basic monitoring, it's trivial to do with simple scripts from tools like monit or nagios. As far as complex things like determining whether a fs needs balanced, that's really non-trivial to figure out. Even with a person looking at it, it's still not easy to know whether or not a balance will actually help. > > - the still problematic documentation situation Not trying to rationalize this, but go take a look at a majority of other projects, most of them that aren't backed by some huge corporation throwing insane amounts of money at them have at best mediocre end-user documentation. The fact that more effort is being put into development than documentation is generally a good thing, especially for something that is not yet feature complete like BTRFS. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-03 19:50 ` btrfs Austin S Hemmelgarn @ 2016-06-04 1:51 ` Christoph Anton Mitterer 2016-06-04 7:24 ` btrfs Andrei Borzenkov ` (3 more replies) 0 siblings, 4 replies; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-04 1:51 UTC (permalink / raw) To: Austin S Hemmelgarn, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 9787 bytes --] On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote: > There's no point in trying to do higher parity levels if we can't get > regular parity working correctly. Given the current state of things, > it might be better to break even and just rewrite the whole parity > raid thing from scratch, but I doubt that anybody is willing to do > that. Well... as I've said, things are pretty worrying. Obviously I cannot really judge, since I'm not into btrfs' development... maybe there's a lack of manpower? Since btrfs seems to be a very important part (i.e. next-gen fs), wouldn't it be possible to either get some additional funding by the Linux Foundation, or possible that some of the core developers make an open call for funding by companies? Having some additional people, perhaps working fulltime on it, may be a big help. As for the RAID... given how many time/effort is spent now into 5/6,.. it really seems that one should have considered multi-parity from the beginning on. Kinda feels like either, with multi-parity this whole instability phase would start again, or it will simply never happen. > > - Serious show-stoppers and security deficiencies like the UUID > > collision corruptions/attacks that have been extensively > > discussed > > earlier, are still open > The UUID issue is not a BTRFS specific one, it just happens to be > easier > to cause issues with it on BTRFS uhm this had been discussed extensively before, as I've said... AFAICS btrfs is the only system we have, that can possibly cause data corruption or even security breach by UUID collisions. I wouldn't know that other fs, or LVM are affected, these just continue to use those devices already "online"... and I think lvm refuses to activate VGs, if conflicting UUIDs are found. > There is no way to solve it sanely given the requirement that > userspace > not be broken. No this is not true. Back when this was discussed, I and others described how it could/should be done,... respectively how userspace/kernel should behave, in short: - continue using those devices that are already active - refusing to (auto)assemble by UUID, if there are conflicts or requiring to specify the devices (with some --override-yes-i-know- what-i-do option option or so) - in case of assembling/rebuilding/similar... never doing this automatically I think there were some more corner cases, I basically had them all discussed in the thread back then (search for "attacking btrfs filesystems via UUID collisions?" and IIRC some different titled parent or child threads). > Properly fixing this would likely make us more dependent > on hardware configuration than even mounting by device name. Sure, if there are colliding UUIDs, and one still wants to mount (by using some --override-yes-i-know-what-i-do option),.. it would need to be by specifying the device name... But where's the problem? This would anyway only happen if someone either attacks or someone made a clone, and it's far better to refuse automatic assembly in cases where accidental corruption can happen or where attacks may be possible, requiring the user/admin to manually take action, than having corruption or security breach. Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some auto-rebuild based on UUID, then if an attacker knows that he'd just need to plug in a USB disk with a fitting UUID...and easily gets a copy of everything on disk, gpg keys, ssh keys, etc. > > - a number of important core features not fully working in many > > situations (e.g. the issues with defrag, not being ref-link > > aware,... > > an I vaguely remember similar things with compression). > OK, how then should defrag handle reflinks? Preserving them prevents > it > from being able to completely defragment data. Didn't that even work in the past and had just some performance issues? > > - OTOH, defrag seems to be viable for important use cases (VM > > images, > > DBs,... everything where large files are internally re-written > > randomly). > > Sure there is nodatacow, but with that one effectively completely > > looses one of the core features/promises of btrfs (integrity by > > checksumming)... and as I've showed in an earlier large > > discussion, > > none of the typical use cases for nodatacow has any high-level > > checksumming, and even if, it's not used per default, or doesn't > > give > > the same benefits at it would on the fs level, like using it for > > RAID > > recovery). > The argument of nodatacow being viable for anything is a pretty > significant secondary discussion that is itself entirely orthogonal > to > the point you appear to be trying to make here. Well the point here was: - many people (including myself) like btrfs, it's (promised/future/current) features - it's intended as a general purpose fs - this includes the case of having such file/IO patterns as e.g. for VM images or DBs - this is currently not really doable without loosing one of the promises (integrity) So the point I'm trying to make: People do probably not care so much whether their VM image/etc. is COWed or not, snapshots/etc. still work with that,... but they may likely care if the integrity feature is lost. So IMHO, nodatacow + checksumming deserves to be amongst the top priorities. > > - still no real RAID 1 > No, you mean still no higher order replication. I know I'm being > stubborn about this, but RAID-1 is offici8ally defined in the > standards > as 2-way replication. I think I remember that you've claimed that last time already, and as I've said back then: - what counts is probably the common understanding of the term, which is N disks RAID1 = N disks mirrored - if there is something like an "official definition", it's probably the original paper that introduced RAID: http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf PDF page 11, respectively content page 9 describes RAID1 as: "This is the most expensive option since *all* disks are duplicated..." > The only extant systems that support higher > levels of replication and call it RAID-1 are entirely based on MD > RAID > and it's poor choice of naming. Not true either, show me any single hardware RAID controller that does RAID1 in a dup2 fashion... I manage some >2PiB of storage at the faculty, all controller we have, handle RAID1 in the sense of "all disks mirrored". > > - no end-user/admin grade maangement/analysis tools, that tell non- > > experts about the state/health of their fs, and whether things > > like > > balance etc.pp. are necessary > I don't see anyone forthcoming with such tools either. As far as > basic > monitoring, it's trivial to do with simple scripts from tools like > monit > or nagios. AFAIU, even that isn't really possible right now, is it? Take RAID again,... there is no place where you can see whether the RAID state is "optimal", or does that exist in the meantime? Last time, people were advised to look at the kernel logs, but this is no proper way to check for the state... logging may simply be deactivated, or you may have an offline fs, for which the logs have been lost because they were on another disk. Not to talk about the inability to properly determine how often btrfs encountered errors, and "silently" corrected it. E.g. some statistics about a device, that can be used to decide whether its dying. I think these things should be stored in the fs (and additionally also on the respective device), where it can also be extracted when no /var/log is present or when forensics are done. > As far as complex things like determining whether a fs needs > balanced, that's really non-trivial to figure out. Even with a > person > looking at it, it's still not easy to know whether or not a balance > will > actually help. Well I wouldn't call myself a btrfs expert, but from time to time I've been a bit "more active" on the list. Even I know about these strange cases (sometimes tricks), like many empty data/meta block groups, that may or may not get cleaned up, and may result in troubles How should the normal user/admin be able to cope with such things if there are no good tools? It starts with simple things like: - adding a further disk to a RAID => there should be a tool which tells you: dude, some files are not yet "rebuild"(duplicated),... do a balance or whatever. > >- the still problematic documentation situation > Not trying to rationalize this, but go take a look at a majority of > other projects, most of them that aren't backed by some huge > corporation > throwing insane amounts of money at them have at best mediocre end- > user > documentation. The fact that more effort is being put into > development > than documentation is generally a good thing, especially for > something > that is not yet feature complete like BTRFS. Uhm.. yes and no... The lack of documentation (i.e. admin/end-user-grade documentation) also means that people have less understanding in the system, less trust, less knowledge on what they can expect/do with it (will Ctrl-C on btrfs checl work? what if I shut down during a balance? does it break then? etc. pp.), less will to play with it. Further,... if btrfs would reach the state of being "feature complete" (if that ever happens, and I don't mean because of slow development, but rather, because most other fs shows that development goes "ever" on),... there would be *so much* to do in documentation, that it's unlikely it will happen. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 1:51 ` btrfs Christoph Anton Mitterer @ 2016-06-04 7:24 ` Andrei Borzenkov 2016-06-04 17:00 ` btrfs Chris Murphy 2016-06-05 20:39 ` btrfs Henk Slager ` (2 subsequent siblings) 3 siblings, 1 reply; 25+ messages in thread From: Andrei Borzenkov @ 2016-06-04 7:24 UTC (permalink / raw) To: Christoph Anton Mitterer, Austin S Hemmelgarn, linux-btrfs 04.06.2016 04:51, Christoph Anton Mitterer пишет: ... > >> The only extant systems that support higher >> levels of replication and call it RAID-1 are entirely based on MD >> RAID >> and it's poor choice of naming. > > Not true either, show me any single hardware RAID controller that does > RAID1 in a dup2 fashion... I manage some >2PiB of storage at the > faculty, all controller we have, handle RAID1 in the sense of "all > disks mirrored". > Out of curiosity - which model of hardware controllers? Those I am aware of simply won't let you create RAID1 if more than 2 disks are selected. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 7:24 ` btrfs Andrei Borzenkov @ 2016-06-04 17:00 ` Chris Murphy 2016-06-04 17:37 ` btrfs Christoph Anton Mitterer 2016-06-04 21:18 ` btrfs Andrei Borzenkov 0 siblings, 2 replies; 25+ messages in thread From: Chris Murphy @ 2016-06-04 17:00 UTC (permalink / raw) To: Andrei Borzenkov Cc: Christoph Anton Mitterer, Austin S Hemmelgarn, Btrfs BTRFS On Sat, Jun 4, 2016 at 1:24 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > 04.06.2016 04:51, Christoph Anton Mitterer пишет: > ... >> >>> The only extant systems that support higher >>> levels of replication and call it RAID-1 are entirely based on MD >>> RAID >>> and it's poor choice of naming. >> >> Not true either, show me any single hardware RAID controller that does >> RAID1 in a dup2 fashion... I manage some >2PiB of storage at the >> faculty, all controller we have, handle RAID1 in the sense of "all >> disks mirrored". >> > > Out of curiosity - which model of hardware controllers? Those I am aware > of simply won't let you create RAID1 if more than 2 disks are selected. SNIA's DDF 2.0 spec Rev 19 page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi-Mirroring" -- Chris Murphy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 17:00 ` btrfs Chris Murphy @ 2016-06-04 17:37 ` Christoph Anton Mitterer 2016-06-04 19:13 ` btrfs Chris Murphy 2016-06-04 21:18 ` btrfs Andrei Borzenkov 1 sibling, 1 reply; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-04 17:37 UTC (permalink / raw) To: Chris Murphy; +Cc: Austin S Hemmelgarn, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1351 bytes --] On Sat, 2016-06-04 at 11:00 -0600, Chris Murphy wrote: > SNIA's DDF 2.0 spec Rev 19 > page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi- > Mirroring" And DDF came how many years after the original RAID paper and everyone understood RAID1 as it was defined there? 1987 vs. ~2003 or so? Also SINA's "standard definition" seems pretty strange, doesn't it? They have two RAID1, as you say: - "simple mirroring" with n=2 - "multi mirrioring" with n=3 I wouldn't see why the n=2 case is "simpler" than the n=3 case, neither why the n=3 case is multi and the n=2 is not (it's also already multiple disks). Also why did they allow n=3 but not n>=3? If n=4 wouldn't make sense, why would n=3, compared to n=2? Anyway,... - the original paper defines it as n mirrored disks - Wikipedia handles it like that - the already existing major RAID implementation (MD) in the Linux kernel handles it like that - LVM's native mirroring, allows to set the number of mirrors, i.e. it allows for everything >=2 which is IMHO closer to the common meaning of RAID1 than to btrfs' two-duplicates So even if there would be some reasonable competing definition (and I don't think the rather proprietary DDF is very reasonable here), why using one that is incomptabible with everything we have in Linux? Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 17:37 ` btrfs Christoph Anton Mitterer @ 2016-06-04 19:13 ` Chris Murphy 2016-06-04 22:43 ` btrfs Christoph Anton Mitterer 0 siblings, 1 reply; 25+ messages in thread From: Chris Murphy @ 2016-06-04 19:13 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Chris Murphy, Austin S Hemmelgarn, Btrfs BTRFS On Sat, Jun 4, 2016 at 11:37 AM, Christoph Anton Mitterer <calestyo@scientia.net> wrote: > On Sat, 2016-06-04 at 11:00 -0600, Chris Murphy wrote: >> SNIA's DDF 2.0 spec Rev 19 >> page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi- >> Mirroring" > > And DDF came how many years after the original RAID paper and everyone > understood RAID1 as it was defined there? 1987 vs. ~2003 or so? > > Also SINA's "standard definition" seems pretty strange, doesn't it? > They have two RAID1, as you say: > - "simple mirroring" with n=2 > - "multi mirrioring" with n=3 > > I wouldn't see why the n=2 case is "simpler" than the n=3 case, neither > why the n=3 case is multi and the n=2 is not (it's also already > multiple disks). > Also why did they allow n=3 but not n>=3? If n=4 wouldn't make sense, > why would n=3, compared to n=2? > > Anyway,... > - the original paper defines it as n mirrored disks > - Wikipedia handles it like that > - the already existing major RAID implementation (MD) in the Linux > kernel handles it like that > - LVM's native mirroring, allows to set the number of mirrors, i.e. it > allows for everything >=2 which is IMHO closer to the common meaning > of RAID1 than to btrfs' two-duplicates > > So even if there would be some reasonable competing definition (and I > don't think the rather proprietary DDF is very reasonable here), why > using one that is incomptabible with everything we have in Linux? mdadm supports DDF. -- Chris Murphy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 19:13 ` btrfs Chris Murphy @ 2016-06-04 22:43 ` Christoph Anton Mitterer 2016-06-05 15:51 ` btrfs Chris Murphy 0 siblings, 1 reply; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-04 22:43 UTC (permalink / raw) To: Chris Murphy; +Cc: Austin S Hemmelgarn, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 231 bytes --] On Sat, 2016-06-04 at 13:13 -0600, Chris Murphy wrote: > mdadm supports DDF. Sure... it also supports IMSM,... so what? Neither of them are the default for mdadm, nor does it change the used terminology :) Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 22:43 ` btrfs Christoph Anton Mitterer @ 2016-06-05 15:51 ` Chris Murphy 2016-06-05 20:39 ` btrfs Christoph Anton Mitterer 0 siblings, 1 reply; 25+ messages in thread From: Chris Murphy @ 2016-06-05 15:51 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Chris Murphy, Austin S Hemmelgarn, Btrfs BTRFS On Sat, Jun 4, 2016 at 4:43 PM, Christoph Anton Mitterer <calestyo@scientia.net> wrote: > On Sat, 2016-06-04 at 13:13 -0600, Chris Murphy wrote: >> mdadm supports DDF. > > Sure... it also supports IMSM,... so what? Neither of them are the > default for mdadm, nor does it change the used terminology :) Why is mdadm the reference point for terminology? There's actually better consistency in terminology usage outside Linux because of SNIA and DDF than within Linux where the most basic terms aren't agreed upon by various upstream maintainers. mdadm and lvm use different terms even though they're both now using the same md backend in the kernel. mdadm chunk = lvm segment = btrfs stripe = ddf strip = ddf stripe element. Some things have no equivalents like the Btrfs chunk. But someone hears chunk and they wonder if it's the same thing as the mdadm chunk but it isn't, and actually Btrfs also uses the term block group for chunk, because... So if you want to create a decoder ring for terminology that's great and would be useful; but just asking everyone in Btrfs land to come up with Btrfs terminology 2.0 merely adds to the list of inconsistent term usage, it doesn't actual fix any problems. -- Chris Murphy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-05 15:51 ` btrfs Chris Murphy @ 2016-06-05 20:39 ` Christoph Anton Mitterer 0 siblings, 0 replies; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-05 20:39 UTC (permalink / raw) To: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1682 bytes --] On Sun, 2016-06-05 at 09:51 -0600, Chris Murphy wrote: > Why is mdadm the reference point for terminology? I haven't said it is,... I just said it mdadm, original paper, WP use it the common/historic way. And since all of these were there before btrfs, and in the case of mdadm/MD "in" the kernel,... one should probably try to follow that, if possible. > There's actually better consistency in terminology usage outside > Linux > because of SNIA and DDF than within Linux where the most basic terms > aren't agreed upon by various upstream maintainers. Does anyone in the Linux world really care much about DDF? Even outside? ;-) Seriously,... as I tried to show in one of my previous posts, I think the terminology of DDF, at least WRT RAID1 is a bit awkward. > mdadm and lvm use > different terms even though they're both now using the same md > backend > in the kernel. Depending on whether one choose to use "raid1" and "mirror" segment types.... Anyway,... I think that discussion gets a bit pointless,... I think it's clear that the current terminology may easily cause confusion, and I think for a term like "RAID1", which is a artificial name it's something completely else as for terms like "stripe", "chunk", etc., which are rather common terms and where one must expect that they are used for different things in different areas. And as I've said just before... the other points on my bucket list, like the UUID collision (security) issues, the no checksumming with nodatacow, etc. deserve IMHO much more attention than the terminology :) So I'm kinda out of this specific part of the discussion. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 17:00 ` btrfs Chris Murphy 2016-06-04 17:37 ` btrfs Christoph Anton Mitterer @ 2016-06-04 21:18 ` Andrei Borzenkov 1 sibling, 0 replies; 25+ messages in thread From: Andrei Borzenkov @ 2016-06-04 21:18 UTC (permalink / raw) To: Chris Murphy; +Cc: Christoph Anton Mitterer, Austin S Hemmelgarn, Btrfs BTRFS 04.06.2016 20:00, Chris Murphy пишет: > On Sat, Jun 4, 2016 at 1:24 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >> 04.06.2016 04:51, Christoph Anton Mitterer пишет: >> ... >>> >>>> The only extant systems that support higher >>>> levels of replication and call it RAID-1 are entirely based on MD >>>> RAID >>>> and it's poor choice of naming. >>> >>> Not true either, show me any single hardware RAID controller that does >>> RAID1 in a dup2 fashion... I manage some >2PiB of storage at the >>> faculty, all controller we have, handle RAID1 in the sense of "all >>> disks mirrored". >>> >> >> Out of curiosity - which model of hardware controllers? Those I am aware >> of simply won't let you create RAID1 if more than 2 disks are selected. > > SNIA's DDF 2.0 spec Rev 19 > page 18/19 shows 'RAID-1 Simple Mirroring" vs "RAID-1 Multi-Mirroring" > The question was about hardware that implements it. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 1:51 ` btrfs Christoph Anton Mitterer 2016-06-04 7:24 ` btrfs Andrei Borzenkov @ 2016-06-05 20:39 ` Henk Slager 2016-06-05 20:56 ` btrfs Christoph Anton Mitterer 2016-06-06 0:56 ` btrfs Chris Murphy 2016-06-06 13:04 ` btrfs Austin S. Hemmelgarn 3 siblings, 1 reply; 25+ messages in thread From: Henk Slager @ 2016-06-05 20:39 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs >> > - OTOH, defrag seems to be viable for important use cases (VM >> > images, >> > DBs,... everything where large files are internally re-written >> > randomly). >> > Sure there is nodatacow, but with that one effectively completely >> > looses one of the core features/promises of btrfs (integrity by >> > checksumming)... and as I've showed in an earlier large >> > discussion, >> > none of the typical use cases for nodatacow has any high-level >> > checksumming, and even if, it's not used per default, or doesn't >> > give >> > the same benefits at it would on the fs level, like using it for >> > RAID >> > recovery). >> The argument of nodatacow being viable for anything is a pretty >> significant secondary discussion that is itself entirely orthogonal >> to >> the point you appear to be trying to make here. > > Well the point here was: > - many people (including myself) like btrfs, it's > (promised/future/current) features > - it's intended as a general purpose fs > - this includes the case of having such file/IO patterns as e.g. for VM > images or DBs > - this is currently not really doable without loosing one of the > promises (integrity) > > So the point I'm trying to make: > People do probably not care so much whether their VM image/etc. is > COWed or not, snapshots/etc. still work with that,... but they may > likely care if the integrity feature is lost. > So IMHO, nodatacow + checksumming deserves to be amongst the top > priorities. Have you tried blockdevice/HDD caching like bcache or dmcache in combination with VMs on BTRFS? Or ZVOL for VMs in ZFS with L2ARC? I assume the primary reason for wanting nodatacow + checksumming is to avoid long seektimes on HDDs due to growing fragmentation of the VM images over time. But even if you have nodatacow + checksumming implemented, it is then still HDD access and a VM imagefile itself is not guaranteed to be continuous. It is clear that for VM images the amount of extents will be large over time (like 50k or so, autodefrag on), but with a modern SSD used as cache, it doesn't matter. It is still way faster than just HDD(s), even with freshly copied image with <100 extents. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-05 20:39 ` btrfs Henk Slager @ 2016-06-05 20:56 ` Christoph Anton Mitterer 2016-06-05 21:07 ` btrfs Hugo Mills 0 siblings, 1 reply; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-05 20:56 UTC (permalink / raw) To: Henk Slager; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2396 bytes --] On Sun, 2016-06-05 at 22:39 +0200, Henk Slager wrote: > > So the point I'm trying to make: > > People do probably not care so much whether their VM image/etc. is > > COWed or not, snapshots/etc. still work with that,... but they may > > likely care if the integrity feature is lost. > > So IMHO, nodatacow + checksumming deserves to be amongst the top > > priorities. > Have you tried blockdevice/HDD caching like bcache or dmcache in > combination with VMs on BTRFS? No yet,... my personal use case is just some VMs on the notebook, and for this, the above would seem a bit overkill. For the larger VM cluster at the institute,... puh to be honest I don't know by hard what we do there. > Or ZVOL for VMs in ZFS with L2ARC? Well but all this is an alternative solution,... > I assume the primary reason for wanting nodatacow + checksumming is > to > avoid long seektimes on HDDs due to growing fragmentation of the VM > images over time. Well the primary reason is wanting to have overall checksumming in the fs, regardless of which features one uses. I think we already have some situations where tools use/set btrfs features by themselves (i.e. automatically)... wasn't systemd creating subvols per default in some locations, when there's btrfs? So it's no big step to postgresql/etc. setting nodatacow, making people loose integrity without them even knowing. Of course, avoiding the fragmentation is the reason for the desire to have nodatacow. > But even if you have nodatacow + checksumming > implemented, it is then still HDD access and a VM imagefile itself is > not guaranteed to be continuous. Uhm... sure, but that's no difference to other filesystems?! > It is clear that for VM images the amount of extents will be large > over time (like 50k or so, autodefrag on), Wasn't it said, that autodefrag performs bad for anything larger than ~1G? > but with a modern SSD used > as cache, it doesn't matter. It is still way faster than just HDD(s), > even with freshly copied image with <100 extents. Well the fragmentation has also many other consequences and not just seeks (assuming everyone would use SSDs, which is and probably won't be the case for quite a while). Most obviously you get much more IOPS and btrfs itself will, AFAIU, also suffer from some issues due to the fragmentation. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-05 20:56 ` btrfs Christoph Anton Mitterer @ 2016-06-05 21:07 ` Hugo Mills 2016-06-05 21:31 ` btrfs Christoph Anton Mitterer 0 siblings, 1 reply; 25+ messages in thread From: Hugo Mills @ 2016-06-05 21:07 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Henk Slager, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3726 bytes --] On Sun, Jun 05, 2016 at 10:56:45PM +0200, Christoph Anton Mitterer wrote: > On Sun, 2016-06-05 at 22:39 +0200, Henk Slager wrote: > > > So the point I'm trying to make: > > > People do probably not care so much whether their VM image/etc. is > > > COWed or not, snapshots/etc. still work with that,... but they may > > > likely care if the integrity feature is lost. > > > So IMHO, nodatacow + checksumming deserves to be amongst the top > > > priorities. > > Have you tried blockdevice/HDD caching like bcache or dmcache in > > combination with VMs on BTRFS? > No yet,... my personal use case is just some VMs on the notebook, and > for this, the above would seem a bit overkill. > For the larger VM cluster at the institute,... puh to be honest I don't > know by hard what we do there. > > > > Or ZVOL for VMs in ZFS with L2ARC? > Well but all this is an alternative solution,... > > > > I assume the primary reason for wanting nodatacow + checksumming is > > to > > avoid long seektimes on HDDs due to growing fragmentation of the VM > > images over time. > Well the primary reason is wanting to have overall checksumming in the > fs, regardless of which features one uses. The problem is that you can't guarantee consistency with nodatacow+checksums. If you have nodatacow, then data is overwritten, in place. If you do that, then you can't have a fully consistent checksum -- there are always race conditions between the checksum and the data being written (or the data and the checksum, depending on which way round you do it). > I think we already have some situations where tools use/set btrfs > features by themselves (i.e. automatically)... wasn't systemd creating > subvols per default in some locations, when there's btrfs? > So it's no big step to postgresql/etc. setting nodatacow, making people > loose integrity without them even knowing. > > Of course, avoiding the fragmentation is the reason for the desire to > have nodatacow. > > > > But even if you have nodatacow + checksumming > > implemented, it is then still HDD access and a VM imagefile itself is > > not guaranteed to be continuous. > Uhm... sure, but that's no difference to other filesystems?! > > > > It is clear that for VM images the amount of extents will be large > > over time (like 50k or so, autodefrag on), > Wasn't it said, that autodefrag performs bad for anything larger than > ~1G? I don't recall ever seeing someone saying that. Of course, I may have forgotten seeing it... > > but with a modern SSD used > > as cache, it doesn't matter. It is still way faster than just HDD(s), > > even with freshly copied image with <100 extents. > Well the fragmentation has also many other consequences and not just > seeks (assuming everyone would use SSDs, which is and probably won't be > the case for quite a while). > Most obviously you get much more IOPS and btrfs itself will, AFAIU, > also suffer from some issues due to the fragmentation. This is a fundamental problem with all CoW filesystems. There are some mititgations that can be put in place (true CoW rather than btrfs's redirect-on-write, like some databases do, where the original data is copied elsewhere before overwriting; cache aggressively and with knowledge of the CoW nature of the FS, like ZFS does), but they all have their drawbacks and pathological cases. Hugo. -- Hugo Mills | How do you become King? You stand in the marketplace hugo@... carfax.org.uk | and announce you're going to tax everyone. If you http://carfax.org.uk/ | get out alive, you're King. PGP: E2AB1DE4 | Harry Harrison [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-05 21:07 ` btrfs Hugo Mills @ 2016-06-05 21:31 ` Christoph Anton Mitterer 2016-06-05 23:39 ` btrfs Chris Murphy 2016-06-08 6:13 ` btrfs Duncan 0 siblings, 2 replies; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-05 21:31 UTC (permalink / raw) To: Hugo Mills; +Cc: Henk Slager, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3516 bytes --] On Sun, 2016-06-05 at 21:07 +0000, Hugo Mills wrote: > The problem is that you can't guarantee consistency with > nodatacow+checksums. If you have nodatacow, then data is overwritten, > in place. If you do that, then you can't have a fully consistent > checksum -- there are always race conditions between the checksum and > the data being written (or the data and the checksum, depending on > which way round you do it). I'm not an expert in the btrfs internals... but I had a pretty long discussion back then when I brought this up first, and everything that came out of that - to my understanding - indicated, that it should be simply possible. a) nodatacow just means "no data cow", but not "no meta data cow". And isn't the checksumming data meda data? So AFAIU, this is itself anyway COWed. b) What you refer to above is, AFAIU, that data may be written (not COWed) and there is of course no guarantee that the written data matches the checksum (which may e.g. still be the old sum). => So what? This anyway only happens in case of crash/etc. and in that case we anyway have no idea, whether the written not COWed block is consistent or not, whether we do checksumming or not. We rather get the benefit that we now know: it may be garbage The only "bad" thing that could happen was: the block is fully written and actually consistent, but the checksum hasn't been written yet - IMHO much less likely than the other case(s). And I rather get one false positive in an more unlikely case, than corrupted blocks in all other possible situations (silent block errors, etc. pp.) And in principle, nothing would prevent a future btrfs to get a journal for the nodatacow-ed writes. Look for the past thread "dear developers, can we have notdatacow + checksumming, plz?",... I think I wrote about much more cases there, any why - even it may not be perfect as datacow+checksumming - it would always still be better to have checksumming with nodatacow. > > Wasn't it said, that autodefrag performs bad for anything larger > > than > > ~1G? > > I don't recall ever seeing someone saying that. Of course, I may > have forgotten seeing it... I think it was mentioned below this thread: http://thread.gmane.org/gmane.comp.file-systems.btrfs/50444/focus=50586 and also implied here: http://article.gmane.org/gmane.comp.file-systems.btrfs/51399/match=autodefrag+large+files > > Well the fragmentation has also many other consequences and not > > just > > seeks (assuming everyone would use SSDs, which is and probably > > won't be > > the case for quite a while). > > Most obviously you get much more IOPS and btrfs itself will, AFAIU, > > also suffer from some issues due to the fragmentation. > This is a fundamental problem with all CoW filesystems. There are > some mititgations that can be put in place (true CoW rather than > btrfs's redirect-on-write, like some databases do, where the original > data is copied elsewhere before overwriting; cache aggressively and > with knowledge of the CoW nature of the FS, like ZFS does), but they > all have their drawbacks and pathological cases. Sure... but defrag (if it would generally work) or notdatacow (if it wouldn't make you loose the ability to determine whether you're consistent or not) would be already quite helpful here. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-05 21:31 ` btrfs Christoph Anton Mitterer @ 2016-06-05 23:39 ` Chris Murphy 2016-06-08 6:13 ` btrfs Duncan 1 sibling, 0 replies; 25+ messages in thread From: Chris Murphy @ 2016-06-05 23:39 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Hugo Mills, Henk Slager, linux-btrfs On Sun, Jun 5, 2016 at 3:31 PM, Christoph Anton Mitterer <calestyo@scientia.net> wrote: > On Sun, 2016-06-05 at 21:07 +0000, Hugo Mills wrote: >> The problem is that you can't guarantee consistency with >> nodatacow+checksums. If you have nodatacow, then data is overwritten, >> in place. If you do that, then you can't have a fully consistent >> checksum -- there are always race conditions between the checksum and >> the data being written (or the data and the checksum, depending on >> which way round you do it). > > I'm not an expert in the btrfs internals... but I had a pretty long > discussion back then when I brought this up first, and everything that > came out of that - to my understanding - indicated, that it should be > simply possible. > > a) nodatacow just means "no data cow", but not "no meta data cow". > And isn't the checksumming data meda data? So AFAIU, this is itself > anyway COWed. > b) What you refer to above is, AFAIU, that data may be written (not > COWed) and there is of course no guarantee that the written data > matches the checksum (which may e.g. still be the old sum). > => So what? For a file like a VM image constantly being modified, essentially at no time will the csums on disk ever reflect the state of the file. > This anyway only happens in case of crash/etc. and in that case > we anyway have no idea, whether the written not COWed block is > consistent or not, whether we do checksumming or not. If the file is cow'd and checksummed, and there's a crash, there is supposed to be consistency: either the old state or new state for the data is on-disk and the current valid metadata correctly describes which state that data is in. If the file is not cow'd and not checksummed, its consistency is unknown but also ignored, when doing normal reads, balance or scrubs. If the file is not cow'd but were checksummed, there would always be some inconsistency if the file is actively being modified. Only when it's not being modified, and metadata writes for that file are committed to disk and the superblock updated, is there consistency. At any other time, there's inconsistency. So if there's a crash, a balance or scrub or normal read will say the file is corrupt. And the normal way Btrfs deals with corruption on reads from a mounted fs is to complain and it does not pass the corrupt data to user space, instead there's an i/o error. You have to use restore to scrape it off the volume; or alternatively use btrfsck to recompute checksums. Presumably you'd ask for an exception for this kind of file, where it can still be read even though there's a checksum mismatch, can be scrubbed and balanced which will report there's corruption even if there isn't any, and you've gained, insofar as I can tell, a lot of confusion and ambiguity. It's fine you want a change in behavior for Btrfs. But when a developer responds, more than once, about how this is somewhere between difficult and not possible, and you say it should simply be possible, I think that's annoying, bordering on irritating. -- Chris Murphy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-05 21:31 ` btrfs Christoph Anton Mitterer 2016-06-05 23:39 ` btrfs Chris Murphy @ 2016-06-08 6:13 ` Duncan 1 sibling, 0 replies; 25+ messages in thread From: Duncan @ 2016-06-08 6:13 UTC (permalink / raw) To: linux-btrfs Christoph Anton Mitterer posted on Sun, 05 Jun 2016 23:31:57 +0200 as excerpted: >> > Wasn't it said, that autodefrag performs bad for anything larger than >> > ~1G? >> >> I don't recall ever seeing someone saying that. Of course, I may >> have forgotten seeing it... > I think it was mentioned below this thread: > http://thread.gmane.org/gmane.comp.file-systems.btrfs/50444/focus=50586 > and also implied here: > http://article.gmane.org/gmane.comp.file-systems.btrfs/51399/match=autodefrag+large+files Yes. I was rather surprised to see Hugo say he doesn't recall seeing anyone state that autodefrag performs poorly on large (from half gig) files, and that its primary recommended use is for smaller database files such as the typical quarter-gig or smaller sqlite files created by firefox and various mail clients (thunderbird, evolution). Because I've both seen and repeated that many times, myself, and indeed, the wiki's mount options page used to say effectively that. And actually, looking at the history of the page, it was Hugo that deleted the wording to the effect that autodefrag didn't work well on large database or VM files.. https://btrfs.wiki.kernel.org/index.php?title=Mount_options&diff=29268&oldid=28191 So if he doesn't remember it... But perhaps Hugo read it as manual defrag, not autodefrag, as I don't remember manual defrag ever being associated with that problem (tho it did and does still have the reflinks/snapshots problem, but that's a totally different issue). Meanwhile, it's news to me that autodefrag doesn't have that problem any longer... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 1:51 ` btrfs Christoph Anton Mitterer 2016-06-04 7:24 ` btrfs Andrei Borzenkov 2016-06-05 20:39 ` btrfs Henk Slager @ 2016-06-06 0:56 ` Chris Murphy 2016-06-06 13:04 ` btrfs Austin S. Hemmelgarn 3 siblings, 0 replies; 25+ messages in thread From: Chris Murphy @ 2016-06-06 0:56 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Austin S Hemmelgarn, Btrfs BTRFS On Fri, Jun 3, 2016 at 7:51 PM, Christoph Anton Mitterer <calestyo@scientia.net> wrote: > I think I remember that you've claimed that last time already, and as > I've said back then: > - what counts is probably the common understanding of the term, which > is N disks RAID1 = N disks mirrored > - if there is something like an "official definition", it's probably > the original paper that introduced RAID: > http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf > PDF page 11, respectively content page 9 describes RAID1 as: > "This is the most expensive option since *all* disks are > duplicated..." You've misread the paper. It defines what it means by "all disks are duplicated" as G=1 and C=1. That is, every data disk has one check disk. That is, two copies. There is no mention of n-copies. Further in table 2 "Characteristics of Level 1 RAID" the overhead is described as 100%, and the usable storage capacity is 50%. Again, that is consistent with duplication. The definition of duplicate is "one of two or more identical things." The etymology of duplicate is "1400-50; late Middle English < Latin duplicātus (past participle of duplicāre to make double), equivalent to duplic- (stem of duplex) duplex + -ātus -ate1 http://www.dictionary.com/browse/duplicate There is no possible reading of this that suggests n-way RAID is intended. -- Chris Murphy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 1:51 ` btrfs Christoph Anton Mitterer ` (2 preceding siblings ...) 2016-06-06 0:56 ` btrfs Chris Murphy @ 2016-06-06 13:04 ` Austin S. Hemmelgarn 3 siblings, 0 replies; 25+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-06 13:04 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2016-06-03 21:51, Christoph Anton Mitterer wrote: > On Fri, 2016-06-03 at 15:50 -0400, Austin S Hemmelgarn wrote: >> There's no point in trying to do higher parity levels if we can't get >> regular parity working correctly. Given the current state of things, >> it might be better to break even and just rewrite the whole parity >> raid thing from scratch, but I doubt that anybody is willing to do >> that. > > Well... as I've said, things are pretty worrying. Obviously I cannot > really judge, since I'm not into btrfs' development... maybe there's a > lack of manpower? Since btrfs seems to be a very important part (i.e. > next-gen fs), wouldn't it be possible to either get some additional > funding by the Linux Foundation, or possible that some of the core > developers make an open call for funding by companies? > Having some additional people, perhaps working fulltime on it, may be a > big help. > > As for the RAID... given how many time/effort is spent now into 5/6,.. > it really seems that one should have considered multi-parity from the > beginning on. > Kinda feels like either, with multi-parity this whole instability phase > would start again, or it will simply never happen. New features will always cause some instability, period, there is no way to avoid that. > > >>> - Serious show-stoppers and security deficiencies like the UUID >>> collision corruptions/attacks that have been extensively >>> discussed >>> earlier, are still open >> The UUID issue is not a BTRFS specific one, it just happens to be >> easier >> to cause issues with it on BTRFS > > uhm this had been discussed extensively before, as I've said... AFAICS > btrfs is the only system we have, that can possibly cause data > corruption or even security breach by UUID collisions. > I wouldn't know that other fs, or LVM are affected, these just continue > to use those devices already "online"... and I think lvm refuses to > activate VGs, if conflicting UUIDs are found. If you are mounting by UUID, it is entirely non-deterministic which filesystem with that UUID will be mounted (because device enumeration is non-deterministic). As far as LVM, it refuses activating VG's, but it can still have issues if you have LV's with the same UUID (which can be done pretty trivially), and the fact that it refuses to activate them technically constitutes a DoS attack (because you can't use the resources). > > >> There is no way to solve it sanely given the requirement that >> userspace >> not be broken. > No this is not true. Back when this was discussed, I and others > described how it could/should be done,... respectively how > userspace/kernel should behave, in short: > - continue using those devices that are already active This is easy, but only works for mounted filesystems. > - refusing to (auto)assemble by UUID, if there are conflicts > or requiring to specify the devices (with some --override-yes-i-know- > what-i-do option option or so) > - in case of assembling/rebuilding/similar... never doing this > automatically These two allow anyone with the ability to plug in a USB device to DoS the system. > > I think there were some more corner cases, I basically had them all > discussed in the thread back then (search for "attacking btrfs > filesystems via UUID collisions?" and IIRC some different titled parent > or child threads). > > >> Properly fixing this would likely make us more dependent >> on hardware configuration than even mounting by device name. > Sure, if there are colliding UUIDs, and one still wants to mount (by > using some --override-yes-i-know-what-i-do option),.. it would need to > be by specifying the device name... > But where's the problem? > This would anyway only happen if someone either attacks or someone made > a clone, and it's far better to refuse automatic assembly in cases > where accidental corruption can happen or where attacks may be > possible, requiring the user/admin to manually take action, than having > corruption or security breach. Refusing automatic assembly does not prevent the attack, it simply converts it from a data corruption attack to a DoS attack. > > Imagine the simple case: degraded RAID1 on a PC; if btrfs would do some > auto-rebuild based on UUID, then if an attacker knows that he'd just > need to plug in a USB disk with a fitting UUID...and easily gets a copy > of everything on disk, gpg keys, ssh keys, etc. If the attacker has physical access to the machine, it's irrelevant even with such protection, as there are all kinds of other things that could be done to get data off of the disk (especially if the system has thunderbolt ports or USB C ports). If the user has any unsecured encryption or authentication tokens on the system, they're screwed anyway though. > >>> - a number of important core features not fully working in many >>> situations (e.g. the issues with defrag, not being ref-link >>> aware,... >>> an I vaguely remember similar things with compression). >> OK, how then should defrag handle reflinks? Preserving them prevents >> it >> from being able to completely defragment data. > Didn't that even work in the past and had just some performance issues? Most of it was scaling issues, but unless you have some solution to handle it correctly, there's no point in complaining about it. And my point about defragmentation with reflinks still stands. > > >>> - OTOH, defrag seems to be viable for important use cases (VM >>> images, >>> DBs,... everything where large files are internally re-written >>> randomly). >>> Sure there is nodatacow, but with that one effectively completely >>> looses one of the core features/promises of btrfs (integrity by >>> checksumming)... and as I've showed in an earlier large >>> discussion, >>> none of the typical use cases for nodatacow has any high-level >>> checksumming, and even if, it's not used per default, or doesn't >>> give >>> the same benefits at it would on the fs level, like using it for >>> RAID >>> recovery). >> The argument of nodatacow being viable for anything is a pretty >> significant secondary discussion that is itself entirely orthogonal >> to >> the point you appear to be trying to make here. > > Well the point here was: > - many people (including myself) like btrfs, it's > (promised/future/current) features > - it's intended as a general purpose fs > - this includes the case of having such file/IO patterns as e.g. for VM > images or DBs > - this is currently not really doable without loosing one of the > promises (integrity) > > So the point I'm trying to make: > People do probably not care so much whether their VM image/etc. is > COWed or not, snapshots/etc. still work with that,... but they may > likely care if the integrity feature is lost. > So IMHO, nodatacow + checksumming deserves to be amongst the top > priorities. You're not thinking from a programming perspective. There is no way to force atomic updates of data in chunks bigger than the sector size on a block storage device. Without that ability, there is no way to ensure that the checksum for a data block and the data block itself are either both written or neither written unless you either use COW or some form of journaling. > > >>> - still no real RAID 1 >> No, you mean still no higher order replication. I know I'm being >> stubborn about this, but RAID-1 is offici8ally defined in the >> standards >> as 2-way replication. > I think I remember that you've claimed that last time already, and as > I've said back then: > - what counts is probably the common understanding of the term, which > is N disks RAID1 = N disks mirrored > - if there is something like an "official definition", it's probably > the original paper that introduced RAID: > http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf > PDF page 11, respectively content page 9 describes RAID1 as: > "This is the most expensive option since *all* disks are > duplicated..." > > >> The only extant systems that support higher >> levels of replication and call it RAID-1 are entirely based on MD >> RAID >> and it's poor choice of naming. > > Not true either, show me any single hardware RAID controller that does > RAID1 in a dup2 fashion... I manage some >2PiB of storage at the > faculty, all controller we have, handle RAID1 in the sense of "all > disks mirrored". Exact specs, please. While I don't manage data on anywhere near that scale, I have seen hundreds of different models of RAID controllers over the years, and have yet to see one that is an actual hardware implementation that supports creating a RAID1 configuration with more than two disks. As far as controllers that I've seen that do RAID-1 solely as 2 way replication: * Every single Dell branded controller I've dealt with, including recent SAS3 based ones (pretty sure most of these are LSI Logic devices) * Every single Marvell based controller I've dealt with. * All of the Adaptec and LSI Logic controllers I've dealt with (although most of these I've dealt with are older devices). * All of the HighPoint controllers I've dealt with. * The few non-Marvell based Areca controllers I've dealt with. > > >>> - no end-user/admin grade maangement/analysis tools, that tell non- >>> experts about the state/health of their fs, and whether things >>> like >>> balance etc.pp. are necessary >> I don't see anyone forthcoming with such tools either. As far as >> basic >> monitoring, it's trivial to do with simple scripts from tools like >> monit >> or nagios. > > AFAIU, even that isn't really possible right now, is it? There's a limit to what you can do with this, but you can definitely check things like error counts from normal operation and scrubs, notify when the filesystem goes degraded, and other basic things that most people expect out of system monitoring. In my particular case, what I'm doing is: 1. Run scrub from a cronjob daily (none of my filesystems are big enough for this to take more than an hour) 2. From monit, check the return code of 'btrfs scrub status' at some point early in the morning after the scrub finishes, if it returns non-zero, there were errors during the scrub. 3. Have monit poll filesystem flags every cycle (in my case, every minute). If it sees these change, the filesystem had some issue. 4. Parse the output of 'btrfs device stats' to check for recorded errors and send an alert under various cases (checking whole system aggregates of each type, and per-filesystem aggregates of all types, and flagging when it's above a certain threshold). 5. Run an hourly filtered balance with -dusage=50 -dlimit=2 -musage=50 -mlimit=3 to clean up partially used chunks. 6. If any of these have issues, I get an e-mail from the system (and because of how I set that up, that works even if none of the persistent storage on the system is working correctly). Note that this is just the BTRFS specific things, and doesn't include SMART checks, low-level LVM verification, and other similar things. > Take RAID again,... there is no place where you can see whether the > RAID state is "optimal", or does that exist in the meantime? Last time, > people were advised to look at the kernel logs, but this is no proper > way to check for the state... logging may simply be deactivated, or you > may have an offline fs, for which the logs have been lost because they > were on another disk. Unless you have a modified kernel or are using raid5/6, the filesystem will go read-only when degraded. You can poll the filesystem flags to verify this (although it's better to poll and check if they're changed,a s that can detect other issues too). Additionally, you can check device stats, which will show any errors. > > Not to talk about the inability to properly determine how often btrfs > encountered errors, and "silently" corrected it. > E.g. some statistics about a device, that can be used to decide whether > its dying. > I think these things should be stored in the fs (and additionally also > on the respective device), where it can also be extracted when no > /var/log is present or when forensics are done. 'btrfs device stats' will show you running error counts since the last time they were manually reset (by passing the -z flag to said command). It's also notably one of the few tools that has output which is easy to parse programmatically (which is an entirely separate issue). > > >> As far as complex things like determining whether a fs needs >> balanced, that's really non-trivial to figure out. Even with a >> person >> looking at it, it's still not easy to know whether or not a balance >> will >> actually help. > Well I wouldn't call myself a btrfs expert, but from time to time I've > been a bit "more active" on the list. > Even I know about these strange cases (sometimes tricks), like many > empty data/meta block groups, that may or may not get cleaned up, and > may result in troubles > How should the normal user/admin be able to cope with such things if > there are no good tools? Empty block groups get deleted automatically these days (I distinctly remember this going in, because it temporarily broke discard and fstrim support).\, so that one is not an issue if they're on a new enough kernel. As far as what I specifically said, it's still hard to know if a balance will _help_ or not. For example, one of the people I was helping on the mailing list recently had a filesystem which had a bunch of chunks which were partially allocated and thus a lot of 'free space' listed in various tools, but none which were empty, and the only reason this was apparent was because a balance filtered on usage was failing above a certain threshold and not balancing anything below that threshold. Having to test for such things and as such use potentially a lot of disk bandwidth (especially because the threshold can be pretty high, in this case it was 67%) is not user friendly any more than not reporting an issue at all is. Part of the issue here is that people aren't used to using filesystem specific tools to check their filesystems. df is a classic example of this, which was designed in the 70's and never envisioned some of the cases we have to deal with in BTRFS. > > It starts with simple things like: > - adding a further disk to a RAID > => there should be a tool which tells you: dude, some files are not > yet "rebuild"(duplicated),... do a balance or whatever. Adding a disk should implicitly balance the FS unless you tell it not to, it was just a poor design choice in the first place to not do it that way. > > >>> - the still problematic documentation situation >> Not trying to rationalize this, but go take a look at a majority of >> other projects, most of them that aren't backed by some huge >> corporation >> throwing insane amounts of money at them have at best mediocre end- >> user >> documentation. The fact that more effort is being put into >> development >> than documentation is generally a good thing, especially for >> something >> that is not yet feature complete like BTRFS. > > Uhm.. yes and no... > The lack of documentation (i.e. admin/end-user-grade documentation) > also means that people have less understanding in the system, less > trust, less knowledge on what they can expect/do with it (will Ctrl-C > on btrfs checl work? what if I shut down during a balance? does it > break then? etc. pp.), less will to play with it. Given the state of BTRFS, that's not a bad thing. A good administrator looking into it will do proper testing before using it. If you aren't going to properly test something this comparatively new, you probably shouldn't be just arbitrarily using it without question. > Further,... if btrfs would reach the state of being "feature complete" > (if that ever happens, and I don't mean because of slow development, > but rather, because most other fs shows that development goes "ever" > on),... there would be *so much* to do in documentation, that it's > unlikely it will happen. In this particular case, I use the term 'feature complete' to mean on par feature wise with most other equivalent software (in this case, near feature parity with ZFS, as that's really the only significant competitor in the intended market). As of right now, the only extant items other than bugs that would need to be in BTRFS to be feature complete by this definition are: 1. Quota support 2. Higher-order replication (at a minimum, 3 copies) 3. Higher order parity (at a minimum, 3-level, which is the highest ZFS supports right now). 4. Online filesystem checking. 5. In-band deduplication. 6. In-line encryption. 7. Storage teiring (like ZFS's L2ARC, or bcache). Of these, items 1 and 5 are under active development, 6 would likely not require much effort for a basic implementation because there's a VFS level API for it now, and 2 and 3 are stalled pending functional raid5/6 (which is the correct choice, adding them now would make it more complicated to fix raid5/6), which means that the only ones that don't appear to be actively on the radar are 4 (which most non-enterprise users probably don't strictly need) and 7 (which would be nice but would require significant work for limited benefit given the alternative options in the block layer itself). ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za>]
* Re: btrfs [not found] ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za> @ 2016-06-04 2:13 ` Christoph Anton Mitterer 2016-06-04 2:36 ` btrfs Chris Murphy 0 siblings, 1 reply; 25+ messages in thread From: Christoph Anton Mitterer @ 2016-06-04 2:13 UTC (permalink / raw) To: Brendan Hide, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4591 bytes --] On Sat, 2016-06-04 at 00:22 +0200, Brendan Hide wrote: > - RAID5/6 seems far from being stable or even usable,... not to > > talk > > about higher parity levels, whose earlier posted patches (e.g. > > http://thread.gmane.org/gmane.linux.kernel/1654735) seem to have > > been given up. > I'm not certain why that patch didn't get any replies, though it > should also be noted that it was sent to three mailing lists - and > that btrfs was simply an implementation example. See previous thread > here: http://thread.gmane.org/gmane.linux.kernel/1622485 Ah... I remembered that one, but just couldn't find it anymore... so even two efforts already, both seem dead :-( > I recall reading it and thinking 6 parities is madness - but I > certainly see how it would be good for future-proofing. Well I can imagine that scenarios exist in which more than two parities may be highly desirable... > > - a number of important core features not fully working in many > > situations (e.g. the issues with defrag, not being ref-link > > aware,... > > an I vaguely remember similar things with compression). > True also. There are various features and situations where btrfs > does not work as intelligently as expected. And even worse: Some of these are totally impossible to know for the average user. => the documentation issue (though at least the defrag issue is documented now in btrfs-filesystem(8) at least). > I class these under the "you're doing it wrong" theme. The vast > majority of popular database engines have been designed without CoW > in mind and, unfortunately, one *cannot* simply dump it onto a CoW > system and expect it to perform well. There is no easy answer here. Well the easy answer is: nodatacow At least in terms of: it's technically possible, not talking about "is it easy for the end-user (the average admin may possible at one point read that nodatacow should be done for VMs and DBs, but what about all the smallish DBs like Firefox sqlites and so on, or simply any other scenario where such IO patterns happen). But the problem with nodatacow is the implication of checksumming loss. > > - other earlier anticipated features like newer/better compression > > or > > checksum algos seem to be dead either > Re alternative compression: https://btrfs.wiki.kernel.org/index.php/ > FAQ#Will_btrfs_support_LZ4.3F > My short version: This is a premature optimisation. > > IMO, alternative checksums is also a premature optimisation. An RFC > for alternative checksums was last looked at by Liu Bo in November > 2014. A different strategy was proposed as the code didn't make use > of a pre-existing crypto code in the kernel. > > - still no real RAID 1 > This depends on what you mean by "real" - and I'm guessing you're > misled by mdraid's feature to have multiple copies in RAID1 rather > than just the two. RAID1 by definition is exactly two mirrored > copies. No more. No less. See my answer to Austin about the same claim. Actually I have no idea where it comes from,... even the more down-to- earth sources like Wikipedia all speak about "mirroring of all disks", as the original paper about RAID. > > - no end-user/admin grade maangement/analysis tools, that tell non- > > experts about the state/health of their fs, and whether things > > like > > balance etc.pp. are necessary > > > > - the still problematic documentation situation > Simple answer: RAID5/6 is not yet recommended for storing data you > don't mind losing. Btrfs is *also* not yet ready for install-and- > forget-style system administration. Well the problem with writing good documentation in the "we do it once it's finished style" is often that it will never happen... or that the devs themselves don't recall all details. Also in the meantime there is so much (also often outdated) 3rd party documentation and myths that come alive, that it takes ages to clean up with all that. > I personally recommend against using btrfs for people who aren't > familiar with it. I think it *is* pretty important that many people try/test/play with it, because that helps stabilisation... but even during that phase, documentation would be quite important. If there would be e.g. an kept-up-to-date wiki page about the status and current perils of e.g. RAID5/6, people (like me) wouldn't ask every weeks, saving the devs' time. Plus people wouldn't end up simply trying it, believing it works already, and then face data loss. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: btrfs 2016-06-04 2:13 ` btrfs Christoph Anton Mitterer @ 2016-06-04 2:36 ` Chris Murphy 0 siblings, 0 replies; 25+ messages in thread From: Chris Murphy @ 2016-06-04 2:36 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Brendan Hide, Btrfs BTRFS On Fri, Jun 3, 2016 at 8:13 PM, Christoph Anton Mitterer <calestyo@scientia.net> wrote: > If there would be e.g. an kept-up-to-date wiki page about the status > and current perils of e.g. RAID5/6, people (like me) wouldn't ask every > weeks, saving the devs' time. Well up until 4.6, there was a rather clear "Btrfs is under heavy development, and is not suitable for-any uses other than benchmarking and review." statement in kernel documentation. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/diff/Documentation/filesystems/btrfs.txt?id=v4.6&id2=v4.5 There's no longer such a strongly worded caution in that document, nor in the wiki. The wiki has stale information still, but it's a volunteer effort like everything else Btrfs related. -- Chris Murphy ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-06-08  6:14 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-01 22:25 raid5/6 production use status? Christoph Anton Mitterer
2016-06-02  9:24 ` Gerald Hopf
2016-06-02  9:35   ` Hugo Mills
2016-06-02 10:03     ` Gerald Hopf
2016-06-03 17:38   ` btrfs (was: raid5/6) production use status (and future)? Christoph Anton Mitterer
2016-06-03 19:50     ` btrfs Austin S Hemmelgarn
2016-06-04  1:51       ` btrfs Christoph Anton Mitterer
2016-06-04  7:24         ` btrfs Andrei Borzenkov
2016-06-04 17:00           ` btrfs Chris Murphy
2016-06-04 17:37             ` btrfs Christoph Anton Mitterer
2016-06-04 19:13               ` btrfs Chris Murphy
2016-06-04 22:43                 ` btrfs Christoph Anton Mitterer
2016-06-05 15:51                   ` btrfs Chris Murphy
2016-06-05 20:39                     ` btrfs Christoph Anton Mitterer
2016-06-04 21:18             ` btrfs Andrei Borzenkov
2016-06-05 20:39         ` btrfs Henk Slager
2016-06-05 20:56           ` btrfs Christoph Anton Mitterer
2016-06-05 21:07             ` btrfs Hugo Mills
2016-06-05 21:31               ` btrfs Christoph Anton Mitterer
2016-06-05 23:39                 ` btrfs Chris Murphy
2016-06-08  6:13                 ` btrfs Duncan
2016-06-06  0:56         ` btrfs Chris Murphy
2016-06-06 13:04         ` btrfs Austin S. Hemmelgarn
     [not found]     ` <f4a9ef2f-99a8-bcc4-5a8f-b022914980f0@swiftspirit.co.za>
2016-06-04  2:13       ` btrfs Christoph Anton Mitterer
2016-06-04  2:36         ` btrfs Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).