* "-d single" for data blocks on a multiple devices doesn't work as it should @ 2014-06-24 10:42 Gerald Hopf 2014-06-24 11:02 ` Roman Mamedov 2014-06-24 11:45 ` Duncan 0 siblings, 2 replies; 6+ messages in thread From: Gerald Hopf @ 2014-06-24 10:42 UTC (permalink / raw) To: linux-btrfs Dear btrfs-developers, thank you for making such a nice and innovative filesystem. I do have a small complaint however :-) I read the documentation and liked the idea of having a multiple-device filesystem with mirrored metadata while having data in "single" mode. This would be perfect for my backup purposes, where I don't want to have a parity disk - but I also don't want to lose the _entire_ backup in the worst case scenario of having already lost the main data RAID5 array and then one of my backup HDDs refusing to spin up or failing while restoring). For testing purposes, I therefore created a 2x 3TB btrfs filesystem as described in the "Using BTRFS with Multiple Devices" Wiki: # Use full capacity of multiple drives with different sizes (metadata mirrored, data not mirrored and not striped) mkfs.btrfs -d single /dev/sdh1 /dev/sdi1 and proceeded to copy about 5.5TB of data on it, about 800 subdirectories each containing a few small files (1-5KB), a medium sized file (50-100MB) and a big file (3GB-15GB). The copy process was completely sequential (only one task copying from source to destination, no random writes, no simultaneous copies to the btrfs volume). After copying, I then unmounted the filesystem, switched off one of the two 3TB USB disks and mounted the remaining 3TB disk in recovery mode (-o degraded,ro) and proceeded to check whether any data was still left alive. Result: - the directories and files were there and looked good (metadata raid1 seems to work) - some small files I tested were fine (probably 50%?) - even some the medium sized files (50-100MB) were fine (not sure about the percentage, might have been less than for the small files) - not a single one (!) of the big files (3GB-15GB) survived Conclusion: The "-d single" allocator is useless (or broken?). It seems to randomly write data blocks to each of the multiple devices, thereby combining the disadvantage of a single disk (low write speed) with the disadvantage of raid0 (loss of all files when a device is missing), while not offering any benefits. In my opinion to offer any benefit compared to raid0 for data, "-d single" should never allocate blocks for a single file across multiple disks unless you start to run ouf of contiguous space when the disk gets almost full. Is there any chance that "-d single" will be fixed at some point in the future? Thanks for listening, Gerald ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "-d single" for data blocks on a multiple devices doesn't work as it should 2014-06-24 10:42 "-d single" for data blocks on a multiple devices doesn't work as it should Gerald Hopf @ 2014-06-24 11:02 ` Roman Mamedov 2014-06-24 21:48 ` Gerald Hopf 2014-06-24 11:45 ` Duncan 1 sibling, 1 reply; 6+ messages in thread From: Roman Mamedov @ 2014-06-24 11:02 UTC (permalink / raw) To: Gerald Hopf; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 564 bytes --] On Tue, 24 Jun 2014 12:42:00 +0200 Gerald Hopf <gerald.hopf@nv-systems.net> wrote: > The "-d single" allocator is useless (or broken?). It's just not designed with your use case in mind. It operates on the level of allocation extents (if I'm not mistaken), not of whole files. If you want to join multiple devices with a per-file granularity (so that a single file is wholely stored on one given device), check out the FUSE filesystem called mhddfs; I wrote an article about it some time ago: https://romanrm.net/mhddfs -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "-d single" for data blocks on a multiple devices doesn't work as it should 2014-06-24 11:02 ` Roman Mamedov @ 2014-06-24 21:48 ` Gerald Hopf 0 siblings, 0 replies; 6+ messages in thread From: Gerald Hopf @ 2014-06-24 21:48 UTC (permalink / raw) To: Roman Mamedov; +Cc: linux-btrfs On 24.06.2014 13:02, Roman Mamedov wrote: > If you want to join multiple devices with a per-file granularity (so > that a single file is wholely stored on one given device), check out > the FUSE filesystem called mhddfs; I wrote an article about it some > time ago: https://romanrm.net/mhddfs Thank you very much for this excellent recommendation. I had never heard of this layered "filesystem" before. And I have to admit, I was initially quite skeptical of your mhddfs recommendation, because of... well... FUSE :-) But I read your article and the mhddfs documentation and it is exactly the solution I had in mind (where trouble on one of the disks does never ever affect data on other disks). mhddfs performance is also great, 130MB/s at only 50% load on one of the cpu's cores - it does only seem to be limited by the disk and not by the CPU or FUSE. Thanks again. I will now use mhddfs for my backups. And even though I'm now not using btrfs for my backups, having a perfectly working backup solution will make it much more likely for me to use BTFS on my main disk array once the btrfs RAID5/6 support is slightly more complete :-) Thanks again, Gerald ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "-d single" for data blocks on a multiple devices doesn't work as it should 2014-06-24 10:42 "-d single" for data blocks on a multiple devices doesn't work as it should Gerald Hopf 2014-06-24 11:02 ` Roman Mamedov @ 2014-06-24 11:45 ` Duncan 2014-06-24 21:51 ` Gerald Hopf 1 sibling, 1 reply; 6+ messages in thread From: Duncan @ 2014-06-24 11:45 UTC (permalink / raw) To: linux-btrfs Gerald Hopf posted on Tue, 24 Jun 2014 12:42:00 +0200 as excerpted: > After copying, I then unmounted the filesystem, switched off one of the > two 3TB USB disks and mounted the remaining 3TB disk in recovery mode > (-o degraded,ro) and proceeded to check whether any data was still left > alive. > > Result: > - the directories and files were there and looked good (metadata raid1 > seems to work) > - some small files I tested were fine (probably 50%?) > - even some the medium sized files (50-100MB) were fine (not sure about > the percentage, might have been less than for the small files) > - not a single one (!) of the big files (3GB-15GB) survived > > Conclusion: > The "-d single" allocator is useless (or broken?). It seems to randomly > write data blocks to each of the multiple devices, thereby combining the > disadvantage of a single disk (low write speed) with the disadvantage of > raid0 (loss of all files when a device is missing), while not offering > any benefits. A little familiarity with btrfs' chunk allocator and it's obvious what happened. The critical point is that btrfs data chunks are 1 GiB in size, so files over a GiB will require multiple data chunks. Meanwhile, from what I've read (I'm not an expert here but it does match what you saw), the chunk allocation algorithm allocates new chunks from the device with the most space left. With two equal sized 3 TB devices and metadata in (default) raid1 mode, thus metadata allocations two (256 MiB) chunks at a time, one from each device, with single data mode, the 1 GiB data chunks will be allocated from one device to put it 1 GiB allocated ahead of the other, then from the other device since it has more unallocated space left to bring it up even with the first one again. Thus allocation will be alternating, 1 GiB data from one, the next from the other. Which with files over a GiB in size and only two devices, pretty much guarantees the file will be split, 1 GiB chunks, chunks on alternating devices. Single data mode doesn't make any specific guarantees about recovery, however (altho for files significantly under a GiB in size some should still be recoverable as long as the metadata is intact, presumably because it's in raid1 mode), and for most usage, 1 GiB plus files are still rather less common than smaller sizes, so that's where we are ATM. If you want somewhat better chances with large files, add more drives of the same size, since the effect of single-mode allocation with multiple drives in that case should be round-robin, so say 2-GiB files should have a reasonable chance of recovery (not hitting the bad drive) with 8 our 10 drives in the filesystem. Tho you're pretty much screwed on the 15 GiB files unless you run say 50 devices, in which case the chance of more than one going out at a time is unfortunately going to be dramatically higher as well. The other alternative would be raid1 or raid10 mode data, or, when raid5/6 modes are completed (AFAIK raid5/6 mode is still lacking full recovery code, tho the parity is being written), those, since that would be more efficient storage-wise than raid1 (with raid6 more reliable as well, since current raid1 and raid10 modes are only two-way-mirroring[1], so drop more than one device and data's likely to be gone, while raid6 should allow dropping two devices -- when the recovery code is complete and tested, of course). Farther out, there has been discussion of adding additional chunk allocation schemes and making the choice configurable, which is really what you're asking for. But while I think that's reasonably likely to eventually happen, I wouldn't expect to see it for a year at least, and honestly it's more likely two years out or more... ... Unless of course you happen to have sufficient interest in that feature to either code it up yourself if you have the skill, or (assuming you have the resources) sponsor someone who actually has the skill to do so. After all, people either scratching their own itches or hiring others to do it for them is what drives freedomware forward. =:^) --- [1] My own #1 anticipated feature is N-way-mirroring, with my personal sweet spot being N=3. Combined with the existing data-integrity and scrub features, three-way-mirroring would be /so/ sweet! Which is why I'm impatiently waiting for raid5/6 completion, since that's next on the roadmap after that. But it has been "at least a couple kernels out" for over a year now, so it's taking awhile. =:^( Meanwhile we all gotta make do with what's available now, which isn't /too/ shabby, after all. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "-d single" for data blocks on a multiple devices doesn't work as it should 2014-06-24 11:45 ` Duncan @ 2014-06-24 21:51 ` Gerald Hopf 2014-06-25 2:20 ` Austin S Hemmelgarn 0 siblings, 1 reply; 6+ messages in thread From: Gerald Hopf @ 2014-06-24 21:51 UTC (permalink / raw) To: Duncan, linux-btrfs On 24.06.2014 13:45, Duncan wrote: >> - not a single one (!) of the big files (3GB-15GB) survived > A little familiarity with btrfs' chunk allocator and it's obvious what > happened. The critical point is that btrfs data chunks are 1 GiB in > size, so files over a GiB will require multiple data chunks. Meanwhile, > from what I've read (I'm not an expert here but it does match what you > saw), the chunk allocation algorithm allocates new chunks from the device > with the most space left. [...] Thus allocation will be alternating, 1 GiB data from one, the next from the other. Thanks for this excellent explanation. The benefits of using "single" (vs. raid0) seem a bit limited now with the only advantage for "single" now being that there is a good chance of recovering small files. > Farther out, there has been discussion of adding additional chunk > allocation schemes and making the choice configurable, which is really > what you're asking for. But while I think that's reasonably likely to > eventually happen, I wouldn't expect to see it for a year at least, and > honestly it's more likely two years out or more... Looking forward to that happening one day. An simple allocator that just sequentially allocates each disk until full or some code that tries to make sure all data chunks for a file are on the same disk (given space is free) in "-d single" would be a cool feature that could provide "-d single" with a lot of benefit over btrfs-raid0 or LVM-Spanning. But I agree of course that probably way more people are interested in completion of the RAID5/6 code (me included!), so that's probably quite far out. > ... Unless of course you happen to have sufficient interest in that > feature to either code it up yourself if you have the skill, or (assuming > you have the resources) sponsor someone who actually has the skill to do > so. I somehow have doubts that a complex filesystem is the right project for me to start learning C, so I'll have to pass :-) No huge corporation with that itch behind me either, and I guess it will be more than a few hours for a btrfs programmer so no way I could sponsor that on my own. Also, with Roman's excellent recommendation of mhddfs my itch has almost entirely disappeared Thanks again for the quick response and the excellent explanation. Gerald ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "-d single" for data blocks on a multiple devices doesn't work as it should 2014-06-24 21:51 ` Gerald Hopf @ 2014-06-25 2:20 ` Austin S Hemmelgarn 0 siblings, 0 replies; 6+ messages in thread From: Austin S Hemmelgarn @ 2014-06-25 2:20 UTC (permalink / raw) To: Gerald Hopf, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1003 bytes --] > I somehow have doubts that a complex filesystem is the right project for > me to start learning C, so I'll have to pass :-) No huge corporation > with that itch behind me either, and I guess it will be more than a few > hours for a btrfs programmer so no way I could sponsor that on my own. Whether or not it is the right project really depends on where you intend to do most of your C programming. If you plan to do most of it in kernel code and occasional userspace wrappers for kernel interfaces (like me), then it could be a great place because it's under such heavy development (which means more developers are working on it, and bugs get spotted faster, both of which are good things for project you are using to learn a language). If, however, you intend to use it mostly for userspace, then I would definitely agree with you, programming in userspace and kernel-space are so different that it's almost like a different language using the same syntax and similar semantics. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2967 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-06-25 2:20 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-24 10:42 "-d single" for data blocks on a multiple devices doesn't work as it should Gerald Hopf 2014-06-24 11:02 ` Roman Mamedov 2014-06-24 21:48 ` Gerald Hopf 2014-06-24 11:45 ` Duncan 2014-06-24 21:51 ` Gerald Hopf 2014-06-25 2:20 ` Austin S Hemmelgarn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).