From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from syrinx.knorrie.org ([82.94.188.77]:43784 "EHLO syrinx.knorrie.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751266AbdDIAVW (ORCPT ); Sat, 8 Apr 2017 20:21:22 -0400 Subject: Re: About free space fragmentation, metadata write amplification and (no)ssd To: Peter Grandi , Linux fs Btrfs References: <5e11b988-05ea-c468-21ef-589c71058436@mendix.com> <22761.23640.216570.125948@tree.ty.sabi.co.uk> From: Hans van Kranenburg Message-ID: Date: Sun, 9 Apr 2017 02:21:19 +0200 MIME-Version: 1.0 In-Reply-To: <22761.23640.216570.125948@tree.ty.sabi.co.uk> Content-Type: text/plain; charset=windows-1252 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 04/08/2017 11:55 PM, Peter Grandi wrote: >> [ ... ] This post is way too long [ ... ] > > Many thanks for your report, it is really useful, especially the > details. Thanks! >> [ ... ] using rsync with --link-dest to btrfs while still >> using rsync, but with btrfs subvolumes and snapshots [1]. [ >> ... ] Currently there's ~35TiB of data present on the example >> filesystem, with a total of just a bit more than 90000 >> subvolumes, in groups of 32 snapshots per remote host (daily >> for 14 days, weekly for 3 months, montly for a year), so >> that's about 2800 'groups' of them. Inside are millions and >> millions and millions of files. And the best part is... it >> just works. [ ... ] > > That kind of arrangement, with a single large pool and very many > many files and many subdirectories is a worst case scanario for > any filesystem type, so it is amazing-ish that it works well so > far, especially with 90,000 subvolumes. Yes, this is one of the reasons for this post. Instead of only hearing about problems all day on the mailing list and IRC, we need some more reports of success. The fundamental functionality of doing the cow snapshots, moo, and the related subvolume removal on filesystem trees is so awesome. I have no idea how we would have been able to continue this type of backup system when btrfs was not available. Hardlinks and rm -rf was a total dead end road. The growth has been slow but steady (oops, fast and steady, I immediately got corrected by our sales department), but anyway, steady. This makes it possible to just let it do its thing every day and spot small changes in behaviour over time, detect patterns that could be a ticking time bomb and then deal with them in a way that allows conscious decisions, well-tested changes and continous measurements of the result. But, ok, it's surely not for the faint of heart, and the devil is in the details. If it breaks, you keep the pieces. Using the NetApp hardware is one of the relevant decisions made here. The shameful state of the most basic case of recovering (or not be able to recover) a failure in a two disk btrfs RAID1 is enough of a sign that the whole multi-disk handling is a nice idea, but didn't get the amount of attention yet that it would deserve to be something to be able to rely on (for me). Having the data safe in my NetApp filer gives me the opportunity to do regular (like, monthly) snapshots of the complete thing, so that I have something to go back to if disaster would strike in linux land. Yes, it's a bit inconvenient because I want to umount for a few minutes in a silent moment of the week, but it's worth the effort, since I can keep the eggs in a shadow basket. OTOH, what we do with btrfs (taking a bulldozer and drive across all the boundaries of sanity according to all recommendations and warnings) on this scale of individual remotes is something that the NetApp people should totally be jealous of. Backups management (manual create, restore etc on top of the nightlies) is self service functionality for our customers, and being able to implement the magic behind the APIs with just a few commands like a btrfs sub snap and some rsync gives the right amount of freedom and flexibility we need. And, monitoring of trends is so. super. important. It's not a secret that when I work with technology, I want to see what's going on in there, crack the black box open and try to understand why the lights are blinking in a specific pattern. What does this balance -dusage=75 mean? Why does it know what's 75% full and I don't? Where does it get that information from? The open source kernel code and the IOCTL API is a source for many hours of happy hacking, because it allows all of this to be done. > As I mentioned elsewhere > I would rather do a rotation of smaller volumes, to reduce risk, > like "Duncan" also on this mailing list likes to do (perhaps to > the opposite extreme). Well, like seen in my 'keeps allocating new chunks for no apparent reason' thread... even small filesystems can have really weird problems. :) > As to the 'ssd'/'nossd' issue that is as described in 'man 5 > btrfs' (and I wonder whether 'ssd_spread' was tried too) but it > is not at all obvious it should impact so much metadata > handling. I'll add a new item in the "gotcha" list. I suspect that the -o ssd behaviour is a decent source of the "help! my filesystem is full but df says it's not" problems we see about every week. But, I can't just argue that. Apart from that it was the very same problem being the first thing that btrfs greeted me with when trying it out for the first time a few years ago, (and it still is one of the first problems people who start using btrfs encounter) I haven't spent time to debug the behaviour when running fully allocated. OTOH the two-step allocation process is also a nice thing, because I *know* when I still have unallocated space available, which makes for example the free space fragmentation debugging process much more bearable. > It is sad that 'ssd' is used by default in your case, and it is > quite perplexing that tghe "wandering trees" problem (that is > "write amplification") is so large with 64KiB write clusters for > metadata (and 'dup' profile for metadata). A worst case 32x a 64KiB fits into 1x 2MiB. That is a bit of a bogus argument, but take the extent tree changes (amount of leafs/nodes) it causes to change, including all the wandering and shifting around of items if they don't fit, and then the recursive updating, and it apparently already makes a difference that causes an entire day of writing metadata at Gigabit/s speed. Notice that everyone who has rotational 0 in /sys is experiencing this behaviour right now, when removing snapshots... and then they end up on IRC complaining to us their computer is totally unusable for hours when they remove some snapshots... > * Probably the metadata and data cluster sizes should be create > or mount parameters instead of being implicit in the 'ssd' > option. > * A cluster size of 2MiB for metadata and/or data presumably > has some downsides, otrherwise it would be the default. I > wonder whether the downsides related to barriers... I don't know... yet. What I know is that adding options to tune things will lead to users not setting them, or setting them to the wrong value. It's a bit like having btrfs-zero-log, or --init-extent-tree. It just doesn't work out in harsh reality. -- Hans van Kranenburg