From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [195.159.176.226] ([195.159.176.226]:37279 "EHLO blaine.gmane.org" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1031975AbeBNX10 (ORCPT ); Wed, 14 Feb 2018 18:27:26 -0500 Received: from list by blaine.gmane.org with local (Exim 4.84_2) (envelope-from ) id 1em6Q8-0001X7-LE for linux-btrfs@vger.kernel.org; Thu, 15 Feb 2018 00:25:00 +0100 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Status of FST and mount times Date: Wed, 14 Feb 2018 23:24:47 +0000 (UTC) Message-ID: References: <4d705301-c3a1-baaa-3eb8-f7b92f12f505@panasas.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Ellis H. Wilson III posted on Wed, 14 Feb 2018 11:00:29 -0500 as excerpted: > Hi again -- back with a few more questions: > > Frame-of-reference here: RAID0. Around 70TB raw capacity. No > compression. No quotas enabled. Many (potentially tens to hundreds) of > subvolumes, each with tens of snapshots. No control over size or number > of files, but directory tree (entries per dir and general tree depth) > can be controlled in case that's helpful). ?? How can you control both breadth (entries per dir) AND depth of directory tree without ultimately limiting your number of files? Or do you mean you can control breadth XOR depth of tree as needed, allowing the other to expand as necessary to accommodate the uncontrolled number of files? Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535 limit on directory hard links before additional ones are out-of-lined into a secondary node, with the entailing performance implications. > 1. I've been reading up about the space cache, and it appears there is a > v2 of it called the free space tree that is much friendlier to large > filesystems such as the one I am designing for. It is listed as OK/OK > on the wiki status page, but there is a note that btrfs progs treats it > as read only (i.e., btrfs check repair cannot help me without a full > space cache rebuild is my biggest concern) and the last status update on > this I can find was circa fall 2016. Can anybody give me an updated > status on this feature? From what I read, v1 and tens of TB filesystems > will not play well together, so I'm inclined to dig into this. At tens of TB, yes, the free-space-cache (v1) has issues that the free- space-tree (aka free-space-cache-v2) are designed to solve. And v2 should be very well tested in large enterprise installations by now, given facebook's usage and intimate involvement with btrfs. But I have an arguably more basic concern... Pardon me for reviewing the basics as I feel rather like a pupil attempting to lecture a teacher on the point and you could very likely teach /me/ about them, but they setup the point... Raid0, particularly at the 10s-of-TB scale, has some implications that don't particularly well match your specified concerns above. Of course "raid0" is a convenient misnomer, as there's nothing "redundant" about the "array of independent devices" in a raid0 configuration, it's simply done for the space and speed features, with the sacrificial tradeoff being reliability. It's only called raid0 as a convenience, allowing it to be grouped with the other raid configurations where "redundant" /is/ a feature, with the more important grouping commonality being they're all multi-device. Because reliability /is/ the sacrificial tradeoff for raid0, it's relatively safe to make the assumption that reliability either isn't needed at all because the data literally is "throw-away" value (cache, say, where refilling the cache isn't a big cost or time factor), or reliability is assured by other mechanisms, backups being the most basic but there are others like multi-layered raid, etc, which in practice makes at least the particular instance of the data on the raid0 "throw- away" value, even if the data as a whole is not. So far, so good. But then above you mention concern about btrfs-progs treating the free-space-tree (free-space-cache-v2) as read-only, and the time cost of having to clear and rebuild it after a btrfs check --repair. Which is what triggered the mismatch warning I mentioned above. Either that raid0 data is of throw-away value appropriate to placement on a raid0, and btrfs check --repair is of little concern as the benefits are questionable (no guarantees it'll work and the data is either directly throw-away value anyway, or there's a backup at hand that /does/ have a tested guarantee of viability, or it's not worthy of being called a backup in the first place), or it's not. It's that concern about the viability of btrfs check --repair on what you're defining as throw-away data by placing it on raid0 in the first place, that's raising all those red warning flags for me! And the fact that you didn't even bother to explain it with a side note to the effect that the reliability is addressed some other way, but you still need to worry about btrfs check --repair viability because $REASONS, is turning those red flags into flashing red lights accompanied by blaring sirens! OK, so let's assume you /do/ have a tested backup, ready to go. Then the viability of btrfs check --repair is of less concern, but remains something you might still be interested in for trivial cases, because let's face it, transferring tens of TB of data, even if ready at hand, does take time, and if you can avoid it because the btrfs check --repair fix is trivial, it's worth doing so. Valid case, but there's nothing in your post indicating it's valid as /your/ case. Of course the other possibility is live-failover, which is sure to be facebook's use-case. But with live-failover, the viability of btrfs check --repair more or less ceases to be of interest, because the failover happens (relative to the offline check or restore time) instantly, and once the failed devices/machine is taken out of service it's far more effective to simply blow away the filesystem (if not replacing the device(s) entirely) and restore "at leisure" from backup, a relatively guaranteed procedure compared to the "no guarantees" of attempting to check --repair the filesystem out of trouble. Which is very likely why the free-space-tree still isn't well supported by btrfs-progs, including btrfs check, several kernel (and thus -progs) development cycles later. The people who really need the one (whichever one of the two)... don't tend to (or at least /shouldn't/) make use of the other so much. It's also worth mentioning that btrfs raid0 mode, as well as single mode, hobbles the btrfs data and metadata integrity feature, because while checksums can and are still generated, stored and checked by default, and integrity problems can still be detected, because raid0 (and single) includes no redundancy, there's no second copy (raid1/10) or parity redundancy (raid5/6) to rebuild the bad data from, so it's simply gone. (Well, for data you can try btrfs restore of the otherwise inaccessible file and hope for the best, and for metadata, you can try check --repair and again hope for the best, but...) If you're using that feature of btrfs and want/need more than just detection of a problem that can't be fixed due to lack of redundancy, there's a good chance you want a real redundancy raid mode on multi-device, or dup mode on single device. So bottom line... given the sacrificial lack of redundancy and reliability of raid0, btrfs or not, in an enterprise setting with tens of TB of data, why are you worrying about the viability of btrfs check -- repair on what the placement on raid0 decrees to be throw-away data anyway? At first glance anyway, one or the other, either the raid0 mode and thus declared throw-away value of tens of TB of data, or the viability of btrfs check --repair, indicating you don't consider the data you just declared to be of throw-away value by placing it on raid0, to be of throw-away value after all, must be wrong. Which one is wrong is your call, and there's certainly individual cases (one of which I even named) where concern about the viability of btrfs check --repair on raid0 might be valid, but your post has no real indication that your case is such a case, and honestly, that worries me! > 2. There's another thread on-going about mount delays. I've been > completely blind to this specific problem until it caught my eye. Does > anyone have ballpark estimates for how long very large HDD-based > filesystems will take to mount? Yes, I know it will depend on the > dataset. I'm looking for O() worst-case approximations for > enterprise-grade large drives (12/14TB), as I expect it should scale > with multiple drives so approximating for a single drive should be good > enough. No input on that question here (my own use-case couldn't be more different, multiple small sub-half-TB independent btrfs raid1s on partitioned ssds), but another concern, based on real-world reports I've seen on-list: 12-14 TB individual drives? While you /did/ say enterprise grade so this probably doesn't apply to you, it might apply to others that will read this. Be careful that you're not trying to use the "archive application" targeted SMR drives for general purpose use. Occasionally people will try to buy and use such drives in general purpose use due to their cheaper per-TB cost, and it just doesn't go well. We've had a number of reports of that. =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman