From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:33945 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753878AbbI1ASU (ORCPT ); Sun, 27 Sep 2015 20:18:20 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1ZgM94-0003he-11 for linux-btrfs@vger.kernel.org; Mon, 28 Sep 2015 02:18:18 +0200 Received: from ip98-167-165-199.ph.ph.cox.net ([98.167.165.199]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 28 Sep 2015 02:18:18 +0200 Received: from 1i5t5.duncan by ip98-167-165-199.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 28 Sep 2015 02:18:18 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: btrfs fi defrag interfering (maybe) with Ceph OSD operation Date: Mon, 28 Sep 2015 00:18:12 +0000 (UTC) Message-ID: References: <56080C9A.6030102@bouton.name> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Lionel Bouton posted on Sun, 27 Sep 2015 17:34:50 +0200 as excerpted: > Hi, > > we use BTRFS for Ceph filestores (after much tuning and testing over > more than a year). One of the problem we've had to face was the slow > decrease in performance caused by fragmentation. While I'm a regular user/admin (not dev) on the btrfs lists, my ceph knowledge is essentially zero, so this is intended to address the btrfs side ONLY. > Here's a small recap of the history for context. > Initially we used internal journals on the few OSDs where we tested > BTRFS, which meant constantly overwriting 10GB files (which is obviously > bad for CoW). Before using NoCoW and eventually moving the journals to > raw SSD partitions, we understood autodefrag was not being effective : > the initial performance on a fresh, recently populated OSD was great and > slowly degraded over time without access patterns and filesystem sizes > changing significantly. Yes. Autodefrag works most effectively on (relatively) small files, generally for performance reasons, as it detects fragmentation and queues up a a defragmenting rewrite by a separate defragmentation worker thread. As file sizes increase, that defragmenting rewrite will take longer, until at some point, particularly on actively rewritten files, change-writes will be coming in faster than file rewrite speeds... Generally speaking, therefore, it's great for small database files upto a quarter gig or so, think firefox sqlite database files on the desktop, with people starting to see issues somewhere between a quarter gig and a gig on spinning rust, depending on disk speed as well as active rewrite load on the file in question. So constantly rewritten 10-gig journal files... Entirely inappropriate for autodefrag. =:^( There has been discussion and a general plan for some sort of larger-file autodefrag optimization, but btrfs continues to be rather "idea and opportunity rich" and "implementation coder poor", so realistically we're looking at years to implementation. Meanwhile, other measures should be taken for multigig files, as you're already doing. =:^) > I couldn't find any description of the algorithms/heuristics used by > autodefrag [...] This is in general documented on the wiki, tho not with the level of explanation I included above. https://btrfs.wiki.kernel.org > I decided to disable it and develop our own > defragmentation scheduler. It is based on both a slow walk through the > filesystem (which acts as a safety net over one week period) and a > fatrace pipe (used to detect recent fragmentation). Fragmentation is > computed from filefrag detailed outputs and it learns how much it can > defragment files with calls to filefrag after defragmentation (we > learned compressed files and uncompressed files don't behave the same > way in the process so we ended up treating them separately). Note that unless this has very recently changed, filefrag doesn't know how to calculate btrfs-compressed file fragmentation correctly. Btrfs uses (IIRC) 128 KiB compression blocks, which filefrag will see (I'm not actually sure if it's 100% consistent or if it's conditional on something else) as separate extents. Bottom line, there's no easily accessible reliable way to get the fragmentation level of a btrfs-compressed file. =:^( (Presumably btrfs-debug-tree with the -e option to print extents info, with the output fed to some parsing script, could do it, but that's not what I'd call easily accessible, at least at a non-programmer admin level.) Again, there has been some discussion around teaching filefrag about btrfs compression, and it may well eventually happen, but I'm not aware of an e2fsprogs release doing it yet, nor of whether there's even actual patches for it yet, let alone merge status. > Simply excluding the journal from defragmentation and using some basic > heuristics (don't defragment recently written files but keep them in a > pool then queue them and don't defragment files below a given > fragmentation "cost" were defragmentation becomes ineffective) gave us > usable performance in the long run. Then we successively moved the > journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS > snapshots which were too costly (removing snapshots generated 120MB of > writes to the disks and this was done every 30s on our configuration). It can be noted that there's an negative interaction between btrfs snapshots and nocow, sometimes called cow1. The btrfs snapshot feature is predicated on cow, with a snapshot locking in place existing file extents, normally no big deal as ordinary cow files will have rewrites cowed elsewhere in any case. Obviously, then, snapshots must by definition play havoc with nocow. What actually happens is that with existing extents locked in place, the first post-snapshot change to a block must then be cowed into a new extent. The nocow attribute remains on the file, however, and further writes to that block... until the next snapshot anyway... will be written in-place, to the (first-post-snapshot- cowed) current extent. When one list poster referred to that as cow1, I found the term so nicely descriptive that I adopted it for myself, altho for obvious reasons I have to explain it first in many posts. It should now be obvious why 30-second snapshots weren't working well on your nocow files, and why they seemed to become fragmented anyway, the 30- second snapshots were effectively disabling nocow! In general, for nocow files, snapshotting should be disabled (as you ultimately did), or as low frequency as is practically possible. Some list posters have, however, reported a good experience with a combination of lower frequency snapshotting (say daily, or maybe every six hours, but DEFINITELY not more frequent than half-hour), and periodic defrag, on the order of the weekly period you implied in a bit I snipped, to perhaps monthly. > In the end we had a very successful experience, migrated everything to > BTRFS filestores that were noticeably faster than XFS (according to Ceph > metrics), detected silent corruption and compressed data. Everything > worked well [...] =:^) > [...] until this morning. =:^( > I woke up to a text message signalling VM freezes all over our platform. > 2 Ceph OSDs died at the same time on two of our servers (20s appart) > which for durability reason freezes writes on the data chunks shared by > these two OSDs. > The errors we got in the OSD logs seem to point to an IO error (at least > IIRC we got a similar crash on an OSD where we had invalid csum errors > logged by the kernel) but we couldn't find any kernel error and btrfs > scrubs finished on the filesystems without finding any corruption. Snipping some of the ceph stuff since as I said I've essentially zero knowledge there, but... > Given that the defragmentation scheduler treats file accesses the same > on all replicas to decide when triggering a call to "btrfs fi defrag > ", I suspect this manual call to defragment could have happened on > the 2 OSDs affected for the same file at nearly the same time and caused > the near simultaneous crashes. ... While what I /do/ know of ceph suggests that it should be protected against this sort of thing, perhaps there's a bug, because... I know for sure that btrfs itself is not intended for distributed access, from more than one system/kernel at a time. Which assuming my ceph illiteracy isn't negatively affecting my reading of the above, seems to be more or less what you're suggesting happened, and I do know that *if* it *did* happen, it could indeed trigger all sorts of havoc! > It's not clear to me that "btrfs fi defrag " can't interfere with > another process trying to use the file. I assume basic reading and > writing is OK but there might be restrictions on unlinking/locking/using > other ioctls... Are there any I should be aware of and should look for > in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which > don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on > our storage network : 2 are running a 4.0.5 kernel and 3 are running > 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on > 4.0.5 (or better if we have the time to test a more recent kernel before > rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). It's worth keeping in mind that the explicit warnings about btrfs being experimental weren't removed until 3.12, and while current status is no longer experimental or entirely unstable, it remains, as I characterize it, as "maturing and stabilizing, not yet entirely stable and mature." So 3.8 is very much still in btrfs-experimental land! And so many bugs have been fixed since then that... well, just get off of it ASAP, which it seems you're already doing. While it's no longer absolutely necessary to stay current to the latest non-long-term-support kernel (unless you're running say raid56 mode, which is still new enough not to be as stable as the rest of btrfs and where running the latest kernel continues to be critical, and while I'm discussing exceptions, btrfs quota code continues to be a problem even with the newest kernels, so I recommend it remain off unless you're specifically working with the devs to debug and test it), list consensus seems to be that where stability is a prime consideration, sticking to long-term-support kernel series, no later than one LTS series behind the latest and upgrading to the latest LTS series some reasonable time after the LTS announcement, after deployment-specific testing as appropriate of course, is recommended best-practice. With kernel 4.1 series now blessed as the latest long-term-stable, and 3.18 the latest before that, the above suggests targeting them, and indeed, list reports for the 3.18 series as it has matured have been very good, with 4.1 still new enough that the stability-cautious are still testing or just deployed, so there's not many reports on it yet. Meanwhile, while latest (or second-latest until latest is site-tested) LTS kernel is recommended for stable deployment, when encountering specific bugs, be prepared to upgrade to latest stable at least for testing, possibly with cherry-picked not-yet-mainlined patches if appropriate for individual bugs. But definitely, anything pre-3.12, get off of, as that really is when the experimental label came off, and you don't want to be running kernel btrfs of that age in production. Again, 3.18 is well tested and rated so targeting it for ASAP deployment is good, with 4.1 targeted for testing and deployment "soon" also recommended. And once again, that's purely from the btrfs side. I know absolutely nothing about ceph stability in any of these kernels, tho obviously for you that's going to be a consideration as well. Tying up a couple loose ends... Regarding nocow... Given that you had apparently missed much of the general list and wiki wisdom above (while at the same time eventually coming to the many of the same conclusions on your own), it's worth mentioning the following additional nocow caveat and recommended procedure, in case you missed it as well: On btrfs, setting nocow on an existing file with existing content, leaves undefined when exactly the nocow attribute will take effect. (FWIW, this is mentioned in the chattr (1) manpage as well.) Recommended procedure is therefore to set the nocow attribute on the directory, such that newly created files (and subdirs) will inherit it. (There's no effect on the directory itself, just this inheritance.) Then, for existing files, copy them into the new location, preferably from a different filesystem in ordered to guarantee that the file is actually newly created and thus gets nocow applied appropriately. (cp behavior currently copies the file in unless the reflink option is set anyway, but there has been discussion of changing that to reflink by default for speed and space usage reasons, and that would play havoc with nocow on file creation, but btrfs doesn't support cross-filesystem reflinks so copying in from a different filesystem should always force creation of a new file, with nocow inherited from its directory as intended.) What about btrfs-progs versions? In general, in normal online operation the btrfs command simply tells the kernel what to do and the kernel takes care of the details, so it's the kernel code that's critical. However, various recovery operations, btrfs check, btrfs restore, btrfs rescue, etc (I'm not actually sure about mkfs.btrfs, whether that's primarily userspace code or calls into the kernel, tho I suspect the former), operate on an unmounted btrfs using primarily userspace code, and it's here where the latest userspace code, updated to deal with the latest known problems, becomes critical. So in general, it's kernel code age and stability that's critical for a deployed and operation filesystem, but userspace code that's critical if you run into problems. For that reason, unless you have backups and intend to simply blow away filesystems with problems and recreate them fresh, restoring from backups, a reasonably current btrfs userspace is critical as well, even if it's not critical in normal operation. And of course you need current userspace as well as kernelspace to best support the newest features, but that's a given. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman