From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f178.google.com ([209.85.223.178]:36310 "EHLO mail-io0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750830AbdAQMZp (ORCPT ); Tue, 17 Jan 2017 07:25:45 -0500 Received: by mail-io0-f178.google.com with SMTP id j13so112988644iod.3 for ; Tue, 17 Jan 2017 04:25:45 -0800 (PST) Subject: Re: Unocorrectable errors with RAID1 To: "Janos Toth F." , Btrfs BTRFS References: <87o9z7dzvd.fsf@grothesque.org> <85a62769-0607-4be5-3c5b-5091bebea07e@gmail.com> <87fukjdna0.fsf@grothesque.org> From: "Austin S. Hemmelgarn" Message-ID: Date: Tue, 17 Jan 2017 07:25:38 -0500 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2017-01-16 23:50, Janos Toth F. wrote: >> BTRFS uses a 2 level allocation system. At the higher level, you have >> chunks. These are just big blocks of space on the disk that get used for >> only one type of lower level allocation (Data, Metadata, or System). Data >> chunks are normally 1GB, Metadata 256MB, and System depends on the size of >> the FS when it was created. Within these chunks, BTRFS then allocates >> individual blocks just like any other filesystem. > > This always seems to confuse me when I try to get an abstract idea > about de-/fragmentation of Btrfs. > Can meta-/data be fragmented on both levels? And if so, can defrag > and/or balance "cure" both levels of fragmentation (if any)? > But how? May be several defrag and balance runs, repeated until > returns diminish (or at least you consider them meaningless and/or > unnecessary)? Defrag operates only at the block level. It won't allocate chunks unless it has to, and it won't remove chunks unless they become empty from it moving things around (although that's not likely to happen most of the time). Balance functionally operates at both levels, but it doesn't really do any defragmentation. Balance _may_ merge extents sometimes, but I'm not sure of this. It will compact allocations and therefore functionally defragment free space within chunks (though not necessarily at the chunk-level itself). Defrag run with the same options _should_ have no net effect after the first run, the two exceptions being if the filesystem is close to full or if the data set is being modified live while the defrag is happening. Balance run with the same options will eventually hit a point where it doesn't do anything (or only touches one chunk of each type but doesn't actually give any benefit). If you're just using the usage filters or doing a full balance, this point is the second run. If you're using other filters, it's functionally not possible to determine when that point will be without low-level knowledge of the chunk layout. For an idle filesystem, if you run defrag then a full balance, that will get you a near optimal layout. Running them in the reverse order will get you a different layout that may be less optimal than running defrag first because defrag may move data in such a way that new chunks get allocated. Repeated runs of defrag and balance will in more than 95% of cases provide no extra benefit. > > >> What balancing does is send everything back through the allocator, which in >> turn back-fills chunks that are only partially full, and removes ones that >> are now empty. > > Does't this have a potential chance of introducing (additional) > extent-level fragmentation? In theory, yes. IIRC, extents can't cross a chunk boundary. Beyond that packing constraint, balance shouldn't fragment things further. > >> FWIW, while there isn't a daemon yet that does this, it's a perfect thing >> for a cronjob. The general maintenance regimen that I use for most of my >> filesystems is: >> * Run 'btrfs balance start -dusage=20 -musage=20' daily. This will complete >> really fast on most filesystems, and keeps the slack-space relatively >> under-control (and has the nice bonus that it helps defragment free space. >> * Run a full scrub on all filesystems weekly. This catches silent >> corruption of the data, and will fix it if possible. >> * Run a full defrag on all filesystems monthly. This should be run before >> the balance (reasons are complicated and require more explanation than you >> probably care for). I would run this at least weekly though on HDD's, as >> they tend to be more negatively impacted by fragmentation. > > I wonder if one should always run a full balance instead of a full > scrub, since balance should also read (and thus theoretically verify) > the meta-/data (does it though? I would expect it to check the > chekcsums, but who knows...? may be it's "optimized" to skip that > step?) and also perform the "consolidation" of the chunk level. Scrub uses fewer resources than balance. Balance has to read _and_ re-write all data in the FS regardless of the state of the data. Scrub only needs to read the data if it's good, and if it's bad it only (for raid1) has to re-write the replica that's bad, not both of them. In fact, the only practical reason to run balance on a regular basis at all is to compact allocations and defragment free space. This is why I only have it balance chunks that are less than 1/5 full. > > I wish there was some more "integrated" solution for this: a > balance-like operation which consolidates the chunks and also > de-fragments the file extents at the same time while passively > uncovers (and fixes if necessary and possible) any checksum mismatches > / data errors, so that balance and defrag can't work against > each-other and the overall work is minimized (compared to several full > runs or many different commands). More than 90% of the time, the performance difference between the absolute optimal layout and the one generated by just running defrag then balancing is so small that it's insignificant. The closer to the optimal layout you get, the lower the returns for optimizing further (and this applies to any filesystem in fact). In essence, it's a bit like the traveling salesman problem, any arbitrary solution probably isn't optimal, but it's generally close enough to not matter. As far as scrub fitting into all of this, I'd personally rather have a daemon that slowly (less than 1% bandwidth usage) scrubs the FS over time in the background and logs and fixes errors it encounters (similar to how filesystem scrubbing works in many clustered filesystems) instead of always having to manually invoke it and jump through hoops to keep the bandwidth usage reasonable.