From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:56519 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756204AbaITF7H (ORCPT ); Sat, 20 Sep 2014 01:59:07 -0400 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1XVDhI-0006Vd-VI for linux-btrfs@vger.kernel.org; Sat, 20 Sep 2014 07:59:04 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 20 Sep 2014 07:59:04 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 20 Sep 2014 07:59:04 +0200 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Performance Issues Date: Sat, 20 Sep 2014 05:58:52 +0000 (UTC) Message-ID: References: <1411129114.1811.7.camel@zarniwoop.blob> <541C464F.3030600@fb.com> <1411145469.1601.2.camel@zarniwoop.blob> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Rob Spanton posted on Fri, 19 Sep 2014 17:51:09 +0100 as excerpted: > The evolution problem has been improved: the sqlite db that it was using > had over 18000 fragments, so I got evolution to recreate that file with > nocow set. It now takes "only" 30s to load my mail rather than 80s, > which is better... > > On Fri, 2014-09-19 at 11:05 -0400, Josef Bacik wrote: >> Weird, I get the exact opposite performance. Anyway it's probably >> because of your file layouts, try defragging your git dir and see if >> that helps. Thanks, > > Defragging has improved matters a bit: it now takes 26s (was 46s) to run > git status. Still not amazing, but at the moment I have no evidence to > suggest that it's not something to do with the machine's hardware. If I > get time over the weekend I'll dig out an external hard disk and try a > couple of benchmarks with that. [Replying via mail and list both, as requested.] If you're snapshotting those nocow files, be aware (if you aren't already) that nocow, snapshots and defrag (all on the same files) don't work all that well together... First let's deal with snapshots of nocow files. What does a snapshot do? It locks in place the existing version of a file, both logically, so you can get at that version of it via the snapshot even after changes have been made, and physically, it locks existing extents where they are. With normal cow files this is fine, since any changes would cause the changed block to be written elsewhere, freeing the now replaced block if there's nothing holding it in place. A snapshot simply keeps a reference to the existing extent when the data is cowed elsewhere instead of releasing it, so there's a way to get the old version as referenced by that snapshot back too. But nocow files are normally overwritten in place, that's what nocow /is/. Obviously that conflicts with what a snapshot does, locking the existing version in place. What btrfs does, then, to handle that, is on the first write to a (4KB) block in a (normally) nowcow file after a snapshot, a cow write is forced on that block anyway. The file remains nocow, and additional writes to the /same/ block continue to write to the same new location... until another snapshot locks /that/ in place. All fine if you're just doing occasional snapshots and/or if the nocow file isn't being very actively rewritten after all; it's not that big a deal in that case. *BUT*, if you're doing time-based snapshots say every hour or so, and the file is actively being semi-randomly rewritten, the constant snapshotting locking in place the current version, forcing many of those writes to cow anyway, is going to end up fragmenting that file nearly as fast as it would without the nocow. IOW, the nocow ends up being nearly worthless on that file! There is a (partial) workaround, however. You can use the fact that snapshots stop at subvolume boundaries, putting the nocow files on their own dedicated subvolume. You can then continue snapshotting the up-tree subvolume as you were before and it'll stop at the dedicated subvolume, so the nocow files on that subvolume don't get snapshotted and thus don't get fragmented anyway. Of course without that snapshotting you'll need to do conventional backup on the files in that dedicated nocow subvolume. Another alternative is to continue snapshotting the dedicated subvolume and its nocow files, but at a lower frequency, perhaps every day or twice a day instead of every hour, or maybe twice a week instead of daily, or whatever. That will slow down but not eliminate the snapshot-triggered fragmentation of the nocow files. If you then combine that with scheduled (presumably cron job or systemd- timer) defrag of that dedicated subvolume, perhaps weekly or monthly, depending on how fast it still fragments, that can help keep performance from dragging down too badly. Of course you can use the scheduled defrag technique without the dedicated subvolume and just up the frequency of the defrags instead of decreasing the frequency of the snapshotting, too, if it works better for you. Meanwhile, how big are those files? If you're not dealing with any nocow- candidate files approaching a gig or larger, you may find that the autodefrag mount option helps. However, it works by queuing up a rewrite of the entire file for a worker thread that comes along a bit later, and if the file is too big and being written to too much, the changes to the file can end up coming faster than the file can be rewritten. Obviously that's not a good thing. Generally, for files under 100 MB autodefrag works very well. For actively rewritten files over a GB, it doesn't work well at all, and for files between 100 MB and 1 GB, it depends on the speed of your hardware and how fast the rewrites are coming in. Actually, most folks seem to be OK up to a quarter GiB or so, and most folks have problems starting around 3/4 GiB or so. 256-768 MiB is the YMMV zone. Meanwhile, from what I've read sqlite apparently works best with under half a gig of data to manage anyway, otherwise it's time to consider scaling up to something like mysql/mariadb. So for most people, if all they're dealing with is sqlite files, they're usually under half a gig in size and the autodefrag mount option works at least reasonably well. But I mentioned defrag as not working so well with snapshots too. The problem there is somewhat different. Before kernel 3.9 btrfs defrag wasn't snapshot aware -- it would defrag just the current snapshot, leaving others in place. This of course duplicated the data that defrag moved since the old locations couldn't be freed as other snapshots were still referencing them, thus eating up space rather faster than might have been expected. With 3.9 defrag became snapshot aware, and would track and adjust all reference to a block when defrag moved it. Unfortunately, that first attempt had **HUGE** scaling issues -- defrags that should have taken hours were taking days, even weeks, and multiple gigabytes of memory, such that it was running into out-of-memory errors even on 16 and 32 GiB RAM machines! (IIRC we had one report of it happening on a 64 gig machine too!) Let alone the poor 32-bit folks! It turned out that quotas and snapshots were the big culprits, and people with thousands of snapshots (as can happen with snapper and the like if it's not set to thin them out regularly) AND quotas enabled simply found defrag didn't work at all for them. So along about 3.12 (I'm not sure exactly), that first attempt at snapshot-aware-defrag was disabled again, so people could at least /run/ defrag. While they've rewritten various bits to scale MUCH better now, snapshot- aware-defrag remains disabled for the time being. Which means defrag is again only working on the current snapshot it is pointed at, leaving other snapshots in place as they are. Which means if you're snapshotting and defragging, and not deleting those snapshots within a reasonable time, data usage is going to go up as defrag duplicates the data it moves around, because it's only moving it around for the current snapshot, the references other snapshots have to the fragmented version remain in place, continuing to take up space until all snapshots referencing the old blocks are deleted. So if you're doing regular snapshots, try to keep them to a reasonably limited time frame, with conventional backups if needed before that. If you can keep snapshots to a month's time, great. But do try to keep it to 60 or 90 days if possible. Beyond that, keep conventional backups if you need to. And since btrfs is still under development and not fully stable, such backups are strongly recommended anyway. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman