From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:37660 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752397AbaAZVo1 (ORCPT ); Sun, 26 Jan 2014 16:44:27 -0500 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1W7XV9-0000jr-4U for linux-btrfs@vger.kernel.org; Sun, 26 Jan 2014 22:44:23 +0100 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 26 Jan 2014 22:44:23 +0100 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 26 Jan 2014 22:44:23 +0100 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Options for SSD - autodefrag etc? Date: Sun, 26 Jan 2014 21:44:00 +0000 (UTC) Message-ID: References: <52E19667.6090005@gmail.com> <2056077.lUO7kt9W8r@merkaba> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Martin Steigerwald posted on Sat, 25 Jan 2014 13:54:40 +0100 as excerpted: > Hi Duncan, > > Am Freitag, 24. Januar 2014, 06:54:31 schrieb Duncan: >> Anyway, yes, I turned autodefrag on for my SSDs, here, but there are >> arguments to be made in either direction, so I can understand people >> choosing not to do that. > > Do you have numbers to back up that this gives any advantage? Your post (like some of mine) reads like a stream of consciousness more than a well organized post, making it somewhat difficult to reply to (I guess I'm now experiencing the pain others sometimes mention when trying to reply to some of mine). However, I'll try... I haven't done benchmarks, etc, nor do I have them at hand to quote, if that's what you're asking for. But of course I did say I understand the arguments made by both sides, and just gave the reasons why I made the choice I did, here. What I /do/ have is the multiple post here on this list from people complaining about pathologic[1] performance issues due to large-internal- written-file fragmentation even on SSDs, particularly so when interacting with non-trivial numbers of snapshots as well. That's a case that at present simply Does. Not. Scale. Period! Of course the multi-gig internal-rewritten-file case is better suited to the NOCOW extended attribute than to autodefrag, but anyway... > I have it disabled and yet I have things like: > > Oh, this is insane. This filefrags runs for over [five minutes] > already. And hogging on one core eating almost 100% of its processing > power. > /usr/bin/time -v filefrag soprano-virtuoso.db > Well, now that command completed: > > soprano-virtuoso.db: 93807 extents found > Command being timed: "filefrag soprano-virtuoso.db" > User time (seconds): 0.00 > System time (seconds): 338.77 > Percent of CPU this job got: 98% > Elapsed (wall clock) time (h:mm:ss or m:ss): 5:42.81 I don't see any mention of the file size. I'm (informally) collecting data on that sort of thing ATM, since it's exactly the sort of thing I was referring to, and I've seen enough posts on the list about it to have caught my interest. FWIW I'll guess something over a gig, perhaps 2-3 gigs... Also FWIW, while my desktop of choice is indeed KDE, I'm running gentoo, and turned off USE=semantic-desktop and related flags some time ago (early kde 4.7, so over 2.5 years ago now), entirely purging nepomuk, virtuoso, etc, from my system. That was well before I switched to btrfs, but the performance improvement from not just turning it off at runtime (I already had it off at runtime) but entirely purging it from my system was HUGE, I mean like clean all the malware off an MS Windows machine and see how much faster it runs HUGE, *WELL* more than I expected! (I had /expected/ just to get rid of a few packages that I'd no longer have to update, little or no performance improvement at all, since I already the data indexing, etc, turned off to the extent that I could, at runtime. Boy was I surprised, but in a GOOD way! =:^) Anyway, because I have that stuff not only disabled at runtime but entirely turned off at build time and purged from the system as well, I don't have such a database file available here to compare with yours. But I'd certainly be interested in knowing how big yours actually was, since I already have both the filefrag report on it, and your complaint about how long it took filefrag to compile that information and report back. > Well I have some files with several ten thousands extent. But first, > this is mounted with compress=lzo, so 128k is the largest extent size as > far as I know Well, you're mounting with compress=lzo (which I'm using too, FWIW), not compress-force=lzo, so btrfs won't try to compress it if it thinks it's already compressed. Unfortunately, I believe there's no tool to report on whether btrfs has actually compressed the file or not, and as you imply filefrag doesn't know about btrfs compression yet, so just running the filefrag on a file on a compress=lzo btrfs doesn't really tell you a whole lot. =:^( What you /could/ do (well, after you've freed some space given your filesystem usage information below, or perhaps to a different filesystem) would be copy the file elsewhere, using reflink=no just to be sure it's actually copied, and see what filefrag reports on the new copy. Assuming enough free space btrfs should write the new file as a single extent, so if filefrag reports a similar number of extents on the new copy, you'll know it's compression related, while if it reports only one or a small handful of extents, you'll know the original wasn't compressed and it's real fragmentation. It would also be interesting to know how long a filefrag on the new file takes, as compared to the original, but in ordered to get an apples to apples comparison, you'd have to either drop-caches before doing the filefrag on the new one, or reboot, since after the copy it'd be cached, while the 5+ minute time on the original above was presumably with very little of the file actually cached. And of course you could temporarily mount without the compress=lzo option and do the copy, if you find it is the compression triggering the extents report from filefrag, just to see the difference compression makes. Or similarly, you could mount with compress-force=lzo and try it, if you find btrfs isn't compressing the file with ordinary compress=lzo, again to see the difference that makes. > and second: I did manual btrfs filesystem defragment on > files like those and and never ever perceived any noticable difference > in performance. > > Thus I just gave up on trying to defragment stuff on the SSD. I still say it'd be interesting to see the (from cold-cache) filefrag report and timing on a fresh copy, compared to the 5 minute plus timing above. > And this is really quite high. > But… I think I have a more pressing issue with that BTRFS /home > on an Intel SSD 320 and that is that it is almost full: > > merkaba:~> LANG=C df -hT /home > Filesystem Type Size Used Avail Use% Mounted on > /dev/mapper/merkaba-home btrfs 254G 241G 8.5G 97% /home Yeah, that's uncomfortably close to full... (FWIW, it's also interesting comparing that to a df on my /home... $>> df . Filesystem 2M-blocks Used Available Use% Mounted on /dev/sda6 20480 12104 7988 61% /h As you can see I'm using 2M blocks (alias df=df -B2M), but the filesystem is raid1 both data and metadata, so the numbers would be double and the 2M blocks are thus 1M block equivalent. (You can also see that I've actually mounted it on /h, not /home. /home is actually a symlink to /h just in case, but I export HOME=/h/whatever, and most programs honor that.) So the partition size is 20480 MiB or 20.0 GiB, with ~12+ GiB used, just under 8 GiB available. It can be and is so small because I have a dedicated media partition with all the big stuff located elsewhere (still on reiserfs on spinning rust, as a matter of fact). Just interesting to see how people setup their systems differently, is all, thus the "FWIW". But the small independent partitions do make for much shorter balance times, etc! =:^) > merkaba:~> btrfs filesystem show […] > Label: home uuid: […] > Total devices 1 FS bytes used 238.99GiB > devid 1 size 253.52GiB used 253.52GiB path [...] > > Btrfs v3.12 > > merkaba:~> btrfs filesystem df /home > Data, single: total=245.49GiB, used=237.07GiB > System, DUP: total=8.00MiB, used=48.00KiB > System, single: total=4.00MiB, used=0.00 > Metadata, DUP: total=4.00GiB, used=1.92GiB > Metadata, single: total=8.00MiB, used=0.00 It has come up before on this list and doesn't hurt anything, but those extra system-single and metadata-single chunks can be removed. A balance with a zero usage filter should do it. Something like this: btrfs balance start -musage=0 That will act on metadata chunks with usage=0 only. It may or may not act on the system chunk. Here it does, and metadata implies system also, but someone reported it didn't, for them. If it doesn't... btrfs balance start -f -susage=0 ... should do it. (-f=force, needed if acting on system chunk only.) https://btrfs.wiki.kernel.org/index.php/Balance_Filters (That's for the filter info, not well documented in the manpage yet. The manpage documents btrfs balance fairly well tho, other than that.) Anyway... 252 gigs used of 252 total in filesystem show. That's full enough you may not even be able to balance as there's no unallocated blocks left to allocate for the balance. But the usage=0 thing may get you a bit of room, after which you can try usage=1, etc, to hopefully recover a bit more, until you get at least /some/ unallocated space as a buffer to work with. Right now, you're risking being unable to allocate anything more when data or metadata runs out, and I'd be worried about that. > Okay, I could probably get back 1,5 GiB on metadata, but whenever I > tried a btrfs filesystem balance on any of the BTRFS filesystems on my > SSD I usually got the following unpleasant result: > > Halve of the performance. Like double boot times on / and such. That's weird. I wonder why/how, unless it's simply so full an SSD that the firmware's having serious trouble doing its thing. I know I've seen nothing like that on my SSDs. But then again, my usage is WILDLY different, with my largest partition 24 gigs, and only about 60% of the SSD even partitioned at all because I keep the big stuff like media files on spinning rust (and reiserfs, not btrfs), so the firmware has *LOTS* of room to shuffle blocks around for write-cycle balancing, etc. And of course I'm using a different brand SSD. (FWIW, Corsair Neutron 256 GB, 238 GiB, *NOT* the Neutron GTX.) But if anything, Intel SSDs have a better rep than my Corsair Neutrons do, so I doubt that has anything to do with it. > So I have the following thoughts: > > 1) I am not yet clear whether defragmenting files on SSD will really > bring a benefit. Of course that's the question of the entire thread. As I said, I have it turned on here, but I understand the arguments for both sides, and from here that question does appear to remain open for debate. One other related critical point while we're on the subject. A number of people have reported that at least for some distros installed to btrfs, brand new installs are coming up significantly fragmented. Apparently some distros do their install to btrfs mounted without autodefrag turned on. And once there's existing fragmentation, turning on autodefrag /then/ results in a slowdown for several boot cycles, as normal usage detects and queues for defrag, then defrags, all those already fragmented files. There's an eventual speedup (at least on spinning rust, SSDs of course are open to question, thus this thread), but the system has to work thru the existing backlog of fragmentation before you'll see it. Of course one way out of that (temporary but sometimes several days) pain is to deliberately run a btrfs defrag recursive (new enough btrfs has a recursive flag, previous to that, one had to play some tricks with find, as documented on the wiki) on the entire filesystem. That will be more intense pain, but it'll be over faster! =:^) The point being, if a reader is considering autodefrag, be SURE and turn it on BEFORE there's a whole bunch of already fragmented data on the filesystem. Ideally, turn it on for the first mount after the mkfs.btrfs, and never mount without it. That ensures there's never a chance for fragmentation to get out of hand in the first place. =:^) (Well, with the additional caveat that the NOCOW extended attribute is used appropriately on internal-rewrite files such as VM images, databases, bittorrent preallocations, etc, when said file approaches a gig or larger. But that is discussed elsewhere.) > 2) On my /home problem is more that it is almost full and free space > appears to be highly fragmented. Long fstrim times speak tend to agree > with it: > > merkaba:~> /usr/bin/time fstrim -v /home > /home: 13494484992 bytes were trimmed > 0.00user 12.64system 1:02.93elapsed 20%CPU Some people wouldn't call a minute "long", but yeah, on an SSD, even at several hundred gig, that's definitely not "short". It's not well comparable because as I explained, my partition sizes are so much smaller, but for reference, a trim on my 20-gig /home took a bit over a second. Doing the math, that'd be 10-20 seconds for 200+ gigs. That you're seeing a minute, does indeed seem to indicate high free-space fragmentation. But again, I'm at under 60% SSD space even partitioned, so there's LOTS of space for the firmware to do its management thing. If your SSD is 256 gig as mine, with 253+ gigs used (well, I see below it's 300 gig, but still...) ... especially if you're not running with the discard mount option (which could be an entire thread of its own, but at least there's some official guidance on it), that firmware could be working pretty hard indeed with the resources it has at its disposal! I expect you'd see quite a difference if you could reduce that to say 80% partitioned and trim the other 20%, giving the firmware a solid 20% extra space to work with. If you could then give btrfs some headroom on the reduced size partition as well, well... > 3) Turning autodefrag on might fragment free space even more. Now, yes. As I stressed above, turn it on when the filesystem's new, before you start loading it with content, and the story should be quite different. Don't give it a chance to fragment in the first place. =:^) > 4) I have no clear conclusion on what maintenance other than scrubbing > might make sense for BTRFS filesystems on SSDs at all. Everything I > tried either did not have any perceivable effect or made things worse. Well, of course there's backups. Given that btrfs isn't fully stabilized yet and there are still bugs being worked out, those are *VITAL* maintenance! =:^) Also, for the same reason (btrfs isn't yet fully stable), I recently refreshed and double-checked my backups, then blew away the existing btrfs with a fresh mkfs.btrfs and restored from backup. The brand new filesystems now make use of several features that the older ones didn't have, including the new 16k nodesize default. =:^) For anyone who has been running btrfs for awhile, that's potentially a nice improvement. I expect to do the same thing at least once more, later on after btrfs has settled down to more or less routine stability, just to clear out any remaining not-fully-stable-yet corner-cases that may eventually come back to haunt me if I don't, as well as to update the filesystem to take advantage of any further format updates between now and then. That's useful btrfs maintenance, SSD or no SSD. =:^) > Thus for SSD except for the scrubbing and the occasional fstrim I be > done with it. > > For harddisks I enable autodefrag. > > But still for now this is only guess work. I don´t have much clue on > BTRFS filesystems maintenance yet and I just remember the slogan on > xfs.org wiki: > > "Use the defaults." =:^) > I would love to hear some more or less official words from BTRFS > filesystem developers on that. But for know I think one of the best > optimizations would be to complement that 300 GB Intel SSD 320 with a > 512 GB Crucial m5 mSATA SSD or some Intel mSATA SSDs (but these cost > twice as much), and make more free space on /home again. For criticial > data regarding data safety and amount of accesses I could even use BTRFS > RAID 1 then. Indeed. I'm running btrfs raid1 mode with my ssds (except for /boot, where I have a separate one configured on each drive, so I can grub install update one and test it before doing the other, without endangering my ability to boot off the other should something go wrong). > All those MPEG3 and photos I could place on the bigger > mSATA SSD. Granted a SSD is definately not needed for those, but it is > just more silent. I never got how loud even a tiny 2,5 inch laptop drive > is, unless I switched one external on while using this ThinkPad T520 > with SSD. For the first time I heard the harddisk clearly. Thus I´d > prefer a SSD anyway. Well, yes. But SSDs cost money. And at least here, while I could justify two SSDs in raid1 mode for my critical data, and even overprovision such that I have nearly 50% available space entirely unpartitioned, I really couldn't justify spending SSD money on gigs of media files. But as they say, YMMV... --- [1] Pathologic: THAT is the word I was looking for in several recent posts, but couldn't remember, not "pathetic", "pathologic"! But all I could think of was pathetic, and I knew /that/ wasn't what I wanted, so explained using other words instead. So if you see any of my other recent posts on the issue and think I'm describing a pathologic case using other words, it's because I AM! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman