From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Options for SSD - autodefrag etc?
Date: Sun, 26 Jan 2014 21:44:00 +0000 (UTC) [thread overview]
Message-ID: <pan$7fa2c$63fcd33a$64312f4a$d9ed1b9@cox.net> (raw)
In-Reply-To: 2056077.lUO7kt9W8r@merkaba
Martin Steigerwald posted on Sat, 25 Jan 2014 13:54:40 +0100 as excerpted:
> Hi Duncan,
>
> Am Freitag, 24. Januar 2014, 06:54:31 schrieb Duncan:
>> Anyway, yes, I turned autodefrag on for my SSDs, here, but there are
>> arguments to be made in either direction, so I can understand people
>> choosing not to do that.
>
> Do you have numbers to back up that this gives any advantage?
Your post (like some of mine) reads like a stream of consciousness more
than a well organized post, making it somewhat difficult to reply to (I
guess I'm now experiencing the pain others sometimes mention when trying
to reply to some of mine). However, I'll try...
I haven't done benchmarks, etc, nor do I have them at hand to quote, if
that's what you're asking for. But of course I did say I understand the
arguments made by both sides, and just gave the reasons why I made the
choice I did, here.
What I /do/ have is the multiple post here on this list from people
complaining about pathologic[1] performance issues due to large-internal-
written-file fragmentation even on SSDs, particularly so when interacting
with non-trivial numbers of snapshots as well. That's a case that at
present simply Does. Not. Scale. Period!
Of course the multi-gig internal-rewritten-file case is better suited to
the NOCOW extended attribute than to autodefrag, but anyway...
> I have it disabled and yet I have things like:
>
> Oh, this is insane. This filefrags runs for over [five minutes]
> already. And hogging on one core eating almost 100% of its processing
> power.
> /usr/bin/time -v filefrag soprano-virtuoso.db
> Well, now that command completed:
>
> soprano-virtuoso.db: 93807 extents found
> Command being timed: "filefrag soprano-virtuoso.db"
> User time (seconds): 0.00
> System time (seconds): 338.77
> Percent of CPU this job got: 98%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 5:42.81
I don't see any mention of the file size. I'm (informally) collecting
data on that sort of thing ATM, since it's exactly the sort of thing I
was referring to, and I've seen enough posts on the list about it to have
caught my interest.
FWIW I'll guess something over a gig, perhaps 2-3 gigs...
Also FWIW, while my desktop of choice is indeed KDE, I'm running gentoo,
and turned off USE=semantic-desktop and related flags some time ago
(early kde 4.7, so over 2.5 years ago now), entirely purging nepomuk,
virtuoso, etc, from my system. That was well before I switched to btrfs,
but the performance improvement from not just turning it off at runtime
(I already had it off at runtime) but entirely purging it from my system
was HUGE, I mean like clean all the malware off an MS Windows machine and
see how much faster it runs HUGE, *WELL* more than I expected! (I had
/expected/ just to get rid of a few packages that I'd no longer have to
update, little or no performance improvement at all, since I already the
data indexing, etc, turned off to the extent that I could, at runtime.
Boy was I surprised, but in a GOOD way! =:^)
Anyway, because I have that stuff not only disabled at runtime but
entirely turned off at build time and purged from the system as well, I
don't have such a database file available here to compare with yours.
But I'd certainly be interested in knowing how big yours actually was,
since I already have both the filefrag report on it, and your complaint
about how long it took filefrag to compile that information and report
back.
> Well I have some files with several ten thousands extent. But first,
> this is mounted with compress=lzo, so 128k is the largest extent size as
> far as I know
Well, you're mounting with compress=lzo (which I'm using too, FWIW), not
compress-force=lzo, so btrfs won't try to compress it if it thinks it's
already compressed.
Unfortunately, I believe there's no tool to report on whether btrfs has
actually compressed the file or not, and as you imply filefrag doesn't
know about btrfs compression yet, so just running the filefrag on a file
on a compress=lzo btrfs doesn't really tell you a whole lot. =:^(
What you /could/ do (well, after you've freed some space given your
filesystem usage information below, or perhaps to a different filesystem)
would be copy the file elsewhere, using reflink=no just to be sure it's
actually copied, and see what filefrag reports on the new copy. Assuming
enough free space btrfs should write the new file as a single extent, so
if filefrag reports a similar number of extents on the new copy, you'll
know it's compression related, while if it reports only one or a small
handful of extents, you'll know the original wasn't compressed and it's
real fragmentation.
It would also be interesting to know how long a filefrag on the new file
takes, as compared to the original, but in ordered to get an apples to
apples comparison, you'd have to either drop-caches before doing the
filefrag on the new one, or reboot, since after the copy it'd be cached,
while the 5+ minute time on the original above was presumably with very
little of the file actually cached.
And of course you could temporarily mount without the compress=lzo option
and do the copy, if you find it is the compression triggering the extents
report from filefrag, just to see the difference compression makes. Or
similarly, you could mount with compress-force=lzo and try it, if you
find btrfs isn't compressing the file with ordinary compress=lzo, again
to see the difference that makes.
> and second: I did manual btrfs filesystem defragment on
> files like those and and never ever perceived any noticable difference
> in performance.
>
> Thus I just gave up on trying to defragment stuff on the SSD.
I still say it'd be interesting to see the (from cold-cache) filefrag
report and timing on a fresh copy, compared to the 5 minute plus timing
above.
> And this is really quite high.
> But… I think I have a more pressing issue with that BTRFS /home
> on an Intel SSD 320 and that is that it is almost full:
>
> merkaba:~> LANG=C df -hT /home
> Filesystem Type Size Used Avail Use% Mounted on
> /dev/mapper/merkaba-home btrfs 254G 241G 8.5G 97% /home
Yeah, that's uncomfortably close to full...
(FWIW, it's also interesting comparing that to a df on my /home...
$>> df .
Filesystem 2M-blocks Used Available Use% Mounted on
/dev/sda6 20480 12104 7988 61% /h
As you can see I'm using 2M blocks (alias df=df -B2M), but the filesystem
is raid1 both data and metadata, so the numbers would be double and the
2M blocks are thus 1M block equivalent. (You can also see that I've
actually mounted it on /h, not /home. /home is actually a symlink to /h
just in case, but I export HOME=/h/whatever, and most programs honor
that.)
So the partition size is 20480 MiB or 20.0 GiB, with ~12+ GiB used, just
under 8 GiB available.
It can be and is so small because I have a dedicated media partition with
all the big stuff located elsewhere (still on reiserfs on spinning rust,
as a matter of fact).
Just interesting to see how people setup their systems differently, is
all, thus the "FWIW". But the small independent partitions do make for
much shorter balance times, etc! =:^)
> merkaba:~> btrfs filesystem show […]
> Label: home uuid: […]
> Total devices 1 FS bytes used 238.99GiB
> devid 1 size 253.52GiB used 253.52GiB path [...]
>
> Btrfs v3.12
>
> merkaba:~> btrfs filesystem df /home
> Data, single: total=245.49GiB, used=237.07GiB
> System, DUP: total=8.00MiB, used=48.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, DUP: total=4.00GiB, used=1.92GiB
> Metadata, single: total=8.00MiB, used=0.00
It has come up before on this list and doesn't hurt anything, but those
extra system-single and metadata-single chunks can be removed. A balance
with a zero usage filter should do it. Something like this:
btrfs balance start -musage=0
That will act on metadata chunks with usage=0 only. It may or may not
act on the system chunk. Here it does, and metadata implies system also,
but someone reported it didn't, for them. If it doesn't...
btrfs balance start -f -susage=0
... should do it. (-f=force, needed if acting on system chunk only.)
https://btrfs.wiki.kernel.org/index.php/Balance_Filters
(That's for the filter info, not well documented in the manpage yet. The
manpage documents btrfs balance fairly well tho, other than that.)
Anyway... 252 gigs used of 252 total in filesystem show. That's full
enough you may not even be able to balance as there's no unallocated
blocks left to allocate for the balance. But the usage=0 thing may get
you a bit of room, after which you can try usage=1, etc, to hopefully
recover a bit more, until you get at least /some/ unallocated space as a
buffer to work with. Right now, you're risking being unable to allocate
anything more when data or metadata runs out, and I'd be worried about
that.
> Okay, I could probably get back 1,5 GiB on metadata, but whenever I
> tried a btrfs filesystem balance on any of the BTRFS filesystems on my
> SSD I usually got the following unpleasant result:
>
> Halve of the performance. Like double boot times on / and such.
That's weird. I wonder why/how, unless it's simply so full an SSD that
the firmware's having serious trouble doing its thing. I know I've seen
nothing like that on my SSDs. But then again, my usage is WILDLY
different, with my largest partition 24 gigs, and only about 60% of the
SSD even partitioned at all because I keep the big stuff like media files
on spinning rust (and reiserfs, not btrfs), so the firmware has *LOTS* of
room to shuffle blocks around for write-cycle balancing, etc.
And of course I'm using a different brand SSD. (FWIW, Corsair Neutron
256 GB, 238 GiB, *NOT* the Neutron GTX.) But if anything, Intel SSDs
have a better rep than my Corsair Neutrons do, so I doubt that has
anything to do with it.
> So I have the following thoughts:
>
> 1) I am not yet clear whether defragmenting files on SSD will really
> bring a benefit.
Of course that's the question of the entire thread. As I said, I have it
turned on here, but I understand the arguments for both sides, and from
here that question does appear to remain open for debate.
One other related critical point while we're on the subject.
A number of people have reported that at least for some distros installed
to btrfs, brand new installs are coming up significantly fragmented.
Apparently some distros do their install to btrfs mounted without
autodefrag turned on.
And once there's existing fragmentation, turning on autodefrag /then/
results in a slowdown for several boot cycles, as normal usage detects
and queues for defrag, then defrags, all those already fragmented files.
There's an eventual speedup (at least on spinning rust, SSDs of course
are open to question, thus this thread), but the system has to work thru
the existing backlog of fragmentation before you'll see it.
Of course one way out of that (temporary but sometimes several days) pain
is to deliberately run a btrfs defrag recursive (new enough btrfs has a
recursive flag, previous to that, one had to play some tricks with find,
as documented on the wiki) on the entire filesystem. That will be more
intense pain, but it'll be over faster! =:^)
The point being, if a reader is considering autodefrag, be SURE and turn
it on BEFORE there's a whole bunch of already fragmented data on the
filesystem.
Ideally, turn it on for the first mount after the mkfs.btrfs, and never
mount without it. That ensures there's never a chance for fragmentation
to get out of hand in the first place. =:^)
(Well, with the additional caveat that the NOCOW extended attribute is
used appropriately on internal-rewrite files such as VM images,
databases, bittorrent preallocations, etc, when said file approaches a
gig or larger. But that is discussed elsewhere.)
> 2) On my /home problem is more that it is almost full and free space
> appears to be highly fragmented. Long fstrim times speak tend to agree
> with it:
>
> merkaba:~> /usr/bin/time fstrim -v /home
> /home: 13494484992 bytes were trimmed
> 0.00user 12.64system 1:02.93elapsed 20%CPU
Some people wouldn't call a minute "long", but yeah, on an SSD, even at
several hundred gig, that's definitely not "short".
It's not well comparable because as I explained, my partition sizes are
so much smaller, but for reference, a trim on my 20-gig /home took a bit
over a second. Doing the math, that'd be 10-20 seconds for 200+ gigs.
That you're seeing a minute, does indeed seem to indicate high free-space
fragmentation.
But again, I'm at under 60% SSD space even partitioned, so there's LOTS
of space for the firmware to do its management thing. If your SSD is 256
gig as mine, with 253+ gigs used (well, I see below it's 300 gig, but
still...) ... especially if you're not running with the discard mount
option (which could be an entire thread of its own, but at least there's
some official guidance on it), that firmware could be working pretty hard
indeed with the resources it has at its disposal!
I expect you'd see quite a difference if you could reduce that to say 80%
partitioned and trim the other 20%, giving the firmware a solid 20% extra
space to work with.
If you could then give btrfs some headroom on the reduced size partition
as well, well...
> 3) Turning autodefrag on might fragment free space even more.
Now, yes. As I stressed above, turn it on when the filesystem's new,
before you start loading it with content, and the story should be quite
different. Don't give it a chance to fragment in the first place. =:^)
> 4) I have no clear conclusion on what maintenance other than scrubbing
> might make sense for BTRFS filesystems on SSDs at all. Everything I
> tried either did not have any perceivable effect or made things worse.
Well, of course there's backups. Given that btrfs isn't fully stabilized
yet and there are still bugs being worked out, those are *VITAL*
maintenance! =:^)
Also, for the same reason (btrfs isn't yet fully stable), I recently
refreshed and double-checked my backups, then blew away the existing
btrfs with a fresh mkfs.btrfs and restored from backup.
The brand new filesystems now make use of several features that the older
ones didn't have, including the new 16k nodesize default. =:^) For
anyone who has been running btrfs for awhile, that's potentially a nice
improvement.
I expect to do the same thing at least once more, later on after btrfs
has settled down to more or less routine stability, just to clear out any
remaining not-fully-stable-yet corner-cases that may eventually come back
to haunt me if I don't, as well as to update the filesystem to take
advantage of any further format updates between now and then.
That's useful btrfs maintenance, SSD or no SSD. =:^)
> Thus for SSD except for the scrubbing and the occasional fstrim I be
> done with it.
>
> For harddisks I enable autodefrag.
>
> But still for now this is only guess work. I don´t have much clue on
> BTRFS filesystems maintenance yet and I just remember the slogan on
> xfs.org wiki:
>
> "Use the defaults."
=:^)
> I would love to hear some more or less official words from BTRFS
> filesystem developers on that. But for know I think one of the best
> optimizations would be to complement that 300 GB Intel SSD 320 with a
> 512 GB Crucial m5 mSATA SSD or some Intel mSATA SSDs (but these cost
> twice as much), and make more free space on /home again. For criticial
> data regarding data safety and amount of accesses I could even use BTRFS
> RAID 1 then.
Indeed. I'm running btrfs raid1 mode with my ssds (except for /boot,
where I have a separate one configured on each drive, so I can grub
install update one and test it before doing the other, without
endangering my ability to boot off the other should something go wrong).
> All those MPEG3 and photos I could place on the bigger
> mSATA SSD. Granted a SSD is definately not needed for those, but it is
> just more silent. I never got how loud even a tiny 2,5 inch laptop drive
> is, unless I switched one external on while using this ThinkPad T520
> with SSD. For the first time I heard the harddisk clearly. Thus I´d
> prefer a SSD anyway.
Well, yes. But SSDs cost money. And at least here, while I could
justify two SSDs in raid1 mode for my critical data, and even
overprovision such that I have nearly 50% available space entirely
unpartitioned, I really couldn't justify spending SSD money on gigs of
media files.
But as they say, YMMV...
---
[1] Pathologic: THAT is the word I was looking for in several recent
posts, but couldn't remember, not "pathetic", "pathologic"! But all I
could think of was pathetic, and I knew /that/ wasn't what I wanted, so
explained using other words instead. So if you see any of my other
recent posts on the issue and think I'm describing a pathologic case
using other words, it's because I AM!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-01-26 21:44 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-23 22:23 Options for SSD - autodefrag etc? KC
2014-01-24 6:54 ` Duncan
2014-01-25 12:54 ` Martin Steigerwald
2014-01-26 21:44 ` Duncan [this message]
2014-01-24 20:14 ` Kai Krakow
2014-01-25 13:11 ` Martin Steigerwald
2014-01-25 14:06 ` Kai Krakow
2014-01-25 16:19 ` Martin Steigerwald
-- strict thread matches above, loose matches on Subject: below --
2014-01-24 18:55 KC
2014-01-24 20:27 ` Kai Krakow
2014-01-25 5:09 ` Duncan
2014-01-25 13:33 ` Imran Geriskovan
2014-01-25 14:01 ` Martin Steigerwald
2014-01-26 17:18 ` Duncan
[not found] ` <KA9w1n01A0tVtje01A9yLn>
2014-01-28 11:41 ` Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$7fa2c$63fcd33a$64312f4a$d9ed1b9@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).