* Very slow filesystem
@ 2014-06-04 22:15 Igor M
2014-06-04 22:27 ` Fajar A. Nugraha
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Igor M @ 2014-06-04 22:15 UTC (permalink / raw)
To: linux-btrfs@vger.kernel.org
Hello,
Why btrfs becames EXTREMELY slow after some time (months) of usage ?
This is now happened second time, first time I though it was hard
drive fault, but now drive seems ok.
Filesystem is mounted with compress-force=lzo and is used for MySQL
databases, files are mostly big 2G-8G.
Copying from this file system is unbelievable slow. It goes form 500
KB/s to maybe 5MB/s maybe faster some files.
hdparm -t or dd show 130MB/s+. There are no errors on drive.No errors in logs.
Can I somehow get position of file on disk, so I can try with raw read
with dd or something to make sure it's not drive fault ?
As I said I tried dd and speeds are normal but maybe there is problem
with only some sectors.
Below are btrfs version and info:
# uname -a
Linux voyager 3.14.2 #1 SMP Tue May 6 09:25:40 CEST 2014 x86_64
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz GenuineIntel GNU/Linux
Currently but when filesystem was created it was some 3.x I don't remember.
# btrfs --version
Btrfs v0.20-rc1-358-g194aa4a (now I'm upgraded to Btrfs v3.14.2)
# btrfs fi show
Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba
Total devices 1 FS bytes used 2.36TiB
devid 1 size 2.73TiB used 2.38TiB path /dev/sde
Label: none uuid: 09898e7a-b0b4-4a26-a956-a833514c17f6
Total devices 1 FS bytes used 1.05GiB
devid 1 size 3.64TiB used 5.04GiB path /dev/sdb
Btrfs v3.14.2
# btrfs fi df /mnt/old
Data, single: total=2.36TiB, used=2.35TiB
System, DUP: total=8.00MiB, used=264.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=8.50GiB, used=7.13GiB
Metadata, single: total=8.00MiB, used=0.00
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-04 22:15 Very slow filesystem Igor M
@ 2014-06-04 22:27 ` Fajar A. Nugraha
2014-06-04 22:40 ` Roman Mamedov
2014-06-04 22:45 ` Igor M
2014-06-05 3:05 ` Duncan
2014-06-05 8:08 ` Erkki Seppala
2 siblings, 2 replies; 18+ messages in thread
From: Fajar A. Nugraha @ 2014-06-04 22:27 UTC (permalink / raw)
To: Igor M; +Cc: linux-btrfs@vger.kernel.org
On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote:
> Hello,
>
> Why btrfs becames EXTREMELY slow after some time (months) of usage ?
> # btrfs fi show
> Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba
> Total devices 1 FS bytes used 2.36TiB
> devid 1 size 2.73TiB used 2.38TiB path /dev/sde
> # btrfs fi df /mnt/old
> Data, single: total=2.36TiB, used=2.35TiB
Is that the fs that is slow?
It's almost full. Most filesystems would exhibit really bad
performance when close to full due to fragmentation issue (threshold
vary, but 80-90% full usually means you need to start adding space).
You should free up some space (e.g. add a new disk so it becomes
multi-device, or delete some files) and rebalance/defrag.
--
Fajar
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-04 22:27 ` Fajar A. Nugraha
@ 2014-06-04 22:40 ` Roman Mamedov
2014-06-04 22:45 ` Igor M
1 sibling, 0 replies; 18+ messages in thread
From: Roman Mamedov @ 2014-06-04 22:40 UTC (permalink / raw)
To: Fajar A. Nugraha; +Cc: Igor M, linux-btrfs@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 910 bytes --]
On Thu, 5 Jun 2014 05:27:33 +0700
"Fajar A. Nugraha" <list@fajar.net> wrote:
> On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote:
> > Hello,
> >
> > Why btrfs becames EXTREMELY slow after some time (months) of usage ?
>
> > # btrfs fi show
> > Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba
> > Total devices 1 FS bytes used 2.36TiB
> > devid 1 size 2.73TiB used 2.38TiB path /dev/sde
>
> > # btrfs fi df /mnt/old
> > Data, single: total=2.36TiB, used=2.35TiB
>
> Is that the fs that is slow?
>
> It's almost full.
Really, is it? The device size is 2.75 TiB, while only 2.35 TiB is used. About
400 GiB should be free. That's not "almost full". The "btrfs fi df" readings
may be a little confusing, but usually it's those who ask questions on this
list are confused by them, not those who (try to) answer. :)
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-04 22:27 ` Fajar A. Nugraha
2014-06-04 22:40 ` Roman Mamedov
@ 2014-06-04 22:45 ` Igor M
2014-06-04 23:17 ` Timofey Titovets
1 sibling, 1 reply; 18+ messages in thread
From: Igor M @ 2014-06-04 22:45 UTC (permalink / raw)
To: Fajar A. Nugraha; +Cc: linux-btrfs@vger.kernel.org
On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha <list@fajar.net> wrote:
> On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote:
>> Hello,
>>
>> Why btrfs becames EXTREMELY slow after some time (months) of usage ?
>
>> # btrfs fi show
>> Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba
>> Total devices 1 FS bytes used 2.36TiB
>> devid 1 size 2.73TiB used 2.38TiB path /dev/sde
>
>> # btrfs fi df /mnt/old
>> Data, single: total=2.36TiB, used=2.35TiB
>
> Is that the fs that is slow?
>
> It's almost full. Most filesystems would exhibit really bad
> performance when close to full due to fragmentation issue (threshold
> vary, but 80-90% full usually means you need to start adding space).
> You should free up some space (e.g. add a new disk so it becomes
> multi-device, or delete some files) and rebalance/defrag.
>
> --
> Fajar
Yes this one is slow. I know it's getting full I'm just copying to new
disk (it will take days or even weeks!).
It shouldn't be so much fragmented, data is mostly just added. But
still, can reading became so slow just because fullness and
fragmentation ?
It just seems strange to me. If it would be 60Mb/s instead 130, but so
much slower. I'll delete some files and see if it will be faster, but
it will take hours to copy them to new disk.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-04 22:45 ` Igor M
@ 2014-06-04 23:17 ` Timofey Titovets
0 siblings, 0 replies; 18+ messages in thread
From: Timofey Titovets @ 2014-06-04 23:17 UTC (permalink / raw)
To: Igor M; +Cc: linux-btrfs
i can mistake, but i think what:
btrfstune -x <dev> # can improve perfomance because this decrease metadata
Also, in last versions of btrfs progs changed from 4k to 16k, it also
can help (but for this, you must reformat fs)
For clean btrfs fi df /, you can try do:
btrfs bal start -f -sconvert=dup,soft -mconvert=dup,soft <path>
Data, single: total=52.01GiB, used=49.29GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=1.50GiB, used=483.77MiB
Also disable compression or use without force option, if i properly
understand it also do additional fragmentation (filefrag helpfull).
Also, for defragmentation data (if you need defrag some files), you
can do it by "just copy-past", it create nodefragment copy
2014-06-05 1:45 GMT+03:00 Igor M <igork20@gmail.com>:
> On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha <list@fajar.net> wrote:
>> On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote:
>>> Hello,
>>>
>>> Why btrfs becames EXTREMELY slow after some time (months) of usage ?
>>
>>> # btrfs fi show
>>> Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba
>>> Total devices 1 FS bytes used 2.36TiB
>>> devid 1 size 2.73TiB used 2.38TiB path /dev/sde
>>
>>> # btrfs fi df /mnt/old
>>> Data, single: total=2.36TiB, used=2.35TiB
>>
>> Is that the fs that is slow?
>>
>> It's almost full. Most filesystems would exhibit really bad
>> performance when close to full due to fragmentation issue (threshold
>> vary, but 80-90% full usually means you need to start adding space).
>> You should free up some space (e.g. add a new disk so it becomes
>> multi-device, or delete some files) and rebalance/defrag.
>>
>> --
>> Fajar
>
> Yes this one is slow. I know it's getting full I'm just copying to new
> disk (it will take days or even weeks!).
> It shouldn't be so much fragmented, data is mostly just added. But
> still, can reading became so slow just because fullness and
> fragmentation ?
> It just seems strange to me. If it would be 60Mb/s instead 130, but so
> much slower. I'll delete some files and see if it will be faster, but
> it will take hours to copy them to new disk.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best regards,
Timofey.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-04 22:15 Very slow filesystem Igor M
2014-06-04 22:27 ` Fajar A. Nugraha
@ 2014-06-05 3:05 ` Duncan
2014-06-05 3:22 ` Fajar A. Nugraha
` (2 more replies)
2014-06-05 8:08 ` Erkki Seppala
2 siblings, 3 replies; 18+ messages in thread
From: Duncan @ 2014-06-05 3:05 UTC (permalink / raw)
To: linux-btrfs
Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:
> Why btrfs becames EXTREMELY slow after some time (months) of usage ?
> This is now happened second time, first time I though it was hard drive
> fault, but now drive seems ok.
> Filesystem is mounted with compress-force=lzo and is used for MySQL
> databases, files are mostly big 2G-8G.
That's the problem right there, database access pattern on files over 1
GiB in size, but the problem along with the fix has been repeated over
and over and over and over... again on this list, and it's covered on the
btrfs wiki as well, so I guess you haven't checked existing answers
before you asked the same question yet again.
Never-the-less, here's the basic answer yet again...
Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a
particular file rewrite pattern, that being frequently changed and
rewritten data internal to an existing file (as opposed to appended to
it, like a log file). In the normal case, such an internal-rewrite
pattern triggers copies of the rewritten blocks every time they change,
*HIGHLY* fragmenting this type of files after only a relatively short
period. While compression changes things up a bit (filefrag doesn't know
how to deal with it yet and its report isn't reliable), it's not unusual
to see people with several-gig files with this sort of write pattern on
btrfs without compression find filefrag reporting literally hundreds of
thousands of extents!
For smaller files with this access pattern (think firefox/thunderbird
sqlite database files and the like), typically up to a few hundred MiB or
so, btrfs' autodefrag mount option works reasonably well, as when it sees
a file fragmenting due to rewrite, it'll queue up that file for
background defrag via sequential copy, deleting the old fragmented copy
after the defrag is done.
For larger files (say a gig plus) with this access pattern, typically
larger database files as well as VM images, autodefrag doesn't scale so
well, as the whole file must be rewritten each time, and at that size the
changes can come faster than the file can be rewritten. So a different
solution must be used for them.
The recommended solution for larger internal-rewrite-pattern files is to
give them the NOCOW file attribute (chattr +C) , so they're updated in
place. However, this attribute cannot be added to a file with existing
data and have things work as expected. NOCOW must be added to the file
before it contains data. The easiest way to do that is to set the
attribute on the subdir that will contain the files and let the files
inherit the attribute as they are created. Then you can copy (not move,
and don't use cp's --reflink option) existing files into the new subdir,
such that the new copy gets created with the NOCOW attribute.
NOCOW files are updated in-place, thereby eliminating the fragmentation
that would otherwise occur, keeping them fast to access.
However, there are a few caveats. Setting NOCOW turns off file
compression and checksumming as well, which is actually what you want for
such files as it eliminates race conditions and other complex issues that
would otherwise occur when trying to update the files in-place (thus the
reason such features aren't part of most non-COW filesystems, which
update in-place by default).
Additionally, taking a btrfs snapshot locks the existing data in place
for the snapshot, so the first rewrite to a file block (4096 bytes, I
believe) after a snapshot will always be COW, even if the file has the
NOCOW attribute set. Some people run automatic snapshotting software and
can be taking snapshots as often as once a minute. Obviously, this
effectively almost kills NOCOW entirely, since it's then only effective
on changes after the first one between shapshots, and with snapshots only
a minute apart, the file fragments almost as fast as it would have
otherwise!
So snapshots and the NOCOW attribute basically don't get along with each
other. But because snapshots stop at subvolume boundaries, one method to
avoid snapshotting NOCOW files is to put your NOCOW files, already in
their own subdirs if using the suggestion above, into dedicated subvolumes
as well. That lets you continue taking snapshots of the parent subvolume,
without snapshotting the the dedicated subvolumes containing the NOCOW
database or VM-image files.
You'd then do conventional backups of your database and VM-image files,
instead of snapshotting them.
Of course if you're not using btrfs snapshots in the first place, you can
avoid the whole subvolume thing, and just put your NOCOW files in their
own subdirs, setting NOCOW on the subdir as suggested above, so files
(and further subdirs, nested subdirs inherit the NOCOW as well) inherit
the NOCOW of the subdir they're created in, at that creation.
Meanwhile, it can be noted that once you turn off COW/compression/
checksumming, and if you're not snapshotting, you're almost back to the
features of a normal filesystem anyway, except you can still use the
btrfs multi-device features, of course. So if you're not using the multi-
device features either, an alternative solution is to simply use a more
traditional filesystem (like ext4 or xfs, with xfs being targeted at
large files anyway, so for multi-gig database and VM-image files it could
be a good choice =:^) for your large internal-rewrite-pattern files,
while potentially continuing to use btrfs for your normal files, where
btrfs' COW nature and other features are a better match for the use-case,
than they are for gig-plus internal-rewrite-pattern files.
As I said, further discussion elsewhere already, but that's the problem
you're seeing along with a couple potential solutions.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 3:05 ` Duncan
@ 2014-06-05 3:22 ` Fajar A. Nugraha
2014-06-05 4:45 ` Duncan
2014-06-05 7:50 ` Igor M
2014-06-05 15:52 ` Igor M
2 siblings, 1 reply; 18+ messages in thread
From: Fajar A. Nugraha @ 2014-06-05 3:22 UTC (permalink / raw)
To: linux-btrfs
(resending to the list as plain text, the original reply was rejected
due to HTML format)
On Thu, Jun 5, 2014 at 10:05 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>
> Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:
>
> > Why btrfs becames EXTREMELY slow after some time (months) of usage ?
> > This is now happened second time, first time I though it was hard drive
> > fault, but now drive seems ok.
> > Filesystem is mounted with compress-force=lzo and is used for MySQL
> > databases, files are mostly big 2G-8G.
>
> That's the problem right there, database access pattern on files over 1
> GiB in size, but the problem along with the fix has been repeated over
> and over and over and over... again on this list, and it's covered on the
> btrfs wiki as well
Which part on the wiki? It's not on
https://btrfs.wiki.kernel.org/index.php/FAQ or
https://btrfs.wiki.kernel.org/index.php/UseCases
> so I guess you haven't checked existing answers
> before you asked the same question yet again.
>
> Never-the-less, here's the basic answer yet again...
>
> Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a
> particular file rewrite pattern, that being frequently changed and
> rewritten data internal to an existing file (as opposed to appended to
> it, like a log file). In the normal case, such an internal-rewrite
> pattern triggers copies of the rewritten blocks every time they change,
> *HIGHLY* fragmenting this type of files after only a relatively short
> period. While compression changes things up a bit (filefrag doesn't know
> how to deal with it yet and its report isn't reliable), it's not unusual
> to see people with several-gig files with this sort of write pattern on
> btrfs without compression find filefrag reporting literally hundreds of
> thousands of extents!
>
> For smaller files with this access pattern (think firefox/thunderbird
> sqlite database files and the like), typically up to a few hundred MiB or
> so, btrfs' autodefrag mount option works reasonably well, as when it sees
> a file fragmenting due to rewrite, it'll queue up that file for
> background defrag via sequential copy, deleting the old fragmented copy
> after the defrag is done.
>
> For larger files (say a gig plus) with this access pattern, typically
> larger database files as well as VM images, autodefrag doesn't scale so
> well, as the whole file must be rewritten each time, and at that size the
> changes can come faster than the file can be rewritten. So a different
> solution must be used for them.
If COW and rewrite is the main issue, why don't zfs experience the
extreme slowdown (that is, not if you have sufficient free space
available, like 20% or so)?
--
Fajar
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 3:22 ` Fajar A. Nugraha
@ 2014-06-05 4:45 ` Duncan
0 siblings, 0 replies; 18+ messages in thread
From: Duncan @ 2014-06-05 4:45 UTC (permalink / raw)
To: linux-btrfs
Fajar A. Nugraha posted on Thu, 05 Jun 2014 10:22:49 +0700 as excerpted:
> (resending to the list as plain text, the original reply was rejected
> due to HTML format)
>
> On Thu, Jun 5, 2014 at 10:05 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>
>> Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:
>>
>> > Why btrfs becames EXTREMELY slow after some time (months) of usage ?
>> > This is now happened second time, first time I though it was hard
>> > drive fault, but now drive seems ok.
>> > Filesystem is mounted with compress-force=lzo and is used for MySQL
>> > databases, files are mostly big 2G-8G.
>>
>> That's the problem right there, database access pattern on files over 1
>> GiB in size, but the problem along with the fix has been repeated over
>> and over and over and over... again on this list, and it's covered on
>> the btrfs wiki as well
>
> Which part on the wiki? It's not on
> https://btrfs.wiki.kernel.org/index.php/FAQ or
> https://btrfs.wiki.kernel.org/index.php/UseCases
Most of the discussion and information is on the list, but there's a
limited amount of information on the wiki in at least three places. Two
are on the mount options page, in the autodefrag and nodatacow options
description:
* Autodefrag says it's well suited to bdb and sqlite dbs but not vm
images or big dbs (yet).
* Nodatacow says performance gain is usually under 5% *UNLESS* the
workload is random writes to large db files, where the difference can be
VERY large. (There's also mention of the fact that this turns off
checksumming and compression.)
Of course that's the nodatacow mount option, not the NOCOW file
attribute, which isn't to my knowledge discussed on the wiki, and given
the wiki wording, one does indeed have to read a bit between the lines,
but it is there if one looks. That was certainly enough hint for me to
mark the issue for further study as I did my initial pre-mkfs.btrfs
research, for instance, and that it was a problem, with additional
detail, was quickly confirmed once I checked the list.
* Additionally, there some discussion in the FAQ under "Can copy-on-write
be turned off for data blocks?", including discussion of the command used
(chattr +C), a link to a script, a shell commands example, and the hint
"will produce file suitable for a raw VM image -- the blocks will be
updated in-place and are preallocated."
FWIW, if I did wiki editing there'd probably be a dedicated page
discussing it, but for better or worse, I seem to work best on mailing
lists and newsgroups, and every time I've tried contributing on the web,
even when it has been to a web forum which one would think would be close
enough to lists/groups for me to adapt to, it simply hasn't gone much of
anywhere. So these days I let other people more comfortable with editing
wikis or doing web forums do that (and sometimes people do that by either
actually quoting my list post nearly verbatim or simply linking to it,
which I'm fine with, as after all that's where much of the info I post
comes from in the first place), and I stick to the lists. Since I don't
directly contribute to the wiki I don't much criticize it, but there are
indeed at least hints there for those who can read them, something I did
myself so I know it's not asking the impossible.
> If COW and rewrite is the main issue, why don't zfs experience the
> extreme slowdown (that is, not if you have sufficient free space
> available, like 20% or so)?
My personal opinion? Primarily two things:
1) zfs is far more mature than btrfs and has been in production usage for
many years now, while btrfs is still barely getting the huge warnings
stripped off. There's a lot of btrfs optimization possible that simply
hasn't occurred yet as the focus is still real data-destruction-risk
bugs, and in fact, btrfs isn't yet feature-complete either, so there's
still focus on raw feature development as well. When btrfs gets to the
maturity level that zfs is at now, I expect a lot of the problems we have
now will have been dramatically reduced if not eliminated. (And the devs
are indeed working on this problem, among others.)
2) Stating the obvious, while both btrfs and zfs are COW based and have
other similarities, btrfs is an different filesystem, with an entirely
different implementation and somewhat different emphasis. There
consequently WILL be some differences, even when they're both mature
filesystems. It's entirely possible that something about the btrfs
implementation makes it less suitable in general to this particular use-
case.
Additionally, while I don't have zfs experience myself nor do I find it a
particularly feasible option for me due to licensing and political
issues, from what I've read it tends to handle certain issues by simply
throwing gigs on gigs of memory at the problem. Btrfs is designed to
require far less memory, and as such, will by definition be somewhat more
limited in spots. (Arguably, this is simply a specific case of #2 above,
they're individual filesystems with differing implementation and
emphasis, so WILL by definition have different ideal use-cases.)
Meanwhile, there's that specific mention of 20% zfs free-space available,
above. On btrfs, as long as some amount of chunk-space remains
unallocated to chunks, percentage free-space has little to no effect on
performance. And with metadata chunk-sizes of a quarter gig and data
chunk-sizes of a gig, at the terabyte filesystem scale that equates to
well under 1% free, before free-space becomes a performance issue at all.
So if indeed zfs is like many other filesystems in requiring 10-20%
freespace in ordered to perform at best efficiency (I really don't know
if that's the case or not, but it is part of the claim above), then that
again simply emphasizes the differences between zfs and btrfs, since that
literally has zero bearing at all on btrfs efficiency.
Rather, at least until btrfs gets automatic entirely unattended
chunkspace rebalance triggering the btrfs issue is far more likely to be
literally running out of either data or metadata space as all the chunks
with freespace are allocated to the other one. (Usually, it's metadata
that runs out first, with lots of free space tied up in nearly empty data
chunks. But it can be either. Of course a currently manually triggered
rebalance can be used to solve this problem, but at present, it IS
manually triggered, no automatic rebalancing functionality at all.)
So while zfs and btrfs might be similarly based on COW technology, they
really are entirely different filesystems, with vastly different maturity
levels and some pretty big differences in behavior as well as licensing
and political philosophy, certainly now, but potentially even as btrfs
matures to match zfs maturity, too.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 3:05 ` Duncan
2014-06-05 3:22 ` Fajar A. Nugraha
@ 2014-06-05 7:50 ` Igor M
2014-06-05 10:54 ` Russell Coker
2014-06-05 15:52 ` Igor M
2 siblings, 1 reply; 18+ messages in thread
From: Igor M @ 2014-06-05 7:50 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs@vger.kernel.org
Thanks for explanation. I did read wiki, but didn't see this mentioned.
I saw mentioned 'nodatacow' mount option, but this disables
compression and I need compression.
Also I was wrong about files size, files can go to 70GB.
But data to this big tables is only appended, it's never deleted. So
no rewrites should be happening.
Also I see now that after some initial files (which are under 1GB) and
are rewritten/modifed a lot, it's very slow,
but other files where data is only appended it's more or less normal speed.
So it seems to be fragmentation.
For. ex. for one file that reading is slow (800M) filefrag reports
63282 extents and for ex. 8G file that only have 9 extents read speed
is normal.
I'll put freqently modified files in directory with NOCOW attribute.
Thanks.
On Thu, Jun 5, 2014 at 5:05 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:
>
>> Why btrfs becames EXTREMELY slow after some time (months) of usage ?
>> This is now happened second time, first time I though it was hard drive
>> fault, but now drive seems ok.
>> Filesystem is mounted with compress-force=lzo and is used for MySQL
>> databases, files are mostly big 2G-8G.
>
> That's the problem right there, database access pattern on files over 1
> GiB in size, but the problem along with the fix has been repeated over
> and over and over and over... again on this list, and it's covered on the
> btrfs wiki as well, so I guess you haven't checked existing answers
> before you asked the same question yet again.
>
> Never-the-less, here's the basic answer yet again...
>
> Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a
> particular file rewrite pattern, that being frequently changed and
> rewritten data internal to an existing file (as opposed to appended to
> it, like a log file). In the normal case, such an internal-rewrite
> pattern triggers copies of the rewritten blocks every time they change,
> *HIGHLY* fragmenting this type of files after only a relatively short
> period. While compression changes things up a bit (filefrag doesn't know
> how to deal with it yet and its report isn't reliable), it's not unusual
> to see people with several-gig files with this sort of write pattern on
> btrfs without compression find filefrag reporting literally hundreds of
> thousands of extents!
>
> For smaller files with this access pattern (think firefox/thunderbird
> sqlite database files and the like), typically up to a few hundred MiB or
> so, btrfs' autodefrag mount option works reasonably well, as when it sees
> a file fragmenting due to rewrite, it'll queue up that file for
> background defrag via sequential copy, deleting the old fragmented copy
> after the defrag is done.
>
> For larger files (say a gig plus) with this access pattern, typically
> larger database files as well as VM images, autodefrag doesn't scale so
> well, as the whole file must be rewritten each time, and at that size the
> changes can come faster than the file can be rewritten. So a different
> solution must be used for them.
>
> The recommended solution for larger internal-rewrite-pattern files is to
> give them the NOCOW file attribute (chattr +C) , so they're updated in
> place. However, this attribute cannot be added to a file with existing
> data and have things work as expected. NOCOW must be added to the file
> before it contains data. The easiest way to do that is to set the
> attribute on the subdir that will contain the files and let the files
> inherit the attribute as they are created. Then you can copy (not move,
> and don't use cp's --reflink option) existing files into the new subdir,
> such that the new copy gets created with the NOCOW attribute.
>
> NOCOW files are updated in-place, thereby eliminating the fragmentation
> that would otherwise occur, keeping them fast to access.
>
> However, there are a few caveats. Setting NOCOW turns off file
> compression and checksumming as well, which is actually what you want for
> such files as it eliminates race conditions and other complex issues that
> would otherwise occur when trying to update the files in-place (thus the
> reason such features aren't part of most non-COW filesystems, which
> update in-place by default).
>
> Additionally, taking a btrfs snapshot locks the existing data in place
> for the snapshot, so the first rewrite to a file block (4096 bytes, I
> believe) after a snapshot will always be COW, even if the file has the
> NOCOW attribute set. Some people run automatic snapshotting software and
> can be taking snapshots as often as once a minute. Obviously, this
> effectively almost kills NOCOW entirely, since it's then only effective
> on changes after the first one between shapshots, and with snapshots only
> a minute apart, the file fragments almost as fast as it would have
> otherwise!
>
> So snapshots and the NOCOW attribute basically don't get along with each
> other. But because snapshots stop at subvolume boundaries, one method to
> avoid snapshotting NOCOW files is to put your NOCOW files, already in
> their own subdirs if using the suggestion above, into dedicated subvolumes
> as well. That lets you continue taking snapshots of the parent subvolume,
> without snapshotting the the dedicated subvolumes containing the NOCOW
> database or VM-image files.
>
> You'd then do conventional backups of your database and VM-image files,
> instead of snapshotting them.
>
> Of course if you're not using btrfs snapshots in the first place, you can
> avoid the whole subvolume thing, and just put your NOCOW files in their
> own subdirs, setting NOCOW on the subdir as suggested above, so files
> (and further subdirs, nested subdirs inherit the NOCOW as well) inherit
> the NOCOW of the subdir they're created in, at that creation.
>
> Meanwhile, it can be noted that once you turn off COW/compression/
> checksumming, and if you're not snapshotting, you're almost back to the
> features of a normal filesystem anyway, except you can still use the
> btrfs multi-device features, of course. So if you're not using the multi-
> device features either, an alternative solution is to simply use a more
> traditional filesystem (like ext4 or xfs, with xfs being targeted at
> large files anyway, so for multi-gig database and VM-image files it could
> be a good choice =:^) for your large internal-rewrite-pattern files,
> while potentially continuing to use btrfs for your normal files, where
> btrfs' COW nature and other features are a better match for the use-case,
> than they are for gig-plus internal-rewrite-pattern files.
>
> As I said, further discussion elsewhere already, but that's the problem
> you're seeing along with a couple potential solutions.
>
> --
> Duncan - List replies preferred. No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-04 22:15 Very slow filesystem Igor M
2014-06-04 22:27 ` Fajar A. Nugraha
2014-06-05 3:05 ` Duncan
@ 2014-06-05 8:08 ` Erkki Seppala
2014-06-05 8:12 ` Erkki Seppala
2 siblings, 1 reply; 18+ messages in thread
From: Erkki Seppala @ 2014-06-05 8:08 UTC (permalink / raw)
To: linux-btrfs
Igor M <igork20@gmail.com> writes:
> Why btrfs becames EXTREMELY slow after some time (months) of usage ?
Have you tried iostat from sysstat to see the number of IO-operations
performed per second (tps) on the devices when it is performing badly?
If the number is hitting your seek rate (ie. 1/0.0075 for 7.5 ms seek =
133), then fragmentation is sure to be blamed.
--
_____________________________________________________________________
/ __// /__ ____ __ http://www.modeemi.fi/~flux/\ \
/ /_ / // // /\ \/ / \ /
/_/ /_/ \___/ /_/\_\@modeemi.fi \/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 8:08 ` Erkki Seppala
@ 2014-06-05 8:12 ` Erkki Seppala
0 siblings, 0 replies; 18+ messages in thread
From: Erkki Seppala @ 2014-06-05 8:12 UTC (permalink / raw)
To: linux-btrfs
Erkki Seppala <flux-btrfs@inside.org> writes:
> If the number is hitting your seek rate (ie. 1/0.0075 for 7.5 ms seek =
> 133), then fragmentation is sure to be blamed.
Actually the number may very well be off by at least a factor of two (I
tested that my device did 400 tps when I expected 200; perhaps bulk
transfers cause more transactions than I expect), but it should be in
the ballpark I think :).
--
_____________________________________________________________________
/ __// /__ ____ __ http://www.modeemi.fi/~flux/\ \
/ /_ / // // /\ \/ / \ /
/_/ /_/ \___/ /_/\_\@modeemi.fi \/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 7:50 ` Igor M
@ 2014-06-05 10:54 ` Russell Coker
0 siblings, 0 replies; 18+ messages in thread
From: Russell Coker @ 2014-06-05 10:54 UTC (permalink / raw)
To: Igor M; +Cc: Duncan, linux-btrfs@vger.kernel.org
On Thu, 5 Jun 2014 09:50:53 Igor M wrote:
> But data to this big tables is only appended, it's never deleted. So
> no rewrites should be happening.
When you write to the big tables the indexes will be rewritten. Indexes can
be in the same file as table data or as separate files depending on what data
base you use. For the former you get fragmented table files and for the
latter 70G of data will have index files that are large enough to get
fragmented.
Also when you have multiple files in a filesystem being written at the same
time (EG multiple tables appended to in each transaction) then you will get
some fragmentation. Add COW and that makes a lot of fragmentation.
Finally append is done at the file level while COW is rewriting at the block
level. If your database rounds up the allocated space to some power of 2
larger than 4K then things will be fine for a filesystem like Ext3 where file
offsets correspond to fixed locations on disk. But with BTRFS that pre-
allocated space filled with zeros will be rewritten to a different part of
disk when the allocated space is used.
If you use a database that doesn't preallocate space then COW will be invoked
when the end of the file at an offset that isn't a multiple of 4K (or I think
16K for a BTRFS filesystem created with a recent mkfs.btrfs) is written as
appending to data within a block offset means rewriting the block.
I believe that COW is desirable for a database. I don't believe that a lack
of integrity at the filesystem level will help integrity at the database
level. If the working set of your database can fit in RAM then you can rely
on cache to ensure that little data is read during operation. For example one
of my database servers has been running for 330 days and the /mysql filesystem
has writes outnumbering reads by a factor of 3:1. When most IO is for writes
fragmentation of data is less of an issue - although in this case the server
is running Ext3 so it wouldn't get the COW fragmentation issues.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 3:05 ` Duncan
2014-06-05 3:22 ` Fajar A. Nugraha
2014-06-05 7:50 ` Igor M
@ 2014-06-05 15:52 ` Igor M
2014-06-05 16:13 ` Timofey Titovets
2 siblings, 1 reply; 18+ messages in thread
From: Igor M @ 2014-06-05 15:52 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs@vger.kernel.org
One more question. Is there any other way to find out file fragmentation ?
I just copied 35Gb file on new btrfs filesystem (compressed) and
filefrag reports 282275 extents found. This can't be right ?
On Thu, Jun 5, 2014 at 5:05 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted:
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 15:52 ` Igor M
@ 2014-06-05 16:13 ` Timofey Titovets
2014-06-05 19:53 ` Duncan
0 siblings, 1 reply; 18+ messages in thread
From: Timofey Titovets @ 2014-06-05 16:13 UTC (permalink / raw)
To: Igor M, linux-btrfs
2014-06-05 18:52 GMT+03:00 Igor M <igork20@gmail.com>:
> One more question. Is there any other way to find out file fragmentation ?
> I just copied 35Gb file on new btrfs filesystem (compressed) and
> filefrag reports 282275 extents found. This can't be right ?
hes, because filefrag show compressed block (128kbite) as one extent.
--
Best regards,
Timofey.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 16:13 ` Timofey Titovets
@ 2014-06-05 19:53 ` Duncan
2014-06-06 19:06 ` Mitch Harder
0 siblings, 1 reply; 18+ messages in thread
From: Duncan @ 2014-06-05 19:53 UTC (permalink / raw)
To: linux-btrfs
Timofey Titovets posted on Thu, 05 Jun 2014 19:13:08 +0300 as excerpted:
> 2014-06-05 18:52 GMT+03:00 Igor M <igork20@gmail.com>:
>> One more question. Is there any other way to find out file
>> fragmentation ?
>> I just copied 35Gb file on new btrfs filesystem (compressed) and
>> filefrag reports 282275 extents found. This can't be right ?
> hes, because filefrag show compressed block (128kbite) as one extent.
Correct. Which is why my original answer had this:
>>> While compression changes things up a bit (filefrag doesn't know
>>> how to deal with it yet and its report isn't reliable),
I skipped over the "why" at the time as it wasn't necessary for the then-
current discussion, but indeed, the reason is because filefrag counts
each 128 KiB block as a separate fragment because it doesn't understand
them, and as a result, it's not (currently) usable for btrfs-compressed
files.
They (the btrfs, filefrag, and kernel folks) are working on teaching
filefrag about the problem so it can report correct information, but the
approach being taken is to setup generic kernel functionality to report
that information to filefrag, such that other filesystems can use the
same functionality, which makes it a rather more complicated project than
a simple one-shot fix for btrfs by itself would be. So while the problem
is indeed being actively worked on, it could be some time before we
actually have a filefrag that's accurate in btrfs-compressed-file
situations, but once we do, we can be confident the solution is a correct
one that can be used well into the future by btrfs and other filesystems
as well, not just a brittle hack of a few minutes to a day that can't be
used for anything else and that could well break again two kernel cycles
down the road.
Unfortunately, I know of nothing else that can report that information,
so the only real suggestion I have is to either turn off compression or
forget about tracking individual file fragmentation for now and go only
on performance.
But as it happens, the NOCOW file attribute turns off compression (as
well as checksumming) for that file anyway, because in-place rewrite
would otherwise trigger complex issues and race conditions that are a
risk to the data as well as performance, which is why traditional non-COW
filesystems don't tend to offer these features in the first place.
Btrfs' normal COW nature makes these features possible, but as this
thread already explores, unfortunately simply isn't suitable for certain
access patterns.
So if the file is properly (that is, at creation) set NOCOW, filefrag
should indeed be accurate, because the file won't be (btrfs-)compressed.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-05 19:53 ` Duncan
@ 2014-06-06 19:06 ` Mitch Harder
2014-06-06 19:59 ` Duncan
2014-06-07 2:29 ` Russell Coker
0 siblings, 2 replies; 18+ messages in thread
From: Mitch Harder @ 2014-06-06 19:06 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
On Thu, Jun 5, 2014 at 2:53 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Timofey Titovets posted on Thu, 05 Jun 2014 19:13:08 +0300 as excerpted:
>
>> 2014-06-05 18:52 GMT+03:00 Igor M <igork20@gmail.com>:
>>> One more question. Is there any other way to find out file
>>> fragmentation ?
>>> I just copied 35Gb file on new btrfs filesystem (compressed) and
>>> filefrag reports 282275 extents found. This can't be right ?
>
>> hes, because filefrag show compressed block (128kbite) as one extent.
>
> Correct. Which is why my original answer had this:
>
>>>> While compression changes things up a bit (filefrag doesn't know
>>>> how to deal with it yet and its report isn't reliable),
>
> I skipped over the "why" at the time as it wasn't necessary for the then-
> current discussion, but indeed, the reason is because filefrag counts
> each 128 KiB block as a separate fragment because it doesn't understand
> them, and as a result, it's not (currently) usable for btrfs-compressed
> files.
>
In the context of a compressed database file, the 128 KiB compressed
block size has more severe consequences.
First, even if the 128 KiB blocks are contiguous, each 128 KiB block
has it's own metadata entry. So you already have much higher metadata
utilization than without compression. And the metadata can also get
fragmented.
Every time you update your database, btrfs is going to update
whichever 128 KiB blocks need to be modified.
Even for a tiny modification, the new compressed block may be slightly
more or slightly less than 128 KiB.
If you have a 1-2 GB database that is being updated with any
frequency, you can see how you will quickly end up with lots of
metadata fragmentation as well as inefficient data block utilization.
I think this will be the case even if you switch to NOCOW due to the
compression.
On a very fundamental level, file system compression and large
databases are two use cases that are difficult to reconcile.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-06 19:06 ` Mitch Harder
@ 2014-06-06 19:59 ` Duncan
2014-06-07 2:29 ` Russell Coker
1 sibling, 0 replies; 18+ messages in thread
From: Duncan @ 2014-06-06 19:59 UTC (permalink / raw)
To: linux-btrfs
Mitch Harder posted on Fri, 06 Jun 2014 14:06:53 -0500 as excerpted:
> Every time you update your database, btrfs is going to update whichever
> 128 KiB blocks need to be modified.
>
> Even for a tiny modification, the new compressed block may be slightly
> more or slightly less than 128 KiB.
FWIW, I believe that's 128 KiB pre-compression. And at least without
compress-force, btrfs will try the compression and if the compressed size
is larger than the uncompressed size, it simply won't compress that
block. So 128 KiB is the largest amount of space that 128 KiB of data
could take with compression on, but it can be half that or less if the
compression happens to be good for that 128 KiB block.
> If you have a 1-2 GB database that is being updated with any frequency,
> you can see how you will quickly end up with lots of metadata
> fragmentation as well as inefficient data block utilization.
> I think this will be the case even if you switch to NOCOW due to the
> compression.
That is one reason that, as I said, NOCOW turns off compression.
Compression simply doesn't work well with in-place updates, because as
you point out, the update may compress more or less well than the
original, and that won't work in-place. So NOCOW turns off compression
to avoid the problem.
If its COW (that is, not NOCOW), then the COW-based out-of-place-updates
avoid the problem of fitting more data in the same space, because the new
write can take more space in the new location if it has to.
But you are correct that compression and large, frequently updated
databases don't play well together either. Which is why turning off
compression when turning off COW isn't the big problem it would first
appear to be -- as it happens, the very same files where COW doesn't work
well, are also the ones where compression doesn't work well.
Similarly for checksumming. When there are enough updates, in addition
to taking more time to calculate and write, checksumming simply invites
race conditions between the last then-valid checksum and the next update
invalidating it. In addition, in many, perhaps most cases, the sorts of
apps that do constant internal updates, have already evolved their own
data integrity verification methods in ordered to cope with issues on the
after all way more common unverified filesystems, creating even more
possible race conditions and timing issues and making all that extra work
that btrfs normally does for verification unnecessary. Trying to do all
that in-place due to NOCOW is a recipe for failure or insanity if not both
So when turning off COW, just turning off checksumming/verification and
compression along with it makes the most sense, and that's what btrfs
does. To do otherwise is just asking for trouble, which is why you very
rarely see in-place-update-by-default filesystems offering either
transparent compression or data verification as features.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem
2014-06-06 19:06 ` Mitch Harder
2014-06-06 19:59 ` Duncan
@ 2014-06-07 2:29 ` Russell Coker
1 sibling, 0 replies; 18+ messages in thread
From: Russell Coker @ 2014-06-07 2:29 UTC (permalink / raw)
To: linux-btrfs
On Fri, 6 Jun 2014 14:06:53 Mitch Harder wrote:
> Every time you update your database, btrfs is going to update
> whichever 128 KiB blocks need to be modified.
>
> Even for a tiny modification, the new compressed block may be slightly
> more or slightly less than 128 KiB.
>
> If you have a 1-2 GB database that is being updated with any
> frequency, you can see how you will quickly end up with lots of
> metadata fragmentation as well as inefficient data block utilization.
> I think this will be the case even if you switch to NOCOW due to the
> compression.
>
> On a very fundamental level, file system compression and large
> databases are two use cases that are difficult to reconcile.
The ZFS approach of using a ZIL (write-back cache that caches before
allocation) and L2ARC (read-cache on SSD) mitigates these problems. Samsung
1TB SSDs are $565 at my local computer store, if your database has a working
set of less than 2TB then SSDs with L2ARC should solve those performance
problems at low cost. The vast majority of sysadmins have never seen a
database that's 2TB in size, let alone one with a 2TB working set.
That said I've seen Oracle docs recommending against ZFS for large databases,
but the Oracle definition of "large database" is probably a lot larger than
anything that is likely to be stored on BTRFS in the near future.
Another thing to note is that there are a variety of ways of storing
compressed data in databases. Presumably anyone who is storing so much data
that the working set exceeds the ability to attach lots of SSDs is going to be
using some form of compressed tables which will reduce the ability of
filesystem compression to do any good.
On Fri, 6 Jun 2014 19:59:55 Duncan wrote:
> Similarly for checksumming. When there are enough updates, in addition
> to taking more time to calculate and write, checksumming simply invites
> race conditions between the last then-valid checksum and the next update
> invalidating it. In addition, in many, perhaps most cases, the sorts of
> apps that do constant internal updates, have already evolved their own
> data integrity verification methods in ordered to cope with issues on the
> after all way more common unverified filesystems, creating even more
> possible race conditions and timing issues and making all that extra work
> that btrfs normally does for verification unnecessary. Trying to do all
> that in-place due to NOCOW is a recipe for failure or insanity if not both
http://www.strchr.com/crc32_popcnt
The above URL has some interesting information about CRC32 speed. In summary
if you have a Core i5 system then you are looking at less than a clock cycle
per byte on average. So if your storage is capable of handling more than
4GB/s of data transfer then CRC32 might be a bottleneck. But doing 4GB/s for
a database is a very different problem.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-06-07 2:29 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-04 22:15 Very slow filesystem Igor M
2014-06-04 22:27 ` Fajar A. Nugraha
2014-06-04 22:40 ` Roman Mamedov
2014-06-04 22:45 ` Igor M
2014-06-04 23:17 ` Timofey Titovets
2014-06-05 3:05 ` Duncan
2014-06-05 3:22 ` Fajar A. Nugraha
2014-06-05 4:45 ` Duncan
2014-06-05 7:50 ` Igor M
2014-06-05 10:54 ` Russell Coker
2014-06-05 15:52 ` Igor M
2014-06-05 16:13 ` Timofey Titovets
2014-06-05 19:53 ` Duncan
2014-06-06 19:06 ` Mitch Harder
2014-06-06 19:59 ` Duncan
2014-06-07 2:29 ` Russell Coker
2014-06-05 8:08 ` Erkki Seppala
2014-06-05 8:12 ` Erkki Seppala
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).