* Very slow filesystem @ 2014-06-04 22:15 Igor M 2014-06-04 22:27 ` Fajar A. Nugraha ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Igor M @ 2014-06-04 22:15 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org Hello, Why btrfs becames EXTREMELY slow after some time (months) of usage ? This is now happened second time, first time I though it was hard drive fault, but now drive seems ok. Filesystem is mounted with compress-force=lzo and is used for MySQL databases, files are mostly big 2G-8G. Copying from this file system is unbelievable slow. It goes form 500 KB/s to maybe 5MB/s maybe faster some files. hdparm -t or dd show 130MB/s+. There are no errors on drive.No errors in logs. Can I somehow get position of file on disk, so I can try with raw read with dd or something to make sure it's not drive fault ? As I said I tried dd and speeds are normal but maybe there is problem with only some sectors. Below are btrfs version and info: # uname -a Linux voyager 3.14.2 #1 SMP Tue May 6 09:25:40 CEST 2014 x86_64 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz GenuineIntel GNU/Linux Currently but when filesystem was created it was some 3.x I don't remember. # btrfs --version Btrfs v0.20-rc1-358-g194aa4a (now I'm upgraded to Btrfs v3.14.2) # btrfs fi show Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba Total devices 1 FS bytes used 2.36TiB devid 1 size 2.73TiB used 2.38TiB path /dev/sde Label: none uuid: 09898e7a-b0b4-4a26-a956-a833514c17f6 Total devices 1 FS bytes used 1.05GiB devid 1 size 3.64TiB used 5.04GiB path /dev/sdb Btrfs v3.14.2 # btrfs fi df /mnt/old Data, single: total=2.36TiB, used=2.35TiB System, DUP: total=8.00MiB, used=264.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=7.13GiB Metadata, single: total=8.00MiB, used=0.00 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-04 22:15 Very slow filesystem Igor M @ 2014-06-04 22:27 ` Fajar A. Nugraha 2014-06-04 22:40 ` Roman Mamedov 2014-06-04 22:45 ` Igor M 2014-06-05 3:05 ` Duncan 2014-06-05 8:08 ` Erkki Seppala 2 siblings, 2 replies; 18+ messages in thread From: Fajar A. Nugraha @ 2014-06-04 22:27 UTC (permalink / raw) To: Igor M; +Cc: linux-btrfs@vger.kernel.org On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote: > Hello, > > Why btrfs becames EXTREMELY slow after some time (months) of usage ? > # btrfs fi show > Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba > Total devices 1 FS bytes used 2.36TiB > devid 1 size 2.73TiB used 2.38TiB path /dev/sde > # btrfs fi df /mnt/old > Data, single: total=2.36TiB, used=2.35TiB Is that the fs that is slow? It's almost full. Most filesystems would exhibit really bad performance when close to full due to fragmentation issue (threshold vary, but 80-90% full usually means you need to start adding space). You should free up some space (e.g. add a new disk so it becomes multi-device, or delete some files) and rebalance/defrag. -- Fajar ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-04 22:27 ` Fajar A. Nugraha @ 2014-06-04 22:40 ` Roman Mamedov 2014-06-04 22:45 ` Igor M 1 sibling, 0 replies; 18+ messages in thread From: Roman Mamedov @ 2014-06-04 22:40 UTC (permalink / raw) To: Fajar A. Nugraha; +Cc: Igor M, linux-btrfs@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 910 bytes --] On Thu, 5 Jun 2014 05:27:33 +0700 "Fajar A. Nugraha" <list@fajar.net> wrote: > On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote: > > Hello, > > > > Why btrfs becames EXTREMELY slow after some time (months) of usage ? > > > # btrfs fi show > > Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba > > Total devices 1 FS bytes used 2.36TiB > > devid 1 size 2.73TiB used 2.38TiB path /dev/sde > > > # btrfs fi df /mnt/old > > Data, single: total=2.36TiB, used=2.35TiB > > Is that the fs that is slow? > > It's almost full. Really, is it? The device size is 2.75 TiB, while only 2.35 TiB is used. About 400 GiB should be free. That's not "almost full". The "btrfs fi df" readings may be a little confusing, but usually it's those who ask questions on this list are confused by them, not those who (try to) answer. :) -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-04 22:27 ` Fajar A. Nugraha 2014-06-04 22:40 ` Roman Mamedov @ 2014-06-04 22:45 ` Igor M 2014-06-04 23:17 ` Timofey Titovets 1 sibling, 1 reply; 18+ messages in thread From: Igor M @ 2014-06-04 22:45 UTC (permalink / raw) To: Fajar A. Nugraha; +Cc: linux-btrfs@vger.kernel.org On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha <list@fajar.net> wrote: > On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote: >> Hello, >> >> Why btrfs becames EXTREMELY slow after some time (months) of usage ? > >> # btrfs fi show >> Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba >> Total devices 1 FS bytes used 2.36TiB >> devid 1 size 2.73TiB used 2.38TiB path /dev/sde > >> # btrfs fi df /mnt/old >> Data, single: total=2.36TiB, used=2.35TiB > > Is that the fs that is slow? > > It's almost full. Most filesystems would exhibit really bad > performance when close to full due to fragmentation issue (threshold > vary, but 80-90% full usually means you need to start adding space). > You should free up some space (e.g. add a new disk so it becomes > multi-device, or delete some files) and rebalance/defrag. > > -- > Fajar Yes this one is slow. I know it's getting full I'm just copying to new disk (it will take days or even weeks!). It shouldn't be so much fragmented, data is mostly just added. But still, can reading became so slow just because fullness and fragmentation ? It just seems strange to me. If it would be 60Mb/s instead 130, but so much slower. I'll delete some files and see if it will be faster, but it will take hours to copy them to new disk. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-04 22:45 ` Igor M @ 2014-06-04 23:17 ` Timofey Titovets 0 siblings, 0 replies; 18+ messages in thread From: Timofey Titovets @ 2014-06-04 23:17 UTC (permalink / raw) To: Igor M; +Cc: linux-btrfs i can mistake, but i think what: btrfstune -x <dev> # can improve perfomance because this decrease metadata Also, in last versions of btrfs progs changed from 4k to 16k, it also can help (but for this, you must reformat fs) For clean btrfs fi df /, you can try do: btrfs bal start -f -sconvert=dup,soft -mconvert=dup,soft <path> Data, single: total=52.01GiB, used=49.29GiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=1.50GiB, used=483.77MiB Also disable compression or use without force option, if i properly understand it also do additional fragmentation (filefrag helpfull). Also, for defragmentation data (if you need defrag some files), you can do it by "just copy-past", it create nodefragment copy 2014-06-05 1:45 GMT+03:00 Igor M <igork20@gmail.com>: > On Thu, Jun 5, 2014 at 12:27 AM, Fajar A. Nugraha <list@fajar.net> wrote: >> On Thu, Jun 5, 2014 at 5:15 AM, Igor M <igork20@gmail.com> wrote: >>> Hello, >>> >>> Why btrfs becames EXTREMELY slow after some time (months) of usage ? >> >>> # btrfs fi show >>> Label: none uuid: b367812a-b91a-4fb2-a839-a3a153312eba >>> Total devices 1 FS bytes used 2.36TiB >>> devid 1 size 2.73TiB used 2.38TiB path /dev/sde >> >>> # btrfs fi df /mnt/old >>> Data, single: total=2.36TiB, used=2.35TiB >> >> Is that the fs that is slow? >> >> It's almost full. Most filesystems would exhibit really bad >> performance when close to full due to fragmentation issue (threshold >> vary, but 80-90% full usually means you need to start adding space). >> You should free up some space (e.g. add a new disk so it becomes >> multi-device, or delete some files) and rebalance/defrag. >> >> -- >> Fajar > > Yes this one is slow. I know it's getting full I'm just copying to new > disk (it will take days or even weeks!). > It shouldn't be so much fragmented, data is mostly just added. But > still, can reading became so slow just because fullness and > fragmentation ? > It just seems strange to me. If it would be 60Mb/s instead 130, but so > much slower. I'll delete some files and see if it will be faster, but > it will take hours to copy them to new disk. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, Timofey. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-04 22:15 Very slow filesystem Igor M 2014-06-04 22:27 ` Fajar A. Nugraha @ 2014-06-05 3:05 ` Duncan 2014-06-05 3:22 ` Fajar A. Nugraha ` (2 more replies) 2014-06-05 8:08 ` Erkki Seppala 2 siblings, 3 replies; 18+ messages in thread From: Duncan @ 2014-06-05 3:05 UTC (permalink / raw) To: linux-btrfs Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: > Why btrfs becames EXTREMELY slow after some time (months) of usage ? > This is now happened second time, first time I though it was hard drive > fault, but now drive seems ok. > Filesystem is mounted with compress-force=lzo and is used for MySQL > databases, files are mostly big 2G-8G. That's the problem right there, database access pattern on files over 1 GiB in size, but the problem along with the fix has been repeated over and over and over and over... again on this list, and it's covered on the btrfs wiki as well, so I guess you haven't checked existing answers before you asked the same question yet again. Never-the-less, here's the basic answer yet again... Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a particular file rewrite pattern, that being frequently changed and rewritten data internal to an existing file (as opposed to appended to it, like a log file). In the normal case, such an internal-rewrite pattern triggers copies of the rewritten blocks every time they change, *HIGHLY* fragmenting this type of files after only a relatively short period. While compression changes things up a bit (filefrag doesn't know how to deal with it yet and its report isn't reliable), it's not unusual to see people with several-gig files with this sort of write pattern on btrfs without compression find filefrag reporting literally hundreds of thousands of extents! For smaller files with this access pattern (think firefox/thunderbird sqlite database files and the like), typically up to a few hundred MiB or so, btrfs' autodefrag mount option works reasonably well, as when it sees a file fragmenting due to rewrite, it'll queue up that file for background defrag via sequential copy, deleting the old fragmented copy after the defrag is done. For larger files (say a gig plus) with this access pattern, typically larger database files as well as VM images, autodefrag doesn't scale so well, as the whole file must be rewritten each time, and at that size the changes can come faster than the file can be rewritten. So a different solution must be used for them. The recommended solution for larger internal-rewrite-pattern files is to give them the NOCOW file attribute (chattr +C) , so they're updated in place. However, this attribute cannot be added to a file with existing data and have things work as expected. NOCOW must be added to the file before it contains data. The easiest way to do that is to set the attribute on the subdir that will contain the files and let the files inherit the attribute as they are created. Then you can copy (not move, and don't use cp's --reflink option) existing files into the new subdir, such that the new copy gets created with the NOCOW attribute. NOCOW files are updated in-place, thereby eliminating the fragmentation that would otherwise occur, keeping them fast to access. However, there are a few caveats. Setting NOCOW turns off file compression and checksumming as well, which is actually what you want for such files as it eliminates race conditions and other complex issues that would otherwise occur when trying to update the files in-place (thus the reason such features aren't part of most non-COW filesystems, which update in-place by default). Additionally, taking a btrfs snapshot locks the existing data in place for the snapshot, so the first rewrite to a file block (4096 bytes, I believe) after a snapshot will always be COW, even if the file has the NOCOW attribute set. Some people run automatic snapshotting software and can be taking snapshots as often as once a minute. Obviously, this effectively almost kills NOCOW entirely, since it's then only effective on changes after the first one between shapshots, and with snapshots only a minute apart, the file fragments almost as fast as it would have otherwise! So snapshots and the NOCOW attribute basically don't get along with each other. But because snapshots stop at subvolume boundaries, one method to avoid snapshotting NOCOW files is to put your NOCOW files, already in their own subdirs if using the suggestion above, into dedicated subvolumes as well. That lets you continue taking snapshots of the parent subvolume, without snapshotting the the dedicated subvolumes containing the NOCOW database or VM-image files. You'd then do conventional backups of your database and VM-image files, instead of snapshotting them. Of course if you're not using btrfs snapshots in the first place, you can avoid the whole subvolume thing, and just put your NOCOW files in their own subdirs, setting NOCOW on the subdir as suggested above, so files (and further subdirs, nested subdirs inherit the NOCOW as well) inherit the NOCOW of the subdir they're created in, at that creation. Meanwhile, it can be noted that once you turn off COW/compression/ checksumming, and if you're not snapshotting, you're almost back to the features of a normal filesystem anyway, except you can still use the btrfs multi-device features, of course. So if you're not using the multi- device features either, an alternative solution is to simply use a more traditional filesystem (like ext4 or xfs, with xfs being targeted at large files anyway, so for multi-gig database and VM-image files it could be a good choice =:^) for your large internal-rewrite-pattern files, while potentially continuing to use btrfs for your normal files, where btrfs' COW nature and other features are a better match for the use-case, than they are for gig-plus internal-rewrite-pattern files. As I said, further discussion elsewhere already, but that's the problem you're seeing along with a couple potential solutions. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 3:05 ` Duncan @ 2014-06-05 3:22 ` Fajar A. Nugraha 2014-06-05 4:45 ` Duncan 2014-06-05 7:50 ` Igor M 2014-06-05 15:52 ` Igor M 2 siblings, 1 reply; 18+ messages in thread From: Fajar A. Nugraha @ 2014-06-05 3:22 UTC (permalink / raw) To: linux-btrfs (resending to the list as plain text, the original reply was rejected due to HTML format) On Thu, Jun 5, 2014 at 10:05 AM, Duncan <1i5t5.duncan@cox.net> wrote: > > Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: > > > Why btrfs becames EXTREMELY slow after some time (months) of usage ? > > This is now happened second time, first time I though it was hard drive > > fault, but now drive seems ok. > > Filesystem is mounted with compress-force=lzo and is used for MySQL > > databases, files are mostly big 2G-8G. > > That's the problem right there, database access pattern on files over 1 > GiB in size, but the problem along with the fix has been repeated over > and over and over and over... again on this list, and it's covered on the > btrfs wiki as well Which part on the wiki? It's not on https://btrfs.wiki.kernel.org/index.php/FAQ or https://btrfs.wiki.kernel.org/index.php/UseCases > so I guess you haven't checked existing answers > before you asked the same question yet again. > > Never-the-less, here's the basic answer yet again... > > Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a > particular file rewrite pattern, that being frequently changed and > rewritten data internal to an existing file (as opposed to appended to > it, like a log file). In the normal case, such an internal-rewrite > pattern triggers copies of the rewritten blocks every time they change, > *HIGHLY* fragmenting this type of files after only a relatively short > period. While compression changes things up a bit (filefrag doesn't know > how to deal with it yet and its report isn't reliable), it's not unusual > to see people with several-gig files with this sort of write pattern on > btrfs without compression find filefrag reporting literally hundreds of > thousands of extents! > > For smaller files with this access pattern (think firefox/thunderbird > sqlite database files and the like), typically up to a few hundred MiB or > so, btrfs' autodefrag mount option works reasonably well, as when it sees > a file fragmenting due to rewrite, it'll queue up that file for > background defrag via sequential copy, deleting the old fragmented copy > after the defrag is done. > > For larger files (say a gig plus) with this access pattern, typically > larger database files as well as VM images, autodefrag doesn't scale so > well, as the whole file must be rewritten each time, and at that size the > changes can come faster than the file can be rewritten. So a different > solution must be used for them. If COW and rewrite is the main issue, why don't zfs experience the extreme slowdown (that is, not if you have sufficient free space available, like 20% or so)? -- Fajar ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 3:22 ` Fajar A. Nugraha @ 2014-06-05 4:45 ` Duncan 0 siblings, 0 replies; 18+ messages in thread From: Duncan @ 2014-06-05 4:45 UTC (permalink / raw) To: linux-btrfs Fajar A. Nugraha posted on Thu, 05 Jun 2014 10:22:49 +0700 as excerpted: > (resending to the list as plain text, the original reply was rejected > due to HTML format) > > On Thu, Jun 5, 2014 at 10:05 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> >> Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: >> >> > Why btrfs becames EXTREMELY slow after some time (months) of usage ? >> > This is now happened second time, first time I though it was hard >> > drive fault, but now drive seems ok. >> > Filesystem is mounted with compress-force=lzo and is used for MySQL >> > databases, files are mostly big 2G-8G. >> >> That's the problem right there, database access pattern on files over 1 >> GiB in size, but the problem along with the fix has been repeated over >> and over and over and over... again on this list, and it's covered on >> the btrfs wiki as well > > Which part on the wiki? It's not on > https://btrfs.wiki.kernel.org/index.php/FAQ or > https://btrfs.wiki.kernel.org/index.php/UseCases Most of the discussion and information is on the list, but there's a limited amount of information on the wiki in at least three places. Two are on the mount options page, in the autodefrag and nodatacow options description: * Autodefrag says it's well suited to bdb and sqlite dbs but not vm images or big dbs (yet). * Nodatacow says performance gain is usually under 5% *UNLESS* the workload is random writes to large db files, where the difference can be VERY large. (There's also mention of the fact that this turns off checksumming and compression.) Of course that's the nodatacow mount option, not the NOCOW file attribute, which isn't to my knowledge discussed on the wiki, and given the wiki wording, one does indeed have to read a bit between the lines, but it is there if one looks. That was certainly enough hint for me to mark the issue for further study as I did my initial pre-mkfs.btrfs research, for instance, and that it was a problem, with additional detail, was quickly confirmed once I checked the list. * Additionally, there some discussion in the FAQ under "Can copy-on-write be turned off for data blocks?", including discussion of the command used (chattr +C), a link to a script, a shell commands example, and the hint "will produce file suitable for a raw VM image -- the blocks will be updated in-place and are preallocated." FWIW, if I did wiki editing there'd probably be a dedicated page discussing it, but for better or worse, I seem to work best on mailing lists and newsgroups, and every time I've tried contributing on the web, even when it has been to a web forum which one would think would be close enough to lists/groups for me to adapt to, it simply hasn't gone much of anywhere. So these days I let other people more comfortable with editing wikis or doing web forums do that (and sometimes people do that by either actually quoting my list post nearly verbatim or simply linking to it, which I'm fine with, as after all that's where much of the info I post comes from in the first place), and I stick to the lists. Since I don't directly contribute to the wiki I don't much criticize it, but there are indeed at least hints there for those who can read them, something I did myself so I know it's not asking the impossible. > If COW and rewrite is the main issue, why don't zfs experience the > extreme slowdown (that is, not if you have sufficient free space > available, like 20% or so)? My personal opinion? Primarily two things: 1) zfs is far more mature than btrfs and has been in production usage for many years now, while btrfs is still barely getting the huge warnings stripped off. There's a lot of btrfs optimization possible that simply hasn't occurred yet as the focus is still real data-destruction-risk bugs, and in fact, btrfs isn't yet feature-complete either, so there's still focus on raw feature development as well. When btrfs gets to the maturity level that zfs is at now, I expect a lot of the problems we have now will have been dramatically reduced if not eliminated. (And the devs are indeed working on this problem, among others.) 2) Stating the obvious, while both btrfs and zfs are COW based and have other similarities, btrfs is an different filesystem, with an entirely different implementation and somewhat different emphasis. There consequently WILL be some differences, even when they're both mature filesystems. It's entirely possible that something about the btrfs implementation makes it less suitable in general to this particular use- case. Additionally, while I don't have zfs experience myself nor do I find it a particularly feasible option for me due to licensing and political issues, from what I've read it tends to handle certain issues by simply throwing gigs on gigs of memory at the problem. Btrfs is designed to require far less memory, and as such, will by definition be somewhat more limited in spots. (Arguably, this is simply a specific case of #2 above, they're individual filesystems with differing implementation and emphasis, so WILL by definition have different ideal use-cases.) Meanwhile, there's that specific mention of 20% zfs free-space available, above. On btrfs, as long as some amount of chunk-space remains unallocated to chunks, percentage free-space has little to no effect on performance. And with metadata chunk-sizes of a quarter gig and data chunk-sizes of a gig, at the terabyte filesystem scale that equates to well under 1% free, before free-space becomes a performance issue at all. So if indeed zfs is like many other filesystems in requiring 10-20% freespace in ordered to perform at best efficiency (I really don't know if that's the case or not, but it is part of the claim above), then that again simply emphasizes the differences between zfs and btrfs, since that literally has zero bearing at all on btrfs efficiency. Rather, at least until btrfs gets automatic entirely unattended chunkspace rebalance triggering the btrfs issue is far more likely to be literally running out of either data or metadata space as all the chunks with freespace are allocated to the other one. (Usually, it's metadata that runs out first, with lots of free space tied up in nearly empty data chunks. But it can be either. Of course a currently manually triggered rebalance can be used to solve this problem, but at present, it IS manually triggered, no automatic rebalancing functionality at all.) So while zfs and btrfs might be similarly based on COW technology, they really are entirely different filesystems, with vastly different maturity levels and some pretty big differences in behavior as well as licensing and political philosophy, certainly now, but potentially even as btrfs matures to match zfs maturity, too. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 3:05 ` Duncan 2014-06-05 3:22 ` Fajar A. Nugraha @ 2014-06-05 7:50 ` Igor M 2014-06-05 10:54 ` Russell Coker 2014-06-05 15:52 ` Igor M 2 siblings, 1 reply; 18+ messages in thread From: Igor M @ 2014-06-05 7:50 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs@vger.kernel.org Thanks for explanation. I did read wiki, but didn't see this mentioned. I saw mentioned 'nodatacow' mount option, but this disables compression and I need compression. Also I was wrong about files size, files can go to 70GB. But data to this big tables is only appended, it's never deleted. So no rewrites should be happening. Also I see now that after some initial files (which are under 1GB) and are rewritten/modifed a lot, it's very slow, but other files where data is only appended it's more or less normal speed. So it seems to be fragmentation. For. ex. for one file that reading is slow (800M) filefrag reports 63282 extents and for ex. 8G file that only have 9 extents read speed is normal. I'll put freqently modified files in directory with NOCOW attribute. Thanks. On Thu, Jun 5, 2014 at 5:05 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: > >> Why btrfs becames EXTREMELY slow after some time (months) of usage ? >> This is now happened second time, first time I though it was hard drive >> fault, but now drive seems ok. >> Filesystem is mounted with compress-force=lzo and is used for MySQL >> databases, files are mostly big 2G-8G. > > That's the problem right there, database access pattern on files over 1 > GiB in size, but the problem along with the fix has been repeated over > and over and over and over... again on this list, and it's covered on the > btrfs wiki as well, so I guess you haven't checked existing answers > before you asked the same question yet again. > > Never-the-less, here's the basic answer yet again... > > Btrfs, like all copy-on-write (COW) filesystems, has a tough time with a > particular file rewrite pattern, that being frequently changed and > rewritten data internal to an existing file (as opposed to appended to > it, like a log file). In the normal case, such an internal-rewrite > pattern triggers copies of the rewritten blocks every time they change, > *HIGHLY* fragmenting this type of files after only a relatively short > period. While compression changes things up a bit (filefrag doesn't know > how to deal with it yet and its report isn't reliable), it's not unusual > to see people with several-gig files with this sort of write pattern on > btrfs without compression find filefrag reporting literally hundreds of > thousands of extents! > > For smaller files with this access pattern (think firefox/thunderbird > sqlite database files and the like), typically up to a few hundred MiB or > so, btrfs' autodefrag mount option works reasonably well, as when it sees > a file fragmenting due to rewrite, it'll queue up that file for > background defrag via sequential copy, deleting the old fragmented copy > after the defrag is done. > > For larger files (say a gig plus) with this access pattern, typically > larger database files as well as VM images, autodefrag doesn't scale so > well, as the whole file must be rewritten each time, and at that size the > changes can come faster than the file can be rewritten. So a different > solution must be used for them. > > The recommended solution for larger internal-rewrite-pattern files is to > give them the NOCOW file attribute (chattr +C) , so they're updated in > place. However, this attribute cannot be added to a file with existing > data and have things work as expected. NOCOW must be added to the file > before it contains data. The easiest way to do that is to set the > attribute on the subdir that will contain the files and let the files > inherit the attribute as they are created. Then you can copy (not move, > and don't use cp's --reflink option) existing files into the new subdir, > such that the new copy gets created with the NOCOW attribute. > > NOCOW files are updated in-place, thereby eliminating the fragmentation > that would otherwise occur, keeping them fast to access. > > However, there are a few caveats. Setting NOCOW turns off file > compression and checksumming as well, which is actually what you want for > such files as it eliminates race conditions and other complex issues that > would otherwise occur when trying to update the files in-place (thus the > reason such features aren't part of most non-COW filesystems, which > update in-place by default). > > Additionally, taking a btrfs snapshot locks the existing data in place > for the snapshot, so the first rewrite to a file block (4096 bytes, I > believe) after a snapshot will always be COW, even if the file has the > NOCOW attribute set. Some people run automatic snapshotting software and > can be taking snapshots as often as once a minute. Obviously, this > effectively almost kills NOCOW entirely, since it's then only effective > on changes after the first one between shapshots, and with snapshots only > a minute apart, the file fragments almost as fast as it would have > otherwise! > > So snapshots and the NOCOW attribute basically don't get along with each > other. But because snapshots stop at subvolume boundaries, one method to > avoid snapshotting NOCOW files is to put your NOCOW files, already in > their own subdirs if using the suggestion above, into dedicated subvolumes > as well. That lets you continue taking snapshots of the parent subvolume, > without snapshotting the the dedicated subvolumes containing the NOCOW > database or VM-image files. > > You'd then do conventional backups of your database and VM-image files, > instead of snapshotting them. > > Of course if you're not using btrfs snapshots in the first place, you can > avoid the whole subvolume thing, and just put your NOCOW files in their > own subdirs, setting NOCOW on the subdir as suggested above, so files > (and further subdirs, nested subdirs inherit the NOCOW as well) inherit > the NOCOW of the subdir they're created in, at that creation. > > Meanwhile, it can be noted that once you turn off COW/compression/ > checksumming, and if you're not snapshotting, you're almost back to the > features of a normal filesystem anyway, except you can still use the > btrfs multi-device features, of course. So if you're not using the multi- > device features either, an alternative solution is to simply use a more > traditional filesystem (like ext4 or xfs, with xfs being targeted at > large files anyway, so for multi-gig database and VM-image files it could > be a good choice =:^) for your large internal-rewrite-pattern files, > while potentially continuing to use btrfs for your normal files, where > btrfs' COW nature and other features are a better match for the use-case, > than they are for gig-plus internal-rewrite-pattern files. > > As I said, further discussion elsewhere already, but that's the problem > you're seeing along with a couple potential solutions. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 7:50 ` Igor M @ 2014-06-05 10:54 ` Russell Coker 0 siblings, 0 replies; 18+ messages in thread From: Russell Coker @ 2014-06-05 10:54 UTC (permalink / raw) To: Igor M; +Cc: Duncan, linux-btrfs@vger.kernel.org On Thu, 5 Jun 2014 09:50:53 Igor M wrote: > But data to this big tables is only appended, it's never deleted. So > no rewrites should be happening. When you write to the big tables the indexes will be rewritten. Indexes can be in the same file as table data or as separate files depending on what data base you use. For the former you get fragmented table files and for the latter 70G of data will have index files that are large enough to get fragmented. Also when you have multiple files in a filesystem being written at the same time (EG multiple tables appended to in each transaction) then you will get some fragmentation. Add COW and that makes a lot of fragmentation. Finally append is done at the file level while COW is rewriting at the block level. If your database rounds up the allocated space to some power of 2 larger than 4K then things will be fine for a filesystem like Ext3 where file offsets correspond to fixed locations on disk. But with BTRFS that pre- allocated space filled with zeros will be rewritten to a different part of disk when the allocated space is used. If you use a database that doesn't preallocate space then COW will be invoked when the end of the file at an offset that isn't a multiple of 4K (or I think 16K for a BTRFS filesystem created with a recent mkfs.btrfs) is written as appending to data within a block offset means rewriting the block. I believe that COW is desirable for a database. I don't believe that a lack of integrity at the filesystem level will help integrity at the database level. If the working set of your database can fit in RAM then you can rely on cache to ensure that little data is read during operation. For example one of my database servers has been running for 330 days and the /mysql filesystem has writes outnumbering reads by a factor of 3:1. When most IO is for writes fragmentation of data is less of an issue - although in this case the server is running Ext3 so it wouldn't get the COW fragmentation issues. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 3:05 ` Duncan 2014-06-05 3:22 ` Fajar A. Nugraha 2014-06-05 7:50 ` Igor M @ 2014-06-05 15:52 ` Igor M 2014-06-05 16:13 ` Timofey Titovets 2 siblings, 1 reply; 18+ messages in thread From: Igor M @ 2014-06-05 15:52 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs@vger.kernel.org One more question. Is there any other way to find out file fragmentation ? I just copied 35Gb file on new btrfs filesystem (compressed) and filefrag reports 282275 extents found. This can't be right ? On Thu, Jun 5, 2014 at 5:05 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Igor M posted on Thu, 05 Jun 2014 00:15:31 +0200 as excerpted: > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 15:52 ` Igor M @ 2014-06-05 16:13 ` Timofey Titovets 2014-06-05 19:53 ` Duncan 0 siblings, 1 reply; 18+ messages in thread From: Timofey Titovets @ 2014-06-05 16:13 UTC (permalink / raw) To: Igor M, linux-btrfs 2014-06-05 18:52 GMT+03:00 Igor M <igork20@gmail.com>: > One more question. Is there any other way to find out file fragmentation ? > I just copied 35Gb file on new btrfs filesystem (compressed) and > filefrag reports 282275 extents found. This can't be right ? hes, because filefrag show compressed block (128kbite) as one extent. -- Best regards, Timofey. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 16:13 ` Timofey Titovets @ 2014-06-05 19:53 ` Duncan 2014-06-06 19:06 ` Mitch Harder 0 siblings, 1 reply; 18+ messages in thread From: Duncan @ 2014-06-05 19:53 UTC (permalink / raw) To: linux-btrfs Timofey Titovets posted on Thu, 05 Jun 2014 19:13:08 +0300 as excerpted: > 2014-06-05 18:52 GMT+03:00 Igor M <igork20@gmail.com>: >> One more question. Is there any other way to find out file >> fragmentation ? >> I just copied 35Gb file on new btrfs filesystem (compressed) and >> filefrag reports 282275 extents found. This can't be right ? > hes, because filefrag show compressed block (128kbite) as one extent. Correct. Which is why my original answer had this: >>> While compression changes things up a bit (filefrag doesn't know >>> how to deal with it yet and its report isn't reliable), I skipped over the "why" at the time as it wasn't necessary for the then- current discussion, but indeed, the reason is because filefrag counts each 128 KiB block as a separate fragment because it doesn't understand them, and as a result, it's not (currently) usable for btrfs-compressed files. They (the btrfs, filefrag, and kernel folks) are working on teaching filefrag about the problem so it can report correct information, but the approach being taken is to setup generic kernel functionality to report that information to filefrag, such that other filesystems can use the same functionality, which makes it a rather more complicated project than a simple one-shot fix for btrfs by itself would be. So while the problem is indeed being actively worked on, it could be some time before we actually have a filefrag that's accurate in btrfs-compressed-file situations, but once we do, we can be confident the solution is a correct one that can be used well into the future by btrfs and other filesystems as well, not just a brittle hack of a few minutes to a day that can't be used for anything else and that could well break again two kernel cycles down the road. Unfortunately, I know of nothing else that can report that information, so the only real suggestion I have is to either turn off compression or forget about tracking individual file fragmentation for now and go only on performance. But as it happens, the NOCOW file attribute turns off compression (as well as checksumming) for that file anyway, because in-place rewrite would otherwise trigger complex issues and race conditions that are a risk to the data as well as performance, which is why traditional non-COW filesystems don't tend to offer these features in the first place. Btrfs' normal COW nature makes these features possible, but as this thread already explores, unfortunately simply isn't suitable for certain access patterns. So if the file is properly (that is, at creation) set NOCOW, filefrag should indeed be accurate, because the file won't be (btrfs-)compressed. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 19:53 ` Duncan @ 2014-06-06 19:06 ` Mitch Harder 2014-06-06 19:59 ` Duncan 2014-06-07 2:29 ` Russell Coker 0 siblings, 2 replies; 18+ messages in thread From: Mitch Harder @ 2014-06-06 19:06 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On Thu, Jun 5, 2014 at 2:53 PM, Duncan <1i5t5.duncan@cox.net> wrote: > Timofey Titovets posted on Thu, 05 Jun 2014 19:13:08 +0300 as excerpted: > >> 2014-06-05 18:52 GMT+03:00 Igor M <igork20@gmail.com>: >>> One more question. Is there any other way to find out file >>> fragmentation ? >>> I just copied 35Gb file on new btrfs filesystem (compressed) and >>> filefrag reports 282275 extents found. This can't be right ? > >> hes, because filefrag show compressed block (128kbite) as one extent. > > Correct. Which is why my original answer had this: > >>>> While compression changes things up a bit (filefrag doesn't know >>>> how to deal with it yet and its report isn't reliable), > > I skipped over the "why" at the time as it wasn't necessary for the then- > current discussion, but indeed, the reason is because filefrag counts > each 128 KiB block as a separate fragment because it doesn't understand > them, and as a result, it's not (currently) usable for btrfs-compressed > files. > In the context of a compressed database file, the 128 KiB compressed block size has more severe consequences. First, even if the 128 KiB blocks are contiguous, each 128 KiB block has it's own metadata entry. So you already have much higher metadata utilization than without compression. And the metadata can also get fragmented. Every time you update your database, btrfs is going to update whichever 128 KiB blocks need to be modified. Even for a tiny modification, the new compressed block may be slightly more or slightly less than 128 KiB. If you have a 1-2 GB database that is being updated with any frequency, you can see how you will quickly end up with lots of metadata fragmentation as well as inefficient data block utilization. I think this will be the case even if you switch to NOCOW due to the compression. On a very fundamental level, file system compression and large databases are two use cases that are difficult to reconcile. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-06 19:06 ` Mitch Harder @ 2014-06-06 19:59 ` Duncan 2014-06-07 2:29 ` Russell Coker 1 sibling, 0 replies; 18+ messages in thread From: Duncan @ 2014-06-06 19:59 UTC (permalink / raw) To: linux-btrfs Mitch Harder posted on Fri, 06 Jun 2014 14:06:53 -0500 as excerpted: > Every time you update your database, btrfs is going to update whichever > 128 KiB blocks need to be modified. > > Even for a tiny modification, the new compressed block may be slightly > more or slightly less than 128 KiB. FWIW, I believe that's 128 KiB pre-compression. And at least without compress-force, btrfs will try the compression and if the compressed size is larger than the uncompressed size, it simply won't compress that block. So 128 KiB is the largest amount of space that 128 KiB of data could take with compression on, but it can be half that or less if the compression happens to be good for that 128 KiB block. > If you have a 1-2 GB database that is being updated with any frequency, > you can see how you will quickly end up with lots of metadata > fragmentation as well as inefficient data block utilization. > I think this will be the case even if you switch to NOCOW due to the > compression. That is one reason that, as I said, NOCOW turns off compression. Compression simply doesn't work well with in-place updates, because as you point out, the update may compress more or less well than the original, and that won't work in-place. So NOCOW turns off compression to avoid the problem. If its COW (that is, not NOCOW), then the COW-based out-of-place-updates avoid the problem of fitting more data in the same space, because the new write can take more space in the new location if it has to. But you are correct that compression and large, frequently updated databases don't play well together either. Which is why turning off compression when turning off COW isn't the big problem it would first appear to be -- as it happens, the very same files where COW doesn't work well, are also the ones where compression doesn't work well. Similarly for checksumming. When there are enough updates, in addition to taking more time to calculate and write, checksumming simply invites race conditions between the last then-valid checksum and the next update invalidating it. In addition, in many, perhaps most cases, the sorts of apps that do constant internal updates, have already evolved their own data integrity verification methods in ordered to cope with issues on the after all way more common unverified filesystems, creating even more possible race conditions and timing issues and making all that extra work that btrfs normally does for verification unnecessary. Trying to do all that in-place due to NOCOW is a recipe for failure or insanity if not both So when turning off COW, just turning off checksumming/verification and compression along with it makes the most sense, and that's what btrfs does. To do otherwise is just asking for trouble, which is why you very rarely see in-place-update-by-default filesystems offering either transparent compression or data verification as features. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-06 19:06 ` Mitch Harder 2014-06-06 19:59 ` Duncan @ 2014-06-07 2:29 ` Russell Coker 1 sibling, 0 replies; 18+ messages in thread From: Russell Coker @ 2014-06-07 2:29 UTC (permalink / raw) To: linux-btrfs On Fri, 6 Jun 2014 14:06:53 Mitch Harder wrote: > Every time you update your database, btrfs is going to update > whichever 128 KiB blocks need to be modified. > > Even for a tiny modification, the new compressed block may be slightly > more or slightly less than 128 KiB. > > If you have a 1-2 GB database that is being updated with any > frequency, you can see how you will quickly end up with lots of > metadata fragmentation as well as inefficient data block utilization. > I think this will be the case even if you switch to NOCOW due to the > compression. > > On a very fundamental level, file system compression and large > databases are two use cases that are difficult to reconcile. The ZFS approach of using a ZIL (write-back cache that caches before allocation) and L2ARC (read-cache on SSD) mitigates these problems. Samsung 1TB SSDs are $565 at my local computer store, if your database has a working set of less than 2TB then SSDs with L2ARC should solve those performance problems at low cost. The vast majority of sysadmins have never seen a database that's 2TB in size, let alone one with a 2TB working set. That said I've seen Oracle docs recommending against ZFS for large databases, but the Oracle definition of "large database" is probably a lot larger than anything that is likely to be stored on BTRFS in the near future. Another thing to note is that there are a variety of ways of storing compressed data in databases. Presumably anyone who is storing so much data that the working set exceeds the ability to attach lots of SSDs is going to be using some form of compressed tables which will reduce the ability of filesystem compression to do any good. On Fri, 6 Jun 2014 19:59:55 Duncan wrote: > Similarly for checksumming. When there are enough updates, in addition > to taking more time to calculate and write, checksumming simply invites > race conditions between the last then-valid checksum and the next update > invalidating it. In addition, in many, perhaps most cases, the sorts of > apps that do constant internal updates, have already evolved their own > data integrity verification methods in ordered to cope with issues on the > after all way more common unverified filesystems, creating even more > possible race conditions and timing issues and making all that extra work > that btrfs normally does for verification unnecessary. Trying to do all > that in-place due to NOCOW is a recipe for failure or insanity if not both http://www.strchr.com/crc32_popcnt The above URL has some interesting information about CRC32 speed. In summary if you have a Core i5 system then you are looking at less than a clock cycle per byte on average. So if your storage is capable of handling more than 4GB/s of data transfer then CRC32 might be a bottleneck. But doing 4GB/s for a database is a very different problem. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-04 22:15 Very slow filesystem Igor M 2014-06-04 22:27 ` Fajar A. Nugraha 2014-06-05 3:05 ` Duncan @ 2014-06-05 8:08 ` Erkki Seppala 2014-06-05 8:12 ` Erkki Seppala 2 siblings, 1 reply; 18+ messages in thread From: Erkki Seppala @ 2014-06-05 8:08 UTC (permalink / raw) To: linux-btrfs Igor M <igork20@gmail.com> writes: > Why btrfs becames EXTREMELY slow after some time (months) of usage ? Have you tried iostat from sysstat to see the number of IO-operations performed per second (tps) on the devices when it is performing badly? If the number is hitting your seek rate (ie. 1/0.0075 for 7.5 ms seek = 133), then fragmentation is sure to be blamed. -- _____________________________________________________________________ / __// /__ ____ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ / \ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Very slow filesystem 2014-06-05 8:08 ` Erkki Seppala @ 2014-06-05 8:12 ` Erkki Seppala 0 siblings, 0 replies; 18+ messages in thread From: Erkki Seppala @ 2014-06-05 8:12 UTC (permalink / raw) To: linux-btrfs Erkki Seppala <flux-btrfs@inside.org> writes: > If the number is hitting your seek rate (ie. 1/0.0075 for 7.5 ms seek = > 133), then fragmentation is sure to be blamed. Actually the number may very well be off by at least a factor of two (I tested that my device did 400 tps when I expected 200; perhaps bulk transfers cause more transactions than I expect), but it should be in the ballpark I think :). -- _____________________________________________________________________ / __// /__ ____ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ / \ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-06-07 2:29 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-04 22:15 Very slow filesystem Igor M 2014-06-04 22:27 ` Fajar A. Nugraha 2014-06-04 22:40 ` Roman Mamedov 2014-06-04 22:45 ` Igor M 2014-06-04 23:17 ` Timofey Titovets 2014-06-05 3:05 ` Duncan 2014-06-05 3:22 ` Fajar A. Nugraha 2014-06-05 4:45 ` Duncan 2014-06-05 7:50 ` Igor M 2014-06-05 10:54 ` Russell Coker 2014-06-05 15:52 ` Igor M 2014-06-05 16:13 ` Timofey Titovets 2014-06-05 19:53 ` Duncan 2014-06-06 19:06 ` Mitch Harder 2014-06-06 19:59 ` Duncan 2014-06-07 2:29 ` Russell Coker 2014-06-05 8:08 ` Erkki Seppala 2014-06-05 8:12 ` Erkki Seppala
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).