* Oddly slow read performance with near-full largish FS
@ 2014-12-17 2:42 Charles Cazabon
2014-12-19 8:58 ` Satoru Takeuchi
2014-12-20 10:57 ` Robert White
0 siblings, 2 replies; 16+ messages in thread
From: Charles Cazabon @ 2014-12-17 2:42 UTC (permalink / raw)
To: btrfs list
Hi,
I've been running btrfs for various filesystems for a few years now, and have
recently run into problems with a large filesystem becoming *really* slow for
basic reading. None of the debugging/testing suggestions I've come across in
the wiki or in the mailing list archives seems to have helped.
Background: this particular filesystem holds backups for various other
machines on the network, a mix of rdiff-backup data (so lots of small files)
and rsync copies of larger files (everything from ~5MB data files to ~60GB VM
HD images). There's roughly 16TB of data in this filesystem (the filesystem
is ~17TB). The btrfs filesystem is a simple single volume, no snapshots,
multiple devices, or anything like that. It's an LVM logical volume on top of
dmcrypt on top of an mdadm RAID set (8 disks in RAID 6).
The performance: trying to copy the data off this filesystem to another
(non-btrfs) filesystem with rsync or just cp was taking aaaages - I found one
suggestion that it could be because updating the atimes required a COW of the
metadata in btrfs, so I mounted the filesystem noatime, but this doesn't
appear to have made any difference. The speeds I'm seeing (with iotop)
fluctuate a lot. They spend most of the time in the range of 1-3 MB/s, with
large periods of time where no IO seems to happen at all, and occasional short
spikes to ~25-30 MB/s. System load seems to sit around 10-12 (with only 2
processes reported as running, everything else sleeping) while this happens.
The server is doing nothing other than this copy at the time. The only
processes using any noticable CPU are rsync (source and destination processes,
around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of
"kworker" processes, perhaps one per CPU (there are 8 physical cores in the
server, plus hyperthreading).
Other filesystems on the same physical disks have no trouble exceeding 100MB/s
reads. The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used).
Is there something obvious I'm missing here? Is there a reason I can only
average ~3MB/s reads from a btrfs filesystem?
kernel is x86_64 linux-stable 3.17.6. btrfs-progs is v3.17.3-3-g8cb0438.
Output of the various info commands is:
$ sudo btrfs fi df /media/backup/
Data, single: total=16.24TiB, used=15.73TiB
System, DUP: total=8.00MiB, used=1.75MiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=35.50GiB, used=34.05GiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=512.00MiB, used=0.00
$ btrfs --version
Btrfs v3.17.3-3-g8cb0438
$ sudo btrfs fi show
Label: 'backup' uuid: c18dfd04-d931-4269-b999-e94df3b1918c
Total devices 1 FS bytes used 15.76TiB
devid 1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup
Thanks in advance for any suggestions.
Charles
--
-----------------------------------------------------------------------
Charles Cazabon
GPL'ed software available at: http://pyropus.ca/software/
-----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: Oddly slow read performance with near-full largish FS 2014-12-17 2:42 Oddly slow read performance with near-full largish FS Charles Cazabon @ 2014-12-19 8:58 ` Satoru Takeuchi 2014-12-19 16:58 ` Charles Cazabon 2014-12-20 10:57 ` Robert White 1 sibling, 1 reply; 16+ messages in thread From: Satoru Takeuchi @ 2014-12-19 8:58 UTC (permalink / raw) To: Charles Cazabon, linux-btrfs@vger.kernel.org Hi, Sorry for late reply. Let me ask some questions. On 2014/12/17 11:42, Charles Cazabon wrote: > Hi, > > I've been running btrfs for various filesystems for a few years now, and have > recently run into problems with a large filesystem becoming *really* slow for > basic reading. None of the debugging/testing suggestions I've come across in > the wiki or in the mailing list archives seems to have helped. > > Background: this particular filesystem holds backups for various other > machines on the network, a mix of rdiff-backup data (so lots of small files) > and rsync copies of larger files (everything from ~5MB data files to ~60GB VM > HD images). There's roughly 16TB of data in this filesystem (the filesystem > is ~17TB). The btrfs filesystem is a simple single volume, no snapshots, > multiple devices, or anything like that. It's an LVM logical volume on top of > dmcrypt on top of an mdadm RAID set (8 disks in RAID 6). Q1. You mean your Btrfs file system exists on the top of the following deep layers? +---------------+ |Btrfs(single) | +---------------+ |LVM(non RAID?) | +---------------+ |dmcrypt | +---------------+ |mdadm RAID set | +---------------+ # Unfortunately, I don't know how Btrfs works in conjunction #with such a deep layers. Q2. If Q1 is true, is it possible to reduce that layers as follows? +-----------+ |Btrfs(*1) | +-----------+ |dmcrypt | +-----------+ It's because there are too many layers and these have the same/similar features and heavy layered file system tends to cause more trouble than thinner layered ones regardless of file system type. *1) Currently I don't recommend you to use RAID56 of Btrfs. So, if RAID6 is mandatory, mdadm RAID6 is also necessary. > > The performance: trying to copy the data off this filesystem to another > (non-btrfs) filesystem with rsync or just cp was taking aaaages - I found one > suggestion that it could be because updating the atimes required a COW of the > metadata in btrfs, so I mounted the filesystem noatime, but this doesn't > appear to have made any difference. The speeds I'm seeing (with iotop) > fluctuate a lot. They spend most of the time in the range of 1-3 MB/s, with > large periods of time where no IO seems to happen at all, and occasional short > spikes to ~25-30 MB/s. System load seems to sit around 10-12 (with only 2 > processes reported as running, everything else sleeping) while this happens. > The server is doing nothing other than this copy at the time. The only > processes using any noticable CPU are rsync (source and destination processes, > around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of > "kworker" processes, perhaps one per CPU (there are 8 physical cores in the > server, plus hyperthreading). > > Other filesystems on the same physical disks have no trouble exceeding 100MB/s > reads. The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used). Q3. They are also consist of the following layers? +---------------+ |XFS/ext4 | +---------------+ |LVM(non RAID?) | +---------------+ |dmcrypt | +---------------+ |mdadm RAID set | +---------------+ Q4. Are other filesystems also near-full? Q5. Is there any error/warning message about Btrfs/LVM/dmcrypt/mdadm/hardwares? Thanks, Satoru > > Is there something obvious I'm missing here? Is there a reason I can only > average ~3MB/s reads from a btrfs filesystem? > > kernel is x86_64 linux-stable 3.17.6. btrfs-progs is v3.17.3-3-g8cb0438. > Output of the various info commands is: > > $ sudo btrfs fi df /media/backup/ > Data, single: total=16.24TiB, used=15.73TiB > System, DUP: total=8.00MiB, used=1.75MiB > System, single: total=4.00MiB, used=0.00 > Metadata, DUP: total=35.50GiB, used=34.05GiB > Metadata, single: total=8.00MiB, used=0.00 > unknown, single: total=512.00MiB, used=0.00 > > $ btrfs --version > Btrfs v3.17.3-3-g8cb0438 > > $ sudo btrfs fi show > > Label: 'backup' uuid: c18dfd04-d931-4269-b999-e94df3b1918c > Total devices 1 FS bytes used 15.76TiB > devid 1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup > > Thanks in advance for any suggestions. > > Charles > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-19 8:58 ` Satoru Takeuchi @ 2014-12-19 16:58 ` Charles Cazabon 2014-12-19 17:33 ` Duncan 0 siblings, 1 reply; 16+ messages in thread From: Charles Cazabon @ 2014-12-19 16:58 UTC (permalink / raw) To: linux-btrfs@vger.kernel.org Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> wrote: > > Let me ask some questions. Sure - thanks for taking an interest. > On 2014/12/17 11:42, Charles Cazabon wrote: > > There's roughly 16TB of data in this filesystem (the filesystem is ~17TB). > > The btrfs filesystem is a simple single volume, no snapshots, multiple > > devices, or anything like that. It's an LVM logical volume on top of > > dmcrypt on top of an mdadm RAID set (8 disks in RAID 6). > > Q1. You mean your Btrfs file system exists on the top of > the following deep layers? > > +---------------+ > |Btrfs(single) | > +---------------+ > |LVM(non RAID?) | > +---------------+ > |dmcrypt | > +---------------+ > |mdadm RAID set | > +---------------+ Yes, precisely. mdadm is used to make a large RAID6 device, which is encrypted with LUKS, on top of which is layered LVM (for ease of management), and the btrfs filesystem sits on that. > Q2. If Q1 is true, is it possible to reduce that layers as follows? > > +-----------+ > |Btrfs(*1) | > +-----------+ > |dmcrypt | > +-----------+ I don't see how I could do that - I simply have far too much data for a single disk (not to mention I don't want to risk loss of data from a single disk failing). This filesystem has 16.x TB of data in it at present. > It's because there are too many layers and these have > the same/similar features and heavy layered file system > tends to cause more trouble than thinner layered ones > regardless of file system type. This configuration is one I've been using for many years. It's only recently that I've noticed it being particularly slow with btrfs -- I don't know if that's because the filesystem has filled up past some critical point, or due to something else entirely. That's why I'm trying to figure this out. > *1) Currently I don't recommend you to use RAID56 of Btrfs. > So, if RAID6 is mandatory, mdadm RAID6 is also necessary. Yes, exactly. That's why I use mdadm. > > The speeds I'm seeing (with iotop) fluctuate a lot. They spend most of > > the time in the range of 1-3 MB/s, with large periods of time where no IO > > seems to happen at all, and occasional short spikes to ~25-30 MB/s. > > System load seems to sit around 10-12 (with only 2 processes reported as > > running, everything else sleeping) while this happens. [...] > > Other filesystems on the same physical disks have no trouble exceeding > > 100MB/s reads. The machine is not swapping (16GB RAM, ~8GB swap with 0 > > swap used). > > Q3. They are also consist of the following layers? Yes, exactly the same configuration. The fact that I don't see any speed problems with other filesystems (even in the same LVM volume group) leads me in the direction of suspecting something to do with btrfs. > Q4. Are other filesystems also near-full? No, not particularly. Now, the btrfs volume in question isn't exactly close to full - there's more than 500 GB free. It's just *relatively* full. > Q5. Is there any error/warning message about > Btrfs/LVM/dmcrypt/mdadm/hardwares? No, no errors or warnings in logs related to the disks, LVM, or btrfs. I have historically, with previous kernels, gotten the "task blocked for more than 120 seconds" warnings fairly often, but I haven't seen those lately. Is there any other info I can collect on this that would help? Thanks, Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-19 16:58 ` Charles Cazabon @ 2014-12-19 17:33 ` Duncan 2014-12-20 8:53 ` Chris Murphy 2014-12-20 10:03 ` Robert White 0 siblings, 2 replies; 16+ messages in thread From: Duncan @ 2014-12-19 17:33 UTC (permalink / raw) To: linux-btrfs Charles Cazabon posted on Fri, 19 Dec 2014 10:58:49 -0600 as excerpted: > This configuration is one I've been using for many years. It's only > recently that I've noticed it being particularly slow with btrfs -- I > don't know if that's because the filesystem has filled up past some > critical point, or due to something else entirely. That's why I'm > trying to figure this out. Not recommending at this point, just saying these are options... Btrfs raid56 mode should, I believe, be pretty close to done with the latest patches. That would be 3.19, however, which isn't out yet of course. There's also raid10, if you have enough devices or little enough data to do it. That's much more mature than raid56 mode and should be about as mature and stable as btrfs in single-device mode, which is what you are using now. But it'll require more devices than a raid56 would... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-19 17:33 ` Duncan @ 2014-12-20 8:53 ` Chris Murphy 2014-12-20 10:03 ` Robert White 1 sibling, 0 replies; 16+ messages in thread From: Chris Murphy @ 2014-12-20 8:53 UTC (permalink / raw) To: Btrfs BTRFS On Fri, Dec 19, 2014 at 10:33 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Charles Cazabon posted on Fri, 19 Dec 2014 10:58:49 -0600 as excerpted: > There's also raid10, if you have enough devices or little enough data to > do it. That's much more mature than raid56 mode and should be about as > mature and stable as btrfs in single-device mode, which is what you are > using now. But it'll require more devices than a raid56 would... And also with such large storage stacks with big drives, when they fail (note I use when not if) it takes a long time to restore. So if you have the ability to break them up and use something like GlusterFS to distribute it, it helps to mitigate this as well as other kinds of failures like power supply, logic board, controllers, and with georep even the entire local site. This is not meant to indicate the current layout is wrong. Just that there are other possibilities to achieve the desired up-time and data safety. -- Chris Murphy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-19 17:33 ` Duncan 2014-12-20 8:53 ` Chris Murphy @ 2014-12-20 10:03 ` Robert White 1 sibling, 0 replies; 16+ messages in thread From: Robert White @ 2014-12-20 10:03 UTC (permalink / raw) To: Duncan, linux-btrfs On 12/19/2014 09:33 AM, Duncan wrote: > Charles Cazabon posted on Fri, 19 Dec 2014 10:58:49 -0600 as excerpted: > >> This configuration is one I've been using for many years. It's only >> recently that I've noticed it being particularly slow with btrfs -- I >> don't know if that's because the filesystem has filled up past some >> critical point, or due to something else entirely. That's why I'm >> trying to figure this out. > > Not recommending at this point, just saying these are options... > > Btrfs raid56 mode should, I believe, be pretty close to done with the > latest patches. That would be 3.19, however, which isn't out yet of > course. Putting the encryption above the raid is a _huge_ win that he'd lose. I've used this same layering before (though not with btrfs). So if you write a sector in this order only one encryption event (e.g. "encrypt this sector") has to take place no matter what raid level is in place. If you put the encryption below the raid, then a write or one sector on a non-degraded RAID5 requires four encryption events (two decrypts, one for the parity and one for the sector being overwritten; followed by two encryptions on the same results). In degraded conditions the profile is much worse. If encryptions and RAID > 0 is in use, he's better off with what he's got in terms of CPU and scheduling. > > There's also raid10, if you have enough devices or little enough data to > do it. That's much more mature than raid56 mode and should be about as > mature and stable as btrfs in single-device mode, which is what you are > using now. But it'll require more devices than a raid56 would... > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-17 2:42 Oddly slow read performance with near-full largish FS Charles Cazabon 2014-12-19 8:58 ` Satoru Takeuchi @ 2014-12-20 10:57 ` Robert White 2014-12-21 16:32 ` Charles Cazabon 1 sibling, 1 reply; 16+ messages in thread From: Robert White @ 2014-12-20 10:57 UTC (permalink / raw) To: btrfs list On 12/16/2014 06:42 PM, Charles Cazabon wrote: > Hi, > > I've been running btrfs for various filesystems for a few years now, and have > recently run into problems with a large filesystem becoming *really* slow for > basic reading. None of the debugging/testing suggestions I've come across in > the wiki or in the mailing list archives seems to have helped. > > Background: this particular filesystem holds backups for various other > machines on the network, a mix of rdiff-backup data (so lots of small files) > and rsync copies of larger files (everything from ~5MB data files to ~60GB VM > HD images). There's roughly 16TB of data in this filesystem (the filesystem > is ~17TB). The btrfs filesystem is a simple single volume, no snapshots, > multiple devices, or anything like that. It's an LVM logical volume on top of > dmcrypt on top of an mdadm RAID set (8 disks in RAID 6). > > The performance: trying to copy the data off this filesystem to another > (non-btrfs) filesystem with rsync or just cp was taking aaaages - I found one > suggestion that it could be because updating the atimes required a COW of the > metadata in btrfs, so I mounted the filesystem noatime, but this doesn't > appear to have made any difference. The speeds I'm seeing (with iotop) > fluctuate a lot. They spend most of the time in the range of 1-3 MB/s, with > large periods of time where no IO seems to happen at all, and occasional short > spikes to ~25-30 MB/s. System load seems to sit around 10-12 (with only 2 > processes reported as running, everything else sleeping) while this happens. > The server is doing nothing other than this copy at the time. The only > processes using any noticable CPU are rsync (source and destination processes, > around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of > "kworker" processes, perhaps one per CPU (there are 8 physical cores in the > server, plus hyperthreading). > > Other filesystems on the same physical disks have no trouble exceeding 100MB/s > reads. The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used). > > Is there something obvious I'm missing here? Is there a reason I can only > average ~3MB/s reads from a btrfs filesystem? > > kernel is x86_64 linux-stable 3.17.6. btrfs-progs is v3.17.3-3-g8cb0438. > Output of the various info commands is: > > $ sudo btrfs fi df /media/backup/ > Data, single: total=16.24TiB, used=15.73TiB > System, DUP: total=8.00MiB, used=1.75MiB > System, single: total=4.00MiB, used=0.00 > Metadata, DUP: total=35.50GiB, used=34.05GiB > Metadata, single: total=8.00MiB, used=0.00 > unknown, single: total=512.00MiB, used=0.00 > > $ btrfs --version > Btrfs v3.17.3-3-g8cb0438 > > $ sudo btrfs fi show > > Label: 'backup' uuid: c18dfd04-d931-4269-b999-e94df3b1918c > Total devices 1 FS bytes used 15.76TiB > devid 1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup > > Thanks in advance for any suggestions. > > Charles > Totally spit-balling ideas here (e.g. no suggestion as to which one to try first etc, just typing them as they come to me): Have you tried increasing the number of stripe buffers for the filesystem? If you've gotten things spread way out you might be thrashing your stripe cache. (see /sys/block/md(number here)/md/stripe_cache_size). Have you taken SMART (smartmotools etc) to these disks to see if any of them are reporting any sort of incipient failure conditions? If one or more drives is reporting recoverable read errors it might just be clogging you up. Try experimentally mounting the filesystem read-only and dong some read tests. This elimination of all possible write sources will tell you things. In particular if all your reads just start breezing through then you know something in the write path is "iffy". One thing that comes to mind is that anything accessing the drive with a barrier-style operation (wait for verification of data sync all the way to disk) would have to pass all the way down through the encryption layer which could be having a multiplier effect. (you know, lots of very short delays making a large net delay). Have you changed any hardware lately in a way that could de-optimize your interrupt handling. I have a vague recollection that somewhere in the last month and a half or so there was a patch here (or in the kernel changelogs) about an extra put operation (or something) that would cause a worker thread to roll over to -1, then spin back down to zero before work could proceed. I know, could I _be_ more vague? Right? Try switching to kernel 3.18.1 to see if the issue just goes away. (Honestly this one's just been scratching at my brain since I started writing this reply and I just _can't_ remember the reference for it... dangit...) When was the last time you did any of the maintenance things (like balance or defrag)? Not that I'd want to sit through 15Tb of that sort of thing, but I'm curious about the maintenance history. Does the read performance fall off with uptime? E.g. is it "okay" right after a system boot and then start to fall off as uptime (and activity) increases? I _imagine_ that if your filesystem huge and your server is modest by comparison in terms of ram, cache pinning and fragmentation can start becoming a real problem. What else besides marshaling this filesystem is this system used for? Have you tried segregating some of your system memory for to make sure that you aren't actually having application performance issues? I've had some luck with kernelcore= and moveablecore= (particularly moveablecore=) kernel command line options when dealing with IO induced fragmentation. On problematic systems I'll try classifying at least 1/4 of the system ram as movablecore. (e.g. on my 8GiB laptop were I do some of my experimental work, I have moveablecore=2G on the command line). Any pages that get locked into memory will be moved out of the movable-only memory first. This can have a profound (usually positive) effect on applications that want to spread out in memory. If you are running anything that likes large swaths of memory then this can help a lot. Particularly if you are also running programs that traverse large swaths of disk. Some programs (rsync of large files etc may be such a program) can do "much better" if you've done this. (BUT DON'T OVERDO IT, enough is good but too much is very bad. 8-) ). ASIDE: Anything that uses hugepages, transparent or explicit, in any serious number has a tendency to antagonize the system cache (and vice-versa). It's a silent fight of the cache-pressure sort. When you explicitly declare an amount of ram for moveable pages only, the disk cache will not grow into that space. so moveablecore=3G creates 3GiB of space where only unlocked pages (malloced heap, stack, etc; basically only things that can get moved -- particularly swapped -- will go in that space.) The practical effect is that certain kinds of pressures will never compete. So broad-format disk I/O (e.g. using find etc) will tend to be on one side of the barrier while video playback buffers and virtual machine's ram regions are on the other. The broad and deep filesystem you describe could be thwarting your program's attempt to access it. That is, the rsync's need to load a large number of inodes could be starving rsync for memory (etc). Keeping the disk cache out of your program's space at least in part could prevent some very "interesting" contention models from ruining your day. Or it could just make things worse. So it's worth a try but it's not gospel. 8-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-20 10:57 ` Robert White @ 2014-12-21 16:32 ` Charles Cazabon 2014-12-21 21:32 ` Robert White 0 siblings, 1 reply; 16+ messages in thread From: Charles Cazabon @ 2014-12-21 16:32 UTC (permalink / raw) To: btrfs list Hi, Robert, Thanks for the response. Many of the things you mentioned I have tried, but for completeness: > Have you taken SMART (smartmotools etc) to these disks Yes. The disks are actually connected to a proper hardware RAID controller that does SMART monitoring of all the disks, although I don't use the RAID features of the controller. By using mdadm, if the controller fails I can slap the disks in another machine, or a different controller into this one, and still have it work without needing to worry about getting a replacement for this particular model of controller. There are no errors or warnings from SMART for the disks. > Try experimentally mounting the filesystem read-only Actually, I'd already done that before I mailed the list. It made no difference to the symptoms. > Have you changed any hardware lately in a way that could de-optimize > your interrupt handling. No. > I have a vague recollection that somewhere in the last month and a > half or so there was a patch here (or in the kernel changelogs) > about an extra put operation (or something) that would cause a > worker thread to roll over to -1, then spin back down to zero before > work could proceed. I know, could I _be_ more vague? Right? Try > switching to kernel 3.18.1 to see if the issue just goes away. I tend to track linux-stable pretty closely (as that seems to be recommended for btrfs use), so I already switched to 3.18.1 as soon as it came out. That made no difference to the symptoms either. > When was the last time you did any of the maintenance things (like > balance or defrag)? Not that I'd want to sit through 15Tb of that > sort of thing, but I'm curious about the maintenance history. I don't generally do those at all. I was under the impression that balance would not apply in my case as btrfs is on a single logical device, but I see that I was wrong in that impression. Is this something that is recommended on a regular basis? Most of the advice I've read regarding them is that it's no longer necessary unless there is a particular problem that these will fix... > Does the read performance fall off with uptime? No. I see these problems right from boot. > I _imagine_ that if your filesystem huge and your server is modest by > comparison in terms of ram, cache pinning and fragmentation can start > becoming a real problem. What else besides marshaling this filesystem is > this system used for? This particular server is only used for holding backups of other machines, nothing else. It has far more CPU and memory (2x quad-core Xeon plus hyperthreading, 16GB RAM) than it needs for this task. So when I say the machine is doing nothing other than this copy/rsync I'm currently running, that's practically the literal truth - there are the normal system processes and my ssh/shell running and that's about it. > Have you tried segregating some of your system memory for to make > sure that you aren't actually having application performance issues? The system isn't running out of memory; as I say, about the only userspace processes running are ssh, my shell, and rsync. However, your first suggestion caused me to slap myself: > Have you tried increasing the number of stripe buffers for the > filesystem? This I had totally forgotten. When I bump up the stripe cache size, it *seems* (so far, at least) to eliminate the slowest performance I'm seeing - specifically, the periods I've been seeing where no I/O at all seems to happen, plus the long runs of 1-3MB/s. The copy is now staying pretty much in the 22-27MB/s range. That's not as fast as the hardware is capable of - as I say, with other filesystems on the same hardware, I can easily see 100+MB/s - but it's much better than it was. Is this remaining difference (25 vs 100+ MB/s) simply due to btrfs not being tuned for performance yet, or is there something else I'm probably overlooking? Thanks, Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-21 16:32 ` Charles Cazabon @ 2014-12-21 21:32 ` Robert White 2014-12-21 22:53 ` Charles Cazabon 2014-12-22 2:13 ` Satoru Takeuchi 0 siblings, 2 replies; 16+ messages in thread From: Robert White @ 2014-12-21 21:32 UTC (permalink / raw) To: btrfs list On 12/21/2014 08:32 AM, Charles Cazabon wrote: > Hi, Robert, > > Thanks for the response. Many of the things you mentioned I have tried, but > for completeness: > >> Have you taken SMART (smartmotools etc) to these disks > There are no errors or warnings from SMART for the disks. Do make sure you are regularly running the long "offline" test. [offline is a bad name, what it really should be called is the long idle-interval test. sigh] about once a week. Otherwise SMART is just going to tell you the disk just died when it dies. I'm not saying this is relevant to the current circumstance. But since you didn't mention a testing schedule I figured it bared a mention >> Have you tried segregating some of your system memory for to make >> sure that you aren't actually having application performance issues? > > The system isn't running out of memory; as I say, about the only userspace > processes running are ssh, my shell, and rsync. The thing with "movablecore=" will not lead to an "out of memory" condition or not, its a question of cache and buffer evictions. I figured that you'd have said something about actual out of memory errors. But here's the thing. Once storage pressure gets "high enough" the system will start forgetting things intermittently to make room for other things. One of the things it will "forget" is pages of code from running programs. The other thing it can "forget" is dirent (directory entries) relevant to ongoing activity. The real killer can involve "swappiness" (e.g. /proc/sys/vm/swapiness :: the tendency of the system to drop pages of program code, do not adjust this till you understand it fully) and overall page fault rates on the system. You'll start geting evictions long before you start using _any_ swap file space. So if your effective throughput is low, the first thing to really look at is if your page fault rates are rising. Variations of sar, ps, and top may be able to tell you about the current system and/or per-process page fault rates. You'll have to compare your distro's tool set to the procedures you can find online. It's a little pernicious because it's a silent performance drain. There are no system messages to tell you "uh, hey dude, I'm doing a lot of reclaims lately and even going back to disk for pages of this program you really like". You just have to know how to look in that area. > > However, your first suggestion caused me to slap myself: > >> Have you tried increasing the number of stripe buffers for the >> filesystem? > > This I had totally forgotten. When I bump up the stripe cache size, it > *seems* (so far, at least) to eliminate the slowest performance I'm seeing - > specifically, the periods I've been seeing where no I/O at all seems to > happen, plus the long runs of 1-3MB/s. The copy is now staying pretty much in > the 22-27MB/s range. > > That's not as fast as the hardware is capable of - as I say, with other > filesystems on the same hardware, I can easily see 100+MB/s - but it's much > better than it was. > > Is this remaining difference (25 vs 100+ MB/s) simply due to btrfs not being > tuned for performance yet, or is there something else I'm probably > overlooking? I find BTRFS can be a little slow on my laptop, but I blame memory pressure evicting important structures somewhat system wide. Which is part of why I did the moveablecore= parametric tuning. I don't think there is anything that will pack the locality of the various trees, so you can end up needing bits of things from all over your disk in order to sequentially resolve a large directory and compute the running checksums for rsync (etc.). Simple rule of thumb, if "wait for I/O time" has started to rise you've got some odd memory pressure that's sending you to idle land. It's not hard-and-fast as a rule, but since you've said that your CPU load (wich I'm taking to be the user+system time) is staying low you are likely waiting for something. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-21 21:32 ` Robert White @ 2014-12-21 22:53 ` Charles Cazabon 2014-12-22 0:38 ` Robert White 2014-12-22 14:16 ` Austin S Hemmelgarn 2014-12-22 2:13 ` Satoru Takeuchi 1 sibling, 2 replies; 16+ messages in thread From: Charles Cazabon @ 2014-12-21 22:53 UTC (permalink / raw) To: btrfs list Hi, Robert, My performance issues with btrfs are more-or-less resolved now -- the performance under btrfs still seems quite variable compared to other filesystems -- my rsync speed is now varying between 40MB and ~90MB/s, with occasional intervals where it drops further, down into the 10-20MB/s range. Still no disk errors or SMART warnings that would indicate that problem is at the hardware level. > Do make sure you are regularly running the long "offline" test. Ok, I'll do that. > Otherwise SMART is just going to tell you the disk just died when it dies. Ya, I'm aware of how limited/useful the SMART diagnostics are. I'm also paranoid enough to be using RAID 6... > The thing with "movablecore=" will not lead to an "out of memory" > condition or not, its a question of cache and buffer evictions. I'm fairly certain memory isn't the issue here. For what it's worth: %Cpu(s): 2.1 us, 19.4 sy, 0.0 ni, 78.0 id, 0.2 wa, 0.3 hi, 0.0 si, 0.0 st KiB Mem: 16469880 total, 16301252 used, 168628 free, 720 buffers KiB Swap: 7811068 total, 0 used, 7811068 free, 15146580 cached Swappiness I've left at the default of 60, but I'm not seeing swapping going on regardless. > > Is this remaining difference (25 vs 100+ MB/s) simply due to btrfs not being > > tuned for performance yet I found the cause of this. Stupidly enough, there was a bwlimit set up in a shell alias for rsync. So btrfs is not nearly as slow as I was seeing. It's still slower than reading from an ext4 or XFS filesystem on these disks, but the absolute level of read speed seems reasonable enough given that btrfs has not been under heavy performance tuning to date. My only remaining concern would be the variability I still see in the read speed. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-21 22:53 ` Charles Cazabon @ 2014-12-22 0:38 ` Robert White 2014-12-25 3:14 ` Charles Cazabon 2014-12-22 14:16 ` Austin S Hemmelgarn 1 sibling, 1 reply; 16+ messages in thread From: Robert White @ 2014-12-22 0:38 UTC (permalink / raw) To: btrfs list On 12/21/2014 02:53 PM, Charles Cazabon wrote: > Hi, Robert, > > My performance issues with btrfs are more-or-less resolved now -- the > performance under btrfs still seems quite variable compared to other > filesystems -- my rsync speed is now varying between 40MB and ~90MB/s, with > occasional intervals where it drops further, down into the 10-20MB/s range. > Still no disk errors or SMART warnings that would indicate that problem is at > the hardware level. > >> Do make sure you are regularly running the long "offline" test. > > Ok, I'll do that. > >> Otherwise SMART is just going to tell you the disk just died when it dies. > > Ya, I'm aware of how limited/useful the SMART diagnostics are. I'm also > paranoid enough to be using RAID 6... > >> The thing with "movablecore=" will not lead to an "out of memory" >> condition or not, its a question of cache and buffer evictions. > > I'm fairly certain memory isn't the issue here. For what it's worth: > > %Cpu(s): 2.1 us, 19.4 sy, 0.0 ni, 78.0 id, 0.2 wa, 0.3 hi, 0.0 si, 0.0 st > KiB Mem: 16469880 total, 16301252 used, 168628 free, 720 buffers > KiB Swap: 7811068 total, 0 used, 7811068 free, 15146580 cached > > Swappiness I've left at the default of 60, but I'm not seeing swapping going > on regardless. Swappiness has nothing to do with swapping. You have very little free memory. Here is how a linux system runs a program. When exec() is called the current process memory is wiped (dropped, forgotten, whatever) [except for a few things like the open file discriptor table]. Then the executable is opened and mmap() is called to map the text portions of that executable into memory. This does not involve any particular reading of that file. The dynamic linker also selectively mmap()s the needed libraries. So you end up with something that looks like this: [+.....................] Most of the program is not actually in memory. (the "." parts), and some minimum part is in memory (the "+" part). As you use the program more of it will work its way into memory. [+++.....+.+....++....] swappiness controls the likelihood that memory pressure will cause parts that have been read in to be "forgotten" based on the idea that it can be read again later if needed. [+++.....+.+....++....] [+.......+......+.....] This is called "demand paging", and because linux uses ELF (extensible link format) and all programs run in a uniform memory map, program text _never_ needs to be written to swap space. Windows DLL/EXE has to "relocate" the code, e.g. re-write it to make it runable. So on windows code text has to be paged to swap. So what is "swapping" and swap space for? Well next to the code is the data. [+++.....+.+....++....] {.............} As it gets written to, it cannot just be forgotten because the program needs that data or it wouldn't have written it. [+++.....+.+....++....] {..******.*...} So if the system needs to reclaim the memory used by that kind of data it sends it to the swap space. [+++.....+.+....++....] {..******.*...} ^^^^ swapping vvvv [+++.....+.+....++....] {..**...*.*...} Swappiness is how you tel the system that you want to keep the code ("+") in memory in favor of the data ("*"). But _long_ before you start having to actually write the data to disk in the swap space, the operating system will start casually forgetting the code. Most of both of these things, while "freshly forgotten" can be "reclaimed" from the disk/page cache. So when you show me that free memory listing what I see is someone who is bumping against their "minimum desired free memory" limit (e.g. about 1% free ram) and so has a system where it's _possible_ that a good bit of stuff is getting dumped out of the active page tables and into the disk/page cache where it could start bogging down the system in large numbers of short reclaim delays and potentially non-trivial amounts of demand paging. So not "out of memory" but not ideal. Real measurements of page fault activity and wait for io time needs to be done to determine if more action needs to be taken. Compare that to my laptop where I've deliberately made sure that memory is always available for fast transient use. Gust ~ # free total used free shared buff/cache available Mem: 7915804 2820368 2419300 47472 2676136 4790892 Swap: 8778748 311688 8467060 In this configuration when I run bursty stuff I've set aside two gig that sits around "Free" into which the dynamic load can find ample space. (I am not recommending that for you necessarily, were I present I'd do some experimenting). But that dynamic space is where the rsync would be doing its work and so be less likely to stall. (Etc). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-22 0:38 ` Robert White @ 2014-12-25 3:14 ` Charles Cazabon 0 siblings, 0 replies; 16+ messages in thread From: Charles Cazabon @ 2014-12-25 3:14 UTC (permalink / raw) To: btrfs list Robert White <rwhite@pobox.com> wrote: > > You have very little free memory. I think you're mistaken. Every diagnostic I've looked at says the opposite. >From 30 seconds ago on the same machine, after unmounting the big btrfs filesystem (and with a larger xfs one mounted), /proc/meminfo says almost the entirety of the machine's 16GB is free: MemTotal: 16469880 kB MemFree: 16005392 kB MemAvailable: 15974244 kB Buffers: 84 kB [...] > This is called "demand paging", Yes, I'm aware of how this works. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-21 22:53 ` Charles Cazabon 2014-12-22 0:38 ` Robert White @ 2014-12-22 14:16 ` Austin S Hemmelgarn 2014-12-25 3:15 ` Charles Cazabon 1 sibling, 1 reply; 16+ messages in thread From: Austin S Hemmelgarn @ 2014-12-22 14:16 UTC (permalink / raw) To: btrfs list [-- Attachment #1: Type: text/plain, Size: 2805 bytes --] On 2014-12-21 17:53, Charles Cazabon wrote: > Hi, Robert, > > My performance issues with btrfs are more-or-less resolved now -- the > performance under btrfs still seems quite variable compared to other > filesystems -- my rsync speed is now varying between 40MB and ~90MB/s, with > occasional intervals where it drops further, down into the 10-20MB/s range. > Still no disk errors or SMART warnings that would indicate that problem is at > the hardware level. > >> Do make sure you are regularly running the long "offline" test. > > Ok, I'll do that. > >> Otherwise SMART is just going to tell you the disk just died when it dies. > > Ya, I'm aware of how limited/useful the SMART diagnostics are. I'm also > paranoid enough to be using RAID 6... > >> The thing with "movablecore=" will not lead to an "out of memory" >> condition or not, its a question of cache and buffer evictions. > > I'm fairly certain memory isn't the issue here. For what it's worth: > > %Cpu(s): 2.1 us, 19.4 sy, 0.0 ni, 78.0 id, 0.2 wa, 0.3 hi, 0.0 si, 0.0 st > KiB Mem: 16469880 total, 16301252 used, 168628 free, 720 buffers > KiB Swap: 7811068 total, 0 used, 7811068 free, 15146580 cached > > Swappiness I've left at the default of 60, but I'm not seeing swapping going > on regardless. > >>> Is this remaining difference (25 vs 100+ MB/s) simply due to btrfs not being >>> tuned for performance yet > > I found the cause of this. Stupidly enough, there was a bwlimit set up in a > shell alias for rsync. > > So btrfs is not nearly as slow as I was seeing. It's still slower than > reading from an ext4 or XFS filesystem on these disks, but the absolute level > of read speed seems reasonable enough given that btrfs has not been under > heavy performance tuning to date. My only remaining concern would be the > variability I still see in the read speed. This actually sounds kind of like the issues I have sometimes on my laptop using btrfs on an SSD, I've mostly resolved them by tuning IO scheduler parameters, as the default IO scheduler (the supposedly Completely Fair Queue, which was obviously named by a mathematician who had never actually run the algorithm) has some pretty brain-dead default settings. The other thing I would suggest looking into regarding the variability is tuning the kernel's write-caching settings, with the defaults you're caching ~1.6G worth of writes before it forces write-back, which is a ridiculous amount; I've that the highest value that is actually usable is about 256M, and that's only if you are doing mostly bursty IO and not the throughput focused stuff that rsync does, I'd say try setting /proc/sys/vm/dirty_background_bytes to 67108864 (64M) and see if that helps things some. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2455 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-22 14:16 ` Austin S Hemmelgarn @ 2014-12-25 3:15 ` Charles Cazabon 0 siblings, 0 replies; 16+ messages in thread From: Charles Cazabon @ 2014-12-25 3:15 UTC (permalink / raw) To: btrfs list Austin S Hemmelgarn <ahferroin7@gmail.com> wrote: > > This actually sounds kind of like the issues I have sometimes on my > laptop using btrfs on an SSD, I've mostly resolved them by tuning IO > scheduler parameters, as the default IO scheduler (the supposedly > Completely Fair Queue, which was obviously named by a mathematician > who had never actually run the algorithm) has some pretty brain-dead > default settings. The other thing I would suggest looking into > regarding the variability is tuning the kernel's write-caching > settings Ok, that's something I will examine. I knew CFQ is completely wrong for SSD use, but I thought it was still one of the better schedulers for spinning disks. Apparently that may not be the case. Thanks, Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-21 21:32 ` Robert White 2014-12-21 22:53 ` Charles Cazabon @ 2014-12-22 2:13 ` Satoru Takeuchi 2014-12-25 3:18 ` Charles Cazabon 1 sibling, 1 reply; 16+ messages in thread From: Satoru Takeuchi @ 2014-12-22 2:13 UTC (permalink / raw) To: Robert White, btrfs list Hi, On 2014/12/22 6:32, Robert White wrote: > On 12/21/2014 08:32 AM, Charles Cazabon wrote: >> Hi, Robert, >> >> Thanks for the response. Many of the things you mentioned I have tried, but >> for completeness: >> >>> Have you taken SMART (smartmotools etc) to these disks >> There are no errors or warnings from SMART for the disks. > > > Do make sure you are regularly running the long "offline" test. [offline is a bad name, what it really should be called is the long idle-interval test. sigh] about once a week. Otherwise SMART is just going to tell you the disk just died when it dies. > > I'm not saying this is relevant to the current circumstance. But since you didn't mention a testing schedule I figured it bared a mention > >>> Have you tried segregating some of your system memory for to make >>> sure that you aren't actually having application performance issues? >> >> The system isn't running out of memory; as I say, about the only userspace >> processes running are ssh, my shell, and rsync. > > The thing with "movablecore=" will not lead to an "out of memory" condition or not, its a question of cache and buffer evictions. > > I figured that you'd have said something about actual out of memory errors. > > But here's the thing. > > Once storage pressure gets "high enough" the system will start forgetting things intermittently to make room for other things. One of the things it will "forget" is pages of code from running programs. The other thing it can "forget" is dirent (directory entries) relevant to ongoing activity. > > The real killer can involve "swappiness" (e.g. /proc/sys/vm/swapiness :: the tendency of the system to drop pages of program code, do not adjust this till you understand it fully) and overall page fault rates on the system. You'll start geting evictions long before you start using _any_ swap file space. > > So if your effective throughput is low, the first thing to really look at is if your page fault rates are rising. Variations of sar, ps, and top may be able to tell you about the current system and/or per-process page fault rates. You'll have to compare your distro's tool set to the procedures you can find online. > > It's a little pernicious because it's a silent performance drain. There are no system messages to tell you "uh, hey dude, I'm doing a lot of reclaims lately and even going back to disk for pages of this program you really like". You just have to know how to look in that area. > >> >> However, your first suggestion caused me to slap myself: >> >>> Have you tried increasing the number of stripe buffers for the >>> filesystem? >> >> This I had totally forgotten. When I bump up the stripe cache size, it >> *seems* (so far, at least) to eliminate the slowest performance I'm seeing - >> specifically, the periods I've been seeing where no I/O at all seems to >> happen, plus the long runs of 1-3MB/s. The copy is now staying pretty much in >> the 22-27MB/s range. >> >> That's not as fast as the hardware is capable of - as I say, with other >> filesystems on the same hardware, I can easily see 100+MB/s - but it's much >> better than it was. >> >> Is this remaining difference (25 vs 100+ MB/s) simply due to btrfs not being >> tuned for performance yet, or is there something else I'm probably >> overlooking? > > > I find BTRFS can be a little slow on my laptop, but I blame memory pressure evicting important structures somewhat system wide. Which is part of why I did the moveablecore= parametric tuning. I don't think there is anything that will pack the locality of the various trees, so you can end up needing bits of things from all over your disk in order to sequentially resolve a large directory and compute the running checksums for rsync (etc.). > > Simple rule of thumb, if "wait for I/O time" has started to rise you've got some odd memory pressure that's sending you to idle land. It's not hard-and-fast as a rule, but since you've said that your CPU load (wich I'm taking to be the user+system time) is staying low you are likely waiting for something. Capturing "echo t >/proc/magic-sysrq" on waiting for I/O may help you. It shows us where kernel actually be waiting for. In addition, to confirm whether this problem is caused only by Btrfs or not, the following way can be used. 1. preparing the extra storage, 2. copy Btrfs's data into int by dd if=<LVM volume> of=<extra storage> 3. Use it and confirm whether this problem still happen or not However, since the size of your Btrfs is quite large, I guess you can't do it. If you have such extra storage, you've already embed it to Btrfs. Thanks, Satoru > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Oddly slow read performance with near-full largish FS 2014-12-22 2:13 ` Satoru Takeuchi @ 2014-12-25 3:18 ` Charles Cazabon 0 siblings, 0 replies; 16+ messages in thread From: Charles Cazabon @ 2014-12-25 3:18 UTC (permalink / raw) To: btrfs list Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> wrote: > > In addition, to confirm whether this problem is caused only by > Btrfs or not, the following way can be used. > > 1. preparing the extra storage, > 2. copy Btrfs's data into int by dd if=<LVM volume> of=<extra storage> > 3. Use it and confirm whether this problem still happen or not I've already copied the ~16TB of data from the btrfs filesystem to an XFS filesystem. I do not see the performance variability under xfs that I see under btrfs. > However, since the size of your Btrfs is quite large, I guess you > can't do it. If you have such extra storage, you've already > embed it to Btrfs. Actually, I decided to move to xfs, at least for now. Apparently not many people are using btrfs with filesystems >15TB, so it seems I'm in more-or-less uncharted territory, at least according to the responses I've gotten when looking into this issue. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2014-12-25 3:14 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-17 2:42 Oddly slow read performance with near-full largish FS Charles Cazabon 2014-12-19 8:58 ` Satoru Takeuchi 2014-12-19 16:58 ` Charles Cazabon 2014-12-19 17:33 ` Duncan 2014-12-20 8:53 ` Chris Murphy 2014-12-20 10:03 ` Robert White 2014-12-20 10:57 ` Robert White 2014-12-21 16:32 ` Charles Cazabon 2014-12-21 21:32 ` Robert White 2014-12-21 22:53 ` Charles Cazabon 2014-12-22 0:38 ` Robert White 2014-12-25 3:14 ` Charles Cazabon 2014-12-22 14:16 ` Austin S Hemmelgarn 2014-12-25 3:15 ` Charles Cazabon 2014-12-22 2:13 ` Satoru Takeuchi 2014-12-25 3:18 ` Charles Cazabon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).