Re: Oddly slow read performance with near-full largish FS

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Robert White <rwhite@pobox.com>
To: btrfs list <linux-btrfs@vger.kernel.org>
Subject: Re: Oddly slow read performance with near-full largish FS
Date: Sat, 20 Dec 2014 02:57:40 -0800	[thread overview]
Message-ID: <54955624.5040808@pobox.com> (raw)
In-Reply-To: <20141217024228.GA5544@pyropus.ca>

On 12/16/2014 06:42 PM, Charles Cazabon wrote:
> Hi,
>
> I've been running btrfs for various filesystems for a few years now, and have
> recently run into problems with a large filesystem becoming *really* slow for
> basic reading.  None of the debugging/testing suggestions I've come across in
> the wiki or in the mailing list archives seems to have helped.
>
> Background: this particular filesystem holds backups for various other
> machines on the network, a mix of rdiff-backup data (so lots of small files)
> and rsync copies of larger files (everything from ~5MB data files to ~60GB VM
> HD images).  There's roughly 16TB of data in this filesystem (the filesystem
> is ~17TB).  The btrfs filesystem is a simple single volume, no snapshots,
> multiple devices, or anything like that.  It's an LVM logical volume on top of
> dmcrypt on top of an mdadm RAID set (8 disks in RAID 6).
>
> The performance:  trying to copy the data off this filesystem to another
> (non-btrfs) filesystem with rsync or just cp was taking aaaages - I found one
> suggestion that it could be because updating the atimes required a COW of the
> metadata in btrfs, so I mounted the filesystem noatime, but this doesn't
> appear to have made any difference.  The speeds I'm seeing (with iotop)
> fluctuate a lot.  They spend most of the time in the range of 1-3 MB/s, with
> large periods of time where no IO seems to happen at all, and occasional short
> spikes to ~25-30 MB/s.  System load seems to sit around 10-12 (with only 2
> processes reported as running, everything else sleeping) while this happens.
> The server is doing nothing other than this copy at the time.  The only
> processes using any noticable CPU are rsync (source and destination processes,
> around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of
> "kworker" processes, perhaps one per CPU (there are 8 physical cores in the
> server, plus hyperthreading).
>
> Other filesystems on the same physical disks have no trouble exceeding 100MB/s
> reads.  The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used).
>
> Is there something obvious I'm missing here?  Is there a reason I can only
> average ~3MB/s reads from a btrfs filesystem?
>
> kernel is x86_64 linux-stable 3.17.6.  btrfs-progs is v3.17.3-3-g8cb0438.
> Output of the various info commands is:
>
>    $ sudo btrfs fi df /media/backup/
>    Data, single: total=16.24TiB, used=15.73TiB
>    System, DUP: total=8.00MiB, used=1.75MiB
>    System, single: total=4.00MiB, used=0.00
>    Metadata, DUP: total=35.50GiB, used=34.05GiB
>    Metadata, single: total=8.00MiB, used=0.00
>    unknown, single: total=512.00MiB, used=0.00
>
>    $ btrfs --version
>    Btrfs v3.17.3-3-g8cb0438
>
>    $ sudo btrfs fi show
>
>    Label: 'backup'  uuid: c18dfd04-d931-4269-b999-e94df3b1918c
>      Total devices 1 FS bytes used 15.76TiB
>      devid    1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup
>
> Thanks in advance for any suggestions.
>
> Charles
>

Totally spit-balling ideas here (e.g. no suggestion as to which one to 
try first etc, just typing them as they come to me):

Have you tried increasing the number of stripe buffers for the 
filesystem? If you've gotten things spread way out you might be 
thrashing your stripe cache. (see /sys/block/md(number 
here)/md/stripe_cache_size).

Have you taken SMART (smartmotools etc) to these disks to see if any of 
them are reporting any sort of incipient failure conditions? If one or 
more drives is reporting recoverable read errors it might just be 
clogging you up.

Try experimentally mounting the filesystem read-only and dong some read 
tests. This elimination of all possible write sources will tell you 
things. In particular if all your reads just start breezing through then 
you know something in the write path is "iffy". One thing that comes to 
mind is that anything accessing the drive with a barrier-style operation 
(wait for verification of data sync all the way to disk) would have to 
pass all the way down through the encryption layer which could be having 
a multiplier effect. (you know, lots of very short delays making a large 
net delay).

Have you changed any hardware lately in a way that could de-optimize 
your interrupt handling.

I have a vague recollection that somewhere in the last month and a half 
or so there was a patch here (or in the kernel changelogs) about an 
extra put operation (or something) that would cause a worker thread to 
roll over to -1, then spin back down to zero before work could proceed. 
I know, could I _be_ more vague? Right? Try switching to kernel 3.18.1 
to see if the issue just goes away. (Honestly this one's just been 
scratching at my brain since I started writing this reply and I just 
_can't_ remember the reference for it... dangit...)

When was the last time you did any of the maintenance things (like 
balance or defrag)? Not that I'd want to sit through 15Tb of that sort 
of thing, but I'm curious about the maintenance history.

Does the read performance fall off with uptime? E.g. is it "okay" right 
after a system boot and then start to fall off as uptime (and activity) 
increases? I _imagine_ that if your filesystem huge and your server is 
modest by comparison in terms of ram, cache pinning and fragmentation 
can start becoming a real problem. What else besides marshaling this 
filesystem is this system used for?

Have you tried segregating some of your system memory for to make sure 
that you aren't actually having application performance issues?  I've 
had some luck with kernelcore= and moveablecore= (particularly 
moveablecore=) kernel command line options when dealing with IO induced 
fragmentation. On problematic systems I'll try classifying at least 1/4 
of the system ram as movablecore. (e.g. on my 8GiB laptop were I do some 
of my experimental work, I have moveablecore=2G on the command line). 
Any pages that get locked into memory will be moved out of the 
movable-only memory first. This can have a profound (usually positive) 
effect on applications that want to spread out in memory. If you are 
running anything that likes large swaths of memory then this can help a 
lot. Particularly if you are also running programs that traverse large 
swaths of disk. Some programs (rsync of large files etc may be such a 
program) can do "much better" if you've done this. (BUT DON'T OVERDO IT, 
enough is good but too much is very bad. 8-) ).

ASIDE: Anything that uses hugepages, transparent or explicit, in any 
serious number has a tendency to antagonize the system cache (and 
vice-versa). It's a silent fight of the cache-pressure sort. When you 
explicitly declare an amount of ram for moveable pages only, the disk 
cache will not grow into that space. so moveablecore=3G creates 3GiB of 
space where only unlocked pages (malloced heap, stack, etc; basically 
only things that can get moved -- particularly swapped -- will go in 
that space.) The practical effect is that certain kinds of pressures 
will never compete. So broad-format disk I/O (e.g. using find etc) will 
tend to be on one side of the barrier while video playback buffers and 
virtual machine's ram regions are on the other. The broad and deep 
filesystem you describe could be thwarting your program's attempt to 
access it. That is, the rsync's need to load a large number of inodes 
could be starving rsync for memory (etc). Keeping the disk cache out of 
your program's space at least in part could prevent some very 
"interesting" contention models from ruining your day.

Or it could just make things worse.

So it's worth a try but it's not gospel. 8-)

next prev parent reply	other threads:[~2014-12-20 10:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-17  2:42 Oddly slow read performance with near-full largish FS Charles Cazabon
2014-12-19  8:58 ` Satoru Takeuchi
2014-12-19 16:58   ` Charles Cazabon
2014-12-19 17:33     ` Duncan
2014-12-20  8:53       ` Chris Murphy
2014-12-20 10:03       ` Robert White
2014-12-20 10:57 ` Robert White [this message]
2014-12-21 16:32   ` Charles Cazabon
2014-12-21 21:32     ` Robert White
2014-12-21 22:53       ` Charles Cazabon
2014-12-22  0:38         ` Robert White
2014-12-25  3:14           ` Charles Cazabon
2014-12-22 14:16         ` Austin S Hemmelgarn
2014-12-25  3:15           ` Charles Cazabon
2014-12-22  2:13       ` Satoru Takeuchi
2014-12-25  3:18         ` Charles Cazabon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54955624.5040808@pobox.com \
    --to=rwhite@pobox.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).