Re: very slow "btrfs dev delete" 3x6Tb, 7Tb of data

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: Leszek Dubiel <leszek@dubiel.pl>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: very slow "btrfs dev delete" 3x6Tb, 7Tb of data
Date: Sat, 4 Jan 2020 00:38:43 -0500	[thread overview]
Message-ID: <20200104053843.GK13306@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtSr9j8AzLRfguHb8+9n_snxmpXkw0V+LiuDnqqvLVAxKQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4637 bytes --]

On Thu, Jan 02, 2020 at 04:22:37PM -0700, Chris Murphy wrote:
> On Thu, Jan 2, 2020 at 3:39 PM Leszek Dubiel <leszek@dubiel.pl> wrote:
> 
> >  > Almost no reads, all writes, but slow. And rather high write request
> >  > per second, almost double for sdc. And sdc is near it's max
> >  > utilization so it might be ear to its iops limit?
> >  >
> >  > ~210 rareq-sz = 210KiB is the average size of the read request for
> > sda and sdb
> >  >
> >  > Default mkfs and default mount options? Or other and if so what other?
> >  >
> >  > Many small files on this file system? Or possibly large files with a
> >  > lot of fragmentation?
> >
> > Default mkfs and default mount options.
> >
> > This system could have a few million (!) of small files.
> > On reiserfs it takes about 40 minutes, to do "find /".
> > Rsync runs for 6 hours to backup data.
> 
> There is a mount option:  max_inline=<bytes> which the man page says
> (default: min(2048, page size) )

It's half the page size per a commit from some years ago.  For compressed
size, it's the compressed data size (i.e. you can have a 4095-byte
inline file with max_inline=2048 due to the compression).

> I've never used it, so in theory the max_inline byte size is 2KiB.
> However, I have seen substantially larger inline extents than 2KiB
> when using a nodesize larger than 16KiB at mkfs time.
> 
> I've wondered whether it makes any difference for the "many small
> files" case to do more aggressive inlining of extents.
> 
> I've seen with 16KiB leaf size, often small files that could be
> inlined, are instead put into a data block group, taking up a minimum
> 4KiB block size (on x64_64 anyway). I'm not sure why, but I suspect
> there just isn't enough room in that leaf to always use inline
> extents, and yet there is enough room to just reference it as a data
> block group extent. When using a larger node size, a larger percentage
> of small files ended up using inline extents. I'd expect this to be
> quite a bit more efficient, because it eliminates a time expensive (on
> HDD anyway) seek.

Putting a lot of inline file data into metadata pages makes them less
dense, which is either good or bad depending on which bottleneck you're
currently hitting.  If you have snapshots there is an up-to-300x metadata
write amplification penalty to update extent item references every time
a shared metadata page is unshared.  Inline extents reduce the write
amplification.  On the other hand, if you are doing a lot of 'find'-style
tree sweeps, then inline extents will reduce their efficiency because more
pages will have to be read to scan the same number of dirents and inodes.

For workloads that reiserfs was good at, there's no reliable rule of
thumb to guess which is better--you have to try both, and measure results.

> Another optimization, using compress=zstd:1, which is the lowest
> compression setting. That'll increase the chance a file can use inline
> extents, in particular with a larger nodesize.
> 
> And still another optimization, at the expense of much more
> complexity, is LVM cache with an SSD. You'd have to pick a suitable
> policy for the workload, but I expect that if the iostat utilizations
> you see of often near max utilization in normal operation, you'll see
> improved performance. SSD's can handle way higher iops than HDD. But a
> lot of this optimization stuff is use case specific. I'm not even sure
> what your mean small file size is.

I've found an interesting result in cache configuration testing: btrfs's
writes with datacow seem to be very well optimized, to the point that
adding a writeback SSD cache between btrfs and a HDD makes btrfs commits
significantly slower.  A writeback cache adds latency to the write path
without removing many seeks--btrfs already does writes in big contiguous
bursts--so the extra latency makes the writeback cache slow compared to
writethrough.  A writethrough SSD cache helps with reads (which are very
seeky and benefit a lot from caching) without adding latency to writes,
and btrfs reads a _lot_ during commits.
 
> > # iotop -d30
> >
> > Total DISK READ:        34.12 M/s | Total DISK WRITE: 40.36 M/s
> > Current DISK READ:      34.12 M/s | Current DISK WRITE:      79.22 M/s
> >    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
> >   4596 be/4 root       34.12 M/s   37.79 M/s  0.00 % 91.77 % btrfs
> 
> Not so bad for many small file reads and writes with HDD. I've see
> this myself with single spindle when doing small file reads and
> writes.
> 
> 
> -- 
> Chris Murphy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

next prev parent reply	other threads:[~2020-01-04  5:39 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-25 22:35 very slow "btrfs dev delete" 3x6Tb, 7Tb of data Leszek Dubiel
2019-12-26  5:08 ` Qu Wenruo
2019-12-26 13:17   ` Leszek Dubiel
2019-12-26 13:44     ` Remi Gauvin
2019-12-26 14:05       ` Leszek Dubiel
2019-12-26 14:21         ` Remi Gauvin
2019-12-26 15:42           ` Leszek Dubiel
2019-12-26 22:40         ` Chris Murphy
2019-12-26 22:58           ` Leszek Dubiel
2019-12-28 17:04             ` Leszek Dubiel
2019-12-28 20:23               ` Zygo Blaxell
2020-01-02 18:37                 ` Leszek Dubiel
2020-01-02 21:57                   ` Chris Murphy
2020-01-02 22:39                     ` Leszek Dubiel
2020-01-02 23:22                       ` Chris Murphy
2020-01-03  9:08                         ` Leszek Dubiel
2020-01-03 19:15                           ` Chris Murphy
2020-01-03 14:39                         ` Leszek Dubiel
2020-01-03 19:02                           ` Chris Murphy
2020-01-03 20:59                             ` Leszek Dubiel
2020-01-04  5:38                         ` Zygo Blaxell [this message]
2020-01-07 18:44                           ` write amplification, was: " Chris Murphy
2020-01-07 19:26                             ` Holger Hoffstätte
2020-01-07 23:32                             ` Zygo Blaxell
2020-01-07 23:53                               ` Chris Murphy
2020-01-08  1:41                                 ` Zygo Blaxell
2020-01-08  2:54                                   ` Chris Murphy
2020-01-06 11:14                     ` Leszek Dubiel
2020-01-07  0:21                       ` Chris Murphy
2020-01-07  7:09                         ` Leszek Dubiel
2019-12-26 22:15 ` Chris Murphy
2019-12-26 22:48   ` Leszek Dubiel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200104053843.GK13306@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=leszek@dubiel.pl \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.