public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Forza <forza@tnonline.net>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Andrey Melnikov <temnota.am@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: btrfs with huge numbers of hardlinks is extremely slow
Date: Fri, 26 Nov 2021 09:23:12 +0100 (GMT+01:00)	[thread overview]
Message-ID: <8747149.faa9ddba.17d5b575f6b@tnonline.net> (raw)
In-Reply-To: <20211126051503.GG17148@hungrycats.org>



---- From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> -- Sent: 2021-11-26 - 06:15 ----

> On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote:
>> Every night a new backup is stored on this fs with 'rsync
>> --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000
>> directories are created, 50-100 normal files transferred.
>> Now, FS contains 351 copies of backup data with 486086495 hardlinks
>> and ANY operations on this FS take significant time. For example -
>> simple count hardlinks with
>> "time find . -type f -links +1 | wc -l" take:
>> real    28567m33.611s
>> user    31m33.395s
>> sys     506m28.576s
>> 
>> 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s.
> 
> That works out to reading the entire drive 4x in 20 days, or all of the
> metadata 30x.  Certainly hardlinks will not result in optimal object
> placement, and you probably don't have enough RAM to cache the entire
> metadata tree, and you're using WD Purple drives in a fileserver for
> some reason, so those numbers seem plausible.

Defragmenting the subvolume and extent tree could help reducing the amount and distance of seeks to metadata which should improve performance of find. 

# btrfs fi defrag /path/to/subvol 


If you cannot change drive model, breaking up your HW raid and create a btrfs raid1 should also improve performance, as well as fault tolerance. 

Apart from this, I second Zygo's suggestions below. 




> 
>> - BTRFS not suitable for this workload?
> 
> There are definitely better ways to do this on btrfs, e.g.
> 
> 	btrfs sub snap $yesterday $today
> 	rsync ... (no link-dest) ... $today
> 
> This will avoid duplicating the entire file tree every time.  It will also
> store historical file attributes correctly, which --link-dest sometimes
> does not.
> 
> You might also consider doing it differently:
> 
> 	rsync ... (no link-dest) ... working-dir/. &&
> 	btrfs sub snap -r working-dir $today
> 
> so that your $today directory doesn't exist until it is complete with
> no rsync errors.  'working-dir' will have to be a subvol, but you only
> have to create it once and you can keep reusing it afterwards.
> 
>> - using reflinks helps speedup FS operations?
> 
> Snapshots are lazy reflink copies, so they'll do a little better than
> reflinks.  You'll only modify the metadata for the 50-100 files that you
> transfer each day, instead of completely rewriting all of the metadata
> in the filesystem every day with hardlinks.
> 
> Hardlinks put the inodes further and further away from their directory
> nodes each day, and add some extra searching overhead within directories
> as well.  You'll need more and more RAM to cache the same amount of
> each filesystem tree, because they're all in the same metadata tree.
> With snapshots they'll end up in separate metadata trees.
> 
>> - readed metadata not cached at all? 
> 
> If you have less than about 640GB of RAM (4x the size of your metadata)
> then you're going to be rereading metadata pages at some point.  Because
> you're using hardlinks, the metadata pages from different days are all
> mashed together, and 'find' will flood the cache chasing references to
> them.
> 
> Other recommendations:
> 
> - Use the right drive model for your workload.  WD Purple drives are for
> continuous video streaming, they are not for seeky millions-of-tiny-files
> rsync workloads.  Almost any other model will outperform them, and
> better drives for this workload (e.g. CMR WD Red models) are cheaper.
> Your WD Purple drives are getting 283 links/s.  Compare that with some
> other drive models:
> 
> 	1665 links/s:  WD Green (2x1TB + 1x2TB btrfs raid1)
> 
> 	6850 links/s:  Sandisk Extreme MicroSD (1x256GB btrfs single/dup)
> 
> 	12511 links/s:  WD Red (2x1TB btrfs raid1)
> 
> 	13371 links/s:  WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1)
> 
> 	14872 links/s:  WD Black (1x1TB btrfs single/dup, 8 years old)
> 
> 	25498 links/s:  WD Gold + Seagate Exos (3x16TB btrfs raid1)
> 
> 	27341 links/s:  Toshiba NVME (1x2TB btrfs single/dup)
> 
> 	311284 links/s:  Sabrent Rocket 4 NVME (2x1TB btrfs raid1)
> 	(1344748222 links, 111 snapshots)
> 
> Some of these numbers are lower than they should be, because I ran
> 'find' commands on some machines that were busy doing other work.
> The point is that even if some of these numbers are too low, all of
> these numbers are higher what we can expect from a WD Purple.
> 
> - Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk
> separately through the RAID interface to btrfs.  This will enable btrfs
> to correct errors and isolate faults if one of your drives goes bad.
> You can also use iostat to see if one of the drives is running much
> slower than the other, which might be an early indication of failure
> (and it might be the only indication of failure you get, if your drive's
> firmware doesn't support SCTERC and hides failures).
> 
>> What BTRFS read 19 days from disks???
>> 
>> Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1
>> (without cache).
>> Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC
>> 2021 x86_64 GNU/Linux
>> btrfs-progs v5.14.1
>> 
>> # btrfs fi show
>> Label: none  uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd
>>         Total devices 1 FS bytes used 474.26GiB
>>         devid    1 size 931.00GiB used 502.23GiB path /dev/sdb1
>> 
>> # btrfs fi df /srv
>> Data, single: total=367.19GiB, used=343.92GiB
>> System, single: total=32.00MiB, used=128.00KiB
>> Metadata, single: total=135.00GiB, used=130.34GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> 
>> # btrfs fi us /srv
>> Overall:
>>     Device size:                 931.00GiB
>>     Device allocated:            502.23GiB
>>     Device unallocated:          428.77GiB
>>     Device missing:                  0.00B
>>     Used:                        474.26GiB
>>     Free (estimated):            452.04GiB      (min: 452.04GiB)
>>     Free (statfs, df):           452.04GiB
>>     Data ratio:                       1.00
>>     Metadata ratio:                   1.00
>>     Global reserve:              512.00MiB      (used: 0.00B)
>>     Multiple profiles:                  no
>> 
>> Data,single: Size:367.19GiB, Used:343.92GiB (93.66%)
>>    /dev/sdb1     367.19GiB
>> 
>> Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%)
>>    /dev/sdb1     135.00GiB
>> 
>> System,single: Size:32.00MiB, Used:128.00KiB (0.39%)
>>    /dev/sdb1      32.00MiB
>> 
>> Unallocated:
>>    /dev/sdb1     428.77GiB



      reply	other threads:[~2021-11-26  8:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-25 21:56 btrfs with huge numbers of hardlinks is extremely slow Andrey Melnikov
2021-11-26  5:15 ` Zygo Blaxell
2021-11-26  8:23   ` Forza [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8747149.faa9ddba.17d5b575f6b@tnonline.net \
    --to=forza@tnonline.net \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=temnota.am@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox