From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1E20CC433EF
	for <linux-btrfs@archiver.kernel.org>; Fri, 26 Nov 2021 08:25:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1353191AbhKZI2d convert rfc822-to-8bit (ORCPT
        <rfc822;linux-btrfs@archiver.kernel.org>);
        Fri, 26 Nov 2021 03:28:33 -0500
Received: from pio-pvt-msa2.bahnhof.se ([79.136.2.41]:58454 "EHLO
        pio-pvt-msa2.bahnhof.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1359562AbhKZI0c (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Fri, 26 Nov 2021 03:26:32 -0500
Received: from localhost (localhost [127.0.0.1])
        by pio-pvt-msa2.bahnhof.se (Postfix) with ESMTP id ABB023F614;
        Fri, 26 Nov 2021 09:23:17 +0100 (CET)
X-Virus-Scanned: Debian amavisd-new at bahnhof.se
Received: from pio-pvt-msa2.bahnhof.se ([127.0.0.1])
        by localhost (pio-pvt-msa2.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024)
        with ESMTP id rH5ovmQgCn4s; Fri, 26 Nov 2021 09:23:16 +0100 (CET)
Received: by pio-pvt-msa2.bahnhof.se (Postfix) with ESMTPA id 7975C3F4CF;
        Fri, 26 Nov 2021 09:23:16 +0100 (CET)
Received: from [8.43.224.52] (port=60436 helo=[100.95.114.229])
        by tnonline.net with esmtpsa  (TLS1.3) tls TLS_AES_128_GCM_SHA256
        (Exim 4.94.2)
        (envelope-from <forza@tnonline.net>)
        id 1mqWVl-0008Fv-Tk; Fri, 26 Nov 2021 09:23:15 +0100
Date:   Fri, 26 Nov 2021 09:23:12 +0100 (GMT+01:00)
From:   Forza <forza@tnonline.net>
To:     Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
        Andrey Melnikov <temnota.am@gmail.com>
Cc:     linux-btrfs@vger.kernel.org
Message-ID: <8747149.faa9ddba.17d5b575f6b@tnonline.net>
In-Reply-To: <20211126051503.GG17148@hungrycats.org>
References: <CA+PODjrE4V9hL1bXEEghU6NAFgPgfUu4f75FCQn+0vKUaeu1zg@mail.gmail.com> <20211126051503.GG17148@hungrycats.org>
Subject: Re: btrfs with huge numbers of hardlinks is extremely slow
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8BIT
X-Mailer: R2Mail2
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org


---- From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> -- Sent: 2021-11-26 - 06:15 ----

> On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote:
>> Every night a new backup is stored on this fs with 'rsync
>> --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000
>> directories are created, 50-100 normal files transferred.
>> Now, FS contains 351 copies of backup data with 486086495 hardlinks
>> and ANY operations on this FS take significant time. For example -
>> simple count hardlinks with
>> "time find . -type f -links +1 | wc -l" take:
>> real    28567m33.611s
>> user    31m33.395s
>> sys     506m28.576s
>> 
>> 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s.
> 
> That works out to reading the entire drive 4x in 20 days, or all of the
> metadata 30x.  Certainly hardlinks will not result in optimal object
> placement, and you probably don't have enough RAM to cache the entire
> metadata tree, and you're using WD Purple drives in a fileserver for
> some reason, so those numbers seem plausible.

Defragmenting the subvolume and extent tree could help reducing the amount and distance of seeks to metadata which should improve performance of find. 

# btrfs fi defrag /path/to/subvol 


If you cannot change drive model, breaking up your HW raid and create a btrfs raid1 should also improve performance, as well as fault tolerance. 

Apart from this, I second Zygo's suggestions below. 


> 
>> - BTRFS not suitable for this workload?
> 
> There are definitely better ways to do this on btrfs, e.g.
> 
> 	btrfs sub snap $yesterday $today
> 	rsync ... (no link-dest) ... $today
> 
> This will avoid duplicating the entire file tree every time.  It will also
> store historical file attributes correctly, which --link-dest sometimes
> does not.
> 
> You might also consider doing it differently:
> 
> 	rsync ... (no link-dest) ... working-dir/. &&
> 	btrfs sub snap -r working-dir $today
> 
> so that your $today directory doesn't exist until it is complete with
> no rsync errors.  'working-dir' will have to be a subvol, but you only
> have to create it once and you can keep reusing it afterwards.
> 
>> - using reflinks helps speedup FS operations?
> 
> Snapshots are lazy reflink copies, so they'll do a little better than
> reflinks.  You'll only modify the metadata for the 50-100 files that you
> transfer each day, instead of completely rewriting all of the metadata
> in the filesystem every day with hardlinks.
> 
> Hardlinks put the inodes further and further away from their directory
> nodes each day, and add some extra searching overhead within directories
> as well.  You'll need more and more RAM to cache the same amount of
> each filesystem tree, because they're all in the same metadata tree.
> With snapshots they'll end up in separate metadata trees.
> 
>> - readed metadata not cached at all? 
> 
> If you have less than about 640GB of RAM (4x the size of your metadata)
> then you're going to be rereading metadata pages at some point.  Because
> you're using hardlinks, the metadata pages from different days are all
> mashed together, and 'find' will flood the cache chasing references to
> them.
> 
> Other recommendations:
> 
> - Use the right drive model for your workload.  WD Purple drives are for
> continuous video streaming, they are not for seeky millions-of-tiny-files
> rsync workloads.  Almost any other model will outperform them, and
> better drives for this workload (e.g. CMR WD Red models) are cheaper.
> Your WD Purple drives are getting 283 links/s.  Compare that with some
> other drive models:
> 
> 	1665 links/s:  WD Green (2x1TB + 1x2TB btrfs raid1)
> 
> 	6850 links/s:  Sandisk Extreme MicroSD (1x256GB btrfs single/dup)
> 
> 	12511 links/s:  WD Red (2x1TB btrfs raid1)
> 
> 	13371 links/s:  WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1)
> 
> 	14872 links/s:  WD Black (1x1TB btrfs single/dup, 8 years old)
> 
> 	25498 links/s:  WD Gold + Seagate Exos (3x16TB btrfs raid1)
> 
> 	27341 links/s:  Toshiba NVME (1x2TB btrfs single/dup)
> 
> 	311284 links/s:  Sabrent Rocket 4 NVME (2x1TB btrfs raid1)
> 	(1344748222 links, 111 snapshots)
> 
> Some of these numbers are lower than they should be, because I ran
> 'find' commands on some machines that were busy doing other work.
> The point is that even if some of these numbers are too low, all of
> these numbers are higher what we can expect from a WD Purple.
> 
> - Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk
> separately through the RAID interface to btrfs.  This will enable btrfs
> to correct errors and isolate faults if one of your drives goes bad.
> You can also use iostat to see if one of the drives is running much
> slower than the other, which might be an early indication of failure
> (and it might be the only indication of failure you get, if your drive's
> firmware doesn't support SCTERC and hides failures).
> 
>> What BTRFS read 19 days from disks???
>> 
>> Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1
>> (without cache).
>> Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC
>> 2021 x86_64 GNU/Linux
>> btrfs-progs v5.14.1
>> 
>> # btrfs fi show
>> Label: none  uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd
>>         Total devices 1 FS bytes used 474.26GiB
>>         devid    1 size 931.00GiB used 502.23GiB path /dev/sdb1
>> 
>> # btrfs fi df /srv
>> Data, single: total=367.19GiB, used=343.92GiB
>> System, single: total=32.00MiB, used=128.00KiB
>> Metadata, single: total=135.00GiB, used=130.34GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>> 
>> # btrfs fi us /srv
>> Overall:
>>     Device size:                 931.00GiB
>>     Device allocated:            502.23GiB
>>     Device unallocated:          428.77GiB
>>     Device missing:                  0.00B
>>     Used:                        474.26GiB
>>     Free (estimated):            452.04GiB      (min: 452.04GiB)
>>     Free (statfs, df):           452.04GiB
>>     Data ratio:                       1.00
>>     Metadata ratio:                   1.00
>>     Global reserve:              512.00MiB      (used: 0.00B)
>>     Multiple profiles:                  no
>> 
>> Data,single: Size:367.19GiB, Used:343.92GiB (93.66%)
>>    /dev/sdb1     367.19GiB
>> 
>> Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%)
>>    /dev/sdb1     135.00GiB
>> 
>> System,single: Size:32.00MiB, Used:128.00KiB (0.39%)
>>    /dev/sdb1      32.00MiB
>> 
>> Unallocated:
>>    /dev/sdb1     428.77GiB