From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E20CC433EF for ; Fri, 26 Nov 2021 08:25:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1353191AbhKZI2d convert rfc822-to-8bit (ORCPT ); Fri, 26 Nov 2021 03:28:33 -0500 Received: from pio-pvt-msa2.bahnhof.se ([79.136.2.41]:58454 "EHLO pio-pvt-msa2.bahnhof.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359562AbhKZI0c (ORCPT ); Fri, 26 Nov 2021 03:26:32 -0500 Received: from localhost (localhost [127.0.0.1]) by pio-pvt-msa2.bahnhof.se (Postfix) with ESMTP id ABB023F614; Fri, 26 Nov 2021 09:23:17 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at bahnhof.se Received: from pio-pvt-msa2.bahnhof.se ([127.0.0.1]) by localhost (pio-pvt-msa2.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rH5ovmQgCn4s; Fri, 26 Nov 2021 09:23:16 +0100 (CET) Received: by pio-pvt-msa2.bahnhof.se (Postfix) with ESMTPA id 7975C3F4CF; Fri, 26 Nov 2021 09:23:16 +0100 (CET) Received: from [8.43.224.52] (port=60436 helo=[100.95.114.229]) by tnonline.net with esmtpsa (TLS1.3) tls TLS_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1mqWVl-0008Fv-Tk; Fri, 26 Nov 2021 09:23:15 +0100 Date: Fri, 26 Nov 2021 09:23:12 +0100 (GMT+01:00) From: Forza To: Zygo Blaxell , Andrey Melnikov Cc: linux-btrfs@vger.kernel.org Message-ID: <8747149.faa9ddba.17d5b575f6b@tnonline.net> In-Reply-To: <20211126051503.GG17148@hungrycats.org> References: <20211126051503.GG17148@hungrycats.org> Subject: Re: btrfs with huge numbers of hardlinks is extremely slow MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8BIT X-Mailer: R2Mail2 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org ---- From: Zygo Blaxell -- Sent: 2021-11-26 - 06:15 ---- > On Fri, Nov 26, 2021 at 12:56:25AM +0300, Andrey Melnikov wrote: >> Every night a new backup is stored on this fs with 'rsync >> --link-dest=$yestoday/ $today/' - 1402700 hardlinks and 23000 >> directories are created, 50-100 normal files transferred. >> Now, FS contains 351 copies of backup data with 486086495 hardlinks >> and ANY operations on this FS take significant time. For example - >> simple count hardlinks with >> "time find . -type f -links +1 | wc -l" take: >> real 28567m33.611s >> user 31m33.395s >> sys 506m28.576s >> >> 19 days 20 hours 10 mins with constant reads from storage 2-4Mb/s. > > That works out to reading the entire drive 4x in 20 days, or all of the > metadata 30x. Certainly hardlinks will not result in optimal object > placement, and you probably don't have enough RAM to cache the entire > metadata tree, and you're using WD Purple drives in a fileserver for > some reason, so those numbers seem plausible. Defragmenting the subvolume and extent tree could help reducing the amount and distance of seeks to metadata which should improve performance of find. # btrfs fi defrag /path/to/subvol If you cannot change drive model, breaking up your HW raid and create a btrfs raid1 should also improve performance, as well as fault tolerance. Apart from this, I second Zygo's suggestions below. > >> - BTRFS not suitable for this workload? > > There are definitely better ways to do this on btrfs, e.g. > > btrfs sub snap $yesterday $today > rsync ... (no link-dest) ... $today > > This will avoid duplicating the entire file tree every time. It will also > store historical file attributes correctly, which --link-dest sometimes > does not. > > You might also consider doing it differently: > > rsync ... (no link-dest) ... working-dir/. && > btrfs sub snap -r working-dir $today > > so that your $today directory doesn't exist until it is complete with > no rsync errors. 'working-dir' will have to be a subvol, but you only > have to create it once and you can keep reusing it afterwards. > >> - using reflinks helps speedup FS operations? > > Snapshots are lazy reflink copies, so they'll do a little better than > reflinks. You'll only modify the metadata for the 50-100 files that you > transfer each day, instead of completely rewriting all of the metadata > in the filesystem every day with hardlinks. > > Hardlinks put the inodes further and further away from their directory > nodes each day, and add some extra searching overhead within directories > as well. You'll need more and more RAM to cache the same amount of > each filesystem tree, because they're all in the same metadata tree. > With snapshots they'll end up in separate metadata trees. > >> - readed metadata not cached at all? > > If you have less than about 640GB of RAM (4x the size of your metadata) > then you're going to be rereading metadata pages at some point. Because > you're using hardlinks, the metadata pages from different days are all > mashed together, and 'find' will flood the cache chasing references to > them. > > Other recommendations: > > - Use the right drive model for your workload. WD Purple drives are for > continuous video streaming, they are not for seeky millions-of-tiny-files > rsync workloads. Almost any other model will outperform them, and > better drives for this workload (e.g. CMR WD Red models) are cheaper. > Your WD Purple drives are getting 283 links/s. Compare that with some > other drive models: > > 1665 links/s: WD Green (2x1TB + 1x2TB btrfs raid1) > > 6850 links/s: Sandisk Extreme MicroSD (1x256GB btrfs single/dup) > > 12511 links/s: WD Red (2x1TB btrfs raid1) > > 13371 links/s: WD Red SSD + Seagate Ironwolf SSD (6x1TB btrfs raid1) > > 14872 links/s: WD Black (1x1TB btrfs single/dup, 8 years old) > > 25498 links/s: WD Gold + Seagate Exos (3x16TB btrfs raid1) > > 27341 links/s: Toshiba NVME (1x2TB btrfs single/dup) > > 311284 links/s: Sabrent Rocket 4 NVME (2x1TB btrfs raid1) > (1344748222 links, 111 snapshots) > > Some of these numbers are lower than they should be, because I ran > 'find' commands on some machines that were busy doing other work. > The point is that even if some of these numbers are too low, all of > these numbers are higher what we can expect from a WD Purple. > > - Use btrfs raid1 instead of hardware RAID1, i.e. expose each disk > separately through the RAID interface to btrfs. This will enable btrfs > to correct errors and isolate faults if one of your drives goes bad. > You can also use iostat to see if one of the drives is running much > slower than the other, which might be an early indication of failure > (and it might be the only indication of failure you get, if your drive's > firmware doesn't support SCTERC and hides failures). > >> What BTRFS read 19 days from disks??? >> >> Hardware: dell r300 with 2 WD Purple 1Tb disk on LSISAS1068E RAID 1 >> (without cache). >> Linux nlxc 5.14-51412-generic #0~lch11 SMP Wed Oct 13 15:57:07 UTC >> 2021 x86_64 GNU/Linux >> btrfs-progs v5.14.1 >> >> # btrfs fi show >> Label: none uuid: a840a2ca-bf05-4074-8895-60d993cb5bdd >> Total devices 1 FS bytes used 474.26GiB >> devid 1 size 931.00GiB used 502.23GiB path /dev/sdb1 >> >> # btrfs fi df /srv >> Data, single: total=367.19GiB, used=343.92GiB >> System, single: total=32.00MiB, used=128.00KiB >> Metadata, single: total=135.00GiB, used=130.34GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> # btrfs fi us /srv >> Overall: >> Device size: 931.00GiB >> Device allocated: 502.23GiB >> Device unallocated: 428.77GiB >> Device missing: 0.00B >> Used: 474.26GiB >> Free (estimated): 452.04GiB (min: 452.04GiB) >> Free (statfs, df): 452.04GiB >> Data ratio: 1.00 >> Metadata ratio: 1.00 >> Global reserve: 512.00MiB (used: 0.00B) >> Multiple profiles: no >> >> Data,single: Size:367.19GiB, Used:343.92GiB (93.66%) >> /dev/sdb1 367.19GiB >> >> Metadata,single: Size:135.00GiB, Used:130.34GiB (96.55%) >> /dev/sdb1 135.00GiB >> >> System,single: Size:32.00MiB, Used:128.00KiB (0.39%) >> /dev/sdb1 32.00MiB >> >> Unallocated: >> /dev/sdb1 428.77GiB