linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: "Ellis H. Wilson III" <ellisw@panasas.com>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>,
	Hans van Kranenburg <hans.van.kranenburg@mendix.com>,
	Nikolay Borisov <nborisov@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Status of FST and mount times
Date: Tue, 20 Feb 2018 10:41:23 -0500	[thread overview]
Message-ID: <ccc1f937-1cdd-afdd-d164-99c3ad648387@gmail.com> (raw)
In-Reply-To: <5743750c-644d-9160-c0f3-599caf92dcb6@panasas.com>

On 2018-02-20 09:59, Ellis H. Wilson III wrote:
> On 02/16/2018 07:59 PM, Qu Wenruo wrote:
>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>> 3454
>>
>> OK, this explains everything.
>>
>> There are too many chunks.
>> This means at mount you need to search for block group item 3454 times.
>>
>> Even each search only needs to iterate 3 tree blocks, multiply it 3454
>> it would still be a big work.
>> Although some tree blocks like the root node and level 1 nodes can be
>> cached, we still need to read about 3500 tree blocks.
>>
>> If the fs is created using 16K nodesize, this means you need to do
>> random read for 54M using 16K blocksize.
>>
>> No wonder it will takes some time.
>>
>> Normally I would expect 1G chunk for each data and metadata chunk.
>>
>> If there is nothing special, it means your filesystem is already larger
>> than 3T.
>> If your used space is way smaller (less than 30%) than 3.5T, then this
>> means your chunk usage is pretty low, and in that case, balance to
>> reduce number of chunks (block groups) would reduce mount time.
> 
> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by 
> btrfs fi df.  So, from what I am hearing, this mount time is normal for 
> a filesystem this size.  Ignoring a more complex and proper fix like the 
> ones we've been discussing, would bumping the nodesize reduce the number 
> of chunks, thereby reducing the mount time?
It would probably not.  Chunk size is only based on the total size of 
the filesystem, with reasonable base values, so you would still need to 
have at least as many chunks to store the same amount of data (increase 
the node size too much though, and you will end up with more chunks, 
because you'll have more empty space wasted).
> 
> I don't see why balance would come into play here -- my understanding 
> was that was for aged filesystems.  The only operations I've done on 
> here was:
> 1. Format filesystem clean
> 2. Create a subvolume
> 3. rsync our home directories into that new subvolume
> 4. Create another subvolume
> 5. rsync our home directories into that new subvolume
> 
> Accordingly, zero (or at least, extremely little) data should have been 
> overwritten, so I would expect things to be fairly well allocated 
> already.  Please correct me if this is naive thinking.
Your logic is in general correct regarding data, but not necessarily 
metadata.  Assuming you did not use the `--inplace` option for rsync, it 
had to issue a rename for each individual file that got copied in, and 
as a result there was likely a lot of metadata being rewritten.

As far as balance being for aged filesystems, that's not exactly true. 
There are four big reasons you might run a balance:

1. As part of reshaping a volume.  You generally want run a balance 
whenever the number of disks in a volume permanently increases (it will 
happen automatically when it permanently decreases, as the device 
deletion operation is a special type of balance under the hood).  It's 
also used for converting chunk profiles.
2. To free up empty space inside chunks when the filesystem is full at 
the chunk level.
3. To redistribute data across multiple disks in a more even manner 
after deleting a lot of data.
4. To reduce the likelihood of 2 or 3 being an issue.

Reasons 2 and 3 are generally more likely to be needed on old volumes. 
Reason 1 is independent of the age of a volume.  Reason 4 is the reason 
for the regular filtered balances that I and some other people recommend 
be run as part of preventative maintenance, and is also generally 
independent of the age of a volume.

Qu's suggestion is actually independent of all the above reasons, but 
does kind of fit in with the fourth as another case of preventative 
maintenance.
> 
>>> I was using btrfs sub del -C for the deletions, so I believe (if that
>>> command truly waits for the subvolume to be utterly gone) it captures
>>> the entirety of the snapshot.
>>
>> No, snapshot deletion is completely delayed in background.
>>
>> -C only ensures that even a powerloss happen after command return, you
>> won't see the snapshot anywhere, but it will still be deleted in 
>> background.
> 
> Ah, I had no idea.  Thank you!  Is there any way to "encourage" 
> btrfs-cleaner to run at specific times, which I presume is the snapshot 
> deletion process you are referring to?  If it can be told to run at a 
> given time, can I throttle how fast it works, such that I avoid some of 
> the high foreground interruption I've seen in the past?
I don't think there's any way to do this right now (though it would be 
nice if there was).  In theory, you could adjust the priority of the 
kernel thread itself, but messing around with kthread priorities is 
seriously dangerous even if you know exactly what you're doing.

  reply	other threads:[~2018-02-20 15:41 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III
2018-02-14 17:08 ` Nikolay Borisov
2018-02-14 17:21   ` Ellis H. Wilson III
2018-02-15  1:42   ` Qu Wenruo
2018-02-15  2:15     ` Duncan
2018-02-15  3:49       ` Qu Wenruo
2018-02-15 11:12     ` Hans van Kranenburg
2018-02-15 16:30       ` Ellis H. Wilson III
2018-02-16  1:55         ` Qu Wenruo
2018-02-16 14:12           ` Ellis H. Wilson III
2018-02-16 14:20             ` Hans van Kranenburg
2018-02-16 14:42               ` Ellis H. Wilson III
2018-02-16 14:55                 ` Ellis H. Wilson III
2018-02-17  0:59             ` Qu Wenruo
2018-02-20 14:59               ` Ellis H. Wilson III
2018-02-20 15:41                 ` Austin S. Hemmelgarn [this message]
2018-02-21  1:49                   ` Qu Wenruo
2018-02-21 14:49                     ` Ellis H. Wilson III
2018-02-21 15:03                       ` Hans van Kranenburg
2018-02-21 15:19                         ` Ellis H. Wilson III
2018-02-21 15:56                           ` Hans van Kranenburg
2018-02-22 12:41                             ` Austin S. Hemmelgarn
2018-02-21 21:27                       ` E V
2018-02-22  0:53                       ` Qu Wenruo
2018-02-15  5:54   ` Chris Murphy
2018-02-14 23:24 ` Duncan
2018-02-15 15:42   ` Ellis H. Wilson III
2018-02-15 16:51     ` Austin S. Hemmelgarn
2018-02-15 16:58       ` Ellis H. Wilson III
2018-02-15 17:57         ` Austin S. Hemmelgarn
2018-02-15  6:14 ` Chris Murphy
2018-02-15 16:45   ` Ellis H. Wilson III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ccc1f937-1cdd-afdd-d164-99c3ad648387@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=ellisw@panasas.com \
    --cc=hans.van.kranenburg@mendix.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).