Re: Questions about BTRFS balance and scrub on non-RAID setup

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrej Friesen <andre.friesen@gmail.com>
To: Lionel Bouton <lionel-subscription@bouton.name>,
	linux-btrfs@vger.kernel.org
Subject: Re: Questions about BTRFS balance and scrub on non-RAID setup
Date: Tue, 31 Aug 2021 10:17:07 +0200	[thread overview]
Message-ID: <d765bf95-0463-59bd-022a-39a0c2d8a241@gmail.com> (raw)
In-Reply-To: <04941c75-3ea5-32de-5978-efe5c5681ee2@bouton.name>

Hi,

thanks for the useful information Lionel.
That already helped a lot!

Scrub:

> Partially. Ceph replication/scrub/repair will cover individual disk/OSD
> server faults but not faults at the origin of the data being stored.
> 
> We provide the same service for a customer. Several years ago the VM
> hosting the NFS server for this customer ran on hardware that developed
> a fault, the result was silent corruption of the data written by the NFS
> server *before* being handed to Ceph for storage (probably memory or CPU
> related, we threw the server out of the cluster and never looked back...).
> - ceph scrubbing was of no use there because from its point of view the
> replicated blocks were all fine.
> - we launch btrfs scrub monthly by default and this is how we detected
> the corruption.

This is a really good point!
Even though we might not be able to automatically let btrfs repair the 
corrupted files during the scrub it would be nice to know that this 
happened and act accordingly.

> We make regular rbd snapshots so we could :
> - switch the NFS server to an existing read-only replica (that could not
> be corrupted by the same fault as it was replicated using simple
> file-level content synchronization),
> - restart the original NFS server using the last known good snapshot,
> - rsync fresh data from the replica to the original server to catch up,
> - switch back.

We also wanted to do some rbd snapshots to have some kind of disaster 
recovery if something happens. Just in case.

Our idea was also to offer quick file based "backups" to with btrfs 
snapshots. This would help if the file was once created correctly and 
afterwards writes to that file would get corrupt because of hardware 
failures.
But for filesystem corruption reasons we also wanted to keep some rbd 
snapshots, you never know.

Balance:

> Full balance is probably overkill in any situation and can sunk your I/O
> bandwidth. With recent kernels it seems there is less need for
> balancing. We still use an automatic balancing script that tries to
> limit the amount of free space allocated to nearly empty allocation
> groups (by using "usage=50+" filters) and cancels the balance if it is
> too long (to avoid limiting IO performance for too long, waiting for a
> next call to continue) but I'm not sure if it's still worth it. In our
> case we have been bitten by out of space situations with old kernels
> brought by over-allocation of free space due to temporary large space
> usages so we consider it an additional safeguard.

In order to solve the file system full "problem" we wanted to create a 
large block device and use a quota of lets say 80 % of that for the data 
subvolume.
We could also make the block device double the size of the subvolume and 
quota we offer because it is thin provisioned from the ceph side we do 
not lose any storage.
We have tested discard/trim with btrfs and ceph and everything worked 
fine :-)

Is there any metric we could/should measure in order to see if a balance 
would give us some benefit in some way?

Did you only do the balance for the file system full problem?

I saw a recommendation to run this balance daily:

`btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`

Source:
https://github.com/netdata/netdata/issues/3203#issuecomment-356026930

Is that a valid recommendation still today?
If so, why is the FAQ not having such information available?
I am happy to put something in the wiki, if needed.

Defragmentation:

> You probably want to use autodefrag or a custom defragmentation solution
> too. We weren't satisfied with autodefrag in some situations (were
> clearly fragmentation crept in and IO performance suffered until a
> manual defrag) and developed our own scheduler for triggering
> defragmentation based on file writes and slow full filesystem scans,

The ceph cluster only uses SSDs therefore I guess we do not suffer from 
fragmentation problem as with HDDs. As far as I understood SSDs.

-- 
Andrej Friesen

https://www.ajfriesen.com/

next prev parent reply	other threads:[~2021-08-31  8:17 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-30 13:20 Questions about BTRFS balance and scrub on non-RAID setup Andrej Friesen
2021-08-30 14:18 ` Lionel Bouton
2021-08-31  8:17   ` Andrej Friesen [this message]
2021-08-31 13:06     ` Lionel Bouton
  -- strict thread matches above, loose matches on Subject: below --
2021-09-01  4:54 Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d765bf95-0463-59bd-022a-39a0c2d8a241@gmail.com \
    --to=andre.friesen@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lionel-subscription@bouton.name \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox