Re: Questions about BTRFS balance and scrub on non-RAID setup

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Questions about BTRFS balance and scrub on non-RAID setup
@ 2021-09-01  4:54 Duncan
  0 siblings, 0 replies; 5+ messages in thread
From: Duncan @ 2021-09-01  4:54 UTC (permalink / raw)
  To: Andrej Friesen, linux-btrfs

Andrej Friesen posted on Tue, 31 Aug 2021 10:17:07 +0200 as excerpted:

>> You probably want to use autodefrag or a custom defragmentation
>> solution too. We weren't satisfied with autodefrag in some situations
>> (were clearly fragmentation crept in and IO performance suffered
>> until a manual defrag) and developed our own scheduler for triggering
>> defragmentation based on file writes and slow full filesystem scans,
> 
> The ceph cluster only uses SSDs therefore I guess we do not suffer
> from fragmentation problem as with HDDs. As far as I understood SSDs.

Since I saw mention of btrfs snapshots as well...

It's worth mentioning that defrag (of course) triggers a write-out of
the new defragmented data, which because btrfs snapshots are cow-based
(copy- on-write), duplicates blocks still locked into place by existing 
snapshots.  With rewrite-in-place write patterns (typical
write-patterns for database or VM image usage), defrag and repeated
snapshots this can eat up space rather fast.

(They tried snapshot-aware defrag at one point but due to the exploded 
complexity of dealing with all the COW-references the performance just 
wasn't within the realm of practical as the defrag ended up making
little forward progress, so that was dropped in favor of a defrag that
would break the cow-references and thus use extra space, but at least
/worked/ for its labeled purpose.)

So I'd suggest choosing either one or the other, either snapshotting or 
defrag, don't try to use both in combination, or at least limit their 
usage in combination and keep an eye on space usage, deleting snapshots 
and/or reducing defrag frequency to some fraction of the snapshot 
frequency as necessary.

For ssds, autodefrag without manual defrag may be a reasonable
compromise (it's one I like personally but my use-case isn't
commercial), tho it is said that autodefrag may be a performance
bottleneck for some database (and I suspect VM-image as well)
use-cases, but I suspect autodefrag on ssds should both mitigate the
performance issue and likely eliminate the need for more intensive
manual/scheduled defrag runs.

The other thing to consider with below-btrfs-level snapshotting, and
I'm out-of-league for ceph/rdb but know it's definitely a problem with
lvm, is that btrfs due to its multi-device functionality cannot be
allowed to see other snapshots of the filesystem with the same btrfs
UUID.  (Btrfs- scan is what would make btrfs aware of them, but udev
typically triggers btrfs-scan when it detects new devices, and with lvm
at least, udev device detection can trigger somewhat unexpectedly.)
Because when btrfs sees these other devices with the same btrfs UUID,
it considers them additional devices of a multi-device btrfs and can
attempt to write to them instead of the original target device,
potentially creating all sorts of mayhem!

Like I said I'm out-of-league with ceph, etc, and have no idea if this 
even applies with it, but when I saw rdb snapshots mentioned I thought
of the lvm snapshots problem and thought it was worth a heads-up, in
case further investigation is necessary.

Likewise I saw the mention of quotas and balance.  Balance with quotas 
running similarly explodes due to constant recalculation of the quota
as the balance does its thing, increasing balance time dramatically and 
often out of the realm of the practical.  So if quotas are needed, 
minimize the use of balance, and if a balance is necessary, turning off 
quotas temporarily may be the only way to make reasonable forward 
progress on the balance.

But it sounds like btrfs quotas may not be necessary, thus avoiding
that problem entirely. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Questions about BTRFS balance and scrub on non-RAID setup
@ 2021-08-30 13:20 Andrej Friesen
  2021-08-30 14:18 ` Lionel Bouton
  0 siblings, 1 reply; 5+ messages in thread
From: Andrej Friesen @ 2021-08-30 13:20 UTC (permalink / raw)
  To: linux-btrfs

Hey folks,

I have used btrfs now for a few years on my home server and have had a
good experience so far.

But now I need some advice because I and my team want to use BTRFs in
a product. And personal use is something really different than
enterprise :-)

Use case and context for my questions:

A file system as a service for our customers.
This will be offered to the customer as a network share via NFS. That
also means we do not have any control over the usage patterns.
No idea about how often, how much they write small or big files to
that file system.

Technically we only create one block device with several terabytes and
format this with btrfs. The actual block device which we format is
backed by a ceph cluster.
So the actual block device is already been on a distributed storage,
therefore we will not do any raid configuration.

The kernel will be a recent 5.10.

Scrub:

Do I need to regularly scrub?
If so, what would be a recommendation for my use case?

My conclusion after reading about the scrub. This checks for damaged
data and will recover the data if this filesystem has another copy of
that data.
Since we will run without raid in btrfs this is not needed in my opinion.
Am I right with my conclusion here?

Balance:

Do I need to regularly balance my filesystem?
If so, what would be a recommendation for my use case?

I am a little bit confused about this one.
The FAQ (https://btrfs.wiki.kernel.org/index.php/FAQ#Do_I_need_to_run_a_balance_regularly.3F)
says:

> In general usage, no. A full unfiltered balance typically takes a long time, and will rewrite huge amounts of data unnecessarily. You may wish to run a balance on metadata only (see Balance_Filters) if you find you have very large amounts of metadata space allocated but unused, but this should be a last resort. At some point, this kind of clean-up will be made an automatic background process.

Others on the wide internet however say it makes sense to regularly balance:

https://github.com/netdata/netdata/issues/3203#issuecomment-356026930

Something like this every day:
`btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`

I also asked on IRC (username ajfriesen) about regular balance and
people seem to have different opinions on that topic as well.

What would a recommendation look like for my use case?
Would it make sense to update the FAQ in that regard?

PS: First-time mailing list user, please tell me if I did something wrong.

All the best
---
Andrej Friesen

https://www.ajfriesen.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about BTRFS balance and scrub on non-RAID setup
  2021-08-30 13:20 Andrej Friesen
@ 2021-08-30 14:18 ` Lionel Bouton
  2021-08-31  8:17   ` Andrej Friesen
  0 siblings, 1 reply; 5+ messages in thread
From: Lionel Bouton @ 2021-08-30 14:18 UTC (permalink / raw)
  To: Andrej Friesen, linux-btrfs

Hi,

Le 30/08/2021 à 15:20, Andrej Friesen a écrit :
> {...]
> Use case and context for my questions:
>
> A file system as a service for our customers.
> This will be offered to the customer as a network share via NFS. That
> also means we do not have any control over the usage patterns.
> No idea about how often, how much they write small or big files to
> that file system.
>
> Technically we only create one block device with several terabytes and
> format this with btrfs. The actual block device which we format is
> backed by a ceph cluster.
> So the actual block device is already been on a distributed storage,
> therefore we will not do any raid configuration.
>
> The kernel will be a recent 5.10.
>
> Scrub:
>
> Do I need to regularly scrub?
> If so, what would be a recommendation for my use case?
>
> My conclusion after reading about the scrub. This checks for damaged
> data and will recover the data if this filesystem has another copy of
> that data.
> Since we will run without raid in btrfs this is not needed in my opinion.
> Am I right with my conclusion here?

Partially. Ceph replication/scrub/repair will cover individual disk/OSD
server faults but not faults at the origin of the data being stored.

We provide the same service for a customer. Several years ago the VM
hosting the NFS server for this customer ran on hardware that developed
a fault, the result was silent corruption of the data written by the NFS
server *before* being handed to Ceph for storage (probably memory or CPU
related, we threw the server out of the cluster and never looked back...).
- ceph scrubbing was of no use there because from its point of view the
replicated blocks were all fine.
- we launch btrfs scrub monthly by default and this is how we detected
the corruption.

We make regular rbd snapshots so we could :
- switch the NFS server to an existing read-only replica (that could not
be corrupted by the same fault as it was replicated using simple
file-level content synchronization),
- restart the original NFS server using the last known good snapshot,
- rsync fresh data from the replica to the original server to catch up,
- switch back.

IIRC I've seen posts here about more checks done in the write path to
catch corruption but even if the likelihood of such corruption is lower
with recent kernels, hardware faults happen and software solutions can't
fully cover for them. Being able to catch corruption after the fact
relatively early makes recovery simpler and faster so I would only
disable scrubs on disposable data. Imagine discovering corruption when
you reboot your NFS server and the filesystem refuses to mount...

>
> Balance:
>
> Do I need to regularly balance my filesystem?
> If so, what would be a recommendation for my use case?

Full balance is probably overkill in any situation and can sunk your I/O
bandwidth. With recent kernels it seems there is less need for
balancing. We still use an automatic balancing script that tries to
limit the amount of free space allocated to nearly empty allocation
groups (by using "usage=50+" filters) and cancels the balance if it is
too long (to avoid limiting IO performance for too long, waiting for a
next call to continue) but I'm not sure if it's still worth it. In our
case we have been bitten by out of space situations with old kernels
brought by over-allocation of free space due to temporary large space
usages so we consider it an additional safeguard.

You probably want to use autodefrag or a custom defragmentation solution
too. We weren't satisfied with autodefrag in some situations (were
clearly fragmentation crept in and IO performance suffered until a
manual defrag) and developed our own scheduler for triggering
defragmentation based on file writes and slow full filesystem scans,
using filefrag to estimate the fragmentation cost file by file.

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about BTRFS balance and scrub on non-RAID setup
  2021-08-30 14:18 ` Lionel Bouton
@ 2021-08-31  8:17   ` Andrej Friesen
  2021-08-31 13:06     ` Lionel Bouton
  0 siblings, 1 reply; 5+ messages in thread
From: Andrej Friesen @ 2021-08-31  8:17 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs

Hi,

thanks for the useful information Lionel.
That already helped a lot!

Scrub:

> Partially. Ceph replication/scrub/repair will cover individual disk/OSD
> server faults but not faults at the origin of the data being stored.
> 
> We provide the same service for a customer. Several years ago the VM
> hosting the NFS server for this customer ran on hardware that developed
> a fault, the result was silent corruption of the data written by the NFS
> server *before* being handed to Ceph for storage (probably memory or CPU
> related, we threw the server out of the cluster and never looked back...).
> - ceph scrubbing was of no use there because from its point of view the
> replicated blocks were all fine.
> - we launch btrfs scrub monthly by default and this is how we detected
> the corruption.

This is a really good point!
Even though we might not be able to automatically let btrfs repair the 
corrupted files during the scrub it would be nice to know that this 
happened and act accordingly.

> We make regular rbd snapshots so we could :
> - switch the NFS server to an existing read-only replica (that could not
> be corrupted by the same fault as it was replicated using simple
> file-level content synchronization),
> - restart the original NFS server using the last known good snapshot,
> - rsync fresh data from the replica to the original server to catch up,
> - switch back.

We also wanted to do some rbd snapshots to have some kind of disaster 
recovery if something happens. Just in case.

Our idea was also to offer quick file based "backups" to with btrfs 
snapshots. This would help if the file was once created correctly and 
afterwards writes to that file would get corrupt because of hardware 
failures.
But for filesystem corruption reasons we also wanted to keep some rbd 
snapshots, you never know.

Balance:

> Full balance is probably overkill in any situation and can sunk your I/O
> bandwidth. With recent kernels it seems there is less need for
> balancing. We still use an automatic balancing script that tries to
> limit the amount of free space allocated to nearly empty allocation
> groups (by using "usage=50+" filters) and cancels the balance if it is
> too long (to avoid limiting IO performance for too long, waiting for a
> next call to continue) but I'm not sure if it's still worth it. In our
> case we have been bitten by out of space situations with old kernels
> brought by over-allocation of free space due to temporary large space
> usages so we consider it an additional safeguard.

In order to solve the file system full "problem" we wanted to create a 
large block device and use a quota of lets say 80 % of that for the data 
subvolume.
We could also make the block device double the size of the subvolume and 
quota we offer because it is thin provisioned from the ceph side we do 
not lose any storage.
We have tested discard/trim with btrfs and ceph and everything worked 
fine :-)

Is there any metric we could/should measure in order to see if a balance 
would give us some benefit in some way?

Did you only do the balance for the file system full problem?

I saw a recommendation to run this balance daily:

`btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`

Source:
https://github.com/netdata/netdata/issues/3203#issuecomment-356026930

Is that a valid recommendation still today?
If so, why is the FAQ not having such information available?
I am happy to put something in the wiki, if needed.

Defragmentation:

> You probably want to use autodefrag or a custom defragmentation solution
> too. We weren't satisfied with autodefrag in some situations (were
> clearly fragmentation crept in and IO performance suffered until a
> manual defrag) and developed our own scheduler for triggering
> defragmentation based on file writes and slow full filesystem scans,

The ceph cluster only uses SSDs therefore I guess we do not suffer from 
fragmentation problem as with HDDs. As far as I understood SSDs.

-- 
Andrej Friesen

https://www.ajfriesen.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Questions about BTRFS balance and scrub on non-RAID setup
  2021-08-31  8:17   ` Andrej Friesen
@ 2021-08-31 13:06     ` Lionel Bouton
  0 siblings, 0 replies; 5+ messages in thread
From: Lionel Bouton @ 2021-08-31 13:06 UTC (permalink / raw)
  To: Andrej Friesen, linux-btrfs

Hi,

Le 31/08/2021 à 10:17, Andrej Friesen a écrit :
> Hi,
>
> thanks for the useful information Lionel.
> That already helped a lot!

My pleasure.

>
> [...]
> In order to solve the file system full "problem" we wanted to create a
> large block device and use a quota of lets say 80 % of that for the
> data subvolume.
> We could also make the block device double the size of the subvolume
> and quota we offer because it is thin provisioned from the ceph side
> we do not lose any storage.

We've never used quotas with btrfs. There's a long history of
difficulties with them that I didn't follow closely. My impression is
that the situation is improving but IIRC I've seen somewhat recent
messages on this list advising them to be disabled at least temporarily
to get out of trouble (something like them slowing down balances). I'd
advise documenting yourself on these difficulties before using quotas.

We over-provision by creating comparatively large RBD volumes allowing
for future growth, create a smaller filesystem and don't limit storage
space based on Unix users so we don't need quotas. We bill the customer
based on actual space usage, monitor the filesystem space used and
resize it when needed (the goal is avoiding reaching 70% used) keeping
the customer in the loop when doing so to better evaluate their needs
and prevent unwanted surprises at billing time.

> We have tested discard/trim with btrfs and ceph and everything worked
> fine :-)

We don't use it often but never had a problem with it.

>
> Is there any metric we could/should measure in order to see if a
> balance would give us some benefit in some way?

You can have a pretty good idea directly with the "btrfs fi usage" output.

In case you are interested I developed this to keep filesystems from
being filled with mostly empty allocation groups :
https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb

By default it launches balances with increasing usage= values when it
detects relatively large amounts of allocated free space (relatively is
configurable in the script). It tries to handle some corner cases too
(like almost full filesystems where naive balances fail). Some
complexities could probably be removed as they deal with problems that
don't seem to appear anymore (at least not with regular uses of the script).

You can launch it with "-a" (for analyze) to get a "waste" percent
(amount of free space that is in allocation groups and could
theoretically be freed) for all your btrfs filesystems. It parses btrfs
fi usage and it doesn't need root just for analyzing although you'll get
warnings (and it probably won't work properly on multi-device filesystems).

>
> Did you only do the balance for the file system full problem?
>

Yes initially it was only to avoid this problem and we kept it as a
safeguard.

Doing it regularly seems to help speeding up shrinking filesystems and
limiting the IO load when doing so. When you shrink a filesystem it has
to move allocation groups to the beginning of the block device and I
believe this move is one side-effect of balance. Only allocation groups
matching your balance filters will move but my understanding and limited
experience is that they mostly move towards the beginning of the block
device. Spreading balances over long periods of time in advance is less
impactful than waiting until the last moment to move large amounts of
data when you need to shrink the filesystem.

I guess there might be some minor or even negligible performance
benefits too in some cases (less allocation groups probably means less
time spent on them for some operations) and on HDDs there was an
incentive to keep all allocation groups nice and tidy to avoid large
seek times.

>
> I saw a recommendation to run this balance daily:
>
> `btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`
>
> Source:
> https://github.com/netdata/netdata/issues/3203#issuecomment-356026930
>
> Is that a valid recommendation still today?
> If so, why is the FAQ not having such information available?

This is probably heavily dependent on your use case and why there isn't
a single solution. Balance is usually very IO intensive and can block
other operations for long periods of time. This seems the reason for the
limit filters of the recommendation you cited above. And this is the
reason for the tuning parameters (MAX_TIME and MAX_FS_TIME) I put in the
btrfs-auto-rebalance.rb script (instead of limiting the number of
operations it limits the time spent doing them).
Depending on the device IO behavior we had to lower MAX_FS_TIME to 20
minutes and on some occasions even less to avoid disturbing clients too
much on some systems and could raise it on others with faster storage
devices or less client load.

The problem exists in a more limited way with defragmentation too : it
seems these operations bypass the IO scheduler priorities and will
happily delay other IO (ionice didn't have any measurable effect with
all IO schedulers supporting IO priorities we tried).

> I am happy to put something in the wiki, if needed.
>
>
> Defragmentation:
>
>> You probably want to use autodefrag or a custom defragmentation solution
>> too. We weren't satisfied with autodefrag in some situations (were
>> clearly fragmentation crept in and IO performance suffered until a
>> manual defrag) and developed our own scheduler for triggering
>> defragmentation based on file writes and slow full filesystem scans,
>
>
> The ceph cluster only uses SSDs therefore I guess we do not suffer
> from fragmentation problem as with HDDs. As far as I understood SSDs.
>

Depending on your SSDs and the rest of your cluster hardware they can
still become the limiting factor if you force small IOs on them which
fragmentation will do. In your case, even with SSD Ceph has a minimum
"seek" time (the RTT for an IO request is almost always above 1ms). This
is compounded by the fact that even if the Ceph client sees a delay for
its request, the SSD behind has bandwidth left for other clients (or
even the same client if it is doing multiple IO requests) so the cluster
as a whole is often fine and performance doesn't plummet.
So fragmentation doesn't bite you nearly as hard on SSD and even with
added Ceph latencies I don't expect an SSD-based RBD volume to be
crushed by fragmentation. I would keep an eye on the IO wait times of
the NFS server (this can become worse very slowly so we keep an history
of at least a whole year) and be ready to defragment but I wouldn't
worry too much about it. An alert raised when the average recent wait
times (say over a week) jumps by an order of magnitude compared to the
average over the year is a good start.

AFAIK you can deal with the problem easily if it arises simply by
defragmenting manually the most used files and mount -o
remount,autodefrag <yourfs>. With HDDs the performance can become so bad
that your clients are blocked just when you look at your cluster the
wrong way so correcting fragmentation problems becomes a frustrating
experience. I simply wouldn't use BTRFS on devices with significant seek
times (in the ~10ms range like most HDDs) without some form of
defragmentation solution.

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-01  4:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-09-01  4:54 Questions about BTRFS balance and scrub on non-RAID setup Duncan
  -- strict thread matches above, loose matches on Subject: below --
2021-08-30 13:20 Andrej Friesen
2021-08-30 14:18 ` Lionel Bouton
2021-08-31  8:17   ` Andrej Friesen
2021-08-31 13:06     ` Lionel Bouton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox