About free space fragmentation, metadata write amplification and (no)ssd

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* About free space fragmentation, metadata write amplification and (no)ssd
@ 2017-04-08 20:19 Hans van Kranenburg
  2017-04-08 21:55 ` Peter Grandi
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-04-08 20:19 UTC (permalink / raw)
  To: linux-btrfs

So... today a real life story / btrfs use case example from the trenches
at work...

tl;dr 1) btrfs is awesome, but you have to carefully choose which parts
of it you want to use or avoid 2) improvements can be made, but at least
the problems relevant for this use case are managable and behaviour is
quite predictable.

This post is way too long, but I hope it's a fun read for a lazy sunday
afternoon. :) Otherwise, skip some sections, they have headers.

...

The example filesystem for this post is one of the backup server
filesystems we have, running btrfs for the data storage.

== About ==

In Q4 2014, we converted all our backup storage from ext4 and using
rsync with --link-dest to btrfs while still using rsync, but with btrfs
subvolumes and snapshots [1]. For every new backup, it creates a
writable snapshot of the previous backup and then uses rsync on the file
tree to get changes from the remote.

Currently there's ~35TiB of data present on the example filesystem, with
a total of just a bit more than 90000 subvolumes, in groups of 32
snapshots per remote host (daily for 14 days, weekly for 3 months,
montly for a year), so that's about 2800 'groups' of them. Inside are
millions and millions and millions of files.

And the best part is... it just works. Well, almost, given the title of
the post. But, the effort needed for creating all backups and doing
subvolume removal for expiries scales linearly with the amount of them.

== Hardware and filesystem setup ==

The actual disk storage is done using NetApp storage equipment, in this
case a FAS2552 with 1.2T SAS disks and some extra disk shelves. Storage
is exported over multipath iSCSI over ethernet, and then grouped
together again with multipathd and LVM, striping (like, RAID0) over
active/active controllers. We've been using this setup for years now in
different places, and it works really well. So, using this, we keep the
whole RAID / multiple disks / hardware disk failure part outside the
reach of btrfs. And yes, checksums are done twice, but who cares. ;]

Since the maximum iSCSI lun size is 16TiB, the maximum block device size
that we use by combining two is 32TiB. This filesystem is already
bigger, so at some point we added two new luns in a new LVM volume
group, and added the result to the btrfs filesystem (yay!):

Total devices 2 FS bytes used 35.10TiB
devid    1 size 29.99TiB used 29.10TiB path /dev/xvdb
devid    2 size 12.00TiB used 11.29TiB path /dev/xvdc

Data, single: total=39.50TiB, used=34.67TiB
System, DUP: total=40.00MiB, used=6.22MiB
Metadata, DUP: total=454.50GiB, used=437.36GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Yes, DUP metadata, more about that later...

I can also umount the filesystem for a short time, take a snapshot on
NetApp level from the luns, clone them and then have a writable clone of
a 40TiB btrfs filesystem, to be able to do crazy things and tests before
really doing changes, like kernel version or things like converting to
the free space tree etc.

>From end 2014 to september 2016, we used the 3.16 LTS kernel from Debian
Jessie. Since september 2016, it's 4.7.5, after torturing it for two
weeks on such a clone, replaying the daily workload on it.

== What's not so great... Allocated but unused space... ==

Since the beginning it showed that the filesystem had a tendency to
accumulate allocated but unused space that didn't get reused again by
writes.

In the last months of using kernel 3.16 the situation worsened, ending
up with about 30% allocated but unused space (11TiB...), while the
filesystem kept allocating new space all the time instead of reusing it:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q23.png

Using balance with the 3.16 kernel and space cache v1 to fight this was
almost not possible because of the amount of scattered out metadata
writes + amplification (1:40 overall read/write ratio during balance)
and writing space cache information over and over again on every commit.

When making the switch to the 4.7 kernel I also switched to the free
space tree, eliminating the space cache flush problems and did a
mega-balance operation which brought it back down quite a bit.

Here's what it looked like for the last 6 months:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q4-17-Q1.png

This is not too bad, but also not good enough. I want my picture to
become brighter white than this:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-03-14-backups-heatmap-chunks.png

The picture shows that the unused space is scattered all around the
whole filesystem.

So about a month ago, I continued searching kernel code for the cause of
this behaviour. This is a fun, but time consuming and often mind
boggling activity, because you run into 10 different interesting things
at the same time and want to start to find out about all of them at the
same time etc. :D

The two first things I found out about were:
  1) the 'free space cluster' code, which is responsible to find empty
space that new writes can go into, sometimes by combining several free
space fragments that are close to each other.
  2) the bool fragmented, which causes a block group to get blacklisted
for any more writes because finding free space for a write did not
succeed too easily.

I haven't been able to find a concise description of how all of it
actually is supposed to work, so have to end up reverse engineering it
from code, comments and git history.

And, in practice the feeling was that btrfs doesn't really try that
hard, and quickly gives up and just starts allocating new chunks for
everything. So, maybe it was just listing all my block groups as
fragmented and ignoring them?

== Balance based on free space fragmentation level ==

Now, free space being fragmented when you have a high churn rate
snapshot create and expire workload is not a surprise... Also, when data
is added there is no way to predict if, and when it ever will be
unreferenced from the snapshots again, which means I really don't care
where it ends up on disk.

But how fragmented is the free space, and how can we measure it?

Three weeks ago I made up a free space 'scoring' algorithm, revised it a
few times and now I'm using it to feed block groups with bad free space
fragmentation to balance to clean up the filesystem a bit. But, this is
a fun story for a separate post. In short, take the log2() of the size
of a free space extent, and then punish it the hardest if it ends up in
the middle of log2(sectorsize) and log2(block_group.length) and less if
it's smaller or bigger.

It's still 'mopping with the tap open', like we say in the Netherlands.
But it's already much better than usage-based balance. If a block group
is used for 50% and it has 512 alternating 1MiB filled and free
segments, I want to get rid of it, but if it's 512MiB data and then
512MiB empty space, it has to stay.

== But... -o remount,nossd ==

About two weeks ago, I ran into this code, from extent-tree.c:

bool ssd = btrfs_test_opt(fs_info, SSD);
*empty_cluster = 0;
[...]
if (ssd)
    *empty_cluster = SZ_2M;
if (space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
    ret = &fs_info->meta_alloc_cluster;
    if (!ssd)
        *empty_cluster = SZ_64K;
} else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) && ssd) {
    ret = &fs_info->data_alloc_cluster;
}
[...]

Wait, what? If I mount -o ssd, every small write will turn into at least
finding 2MiB for a write? What is this magic number?

Since the rotational flag in /sys is set to 0 for this filesystem, which
does not at all mean it's an ssd by the way, it mounts with the ssd
option by default. Since the lower layer of storage is iSCSI on NetApp,
it does not make any sense at all for btrfs to make assumptions about
where goes what or how optimal it is, as everything will be reorganized
anyway.

These two if statements is pretty much about it, what the ssd option
does. There's one other if, in tree-log.c, but t.. that's it folks.
The amount of lines of administration code for handling the mount
options itself is outnumbering the amount of lines where the option is
used by far. :D

Like the careful reader can see, the minimum amount of space used for
metadata writes also gets changed...

After playing around with -o nossd in a few other places, I finally did
it on this filesystem, first by a complete umount and mount, and then,
something magical happened:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-btrfs-nossd-whoa.gif
(timelapse of daily btrfs-heatmap --sort virtual)

After two weeks of creating new backups and feeding fragmented block
groups to balance, 25% of the filesystem consists of chunks that are
100% filled up. (:

== But! The Meta Mummy returns! ==

After changing to nossd, another thing happened. The expiry process,
which normally takes about 1.5 hour to remove ~2500 subvolumes (keeping
it queued up to a 100 orphans all the time), suddenly took the entire
rest of the day, not being done before the nightly backups had to start
again at 10PM...

And the only thing it seemed to do is writing, writing, writing 100MB/s
all day long. To see what it was doing I put some code together into
show_orphan_cleaner_progress.py:

https://github.com/knorrie/python-btrfs/commit/dd34044adf24f7febf6f6992f11966c9094c058b

The output showed it was just doing the normal expiry, but really really
slow. When changing back to -o ssd, it's back at normal speed.

Since the only thing that seems to change is a minimum of 64KiB instead
of 2MiB for metadata writes, I suspect the result of doing smaller
writes is an avalanche of write amplification, especially in the extent
tree. Since more small spots are filled, it causes more extent tree
pages to be cowed, which causes metadata writes, which need free space,
which cause changes in the extent tree, which causes more pages to be
cowed, which needs free space, which cause changes in the extent tree,
which...

Warning: do NOT click if you have epilepsy!
http://31.media.tumblr.com/3c316665d64ecd625eb3b6bc160f08fd/tumblr_mo73kigx0t1s92vobo1_250.gif
  Wheeeeeeeeeeee!

<to be continued>

== So, what do we want? ssd? nossd? ==

Well, both don't do it for me. I want my expensive NetApp disk space to
be filled up, without requiring me to clean up after it all the time
using painful balance actions and I want to quickly get rid of old
snapshots.

So currently, there's two mount -o remount statements before and after
doing the expiries...

== Ok, one more picture ==

Here's a picture of disk read/write throughput of yesterday:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-diskstats_throughput-day.png

* The balance part is me feeding fragmented block groups to balance. And
yes, rewriting 1GiB of data requires writing about 40GiB of metadata. ! :(
* Backup 1 and 2 are the backups, rsync limited at 16MB/s incoming
remote network traffic, which ends up as 50MB/s writes including
metadata changes. :(
* Expire, which today took 2.5 hours, removing 4240 subvolumes (+14 days
and a lot +3 months)

While snapshot removal totally explodes with nossd, there seems to be
little impact on backups and balance... :?

== Work to do ==

The next big change on this system will be to move from the 4.7 kernel
to the 4.9 LTS kernel and Debian Stretch.

Note that our metadata is still DUP, and it doesn't have skinny extent
tree metadata yet. It was originally created with btrfs-progs 3.17, and
when we realized we should have single it was too late. I want to change
that and see if I can convert on a NetApp clone. This should reduce
extent tree metadata size by maybe more than 60% and whoknowswhat will
happen to the abhorrent write traffic.

This conversion can run on the clone, after removing as many subvolumes
as possible with the least amount of data going away.

Before switching over to the clone as live backup server, all missing
snapshots can be rsynced over from the live backup server.

== So ==

Thanks for reading. Now, feel free to ask me anything... :D ...or on IRC
of course.

Moo,

[1] http://tech.mendix.com/linux/2015/02/12/btrfs-dirvish/
-- 
Hans van Kranenburg,
Production Engineer at Mendix

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-08 20:19 About free space fragmentation, metadata write amplification and (no)ssd Hans van Kranenburg
@ 2017-04-08 21:55 ` Peter Grandi
  2017-04-09  0:21   ` Hans van Kranenburg
  2017-04-09  6:38 ` Paul Jones
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Peter Grandi @ 2017-04-08 21:55 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] This post is way too long [ ... ]

Many thanks for your report, it is really useful, especially the
details.

> [ ... ] using rsync with --link-dest to btrfs while still
> using rsync, but with btrfs subvolumes and snapshots [1]. [
> ... ]  Currently there's ~35TiB of data present on the example
> filesystem, with a total of just a bit more than 90000
> subvolumes, in groups of 32 snapshots per remote host (daily
> for 14 days, weekly for 3 months, montly for a year), so
> that's about 2800 'groups' of them. Inside are millions and
> millions and millions of files. And the best part is... it
> just works. [ ... ]

That kind of arrangement, with a single large pool and very many
many files and many subdirectories is a worst case scanario for
any filesystem type, so it is amazing-ish that it works well so
far, especially with 90,000 subvolumes. As I mentioned elsewhere
I would rather do a rotation of smaller volumes, to reduce risk,
like "Duncan" also on this mailing list likes to do (perhaps to
the opposite extreme).

As to the 'ssd'/'nossd' issue that is as described in 'man 5
btrfs' (and I wonder whether 'ssd_spread' was tried too) but it
is not at all obvious it should impact so much metadata
handling. I'll add a new item in the "gotcha" list. 

It is sad that 'ssd' is used by default in your case, and it is
quite perplexing that tghe "wandering trees" problem (that is
"write amplification") is so large with 64KiB write clusters for
metadata (and 'dup' profile for metadata).

* Probably the metadata and data cluster sizes should be create
  or mount parameters instead of being implicit in the 'ssd'
  option.
* A cluster size of 2MiB for metadata and/or data presumably
  has some downsides, otrherwise it would be the default. I
  wonder whether the downsides related to barriers...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-08 21:55 ` Peter Grandi
@ 2017-04-09  0:21   ` Hans van Kranenburg
  2017-04-09  0:39     ` Hans van Kranenburg
  2017-04-09  3:14     ` Kai Krakow
  0 siblings, 2 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-04-09  0:21 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

On 04/08/2017 11:55 PM, Peter Grandi wrote:
>> [ ... ] This post is way too long [ ... ]
> 
> Many thanks for your report, it is really useful, especially the
> details.

Thanks!

>> [ ... ] using rsync with --link-dest to btrfs while still
>> using rsync, but with btrfs subvolumes and snapshots [1]. [
>> ... ]  Currently there's ~35TiB of data present on the example
>> filesystem, with a total of just a bit more than 90000
>> subvolumes, in groups of 32 snapshots per remote host (daily
>> for 14 days, weekly for 3 months, montly for a year), so
>> that's about 2800 'groups' of them. Inside are millions and
>> millions and millions of files. And the best part is... it
>> just works. [ ... ]
> 
> That kind of arrangement, with a single large pool and very many
> many files and many subdirectories is a worst case scanario for
> any filesystem type, so it is amazing-ish that it works well so
> far, especially with 90,000 subvolumes.

Yes, this is one of the reasons for this post. Instead of only hearing
about problems all day on the mailing list and IRC, we need some more
reports of success.

The fundamental functionality of doing the cow snapshots, moo, and the
related subvolume removal on filesystem trees is so awesome. I have no
idea how we would have been able to continue this type of backup system
when btrfs was not available. Hardlinks and rm -rf was a total dead end
road.

The growth has been slow but steady (oops, fast and steady, I
immediately got corrected by our sales department), but anyway, steady.
This makes it possible to just let it do its thing every day and spot
small changes in behaviour over time, detect patterns that could be a
ticking time bomb and then deal with them in a way that allows conscious
decisions, well-tested changes and continous measurements of the result.

But, ok, it's surely not for the faint of heart, and the devil is in the
details. If it breaks, you keep the pieces. Using the NetApp hardware is
one of the relevant decisions made here. The shameful state of the most
basic case of recovering (or not be able to recover) a failure in a two
disk btrfs RAID1 is enough of a sign that the whole multi-disk handling
is a nice idea, but didn't get the amount of attention yet that it would
deserve to be something to be able to rely on (for me). Having the data
safe in my NetApp filer gives me the opportunity to do regular (like,
monthly) snapshots of the complete thing, so that I have something to go
back to if disaster would strike in linux land. Yes, it's a bit
inconvenient because I want to umount for a few minutes in a silent
moment of the week, but it's worth the effort, since I can keep the eggs
in a shadow basket.

OTOH, what we do with btrfs (taking a bulldozer and drive across all the
boundaries of sanity according to all recommendations and warnings) on
this scale of individual remotes is something that the NetApp people
should totally be jealous of. Backups management (manual create, restore
etc on top of the nightlies) is self service functionality for our
customers, and being able to implement the magic behind the APIs with
just a few commands like a btrfs sub snap and some rsync gives the right
amount of freedom and flexibility we need.

And, monitoring of trends is so. super. important. It's not a secret
that when I work with technology, I want to see what's going on in
there, crack the black box open and try to understand why the lights are
blinking in a specific pattern. What does this balance -dusage=75 mean?
Why does it know what's 75% full and I don't? Where does it get that
information from? The open source kernel code and the IOCTL API is a
source for many hours of happy hacking, because it allows all of this to
be done.

> As I mentioned elsewhere
> I would rather do a rotation of smaller volumes, to reduce risk,
> like "Duncan" also on this mailing list likes to do (perhaps to
> the opposite extreme).

Well, like seen in my 'keeps allocating new chunks for no apparent
reason' thread... even small filesystems can have really weird problems. :)

> As to the 'ssd'/'nossd' issue that is as described in 'man 5
> btrfs' (and I wonder whether 'ssd_spread' was tried too) but it
> is not at all obvious it should impact so much metadata
> handling. I'll add a new item in the "gotcha" list.

I suspect that the -o ssd behaviour is a decent source of the "help! my
filesystem is full but df says it's not" problems we see about every
week. But, I can't just argue that. Apart from that it was the very same
problem being the first thing that btrfs greeted me with when trying it
out for the first time a few years ago, (and it still is one of the
first problems people who start using btrfs encounter) I haven't spent
time to debug the behaviour when running fully allocated.

OTOH the two-step allocation process is also a nice thing, because I
*know* when I still have unallocated space available, which makes for
example the free space fragmentation debugging process much more bearable.

> It is sad that 'ssd' is used by default in your case, and it is
> quite perplexing that tghe "wandering trees" problem (that is
> "write amplification") is so large with 64KiB write clusters for
> metadata (and 'dup' profile for metadata).

A worst case 32x a 64KiB fits into 1x 2MiB. That is a bit of a bogus
argument, but take the extent tree changes (amount of leafs/nodes) it
causes to change, including all the wandering and shifting around of
items if they don't fit, and then the recursive updating, and it
apparently already makes a difference that causes an entire day of
writing metadata at Gigabit/s speed.

Notice that everyone who has rotational 0 in /sys is experiencing this
behaviour right now, when removing snapshots... and then they end up on
IRC complaining to us their computer is totally unusable for hours when
they remove some snapshots...

> * Probably the metadata and data cluster sizes should be create
>   or mount parameters instead of being implicit in the 'ssd'
>   option.
> * A cluster size of 2MiB for metadata and/or data presumably
>   has some downsides, otrherwise it would be the default. I
>   wonder whether the downsides related to barriers...

I don't know... yet. What I know is that adding options to tune things
will lead to users not setting them, or setting them to the wrong value.
It's a bit like having btrfs-zero-log, or --init-extent-tree. It just
doesn't work out in harsh reality.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-09  0:21   ` Hans van Kranenburg
@ 2017-04-09  0:39     ` Hans van Kranenburg
  2017-04-09  3:14     ` Kai Krakow
  1 sibling, 0 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-04-09  0:39 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

On 04/09/2017 02:21 AM, Hans van Kranenburg wrote:
> [...]
> Notice that everyone who has rotational 0 in /sys is experiencing this
> behaviour right now, when removing snapshots... [...]

Eh, 1

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-09  0:21   ` Hans van Kranenburg
  2017-04-09  0:39     ` Hans van Kranenburg
@ 2017-04-09  3:14     ` Kai Krakow
  2017-04-09 20:48       ` Hans van Kranenburg
  1 sibling, 1 reply; 19+ messages in thread
From: Kai Krakow @ 2017-04-09  3:14 UTC (permalink / raw)
  To: linux-btrfs

Am Sun, 9 Apr 2017 02:21:19 +0200
schrieb Hans van Kranenburg <hans.van.kranenburg@mendix.com>:

> On 04/08/2017 11:55 PM, Peter Grandi wrote:
> >> [ ... ] This post is way too long [ ... ]  
> > 
> > Many thanks for your report, it is really useful, especially the
> > details.  
> 
> Thanks!
> 
> >> [ ... ] using rsync with --link-dest to btrfs while still
> >> using rsync, but with btrfs subvolumes and snapshots [1]. [
> >> ... ]  Currently there's ~35TiB of data present on the example
> >> filesystem, with a total of just a bit more than 90000
> >> subvolumes, in groups of 32 snapshots per remote host (daily
> >> for 14 days, weekly for 3 months, montly for a year), so
> >> that's about 2800 'groups' of them. Inside are millions and
> >> millions and millions of files. And the best part is... it
> >> just works. [ ... ]  
> > 
> > That kind of arrangement, with a single large pool and very many
> > many files and many subdirectories is a worst case scanario for
> > any filesystem type, so it is amazing-ish that it works well so
> > far, especially with 90,000 subvolumes.  
> 
> Yes, this is one of the reasons for this post. Instead of only hearing
> about problems all day on the mailing list and IRC, we need some more
> reports of success.
> 
> The fundamental functionality of doing the cow snapshots, moo, and the
> related subvolume removal on filesystem trees is so awesome. I have no
> idea how we would have been able to continue this type of backup
> system when btrfs was not available. Hardlinks and rm -rf was a total
> dead end road.

I'm absolutely no expert with arrays of sizes that you use but I also
stopped using the hardlink-and-remove approach: It was slow to manage
(rsync works slow for it, rm works slow for it) and it was error-probe
(due to the nature of hardlinks). I used btrfs with snapshots and rsync
for a while in my personal testbed, and experienced great slowness over
time: rsync started to become slower and slower, full backup took 4
hours with huge %IO usage, maintaining the backup history was also slow
(removing backups took a while), rebalancing was needed due to huge
wasted space. I used rsync with --inplace and --no-whole-file to waste
as few space as possible.

What I first found was an adaptive rebalancer script which I still use
for the main filesystem:

https://www.spinics.net/lists/linux-btrfs/msg52076.html
(thanks to Lionel)

It works pretty well and has no such big IO overhead due to the
adaptive multi-pass approach.

But it still did not help the slowness. I now tested borgbackup for a
while, and it's fast: It does the same job in 30 minutes or less
instead of 4 hours, and it has much better backup density and comes
with easy history maintenance, too. I can now store much more backup
history in the same space. Full restore time is about the same as
copying back with rsync.

For a professional deployment I'm planning to use XFS as the storage
backend and borgbackup as the backup frontend, because my findings
showed that XFS allocation groups are spanning diagonally across the
disk array, that is if you'd use simple JBOD of your iSCSI LUNs, XFS
will spread writes across all the LUNs without you needing to do normal
RAID striping, which should eliminate the need to migrate when adding
more LUNs, and the underlaying storage layer on the NetApp side will
probably already do RAID for redundancy anyways. Just feed more space to
XFS using LVM.

Borgbackup can do everything that btrfs can do for you but is
targetting the job of doing backups only: It can compress, deduplicate,
encrypt and do history thinning. The only downside I found is that only
one backup job at a time can access the backup repository. So you'd
have to use one backup repo per source machine. That way you cannot
benefit from deduplication across multiple sources. But I'm sure NetApp
can do that. OTOH, maybe backup duration drops to a point that you
could serialize the backup of some machines.

> OTOH, what we do with btrfs (taking a bulldozer and drive across all
> the boundaries of sanity according to all recommendations and
> warnings) on this scale of individual remotes is something that the
> NetApp people should totally be jealous of. Backups management
> (manual create, restore etc on top of the nightlies) is self service
> functionality for our customers, and being able to implement the
> magic behind the APIs with just a few commands like a btrfs sub snap
> and some rsync gives the right amount of freedom and flexibility we
> need.

This is something I'm planning here, too: Self-service backups, do a
btrfs snap, but then use borgbackup for archiving purposes.

BTW: I think the 2M size comes from the assumption that SSDs manage
their storage in groups of erase block sizes. The optimization here
would be that btrfs deallocates (and maybe trims) only whole erase
blocks which typically are 2M. This has a performance benefit. But if
your underlying storage layer is RAID anyways, this no longer maps
correctly. So giving "nossd" here would probably be the better decision
right from the start. Or at least you should be able to give the mount
option the amount of stripes your RAID uses so it would align properly
again. XFS has such tuning options which are usually auto-detected if
the storage driver correctly passes those info through.

Btrfs still has a lot of open opportunities here, but currently it's in
the stabilization phase (while still adding new features). I guess it's
probably going a long time until it will be tuned for optimal
performance. Too long to deploy such scenarios as yours today - at
least regarding production usage. Just my two cents.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-08 20:19 About free space fragmentation, metadata write amplification and (no)ssd Hans van Kranenburg
  2017-04-08 21:55 ` Peter Grandi
@ 2017-04-09  6:38 ` Paul Jones
  2017-04-09  8:43   ` Roman Mamedov
  2017-04-09 18:10 ` Chris Murphy
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Paul Jones @ 2017-04-09  6:38 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 909 bytes --]

-----Original Message-----
From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Hans van Kranenburg
Sent: Sunday, 9 April 2017 6:19 AM
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: About free space fragmentation, metadata write amplification and (no)ssd

> So... today a real life story / btrfs use case example from the trenches at work...

Snip!!

Great read. I do the same thing for backups on a much smaller scale and it works brilliantly.  Two 4T drives in btrfs raid1.
I will mention that I recently setup caching using LLVM (1 x 300G ssd for each 4T drive), and it's extraordinary how much of a difference it makes. Especially when running deduplication. If it's feasible perhaps you could try it with a nvme drive.

Paul.
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-09  6:38 ` Paul Jones
@ 2017-04-09  8:43   ` Roman Mamedov
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Mamedov @ 2017-04-09  8:43 UTC (permalink / raw)
  To: Paul Jones; +Cc: Hans van Kranenburg, linux-btrfs

On Sun, 9 Apr 2017 06:38:54 +0000
Paul Jones <paul@pauljones.id.au> wrote:

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Hans van Kranenburg
> Sent: Sunday, 9 April 2017 6:19 AM
> To: linux-btrfs <linux-btrfs@vger.kernel.org>
> Subject: About free space fragmentation, metadata write amplification and (no)ssd
> 
> > So... today a real life story / btrfs use case example from the trenches at work...
> 
> Snip!!
> 
> Great read. I do the same thing for backups on a much smaller scale and it works brilliantly.  Two 4T drives in btrfs raid1.
> I will mention that I recently setup caching using LLVM (1 x 300G ssd for each 4T drive), and it's extraordinary how much of a difference it makes. Especially when running deduplication. If it's feasible perhaps you could try it with a nvme drive.

You mean LVM, not LLVM :)

I was actually going to suggest that as well, in my case I use a 32GB SSD
cache for my entire 14TB filesystem with 15 GB metadata (*2 in DUP). In fact
you should check the metadata size on yours, most likely you can get by with
an order of magnitude smaller cache for exactly the same benefit (and have the
rest of 2x300GB for other interesting uses).

And yeah it's amazing, especially when deleting old snapshots or doing
backups. In my case I backup the entire root FS from about 30 hosts, and keep
that in periodic snapshots for a month. Previously I would also stagger rsync
runs so that no more than 4 or 5 hosts get backed up at the same time (and
still there would be tons of trashing in seeks and iowait), now it's no
problem whatsoever.

The only issue that I have with this setup is you need to "cleanly close" the
cached LVM device on shutdown/reboot, and apparently there is no init script
in Debian that would do that (experimenting with adding some hacks, but no
success yet). So on every boot the entire cache is marked dirty and data is
being copied from cache to the actual storage, which takes some time, since
this appears to be done in a random IO pattern.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-08 20:19 About free space fragmentation, metadata write amplification and (no)ssd Hans van Kranenburg
  2017-04-08 21:55 ` Peter Grandi
  2017-04-09  6:38 ` Paul Jones
@ 2017-04-09 18:10 ` Chris Murphy
  2017-04-09 20:15   ` Hans van Kranenburg
  2017-04-10 12:23 ` Austin S. Hemmelgarn
  2017-05-28  0:59 ` Hans van Kranenburg
  4 siblings, 1 reply; 19+ messages in thread
From: Chris Murphy @ 2017-04-09 18:10 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

On Sat, Apr 8, 2017 at 2:19 PM, Hans van Kranenburg
<hans.van.kranenburg@mendix.com> wrote:
> After changing to nossd, another thing happened. The expiry process,
> which normally takes about 1.5 hour to remove ~2500 subvolumes (keeping
> it queued up to a 100 orphans all the time), suddenly took the entire
> rest of the day, not being done before the nightly backups had to start
> again at 10PM...

Is this 'btrfs sub del' with 100 subvolumes listed? What happens if
the delete command is issued with all 2500 at once? Deleting snapshots
is definitely expensive, and deleting them one at a time is more
expensive in total time than deleting them in one whack. But I've
never deleted 100 or more at once.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-09 18:10 ` Chris Murphy
@ 2017-04-09 20:15   ` Hans van Kranenburg
  0 siblings, 0 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-04-09 20:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On 04/09/2017 08:10 PM, Chris Murphy wrote:
> On Sat, Apr 8, 2017 at 2:19 PM, Hans van Kranenburg
> <hans.van.kranenburg@mendix.com> wrote:
>> After changing to nossd, another thing happened. The expiry process,
>> which normally takes about 1.5 hour to remove ~2500 subvolumes (keeping
>> it queued up to a 100 orphans all the time), suddenly took the entire
>> rest of the day, not being done before the nightly backups had to start
>> again at 10PM...
> 
> Is this 'btrfs sub del' with 100 subvolumes listed? What happens if
> the delete command is issued with all 2500 at once? Deleting snapshots
> is definitely expensive, and deleting them one at a time is more
> expensive in total time than deleting them in one whack. But I've
> never deleted 100 or more at once.

It doesn't really matter how many, because it stil cleans only one at a
time, in the order that they were submitted.

(And this also means that if you delete 1000 snapshots of the same huge
subvolume, it will do all the inefficient backref walking 1000 times etc.)

Doing a subvolume delete (or multiple) on the command line will only
append them to this list, besides removing some tree items so that it's
not visible anymore as normal subvolume. The list of subvolume ids
queued for cleaning can be found in tree 1 with keys of
(ORPHAN_OBJECTID, ORPHAN_ITEM_KEY, <subvolid>)

The 100 is a bit of an arbitrarily chosen number that makes sure it'll
be working full speed all the time, and also give me a somewhat
acceptable time to wait for finishing when interrupting it.

Here's some more and a snippet of example code (at the end of the commit
message) which looks almost 100% like what I have in my backup expiry code:

https://github.com/knorrie/python-btrfs/commit/9d697ba7d4782afbb070bf057aa4ff3e3aa51be0

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-09  3:14     ` Kai Krakow
@ 2017-04-09 20:48       ` Hans van Kranenburg
  0 siblings, 0 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-04-09 20:48 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 04/09/2017 05:14 AM, Kai Krakow wrote:
> Am Sun, 9 Apr 2017 02:21:19 +0200
> schrieb Hans van Kranenburg <hans.van.kranenburg@mendix.com>:
> 
>> [...]
>> The fundamental functionality of doing the cow snapshots, moo, and the
>> related subvolume removal on filesystem trees is so awesome. I have no
>> idea how we would have been able to continue this type of backup
>> system when btrfs was not available. Hardlinks and rm -rf was a total
>> dead end road.
> 
> I'm absolutely no expert with arrays of sizes that you use but I also
> stopped using the hardlink-and-remove approach: It was slow to manage
> (rsync works slow for it, rm works slow for it) and it was error-probe
> (due to the nature of hardlinks). I used btrfs with snapshots and rsync
> for a while in my personal testbed, and experienced great slowness over
> time: rsync started to become slower and slower, full backup took 4
> hours with huge %IO usage, maintaining the backup history was also slow
> (removing backups took a while), rebalancing was needed due to huge
> wasted space.

Did you debug why it was slow?

> I used rsync with --inplace and --no-whole-file to waste
> as few space as possible.

Most of the files on the remotes I backup do not change. There can be
new extra files, or files can be removed, but they don't change in place.

Files that change much, together with --inplace cause extra reflinking
of data, which at first seems to save space, but also makes backref
walking slower and causes more fragmentation and more places in the
trees that needs to change when doing balance and subvolume delete. So
that might have been one of the reasons for the increasing slowness.

And, of course always make sure *everything* runs with noatime on the
remotes, or you'll be unnecessary trashing all metadata all the time.

Aaaaand, if you get a new 128MiB extent with shiny new data on day 1,
and then the remote changes 75% of it before doing backup of day 2, then
25% of the file as seen in day 2 backup might reflink to parts of the
old 128MiB extent of day 1. But, if you expire backup of day 1, that
128MiB extent just stays there, with 75% of it still keeping disk space
occupied, but not reachable from any file on your filesystem! And
balance doesn't fix that.

I have an explicit --whole-file in the rsync command because of this.
Some of the remotes do actually have changing files, which are
postgresql dumps in sqlformat compressed with gzip --rsyncable. Rsync
can combine the data of the day before and new fragments from the
remote, but I'd rather have it write out 1 new complete file again.

> What I first found was an adaptive rebalancer script which I still use
> for the main filesystem:
> 
> https://www.spinics.net/lists/linux-btrfs/msg52076.html
> (thanks to Lionel)
> 
> It works pretty well and has no such big IO overhead due to the
> adaptive multi-pass approach.

It looks that it sets a target amount of unallocated space it wants to
have, and then starts doing balance with dusage=0, 1, 2, 3, etc until
that target is reached. That's a nice way yes.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-08 20:19 About free space fragmentation, metadata write amplification and (no)ssd Hans van Kranenburg
                   ` (2 preceding siblings ...)
  2017-04-09 18:10 ` Chris Murphy
@ 2017-04-10 12:23 ` Austin S. Hemmelgarn
  2017-04-10 22:59   ` Hans van Kranenburg
  2017-05-28  0:59 ` Hans van Kranenburg
  4 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-10 12:23 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 2017-04-08 16:19, Hans van Kranenburg wrote:
> So... today a real life story / btrfs use case example from the trenches
> at work...
>
> tl;dr 1) btrfs is awesome, but you have to carefully choose which parts
> of it you want to use or avoid 2) improvements can be made, but at least
> the problems relevant for this use case are managable and behaviour is
> quite predictable.
>
> This post is way too long, but I hope it's a fun read for a lazy sunday
> afternoon. :) Otherwise, skip some sections, they have headers.
>
> ...
>
> The example filesystem for this post is one of the backup server
> filesystems we have, running btrfs for the data storage.
Two things before I go any further:
1. Thank you for such a detailed and well written post, and especially 
one that isn't just complaining but also going over what works.
2. Apologies if I repeat something from another reply, I didn't do much 
other than skimming them.
>
> == About ==
>
> In Q4 2014, we converted all our backup storage from ext4 and using
> rsync with --link-dest to btrfs while still using rsync, but with btrfs
> subvolumes and snapshots [1]. For every new backup, it creates a
> writable snapshot of the previous backup and then uses rsync on the file
> tree to get changes from the remote.
>
> Currently there's ~35TiB of data present on the example filesystem, with
> a total of just a bit more than 90000 subvolumes, in groups of 32
> snapshots per remote host (daily for 14 days, weekly for 3 months,
> montly for a year), so that's about 2800 'groups' of them. Inside are
> millions and millions and millions of files.
>
> And the best part is... it just works. Well, almost, given the title of
> the post. But, the effort needed for creating all backups and doing
> subvolume removal for expiries scales linearly with the amount of them.
>
> == Hardware and filesystem setup ==
>
> The actual disk storage is done using NetApp storage equipment, in this
> case a FAS2552 with 1.2T SAS disks and some extra disk shelves. Storage
> is exported over multipath iSCSI over ethernet, and then grouped
> together again with multipathd and LVM, striping (like, RAID0) over
> active/active controllers. We've been using this setup for years now in
> different places, and it works really well. So, using this, we keep the
> whole RAID / multiple disks / hardware disk failure part outside the
> reach of btrfs. And yes, checksums are done twice, but who cares. ;]
>
> Since the maximum iSCSI lun size is 16TiB, the maximum block device size
> that we use by combining two is 32TiB. This filesystem is already
> bigger, so at some point we added two new luns in a new LVM volume
> group, and added the result to the btrfs filesystem (yay!):
>
> Total devices 2 FS bytes used 35.10TiB
> devid    1 size 29.99TiB used 29.10TiB path /dev/xvdb
> devid    2 size 12.00TiB used 11.29TiB path /dev/xvdc
>
> Data, single: total=39.50TiB, used=34.67TiB
> System, DUP: total=40.00MiB, used=6.22MiB
> Metadata, DUP: total=454.50GiB, used=437.36GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> Yes, DUP metadata, more about that later...
>
> I can also umount the filesystem for a short time, take a snapshot on
> NetApp level from the luns, clone them and then have a writable clone of
> a 40TiB btrfs filesystem, to be able to do crazy things and tests before
> really doing changes, like kernel version or things like converting to
> the free space tree etc.
>
> From end 2014 to september 2016, we used the 3.16 LTS kernel from Debian
> Jessie. Since september 2016, it's 4.7.5, after torturing it for two
> weeks on such a clone, replaying the daily workload on it.
>
> == What's not so great... Allocated but unused space... ==
>
> Since the beginning it showed that the filesystem had a tendency to
> accumulate allocated but unused space that didn't get reused again by
> writes.
>
> In the last months of using kernel 3.16 the situation worsened, ending
> up with about 30% allocated but unused space (11TiB...), while the
> filesystem kept allocating new space all the time instead of reusing it:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q23.png
>
> Using balance with the 3.16 kernel and space cache v1 to fight this was
> almost not possible because of the amount of scattered out metadata
> writes + amplification (1:40 overall read/write ratio during balance)
> and writing space cache information over and over again on every commit.
>
> When making the switch to the 4.7 kernel I also switched to the free
> space tree, eliminating the space cache flush problems and did a
> mega-balance operation which brought it back down quite a bit.
>
> Here's what it looked like for the last 6 months:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q4-17-Q1.png
>
> This is not too bad, but also not good enough. I want my picture to
> become brighter white than this:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-03-14-backups-heatmap-chunks.png
>
> The picture shows that the unused space is scattered all around the
> whole filesystem.
>
> So about a month ago, I continued searching kernel code for the cause of
> this behaviour. This is a fun, but time consuming and often mind
> boggling activity, because you run into 10 different interesting things
> at the same time and want to start to find out about all of them at the
> same time etc. :D
>
> The two first things I found out about were:
>   1) the 'free space cluster' code, which is responsible to find empty
> space that new writes can go into, sometimes by combining several free
> space fragments that are close to each other.
>   2) the bool fragmented, which causes a block group to get blacklisted
> for any more writes because finding free space for a write did not
> succeed too easily.
>
> I haven't been able to find a concise description of how all of it
> actually is supposed to work, so have to end up reverse engineering it
> from code, comments and git history.
>
> And, in practice the feeling was that btrfs doesn't really try that
> hard, and quickly gives up and just starts allocating new chunks for
> everything. So, maybe it was just listing all my block groups as
> fragmented and ignoring them?
On this part in particular, while I've seen this behavior on my own 
systems to a certain extent, I've never seen it as bad as you're 
describing.  Based on what I have seen though, it really depends on the 
workload.  In my case, the only things that cause this degree of 
free-space fragmentation are RRD files and data files for BOINC 
applications, but both of those have write patterns that are probably 
similar to what your backups produce.

One thing I've found helps at least with these particular cases is 
bumping the commit time up a bit in BTRFS itself.  For both filesystems, 
I run with -o commit=150, which is 5 times the default commit time.  In 
effect, this means I'll lose up to 2.5 minutes of data if the system 
crashes, but in both cases, this is not hugely critical data (the BOINC 
data costs exactly as much time to regenerate as the length of time's 
worth of data that was lost, and the RRD files are just statistics from 
collectd).
>
> == Balance based on free space fragmentation level ==
>
> Now, free space being fragmented when you have a high churn rate
> snapshot create and expire workload is not a surprise... Also, when data
> is added there is no way to predict if, and when it ever will be
> unreferenced from the snapshots again, which means I really don't care
> where it ends up on disk.
>
> But how fragmented is the free space, and how can we measure it?
>
> Three weeks ago I made up a free space 'scoring' algorithm, revised it a
> few times and now I'm using it to feed block groups with bad free space
> fragmentation to balance to clean up the filesystem a bit. But, this is
> a fun story for a separate post. In short, take the log2() of the size
> of a free space extent, and then punish it the hardest if it ends up in
> the middle of log2(sectorsize) and log2(block_group.length) and less if
> it's smaller or bigger.
>
> It's still 'mopping with the tap open', like we say in the Netherlands.
> But it's already much better than usage-based balance. If a block group
> is used for 50% and it has 512 alternating 1MiB filled and free
> segments, I want to get rid of it, but if it's 512MiB data and then
> 512MiB empty space, it has to stay.
If you could write up a patch for the balance operation itself to add 
this as a filter (probably with some threshold value to control how 
picky to be), that would be a great addition.
>
> == But... -o remount,nossd ==
>
> About two weeks ago, I ran into this code, from extent-tree.c:
>
> bool ssd = btrfs_test_opt(fs_info, SSD);
> *empty_cluster = 0;
> [...]
> if (ssd)
>     *empty_cluster = SZ_2M;
> if (space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>     ret = &fs_info->meta_alloc_cluster;
>     if (!ssd)
>         *empty_cluster = SZ_64K;
> } else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) && ssd) {
>     ret = &fs_info->data_alloc_cluster;
> }
> [...]
>
> Wait, what? If I mount -o ssd, every small write will turn into at least
> finding 2MiB for a write? What is this magic number?
Explaining this requires explaining a bit of background on SSD's.  Most 
modern SSD's use NAND flash, which while byte-addressable for reads and 
writes, is only large-block addressable for resetting written bytes. 
This erase-block is usually a power of 2, and on most drives is 2 or 
4MB.  That lower size of 2MB is what got chosen here, and in essence the 
code is trying to write to each erase block exactly once, which in turn 
helps with SSD lifetime, since rewriting part of an erase block may 
require erasing the block, and that erase operation is the limiting 
factor for the life of flash memory.
>
> Since the rotational flag in /sys is set to 0 for this filesystem, which
> does not at all mean it's an ssd by the way, it mounts with the ssd
> option by default. Since the lower layer of storage is iSCSI on NetApp,
> it does not make any sense at all for btrfs to make assumptions about
> where goes what or how optimal it is, as everything will be reorganized
> anyway.
FWIW, it is possible to use a udev rule to change the rotational flag 
from userspace.  The kernel's selection algorithm for determining is is 
somewhat sub-optimal (essentially, if it's not a local disk that can be 
proven to be rotational, it assumes it's non-rotational), so 
re-selecting this ends up being somewhat important in certain cases 
(virtual machines for example).
>
> These two if statements is pretty much about it, what the ssd option
> does. There's one other if, in tree-log.c, but t.. that's it folks.
> The amount of lines of administration code for handling the mount
> options itself is outnumbering the amount of lines where the option is
> used by far. :D
>
> Like the careful reader can see, the minimum amount of space used for
> metadata writes also gets changed...
>
> After playing around with -o nossd in a few other places, I finally did
> it on this filesystem, first by a complete umount and mount, and then,
> something magical happened:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-btrfs-nossd-whoa.gif
> (timelapse of daily btrfs-heatmap --sort virtual)
>
> After two weeks of creating new backups and feeding fragmented block
> groups to balance, 25% of the filesystem consists of chunks that are
> 100% filled up. (:
>
> == But! The Meta Mummy returns! ==
>
> After changing to nossd, another thing happened. The expiry process,
> which normally takes about 1.5 hour to remove ~2500 subvolumes (keeping
> it queued up to a 100 orphans all the time), suddenly took the entire
> rest of the day, not being done before the nightly backups had to start
> again at 10PM...
>
> And the only thing it seemed to do is writing, writing, writing 100MB/s
> all day long. To see what it was doing I put some code together into
> show_orphan_cleaner_progress.py:
>
> https://github.com/knorrie/python-btrfs/commit/dd34044adf24f7febf6f6992f11966c9094c058b
>
> The output showed it was just doing the normal expiry, but really really
> slow. When changing back to -o ssd, it's back at normal speed.
>
> Since the only thing that seems to change is a minimum of 64KiB instead
> of 2MiB for metadata writes, I suspect the result of doing smaller
> writes is an avalanche of write amplification, especially in the extent
> tree. Since more small spots are filled, it causes more extent tree
> pages to be cowed, which causes metadata writes, which need free space,
> which cause changes in the extent tree, which causes more pages to be
> cowed, which needs free space, which cause changes in the extent tree,
> which...
>
> Warning: do NOT click if you have epilepsy!
> http://31.media.tumblr.com/3c316665d64ecd625eb3b6bc160f08fd/tumblr_mo73kigx0t1s92vobo1_250.gif
>   Wheeeeeeeeeeee!
>
> <to be continued>
>
> == So, what do we want? ssd? nossd? ==
>
> Well, both don't do it for me. I want my expensive NetApp disk space to
> be filled up, without requiring me to clean up after it all the time
> using painful balance actions and I want to quickly get rid of old
> snapshots.
>
> So currently, there's two mount -o remount statements before and after
> doing the expiries...
>
> == Ok, one more picture ==
>
> Here's a picture of disk read/write throughput of yesterday:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-diskstats_throughput-day.png
>
> * The balance part is me feeding fragmented block groups to balance. And
> yes, rewriting 1GiB of data requires writing about 40GiB of metadata. ! :(
> * Backup 1 and 2 are the backups, rsync limited at 16MB/s incoming
> remote network traffic, which ends up as 50MB/s writes including
> metadata changes. :(
> * Expire, which today took 2.5 hours, removing 4240 subvolumes (+14 days
> and a lot +3 months)
>
> While snapshot removal totally explodes with nossd, there seems to be
> little impact on backups and balance... :?
>
> == Work to do ==
>
> The next big change on this system will be to move from the 4.7 kernel
> to the 4.9 LTS kernel and Debian Stretch.
>
> Note that our metadata is still DUP, and it doesn't have skinny extent
> tree metadata yet. It was originally created with btrfs-progs 3.17, and
> when we realized we should have single it was too late. I want to change
> that and see if I can convert on a NetApp clone. This should reduce
> extent tree metadata size by maybe more than 60% and whoknowswhat will
> happen to the abhorrent write traffic.
Depending on how much you trust the NetApp storage appliance you're 
using, you may also consider nodatasum.  It wont' help much with the 
metadata issues, but it may cut down on the resource usage on the system 
itself while doing backups.  Overall though, based on your description, 
the only thing you really need from BTRFS itself is the snapshots, and 
given that, there may be other options out there that are more efficient.
>
> This conversion can run on the clone, after removing as many subvolumes
> as possible with the least amount of data going away.
>
> Before switching over to the clone as live backup server, all missing
> snapshots can be rsynced over from the live backup server.
>
> == So ==
>
> Thanks for reading. Now, feel free to ask me anything... :D ...or on IRC
> of course.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-10 12:23 ` Austin S. Hemmelgarn
@ 2017-04-10 22:59   ` Hans van Kranenburg
  2017-04-11 11:33     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 19+ messages in thread
From: Hans van Kranenburg @ 2017-04-10 22:59 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

On 04/10/2017 02:23 PM, Austin S. Hemmelgarn wrote:
> On 2017-04-08 16:19, Hans van Kranenburg wrote:
>> So... today a real life story / btrfs use case example from the trenches
>> at work...
>>
>> tl;dr 1) btrfs is awesome, but you have to carefully choose which parts
>> of it you want to use or avoid 2) improvements can be made, but at least
>> the problems relevant for this use case are managable and behaviour is
>> quite predictable.
>>
>> This post is way too long, but I hope it's a fun read for a lazy sunday
>> afternoon. :) Otherwise, skip some sections, they have headers.
>>
>> ...
>>
>> The example filesystem for this post is one of the backup server
>> filesystems we have, running btrfs for the data storage.
> Two things before I go any further:
> 1. Thank you for such a detailed and well written post, and especially
> one that isn't just complaining but also going over what works.

Thanks!

>> [...]
>>
>>
>> == What's not so great... Allocated but unused space... ==
>>
>> Since the beginning it showed that the filesystem had a tendency to
>> accumulate allocated but unused space that didn't get reused again by
>> writes.
>>
>> [...]
>>
>> So about a month ago, I continued searching kernel code for the cause of
>> this behaviour. This is a fun, but time consuming and often mind
>> boggling activity, because you run into 10 different interesting things
>> at the same time and want to start to find out about all of them at the
>> same time etc. :D
>>
>> The two first things I found out about were:
>>   1) the 'free space cluster' code, which is responsible to find empty
>> space that new writes can go into, sometimes by combining several free
>> space fragments that are close to each other.
>>   2) the bool fragmented, which causes a block group to get blacklisted
>> for any more writes because finding free space for a write did not
>> succeed too easily.
>>
>> I haven't been able to find a concise description of how all of it
>> actually is supposed to work, so have to end up reverse engineering it
>> from code, comments and git history.
>>
>> And, in practice the feeling was that btrfs doesn't really try that
>> hard, and quickly gives up and just starts allocating new chunks for
>> everything. So, maybe it was just listing all my block groups as
>> fragmented and ignoring them?
> On this part in particular, while I've seen this behavior on my own
> systems to a certain extent, I've never seen it as bad as you're
> describing.  Based on what I have seen though, it really depends on the
> workload.

Yes.

> In my case, the only things that cause this degree of
> free-space fragmentation are RRD files and data files for BOINC
> applications, but both of those have write patterns that are probably
> similar to what your backups produce.
> 
> One thing I've found helps at least with these particular cases is
> bumping the commit time up a bit in BTRFS itself.  For both filesystems,
> I run with -o commit=150, which is 5 times the default commit time.  In
> effect, this means I'll lose up to 2.5 minutes of data if the system
> crashes, but in both cases, this is not hugely critical data (the BOINC
> data costs exactly as much time to regenerate as the length of time's
> worth of data that was lost, and the RRD files are just statistics from
> collectd).

I think this might help if you have more little writes piling up in
memory, and then write them out less often in one go yes. It doesn't
help when you're pumping data into the fs because you want to have your
backups finished.

I did some tests with commit times once, to see if it would influence
the amount of rumination the cow does before defecating metadata onto
disk, but it didn't show any difference, I guess because the commit
timeout never gets reached. It just keeps writing metadata at full speed
to disk all the time.

...

In my case the next thing after getting this free space fragmentation
fixed (which looks like it's going in the right direction), is to go see
why this filesystem needs to write so much metadata all the time (like,
how many % is which tree, how close or far apart are the writes in the
trees, and how close or far apart are the locations on disk that it's
written to).

>> == Balance based on free space fragmentation level ==
>>
>> Now, free space being fragmented when you have a high churn rate
>> snapshot create and expire workload is not a surprise... Also, when data
>> is added there is no way to predict if, and when it ever will be
>> unreferenced from the snapshots again, which means I really don't care
>> where it ends up on disk.
>>
>> But how fragmented is the free space, and how can we measure it?
>>
>> Three weeks ago I made up a free space 'scoring' algorithm, revised it a
>> few times and now I'm using it to feed block groups with bad free space
>> fragmentation to balance to clean up the filesystem a bit. But, this is
>> a fun story for a separate post. In short, take the log2() of the size
>> of a free space extent, and then punish it the hardest if it ends up in
>> the middle of log2(sectorsize) and log2(block_group.length) and less if
>> it's smaller or bigger.
>>
>> It's still 'mopping with the tap open', like we say in the Netherlands.
>> But it's already much better than usage-based balance. If a block group
>> is used for 50% and it has 512 alternating 1MiB filled and free
>> segments, I want to get rid of it, but if it's 512MiB data and then
>> 512MiB empty space, it has to stay.
> If you could write up a patch for the balance operation itself to add
> this as a filter (probably with some threshold value to control how
> picky to be), that would be a great addition.

I found out it's quite hard to come up with a useful scoring mechanism.
Creating one that results in a number between 0 and 100 is even much harder.

But... now I know about the nossd/ssd things seen below, I think it's
better to find out what patterns of free space are *actually* a problem,
i.e. which ones result in free space being ignored and getting that
little boolean flag that prevents further writes.

For example, if it's as simple as "with -o ssd all free space fragments
that are <2 MiB will be ignored", (note to self: what about alignment?)
then it's quite simple to come up with a calculation of how much MiB in
total this is, and then give that as % of the total blockgroup size.
This is already totally different method than what I describe above.

So finding out what the actual behaviour is needs to be done first
(like, how do I get access to that flagged list of blockgroups). Making
up some algorithm is pointless without that.

>> == But... -o remount,nossd ==
>>
>> About two weeks ago, I ran into this code, from extent-tree.c:
>>
>> bool ssd = btrfs_test_opt(fs_info, SSD);
>> *empty_cluster = 0;
>> [...]
>> if (ssd)
>>     *empty_cluster = SZ_2M;
>> if (space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>>     ret = &fs_info->meta_alloc_cluster;
>>     if (!ssd)
>>         *empty_cluster = SZ_64K;
>> } else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) && ssd) {
>>     ret = &fs_info->data_alloc_cluster;
>> }
>> [...]
>>
>> Wait, what? If I mount -o ssd, every small write will turn into at least
>> finding 2MiB for a write? What is this magic number?
> Explaining this requires explaining a bit of background on SSD's.  Most
> modern SSD's use NAND flash, which while byte-addressable for reads and
> writes, is only large-block addressable for resetting written bytes.
> This erase-block is usually a power of 2, and on most drives is 2 or
> 4MB.  That lower size of 2MB is what got chosen here, and in essence the
> code is trying to write to each erase block exactly once, which in turn
> helps with SSD lifetime, since rewriting part of an erase block may
> require erasing the block, and that erase operation is the limiting
> factor for the life of flash memory.

What I understand (e.g. [1]) is that the biggest challenge is to line up
your writes in a way so that future 1) discards (assuming the fs sends
discards for deletes) and 2) overwrites (which invalidate the previous
write at that LBA) line up in the most optimal (same) way. That's a very
different thing than the (from btrfs virtual address space point of
view) pattern in which data is written, and it involves being able to
see in the future to do it well.

Any incoming write will be placed into a new empty page of the NAND. So,
if an erase block is 2MiB, and in one btrfs transaction I do 512x a 4kiB
write in 512 random places of the phsyical block device as seen by
Linux, they will end up after each other on the NAND, filling one erase
block (like, the first column in the pictures in the pdf).

So, that would mean that the location where (seen from the point of view
of the btrfs virtual address space) data is written does not matter, as
long as all data that is part of the same thing (like everything that
belongs to 1 file) is written out together.

That means that if I have a 2MiB free space extent, and I write 4kIB, it
does not make any sense at all to ignore the remaining 2093056 bytes
after that. I really hope that's not what it's doing now.

In this case... (-o nossd)

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

...a small amount of vaddr locations are used over and over again (the
/var/spool/postfix behaviour). This means that if this was an actual
ssd, I'd be filling up erase blocks with small writes, and sending
discards for those writes a bit later, over and over again, having the
effect of having a nice pile of erase blocks that can be erased without
having to move live data out of it first.

In this case... (-o ssd)

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

...the ssd sees the same write pattern coming in! Hey! That's nice. And
if I have -o discard, it also sees the same pattern of discards coming in.

If I *don't* have discard enabled in my mount options, I'm killing my
ssd in a much higher speed with -o ssd than with -o nossd because it
doesn't get the invalidate hints from overwriting used space.

8-)

And, the difference is that my btrfs filesystem is totally killing
itself creating blockgroups with usage <10% which are blocked from
further writes, creating a nightmare for the user, having to use balance
to clean them up, only resulting in doing many many more writes to the
ssd...

Since my reasoning here seems to be almost the 100% opposite of what
btrfs tries to do, I must be missing something very important here. What
is it?

Doing bigger writes instead of splitting it up in fragmented writes all
over the place is better for
1. having less extent items in the extent tree and less places in the
(vaddr sorted) trees that need to be updated
2. causing less cowing because tree updates are less or closer to each other
3. keeping data together on "rotational" disks to minimize seeks.

1 is always good and does not have to do anything with ssd or no ssd,
but with the size of metadata and complexity of operations that handle
metadata.
2 seems to be pretty important also, because of the effect of the 64KiB
writes instead of 2MiB writes that happen with -o nossd, so it's
important for both ssd and nossd, while the nossd users are now
suffering from these effects
3 is also nice for the rotational users, but for the ssd users it
wouldn't really matter.

[1]
https://www.micron.com/~/media/documents/products/technical-marketing-brief/brief_ssd_effect_data_placement_writes.pdf

>> Since the rotational flag in /sys is set to 0 for this filesystem, which
>> does not at all mean it's an ssd by the way, it mounts with the ssd
>> option by default. Since the lower layer of storage is iSCSI on NetApp,
>> it does not make any sense at all for btrfs to make assumptions about
>> where goes what or how optimal it is, as everything will be reorganized
>> anyway.
> FWIW, it is possible to use a udev rule to change the rotational flag
> from userspace.  The kernel's selection algorithm for determining is is
> somewhat sub-optimal (essentially, if it's not a local disk that can be
> proven to be rotational, it assumes it's non-rotational), so
> re-selecting this ends up being somewhat important in certain cases
> (virtual machines for example).

Just putting nossd in fstab seems convenient enough.

>> == Work to do ==
>>
>> The next big change on this system will be to move from the 4.7 kernel
>> to the 4.9 LTS kernel and Debian Stretch.
>>
>> Note that our metadata is still DUP, and it doesn't have skinny extent
>> tree metadata yet. It was originally created with btrfs-progs 3.17, and
>> when we realized we should have single it was too late. I want to change
>> that and see if I can convert on a NetApp clone. This should reduce
>> extent tree metadata size by maybe more than 60% and whoknowswhat will
>> happen to the abhorrent write traffic.
> Depending on how much you trust the NetApp storage appliance you're
> using, you may also consider nodatasum.

That... is... actually... a very interesting idea...

> It wont' help much with the
> metadata issues, but it may cut down on the resource usage on the system
> itself while doing backups.

Who knows... The csum tree must also be huuuge with 35TiB of data, and,
snapshot removal causes a lot of deletes in the tree.

So if this is the case, then who knows what things like "Btrfs: bulk
delete checksum items in the same leaf" will do to it.

But, first it needs more research about what those metadata writes are.

> [...]

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-10 22:59   ` Hans van Kranenburg
@ 2017-04-11 11:33     ` Austin S. Hemmelgarn
  2017-04-11 13:13       ` Kai Krakow
  0 siblings, 1 reply; 19+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-11 11:33 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 2017-04-10 18:59, Hans van Kranenburg wrote:
> On 04/10/2017 02:23 PM, Austin S. Hemmelgarn wrote:
>> On 2017-04-08 16:19, Hans van Kranenburg wrote:
>>> So... today a real life story / btrfs use case example from the trenches
>>> at work...
>>>
>>> tl;dr 1) btrfs is awesome, but you have to carefully choose which parts
>>> of it you want to use or avoid 2) improvements can be made, but at least
>>> the problems relevant for this use case are managable and behaviour is
>>> quite predictable.
>>>
>>> This post is way too long, but I hope it's a fun read for a lazy sunday
>>> afternoon. :) Otherwise, skip some sections, they have headers.
>>>
>>> ...
>>>
>>> The example filesystem for this post is one of the backup server
>>> filesystems we have, running btrfs for the data storage.
>> Two things before I go any further:
>> 1. Thank you for such a detailed and well written post, and especially
>> one that isn't just complaining but also going over what works.
>
> Thanks!
>
>>> [...]
>>>
>>>
>>> == What's not so great... Allocated but unused space... ==
>>>
>>> Since the beginning it showed that the filesystem had a tendency to
>>> accumulate allocated but unused space that didn't get reused again by
>>> writes.
>>>
>>> [...]
>>>
>>> So about a month ago, I continued searching kernel code for the cause of
>>> this behaviour. This is a fun, but time consuming and often mind
>>> boggling activity, because you run into 10 different interesting things
>>> at the same time and want to start to find out about all of them at the
>>> same time etc. :D
>>>
>>> The two first things I found out about were:
>>>   1) the 'free space cluster' code, which is responsible to find empty
>>> space that new writes can go into, sometimes by combining several free
>>> space fragments that are close to each other.
>>>   2) the bool fragmented, which causes a block group to get blacklisted
>>> for any more writes because finding free space for a write did not
>>> succeed too easily.
>>>
>>> I haven't been able to find a concise description of how all of it
>>> actually is supposed to work, so have to end up reverse engineering it
>>> from code, comments and git history.
>>>
>>> And, in practice the feeling was that btrfs doesn't really try that
>>> hard, and quickly gives up and just starts allocating new chunks for
>>> everything. So, maybe it was just listing all my block groups as
>>> fragmented and ignoring them?
>> On this part in particular, while I've seen this behavior on my own
>> systems to a certain extent, I've never seen it as bad as you're
>> describing.  Based on what I have seen though, it really depends on the
>> workload.
>
> Yes.
>
>> In my case, the only things that cause this degree of
>> free-space fragmentation are RRD files and data files for BOINC
>> applications, but both of those have write patterns that are probably
>> similar to what your backups produce.
>>
>> One thing I've found helps at least with these particular cases is
>> bumping the commit time up a bit in BTRFS itself.  For both filesystems,
>> I run with -o commit=150, which is 5 times the default commit time.  In
>> effect, this means I'll lose up to 2.5 minutes of data if the system
>> crashes, but in both cases, this is not hugely critical data (the BOINC
>> data costs exactly as much time to regenerate as the length of time's
>> worth of data that was lost, and the RRD files are just statistics from
>> collectd).
>
> I think this might help if you have more little writes piling up in
> memory, and then write them out less often in one go yes. It doesn't
> help when you're pumping data into the fs because you want to have your
> backups finished.
>
> I did some tests with commit times once, to see if it would influence
> the amount of rumination the cow does before defecating metadata onto
> disk, but it didn't show any difference, I guess because the commit
> timeout never gets reached. It just keeps writing metadata at full speed
> to disk all the time.
>
> ...
>
> In my case the next thing after getting this free space fragmentation
> fixed (which looks like it's going in the right direction), is to go see
> why this filesystem needs to write so much metadata all the time (like,
> how many % is which tree, how close or far apart are the writes in the
> trees, and how close or far apart are the locations on disk that it's
> written to).
What the commit timeout ends up being is the longest the FS will wait 
before forcing the in-memory state out to disk.  IOW, the FS is 
guaranteed consistent at least once every 'commit' seconds.  In 
retrospect, you're right that it almost certainly won't help much in 
this case.
>
>>> == Balance based on free space fragmentation level ==
>>>
>>> Now, free space being fragmented when you have a high churn rate
>>> snapshot create and expire workload is not a surprise... Also, when data
>>> is added there is no way to predict if, and when it ever will be
>>> unreferenced from the snapshots again, which means I really don't care
>>> where it ends up on disk.
>>>
>>> But how fragmented is the free space, and how can we measure it?
>>>
>>> Three weeks ago I made up a free space 'scoring' algorithm, revised it a
>>> few times and now I'm using it to feed block groups with bad free space
>>> fragmentation to balance to clean up the filesystem a bit. But, this is
>>> a fun story for a separate post. In short, take the log2() of the size
>>> of a free space extent, and then punish it the hardest if it ends up in
>>> the middle of log2(sectorsize) and log2(block_group.length) and less if
>>> it's smaller or bigger.
>>>
>>> It's still 'mopping with the tap open', like we say in the Netherlands.
>>> But it's already much better than usage-based balance. If a block group
>>> is used for 50% and it has 512 alternating 1MiB filled and free
>>> segments, I want to get rid of it, but if it's 512MiB data and then
>>> 512MiB empty space, it has to stay.
>> If you could write up a patch for the balance operation itself to add
>> this as a filter (probably with some threshold value to control how
>> picky to be), that would be a great addition.
>
> I found out it's quite hard to come up with a useful scoring mechanism.
> Creating one that results in a number between 0 and 100 is even much harder.
>
> But... now I know about the nossd/ssd things seen below, I think it's
> better to find out what patterns of free space are *actually* a problem,
> i.e. which ones result in free space being ignored and getting that
> little boolean flag that prevents further writes.
>
> For example, if it's as simple as "with -o ssd all free space fragments
> that are <2 MiB will be ignored", (note to self: what about alignment?)
> then it's quite simple to come up with a calculation of how much MiB in
> total this is, and then give that as % of the total blockgroup size.
> This is already totally different method than what I describe above.
>
> So finding out what the actual behaviour is needs to be done first
> (like, how do I get access to that flagged list of blockgroups). Making
> up some algorithm is pointless without that.
Excellent point.
>
>>> == But... -o remount,nossd ==
>>>
>>> About two weeks ago, I ran into this code, from extent-tree.c:
>>>
>>> bool ssd = btrfs_test_opt(fs_info, SSD);
>>> *empty_cluster = 0;
>>> [...]
>>> if (ssd)
>>>     *empty_cluster = SZ_2M;
>>> if (space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>>>     ret = &fs_info->meta_alloc_cluster;
>>>     if (!ssd)
>>>         *empty_cluster = SZ_64K;
>>> } else if ((space_info->flags & BTRFS_BLOCK_GROUP_DATA) && ssd) {
>>>     ret = &fs_info->data_alloc_cluster;
>>> }
>>> [...]
>>>
>>> Wait, what? If I mount -o ssd, every small write will turn into at least
>>> finding 2MiB for a write? What is this magic number?
>> Explaining this requires explaining a bit of background on SSD's.  Most
>> modern SSD's use NAND flash, which while byte-addressable for reads and
>> writes, is only large-block addressable for resetting written bytes.
>> This erase-block is usually a power of 2, and on most drives is 2 or
>> 4MB.  That lower size of 2MB is what got chosen here, and in essence the
>> code is trying to write to each erase block exactly once, which in turn
>> helps with SSD lifetime, since rewriting part of an erase block may
>> require erasing the block, and that erase operation is the limiting
>> factor for the life of flash memory.
>
> What I understand (e.g. [1]) is that the biggest challenge is to line up
> your writes in a way so that future 1) discards (assuming the fs sends
> discards for deletes) and 2) overwrites (which invalidate the previous
> write at that LBA) line up in the most optimal (same) way. That's a very
> different thing than the (from btrfs virtual address space point of
> view) pattern in which data is written, and it involves being able to
> see in the future to do it well.
>
> Any incoming write will be placed into a new empty page of the NAND. So,
> if an erase block is 2MiB, and in one btrfs transaction I do 512x a 4kiB
> write in 512 random places of the phsyical block device as seen by
> Linux, they will end up after each other on the NAND, filling one erase
> block (like, the first column in the pictures in the pdf).
>
> So, that would mean that the location where (seen from the point of view
> of the btrfs virtual address space) data is written does not matter, as
> long as all data that is part of the same thing (like everything that
> belongs to 1 file) is written out together.
>
> That means that if I have a 2MiB free space extent, and I write 4kIB, it
> does not make any sense at all to ignore the remaining 2093056 bytes
> after that. I really hope that's not what it's doing now.
That all assumes you have a smart FTL in the SSD's firmware.  Most 
modern ones do decent, but there are still some out there that don't do 
much in the way of remapping data.
>
> In this case... (-o nossd)
>
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
>
> ...a small amount of vaddr locations are used over and over again (the
> /var/spool/postfix behaviour). This means that if this was an actual
> ssd, I'd be filling up erase blocks with small writes, and sending
> discards for those writes a bit later, over and over again, having the
> effect of having a nice pile of erase blocks that can be erased without
> having to move live data out of it first.
>
> In this case... (-o ssd)
>
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4
>
> ...the ssd sees the same write pattern coming in! Hey! That's nice. And
> if I have -o discard, it also sees the same pattern of discards coming in.
>
> If I *don't* have discard enabled in my mount options, I'm killing my
> ssd in a much higher speed with -o ssd than with -o nossd because it
> doesn't get the invalidate hints from overwriting used space.
>
> 8-)
>
> And, the difference is that my btrfs filesystem is totally killing
> itself creating blockgroups with usage <10% which are blocked from
> further writes, creating a nightmare for the user, having to use balance
> to clean them up, only resulting in doing many many more writes to the
> ssd...
>
> Since my reasoning here seems to be almost the 100% opposite of what
> btrfs tries to do, I must be missing something very important here. What
> is it?
>
> Doing bigger writes instead of splitting it up in fragmented writes all
> over the place is better for
> 1. having less extent items in the extent tree and less places in the
> (vaddr sorted) trees that need to be updated
> 2. causing less cowing because tree updates are less or closer to each other
> 3. keeping data together on "rotational" disks to minimize seeks.
>
> 1 is always good and does not have to do anything with ssd or no ssd,
> but with the size of metadata and complexity of operations that handle
> metadata.
> 2 seems to be pretty important also, because of the effect of the 64KiB
> writes instead of 2MiB writes that happen with -o nossd, so it's
> important for both ssd and nossd, while the nossd users are now
> suffering from these effects
> 3 is also nice for the rotational users, but for the ssd users it
> wouldn't really matter.
>
> [1]
> https://www.micron.com/~/media/documents/products/technical-marketing-brief/brief_ssd_effect_data_placement_writes.pdf
>
>>> Since the rotational flag in /sys is set to 0 for this filesystem, which
>>> does not at all mean it's an ssd by the way, it mounts with the ssd
>>> option by default. Since the lower layer of storage is iSCSI on NetApp,
>>> it does not make any sense at all for btrfs to make assumptions about
>>> where goes what or how optimal it is, as everything will be reorganized
>>> anyway.
>> FWIW, it is possible to use a udev rule to change the rotational flag
>> from userspace.  The kernel's selection algorithm for determining is is
>> somewhat sub-optimal (essentially, if it's not a local disk that can be
>> proven to be rotational, it assumes it's non-rotational), so
>> re-selecting this ends up being somewhat important in certain cases
>> (virtual machines for example).
>
> Just putting nossd in fstab seems convenient enough.
While that does work, there are other pieces of software that change 
behavior based on the value of the rotational flag, and likewise make 
misguided assumptions about what it means.
>
>>> == Work to do ==
>>>
>>> The next big change on this system will be to move from the 4.7 kernel
>>> to the 4.9 LTS kernel and Debian Stretch.
>>>
>>> Note that our metadata is still DUP, and it doesn't have skinny extent
>>> tree metadata yet. It was originally created with btrfs-progs 3.17, and
>>> when we realized we should have single it was too late. I want to change
>>> that and see if I can convert on a NetApp clone. This should reduce
>>> extent tree metadata size by maybe more than 60% and whoknowswhat will
>>> happen to the abhorrent write traffic.
>> Depending on how much you trust the NetApp storage appliance you're
>> using, you may also consider nodatasum.
>
> That... is... actually... a very interesting idea...
>
>> It wont' help much with the
>> metadata issues, but it may cut down on the resource usage on the system
>> itself while doing backups.
>
> Who knows... The csum tree must also be huuuge with 35TiB of data, and,
> snapshot removal causes a lot of deletes in the tree.
Well, for 35TiB, with 16KiB blocks, not including the metadata overhead, 
you're looking at a little less than 37.6 billion blocks, and each block 
has it's own checksum.  IIRC, we have 64 bits of space for the checksum 
(even though we're using only 16 of them), which (assuming I'm doing the 
math right) equates to 280GiB of checksums.  At a minimum, turning off 
data checksums will save you some space, and assuming I understand how 
the trees work, should actually cut down on the commit time for snapshot 
deletion and will cut down on the metadata writes (because the csum tree 
is metadata).
>
> So if this is the case, then who knows what things like "Btrfs: bulk
> delete checksum items in the same leaf" will do to it.
>
> But, first it needs more research about what those metadata writes are.
>
>> [...]
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-11 11:33     ` Austin S. Hemmelgarn
@ 2017-04-11 13:13       ` Kai Krakow
  0 siblings, 0 replies; 19+ messages in thread
From: Kai Krakow @ 2017-04-11 13:13 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 11 Apr 2017 07:33:41 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> >> FWIW, it is possible to use a udev rule to change the rotational
> >> flag from userspace.  The kernel's selection algorithm for
> >> determining is is somewhat sub-optimal (essentially, if it's not a
> >> local disk that can be proven to be rotational, it assumes it's
> >> non-rotational), so re-selecting this ends up being somewhat
> >> important in certain cases (virtual machines for example).  
> >
> > Just putting nossd in fstab seems convenient enough.  
> While that does work, there are other pieces of software that change 
> behavior based on the value of the rotational flag, and likewise make 
> misguided assumptions about what it means.

Something similar happens when you put btrfs on bcache. It now assumes
it is on SSD but in reality it isn't. Thus, I also deployed udev rules
to force back nossd behavior.

But maybe, in the bcache case using "nossd" instead would make more
sense. Any ideas on this?


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-04-08 20:19 About free space fragmentation, metadata write amplification and (no)ssd Hans van Kranenburg
                   ` (3 preceding siblings ...)
  2017-04-10 12:23 ` Austin S. Hemmelgarn
@ 2017-05-28  0:59 ` Hans van Kranenburg
  2017-05-28  3:54   ` Duncan
  2017-06-08 17:57   ` Hans van Kranenburg
  4 siblings, 2 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-05-28  0:59 UTC (permalink / raw)
  To: linux-btrfs

A small update...

Original (long) message:
https://www.spinics.net/lists/linux-btrfs/msg64446.html

On 04/08/2017 10:19 PM, Hans van Kranenburg wrote:
> [...]
> 
> == But! The Meta Mummy returns! ==
> 
> After changing to nossd, another thing happened. The expiry process,
> which normally takes about 1.5 hour to remove ~2500 subvolumes (keeping
> it queued up to a 100 orphans all the time), suddenly took the entire
> rest of the day, not being done before the nightly backups had to start
> again at 10PM...
> 
> And the only thing it seemed to do is writing, writing, writing 100MB/s
> all day long.

This behaviour was observed with a 4.7.5 linux kernel.

When running 4.9.25 now with -o nossd, this weird behaviour is gone. I
have no idea what change between 4.7 and 4.9 is responsible for this,
but it's good.

> == So, what do we want? ssd? nossd? ==
> 
> Well, both don't do it for me. I want my expensive NetApp disk space to
> be filled up, without requiring me to clean up after it all the time
> using painful balance actions and I want to quickly get rid of old
> snapshots.
> 
> So currently, there's two mount -o remount statements before and after
> doing the expiries...

With 4.9+ now, it stays on nossd for sure, everywhere. :)

I keep doing daily btrfs-heatmap pictures, here's a nice timelapse of
Feb 22 until May 26th. One picture per day.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-05-28-btrfs-nossd-whoa.mp4

These images use --sort virtual, so the block groups jump around a bit
because of the free-space-fragmentation-level-score-based btrfs balance
that I did for a few weeks. Total fs size is close to 40TiB.

At 17 seconds into the movie, I switched over to -o nossd. The effect is
very clearly visible. Suddenly the filesystem starts filling up all
empty space, starting at the beginning of the virtual address space. In
the last few months the amount of allocated but unused space went down
from about 6 TiB to a bit more than 2 TiB now, and it's still decreasing
every day. \o/

This actually means that forcing -o nossd solved the main headache and
cause of babysitting requirements when using btrfs that I have been
experiencing from the very beginning of trying it...

By the way, being able to use only nossd only is also a big improvement
for the (a few dozen) smaller filesystems that we use with replication
for DR purposes (yay, btrbk). We don't have to look around and respond
to alerts all the time any more to see which filesystem is choking
itself to death today and then rescue it with btrfs balance, and the
snapshot and send/receive schedule and expiry doesn't cause abnormal
write IO any more. \o/

> [...]
> 
> == Work to do ==
> 
> The next big change on this system will be to move from the 4.7 kernel
> to the 4.9 LTS kernel and Debian Stretch.

After starting to upgrade other btrfs filesystems to use kernel 4.9 in
the last few weeks (including the smaller backup servers), I did the
biggest one today. It's running 4.9.25 now, or Debian 4.9.25-1~bpo8+1 to
be exact. Currently it's working its way through the nightlies, looking
good.

> Note that our metadata is still DUP, and it doesn't have skinny extent
> tree metadata yet. It was originally created with btrfs-progs 3.17, and
> when we realized we should have single it was too late. I want to change
> that and see if I can convert on a NetApp clone. This should reduce
> extent tree metadata size by maybe more than 60% and whoknowswhat will
> happen to the abhorrent write traffic.

Yeah, blabla... Converting metadata from DUP to single is a big no go
with btrfs balance, that's what I clearly got figured out now.

> Before switching over to the clone as live backup server, all missing
> snapshots can be rsynced over from the live backup server.

Using snapshot/clone functionality of our NetApp storage, I did the move
from 4.7 to 4.9 in the last two days.

Since mounting with 4.9 requires a rebuild of the free space tree (and
since I didn't feel like hacking the feature bit in instead), this
wasn't going to be a quick maintenance action.

Two days ago I cloned the luns that make up the (now) 40TiB filesystem
and did the skinny-metadata and free space tree changes, and also
cleaned out the free space cache v1 (byebye..)

-# time btrfsck --clear-space-cache v2 /dev/xvdb
Clear free space cache v2
free space cache v2 cleared

real    10m47.854s
user    0m17.200s
sys     0m11.040s

-# time btrfsck --clear-space-cache v1 /dev/xvdb
Clearing free space cache
Free space cache cleared

real    195m8.970s
user    161m32.380s
sys     24m23.476s

^^notice the cpu usage...

-# time btrfstune -x /dev/xvdb

real    17m4.647s
user    0m16.856s
sys     0m3.944s

-# time mount -o noatime,nossd,space_cache=v2 /dev/xvdb /srv/backup

real    289m55.671s
user    0m0.000s
sys     1m11.156s

Yeah, random read IO sucks... :|

In the two days after, I ran the same expiries as the production backup
server was doing, and synced new backup data to the clone. Tonight, just
before the nightly run, I swapped the production luns and the clones so
the real backup server could quickly continue using the prepared filesystem.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-05-28  0:59 ` Hans van Kranenburg
@ 2017-05-28  3:54   ` Duncan
  2017-06-08 17:57   ` Hans van Kranenburg
  1 sibling, 0 replies; 19+ messages in thread
From: Duncan @ 2017-05-28  3:54 UTC (permalink / raw)
  To: linux-btrfs

Hans van Kranenburg posted on Sun, 28 May 2017 02:59:57 +0200 as
excerpted:

>> Note that our metadata is still DUP, and it doesn't have skinny extent
>> tree metadata yet. It was originally created with btrfs-progs 3.17, and
>> when we realized we should have single it was too late. I want to
>> change that and see if I can convert on a NetApp clone. This should
>> reduce extent tree metadata size by maybe more than 60% and
>> whoknowswhat will happen to the abhorrent write traffic.
> 
> Yeah, blabla... Converting metadata from DUP to single is a big no go
> with btrfs balance, that's what I clearly got figured out now.

Umm... Did you try -f (force)?  See the manpage.

OTOH, I'd have thought that was obvious enough you'd have tried it which 
would make the problem here something not so simple, but then again, I'd 
have thought you'd mention trying it too, if you did, to prevent exactly 
this sort of followup.  So I don't know what to think, except I think 
it's worth covering the possibility just in case.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-05-28  0:59 ` Hans van Kranenburg
  2017-05-28  3:54   ` Duncan
@ 2017-06-08 17:57   ` Hans van Kranenburg
  2017-06-08 18:47     ` Roman Mamedov
  1 sibling, 1 reply; 19+ messages in thread
From: Hans van Kranenburg @ 2017-06-08 17:57 UTC (permalink / raw)
  To: linux-btrfs

Ehrm,

On 05/28/2017 02:59 AM, Hans van Kranenburg wrote:
> A small update...
> 
> Original (long) message:
> https://www.spinics.net/lists/linux-btrfs/msg64446.html
> 
> On 04/08/2017 10:19 PM, Hans van Kranenburg wrote:
>> [...]
>>
>> == But! The Meta Mummy returns! ==
>>
>> After changing to nossd, another thing happened. The expiry process,
>> which normally takes about 1.5 hour to remove ~2500 subvolumes (keeping
>> it queued up to a 100 orphans all the time), suddenly took the entire
>> rest of the day, not being done before the nightly backups had to start
>> again at 10PM...
>>
>> And the only thing it seemed to do is writing, writing, writing 100MB/s
>> all day long.
> 
> This behaviour was observed with a 4.7.5 linux kernel.
> 
> When running 4.9.25 now with -o nossd, this weird behaviour is gone. I
> have no idea what change between 4.7 and 4.9 is responsible for this,
> but it's good.

Ok, that hooray was a bit too early...

---- ----

There is an improvement with subvolume delete + nossd that is visible
between 4.7 and 4.9.

This example that I saved shows what happened when doing remount,nossd
on 4.7.8:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-06-08-xvdb-nossd-sub-del.png

That example filesystem has about 1.5TiB of small files (subversion
repositories) on it, and every 15 minutes, using send/receive (helped by
btrbk) incremental changes are being sent to another location, and
snapshots older than a day are removed.

When switching to nossd, the snapshot removals (also every 15 mins)
suddenly showed quite a lot more disk writes happening (metadata).

With 4.9.25, that effect on this one and smaller filesystems is gone.
The graphs look the same when switching to nossd.

---- ----

But still, on the large filesystem (>30TiB), removing
subvolumes/snapshots takes like >10x the time (and metadata write IO)
with nossd than with ssd.

An example:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-06-08-big-expire-ssd-nossd.png

With -o nossd, I was able to remove 900 subvolumes (varying fs tree
sizes) in about 17 hours, doing sustained 100MB/s writes to disk.

When switching to -o ssd, I was able to remove 4300 of them within 4
hours, with way less disk write activity.

So, I'm still suspecting it's simply the SZ_64K vs SZ_2M difference for
metadata *empty_cluster that is making this huge difference, and that
the absurd metadata overhead is generated because of the fact that the
extent tree is tracked inside the extent tree itself.

To gather proof of this, and to research the effect of different
settings, applying different patches (like playing with the
empty_cluster values, the shift to left page patch, bulk csum etc,) I
need to be able to measure some things first.

So, my current idea is to put per tree (all fs trees combined under 5)
cow counters in, exposed via sysfs, so that I can create munin cow rate
graphs per filesystem. Currently, I put the python-to-C btrfs-progs
bindings project aside again, and am teaching myself enough to get this
done first. :) Free time is a bit limited nowadays, but progress is steady.

To be continued...

>> == So, what do we want? ssd? nossd? ==
>>
>> Well, both don't do it for me. I want my expensive NetApp disk space to
>> be filled up, without requiring me to clean up after it all the time
>> using painful balance actions and I want to quickly get rid of old
>> snapshots.
>>
>> So currently, there's two mount -o remount statements before and after
>> doing the expiries...
> 
> With 4.9+ now, it stays on nossd for sure, everywhere. :)

Nope, the daily remounts are back again, well only on the biggest
filesystems. :@

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-06-08 17:57   ` Hans van Kranenburg
@ 2017-06-08 18:47     ` Roman Mamedov
  2017-06-08 19:19       ` Hans van Kranenburg
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Mamedov @ 2017-06-08 18:47 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

On Thu, 8 Jun 2017 19:57:10 +0200
Hans van Kranenburg <hans.van.kranenburg@mendix.com> wrote:

> There is an improvement with subvolume delete + nossd that is visible
> between 4.7 and 4.9.

I don't remember if I asked before, but did you test on 4.4? The two latest
longterm series are 4.9 and 4.4. 4.7 should be abandoned and forgotten by now
really, certainly not used daily in production, it's not even listed on
kernel.org anymore. Also it's possible the 4.7 branch that you test did not
receive all the bugfix backports from mainline like the longterm series do.

> I have no idea what change between 4.7 and 4.9 is responsible for this, but
> it's good.  

FWIW, this appears to be the big Btrfs change between 4.7 and 4.9 (in 4.8):

Btrfs: introduce ticketed enospc infrastructure
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=957780eb2788d8c218d539e19a85653f51a96dc1

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: About free space fragmentation, metadata write amplification and (no)ssd
  2017-06-08 18:47     ` Roman Mamedov
@ 2017-06-08 19:19       ` Hans van Kranenburg
  0 siblings, 0 replies; 19+ messages in thread
From: Hans van Kranenburg @ 2017-06-08 19:19 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On 06/08/2017 08:47 PM, Roman Mamedov wrote:
> On Thu, 8 Jun 2017 19:57:10 +0200
> Hans van Kranenburg <hans.van.kranenburg@mendix.com> wrote:
> 
>> There is an improvement with subvolume delete + nossd that is visible
>> between 4.7 and 4.9.
> 
> I don't remember if I asked before, but did you test on 4.4?

No, I jumped from 3.16 lts (debian) to 4.7.8 to 4.9.25 now. I haven't
been building my own (yet), it's all debian kernels.

The biggest improvement I needed was the free space tree (>=4.5),
because with 3.16 transaction commit disk write IO was going through the
roof, blocking the fs for too long every few seconds. 4.7.8 was about
the first kernel that I tested which I couldn't too easily get to
explode and corrupt file systems. The 3.16 lts was (is) a really stable
kernel for btrfs.

> The two latest
> longterm series are 4.9 and 4.4. 4.7 should be abandoned and forgotten by now
> really, certainly not used daily in production,

I know, I know. They're already gone now. :)

> it's not even listed on
> kernel.org anymore. Also it's possible the 4.7 branch that you test did not
> receive all the bugfix backports from mainline like the longterm series do.

Well, I wouldn't say "all" the bugfixes, looking at the history of
fs/btrfs in current 4.9. It's more like.. sporadically, someone might
take time to also think about the longterm kernel. ;-)

>> I have no idea what change between 4.7 and 4.9 is responsible for this, but
>> it's good.  
> 
> FWIW, this appears to be the big Btrfs change between 4.7 and 4.9 (in 4.8):
> 
> Btrfs: introduce ticketed enospc infrastructure
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=957780eb2788d8c218d539e19a85653f51a96dc1

Since that part of the problem is gone now, I don't think it makes sense
any more to spend time to find where it improved...

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-06-08 19:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-08 20:19 About free space fragmentation, metadata write amplification and (no)ssd Hans van Kranenburg
2017-04-08 21:55 ` Peter Grandi
2017-04-09  0:21   ` Hans van Kranenburg
2017-04-09  0:39     ` Hans van Kranenburg
2017-04-09  3:14     ` Kai Krakow
2017-04-09 20:48       ` Hans van Kranenburg
2017-04-09  6:38 ` Paul Jones
2017-04-09  8:43   ` Roman Mamedov
2017-04-09 18:10 ` Chris Murphy
2017-04-09 20:15   ` Hans van Kranenburg
2017-04-10 12:23 ` Austin S. Hemmelgarn
2017-04-10 22:59   ` Hans van Kranenburg
2017-04-11 11:33     ` Austin S. Hemmelgarn
2017-04-11 13:13       ` Kai Krakow
2017-05-28  0:59 ` Hans van Kranenburg
2017-05-28  3:54   ` Duncan
2017-06-08 17:57   ` Hans van Kranenburg
2017-06-08 18:47     ` Roman Mamedov
2017-06-08 19:19       ` Hans van Kranenburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).