From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Phillip Susi <phill@thesusis.net>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
Jan Ziak <0xe2.0x9a.0x9b@gmail.com>,
linux-btrfs@vger.kernel.org
Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit
Date: Tue, 15 Mar 2022 17:06:55 -0400 [thread overview]
Message-ID: <YjD/7zhERFjcY5ZP@hungrycats.org> (raw)
In-Reply-To: <87fsnjnjxr.fsf@vps.thesusis.net>
On Tue, Mar 15, 2022 at 02:28jjjZ:46PM -0400, Phillip Susi wrote:
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> writes:
>
> > btrfs extents are immutable, so the filesystem can't extend an existing
> > extent with new data. Instead, a new extent must be created that contains
> > both the old and new data to replace the old extent. At least one new
>
> Wait, what? How is an extent immutable? Why isn't a new tree written
> out with a larger extent and once the transaction commits, bam... you've
> enlarged your extent? Just like modifying any other data.
If the extent is compressed, you have to write a new extent, because
there's no other way to atomically update a compressed extent.
If it's reflinked or snapshotted, you can't overwrite the data in place
as long as a second reference to the data exists. This is what makes
nodatacow and prealloc slow--on every write, they have to check whether
the blocks being written are shared or not, and that check is expensive
because it's a linear search of every reference for overlapping block
ranges, and it can't exit the search early until it has proven there
are no shared references. Contrast with datacow, which allocates a new
unshared extent that it knows it can write to, and only has to check
overwritten extents when they are completely overwritten (and only has
to check for the existence of one reference, not enumerate them all).
When a file refers to an extent, it refers to the entire extent from the
file's subvol tree, even if only a single byte of the extent is contained
in the file. There's no mechanism in btrfs extent tree v1 for atomically
replacing an extent with separately referenceable objects, and updating
all the pointers to parts of the old object to point to the new one.
Any such update could cascade into updates across all reflinks and
snapshots of the extent, so the write multiplier can be arbitrarily large.
There is an extent tree v2 project which provides for splitting
uncompressed extents (compressed extents are always immutable) by storing
all the overlapping references as objects in the extent tree. It does
reference tracking by creating an extent item for every referenced
block range, so changing one reference's position or length (e.g. by
overwriting or deleting part of an extent reference in a file) doesn't
affect any other reference. In theory it could also append to the end
of an existing extent, if that case ever came up.
That brings us to the next problem: mutable extents won't help with
the appending case without also teaching the allocator how to spread out
files all over the disk so there's physical space available at file EOF.
Normally in btrfs, if you write to 3 files, whatever you wrote is packed
into 3 physically contiguous and adjacent extents. If you then want
to append to the first or second file, you'll need a new extent, because
there's no physical space between the files.
> And do you mean to say that before the new data can be written, the old
> data must first be read in and moved to the new extent? That seems
> horridly inefficient.
Normally btrfs doesn't read anything when it writes. New writes create
new extents for the new data, and delete only extents that are completely
replaced by the new extents.
A series of sequential small writes create a lot of small extents,
and small extents are sometimes undesirable. Defrag gathers these
small extents when they are logically adjacent, reads them into memory,
writes a new physically contiguous extent to replace them, then deletes
the old extents. Autodefrag is a process that makes defrag happen in
near time to extents that were written recently.
Defrag isn't the only way to resolve the small-extents issue. If the
file is only read once (e.g. a log file that is rotated and compressed
with a high-performance compressor like xz) then defrag is a waste of
read/write cycles--it's better to leave the small fragments where they
are until they are deleted by an application.
next prev parent reply other threads:[~2022-03-15 21:06 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-06 15:59 Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Jan Ziak
2022-03-07 0:48 ` Qu Wenruo
2022-03-07 2:23 ` Jan Ziak
2022-03-07 2:39 ` Qu Wenruo
2022-03-07 7:31 ` Qu Wenruo
2022-03-10 1:10 ` Jan Ziak
2022-03-10 1:26 ` Qu Wenruo
2022-03-10 4:33 ` Jan Ziak
2022-03-10 6:42 ` Qu Wenruo
2022-03-10 21:31 ` Jan Ziak
2022-03-10 23:27 ` Qu Wenruo
2022-03-11 2:42 ` Jan Ziak
2022-03-11 2:59 ` Qu Wenruo
2022-03-11 5:04 ` Jan Ziak
2022-03-11 16:31 ` Jan Ziak
2022-03-11 20:02 ` Jan Ziak
2022-03-11 23:04 ` Qu Wenruo
2022-03-11 23:28 ` Jan Ziak
2022-03-11 23:39 ` Qu Wenruo
2022-03-12 0:01 ` Jan Ziak
2022-03-12 0:15 ` Qu Wenruo
2022-03-12 3:16 ` Zygo Blaxell
2022-03-12 2:43 ` Zygo Blaxell
2022-03-12 3:24 ` Qu Wenruo
2022-03-12 3:48 ` Zygo Blaxell
2022-03-14 20:09 ` Phillip Susi
2022-03-14 22:59 ` Zygo Blaxell
2022-03-15 18:28 ` Phillip Susi
2022-03-15 19:28 ` Jan Ziak
2022-03-15 21:06 ` Zygo Blaxell [this message]
2022-03-15 22:20 ` Jan Ziak
2022-03-16 17:02 ` Zygo Blaxell
2022-03-16 17:48 ` Jan Ziak
2022-03-17 2:11 ` Zygo Blaxell
2022-03-16 18:46 ` Phillip Susi
2022-03-16 19:59 ` Zygo Blaxell
2022-03-20 17:50 ` Forza
2022-03-20 21:15 ` Zygo Blaxell
2022-03-08 21:57 ` Jan Ziak
2022-03-08 23:40 ` Qu Wenruo
2022-03-09 22:22 ` Jan Ziak
2022-03-09 22:44 ` Qu Wenruo
2022-03-09 22:55 ` Jan Ziak
2022-03-09 23:00 ` Jan Ziak
2022-03-09 4:48 ` Zygo Blaxell
2022-03-07 14:30 ` Phillip Susi
2022-03-08 21:43 ` Jan Ziak
2022-03-09 18:46 ` Phillip Susi
2022-03-09 21:35 ` Jan Ziak
2022-03-14 20:02 ` Phillip Susi
2022-03-14 21:53 ` Jan Ziak
2022-03-14 22:24 ` Remi Gauvin
2022-03-14 22:51 ` Zygo Blaxell
2022-03-14 23:07 ` Remi Gauvin
2022-03-14 23:39 ` Zygo Blaxell
2022-03-15 14:14 ` Remi Gauvin
2022-03-15 18:51 ` Zygo Blaxell
2022-03-15 19:22 ` Remi Gauvin
2022-03-15 21:08 ` Zygo Blaxell
2022-03-15 18:15 ` Phillip Susi
2022-03-16 16:52 ` Andrei Borzenkov
2022-03-16 18:28 ` Jan Ziak
2022-03-16 18:31 ` Phillip Susi
2022-03-16 18:43 ` Andrei Borzenkov
2022-03-16 18:46 ` Jan Ziak
2022-03-16 19:04 ` Zygo Blaxell
2022-03-17 20:34 ` Phillip Susi
2022-03-17 22:06 ` Zygo Blaxell
2022-03-16 12:47 ` Kai Krakow
2022-03-16 18:18 ` Jan Ziak
-- strict thread matches above, loose matches on Subject: below --
2022-06-17 0:20 Jan Ziak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YjD/7zhERFjcY5ZP@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=0xe2.0x9a.0x9b@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=phill@thesusis.net \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox