Re: Changing the checksum algorithm on an existing BTRFS filesystem

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Matthew Warren <matthewwarren101010@gmail.com>,
	dsterba@suse.cz, linux-btrfs@vger.kernel.org
Subject: Re: Changing the checksum algorithm on an existing BTRFS filesystem
Date: Tue, 5 Oct 2021 14:00:15 +0800	[thread overview]
Message-ID: <4be6e096-b617-41a1-e535-296b6949b765@gmx.com> (raw)
In-Reply-To: <20211005032621.GQ29026@hungrycats.org>



On 2021/10/5 11:26, Zygo Blaxell wrote:
> On Tue, Oct 05, 2021 at 06:54:27AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/10/5 00:01, Matthew Warren wrote:
>>> I don't know how btrfs is layed out internally, but is checksum tree
>>> separate from file (meta)data or is it part of the (meta)data? If it's
>>> separate it should be possible to build a second csum tree and then
>>> replace the old one once it's done, right?
>>
>> There are several problems, even for off-line covert.
>
> I worked out something of a plan to do an online conversion, but there
> are some unfortunate issues with the btrfs on-disk design that make online
> conversion harder than it has to be.  It would be easier to do an online
> conversion to a new on-disk format that supports multiple csum types,
> than to convert from one of the existing csum types to another.
>
>> 1) Metadata checksum
>>     Unlike data checksum, metadata checksum is inlined into the metadata
>>     header.
>>     Thus there is no way to make the csum co-exist for metadata, and we
>>     need to re-write (CoW) all metadata of the fs to convert them to the
>>     new algo.
>
> Metadata is the easy case.  In the metadata pages the csum is a
> fixed-width field.  There is also a transid/generation field on each page.
> So if you know that the csum was changed from crc32c to xxhash on transid
> 123456 (and you'd know this because the online conversion would store
> this in an item at the beginning of the conversion process), then any
> time you read a page with a transid below 123456 you check crc32c, and any
> time you read a page with transid above 123456 you check xxhash, and any
> time you write a page you use xxhash.

This means we need a way to record such info on-disk. (source checksum
algo, dest checksum algo, transid of the initial convert)

Thus a format change is needed (which is not really a big deal compared
to other things involved).

>  Once conversion is set up, run a
> metadata balance, and all the metadata pages are converted to use the new
> csum algorithm.  A few other places (chunk headers and superblocks) need
> to be handled separately.  Allow only one conversion at a time to keep
> things simple;

Well, single conversion is already going to be painful...

> otherwise, we'd need to maintain a list of transids and
> csum algorithms to figure out which to use, or we'd have to brute-force
> all possible algorithms until we figure out which one a metadata page
> is using.  When all the metadata block groups have been balanced, the
> conversion is done, and future csum checks use only the new algorithm.

Yes, that's it, we need something like a metadata balance to do that.

But I doubt if we should do it in kernel space.
>
> An offline conversion tool can read, verify, compute, and write the new
> csum algorithm in place,

For in-place convert, it still needs some way to record the source/dest
algo.

It will work just like uuid changing tool, and would be my preferred way
to do it.

> as there is a fixed-length 32 bytes struct in
> each metadata page for csum.  It would need to do some kind of data
> journalling to be crash-safe, but that's not difficult to arrange

Either we go CoW full, or we don't care and go with try-and-error way.
The only thing we need to know is the source and dest algo, just two
tries should be enough.

> (and in practice it can be YOLOed with no journal most of the time,
> using dup or raid1 metadata to fix any blocks mangled by a crash).
>
>> 2) Superblock flags/csum_type
>>     Currently we have only one csum_type, which is shared by both data
>>     and metadata.
>>
>>     We need extra csum_type to indicate the destination csum algo.
>>     We will also need to sync this bit to kernel, no matter if the kernel
>>     really uses that.
>
> In the data csum tree it's a bit awkward.  Although there are 136 bits
> of (objectid, key, offset) for each csum item, for historical reasons
> the objectid and key fields are constant, leaving no room to identify
> the csum type in the item key (and we don't want to use any item data
> bytes for this as it will make the items larger).  We could create a
> new csum tree with a different root, or use the objectid in the key to
> identify the csum algorithm, but this requires completely new multi-tree
> code to read and modify multiple csum trees during conversion.

Yes, that's the problem.

>  It would
> have to ensure there are no overlapping csums in two trees (or distant
> locations in a single tree, which is just as bad) at the same time.
> Every read would require potentially two tree lookups--all at places
> where there have historically been bugs in the csum tree implementation.
> This double-lookup and double-update (and quadruple-kernel-bug-fixing)
> would have to continue until the last item in the old csum tree is
> deleted.
>
> Ideally a future csum tree (csum_tree_v2?) would have keys of (objectid =
> bytenr of data block, type = CSUM_ITEM_KEY, offset = csum algorithm).

This indeed sounds pretty interesting.

> Then the existing csum tree code can be used more or less unmodified,
> except for the detail that each csum item might contain csums from
> any one of the csum algorithms.  This is a simpler modification to the
> csum tree code, since it uses only one tree, adjacent block csums are
> adjacent in the tree key order, and lookup performance doesn't change.
> The gotcha is that doing this requires a less-simple modification to the
> fsync log tree code, which is apparently leveraging the existing csum item
> key format to share a lot of code between fsync log tree and csum tree.
> Also a single extent might have csums of different sizes, which will
> complicate code that deals with e.g. printing csum values.
>
> The advantage of csum_tree_v2 is that once the tree is converted to
> a multi-csum tree format, it's much easier/faster to convert it again
> (e.g. maybe you decide sha256 is too slow for you and you want to go
> back to xxhash or blake2b), and you can mkfs new filesystems using
> csum_tree_v2 from the beginning for online conversion later.

That's why I'd prefer the offline convert.

We only need to update the new csum tree, without bothering all the
runtime csum lookup things.

And just need to rename the objectid after finishing the csum tree
update, and finally drop the old csum tree.

>
> Any change in the tree organization probably means we have to solve the
> multi-tree problem above anyway, since whatever new tree format we choose
> to support multiple csums, the existing csum tree isn't laid out in the
> same format.  There might be a way out of this if we borrow an idea from
> space_cache=v2, and convert the csum tree to csum_tree_v2 format during
> mount, without changing the csum algorithm at the same time.  This would
> be relatively fast (compared to reading all the data blocks) but it
> would eliminate the need to update multiple trees online (conversion
> would simply copy the csums to a new tree and delete the old one, while
> nothing is allowed to write to the filesystem).  After the csum tree
> is converted to csum_tree_v2, the kernel would be able to complete the
> conversion online.
>
> Regardless of how multiple csum algorithms are supported, a data balance
> would be sufficient to recompute all of the data csums by brute-force
> reading (checking old csum) and writing (computing new csum) the data.
> This is very wasteful as it rewrites many (but not all) of the metadata
> trees again, as well as all of the data blocks, but it would work and
> could be used until a proper conversion tool exists.
>
> It might be possible to do something like scrub, but only reading the
> data blocks and writing the csum trees.  Except not really scrub--I don't
> think any of the existing scrub code can be used this way.  Maybe a
> variation of defrag that only rewrites csums?  It might be easier to
> build this component from scratch--none of the existing building blocks
> in the kernel really fits.
>
> Offline conversion of the existing csum tree from one csum algorithm
> to another is non-trivial, since the csum data is packed into pages
> dynamically.  Changing from any csum algorithm to one with a different
> length will require not only rewriting the content of the csum tree pages,
> but also allocating a new tree structures to store them.

For offline convert, we need to do regular tree CoWs to create a new
csum tree.
And rely on CoW to protect against powerloss/signal.

The good thing is, we can easily grab where we were, thus no need for
extra format to record the process.

>
> Offline conversion to csum_tree_v2 is easy since it changes only the
> arragement of bits in the btrfs key structure, not the size of anything.
> It can be done in place the same way metadata can be converted in place.
If we go offline, it then means we don't need on-line convert, thus no
need for csum_tree_v2 for convert purpose.

Thanks,
Qu

>
>> Thanks,
>> Qu
>>
>>>
>>> Matthew Warren
>>>
>>> On Mon, Oct 4, 2021 at 4:52 AM David Sterba <dsterba@suse.cz> wrote:
>>>>
>>>> On Mon, Oct 04, 2021 at 12:26:16AM -0500, Matthew Warren wrote:
>>>>> Is there currently any way to change the checksum used by btrfs
>>>>> without recreating the filesystem and copying data to the new fs?
>>>>
>>>> I have a WIP patch but it's buggy. It works on small filesystems but
>>>> when I tried it on TB-sized images it blew up somewhere. Also the WIP is
>>>> not too smart, it deletes the whole checksum tree and rebuilds if from
>>>> the on-disk data, so it lacks the verification against the existing
>>>> csums. I intend to debug it some day but it's a nice to have feature,
>>>> I'm aware that people have been asking for it but at this point it would
>>>> be to dangerous to provide even the prototype.

     prev parent reply	other threads:[~2021-10-05  6:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-04  5:26 Changing the checksum algorithm on an existing BTRFS filesystem Matthew Warren
2021-10-04  6:20 ` Nikolay Borisov
2021-10-04  9:51 ` David Sterba
2021-10-04 16:01   ` Matthew Warren
2021-10-04 20:37     ` Chris Murphy
2021-10-04 22:54     ` Qu Wenruo
2021-10-05  3:26       ` Zygo Blaxell
2021-10-05  6:00         ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4be6e096-b617-41a1-e535-296b6949b765@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=matthewwarren101010@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox