Re: [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Boris Burkov <boris@bur.io>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir
Date: Thu, 1 Aug 2024 08:49:39 +0930	[thread overview]
Message-ID: <dd3444f9-c8be-46e7-97b1-8f95a161c709@gmx.com> (raw)
In-Reply-To: <20240731225935.GB3808023@zen.localdomain>



在 2024/8/1 08:29, Boris Burkov 写道:
> On Wed, Jul 31, 2024 at 07:08:47PM +0930, Qu Wenruo wrote:
>> There are several hidden pitfalls of the existing traverse_directory():
>>
>> - Hand written preorder traversal
>>    Meanwhile there is already a better written standard library function,
>>    nftw() doing exactly what we need.
>
> This is great!
>
>>
>> - Over-designed path list
>>    To properly handle the directory change, we have structure
>>    directory_name_entry, to record every inode until rootdir.
>>
>>    But it has two string members, dir_name and path, which is a little
>>    overkilled.
>>    As for preorder traversal, we will never need to read the parent's
>>    filename, just its inode number.
>>
>>    And it's exported while no one utilizes it out of mkfs/rootdir.c.
>>
>> - Weird inode numbers
>>    We use the inode number from st->st_ino, with an extra offset.
>>    This by itself is not safe, if the rootdir has child directory in
>>    another filesystem.
>
> Can you explain what you mean by not safe? As far as I can tell, the
> +256 is to handle that particular invariant of btrfs. Are you worried
> about duplicate inode numbers in the vein of the usual st_dev/st_ino
> debates?

I'm worried about this case:

rootdir (on fs1)
|- dir1
|  |- file1 (fs1, ino 1024)
|- dir2 (on fs2)
    |- file2 (fs2, ino 1024)

We do not have any cross mount point checks.

I created a case like this:

# fallocate -l 1G 1.img
# fallocate -l 1G 2.img
# mkfs.ext4 1.img
# mkfs.ext4 2.img
# mount 1.img mnt
# mkdir mnt/dir1 mnt/dir2
# touch mnt/dir1/file1 mnt/file1
# umount mnt
# mount 2.img mnt
# mkdir mnt/dir1 mnt/dir2
# touch mnt/dir1/file1 mnt/file1
# umount mnt
# mkdir rootdir
# mount 1.img rootdir
# mount 2.img rootdir/dir2
# mkfs.btrfs -f --rootdir rootdir test.img
ERROR: item file1 already exists but has wrong st_nlink 1 <= 1
ERROR: unable to traverse directory rootdir/: 1
ERROR: error while filling filesystem: 1

And it failed as expeceted.

>
> In that case, if this is a bugfix, I think it makes sense to write a
> regression test that highlights it. It would be nice to pull the fix
> out of the refactor, too, but I suppose that with such a deep refactor,
> that might not be possible/worth it.

Unfortunately I didn't even notice how serious these problems are, until
I tried...

Will add two new test cases. One is the above one, the other is the
hardlink out of rootdir bug I mentioned in 3/3.

>
>>
>>    And this results very weird inode numbers, e.g:
>>
>> 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
>> 	item 6 key (88347519 INODE_ITEM 0) itemoff 15815 itemsize 160
>> 	item 16 key (88347520 INODE_ITEM 0) itemoff 15363 itemsize 160
>> 	item 20 key (88347521 INODE_ITEM 0) itemoff 15119 itemsize 160
>> 	item 24 key (88347522 INODE_ITEM 0) itemoff 14875 itemsize 160
>> 	item 26 key (88347523 INODE_ITEM 0) itemoff 14700 itemsize 160
>> 	item 28 key (88347524 INODE_ITEM 0) itemoff 14525 itemsize 160
>> 	item 30 key (88347557 INODE_ITEM 0) itemoff 14350 itemsize 160
>> 	item 32 key (88347566 INODE_ITEM 0) itemoff 14175 itemsize 160
>>
>>    Which is far from a regular fs created by copying the data.
>
> +1, especially since it doesn't even *preserve* the inode numbers, which
> would be a comprehensible (if not working) behavior. Creating the oids
> in "btrfs order" seems like a nice improvement!
>
>>
>> - Weird directory inode size calculation
>>    Unlike kernel, which updated the directory inode size every time new
>>    child inodes are added, we calculate the directory inode size by
>>    searching all its children first, then later new inodes linked to this
>>    directory won't touch the inode size.
>>
>> Enhance all these points by:
>>
>> - Use nftw() to do the preorder traversal
>>    It also provides the extra level detection, which is pretty handy.
>>
>> - Use a simple local inode_entry to record each parent
>>    The only value is a u64 to record the inode number.
>>    And one simple rootdir_path structure to record the list of
>>    inode_entry, alone with the current level.
>>
>>    This rootdir_path structure along with two helpers,
>>    rootdir_path_push() and rootdir_path_pop(), along with the
>>    preorder traversal provided by nftw(), are enough for us to record
>>    all the parent directories until the rootdir.
>>
>> - Grab new inode number properly
>>    Just call btrfs_get_free_objectid() to grab a proper inode number,
>>    other than using some weird calculated value.
>>
>> - Use btrfs_insert_inode() and btrfs_add_link() to update directory
>>    inode automatically
>>
>> With all the refactor, the code is shorter and easier to read.
>
> Agreed on all counts, I think this is a lovely refactor.
>
> For a change of this magnitude, I think it is helpful to justify the
> correctness of the change by documenting the testing plan. Are there
> existing unit tests of mkfs that cover all the cases we are modifying
> here?

We have existing rootdir test cases, from the regular functional tests,
to corner cases.

But we lack corner cases of corner cases, like duplicating inode number
(cross mnt) and hard links.

> Is there some --rootdir master case that covers everything?

It's inside tests/mkfs-tests/

A quick grep shows:

004-rootdir-keeps-size
009-special-files-for-rootdir
011-rootdir-create-file
012-rootdir-no-shrink
014-rootdir-inline-extent
016-rootdir-bad-symbolic-link
021-rfeatures-quota-rootdir
022-rootdir-size
027-rootdir-inode


> Is it
> in fstests? Knowing that all of the new code has at least run correctly
> would go a long way to feeling confident in the details of the
> transformation.

We have btrfs-progs github CI, runs the all the selftests on each PR.
And it's all green (except a typo caught by code spell).

[...]
>> -	parent_inum = highest_inum + BTRFS_FIRST_FREE_OBJECTID;
>> -	dir_entry->inum = parent_inum;
>> -	list_add_tail(&dir_entry->list, &dir_head->list);
>> +	/*
>> +	 * If our path level is higher than this level - 1, this means
>> +	 * we have changed directory.
>> +	 * Poping out the unrelated inodes.
>
> Popping

Exposed by the CI, and fixed immediately in github.

>
> Also, this took me like 15 minutes of working examples to figure out why
> it worked. I think it could definitely do with some deeper explanation
> of the invariant, like:
>
> ftwbuf->level - 1 is the level of the parent of the current traversed
> inode. ftw will traverse all inodes in a directory before leaving it,
> and will never traverse an inode without first traversing its dir,
> so if we see a level less than or equal to the last directory we saw,
> we are finished with that directory and can pop it.
>
> Perhaps with an annotated drawing like:
>
> 0: /
> 1:         /foo/
> 2:                 /foo/bar/
> 3:                         /foo/bar/f
> 2:                 /foo/baz/            POP! 2 > (2 - 1); done with /foo/bar/
> 3:                         /foo/baz/g
> 1:         /h                           POP! 2 > (1 - 1); done with /foo/baz/
>
> To help make it clearer. I honestly even think just changing to >= makes
> it clearer? Not sure about that.

I'll add comments with an example to explain the workflow.

BTW, David and I are working with Github review system a lot recently:
https://github.com/kdave/btrfs-progs/pull/855

We do not force anyone to use specific system to do anything, but you
may find it a little easier to comment, and feel free to fall back to
the mail based review workflow at any time.

Thanks a lot for the detailed review!
Qu

next prev parent reply	other threads:[~2024-07-31 23:19 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-31  9:38 [PATCH 0/3] btrfs-progs: rework how we traverse rootdir Qu Wenruo
2024-07-31  9:38 ` [PATCH 1/3] btrfs-progs: constify the name parameter of btrfs_add_link() Qu Wenruo
2024-07-31  9:38 ` [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir Qu Wenruo
2024-07-31 22:59   ` Boris Burkov
2024-07-31 23:19     ` Qu Wenruo [this message]
2024-07-31 23:42       ` Boris Burkov
2024-08-01  0:06         ` Qu Wenruo
2024-07-31  9:38 ` [PATCH 3/3] btrfs-progs: rootdir: reject hard links Qu Wenruo
2024-07-31 22:17   ` Boris Burkov
2024-07-31 22:55     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dd3444f9-c8be-46e7-97b1-8f95a161c709@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=boris@bur.io \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox