From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir
Date: Wed, 31 Jul 2024 16:42:31 -0700 [thread overview]
Message-ID: <20240731234231.GA3809836@zen.localdomain> (raw)
In-Reply-To: <dd3444f9-c8be-46e7-97b1-8f95a161c709@gmx.com>
On Thu, Aug 01, 2024 at 08:49:39AM +0930, Qu Wenruo wrote:
>
>
> 在 2024/8/1 08:29, Boris Burkov 写道:
> > On Wed, Jul 31, 2024 at 07:08:47PM +0930, Qu Wenruo wrote:
> > > There are several hidden pitfalls of the existing traverse_directory():
> > >
> > > - Hand written preorder traversal
> > > Meanwhile there is already a better written standard library function,
> > > nftw() doing exactly what we need.
> >
> > This is great!
> >
> > >
> > > - Over-designed path list
> > > To properly handle the directory change, we have structure
> > > directory_name_entry, to record every inode until rootdir.
> > >
> > > But it has two string members, dir_name and path, which is a little
> > > overkilled.
> > > As for preorder traversal, we will never need to read the parent's
> > > filename, just its inode number.
> > >
> > > And it's exported while no one utilizes it out of mkfs/rootdir.c.
> > >
> > > - Weird inode numbers
> > > We use the inode number from st->st_ino, with an extra offset.
> > > This by itself is not safe, if the rootdir has child directory in
> > > another filesystem.
> >
> > Can you explain what you mean by not safe? As far as I can tell, the
> > +256 is to handle that particular invariant of btrfs. Are you worried
> > about duplicate inode numbers in the vein of the usual st_dev/st_ino
> > debates?
>
> I'm worried about this case:
>
> rootdir (on fs1)
> |- dir1
> | |- file1 (fs1, ino 1024)
> |- dir2 (on fs2)
> |- file2 (fs2, ino 1024)
>
> We do not have any cross mount point checks.
>
> I created a case like this:
>
> # fallocate -l 1G 1.img
> # fallocate -l 1G 2.img
> # mkfs.ext4 1.img
> # mkfs.ext4 2.img
> # mount 1.img mnt
> # mkdir mnt/dir1 mnt/dir2
> # touch mnt/dir1/file1 mnt/file1
> # umount mnt
> # mount 2.img mnt
> # mkdir mnt/dir1 mnt/dir2
> # touch mnt/dir1/file1 mnt/file1
> # umount mnt
> # mkdir rootdir
> # mount 1.img rootdir
> # mount 2.img rootdir/dir2
> # mkfs.btrfs -f --rootdir rootdir test.img
> ERROR: item file1 already exists but has wrong st_nlink 1 <= 1
> ERROR: unable to traverse directory rootdir/: 1
> ERROR: error while filling filesystem: 1
>
> And it failed as expeceted.
>
Perfect example, it did seem to be vulnerable to this with just
"original st_ino + 256"... Thanks for the new test.
> >
> > In that case, if this is a bugfix, I think it makes sense to write a
> > regression test that highlights it. It would be nice to pull the fix
> > out of the refactor, too, but I suppose that with such a deep refactor,
> > that might not be possible/worth it.
>
> Unfortunately I didn't even notice how serious these problems are, until
> I tried...
That's kind of my concern with the higher level testing question, that
we don't have much coverage for such a big refactor. OTOH, if we already
had such serious bugs, how bad could the refactor really be :)
>
> Will add two new test cases. One is the above one, the other is the
> hardlink out of rootdir bug I mentioned in 3/3.
>
> >
> > >
> > > And this results very weird inode numbers, e.g:
> > >
> > > item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
> > > item 6 key (88347519 INODE_ITEM 0) itemoff 15815 itemsize 160
> > > item 16 key (88347520 INODE_ITEM 0) itemoff 15363 itemsize 160
> > > item 20 key (88347521 INODE_ITEM 0) itemoff 15119 itemsize 160
> > > item 24 key (88347522 INODE_ITEM 0) itemoff 14875 itemsize 160
> > > item 26 key (88347523 INODE_ITEM 0) itemoff 14700 itemsize 160
> > > item 28 key (88347524 INODE_ITEM 0) itemoff 14525 itemsize 160
> > > item 30 key (88347557 INODE_ITEM 0) itemoff 14350 itemsize 160
> > > item 32 key (88347566 INODE_ITEM 0) itemoff 14175 itemsize 160
> > >
> > > Which is far from a regular fs created by copying the data.
> >
> > +1, especially since it doesn't even *preserve* the inode numbers, which
> > would be a comprehensible (if not working) behavior. Creating the oids
> > in "btrfs order" seems like a nice improvement!
> >
> > >
> > > - Weird directory inode size calculation
> > > Unlike kernel, which updated the directory inode size every time new
> > > child inodes are added, we calculate the directory inode size by
> > > searching all its children first, then later new inodes linked to this
> > > directory won't touch the inode size.
> > >
> > > Enhance all these points by:
> > >
> > > - Use nftw() to do the preorder traversal
> > > It also provides the extra level detection, which is pretty handy.
> > >
> > > - Use a simple local inode_entry to record each parent
> > > The only value is a u64 to record the inode number.
> > > And one simple rootdir_path structure to record the list of
> > > inode_entry, alone with the current level.
> > >
> > > This rootdir_path structure along with two helpers,
> > > rootdir_path_push() and rootdir_path_pop(), along with the
> > > preorder traversal provided by nftw(), are enough for us to record
> > > all the parent directories until the rootdir.
> > >
> > > - Grab new inode number properly
> > > Just call btrfs_get_free_objectid() to grab a proper inode number,
> > > other than using some weird calculated value.
> > >
> > > - Use btrfs_insert_inode() and btrfs_add_link() to update directory
> > > inode automatically
> > >
> > > With all the refactor, the code is shorter and easier to read.
> >
> > Agreed on all counts, I think this is a lovely refactor.
> >
> > For a change of this magnitude, I think it is helpful to justify the
> > correctness of the change by documenting the testing plan. Are there
> > existing unit tests of mkfs that cover all the cases we are modifying
> > here?
>
> We have existing rootdir test cases, from the regular functional tests,
> to corner cases.
>
> But we lack corner cases of corner cases, like duplicating inode number
> (cross mnt) and hard links.
>
> > Is there some --rootdir master case that covers everything?
>
> It's inside tests/mkfs-tests/
>
> A quick grep shows:
>
> 004-rootdir-keeps-size
> 009-special-files-for-rootdir
> 011-rootdir-create-file
> 012-rootdir-no-shrink
> 014-rootdir-inline-extent
> 016-rootdir-bad-symbolic-link
> 021-rfeatures-quota-rootdir
> 022-rootdir-size
> 027-rootdir-inode
Thanks for collecting those. By scanning those, it's hard to tell if
there is, for example, a test of an "interesting" directory structure
like the one I drew above, to test the traversal, for example.
Out of curiousity, I applied your patches, ran the tests (pass) then
put a bug in the traversal code (only pop on cur_lvl > ftw_level, not
ftw_level - 1) and 004-rootdir-keeps-size did fail, as it ran on the
Documentation/ dir, which does seem like a sufficiently interesting
directory.
I feel a bit better now :)
Maybe something that ran after fsstress (or however we test send/recv?)
in the long run would be a good test, as we nail down some of the bugs
you have hit here.
This comment is not directed at you, but is more general for btrfs
development: I think explicitly stating the testing conducted on a
complex patch is really helpful for reviewers.
>
>
> > Is it
> > in fstests? Knowing that all of the new code has at least run correctly
> > would go a long way to feeling confident in the details of the
> > transformation.
>
> We have btrfs-progs github CI, runs the all the selftests on each PR.
> And it's all green (except a typo caught by code spell).
>
> [...]
> > > - parent_inum = highest_inum + BTRFS_FIRST_FREE_OBJECTID;
> > > - dir_entry->inum = parent_inum;
> > > - list_add_tail(&dir_entry->list, &dir_head->list);
> > > + /*
> > > + * If our path level is higher than this level - 1, this means
> > > + * we have changed directory.
> > > + * Poping out the unrelated inodes.
> >
> > Popping
>
> Exposed by the CI, and fixed immediately in github.
>
Cool!
> >
> > Also, this took me like 15 minutes of working examples to figure out why
> > it worked. I think it could definitely do with some deeper explanation
> > of the invariant, like:
> >
> > ftwbuf->level - 1 is the level of the parent of the current traversed
> > inode. ftw will traverse all inodes in a directory before leaving it,
> > and will never traverse an inode without first traversing its dir,
> > so if we see a level less than or equal to the last directory we saw,
> > we are finished with that directory and can pop it.
> >
> > Perhaps with an annotated drawing like:
> >
> > 0: /
> > 1: /foo/
> > 2: /foo/bar/
> > 3: /foo/bar/f
> > 2: /foo/baz/ POP! 2 > (2 - 1); done with /foo/bar/
> > 3: /foo/baz/g
> > 1: /h POP! 2 > (1 - 1); done with /foo/baz/
> >
> > To help make it clearer. I honestly even think just changing to >= makes
> > it clearer? Not sure about that.
>
> I'll add comments with an example to explain the workflow.
>
> BTW, David and I are working with Github review system a lot recently:
> https://github.com/kdave/btrfs-progs/pull/855
>
> We do not force anyone to use specific system to do anything, but you
> may find it a little easier to comment, and feel free to fall back to
> the mail based review workflow at any time.
Happy to do code review in PRs! It's a bit annoying that the history
gets blown up when the author re-pushes the branch after incorporating
feeback (never understood that design, Phabricator tracks the *branch*
history too and it seems very helpful to me.)
BTW, if this review style is going well, we should discuss more and get
the workflow more documented/supported in the workflow docs.
Also, while we are still here in email-land, you can add
Reviewed-by: Boris Burkov <boris@bur.io>
Thanks,
Boris
>
> Thanks a lot for the detailed review!
> Qu
next prev parent reply other threads:[~2024-07-31 23:44 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-31 9:38 [PATCH 0/3] btrfs-progs: rework how we traverse rootdir Qu Wenruo
2024-07-31 9:38 ` [PATCH 1/3] btrfs-progs: constify the name parameter of btrfs_add_link() Qu Wenruo
2024-07-31 9:38 ` [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir Qu Wenruo
2024-07-31 22:59 ` Boris Burkov
2024-07-31 23:19 ` Qu Wenruo
2024-07-31 23:42 ` Boris Burkov [this message]
2024-08-01 0:06 ` Qu Wenruo
2024-07-31 9:38 ` [PATCH 3/3] btrfs-progs: rootdir: reject hard links Qu Wenruo
2024-07-31 22:17 ` Boris Burkov
2024-07-31 22:55 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240731234231.GA3809836@zen.localdomain \
--to=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox