public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir
Date: Wed, 31 Jul 2024 16:42:31 -0700	[thread overview]
Message-ID: <20240731234231.GA3809836@zen.localdomain> (raw)
In-Reply-To: <dd3444f9-c8be-46e7-97b1-8f95a161c709@gmx.com>

On Thu, Aug 01, 2024 at 08:49:39AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/8/1 08:29, Boris Burkov 写道:
> > On Wed, Jul 31, 2024 at 07:08:47PM +0930, Qu Wenruo wrote:
> > > There are several hidden pitfalls of the existing traverse_directory():
> > > 
> > > - Hand written preorder traversal
> > >    Meanwhile there is already a better written standard library function,
> > >    nftw() doing exactly what we need.
> > 
> > This is great!
> > 
> > > 
> > > - Over-designed path list
> > >    To properly handle the directory change, we have structure
> > >    directory_name_entry, to record every inode until rootdir.
> > > 
> > >    But it has two string members, dir_name and path, which is a little
> > >    overkilled.
> > >    As for preorder traversal, we will never need to read the parent's
> > >    filename, just its inode number.
> > > 
> > >    And it's exported while no one utilizes it out of mkfs/rootdir.c.
> > > 
> > > - Weird inode numbers
> > >    We use the inode number from st->st_ino, with an extra offset.
> > >    This by itself is not safe, if the rootdir has child directory in
> > >    another filesystem.
> > 
> > Can you explain what you mean by not safe? As far as I can tell, the
> > +256 is to handle that particular invariant of btrfs. Are you worried
> > about duplicate inode numbers in the vein of the usual st_dev/st_ino
> > debates?
> 
> I'm worried about this case:
> 
> rootdir (on fs1)
> |- dir1
> |  |- file1 (fs1, ino 1024)
> |- dir2 (on fs2)
>    |- file2 (fs2, ino 1024)
> 
> We do not have any cross mount point checks.
> 
> I created a case like this:
> 
> # fallocate -l 1G 1.img
> # fallocate -l 1G 2.img
> # mkfs.ext4 1.img
> # mkfs.ext4 2.img
> # mount 1.img mnt
> # mkdir mnt/dir1 mnt/dir2
> # touch mnt/dir1/file1 mnt/file1
> # umount mnt
> # mount 2.img mnt
> # mkdir mnt/dir1 mnt/dir2
> # touch mnt/dir1/file1 mnt/file1
> # umount mnt
> # mkdir rootdir
> # mount 1.img rootdir
> # mount 2.img rootdir/dir2
> # mkfs.btrfs -f --rootdir rootdir test.img
> ERROR: item file1 already exists but has wrong st_nlink 1 <= 1
> ERROR: unable to traverse directory rootdir/: 1
> ERROR: error while filling filesystem: 1
> 
> And it failed as expeceted.
> 

Perfect example, it did seem to be vulnerable to this with just
"original st_ino + 256"... Thanks for the new test.

> > 
> > In that case, if this is a bugfix, I think it makes sense to write a
> > regression test that highlights it. It would be nice to pull the fix
> > out of the refactor, too, but I suppose that with such a deep refactor,
> > that might not be possible/worth it.
> 
> Unfortunately I didn't even notice how serious these problems are, until
> I tried...

That's kind of my concern with the higher level testing question, that
we don't have much coverage for such a big refactor. OTOH, if we already
had such serious bugs, how bad could the refactor really be :)

> 
> Will add two new test cases. One is the above one, the other is the
> hardlink out of rootdir bug I mentioned in 3/3.
> 
> > 
> > > 
> > >    And this results very weird inode numbers, e.g:
> > > 
> > > 	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
> > > 	item 6 key (88347519 INODE_ITEM 0) itemoff 15815 itemsize 160
> > > 	item 16 key (88347520 INODE_ITEM 0) itemoff 15363 itemsize 160
> > > 	item 20 key (88347521 INODE_ITEM 0) itemoff 15119 itemsize 160
> > > 	item 24 key (88347522 INODE_ITEM 0) itemoff 14875 itemsize 160
> > > 	item 26 key (88347523 INODE_ITEM 0) itemoff 14700 itemsize 160
> > > 	item 28 key (88347524 INODE_ITEM 0) itemoff 14525 itemsize 160
> > > 	item 30 key (88347557 INODE_ITEM 0) itemoff 14350 itemsize 160
> > > 	item 32 key (88347566 INODE_ITEM 0) itemoff 14175 itemsize 160
> > > 
> > >    Which is far from a regular fs created by copying the data.
> > 
> > +1, especially since it doesn't even *preserve* the inode numbers, which
> > would be a comprehensible (if not working) behavior. Creating the oids
> > in "btrfs order" seems like a nice improvement!
> > 
> > > 
> > > - Weird directory inode size calculation
> > >    Unlike kernel, which updated the directory inode size every time new
> > >    child inodes are added, we calculate the directory inode size by
> > >    searching all its children first, then later new inodes linked to this
> > >    directory won't touch the inode size.
> > > 
> > > Enhance all these points by:
> > > 
> > > - Use nftw() to do the preorder traversal
> > >    It also provides the extra level detection, which is pretty handy.
> > > 
> > > - Use a simple local inode_entry to record each parent
> > >    The only value is a u64 to record the inode number.
> > >    And one simple rootdir_path structure to record the list of
> > >    inode_entry, alone with the current level.
> > > 
> > >    This rootdir_path structure along with two helpers,
> > >    rootdir_path_push() and rootdir_path_pop(), along with the
> > >    preorder traversal provided by nftw(), are enough for us to record
> > >    all the parent directories until the rootdir.
> > > 
> > > - Grab new inode number properly
> > >    Just call btrfs_get_free_objectid() to grab a proper inode number,
> > >    other than using some weird calculated value.
> > > 
> > > - Use btrfs_insert_inode() and btrfs_add_link() to update directory
> > >    inode automatically
> > > 
> > > With all the refactor, the code is shorter and easier to read.
> > 
> > Agreed on all counts, I think this is a lovely refactor.
> > 
> > For a change of this magnitude, I think it is helpful to justify the
> > correctness of the change by documenting the testing plan. Are there
> > existing unit tests of mkfs that cover all the cases we are modifying
> > here?
> 
> We have existing rootdir test cases, from the regular functional tests,
> to corner cases.
> 
> But we lack corner cases of corner cases, like duplicating inode number
> (cross mnt) and hard links.
> 
> > Is there some --rootdir master case that covers everything?
> 
> It's inside tests/mkfs-tests/
> 
> A quick grep shows:
> 
> 004-rootdir-keeps-size
> 009-special-files-for-rootdir
> 011-rootdir-create-file
> 012-rootdir-no-shrink
> 014-rootdir-inline-extent
> 016-rootdir-bad-symbolic-link
> 021-rfeatures-quota-rootdir
> 022-rootdir-size
> 027-rootdir-inode

Thanks for collecting those. By scanning those, it's hard to tell if
there is, for example, a test of an "interesting" directory structure
like the one I drew above, to test the traversal, for example.

Out of curiousity, I applied your patches, ran the tests (pass) then
put a bug in the traversal code (only pop on cur_lvl > ftw_level, not
ftw_level - 1) and 004-rootdir-keeps-size did fail, as it ran on the
Documentation/ dir, which does seem like a sufficiently interesting
directory.

I feel a bit better now :)

Maybe something that ran after fsstress (or however we test send/recv?)
in the long run would be a good test, as we nail down some of the bugs
you have hit here.

This comment is not directed at you, but is more general for btrfs
development: I think explicitly stating the testing conducted on a
complex patch is really helpful for reviewers.

> 
> 
> > Is it
> > in fstests? Knowing that all of the new code has at least run correctly
> > would go a long way to feeling confident in the details of the
> > transformation.
> 
> We have btrfs-progs github CI, runs the all the selftests on each PR.
> And it's all green (except a typo caught by code spell).
> 
> [...]
> > > -	parent_inum = highest_inum + BTRFS_FIRST_FREE_OBJECTID;
> > > -	dir_entry->inum = parent_inum;
> > > -	list_add_tail(&dir_entry->list, &dir_head->list);
> > > +	/*
> > > +	 * If our path level is higher than this level - 1, this means
> > > +	 * we have changed directory.
> > > +	 * Poping out the unrelated inodes.
> > 
> > Popping
> 
> Exposed by the CI, and fixed immediately in github.
> 

Cool!

> > 
> > Also, this took me like 15 minutes of working examples to figure out why
> > it worked. I think it could definitely do with some deeper explanation
> > of the invariant, like:
> > 
> > ftwbuf->level - 1 is the level of the parent of the current traversed
> > inode. ftw will traverse all inodes in a directory before leaving it,
> > and will never traverse an inode without first traversing its dir,
> > so if we see a level less than or equal to the last directory we saw,
> > we are finished with that directory and can pop it.
> > 
> > Perhaps with an annotated drawing like:
> > 
> > 0: /
> > 1:         /foo/
> > 2:                 /foo/bar/
> > 3:                         /foo/bar/f
> > 2:                 /foo/baz/            POP! 2 > (2 - 1); done with /foo/bar/
> > 3:                         /foo/baz/g
> > 1:         /h                           POP! 2 > (1 - 1); done with /foo/baz/
> > 
> > To help make it clearer. I honestly even think just changing to >= makes
> > it clearer? Not sure about that.
> 
> I'll add comments with an example to explain the workflow.
> 
> BTW, David and I are working with Github review system a lot recently:
> https://github.com/kdave/btrfs-progs/pull/855
> 
> We do not force anyone to use specific system to do anything, but you
> may find it a little easier to comment, and feel free to fall back to
> the mail based review workflow at any time.

Happy to do code review in PRs! It's a bit annoying that the history
gets blown up when the author re-pushes the branch after incorporating
feeback (never understood that design, Phabricator tracks the *branch*
history too and it seems very helpful to me.)

BTW, if this review style is going well, we should discuss more and get
the workflow more documented/supported in the workflow docs.

Also, while we are still here in email-land, you can add
Reviewed-by: Boris Burkov <boris@bur.io>

Thanks,
Boris

> 
> Thanks a lot for the detailed review!
> Qu

  reply	other threads:[~2024-07-31 23:44 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-31  9:38 [PATCH 0/3] btrfs-progs: rework how we traverse rootdir Qu Wenruo
2024-07-31  9:38 ` [PATCH 1/3] btrfs-progs: constify the name parameter of btrfs_add_link() Qu Wenruo
2024-07-31  9:38 ` [PATCH 2/3] btrfs-progs: mkfs: rework how we traverse rootdir Qu Wenruo
2024-07-31 22:59   ` Boris Burkov
2024-07-31 23:19     ` Qu Wenruo
2024-07-31 23:42       ` Boris Burkov [this message]
2024-08-01  0:06         ` Qu Wenruo
2024-07-31  9:38 ` [PATCH 3/3] btrfs-progs: rootdir: reject hard links Qu Wenruo
2024-07-31 22:17   ` Boris Burkov
2024-07-31 22:55     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240731234231.GA3809836@zen.localdomain \
    --to=boris@bur.io \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox