public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Zizhi Wo <wozizhi@huawei.com>
Cc: chandan.babu@oracle.com, dchinner@redhat.com, osandov@fb.com,
	john.g.garry@oracle.com, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, yangerkun@huawei.com
Subject: Re: [PATCH 1/2] xfs: Fix missing block calculations in xfs datadev fsmap
Date: Fri, 8 Nov 2024 09:30:06 -0800	[thread overview]
Message-ID: <20241108173006.GA168069@frogsfrogsfrogs> (raw)
In-Reply-To: <1549f04a-8431-405d-adfc-23e5988abe51@huawei.com>

On Fri, Nov 08, 2024 at 10:29:08AM +0800, Zizhi Wo wrote:
> 
> 
> 在 2024/11/8 7:43, Darrick J. Wong 写道:
> > On Mon, Aug 26, 2024 at 11:10:04AM +0800, Zizhi Wo wrote:
> > > In xfs datadev fsmap query, I noticed a missing block calculation problem:
> > > [root@fedora ~]# xfs_db -r -c "sb 0" -c "p" /dev/vdb
> > > magicnum = 0x58465342
> > > blocksize = 4096
> > > dblocks = 5242880
> > > ......
> > > [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
> > > ...
> > > 30: 253:16 [31457384..41943031]: free space            3  (104..10485751)    10485648
> > > 
> > > (41943031 + 1) / 8 = 5242879 != 5242880
> > > We missed one block in our fsmap calculation!
> > 
> > Eek.
> > 
> > > The root cause of the problem lies in __xfs_getfsmap_datadev(), where the
> > > calculation of "end_fsb" requires a classification discussion. If "end_fsb"
> > > is calculated based on "eofs", we need to add an extra sentinel node for
> > > subsequent length calculations. Otherwise, one block will be missed. If
> > > "end_fsb" is calculated based on "keys[1]", then there is no need to add an
> > > extra node. Because "keys[1]" itself is unreachable, it cancels out one of
> > > the additions. The diagram below illustrates this:
> > > 
> > > |0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|-----eofs
> > > |---------------|---------------------|
> > > a       n       b         n+1         c
> > > 
> > > Assume that eofs is 16, the start address of the previous query is block n,
> > > sector 0, and the length is 1, so the "info->next" is at point b, sector 8.
> > > In the last query, suppose the "rm_startblock" calculated based on
> > > "eofs - 1" is the last block n+1 at point b. All we get is the starting
> > > address of the block, not the end. Therefore, an additional sentinel node
> > > needs to be added to move it to point c. After that, subtracting one from
> > > the other will yield the remaining 1.
> > > 
> > > Although we can now calculate the exact last query using "info->end_daddr",
> > > we will still get an incorrect value if the device at this point is not the
> > > boundary device specified by "keys[1]", as "end_daddr" is still the initial
> > > value. Therefore, the eofs situation here needs to be corrected. The issue
> > > is resolved by adding a sentinel node.
> > 
> > Why don't we set end_daddr unconditionally, then?
> > 
> > Hmm, looking at the end_daddr usage in fsmap.c, I think it's wrong.  If
> > end_daddr is set at all, it's set either to the last sector for which
> > the user wants a mapping; or it's set to the last sector for the device.
> > But then look at how we use it:
> > 
> > 	if (info->last...)
> > 		frec->start_daddr = info->end_daddr;
> > 
> > 	...
> > 
> > 	/* "report the gap..."
> > 	if (frec->start_daddr > info->next_daddr) {
> > 		fmr.fmr_length = frec->start_daddr - info->next_daddr;
> > 	}
> > 
> > This is wrong -- we're using start_daddr to compute the distance from
> > the last mapping that we output up to the end of the range that we want.
> > The "end of the range" is modeled with a phony rmap record that starts
> > at the first fsblock after that range.
> > 
> 
> In the current code, we set "rec_daddr = end_daddr" only when
> (info->last && info->end_daddr != NULL), which should ensure that this
> is the last query?

Right.

> Because end_daddr is set to the last device, and
> info->last is set to the last query. Therefore, assigning it to
> start_daddr should not cause issues in the next query?

Right, the code currently sets end_daddr only for the last device, so
there won't be any issues with the next query.

That said, we reset the xfs_getfsmap_info state between each device, so
it's safe to set end_daddr for every device, not just the last time
through that loop.

> Did I misunderstand something? Or is it because the latest code
> constantly updates end_daddr, which is why this issue arises?

The 6.13 metadir/rtgroups patches didn't change when end_daddr gets set,
but my fixpatch *does* make it set end_daddr for all devices.  Will send
a patch + fstests update shortly to demonstrate. :)

> > IOWs, that assignment should have been
> > frec->start_daddr = info->end_daddr + 1.
> > 
> > Granted in August the codebase was less clear about the difference
> > between rec_daddr and rmap->rm_startblock.  For 6.13, hch cleaned all
> > that up -- rec_daddr is now called start_daddr and the fsmap code passes
> > rmap records with space numbers in units of daddrs via a new struct
> > xfs_fsmap_rec.  Unfortunately, that's all buried in the giant pile of
> > pull requests I sent a couple of days ago which hasn't shown up on
> > for-next yet.
> > 
> > https://lore.kernel.org/linux-xfs/173084396955.1871025.18156568347365549855.stgit@frogsfrogsfrogs/
> > 
> > So I think I know how to fix this against the 6.13 codebase, but I'm
> > going to take a slightly different approach than yours...
> > 
> > > Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
> > > Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> > > ---
> > >   fs/xfs/xfs_fsmap.c | 19 +++++++++++++++++--
> > >   1 file changed, 17 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
> > > index 85dbb46452ca..8a2dfe96dae7 100644
> > > --- a/fs/xfs/xfs_fsmap.c
> > > +++ b/fs/xfs/xfs_fsmap.c
> > > @@ -596,12 +596,27 @@ __xfs_getfsmap_datadev(
> > >   	xfs_agnumber_t			end_ag;
> > >   	uint64_t			eofs;
> > >   	int				error = 0;
> > > +	int				sentinel = 0;
> > >   	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
> > >   	if (keys[0].fmr_physical >= eofs)
> > >   		return 0;
> > >   	start_fsb = XFS_DADDR_TO_FSB(mp, keys[0].fmr_physical);
> > > -	end_fsb = XFS_DADDR_TO_FSB(mp, min(eofs - 1, keys[1].fmr_physical));
> > > +	/*
> > > +	 * For the case of eofs, we need to add a sentinel node;
> > > +	 * otherwise, one block will be missed when calculating the length
> > > +	 * in the last query.
> > > +	 * For the case of key[1], there is no need to add a sentinel node
> > > +	 * because it already represents a value that cannot be reached.
> > > +	 * For the case where key[1] after shifting is within the same
> > > +	 * block as the starting address, it is resolved using end_daddr.
> > > +	 */
> > > +	if (keys[1].fmr_physical > eofs - 1) {
> > > +		sentinel = 1;
> > > +		end_fsb = XFS_DADDR_TO_FSB(mp, eofs - 1);
> > > +	} else {
> > > +		end_fsb = XFS_DADDR_TO_FSB(mp, keys[1].fmr_physical);
> > > +	}
> > 
> > ...because running against djwong-wtf, I actually see the same symptoms
> > for the realtime device.  So I think a better solution is to change
> > xfs_getfsmap to set end_daddr always, and then fix the off by one error.
> > 
> 
> Yes, my second patch looks at this rt problem...
> Thank you for your reply

<nod>

--D

> Thanks,
> Zizhi Wo
> 
> 
> > I also don't really like "sentinel" values because they're not
> > intuitive.
> > 
> > I will also go update xfs/273 to check that there are no gaps in the
> > mappings returned, and that they go to where the filesystem thinks is
> > the end of the device.  Thanks for reporting this, sorry I was too busy
> > trying to get metadir/rtgroups done to look at this until now. :(
> > 
> > --D
> > 
> 
> 
> 
> > >   	/*
> > >   	 * Convert the fsmap low/high keys to AG based keys.  Initialize
> > > @@ -649,7 +664,7 @@ __xfs_getfsmap_datadev(
> > >   		info->pag = pag;
> > >   		if (pag->pag_agno == end_ag) {
> > >   			info->high.rm_startblock = XFS_FSB_TO_AGBNO(mp,
> > > -					end_fsb);
> > > +					end_fsb) + sentinel;
> > >   			info->high.rm_offset = XFS_BB_TO_FSBT(mp,
> > >   					keys[1].fmr_offset);
> > >   			error = xfs_fsmap_owner_to_rmap(&info->high, &keys[1]);
> > > -- 
> > > 2.39.2
> > > 
> > > 
> 

  reply	other threads:[~2024-11-08 17:30 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-26  3:10 [PATCH 0/2] Some boundary error bugfix related to XFS fsmap Zizhi Wo
2024-08-26  3:10 ` [PATCH 1/2] xfs: Fix missing block calculations in xfs datadev fsmap Zizhi Wo
2024-11-07 23:43   ` Darrick J. Wong
2024-11-08  2:29     ` Zizhi Wo
2024-11-08 17:30       ` Darrick J. Wong [this message]
2024-11-09  2:34         ` Zizhi Wo
2024-08-26  3:10 ` [PATCH 2/2] xfs: Fix incorrect parameter calculation in rt fsmap Zizhi Wo
2024-11-07 23:51   ` Darrick J. Wong
2024-08-29 11:24 ` [PATCH 0/2] Some boundary error bugfix related to XFS fsmap Zizhi Wo
2024-09-02 19:08   ` Darrick J. Wong
2024-09-03  9:18     ` Zizhi Wo
2024-10-09 13:01     ` Zizhi Wo
2024-11-07  9:21       ` Zizhi Wo
2024-11-07 23:51         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241108173006.GA168069@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=chandan.babu@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=osandov@fb.com \
    --cc=wozizhi@huawei.com \
    --cc=yangerkun@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox