From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org, Omar Sandoval <osandov@osandov.com>,
Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH] xfs: fix log reservation overflows when allocating large rt extents
Date: Fri, 13 Dec 2019 12:19:57 -0800 [thread overview]
Message-ID: <20191213201957.GI99875@magnolia> (raw)
In-Reply-To: <20191213121840.GA43376@bfoster>
On Fri, Dec 13, 2019 at 07:18:40AM -0500, Brian Foster wrote:
> On Wed, Dec 04, 2019 at 08:38:09AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Omar Sandoval reported that a 4G fallocate on the realtime device causes
> > filesystem shutdowns due to a log reservation overflow that happens when
> > we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the
> > the tr_write and tr_itruncate log reservation calculation.
> >
> > "The following reproducer results in a transaction log overrun warning
> > for me:
> >
> > mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb
> > mount -o rtdev=/dev/vdc /dev/vdb /mnt
> > fallocate -l 4G /mnt/foo
> >
> > Reported-by: Omar Sandoval <osandov@osandov.com>
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
>
> Looks reasonable enough given my limited knowledge on the rt bits. One
> question..
>
> > fs/xfs/libxfs/xfs_trans_resv.c | 96 ++++++++++++++++++++++++++++++++--------
> > 1 file changed, 77 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index c55cd9a3dec9..824073a839ac 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -196,6 +196,24 @@ xfs_calc_inode_chunk_res(
> > return res;
> > }
> >
> > +/*
> > + * Per-extent log reservation for the btree changes involved in freeing or
> > + * allocating a realtime extent. We have to be able to log as many rtbitmap
> > + * blocks as needed to mark inuse MAXEXTLEN blocks' worth of realtime extents,
> > + * as well as the realtime summary block.
> > + */
> > +unsigned int
> > +xfs_rtalloc_log_count(
> > + struct xfs_mount *mp,
> > + unsigned int num_ops)
> > +{
> > + unsigned int blksz = XFS_FSB_TO_B(mp, 1);
> > + unsigned int rtbmp_bytes;
> > +
> > + rtbmp_bytes = (MAXEXTLEN / mp->m_sb.sb_rextsize) / NBBY;
> > + return (howmany(rtbmp_bytes, blksz) + 1) * num_ops;
> > +}
> > +
> > /*
> > * Various log reservation values.
> > *
> > @@ -218,13 +236,21 @@ xfs_calc_inode_chunk_res(
> >
> > /*
> > * In a write transaction we can allocate a maximum of 2
> > - * extents. This gives:
> > + * extents. This gives (t1):
> > * the inode getting the new extents: inode size
> > * the inode's bmap btree: max depth * block size
> > * the agfs of the ags from which the extents are allocated: 2 * sector
> > * the superblock free block counter: sector size
> > * the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
> > - * And the bmap_finish transaction can free bmap blocks in a join:
> > + * Or, if we're writing to a realtime file (t2):
> > + * the inode getting the new extents: inode size
> > + * the inode's bmap btree: max depth * block size
> > + * the agfs of the ags from which the extents are allocated: 2 * sector
> > + * the superblock free block counter: sector size
> > + * the realtime bitmap: ((MAXEXTLEN / rtextsize) / NBBY) bytes
> > + * the realtime summary: 1 block
> > + * the allocation btrees: 2 trees * (2 * max depth - 1) * block size
>
> Why do we include the allocation btrees in the rt reservations? I
> thought that we'd either allocate (or free) out of one pool or the
> other. Do we operate on both sets of structures in the same transaction?
I read "allocation btrees: 2 exts * 2 trees..." for t1 to mean that we
need to be able to allocate one datadev extent (which could cause a full
bnobt/cntbt split) for the actual file data, and then the second extent is
to handle allocating a new bmbt block to the bmap btree.
Based on that, I concluded that we still need to reserve space for that
"second" extent to handle allocating a new bmbt block (on the datadev).
While pondering that, I wondered if even that's really true because what
happens if you suffer a full bmbt split, the free space is so fragmented
that each level of the bmbt split ends up allocating a new block from a
different part of the free space btrees and that in turn causes splits
in the free space btrees?
I think the answer is "no we're fine" because even if each new bmbt
block comes from a different bnobt record, a full bmbt split will never
hollow out more than half of one bnobt block's worth of free space
records.
--D
> Brian
>
> > + * And the bmap_finish transaction can free bmap blocks in a join (t3):
> > * the agfs of the ags containing the blocks: 2 * sector size
> > * the agfls of the ags containing the blocks: 2 * sector size
> > * the super block free block counter: sector size
> > @@ -234,40 +260,72 @@ STATIC uint
> > xfs_calc_write_reservation(
> > struct xfs_mount *mp)
> > {
> > - return XFS_DQUOT_LOGRES(mp) +
> > - max((xfs_calc_inode_res(mp, 1) +
> > + unsigned int t1, t2, t3;
> > + unsigned int blksz = XFS_FSB_TO_B(mp, 1);
> > +
> > + t1 = xfs_calc_inode_res(mp, 1) +
> > + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), blksz) +
> > + xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> > +
> > + if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
> > + t2 = xfs_calc_inode_res(mp, 1) +
> > xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
> > - XFS_FSB_TO_B(mp, 1)) +
> > + blksz) +
> > xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> > - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
> > - XFS_FSB_TO_B(mp, 1))),
> > - (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> > - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
> > - XFS_FSB_TO_B(mp, 1))));
> > + xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 1), blksz) +
> > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), blksz);
> > + } else {
> > + t2 = 0;
> > + }
> > +
> > + t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> > +
> > + return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3);
> > }
> >
> > /*
> > - * In truncating a file we free up to two extents at once. We can modify:
> > + * In truncating a file we free up to two extents at once. We can modify (t1):
> > * the inode being truncated: inode size
> > * the inode's bmap btree: (max depth + 1) * block size
> > - * And the bmap_finish transaction can free the blocks and bmap blocks:
> > + * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
> > * the agf for each of the ags: 4 * sector size
> > * the agfl for each of the ags: 4 * sector size
> > * the super block to reflect the freed blocks: sector size
> > * worst case split in allocation btrees per extent assuming 4 extents:
> > * 4 exts * 2 trees * (2 * max depth - 1) * block size
> > + * Or, if it's a realtime file (t3):
> > + * the agf for each of the ags: 2 * sector size
> > + * the agfl for each of the ags: 2 * sector size
> > + * the super block to reflect the freed blocks: sector size
> > + * the realtime bitmap: 2 exts * ((MAXEXTLEN / rtextsize) / NBBY) bytes
> > + * the realtime summary: 2 exts * 1 block
> > + * worst case split in allocation btrees per extent assuming 2 extents:
> > + * 2 exts * 2 trees * (2 * max depth - 1) * block size
> > */
> > STATIC uint
> > xfs_calc_itruncate_reservation(
> > struct xfs_mount *mp)
> > {
> > - return XFS_DQUOT_LOGRES(mp) +
> > - max((xfs_calc_inode_res(mp, 1) +
> > - xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
> > - XFS_FSB_TO_B(mp, 1))),
> > - (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> > - xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
> > - XFS_FSB_TO_B(mp, 1))));
> > + unsigned int t1, t2, t3;
> > + unsigned int blksz = XFS_FSB_TO_B(mp, 1);
> > +
> > + t1 = xfs_calc_inode_res(mp, 1) +
> > + xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
> > +
> > + t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> > +
> > + if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
> > + t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> > + xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
> > + xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> > + } else {
> > + t3 = 0;
> > + }
> > +
> > + return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3);
> > }
> >
> > /*
> >
>
next prev parent reply other threads:[~2019-12-13 20:41 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-04 16:38 [PATCH] xfs: fix log reservation overflows when allocating large rt extents Darrick J. Wong
2019-12-05 19:23 ` Omar Sandoval
2019-12-12 23:01 ` Darrick J. Wong
2019-12-13 12:18 ` Brian Foster
2019-12-13 20:19 ` Darrick J. Wong [this message]
2019-12-16 12:23 ` Brian Foster
2019-12-16 16:47 ` Darrick J. Wong
2019-12-16 18:00 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191213201957.GI99875@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=osandov@osandov.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox