From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
Date: Tue, 18 Feb 2014 15:34:10 -0500 [thread overview]
Message-ID: <5303C3C2.5070501@redhat.com> (raw)
In-Reply-To: <530393F8.4070106@redhat.com>
On 02/18/2014 12:10 PM, Brian Foster wrote:
> On 02/11/2014 01:46 AM, Dave Chinner wrote:
>> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>>> reservation for inode allocation and free. Update
>>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>>> reserve blocks for the potential finobt record insertion on inode
>>> free (i.e., if an inode chunk was previously fully allocated).
>>>
>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>> ---
>>> fs/xfs/xfs_inode.c | 4 +++-
>>> fs/xfs/xfs_trans_resv.c | 47 +++++++++++++++++++++++++++++++++++++++++++----
>>> fs/xfs/xfs_trans_space.h | 7 ++++++-
>>> 3 files changed, 52 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>>> index 001aa89..57c77ed 100644
>>> --- a/fs/xfs/xfs_inode.c
>>> +++ b/fs/xfs/xfs_inode.c
>>> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
>>> int error;
>>>
>>> tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>>> - error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
>>> + tp->t_flags |= XFS_TRANS_RESERVE;
>>> + error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>>> + XFS_IFREE_SPACE_RES(mp), 0);
>>
>> Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
>> used here, and why it's use won't lead to accelerated reserve pool
>> depletion?
>>
>
> So this aspect of things appears to be a bit more interesting than I
> originally anticipated. I "reserve enabled" this transaction to
> facilitate the ability to free up inodes under ENOSPC conditions.
> Without this, the problem of failing out of xfs_inactive_ifree() (and
> leaving an inode chained on the unlinked list) is easily reproducible
> with generic/083.
>
> The basic argument for why this is reasonable is that releasing an inode
> releases used space (i.e., file blocks and potentially directory blocks
> and inode chunks over time). That said, I can manufacture situations
> where this is not the case. E.g., allocate a bunch of 0-sized files,
> consume remaining free space in some separate file, start removing
> inodes in a manner that removes a single inode per chunk or so. This
> creates a scenario where the inobt can be very large and the finobt very
> small (likely a single record). Removing the inodes in this manner
> reduces the likelihood of freeing up any space and thus rapidly grows
> the finobt towards the size of the inobt without any free space
> available. This might or might not qualify as sane use of the fs, but I
> don't think the failure scenario is acceptable as things currently stand.
>
> I think there are several ways this can go from here. A couple ideas
> that have crossed my mind:
>
> - Find a way to variably reserve the number of blocks that would be
> required to grow the finobt to the finobt, based on current state. This
> would require the total number of blocks (not just enough for a split),
> so this could get complex and somewhat overbearing (i.e., a lot of space
> could be quietly reserved, current tracking might not be sufficient and
> the allocation paths could get hairy).
>
> - Work to push the ifree transaction allocation and reservation to the
> unlink codepath rather than the eviction codepath. Under normal
> circumstances, chain the tp to the xfs_inode such that the eviction code
> path can grab it and run. This prevents us going into the state where an
> inode is unlinked without having enough space to free up. On the flip
> side, ENOSPC on unlink isn't very forgiving behavior to the user.
>
- Add some state or flags bits to the finobt and the associated ability
to kill/invalidate it at runtime. Print a warning with regard to the
situation that indicates performance might be affected and a repair is
required to re-enable.
Brian
> I think the former approach is probably overkill for something that
> might be a pathological situation. The latter approach is more simple,
> but it feels like a bit of a hack. I've experimented with it a bit, but
> I'm not quite sure yet if it introduces any transaction issues by
> allocating the unlink and ifree transactions at the same time.
>
> Perhaps another argument could be made that it's rather unlikely we run
> into an fs with as many 0-sized (or sub-inode chunk sized) files as
> required to deplete the reserve pool without freeing any space, and we
> should just touch up the failure handling. E.g.,
>
> 1.) Continue to reserve enable the ifree transaction. Consider expanding
> the reserve pool on finobt-enabled fs' if appropriate. Note that this is
> not guaranteed to provide enough resources to populate the finobt to the
> level of the inobt without freeing up more space.
> 2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
> If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
> 3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
> shutdown.
>
> And this could probably be made more intelligent to bail out sooner if
> we repeat XFS_TRANS_RESERVE reservations without freeing up any space,
> etc. Before going too far in one direction... thoughts?
>
> Brian
>
>>> if (error) {
>>> ASSERT(XFS_FORCED_SHUTDOWN(mp));
>>> xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
>>> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
>>> index 2fd59c0..32f35c1 100644
>>> --- a/fs/xfs/xfs_trans_resv.c
>>> +++ b/fs/xfs/xfs_trans_resv.c
>>> @@ -98,6 +98,37 @@ xfs_calc_inode_res(
>>> }
>>>
>>> /*
>>> + * The free inode btree is a conditional feature and the log reservation
>>> + * requirements differ slightly from that of the traditional inode allocation
>>> + * btree. The finobt tracks records for inode chunks with at least one free inode.
>>> + * Therefore, a record can be removed from the tree for an inode allocation or
>>> + * free and the associated merge reservation is unconditional. This also covers
>>> + * the possibility of a split on record insertion.
>>
>> Slightly wider than 80 columns here. FWIW, if you use vim, add this
>> rule to have it add a red line at the textwidth you have set:
>>
>> " highlight textwidth
>> set cc=+1
>>
>> And that will point out lines that are too long quite obviously ;)
>>
>>> + *
>>> + * the free inode btree: max depth * block size
>>> + * the free inode btree entry: block size
>>> + *
>>> + * TODO: is the modify res really necessary? covered by the merge/split res?
>>> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
>>
>> The modify case is for an allocation that only updates an inobt
>> record (i.e. chunk already allocated, free inodes in it). Because
>> we can remove a finobt record when "modifying" the last free inode
>> record in a chunk, "modify" can cause a redcord removal and hence a
>> tree merge. In which case it's no different of any of the other
>> finobt reservations....
>>
>>> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>>> * the superblock for the nlink flag: sector size
>>> * the directory btree: (max depth + v2) * dir block size
>>> * the directory inode's bmap btree: (max depth + v2) * block size
>>> + * the finobt
>>> */
>>> STATIC uint
>>> xfs_calc_create_resv_modify(
>>> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>>> return xfs_calc_inode_res(mp, 2) +
>>> xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>>> (uint)XFS_FSB_TO_B(mp, 1) +
>>> - xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
>>> + xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
>>> + xfs_calc_finobt_res(mp, 1);
>>> }
>>
>> And this is where is starts to get complex. The modify operation can
>> now cause a finobt merge, when means blocks will be allocated/freed.
>> That means we now need to take into account:
>>
>> * the allocation btrees: 2 trees * (max depth - 1) * block size
>>
>> and anything else freeing an extent requires.
>>
>>> /*
>>> @@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
>>> * the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
>>> * the inode btree: max depth * blocksize
>>> * the allocation btrees: 2 trees * (max depth - 1) * block size
>>> + * the finobt
>>> */
>>> STATIC uint
>>> xfs_calc_create_resv_alloc(
>>> @@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
>>> xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
>>> xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>>> xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
>>> - XFS_FSB_TO_B(mp, 1));
>>> + XFS_FSB_TO_B(mp, 1)) +
>>> + xfs_calc_finobt_res(mp, 0);
>>> }
>>
>> This reservation is only for v4 superblocks - the icreate
>> transaction reservation is used for v5 superblocks, so that's the
>> only one you need to modify.
>>
>> Cheers,
>>
>> Dave.
>>
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-02-18 20:34 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-04 17:49 [PATCH v3 00/11] xfs: introduce the free inode btree Brian Foster
2014-02-04 17:49 ` [PATCH v3 01/11] xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers Brian Foster
2014-02-04 17:49 ` [PATCH v3 02/11] xfs: reserve v5 superblock read-only compat. feature bit for finobt Brian Foster
2014-02-11 6:07 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 03/11] xfs: support the XFS_BTNUM_FINOBT free inode btree type Brian Foster
2014-02-11 6:22 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt Brian Foster
2014-02-11 6:46 ` Dave Chinner
2014-02-11 16:22 ` Brian Foster
2014-02-20 1:00 ` Dave Chinner
2014-02-20 16:04 ` Brian Foster
2014-02-18 17:10 ` Brian Foster
2014-02-18 20:34 ` Brian Foster [this message]
2014-02-20 2:01 ` Dave Chinner
2014-02-20 18:49 ` Brian Foster
2014-02-20 20:50 ` Dave Chinner
2014-02-20 21:14 ` Christoph Hellwig
2014-02-20 23:13 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 05/11] xfs: insert newly allocated inode chunks into the finobt Brian Foster
2014-02-11 6:48 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 06/11] xfs: use and update the finobt on inode allocation Brian Foster
2014-02-11 7:17 ` Dave Chinner
2014-02-11 16:32 ` Brian Foster
2014-02-14 20:01 ` Brian Foster
2014-02-20 0:38 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 07/11] xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper Brian Foster
2014-02-11 7:19 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 08/11] xfs: update the finobt on inode free Brian Foster
2014-02-11 7:31 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 09/11] xfs: add finobt support to growfs Brian Foster
2014-02-04 17:49 ` [PATCH v3 10/11] xfs: report finobt status in fs geometry Brian Foster
2014-02-11 7:34 ` Dave Chinner
2014-02-04 17:49 ` [PATCH v3 11/11] xfs: enable the finobt feature on v5 superblocks Brian Foster
2014-02-11 7:34 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5303C3C2.5070501@redhat.com \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox