From: Wengang Wang <wen.gang.wang@oracle.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions
Date: Wed, 17 May 2023 19:41:31 +0000 [thread overview]
Message-ID: <2D4697C7-AF79-482B-B2D3-C41418FC6911@oracle.com> (raw)
In-Reply-To: <20230517015433.GQ858815@frogsfrogsfrogs>
> On May 16, 2023, at 6:54 PM, Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, May 17, 2023 at 11:40:05AM +1000, Dave Chinner wrote:
>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
>>> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes:
>>>
>>> xfs_alloc_fix_freelist ->
>>> xfs_alloc_ag_vextent_size ->
>>> (run all the way to the end of the bnobt) ->
>>> xfs_extent_busy_flush ->
>>> <stall on the busy extent that's in @tp->busy_list>
>>>
>>> xfs_extent_busy_flush does this, potentially while we're holding the
>>> freed extent in @tp->t_busy_list:
>>>
>>> error = xfs_log_force(mp, XFS_LOG_SYNC);
>>> if (error)
>>> return;
>>>
>>> do {
>>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>>> if (busy_gen != READ_ONCE(pag->pagb_gen))
>>> break;
>>> schedule();
>>> } while (1);
>>>
>>> finish_wait(&pag->pagb_wait, &wait);
>>>
>>> The log force kicks the CIL to process whatever other committed items
>>> might be lurking in the log. *Hopefully* someone else freed an extent
>>> in the same AG, so the log force has now caused that *other* extent to
>>> get processed so it has now cleared the busy list. Clearing something
>>> from the busy list increments the busy generation (aka pagb_gen).
>>
>> *nod*
>>
>>> Unfortunately, there aren't any other extents, so the busy_gen does not
>>> change, and the loop runs forever.
>>>
>>> At this point, Dave writes:
>>>
>>> [15:57] <dchinner> so if we enter that function with busy extents on the
>>> transaction, and we are doing an extent free operation, we should return
>>> after the sync log force and not do the generation number wait
>>>
>>> [15:58] <dchinner> if we fail to allocate again after the sync log force
>>> and the generation number hasn't changed, then return -EAGAIN because no
>>> progress has been made.
>>>
>>> [15:59] <dchinner> Then the transaction is rolled, the transaction busy
>>> list is cleared, and if the next allocation attempt fails becaues
>>> everything is busy, we go to sleep waiting for the generation to change
>>>
>>> [16:00] <dchinner> but because the transaction does not hold any busy
>>> extents, it cannot deadlock here because it does not pin any extents
>>> that are in the busy tree....
>>>
>>> [16:05] <dchinner> All the generation number is doing here is telling us
>>> whether there was busy extent resolution between the time we last
>>> skipped a viable extent because it was busy and when the flush
>>> completes.
>>>
>>> [16:06] <dchinner> it doesn't mean the next allocation will succeed,
>>> just that progress has been made so trying the allocation attempt will
>>> at least get a different result to the previous scan.
>>>
>>> I think the callsites go from this:
>>>
>>> if (busy) {
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>> goto restart;
>>> }
>>
>> I was thinking this code changes to:
>>
>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>> ....
>> <attempt allocation>
>> ....
>> if (busy) {
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> error = xfs_extent_busy_flush(args->tp, args->pag,
>> busy_gen, flags);
>> if (!error) {
>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
>> goto restart;
>> }
>> /* jump to cleanup exit point */
>> goto out_error;
>> }
>>
>> Note the different first parameter - we pass args->tp, not args->mp
>> so that the flush has access to the transaction context...
>
> <nod>
>
>>> to something like this:
>>>
>>> bool try_log_flush = true;
>>> ...
>>> restart:
>>> ...
>>>
>>> if (busy) {
>>> bool progress;
>>>
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>>
>>> /*
>>> * If the current transaction has an extent on the busy
>>> * list, we're allocating space as part of freeing
>>> * space, and all the free space is busy, we can't hang
>>> * here forever. Force the log to try to unbusy free
>>> * space that could have been freed by other
>>> * transactions, and retry the allocation. If the
>>> * allocation fails a second time because all the free
>>> * space is busy and nobody made any progress with
>>> * clearing busy extents, return EAGAIN so the caller
>>> * can roll this transaction.
>>> */
>>> if ((flags & XFS_ALLOC_FLAG_FREEING) &&
>>> !list_empty(&tp->t_busy_list)) {
>>> int log_flushed;
>>>
>>> if (try_log_flush) {
>>> _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed);
>>> try_log_flush = false;
>>> goto restart;
>>> }
>>>
>>> if (busy_gen == READ_ONCE(pag->pagb_gen))
>>> return -EAGAIN;
>>>
>>> /* XXX should we set try_log_flush = true? */
>>> goto restart;
>>> }
>>>
>>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>> goto restart;
>>> }
>>>
>>> IOWs, I think Dave wants us to keep the changes in the allocator instead
>>> of spreading it around.
>>
>> Sort of - I want the busy extent flush code to be isolated inside
>> xfs_extent_busy_flush(), not spread around the allocator. :)
>>
>> xfs_extent_busy_flush(
>> struct xfs_trans *tp,
>> struct xfs_perag *pag,
>> unsigned int busy_gen,
>> unsigned int flags)
>> {
>> error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> if (error)
>> return error;
>>
>> /*
>> * If we are holding busy extents, the caller may not want
>> * to block straight away. If we are being told just to try
>> * a flush or progress has been made since we last skipped a busy
>> * extent, return immediately to allow the caller to try
>> * again. If we are freeing extents, we might actually be
>> * holding the onyly free extents in the transaction busy
>
> only
>
>> * list and the log force won't resolve that situation. In
>> * this case, return -EAGAIN in that case to tell the caller
>> * it needs to commit the busy extents it holds before
>> * retrying the extent free operation.
>> */
>> if (!list_empty(&tp->t_busy_list)) {
>> if (flags & XFS_ALLOC_FLAG_TRY_FLUSH)
>> return 0;
>> if (busy_gen != READ_ONCE(pag->pagb_gen))
>> return 0;
>> if (flags & XFS_ALLOC_FLAG_FREEING)
>> return -EAGAIN;
>> }
>
> Indeed, that's a lot cleaner.
>
>>
>> /* wait for progressing resolving busy extents */
>> do {
>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>> if (busy_gen != READ_ONCE(pag->pagb_gen))
>> break;
>> schedule();
>> } while (1);
>>
>> finish_wait(&pag->pagb_wait, &wait);
>> return 0;
>> }
>>
>> It seems cleaner to me to put this all in xfs_extent_busy_flush()
>> rather than having open-coded handling of extent free constraints in
>> each potential flush location. We already have retry semantics
>> around the flush, let's just extend them slightly....
>
> <nod> Wengang, how does this sound?
Thanks Darrick, pls see my previous reply.
thanks,
wengang
next prev parent reply other threads:[~2023-05-17 19:41 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-24 22:51 [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions Wengang Wang
2023-05-02 22:41 ` Wengang Wang
2023-05-12 18:24 ` Darrick J. Wong
2023-05-12 19:55 ` Wengang Wang
2023-05-12 21:13 ` Darrick J. Wong
2023-05-12 22:20 ` Wengang Wang
2023-05-16 16:05 ` Wengang Wang
2023-05-17 0:59 ` Darrick J. Wong
2023-05-17 1:40 ` Dave Chinner
2023-05-17 1:54 ` Darrick J. Wong
2023-05-17 19:41 ` Wengang Wang [this message]
2023-05-17 19:14 ` Wengang Wang
2023-05-17 22:49 ` Dave Chinner
2023-05-18 0:10 ` Wengang Wang
2023-05-18 1:01 ` Dave Chinner
2023-05-18 1:25 ` Wengang Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2D4697C7-AF79-482B-B2D3-C41418FC6911@oracle.com \
--to=wen.gang.wang@oracle.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).