linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wengang Wang <wen.gang.wang@oracle.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions
Date: Wed, 17 May 2023 19:41:31 +0000	[thread overview]
Message-ID: <2D4697C7-AF79-482B-B2D3-C41418FC6911@oracle.com> (raw)
In-Reply-To: <20230517015433.GQ858815@frogsfrogsfrogs>



> On May 16, 2023, at 6:54 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Wed, May 17, 2023 at 11:40:05AM +1000, Dave Chinner wrote:
>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
>>> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes:
>>> 
>>> xfs_alloc_fix_freelist ->
>>> xfs_alloc_ag_vextent_size ->
>>> (run all the way to the end of the bnobt) ->
>>> xfs_extent_busy_flush ->
>>> <stall on the busy extent that's in @tp->busy_list>
>>> 
>>> xfs_extent_busy_flush does this, potentially while we're holding the
>>> freed extent in @tp->t_busy_list:
>>> 
>>> error = xfs_log_force(mp, XFS_LOG_SYNC);
>>> if (error)
>>> return;
>>> 
>>> do {
>>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>>> if  (busy_gen != READ_ONCE(pag->pagb_gen))
>>> break;
>>> schedule();
>>> } while (1);
>>> 
>>> finish_wait(&pag->pagb_wait, &wait);
>>> 
>>> The log force kicks the CIL to process whatever other committed items
>>> might be lurking in the log.  *Hopefully* someone else freed an extent
>>> in the same AG, so the log force has now caused that *other* extent to
>>> get processed so it has now cleared the busy list.  Clearing something
>>> from the busy list increments the busy generation (aka pagb_gen).
>> 
>> *nod*
>> 
>>> Unfortunately, there aren't any other extents, so the busy_gen does not
>>> change, and the loop runs forever.
>>> 
>>> At this point, Dave writes:
>>> 
>>> [15:57] <dchinner> so if we enter that function with busy extents on the
>>> transaction, and we are doing an extent free operation, we should return
>>> after the sync log force and not do the generation number wait
>>> 
>>> [15:58] <dchinner> if we fail to allocate again after the sync log force
>>> and the generation number hasn't changed, then return -EAGAIN because no
>>> progress has been made.
>>> 
>>> [15:59] <dchinner> Then the transaction is rolled, the transaction busy
>>> list is cleared, and if the next allocation attempt fails becaues
>>> everything is busy, we go to sleep waiting for the generation to change
>>> 
>>> [16:00] <dchinner> but because the transaction does not hold any busy
>>> extents, it cannot deadlock here because it does not pin any extents
>>> that are in the busy tree....
>>> 
>>> [16:05] <dchinner> All the generation number is doing here is telling us
>>> whether there was busy extent resolution between the time we last
>>> skipped a viable extent because it was busy and when the flush
>>> completes.
>>> 
>>> [16:06] <dchinner> it doesn't mean the next allocation will succeed,
>>> just that progress has been made so trying the allocation attempt will
>>> at least get a different result to the previous scan.
>>> 
>>> I think the callsites go from this:
>>> 
>>> if (busy) {
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>> goto restart;
>>> }
>> 
>> I was thinking this code changes to:
>> 
>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>> ....
>> <attempt allocation>
>> ....
>> if (busy) {
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> error = xfs_extent_busy_flush(args->tp, args->pag,
>> busy_gen, flags);
>> if (!error) {
>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
>> goto restart;
>> }
>> /* jump to cleanup exit point */
>> goto out_error;
>> }
>> 
>> Note the different first parameter - we pass args->tp, not args->mp
>> so that the flush has access to the transaction context...
> 
> <nod>
> 
>>> to something like this:
>>> 
>>> bool try_log_flush = true;
>>> ...
>>> restart:
>>> ...
>>> 
>>> if (busy) {
>>> bool progress;
>>> 
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>> 
>>> /*
>>>  * If the current transaction has an extent on the busy
>>>  * list, we're allocating space as part of freeing
>>>  * space, and all the free space is busy, we can't hang
>>>  * here forever.  Force the log to try to unbusy free
>>>  * space that could have been freed by other
>>>  * transactions, and retry the allocation.  If the
>>>  * allocation fails a second time because all the free
>>>  * space is busy and nobody made any progress with
>>>  * clearing busy extents, return EAGAIN so the caller
>>>  * can roll this transaction.
>>>  */
>>> if ((flags & XFS_ALLOC_FLAG_FREEING) &&
>>>     !list_empty(&tp->t_busy_list)) {
>>> int log_flushed;
>>> 
>>> if (try_log_flush) {
>>> _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed);
>>> try_log_flush = false;
>>> goto restart;
>>> }
>>> 
>>> if (busy_gen == READ_ONCE(pag->pagb_gen))
>>> return -EAGAIN;
>>> 
>>> /* XXX should we set try_log_flush = true? */
>>> goto restart;
>>> }
>>> 
>>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>> goto restart;
>>> }
>>> 
>>> IOWs, I think Dave wants us to keep the changes in the allocator instead
>>> of spreading it around.
>> 
>> Sort of - I want the busy extent flush code to be isolated inside
>> xfs_extent_busy_flush(), not spread around the allocator. :)
>> 
>> xfs_extent_busy_flush(
>> struct xfs_trans *tp,
>> struct xfs_perag *pag,
>> unsigned int busy_gen,
>> unsigned int flags)
>> {
>> error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> if (error)
>> return error;
>> 
>> /*
>>  * If we are holding busy extents, the caller may not want
>>  * to block straight away. If we are being told just to try
>>  * a flush or progress has been made since we last skipped a busy
>>  * extent, return immediately to allow the caller to try
>>  * again. If we are freeing extents, we might actually be
>>  * holding the onyly free extents in the transaction busy
> 
>                       only
> 
>>  * list and the log force won't resolve that situation. In
>>  * this case, return -EAGAIN in that case to tell the caller
>>  * it needs to commit the busy extents it holds before
>>  * retrying the extent free operation.
>>  */
>> if (!list_empty(&tp->t_busy_list)) {
>> if (flags & XFS_ALLOC_FLAG_TRY_FLUSH)
>> return 0;
>> if (busy_gen != READ_ONCE(pag->pagb_gen))
>> return 0;
>> if (flags & XFS_ALLOC_FLAG_FREEING)
>> return -EAGAIN;
>> }
> 
> Indeed, that's a lot cleaner.
> 
>> 
>> /* wait for progressing resolving busy extents */
>> do {
>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>> if  (busy_gen != READ_ONCE(pag->pagb_gen))
>> break;
>> schedule();
>> } while (1);
>> 
>> finish_wait(&pag->pagb_wait, &wait);
>> return 0;
>> }
>> 
>> It seems cleaner to me to put this all in xfs_extent_busy_flush()
>> rather than having open-coded handling of extent free constraints in
>> each potential flush location. We already have retry semantics
>> around the flush, let's just extend them slightly....
> 
> <nod> Wengang, how does this sound?

Thanks Darrick, pls see my previous reply.
thanks,
wengang


  reply	other threads:[~2023-05-17 19:41 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-24 22:51 [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions Wengang Wang
2023-05-02 22:41 ` Wengang Wang
2023-05-12 18:24 ` Darrick J. Wong
2023-05-12 19:55   ` Wengang Wang
2023-05-12 21:13     ` Darrick J. Wong
2023-05-12 22:20       ` Wengang Wang
2023-05-16 16:05         ` Wengang Wang
2023-05-17  0:59           ` Darrick J. Wong
2023-05-17  1:40             ` Dave Chinner
2023-05-17  1:54               ` Darrick J. Wong
2023-05-17 19:41                 ` Wengang Wang [this message]
2023-05-17 19:14               ` Wengang Wang
2023-05-17 22:49                 ` Dave Chinner
2023-05-18  0:10                   ` Wengang Wang
2023-05-18  1:01                     ` Dave Chinner
2023-05-18  1:25                       ` Wengang Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2D4697C7-AF79-482B-B2D3-C41418FC6911@oracle.com \
    --to=wen.gang.wang@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).