public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Wengang Wang <wen.gang.wang@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH 1/2] xfs: IO time one extent per EFI
Date: Mon, 24 Apr 2023 22:52:49 +0000	[thread overview]
Message-ID: <3B023A0D-DF2E-48B8-9BC9-46498CBCF8AB@oracle.com> (raw)
In-Reply-To: <134EE4A4-8F16-4E58-8764-EA49B0EE3107@oracle.com>



> On Apr 24, 2023, at 8:53 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>> 
>>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>> We don't do that anymore for partially processed multi-extent
>>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>>>> we log a complete intent done items alongside a new intent
>>>>>> containing the remaining work to be done in the same transaction.
>>>>>> This cancels the original intent and atomically replaces it with a
>>>>>> new intent containing the remaining work to be done.
>>>>>> 
>>>>>> So when I say "update the EFD" I'm using historic terminology for
>>>>>> processing and recovering multi-extent intents. In modern terms,
>>>>>> what I mean is "update the deferred work intent chain to reflect the
>>>>>> work remaining to be done".
>>>>> 
>>>>> OK. so let’s see the difference between your implementation from mine.
>>>>> Say, there are three extents to free.
>>>>> 
>>>>> My patch would result in:
>>>>> 
>>>>> EFI 1  with extent1
>>>>> free extent1
>>>>> EFD 1 with extent1
>>>>> transaction roll
>>>>> EFI 2 with extent2
>>>>> free extent2
>>>>> EFD 2 with extent2
>>>>> transaction roll
>>>>> EFI 3 with extent3
>>>>> free extent3
>>>>> EFD3 with extent3
>>>>> transaction commit
>>>> 
>>>> No, it wouldn't. This isn't how the deferred work processes work
>>>> items on the transaction. A work item with multiple extents on it
>>>> would result in this:
>>>> 
>>>> xfs_defer_finish(tp)  # tp contains three xefi work items 
>>>> xfs_defer_finish_noroll
>>>> xfs_defer_create_intents()
>>>>   list_for_each_defer_pending
>>>>     xfs_defer_create_intent(dfp)
>>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>>> xfs_extent_free_create_intent()
>>>>   <create EFI>
>>>>   list_for_each_xefi
>>>>     xfs_extent_free_log_item(xefi)
>>>> <adds extent to current EFI>
>>>> 
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>>   xfs_trans_dup()
>>>>   xfs_trans_commit()
>>>> <restore>
>>>> 
>>>> At this point we have this committed to the CIL
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> 
>>>> And xfs_defer_finish_noroll() continues with
>>>> 
>>>> <grabs first work item>
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>>   xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>>   ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>>     xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>>   <adds extent to EFD>
>>>> 
>>>> And once the processing of the single work item is done we loop
>>>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>>>> have any new intents, so xfs_defer_create_intents() returns false,
>>>> but we completed a dfp work item, so we run:
>>>> 
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>>   xfs_trans_dup()
>>>>   xfs_trans_commit()
>>>> <restore>
>>>> 
>>>> At this point we have this committed to the CIL
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> 
>>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> 
>>>> But we have not committed the final extent free or EFD 3 - that's in
>>>> the last transaction context we pass back to the _xfs_trans_commit()
>>>> context for it to finalise and close off the rolling transaction
>>>> chain. At this point, we finally have this in the CIL:
>>>> 
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 3 with extent3
>>> 
>>> 
>>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
>>> thinking it only create intent for the first in tp->t_dfops.
>>> 
>>>> 
>>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>>>> 
>>>> I *always* ignore CIL intent whiteouts when thinking about
>>>> transaction chains and intents. That is purely a journal efficiency
>>>> optimisation, not something that is necessary for correct operation.
>>> 
>>> OK.
>>> 
>>>> 
>>>>> Your idea yields this:
>>>>> 
>>>>> EFI 1 with extent1 extent2 extent3
>>>>> free extent1
>>>>> EFI 2 with extent2 extent3
>>>>> EFD 1 with extent1 extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent2
>>>>> EFI 3 with extent3
>>>>> EFD 2 with extent extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent3
>>>>> EFD 3 with extent3
>>>>> transaction commit.
>>>> 
>>>> The EFI/EFD contents are correct, but the rest of it is not - I am
>>>> not suggesting open coding transaction rolling like that. Everything
>>>> I am suggesting works through the same defer ops mechanism as you
>>>> are describing. So if we start with the initial journal contents
>>>> looks like this:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> 
>>>> Then we run xfs_defer_finish_one() on that work, 
>>>> 
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>>   xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>>   ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>>     xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>>   <adds extent to EFD>
>>>> 
>>>> But now we have 3 xefis on the work to process. So on success of
>>>> the first call to xfs_trans_free_extent(), we want it to return
>>>> -EAGAIN to trigger the existing relogging code to create the new
>>>> EFI. How this works is describe in the section "Requesting a
>>>> Fresh Transaction while Finishing Deferred Work" in
>>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>>>> 
>>>> The result is that the deferred work infrastructure does the work
>>>> of updaing the done intent and creating the new intents for the work
>>>> remaining. Hence after the next transaction roll, we have in the CIL
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent2 extent3.
>>>> 
>>> 
>>> Taking transaction rolls into account (also adding up to EFD3), above would be:
>>> 
>>> EFI 1 with extent1 extent2 extent3.
>>> transaction roll
>>> <AGF, AGFL, free space btree block mods>  for extent 1
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent2 extent3.
>>> transaction roll
>>> free extent 2
>>> EFD 2 with extent2 extent3
>>> EFI 3 with extent3
>>> transaction roll
>>> free extent 3
>>> EFD 3 with extent3
>>> 
>>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
>>> I got it.
>>> 
>>>> And so the loop goes until there is no more work remaining.
>>>> 
>>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>>>> XFS_EFI_MAX_FAST_EXTENTS.
>>>> 
>>>> The difference is that what I'm suggesting is that the extent free
>>>> code can decide when it needs a transaction to be rolled. It isn't
>>>> forced to always run a single free per transaction, it can decide
>>>> that it can free multiple extents per transaction if there is no
>>>> risk of deadlocks (e.g. extents are in different AGs).  Forcing
>>>> everything to be limited to a transaction per extent free even when
>>>> there is no risk of deadlocks freeing multiple extents in a single
>>>> transaction is unnecessary.
>>>> 
>>>> Indeed, if the second extent is in a different AG, we don't risk
>>>> busy extents causing us issues, so we could do:
>>>> 
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF 1, AGFL 1, free space btree block mods>
>>>> <AGF 2, AGFL 2, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent3.
>>>> .....
>>>> 
>>> 
>>> My thumb is up.
>> 
>> Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
>> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
>> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
>> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?
>> 
>> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
>> are stored there. For those AGs
>> they have pending busy extents in in-memory transactions (not committed to CIL).
>> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
>> if it’s not there, and continue to free the extent. Otherwise if the AG number is already
>> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
>> the busy extents are committed to the xfs_cil_ctx.
>> Well, wondering if the pending busy extents would remain in pending state long before
>> committed to CIL and that become a bottle neck for freeing extents.  
> 
> Thinking more, we could add per AG counter indicating current pending busy extents on the AG.
> For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent
> is always OK to go, there shouldn’t be a bottle neck.

I just sent the first attempt (this way), pls review.

thanks,
wengang


  reply	other threads:[~2023-04-24 22:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang
2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang
2023-04-19 23:55   ` Dave Chinner
2023-04-20 17:31     ` Wengang Wang
2023-04-20 23:22       ` Dave Chinner
2023-04-21  0:24         ` Wengang Wang
2023-04-21  9:34           ` Dave Chinner
2023-04-21 18:23             ` Wengang Wang
2023-04-22  3:22               ` Wengang Wang
2023-04-24 15:53                 ` Wengang Wang
2023-04-24 22:52                   ` Wengang Wang [this message]
2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang
2023-04-20  0:30   ` Dave Chinner
2023-04-20 17:10     ` Wengang Wang
2023-04-20 22:54       ` Dave Chinner
2023-04-21  0:32         ` Wengang Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3B023A0D-DF2E-48B8-9BC9-46498CBCF8AB@oracle.com \
    --to=wen.gang.wang@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox