From: Wengang Wang <wen.gang.wang@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH 1/2] xfs: IO time one extent per EFI
Date: Mon, 24 Apr 2023 22:52:49 +0000 [thread overview]
Message-ID: <3B023A0D-DF2E-48B8-9BC9-46498CBCF8AB@oracle.com> (raw)
In-Reply-To: <134EE4A4-8F16-4E58-8764-EA49B0EE3107@oracle.com>
> On Apr 24, 2023, at 8:53 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>
>
>
>> On Apr 21, 2023, at 8:22 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>
>>
>>
>>> On Apr 21, 2023, at 11:23 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>>
>>>
>>>
>>>> On Apr 21, 2023, at 2:34 AM, Dave Chinner <david@fromorbit.com> wrote:
>>>>
>>>> On Fri, Apr 21, 2023 at 12:24:49AM +0000, Wengang Wang wrote:
>>>>>> On Apr 20, 2023, at 4:22 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>> We don't do that anymore for partially processed multi-extent
>>>>>> intents anymore. Instead, we use deferred ops to chain updates. i.e.
>>>>>> we log a complete intent done items alongside a new intent
>>>>>> containing the remaining work to be done in the same transaction.
>>>>>> This cancels the original intent and atomically replaces it with a
>>>>>> new intent containing the remaining work to be done.
>>>>>>
>>>>>> So when I say "update the EFD" I'm using historic terminology for
>>>>>> processing and recovering multi-extent intents. In modern terms,
>>>>>> what I mean is "update the deferred work intent chain to reflect the
>>>>>> work remaining to be done".
>>>>>
>>>>> OK. so let’s see the difference between your implementation from mine.
>>>>> Say, there are three extents to free.
>>>>>
>>>>> My patch would result in:
>>>>>
>>>>> EFI 1 with extent1
>>>>> free extent1
>>>>> EFD 1 with extent1
>>>>> transaction roll
>>>>> EFI 2 with extent2
>>>>> free extent2
>>>>> EFD 2 with extent2
>>>>> transaction roll
>>>>> EFI 3 with extent3
>>>>> free extent3
>>>>> EFD3 with extent3
>>>>> transaction commit
>>>>
>>>> No, it wouldn't. This isn't how the deferred work processes work
>>>> items on the transaction. A work item with multiple extents on it
>>>> would result in this:
>>>>
>>>> xfs_defer_finish(tp) # tp contains three xefi work items
>>>> xfs_defer_finish_noroll
>>>> xfs_defer_create_intents()
>>>> list_for_each_defer_pending
>>>> xfs_defer_create_intent(dfp)
>>>> ops->create_intent(tp, &dfp->dfp_work, dfp->dfp_count, sort);
>>>> xfs_extent_free_create_intent()
>>>> <create EFI>
>>>> list_for_each_xefi
>>>> xfs_extent_free_log_item(xefi)
>>>> <adds extent to current EFI>
>>>>
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>> xfs_trans_dup()
>>>> xfs_trans_commit()
>>>> <restore>
>>>>
>>>> At this point we have this committed to the CIL
>>>>
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>>
>>>> And xfs_defer_finish_noroll() continues with
>>>>
>>>> <grabs first work item>
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>> xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>> ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>> xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>> <adds extent to EFD>
>>>>
>>>> And once the processing of the single work item is done we loop
>>>> back to the start of the xfs_defer_finish_noroll() loop. We don't
>>>> have any new intents, so xfs_defer_create_intents() returns false,
>>>> but we completed a dfp work item, so we run:
>>>>
>>>> xfs_defer_trans_roll()
>>>> <save>
>>>> xfs_trans_roll()
>>>> xfs_trans_dup()
>>>> xfs_trans_commit()
>>>> <restore>
>>>>
>>>> At this point we have this committed to the CIL
>>>>
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>>
>>>> Then we run xfs_defer_finish_one() on EFI 2, commit, then run
>>>> xfs_defer_finish_one() on EFI 3. At this point, we have in the log:
>>>>
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>>
>>>> But we have not committed the final extent free or EFD 3 - that's in
>>>> the last transaction context we pass back to the _xfs_trans_commit()
>>>> context for it to finalise and close off the rolling transaction
>>>> chain. At this point, we finally have this in the CIL:
>>>>
>>>> EFI 1 with extent1
>>>> EFI 2 with extent2
>>>> EFI 3 with extent3
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 2 with extent2
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 3 with extent3
>>>
>>>
>>> Yes, correct. thanks for so much details! I mis-read xfs_defer_create_intents()
>>> thinking it only create intent for the first in tp->t_dfops.
>>>
>>>>
>>>>> The EFI/EFD log item pairs should not be written to log as they appear in same checkpoint.
>>>>
>>>> I *always* ignore CIL intent whiteouts when thinking about
>>>> transaction chains and intents. That is purely a journal efficiency
>>>> optimisation, not something that is necessary for correct operation.
>>>
>>> OK.
>>>
>>>>
>>>>> Your idea yields this:
>>>>>
>>>>> EFI 1 with extent1 extent2 extent3
>>>>> free extent1
>>>>> EFI 2 with extent2 extent3
>>>>> EFD 1 with extent1 extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent2
>>>>> EFI 3 with extent3
>>>>> EFD 2 with extent extent2 extent3
>>>>> transaction commit
>>>>> create transaction
>>>>> free extent3
>>>>> EFD 3 with extent3
>>>>> transaction commit.
>>>>
>>>> The EFI/EFD contents are correct, but the rest of it is not - I am
>>>> not suggesting open coding transaction rolling like that. Everything
>>>> I am suggesting works through the same defer ops mechanism as you
>>>> are describing. So if we start with the initial journal contents
>>>> looks like this:
>>>>
>>>> EFI 1 with extent1 extent2 extent3.
>>>>
>>>> Then we run xfs_defer_finish_one() on that work,
>>>>
>>>> xfs_defer_finish_one(dfp)
>>>> ->create_done(EFI 1)
>>>> xfs_extent_free_create_done
>>>> <create EFD>
>>>> list_for_each_xefi
>>>> ops->finish_item(tp, dfp->dfp_done, li, &state);
>>>> xfs_extent_free_finish_item()
>>>> xfs_trans_free_extent
>>>> __xfs_free_extent
>>>> <adds extent to EFD>
>>>>
>>>> But now we have 3 xefis on the work to process. So on success of
>>>> the first call to xfs_trans_free_extent(), we want it to return
>>>> -EAGAIN to trigger the existing relogging code to create the new
>>>> EFI. How this works is describe in the section "Requesting a
>>>> Fresh Transaction while Finishing Deferred Work" in
>>>> fs/xfs/libxfs/xfs_defer.c, no point me duplicating that here.
>>>>
>>>> The result is that the deferred work infrastructure does the work
>>>> of updaing the done intent and creating the new intents for the work
>>>> remaining. Hence after the next transaction roll, we have in the CIL
>>>>
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF, AGFL, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent2 extent3.
>>>>
>>>
>>> Taking transaction rolls into account (also adding up to EFD3), above would be:
>>>
>>> EFI 1 with extent1 extent2 extent3.
>>> transaction roll
>>> <AGF, AGFL, free space btree block mods> for extent 1
>>> EFD 1 with extent1 extent2 extent3.
>>> EFI 2 with extent2 extent3.
>>> transaction roll
>>> free extent 2
>>> EFD 2 with extent2 extent3
>>> EFI 3 with extent3
>>> transaction roll
>>> free extent 3
>>> EFD 3 with extent3
>>>
>>> Because EFD N is always be with EFI N+1 in the SAME transaction, so EFI N+1 doesn’t have to be placed before EFD N.
>>> I got it.
>>>
>>>> And so the loop goes until there is no more work remaining.
>>>>
>>>>> Your implementation also includes three EFI/EFD pairs, that’s the same as mine.
>>>>> So actually it’s still one extent per EFI per transaction. Though you are not changing
>>>>> XFS_EFI_MAX_FAST_EXTENTS.
>>>>
>>>> The difference is that what I'm suggesting is that the extent free
>>>> code can decide when it needs a transaction to be rolled. It isn't
>>>> forced to always run a single free per transaction, it can decide
>>>> that it can free multiple extents per transaction if there is no
>>>> risk of deadlocks (e.g. extents are in different AGs). Forcing
>>>> everything to be limited to a transaction per extent free even when
>>>> there is no risk of deadlocks freeing multiple extents in a single
>>>> transaction is unnecessary.
>>>>
>>>> Indeed, if the second extent is in a different AG, we don't risk
>>>> busy extents causing us issues, so we could do:
>>>>
>>>> EFI 1 with extent1 extent2 extent3.
>>>> <AGF 1, AGFL 1, free space btree block mods>
>>>> <AGF 2, AGFL 2, free space btree block mods>
>>>> EFD 1 with extent1 extent2 extent3.
>>>> EFI 2 with extent3.
>>>> .....
>>>>
>>>
>>> My thumb is up.
>>
>> Well, I am wondering if there is ABBA deadlock instead of self deadlock then.
>> Say process 1 produced busy extent 1 on AG 0 and now blocking at AGFL
>> allocation on AG 1. And at the same time, process 2 produced busy extent 2 on AG 1
>> and now blocking at AGFL allocation on AG 0. Anything prevents that from happening?
>>
>> Shall we introduce a per FS list, say named “pending_busy_AGs", where the AG number
>> are stored there. For those AGs
>> they have pending busy extents in in-memory transactions (not committed to CIL).
>> We add the AG number to pending_busy_AGs at the start of xfs_trans_free_extent()
>> if it’s not there, and continue to free the extent. Otherwise if the AG number is already
>> there in pending_busy_AGs, we return -EAGAIN. The AG number is removed when
>> the busy extents are committed to the xfs_cil_ctx.
>> Well, wondering if the pending busy extents would remain in pending state long before
>> committed to CIL and that become a bottle neck for freeing extents.
>
> Thinking more, we could add per AG counter indicating current pending busy extents on the AG.
> For the 2nd+ extent in the EFI, return -EAGAIN on seeing the counter set (>0). As the first extent
> is always OK to go, there shouldn’t be a bottle neck.
I just sent the first attempt (this way), pls review.
thanks,
wengang
next prev parent reply other threads:[~2023-04-24 22:52 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-14 22:58 [PATCH 0/2] xfs: one extent per EFI Wengang Wang
2023-04-14 22:58 ` [PATCH 1/2] xfs: IO time " Wengang Wang
2023-04-19 23:55 ` Dave Chinner
2023-04-20 17:31 ` Wengang Wang
2023-04-20 23:22 ` Dave Chinner
2023-04-21 0:24 ` Wengang Wang
2023-04-21 9:34 ` Dave Chinner
2023-04-21 18:23 ` Wengang Wang
2023-04-22 3:22 ` Wengang Wang
2023-04-24 15:53 ` Wengang Wang
2023-04-24 22:52 ` Wengang Wang [this message]
2023-04-14 22:58 ` [PATCH 2/2] xfs: log recovery stage split EFIs with multiple extents Wengang Wang
2023-04-20 0:30 ` Dave Chinner
2023-04-20 17:10 ` Wengang Wang
2023-04-20 22:54 ` Dave Chinner
2023-04-21 0:32 ` Wengang Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3B023A0D-DF2E-48B8-9BC9-46498CBCF8AB@oracle.com \
--to=wen.gang.wang@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox