linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	John Garry <john.g.garry@oracle.com>,
	brauner@kernel.org, cem@kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	ojaswin@linux.ibm.com, ritesh.list@gmail.com,
	martin.petersen@oracle.com
Subject: Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()
Date: Wed, 12 Mar 2025 21:51:21 -0700	[thread overview]
Message-ID: <20250313045121.GE2803730@frogsfrogsfrogs> (raw)
In-Reply-To: <Z9I0Ab5TyBEdkC32@dread.disaster.area>

On Thu, Mar 13, 2025 at 12:25:21PM +1100, Dave Chinner wrote:
> On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote:
> > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote:
> > > > On 12/03/2025 07:24, Christoph Hellwig wrote:
> > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote:
> > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process
> > > > > > the CoW range and commit the transaction.
> > > > > > 
> > > > > > This refactoring will be used in future for when it is required to commit
> > > > > > a range of extents as a single transaction, similar to how it was done
> > > > > > pre-commit d6f215f359637.
> > > > > 
> > > > > Darrick pointed out that if you do more than just a tiny number
> > > > > of extents per transactions you run out of log reservations very
> > > > > quickly here:
> > > > > 
> > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$
> > > > > 
> > > > > how does your scheme deal with that?
> > > > > 
> > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this,
> > > > right? Or does the log reservation have a hard size limit, regardless of
> > > > that calculation?
> > > 
> > > The resblks calculated there are the reserved disk blocks
> 
> Used for btree block allocations that might be needed during the
> processing of the transaction.
> 
> > > and have
> > > nothing to do with the log reservations, which comes from the
> > > tr_write field passed in.  There is some kind of upper limited to it
> > > obviously by the log size, although I'm not sure if we've formalized
> > > that somewhere.  Dave might be the right person to ask about that.
> > 
> > The (very very rough) upper limit for how many intent items you can
> > attach to a tr_write transaction is:
> > 
> > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size)
> > max_blocks = tr_write::tr_logres / per_extent_cost
> > 
> > (ili_size is the inode log item size)
> 
> That doesn't sound right. The number of intents we can log is not
> dependent on the aggregated size of all intent types. We do not log
> all those intent types in a single transaction, nor do we process
> more than one type of intent in a given transaction. Also, we only
> log the inode once per transaction, so that is not a per-extent
> overhead.
> 
> Realistically, the tr_write transaction is goign to be at least a
> 100kB because it has to be big enough to log full splits of multiple
> btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB
> filesystem spits out:
> 
> xfs_trans_resv_calc:  dev 7:0 type 0 logres 193528 logcount 5 flags 0x4
> 
> About 190kB.
> 
> However, intents are typically very small - around 32 bytes in size
> plus another 12 bytes for the log region ophdr.
> 
> This implies that we can fit thousands of individual intents in a
> single tr_write log reservation on any given filesystem, and the
> number of loop iterations in a transaction is therefore dependent
> largely on how many intents are logged per iteration.
> 
> Hence if we are walking a range of extents in the BMBT to unmap
> them, then we should only be generating 2 intents per loop - a BUI
> for the BMBT removal and a CUI for the shared refcount decrease.
> That means we should be able to run at least a thousand iterations
> of that loop per transaction without getting anywhere near the
> transaction reservation limits.
> 
> *However!*
> 
> We have to relog every intent we haven't processed in the deferred
> batch every-so-often to prevent the outstanding intents from pinning
> the tail of the log. Hence the larger the number of intents in the
> initial batch, the more work we have to do later on (and the more
> overall log space and bandwidth they will consume) to relog them
> them over and over again until they pop to the head of the
> processing queue.
> 
> Hence there is no real perforamce advantage to creating massive intent
> batches because we end up doing more work later on to relog those
> intents to prevent journal space deadlocks. It also doesn't speed up
> processing, because we still process the intent chains one at a time
> from start to completion before moving on to the next high level
> intent chain that needs to be processed.
> 
> Further, after the first couple of intent chains have been
> processed, the initial log space reservation will have run out, and
> we are now asking for a new resrevation on every transaction roll we
> do. i.e. we now are now doing a log space reservation on every
> transaction roll in the processing chain instead of only doing it
> once per high level intent chain.
> 
> Hence from a log space accounting perspective (the hottest code path
> in the journal), it is far more efficient to perform a single high
> level transaction per extent unmap operation than it is to batch
> intents into a single high level transaction.
> 
> My advice is this: we should never batch high level iterative
> intent-based operations into a single transaction because it's a
> false optimisation.  It might look like it is an efficiency
> improvement from the high level, but it ends up hammering the hot,
> performance critical paths in the transaction subsystem much, much
> harder and so will end up being slower than the single transaction
> per intent-based operation algorithm when it matters most....

How specifically do you propose remapping all the extents in a file
range after an untorn write?  The regular cow ioend does a single
transaction per extent across the entire ioend range and cannot deliver
untorn writes.  This latest proposal does, but now you've torn that idea
down too.

At this point I have run out of ideas and conclude that can only submit
to your superior intellect.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

  reply	other threads:[~2025-03-13  4:51 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10 18:39 [PATCH v5 00/10] large atomic writes for xfs with CoW John Garry
2025-03-10 18:39 ` [PATCH v5 01/10] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
2025-03-12  7:15   ` Christoph Hellwig
2025-03-12  8:19     ` John Garry
2025-03-10 18:39 ` [PATCH v5 02/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2025-03-12  7:17   ` Christoph Hellwig
2025-03-12  8:21     ` John Garry
2025-03-10 18:39 ` [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
2025-03-12  7:24   ` Christoph Hellwig
2025-03-12  8:27     ` John Garry
2025-03-12  8:35       ` Christoph Hellwig
2025-03-12 15:46         ` Darrick J. Wong
2025-03-12 22:06           ` John Garry
2025-03-12 23:22             ` Darrick J. Wong
2025-03-13  1:25           ` Dave Chinner
2025-03-13  4:51             ` Darrick J. Wong [this message]
2025-03-13  6:11               ` John Garry
2025-03-18  0:43                 ` Dave Chinner
2025-03-13  7:21               ` Dave Chinner
2025-03-22  5:19                 ` Darrick J. Wong
2025-03-10 18:39 ` [PATCH v5 04/10] xfs: Reflink CoW-based atomic write support John Garry
2025-03-12  7:27   ` Christoph Hellwig
2025-03-12  9:13     ` John Garry
2025-03-12 13:45       ` Christoph Hellwig
2025-03-12 14:48         ` John Garry
2025-03-10 18:39 ` [PATCH v5 05/10] xfs: Iomap SW-based " John Garry
2025-03-12  7:37   ` Christoph Hellwig
2025-03-12  9:00     ` John Garry
2025-03-12 13:52       ` Christoph Hellwig
2025-03-12 14:57         ` John Garry
2025-03-12 15:55           ` Christoph Hellwig
2025-03-12 16:11             ` John Garry
2025-03-10 18:39 ` [PATCH v5 06/10] xfs: Add xfs_file_dio_write_atomic() John Garry
2025-03-10 18:39 ` [PATCH v5 07/10] xfs: Commit CoW-based atomic writes atomically John Garry
2025-03-12  7:39   ` Christoph Hellwig
2025-03-12  9:04     ` John Garry
2025-03-12 13:54       ` Christoph Hellwig
2025-03-12 15:01         ` John Garry
2025-03-10 18:39 ` [PATCH v5 08/10] xfs: Update atomic write max size John Garry
2025-03-11 14:40   ` Carlos Maiolino
2025-03-12  7:41   ` Christoph Hellwig
2025-03-12  8:09     ` John Garry
2025-03-12  8:13       ` Christoph Hellwig
2025-03-12  8:14         ` John Garry
2025-03-10 18:39 ` [PATCH v5 09/10] xfs: Allow block allocator to take an alignment hint John Garry
2025-03-12  7:42   ` Christoph Hellwig
2025-03-12  8:05     ` John Garry
2025-03-12 13:45       ` Christoph Hellwig
2025-03-12 14:47         ` John Garry
2025-03-12 16:00         ` Darrick J. Wong
2025-03-12 16:28           ` John Garry
2025-03-10 18:39 ` [PATCH RFC v5 10/10] iomap: Rename ATOMIC flags again John Garry
2025-03-12  7:13   ` Christoph Hellwig
2025-03-12 23:59     ` Dave Chinner
2025-03-13  6:28       ` John Garry
2025-03-13  7:02         ` Christoph Hellwig
2025-03-13  7:41           ` John Garry
2025-03-13  7:49             ` Christoph Hellwig
2025-03-13  7:53               ` John Garry
2025-03-13  8:09                 ` Christoph Hellwig
2025-03-13  8:18                   ` Christoph Hellwig
2025-03-13  8:24                     ` John Garry
2025-03-13  8:28                     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250313045121.GE2803730@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).