All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Dave Chinner <dchinner@redhat.com>
Cc: Joe Thornber <thornber@redhat.com>,
	Eric Sandeen <sandeen@redhat.com>,
	dm-devel@lists.linux.dev, Mikulas Patocka <mpatocka@redhat.com>,
	Jonathan Brassow <jbrassow@redhat.com>
Subject: Re: RFE: dm-thin: fine-grained handling of REQ_FUA is needed
Date: Tue, 23 Apr 2024 13:26:31 -0400	[thread overview]
Message-ID: <ZifvR1_FXjPnjOO7@redhat.com> (raw)
In-Reply-To: <ZiftqzT1a8YO7gQN@redhat.com>

[one more time with feeling, after fixing dm-devel address typo]

On Tue, Apr 23, 2024 at 01:19:39PM -0400, Mike Snitzer wrote:
> [adding snitzer@kernel.org and dm-devel to the Cc so I don't lose this important thread again]
> [this was old Red Hat internal discussion that I tripped over like a dead cat, but there isn't anything sensitive here: dm-thinp needs to improve]
> 
> On Sun, Jan 24 2021 at  8:34P -0500,
> Dave Chinner <dchinner@redhat.com> wrote:
> 
> > On Thu, Jan 21, 2021 at 11:25:37PM -0500, Mike Snitzer wrote:
> > > On Thu, Jan 21 2021 at  9:33pm -0500,
> > > Dave Chinner <dchinner@redhat.com> wrote:
> > > 
> > > > On Thu, Jan 21, 2021 at 04:23:20PM +0000, Joe Thornber wrote:
> > > > > On Thu, 2021-01-21 at 17:45 +1100, Dave Chinner wrote:
> > > > > > On Wed, Jan 20, 2021 at 10:34:01AM +0000, Joe Thornber wrote:
> > > > > > 
> > > > > > > - Currently the tools unshare _metadata_ when restoring/repairing
> > > > > > > pools
> > > > > > > with snapshots.  Combined with the hard 16gb limit on metadata size
> > > > > > > this means some systems with many snapshots can't be repaired if
> > > > > > > they
> > > > > > > run into trouble.  Next release of the tools will fix, but as said
> > > > > > > before, we have to assess the software as is.
> > > > > > 
> > > > > > Can you explain more about this limit and what it means? The test
> > > > > > I'm currently running (1000 iterations of "write 10,000 4kB IOs,
> > > > > > snapshot"
> > > > > 
> > > > > Right let's talk about your scenario.
> > > > > 
> > > > > Firstly metadata.  Our transactional model is based on persistent data
> > > > > structures in the functional programming sense.  To update metadata in
> > > > > a new transaction we COW key metadata pages to build a new data
> > > > > structure that shares data with that of the previous transaction.  We
> > > > > amortise the costs associated with this by holding the transactions
> > > > > open as long as possible and allowing real mutation *within* a
> > > > > transaction.
> > > > > 
> > > > > There are advantages and disadvantages with this scheme.
> > > > > 
> > > > > Advantages: It's simple.  It allows the 'take snapshot' operation to be
> > > > > very quick (just writing a handful of pages).  It also allows us to
> > > > > take a 'metadata snapshot' of all the metadata which userland can
> > > > > access safely.  So we can move the implementation of certain
> > > > > query/health functions entirely to userland. (see thin_check, thin_ls,
> > > > > thin_dump etc).
> > > > > 
> > > > > Disadvantages: The biggest down side is everything is done in one
> > > > > global transaction; if a commit is required by one thin device (eg, due
> > > > > to a REQ_FUA), then other active thin devices will be temporarily
> > > > > slowed as the transaction commits.
> > > > 
> > > > OUCH!
> > > > 
> > > > That means every journal IO from the filesystem on the thin vol will
> > > > trigger a globally serialising metadata commit. It also means every
> > > > user O_DSYNC/RWF_DSYNC direct IO (which we optimise to FUA writes
> > > > to avoid journal flushes) will trigger a thinp commit.
> > > 
> > > So XFS hijacks REQ_FUA to bypass its journaling _but_ expects that
> > > REQ_FUA to have no impact on the underlying storage?  Why not clear
> > > REQ_FUA after XFS decides to bypass its journal?  Why use REQ_FUA for
> > > this?  What are your assumptions about how the underlying block device
> > > will respond?
> > 
> > "Hijack" is a bit ... strong. We replaced a full device cache flush
> > with an FUA write at the request of a partner that demonstrated that
> > using FUA writes in their applications on other operating systems on
> > the same hardware configurations ran them 25-50% faster than on
> > RHEL/XFS.
> > 
> > I reproduced their results locally, and the difference in
> > performance was the device level overhead between write+full device
> > caceh flush (the traditional linux DIO O_DSYNC implementation) and
> > using FUA writes to avoid the full device cache flush when only the
> > data in the IO itself needed to be written to stable storage.
> > 
> > It's not at all controversial - REQ_FUA has been used by filesystems
> > for years for their journal writes to optimise away the post-write
> > cache flush on hardware that supports FUA. And ext4, gfs2, zonefs
> > and btrfs all use this same code direct IO code path now, so it's
> > not just XFS doing this...
> > 
> > FWIW, I mused on the weekend that we don't even need to check that
> > the block device supports FUA to make this optimisation, because the
> > block layer will convert REQ_FUA to a normal write + post write
> > cache flush if the underlying device doesn't support FUA. So we are
> > likely to make all pure O_DSYNC data overwrites use REQ_FUA in the
> > future, not just on devices that support FUA...
> > 
> > > > This is what enterprise applications do for IO performance - small,
> > > > random, highly concurrent O_DSYNC AIO+DIO overwrites - so it seems
> > > > like this is going to cause IO serialisation problems even if there is 
> > > > only one thin-vol in a thin-pool...
> > > 
> > > Yes, but I currently cannot say this is thinp's fault.  REQ_FUA must be
> > > honored.  If dm-thinp is being asked to flush caches and get a
> > > consistent point-in-time it cannot be ignored.
> > 
> > Yes, that's the point I was trying to make. FUA is not a cache flush
> > request - it's much more fine grained than that, and so can be used
> > to elide full device cache flushes. Hence requiring global commits
> > to honor FUA data writes defeats the higher level optimisations that
> > try to minimise the device cache flush overhead...
> > 
> > > So what additonal data structures and approach should we be taking to
> > > mask the impact of needing to keep data and metadata consistent?
> > 
> > I'm not a dm-thin expert, but I can describe the constraints for
> > using REQ_FUA because I think they probably apply to dm-thin, too
> > 
> > For filesystems, there are per-inode metadata trees, and that's all
> > we need to be stable w.r.t. O_DSYNC data writes. If the per-inode
> > metadata is up-to-date on disk (i.e. we are doing a pure overwrite
> > and no metadata needs to change to perform the data IO) then we can
> > issue a REQ_FUA write.  If there is any metadata that needs to be
> > modified to perform the write (e.g. we need an allocation, convert
> > an unwritten extent, etc) or something else dirtied the inode
> > metadata prior to the write, then we do a normal data write and a
> > post-write journal flush (which is REQ_PREFLUSH|REQ_FUA write) to
> > ensure both the data and the metadata that references it is on
> > stable storage.
> > 
> > See this commit:
> > 
> > commit 3460cac1ca76215a60acb086ebe97b3e50731628
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Wed May 2 12:54:53 2018 -0700
> > 
> >     iomap: Use FUA for pure data O_DSYNC DIO writes
> > ....
> > 
> > Also, check the current upstream tree for IOMAP_F_DIRTY so the fs
> > can indicate that inode has O_DSYNC dirty metadata (allocate,
> > unwritten) or IOMAP_F_SHARED (requires COW), and the iomap
> > implementation uses the IOMAP_DIO_WRITE_FUA and IOMAP_DIO_NEED_SYNC
> > flags to determine what to do at IO completion.
> > 
> > I suspect that dm-thin would need to do something similar for
> > REQ_FUA writes - if the thin-dev metadata (rather than the global
> > pool metadata) is clean and there's an already allocated block being
> > overwritten into, then you can just pass the FUA down and not have
> > to run a global transaction commit.  If there's anything dirty, or
> > the IO requires COW or allocation, then it needs to run through the
> > existing slow path...
> > 
> > > > > This limits the number of active
> > > > > thin devices.  Similarly if different thin devices are performing
> > > > > operations that allocate new data space (provisioning, COW), then there
> > > > > can be contention in the allocators.
> > > > 
> > > > Or we are doing concurrent IO to the same thin device...
> > > 
> > > Concurrent IO that is causing a denial-of-service due to floods of
> > > REQ_FUA IO?  Yeap.. that sounds like it'd hurt ;)
> > 
> > This is what enterprise applications like SAP, MS-SQL, Seastar, etc
> > all do to extract maximum performance out of the underlying storage.
> > They can have hundreds, even thousands of O_DSYNC AIO+DIO in flight at
> > once, generally only limited only by software and hardware queue depths.
> > 
> > So, yeah, if "general use" is the goal, this better not be a DOS
> > vector. ;)
> 
> Just bumping this thread because it has been over 3 years now and
> dm-thinp hasn't seen any changes to address its heavy handling of
> REQ_FUA.
> 
> Mike

           reply	other threads:[~2024-04-23 17:26 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <ZiftqzT1a8YO7gQN@redhat.com>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZifvR1_FXjPnjOO7@redhat.com \
    --to=snitzer@kernel.org \
    --cc=dchinner@redhat.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=jbrassow@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=sandeen@redhat.com \
    --cc=thornber@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.