public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jeremy Bongio <bongiojp@gmail.com>
Cc: Ted Tso <tytso@mit.edu>, "Darrick J . Wong" <djwong@kernel.org>,
	Allison Henderson <allison.henderson@oracle.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/1] iomap regression for aio dio 4k writes
Date: Thu, 22 Jun 2023 11:55:23 +1000	[thread overview]
Message-ID: <ZJOqC7Cfjr5AoW7S@dread.disaster.area> (raw)
In-Reply-To: <ZJOO4SobNFaQ+C5g@dread.disaster.area>

On Thu, Jun 22, 2023 at 09:59:30AM +1000, Dave Chinner wrote:
> On Wed, Jun 21, 2023 at 10:29:19AM -0700, Jeremy Bongio wrote:
> > Since no issues have been found for ext4 calling completion work
> > directly in the io handler pre-iomap, it is unlikely that this is
> > unsafe (sleeping within an io handler callback). However, this may not
> > be true for all filesystems. Does XFS potentially sleep in its
> > completion code?
> 
> Yes, and ext4 does too. e.g. O_DSYNC overwrites always need to be
> deferred to task context to be able to take sleeping locks and
> potentially block on journal and or device cache flushes.
> 
> i.e. Have you considered what context all of XFS, f2fs, btrfs,
> zonefs and gfs2 need for pure DIO overwrite completion in all it's
> different variants?
> 
> AFAIC, it's far simpler conceptually to defer all writes to
> completion context than it is to try to work out what writes need to
> be deferred and what doesn't, especially as the filesystem ->end_io
> completion might need to sleep and the iomap code has no idea
> whether that is possible.

Ok, so having spent a bit more thought on this away from the office
this morning, I think there is a generic way we can avoid deferring
completions for pure overwrites.

We already have a mechanism in iomap that tells us if the write is a
pure overwrite and we use it to change how we issue O_DSYNC DIO
writes. i.e. we use it to determine if we can use FUA writes rather
than a post-IO journal/device cache flush to guarantee data
integrity. See IOMAP_DIO_WRITE_FUA for how we determine whether we
need issue a generic_write_sync() call or not in the post IO
completion processing.

The iomap flags that determines if we can make this optimisation are
IOMAP_F_SHARED and IOMAP_F_DIRTY. IOMAP_F_SHARED indicates a COW is
required to break sharing for the write IO to proceed, whilst
IOMAP_F_DIRTY indicates that the inode is either dirty or that the
write IO requires metadata to be dirtied at completion time (e.g.
unwritten extent conversion) before the sync operation that provides
data integrity guarantees can be run.

If neither of these flags are set in the iomap, it effectively means
that the IO is a pure overwrite. i.e. the filesytsem has explicitly
said that this write IO does not need any post-IO completion
filesystem work to be done.

At this point, the iomap code can optimise for a pure overwrite into
an IOMAP_MAPPED extent, knowing that the only thing it needs to care
about on completion is internal data integrity requirements (i.e.
O_DSYNC/O_SYNC) of the IO.

Hence if the filesystem has told iomap that it has no IO completion
requirements, and iomap doesn't need generic_write_sync() for data
integrity (i.e. no data integrity required or FUA was used for the
entire IO), then we could complete the DIO write directly from the
bio completion callback context...

IOWs, what you want can be done, it's just a whole lot more complex
than just avoiding a queue_work() call...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2023-06-22  1:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-21 17:29 [PATCH 0/1] iomap regression for aio dio 4k writes Jeremy Bongio
2023-06-21 17:29 ` [PATCH 1/1] For DIO writes with no mapped pages for inode, skip deferring completion Jeremy Bongio
2023-06-21 18:55   ` Matthew Wilcox
2023-06-22  0:04   ` Dave Chinner
2023-06-21 23:59 ` [PATCH 0/1] iomap regression for aio dio 4k writes Dave Chinner
2023-06-22  1:55   ` Dave Chinner [this message]
2023-06-22  2:55     ` Matthew Wilcox
2023-06-22  4:08       ` Christoph Hellwig
2023-06-22  4:47       ` Dave Chinner
2023-06-23  2:32   ` Theodore Ts'o
2023-06-23  3:02     ` Dave Chinner
2023-06-22 23:22 ` Allison Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJOqC7Cfjr5AoW7S@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=allison.henderson@oracle.com \
    --cc=bongiojp@gmail.com \
    --cc=djwong@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox