Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: John Garry <john.g.garry@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: brauner@kernel.org, djwong@kernel.org, cem@kernel.org,
	dchinner@redhat.com, hch@lst.de, ritesh.list@gmail.com,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, martin.petersen@oracle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes
Date: Mon, 13 Jan 2025 21:35:01 +0000	[thread overview]
Message-ID: <ef979627-52dc-4a15-896b-c848ab703cd6@oracle.com> (raw)
In-Reply-To: <Z1IX2dFida3coOxe@dread.disaster.area>

On 05/12/2024 21:15, Dave Chinner wrote:
> On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
>> On 04/12/2024 20:35, Dave Chinner wrote:
>>> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
>>>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
>>>>
>>>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>>>> But we still can't allow the writes to be split into multiple BIOs. Hence
>>>> let's check if the iomap_length() is same as iter->len or not.
>>>>
>>>> It is the responsibility of userspace to ensure that a write does not span
>>>> mixed unwritten and mapped extents (which would lead to multiple BIOs).
>>>
>>> How is "userspace" supposed to do this?
>>
>> If an atomic write spans mixed unwritten and mapped extents, then it should
>> manually zero the unwritten extents beforehand.
>>
>>>
>>> No existing utility in userspace is aware of atomic write limits or
>>> rtextsize configs, so how does "userspace" ensure everything is
>>> laid out in a manner compatible with atomic writes?
>>>
>>> e.g. restoring a backup (or other disaster recovery procedures) is
>>> going to have to lay the files out correctly for atomic writes.
>>> backup tools often sparsify the data set and so what gets restored
>>> will not have the same layout as the original data set...
>>
>> I am happy to support whatever is needed to make atomic writes work over
>> mixed extents if that is really an expected use case and it is a pain for an
>> application writer/admin to deal with this (by manually zeroing extents).
>>
>> JFYI, I did originally support the extent pre-zeroing for this. That was to
>> support a real-life scenario which we saw where we were attempting atomic
>> writes over mixed extents. The mixed extents were coming from userspace
>> punching holes and then attempting an atomic write over that space. However
>> that was using an early experimental and buggy forcealign; it was buggy as
>> it did not handle punching holes properly - it punched out single blocks and
>> not only full alloc units.
>>
>>>
>>> Where's the documentation that outlines all the restrictions on
>>> userspace behaviour to prevent this sort of problem being triggered?
>>
>> I would provide a man page update.
> 
> I think, at this point, we need an better way of documenting all the
> atomic write stuff in one place. Not just the user interface and
> what is expected of userspace, but also all the things the
> filesystems need to do to ensure atomic writes work correctly. I was
> thinking that a document somewhere in the Documentation/ directory,
> rather than random pieces of information splattered across random man pages
> would be a much better way of explaining all this.
> 
> Don't get me wrong - man pages explaining the programmatic API are
> necessary, but there's a whole lot more to understanding and making
> effective use of atomic writes than what has been added to the man
> pages so far.
> 
>>> Common operations such as truncate, hole punch,
>>
>> So how would punch hole be a problem? The atomic write unit max is limited
>> by the alloc unit, and we can only punch out full alloc units.
> 
> I was under the impression that this was a feature of the
> force-align code, not a feature of atomic writes. i.e. force-align
> is what ensures the BMBT aligns correctly with the underlying
> extents.
> 
> Or did I miss the fact that some of the force-align semantics bleed
> back into the original atomic write patch set?
> 
>>> buffered writes,
>>> reflinks, etc will trip over this, so application developers, users
>>> and admins really need to know what they should be doing to avoid
>>> stepping on this landmine...
>>
>> If this is not a real-life scenario which we expect to see, then I don't see
>> why we would add the complexity to the kernel for this.
> 
> I gave you one above - restoring a data set as a result of disaster
> recovery.
> 
>> My motivation for atomic writes support is to support atomically writing
>> large database internal page size. If the database only writes at a fixed
>> internal page size, then we should not see mixed mappings.
> 
> Yup, that's the problem here. Once atomic writes are supported by
> the kernel and userspace, all sorts of applications are going to
> start using them for in all sorts of ways you didn't think of.
> 
>> But you see potential problems elsewhere ..
> 
> That's my job as a senior engineer with 20+ years of experience in
> filesystems and storage related applications. I see far because I
> stand on the shoulders of giants - I don't try to be a giant myself.
> 
> Other people become giants by implementing ground-breaking features
> (e.g. like atomic writes), but without the people who can see far
> enough ahead just adding features ends up with an incoherent mess of
> special interest niche features rather than a neatly integrated set
> of widely usable generic features.
> 
> e.g. look at MySQL's use of fallocate(hole punch) for transparent
> data compression - nobody had forseen that hole punching would be
> used like this, but it's a massive win for the applications which
> store bulk compressible data in the database even though it does bad
> things to the filesystem.
> 
> Spend some time looking outside the proprietary database application
> box and think a little harder about the implications of atomic write
> functionality.  i.e. what happens when we have ubiquitous support
> for guaranteeing only the old or the new data will be seen after
> a crash *without the need for using fsync*.
> 
> Think about the implications of that for a minute - for any full
> file overwrite up to the hardware atomic limits, we won't need fsync
> to guarantee the integrity of overwritten data anymore. We only need
> a mechanism to flush the journal and device caches once all the data
> has been written (e.g. syncfs)...
> 
> Want to overwrite a bunch of small files safely?  Atomic write the
> new data, then syncfs(). There's no need to run fdatasync after each
> write to ensure individual files are not corrupted if we crash in
> the middle of the operation. Indeed, atomic writes actually provide
> better overwrite integrity semantics that fdatasync as it will be
> all or nothing. fdatasync does not provide that guarantee if we
> crash during the fdatasync operation.
> 
> Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> can emulate atomic writes for any size larger than what the hardware
> supports.
> 
> At this point we actually provide app developers with what they've
> been repeatedly asking kernel filesystem engineers to provide them
> for the past 20 years: a way of overwriting arbitrary file data
> safely without needing an expensive fdatasync operation on every
> file that gets modified.
> 
> Put simply: atomic writes have a huge potential to fundamentally
> change the way applications interact with Linux filesystems and to
> make it *much* simpler for applications to safely overwrite user
> data.  Hence there is an imperitive here to make the foundational
> support for this technology solid and robust because atomic writes
> are going to be with us for the next few decades...
> 



Dave,

I provided an proposal to solve this issue in 
https://lore.kernel.org/lkml/20241210125737.786928-3-john.g.garry@oracle.com/ 
(there is also a v3, which is much the same.

but I can't make progress, as there is no agreement upon how this should 
be implemented, if at all. Any input there would be appreciated...

Cheers

next prev parent reply	other threads:[~2025-01-13 21:35 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-04 15:43 [PATCH 0/4] large atomic writes for xfs John Garry
2024-12-04 15:43 ` [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes John Garry
2024-12-04 20:35   ` Dave Chinner
2024-12-05  6:30     ` Darrick J. Wong
2024-12-05 11:51       ` John Garry
2024-12-05 10:52     ` John Garry
2024-12-05 21:15       ` Dave Chinner
2024-12-06  9:43         ` John Garry
2024-12-12  1:34         ` Darrick J. Wong
2025-01-14  4:41           ` Dave Chinner
2025-01-14 23:57             ` Darrick J. Wong
2025-01-15  9:30               ` John Garry
2025-01-16  6:52               ` Christoph Hellwig
2025-01-17 18:49                 ` Darrick J. Wong
2025-01-22  6:42                   ` Christoph Hellwig
2025-01-22 10:45                     ` John Garry
2025-01-22 23:51                       ` Dave Chinner
2025-01-23  9:28                         ` John Garry
2025-01-17 10:26               ` John Garry
2025-01-17 18:29                 ` Darrick J. Wong
2025-01-20  8:29                   ` John Garry
2025-01-22 21:05               ` Dave Chinner
2025-01-13 21:35         ` John Garry [this message]
2025-01-14  4:43           ` Dave Chinner
2024-12-04 15:43 ` [PATCH 2/4] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2024-12-04 15:43 ` [PATCH 3/4] xfs: Add RT atomic write unit max to xfs_mount John Garry
2024-12-04 15:43 ` [PATCH 4/4] xfs: Update xfs_get_atomic_write_attr() for large atomic writes John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ef979627-52dc-4a15-896b-c848ab703cd6@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ritesh.list@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox