From: John Garry <john.g.garry@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: brauner@kernel.org, djwong@kernel.org, cem@kernel.org,
dchinner@redhat.com, hch@lst.de, ritesh.list@gmail.com,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, martin.petersen@oracle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes
Date: Mon, 13 Jan 2025 21:35:01 +0000 [thread overview]
Message-ID: <ef979627-52dc-4a15-896b-c848ab703cd6@oracle.com> (raw)
In-Reply-To: <Z1IX2dFida3coOxe@dread.disaster.area>
On 05/12/2024 21:15, Dave Chinner wrote:
> On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
>> On 04/12/2024 20:35, Dave Chinner wrote:
>>> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
>>>> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
>>>>
>>>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>>>> But we still can't allow the writes to be split into multiple BIOs. Hence
>>>> let's check if the iomap_length() is same as iter->len or not.
>>>>
>>>> It is the responsibility of userspace to ensure that a write does not span
>>>> mixed unwritten and mapped extents (which would lead to multiple BIOs).
>>>
>>> How is "userspace" supposed to do this?
>>
>> If an atomic write spans mixed unwritten and mapped extents, then it should
>> manually zero the unwritten extents beforehand.
>>
>>>
>>> No existing utility in userspace is aware of atomic write limits or
>>> rtextsize configs, so how does "userspace" ensure everything is
>>> laid out in a manner compatible with atomic writes?
>>>
>>> e.g. restoring a backup (or other disaster recovery procedures) is
>>> going to have to lay the files out correctly for atomic writes.
>>> backup tools often sparsify the data set and so what gets restored
>>> will not have the same layout as the original data set...
>>
>> I am happy to support whatever is needed to make atomic writes work over
>> mixed extents if that is really an expected use case and it is a pain for an
>> application writer/admin to deal with this (by manually zeroing extents).
>>
>> JFYI, I did originally support the extent pre-zeroing for this. That was to
>> support a real-life scenario which we saw where we were attempting atomic
>> writes over mixed extents. The mixed extents were coming from userspace
>> punching holes and then attempting an atomic write over that space. However
>> that was using an early experimental and buggy forcealign; it was buggy as
>> it did not handle punching holes properly - it punched out single blocks and
>> not only full alloc units.
>>
>>>
>>> Where's the documentation that outlines all the restrictions on
>>> userspace behaviour to prevent this sort of problem being triggered?
>>
>> I would provide a man page update.
>
> I think, at this point, we need an better way of documenting all the
> atomic write stuff in one place. Not just the user interface and
> what is expected of userspace, but also all the things the
> filesystems need to do to ensure atomic writes work correctly. I was
> thinking that a document somewhere in the Documentation/ directory,
> rather than random pieces of information splattered across random man pages
> would be a much better way of explaining all this.
>
> Don't get me wrong - man pages explaining the programmatic API are
> necessary, but there's a whole lot more to understanding and making
> effective use of atomic writes than what has been added to the man
> pages so far.
>
>>> Common operations such as truncate, hole punch,
>>
>> So how would punch hole be a problem? The atomic write unit max is limited
>> by the alloc unit, and we can only punch out full alloc units.
>
> I was under the impression that this was a feature of the
> force-align code, not a feature of atomic writes. i.e. force-align
> is what ensures the BMBT aligns correctly with the underlying
> extents.
>
> Or did I miss the fact that some of the force-align semantics bleed
> back into the original atomic write patch set?
>
>>> buffered writes,
>>> reflinks, etc will trip over this, so application developers, users
>>> and admins really need to know what they should be doing to avoid
>>> stepping on this landmine...
>>
>> If this is not a real-life scenario which we expect to see, then I don't see
>> why we would add the complexity to the kernel for this.
>
> I gave you one above - restoring a data set as a result of disaster
> recovery.
>
>> My motivation for atomic writes support is to support atomically writing
>> large database internal page size. If the database only writes at a fixed
>> internal page size, then we should not see mixed mappings.
>
> Yup, that's the problem here. Once atomic writes are supported by
> the kernel and userspace, all sorts of applications are going to
> start using them for in all sorts of ways you didn't think of.
>
>> But you see potential problems elsewhere ..
>
> That's my job as a senior engineer with 20+ years of experience in
> filesystems and storage related applications. I see far because I
> stand on the shoulders of giants - I don't try to be a giant myself.
>
> Other people become giants by implementing ground-breaking features
> (e.g. like atomic writes), but without the people who can see far
> enough ahead just adding features ends up with an incoherent mess of
> special interest niche features rather than a neatly integrated set
> of widely usable generic features.
>
> e.g. look at MySQL's use of fallocate(hole punch) for transparent
> data compression - nobody had forseen that hole punching would be
> used like this, but it's a massive win for the applications which
> store bulk compressible data in the database even though it does bad
> things to the filesystem.
>
> Spend some time looking outside the proprietary database application
> box and think a little harder about the implications of atomic write
> functionality. i.e. what happens when we have ubiquitous support
> for guaranteeing only the old or the new data will be seen after
> a crash *without the need for using fsync*.
>
> Think about the implications of that for a minute - for any full
> file overwrite up to the hardware atomic limits, we won't need fsync
> to guarantee the integrity of overwritten data anymore. We only need
> a mechanism to flush the journal and device caches once all the data
> has been written (e.g. syncfs)...
>
> Want to overwrite a bunch of small files safely? Atomic write the
> new data, then syncfs(). There's no need to run fdatasync after each
> write to ensure individual files are not corrupted if we crash in
> the middle of the operation. Indeed, atomic writes actually provide
> better overwrite integrity semantics that fdatasync as it will be
> all or nothing. fdatasync does not provide that guarantee if we
> crash during the fdatasync operation.
>
> Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> can emulate atomic writes for any size larger than what the hardware
> supports.
>
> At this point we actually provide app developers with what they've
> been repeatedly asking kernel filesystem engineers to provide them
> for the past 20 years: a way of overwriting arbitrary file data
> safely without needing an expensive fdatasync operation on every
> file that gets modified.
>
> Put simply: atomic writes have a huge potential to fundamentally
> change the way applications interact with Linux filesystems and to
> make it *much* simpler for applications to safely overwrite user
> data. Hence there is an imperitive here to make the foundational
> support for this technology solid and robust because atomic writes
> are going to be with us for the next few decades...
>
Dave,
I provided an proposal to solve this issue in
https://lore.kernel.org/lkml/20241210125737.786928-3-john.g.garry@oracle.com/
(there is also a v3, which is much the same.
but I can't make progress, as there is no agreement upon how this should
be implemented, if at all. Any input there would be appreciated...
Cheers
next prev parent reply other threads:[~2025-01-13 21:35 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-04 15:43 [PATCH 0/4] large atomic writes for xfs John Garry
2024-12-04 15:43 ` [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes John Garry
2024-12-04 20:35 ` Dave Chinner
2024-12-05 6:30 ` Darrick J. Wong
2024-12-05 11:51 ` John Garry
2024-12-05 10:52 ` John Garry
2024-12-05 21:15 ` Dave Chinner
2024-12-06 9:43 ` John Garry
2024-12-12 1:34 ` Darrick J. Wong
2025-01-14 4:41 ` Dave Chinner
2025-01-14 23:57 ` Darrick J. Wong
2025-01-15 9:30 ` John Garry
2025-01-16 6:52 ` Christoph Hellwig
2025-01-17 18:49 ` Darrick J. Wong
2025-01-22 6:42 ` Christoph Hellwig
2025-01-22 10:45 ` John Garry
2025-01-22 23:51 ` Dave Chinner
2025-01-23 9:28 ` John Garry
2025-01-17 10:26 ` John Garry
2025-01-17 18:29 ` Darrick J. Wong
2025-01-20 8:29 ` John Garry
2025-01-22 21:05 ` Dave Chinner
2025-01-13 21:35 ` John Garry [this message]
2025-01-14 4:43 ` Dave Chinner
2024-12-04 15:43 ` [PATCH 2/4] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2024-12-04 15:43 ` [PATCH 3/4] xfs: Add RT atomic write unit max to xfs_mount John Garry
2024-12-04 15:43 ` [PATCH 4/4] xfs: Update xfs_get_atomic_write_attr() for large atomic writes John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ef979627-52dc-4a15-896b-c848ab703cd6@oracle.com \
--to=john.g.garry@oracle.com \
--cc=brauner@kernel.org \
--cc=cem@kernel.org \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ritesh.list@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox